date:20160331

Re: [Qemu-devel] [PATCH V2 1/3] colo-compare: introduce colo compare initlization

2016-03-31 Thread Li Zhijian

On 04/01/2016 01:11 PM, Jason Wang wrote:

On 03/31/2016 05:24 PM, Dr. David Alan Gilbert wrote:

* Zhang Chen (zhangchen.f...@cn.fujitsu.com) wrote:

On 03/30/2016 05:25 PM, Dr. David Alan Gilbert wrote:

* Zhang Chen (zhangchen.f...@cn.fujitsu.com) wrote:

packet come from primary char indev will be send to
outdev - packet come from secondary char dev will be drop

Please put in the description an example of how you invoke
the filter on the primary and secondary.

OK.
colo-compare get packet from chardev(primary_in,secondary_in),
and output to other chardev(outdev), so, we can use it with the
help of filter-mirror and filter-redirector. like that:

Wow, this is a bit more complicated than I expected - I was expecting just one
socket.  So let me draw this out; see comments below.

primary:
-netdev
tap,id=hn0,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
-device e1000,id=e0,netdev=hn0,mac=52:a4:00:12:78:66
-chardev socket,id=mirror0,host=3.3.3.3,port=9003,server,nowait
-chardev socket,id=compare1,host=3.3.3.3,port=9004,server,nowait
-chardev socket,id=compare0,host=3.3.3.3,port=9001,server,nowait
-chardev socket,id=compare0-0,host=3.3.3.3,port=9001
-chardev socket,id=compare_out,host=3.3.3.3,port=9005,server,nowait
-chardev socket,id=compare_out0,host=3.3.3.3,port=9005
-object filter-mirror,id=m0,netdev=hn0,queue=tx,outdev=mirror0
-object filter-redirector,netdev=hn0,id=redire0,queue=rx,indev=compare_out
-object filter-redirector,netdev=hn0,id=redire1,queue=rx,outdev=compare0
-object 
colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,outdev=compare_out0

 > mirror0: socket:secondary:9003
 |
 mirror-m0 <-- e1000
 |
 v
 redirector-redire1 ---> compare0 socket:primary:9001 (to compare0-0)

 -< compare0-0 socket:primary:9001 (to compare0)
 |  primary_in
 |
 compare-comp0   -> compare_out0 socket:primary:9005
 |
 |  secondary_in
 -< compare1   socket:secondary:9004

tap <-- redirector-redire0 <--- compare_out socket:primary:9005 (from 
compare_out0)

secondary:
-netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,down
script=/etc/qemu-ifdown
-device e1000,netdev=hn0,mac=52:a4:00:12:78:66
-chardev socket,id=red0,host=3.3.3.3,port=9003
-chardev socket,id=red1,host=3.3.3.3,port=9004
-object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0
-object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1

 > red0 socket::9003
 |
tap <-- redirector-f1 <--
e1000
 --> redirector-f2 -->
 |
 < red1 socket::9004

OK, so I think we need to find a way to fix two things:
a) There must be an easier way of connecting two filters within the same
   qemu than going up to the kernel's socket code, around it's TCP stack
   and back.  Is there a better type of chardev to use we can short circuit
   with - it shouldn't need to leave QEMU (although I can see it's easier
   for debugging like this).  Jason/Dan - any ideas?

Looks like there's no such type of chardev. I think we could start with
tcp socket chardev first and do necessary optimization on top. In fact,
there's also advantages, with tcp sockets, the colo-compare codes could
even be an independent program out of qemu.

For the chardev type, this also reminds me one thing. Since
mirror/redirector can only work for tcp socket, we may need inspect the
chardev types and fail if not a tcp socket like what vhost-user did.

b) We should only need one socket for the connection between primary and
   secondary - I'm not sure how to change it to let it do that.

Looks like we can (e.g for secondary):

-device e1000,netdev=hn0,mac=52:a4:00:12:78:66
-chardev socket,id=red0,host=3.3.3.3,port=9003
-object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0
-object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red0

If not, probably a bug.

Right,
but firstly, 'chardev socket' need to enable 'mux=on' if we want two filter 
reference the
same socket.

Thanks

c) You need to be able to turn off nagling (socket no delay) on the sockets;
  I found a noticeable benefit of doing this on the connection between 
primary
  and secondary in my world.

Right.

Dave

.

Re: [Qemu-devel] [RFC PATCH v2.1 08/12] spapr: Add CPU type specific core devices

2016-03-31 Thread David Gibson

On Thu, Mar 31, 2016 at 02:09:17PM +0530, Bharata B Rao wrote:
> Introduce core devices for each CPU type supported by sPAPR. These
> core devices are derived from the base spapr-cpu-core device type.
> 
> TODO:
> - Add core types for other remaining CPU types
> - Handle CPU model alias correctly
> 
> Signed-off-by: Bharata B Rao 
> ---
>  hw/ppc/spapr.c  |   3 +-
>  hw/ppc/spapr_cpu_core.c | 118 
> 
>  include/hw/ppc/spapr.h  |   1 +
>  include/hw/ppc/spapr_cpu_core.h |  36 
>  4 files changed, 156 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 64c4acc..45ac5dc 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1614,8 +1614,7 @@ static void spapr_boot_set(void *opaque, const char 
> *boot_device,
>  machine->boot_order = g_strdup(boot_device);
>  }
>  
> -static void spapr_cpu_init(sPAPRMachineState *spapr, PowerPCCPU *cpu,
> -   Error **errp)
> +void spapr_cpu_init(sPAPRMachineState *spapr, PowerPCCPU *cpu, Error **errp)
>  {
>  CPUPPCState *env = >env;
>  
> diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
> index 8cbe2a5..3751a54 100644
> --- a/hw/ppc/spapr_cpu_core.c
> +++ b/hw/ppc/spapr_cpu_core.c
> @@ -22,9 +22,127 @@ static const TypeInfo spapr_cpu_core_type_info = {
>  .instance_size = sizeof(sPAPRCPUCore),
>  };
>  
> +#define DEFINE_SPAPR_CPU_CORE(_name) 
>   \
> +static void  
>   \
> +glue(_name, _spapr_cpu_core_create_threads)(DeviceState *dev, int threads,   
>   \
> +Error **errp)
>   \
> +{
>   \
> +int i;   
>   \
> +Error *local_err = NULL; 
>   \
> +sPAPRCPUCore *sc = SPAPR_CPU_CORE(OBJECT(dev));  
>   \
> +glue(_name, sPAPRCPUCore) * core =   
>   \
> + glue(_name, _SPAPR_CPU_CORE)(OBJECT(dev));  
>   \
> + 
>   \
> +for (i = 0; i < threads; i++) {  
>   \
> +char id[32]; 
>   \
> + 
>   \
> +object_initialize(>threads[i], sizeof(sc->threads[i]),   
>   \
> +  object_class_get_name(core->cpu)); 
>   \
> +snprintf(id, sizeof(id), "thread[%d]", i);   
>   \
> +object_property_add_child(OBJECT(core), id, OBJECT(>threads[i]), 
>   \
> +  _err);   
>   \
> +if (local_err) { 
>   \
> +goto err;
>   \
> +}
>   \
> +}
>   \
> +return;  
>   \
> + 
>   \
> +err: 
>   \
> +while (--i) {
>   \
> +object_unparent(OBJECT(>threads[i]));
>   \
> +}
>   \
> +error_propagate(errp, local_err);
>   \
> +}
>   \
> + 
>   \
> +static int   
>   \
> +glue(_name, _spapr_cpu_core_realize_child)(Object *child, void *opaque)  
>   \
> +{
>   \
> +Error **errp = opaque;   
>   \
> +sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>   \
> +CPUState *cs = CPU(child);   
>   \
> +PowerPCCPU *cpu = POWERPC_CPU(cs);   
>   \
> + 
>   \
> +object_property_set_bool(child, true, "realized", errp);

Re: [Qemu-devel] [RFC PATCH v2.1 09/12] spapr: convert boot CPUs into CPU core devices

2016-03-31 Thread David Gibson

On Thu, Mar 31, 2016 at 02:09:18PM +0530, Bharata B Rao wrote:
> Introduce sPAPRMachineClass.dr_cpu_enabled to indicate support for
> CPU core hotplug. Initialize boot time CPUs as core deivces and prevent
> topologies that result in partially filled cores. Both of these are done
> only if CPU core hotplug is supported.
> 
> Note: An unrelated change in the call to xics_system_init() is done
> in this patch as it makes sense to use the local variable smt introduced
> in this patch instead of kvmppc_smt_threads() call here.
> 
> Signed-off-by: Bharata B Rao 

Reviewed-by: David Gibson 

> ---
>  hw/ppc/spapr.c  | 73 
> +++--
>  hw/ppc/spapr_cpu_core.c | 45 +
>  include/hw/ppc/spapr.h  |  2 ++
>  include/hw/ppc/spapr_cpu_core.h |  3 ++
>  4 files changed, 113 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 45ac5dc..1ead043 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -64,6 +64,7 @@
>  
>  #include "hw/compat.h"
>  #include "qemu-common.h"
> +#include "hw/ppc/spapr_cpu_core.h"
>  
>  #include 
>  
> @@ -1614,6 +1615,10 @@ static void spapr_boot_set(void *opaque, const char 
> *boot_device,
>  machine->boot_order = g_strdup(boot_device);
>  }
>  
> +/*
> + * TODO: Check if some of these can be moved to rtas_start_cpu() where
> + * a few other things required for hotplugged CPUs are being done.
> + */
>  void spapr_cpu_init(sPAPRMachineState *spapr, PowerPCCPU *cpu, Error **errp)
>  {
>  CPUPPCState *env = >env;
> @@ -1644,6 +1649,7 @@ void spapr_cpu_init(sPAPRMachineState *spapr, 
> PowerPCCPU *cpu, Error **errp)
>  xics_cpu_setup(spapr->icp, cpu);
>  
>  qemu_register_reset(spapr_cpu_reset, cpu);
> +spapr_cpu_reset(cpu);
>  }
>  
>  /*
> @@ -1727,7 +1733,6 @@ static void ppc_spapr_init(MachineState *machine)
>  const char *kernel_filename = machine->kernel_filename;
>  const char *kernel_cmdline = machine->kernel_cmdline;
>  const char *initrd_filename = machine->initrd_filename;
> -PowerPCCPU *cpu;
>  PCIHostState *phb;
>  int i;
>  MemoryRegion *sysmem = get_system_memory();
> @@ -1741,6 +1746,22 @@ static void ppc_spapr_init(MachineState *machine)
>  long load_limit, fw_size;
>  bool kernel_le = false;
>  char *filename;
> +int smt = kvmppc_smt_threads();
> +int spapr_cores = smp_cpus / smp_threads;
> +int spapr_max_cores = max_cpus / smp_threads;
> +
> +if (smc->dr_cpu_enabled) {
> +if (smp_cpus % smp_threads) {
> +error_report("smp_cpus (%u) must be multiple of threads (%u)",
> + smp_cpus, smp_threads);
> +exit(1);
> +}
> +if (max_cpus % smp_threads) {
> +error_report("max_cpus (%u) must be multiple of threads (%u)",
> + max_cpus, smp_threads);
> +exit(1);
> +}
> +}
>  
>  msi_supported = true;
>  
> @@ -1787,8 +1808,7 @@ static void ppc_spapr_init(MachineState *machine)
>  
>  /* Set up Interrupt Controller before we create the VCPUs */
>  spapr->icp = xics_system_init(machine,
> -  DIV_ROUND_UP(max_cpus * 
> kvmppc_smt_threads(),
> -   smp_threads),
> +  DIV_ROUND_UP(max_cpus * smt, smp_threads),
>XICS_IRQS, _fatal);
>  
>  if (smc->dr_lmb_enabled) {
> @@ -1799,13 +1819,34 @@ static void ppc_spapr_init(MachineState *machine)
>  if (machine->cpu_model == NULL) {
>  machine->cpu_model = kvm_enabled() ? "host" : "POWER7";
>  }
> -for (i = 0; i < smp_cpus; i++) {
> -cpu = cpu_ppc_init(machine->cpu_model);
> -if (cpu == NULL) {
> -error_report("Unable to find PowerPC CPU definition");
> -exit(1);
> +
> +if (smc->dr_cpu_enabled) {
> +spapr->cores = g_new0(Object *, spapr_max_cores);
> +
> +for (i = 0; i < spapr_max_cores; i++) {
> +int core_dt_id = i * smt;
> +
> +if (i < spapr_cores) {
> +char *type = spapr_get_cpu_core_type(machine->cpu_model);
> +Object *core  = object_new(type);
> +
> +g_free(type);
> +object_property_set_int(core, smp_threads, "threads",
> +_fatal);
> +object_property_set_int(core, core_dt_id, CPU_CORE_PROP_CORE,
> +_fatal);
> +object_property_set_bool(core, true, "realized", 
> _fatal);
> +}
>  }
> -spapr_cpu_init(spapr, cpu, _fatal);
> +} else {
> +for (i = 0; i < smp_cpus; i++) {
> +PowerPCCPU *cpu = cpu_ppc_init(machine->cpu_model);
> +if (cpu == NULL) {
> +

Re: [Qemu-devel] [PATCH V2 1/3] colo-compare: introduce colo compare initlization

2016-03-31 Thread Jason Wang



On 03/31/2016 05:24 PM, Dr. David Alan Gilbert wrote:
> * Zhang Chen (zhangchen.f...@cn.fujitsu.com) wrote:
>>
>> On 03/30/2016 05:25 PM, Dr. David Alan Gilbert wrote:
>>> * Zhang Chen (zhangchen.f...@cn.fujitsu.com) wrote:
 packet come from primary char indev will be send to
 outdev - packet come from secondary char dev will be drop
>>> Please put in the description an example of how you invoke
>>> the filter on the primary and secondary.
>> OK.
>> colo-compare get packet from chardev(primary_in,secondary_in),
>> and output to other chardev(outdev), so, we can use it with the
>> help of filter-mirror and filter-redirector. like that:
> Wow, this is a bit more complicated than I expected - I was expecting just one
> socket.  So let me draw this out; see comments below.
>
>> primary:
>> -netdev
>> tap,id=hn0,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
>> -device e1000,id=e0,netdev=hn0,mac=52:a4:00:12:78:66
>> -chardev socket,id=mirror0,host=3.3.3.3,port=9003,server,nowait
>> -chardev socket,id=compare1,host=3.3.3.3,port=9004,server,nowait
>> -chardev socket,id=compare0,host=3.3.3.3,port=9001,server,nowait
>> -chardev socket,id=compare0-0,host=3.3.3.3,port=9001
>> -chardev socket,id=compare_out,host=3.3.3.3,port=9005,server,nowait
>> -chardev socket,id=compare_out0,host=3.3.3.3,port=9005
>> -object filter-mirror,id=m0,netdev=hn0,queue=tx,outdev=mirror0
>> -object filter-redirector,netdev=hn0,id=redire0,queue=rx,indev=compare_out
>> -object filter-redirector,netdev=hn0,id=redire1,queue=rx,outdev=compare0
>> -object 
>> colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,outdev=compare_out0
> > mirror0: socket:secondary:9003
> |
> mirror-m0 <-- e1000
> |
> v
> redirector-redire1 ---> compare0 socket:primary:9001 (to compare0-0)
>   
> -< compare0-0 socket:primary:9001 (to compare0)
> |  primary_in
> |
> compare-comp0   -> compare_out0 socket:primary:9005
> |
> |  secondary_in
> -< compare1   socket:secondary:9004
>
> tap <-- redirector-redire0 <--- compare_out socket:primary:9005 (from 
> compare_out0)
>
>
>> secondary:
>> -netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,down
>> script=/etc/qemu-ifdown
>> -device e1000,netdev=hn0,mac=52:a4:00:12:78:66
>> -chardev socket,id=red0,host=3.3.3.3,port=9003
>> -chardev socket,id=red1,host=3.3.3.3,port=9004
>> -object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0
>> -object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1
> > red0 socket::9003
> |
> tap <-- redirector-f1 <--
>e1000
> --> redirector-f2 -->
> |
> < red1 socket::9004
>
> OK, so I think we need to find a way to fix two things:
>a) There must be an easier way of connecting two filters within the same
>   qemu than going up to the kernel's socket code, around it's TCP stack
>   and back.  Is there a better type of chardev to use we can short circuit
>   with - it shouldn't need to leave QEMU (although I can see it's easier
>   for debugging like this).  Jason/Dan - any ideas?

Looks like there's no such type of chardev. I think we could start with
tcp socket chardev first and do necessary optimization on top. In fact,
there's also advantages, with tcp sockets, the colo-compare codes could
even be an independent program out of qemu.

For the chardev type, this also reminds me one thing. Since
mirror/redirector can only work for tcp socket, we may need inspect the
chardev types and fail if not a tcp socket like what vhost-user did.


>
>b) We should only need one socket for the connection between primary and
>   secondary - I'm not sure how to change it to let it do that.

Looks like we can (e.g for secondary):

-device e1000,netdev=hn0,mac=52:a4:00:12:78:66
-chardev socket,id=red0,host=3.3.3.3,port=9003
-object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0
-object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red0

If not, probably a bug.

>c) You need to be able to turn off nagling (socket no delay) on the 
> sockets;
>  I found a noticeable benefit of doing this on the connection between 
> primary
>  and secondary in my world.

Right.

>
> Dave
>

[Qemu-devel] [Bug 1563887] Re: qemu-system-ppc64 freezes on starting image on ppc64le

2016-03-31 Thread Serge Hallyn

I'll try building the package source from kilo on the xenial host and
see if that succeeds.  I'm having doubts.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1563887

Title:
  qemu-system-ppc64 freezes on starting image on ppc64le

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  New

Bug description:
  qemu-system-ppc64 running on Ubuntu 16.04 beta-2 fails to start an
  image as part of the certification process. This on an IBM ppc64le in
  PowerVM mode running Ubuntu 16.04 beta-2 deployed by MAAS 1.9.1. There
  is no error output.

  ubuntu@alpine01:~/kvm$ qemu-system-ppc64 -m 256 -display none -nographic -net 
nic -net user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine 
pseries -drive file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  WARNING: Image format was not specified for 'seed.iso' and probing guessed 
raw.
   Automatically detecting the format is dangerous for raw images, 
write operations on block 0 will be restricted.
   Specify the 'raw' format explicitly to remove the restrictions.

  SLOF **
  QEMU Starting
   Build Date = Jan 29 2016 18:58:37
   FW Version = buildd@ release 20151103
   Press "s" to enter Open Firmware.

  Populating /vdevice methods
  Populating /vdevice/vty@7100
  Populating /vdevice/nvram@7101
  Populating /vdevice/l-lan@7102
  Populating /vdevice/v-scsi@7103
     SCSI: Looking for devices
    8200 CD-ROM   : "QEMU QEMU CD-ROM  2.5+"
  Populating /pci@8002000
   00 1800 (D) : 1af4 1001virtio [ block ]
   00 1000 (D) : 1af4 1001virtio [ block ]
   00 0800 (D) : 106b 003fserial bus [ usb-ohci ]
   00  (D) : 1234 qemu vga
  No NVRAM common partition, re-initializing...
  Installing QEMU fb

  Scanning USB
    OHCI: initializing
  USB Keyboard
  USB mouse
  No console specified using screen & keyboard

    Welcome to Open Firmware

    Copyright (c) 2004, 2011 IBM Corporation All rights reserved.
    This program and the accompanying materials are made available
    under the terms of the BSD License available at
    http://www.opensource.org/licenses/bsd-license.php

  Trying to load:  from: /pci@8002000/scsi@3 ...
  E3404: Not a bootable device!
  Trying to load:  from: /pci@8002000/scsi@2 ...   Successfully loaded
  Linux ppc64le
  #31-Ubuntu SMP F

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: qemu-system-ppc 1:2.5+dfsg-5ubuntu6
  ProcVersionSignature: Ubuntu 4.4.0-16.32-generic 4.4.6
  Uname: Linux 4.4.0-16-generic ppc64le
  ApportVersion: 2.20-0ubuntu3
  Architecture: ppc64el
  Date: Wed Mar 30 14:10:01 2016
  KvmCmdLine:
   COMMAND STAT  EUID  RUID   PID  PPID %CPU COMMAND
   kvm-irqfd-clean S<   0 0  1172 2  0.0 [kvm-irqfd-clean]
   qemu-nbdSsl  0 0 13467 1  0.0 qemu-nbd -c /dev/nbd0 
xenial-server-cloudimg-ppc64el-disk1.img
   qemu-system-ppc Sl+   1000  1000 18973 18896  101 qemu-system-ppc64 -m 256 
-display none -nographic -net nic -net 
user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine pseries -drive 
file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  ProcEnviron:
   TERM=xterm
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinux-4.4.0-16-generic 
root=UUID=92d820c8-ab25-497b-9b1e-f1435992bbf3 ro
  ProcLoadAvg: 1.08 0.94 0.58 2/616 19571
  ProcLocks:
   1: POSIX  ADVISORY  WRITE 886 00:13:381 0 EOF
   2: POSIX  ADVISORY  WRITE 1339 00:13:528 0 EOF
   3: FLOCK  ADVISORY  WRITE 1284 00:13:522 0 EOF
   4: POSIX  ADVISORY  WRITE 2281 00:13:563 0 EOF
   5: POSIX  ADVISORY  WRITE 1331 00:13:536 0 EOF
  ProcSwaps:
   Filename TypeSizeUsedPriority
   /swap.img   file 8388544 0   -1
  ProcVersion: Linux version 4.4.0-16-generic (buildd@bos01-ppc64el-001) (gcc 
version 5.3.1 20160320 (Ubuntu/Linaro/IBM 5.3.1-12ubuntu4) ) #32-Ubuntu SMP Thu 
Mar 24 22:31:14 UTC 2016
  SourcePackage: qemu
  UpgradeStatus: No upgrade log present (probably fresh install)
  bootlist:
   /pci@8002011/pci1014,034A@0/sas/disk@4068402c40
   
/pci@8002018/ethernet@0:speed=auto,duplex=auto,csarch,000.000.000.000,,000.000.000.000,000.000.000.000,5,5,000.000.000.000,512
   
/pci@8002018/ethernet@0,1:speed=auto,duplex=auto,csarch,000.000.000.000,,000.000.000.000,000.000.000.000,5,5,000.000.000.000,512
   
/pci@8002018/ethernet@0,2:speed=auto,duplex=auto,csarch,000.000.000.000,,000.000.000.000,000.000.000.000,5,5,000.000.000.000,512

[Qemu-devel] [Bug 1563887] Re: qemu-system-ppc64 freezes on starting image on ppc64le

2016-03-31 Thread Serge Hallyn

Hm, building 2.2.0 (close to what is in the kilo cloud archive) doesn't
help.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1563887

Title:
  qemu-system-ppc64 freezes on starting image on ppc64le

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  New

Bug description:
  qemu-system-ppc64 running on Ubuntu 16.04 beta-2 fails to start an
  image as part of the certification process. This on an IBM ppc64le in
  PowerVM mode running Ubuntu 16.04 beta-2 deployed by MAAS 1.9.1. There
  is no error output.

  ubuntu@alpine01:~/kvm$ qemu-system-ppc64 -m 256 -display none -nographic -net 
nic -net user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine 
pseries -drive file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  WARNING: Image format was not specified for 'seed.iso' and probing guessed 
raw.
   Automatically detecting the format is dangerous for raw images, 
write operations on block 0 will be restricted.
   Specify the 'raw' format explicitly to remove the restrictions.

  SLOF **
  QEMU Starting
   Build Date = Jan 29 2016 18:58:37
   FW Version = buildd@ release 20151103
   Press "s" to enter Open Firmware.

  Populating /vdevice methods
  Populating /vdevice/vty@7100
  Populating /vdevice/nvram@7101
  Populating /vdevice/l-lan@7102
  Populating /vdevice/v-scsi@7103
     SCSI: Looking for devices
    8200 CD-ROM   : "QEMU QEMU CD-ROM  2.5+"
  Populating /pci@8002000
   00 1800 (D) : 1af4 1001virtio [ block ]
   00 1000 (D) : 1af4 1001virtio [ block ]
   00 0800 (D) : 106b 003fserial bus [ usb-ohci ]
   00  (D) : 1234 qemu vga
  No NVRAM common partition, re-initializing...
  Installing QEMU fb

  Scanning USB
    OHCI: initializing
  USB Keyboard
  USB mouse
  No console specified using screen & keyboard

    Welcome to Open Firmware

    Copyright (c) 2004, 2011 IBM Corporation All rights reserved.
    This program and the accompanying materials are made available
    under the terms of the BSD License available at
    http://www.opensource.org/licenses/bsd-license.php

  Trying to load:  from: /pci@8002000/scsi@3 ...
  E3404: Not a bootable device!
  Trying to load:  from: /pci@8002000/scsi@2 ...   Successfully loaded
  Linux ppc64le
  #31-Ubuntu SMP F

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: qemu-system-ppc 1:2.5+dfsg-5ubuntu6
  ProcVersionSignature: Ubuntu 4.4.0-16.32-generic 4.4.6
  Uname: Linux 4.4.0-16-generic ppc64le
  ApportVersion: 2.20-0ubuntu3
  Architecture: ppc64el
  Date: Wed Mar 30 14:10:01 2016
  KvmCmdLine:
   COMMAND STAT  EUID  RUID   PID  PPID %CPU COMMAND
   kvm-irqfd-clean S<   0 0  1172 2  0.0 [kvm-irqfd-clean]
   qemu-nbdSsl  0 0 13467 1  0.0 qemu-nbd -c /dev/nbd0 
xenial-server-cloudimg-ppc64el-disk1.img
   qemu-system-ppc Sl+   1000  1000 18973 18896  101 qemu-system-ppc64 -m 256 
-display none -nographic -net nic -net 
user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine pseries -drive 
file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  ProcEnviron:
   TERM=xterm
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinux-4.4.0-16-generic 
root=UUID=92d820c8-ab25-497b-9b1e-f1435992bbf3 ro
  ProcLoadAvg: 1.08 0.94 0.58 2/616 19571
  ProcLocks:
   1: POSIX  ADVISORY  WRITE 886 00:13:381 0 EOF
   2: POSIX  ADVISORY  WRITE 1339 00:13:528 0 EOF
   3: FLOCK  ADVISORY  WRITE 1284 00:13:522 0 EOF
   4: POSIX  ADVISORY  WRITE 2281 00:13:563 0 EOF
   5: POSIX  ADVISORY  WRITE 1331 00:13:536 0 EOF
  ProcSwaps:
   Filename TypeSizeUsedPriority
   /swap.img   file 8388544 0   -1
  ProcVersion: Linux version 4.4.0-16-generic (buildd@bos01-ppc64el-001) (gcc 
version 5.3.1 20160320 (Ubuntu/Linaro/IBM 5.3.1-12ubuntu4) ) #32-Ubuntu SMP Thu 
Mar 24 22:31:14 UTC 2016
  SourcePackage: qemu
  UpgradeStatus: No upgrade log present (probably fresh install)
  bootlist:
   /pci@8002011/pci1014,034A@0/sas/disk@4068402c40
   
/pci@8002018/ethernet@0:speed=auto,duplex=auto,csarch,000.000.000.000,,000.000.000.000,000.000.000.000,5,5,000.000.000.000,512
   
/pci@8002018/ethernet@0,1:speed=auto,duplex=auto,csarch,000.000.000.000,,000.000.000.000,000.000.000.000,5,5,000.000.000.000,512
   
/pci@8002018/ethernet@0,2:speed=auto,duplex=auto,csarch,000.000.000.000,,000.000.000.000,000.000.000.000,5,5,000.000.000.000,512

[Qemu-devel] [Bug 1562653] Re: Ubuntu 15.10: QEMU VM hang if memory >= 1T

2016-03-31 Thread changlimin

The issue is sloved after change cpuid[8008];

--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2547,7 +2547,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
 /* 64 bit processor */
 /* XXX: The physical address space is limited to 42 bits in exec.c. */
-*eax = 0x3028; /* 48 bits virtual, 40 bits physical */
+*eax = 0x3029; /* 48 bits virtual, 41 bits physical */
 } else {
 if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
 *eax = 0x0024; /* 36 bits physical */

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1562653

Title:
  Ubuntu 15.10: QEMU VM hang if memory >= 1T

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  Incomplete

Bug description:
  1. Ubuntu 15.10 x86_64 installed on HP SuperDome X with 8CPUs and 4T
  memory.

  2. Create a VM, install Ubuntu 15.10, if memory >= 1T , VM hang when start. 
If memory < 1T, it is good.
  
u1510-1
39eefe1e-4829-4843-b892-026d143f3ec7
1073741824
1073741824
16

  hvm
  
  


  
  
  


destroy
restart
restart

  /usr/bin/kvm
  




  
  




  
  
  

  
  

  
  





  
  
  
  

  
  


  
  

  

  

  3. The panic stack is
... cannot show
async_page_fault+0x28
ioread32_rep+0x38
ata_sff_data_xfer32+0x8a
ata_pio_sector+0x93
ata_pio_sectors+0x34
ata_sff_hsm_move+0x226
RIP: kthread_data+0x10
CR2: _FFD8

  4. Change the host os to Redhat 7.2 , the vm is good even memory
  >=3.8T.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1562653/+subscriptions

Re: [Qemu-devel] [Qemu-ppc] [PATCH v2] spapr: compute interrupt vector address from LPCR

2016-03-31 Thread David Gibson

On Thu, Mar 31, 2016 at 03:55:42PM +1100, David Gibson wrote:
> On Wed, Mar 30, 2016 at 05:38:34PM +0200, Cédric Le Goater wrote:
> > This address is changed by the linux kernel using the H_SET_MODE hcall
> > and needs to be restored when migrating a spapr VM running in
> > TCG. This can be done using the AIL bits from the LPCR register.
> > 
> > The patch introduces a helper routine cpu_ppc_get_excp_prefix() which
> > returns the effective address offset of the interrupt handler
> > depending on the LPCR_AIL bits. The same helper can be used in the
> > H_SET_MODE hcall, which lets us remove the H_SET_MODE_ADDR_TRANS_*
> > defines.
> > 
> > Signed-off-by: Cédric Le Goater 
> 
> I've applied this (with Greg's minor amendments) to ppc-for-2.6),
> since it certainly improves behaviour, although I have a couple of
> queries:

And.. I've take it out again now.  In addition to the fact that I'd
like some rework suggested elsewhere, it breaks compile for 32-bit ppc
targets.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH v2.1 05/12] qdev: hotplug: Introduce HotplugHandler.pre_plug() callback

2016-03-31 Thread David Gibson

On Thu, Mar 31, 2016 at 02:09:14PM +0530, Bharata B Rao wrote:
> From: Igor Mammedov 
> 
> pre_plug callback is to be called before device.realize() is executed.
> This would allow to check/set device's properties from HotplugHandler.
> 
> Signed-off-by: Igor Mammedov 
> Signed-off-by: Bharata B Rao 

Reviewed-by: David Gibson 

It would be really nice to get some opinion on this from Andreas or
Paolo.

> ---
>  hw/core/hotplug.c| 11 +++
>  hw/core/qdev.c   |  9 -
>  include/hw/hotplug.h | 14 +-
>  3 files changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/core/hotplug.c b/hw/core/hotplug.c
> index 645cfca..17ac986 100644
> --- a/hw/core/hotplug.c
> +++ b/hw/core/hotplug.c
> @@ -13,6 +13,17 @@
>  #include "hw/hotplug.h"
>  #include "qemu/module.h"
>  
> +void hotplug_handler_pre_plug(HotplugHandler *plug_handler,
> +  DeviceState *plugged_dev,
> +  Error **errp)
> +{
> +HotplugHandlerClass *hdc = HOTPLUG_HANDLER_GET_CLASS(plug_handler);
> +
> +if (hdc->pre_plug) {
> +hdc->pre_plug(plug_handler, plugged_dev, errp);
> +}
> +}
> +
>  void hotplug_handler_plug(HotplugHandler *plug_handler,
>DeviceState *plugged_dev,
>Error **errp)
> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> index db41aa1..a0b3aad 100644
> --- a/hw/core/qdev.c
> +++ b/hw/core/qdev.c
> @@ -1062,6 +1062,14 @@ static void device_set_realized(Object *obj, bool 
> value, Error **errp)
>  g_free(name);
>  }
>  
> +hotplug_ctrl = qdev_get_hotplug_handler(dev);
> +if (hotplug_ctrl) {
> +hotplug_handler_pre_plug(hotplug_ctrl, dev, _err);
> +if (local_err != NULL) {
> +goto fail;
> +}
> +}
> +
>  if (dc->realize) {
>  dc->realize(dev, _err);
>  }
> @@ -1072,7 +1080,6 @@ static void device_set_realized(Object *obj, bool 
> value, Error **errp)
>  
>  DEVICE_LISTENER_CALL(realize, Forward, dev);
>  
> -hotplug_ctrl = qdev_get_hotplug_handler(dev);
>  if (hotplug_ctrl) {
>  hotplug_handler_plug(hotplug_ctrl, dev, _err);
>  }
> diff --git a/include/hw/hotplug.h b/include/hw/hotplug.h
> index 2db025d..50d84e9 100644
> --- a/include/hw/hotplug.h
> +++ b/include/hw/hotplug.h
> @@ -46,7 +46,8 @@ typedef void (*hotplug_fn)(HotplugHandler *plug_handler,
>   * hardware (un)plug functions.
>   *
>   * @parent: Opaque parent interface.
> - * @plug: plug callback.
> + * @pre_plug: pre plug callback called at start of device.realize(true)
> + * @plug: plug callback called at end of device.realize(true).
>   * @unplug_request: unplug request callback.
>   *  Used as a means to initiate device unplug for devices 
> that
>   *  require asynchronous unplug handling.
> @@ -59,6 +60,7 @@ typedef struct HotplugHandlerClass {
>  InterfaceClass parent;
>  
>  /*  */
> +hotplug_fn pre_plug;
>  hotplug_fn plug;
>  hotplug_fn unplug_request;
>  hotplug_fn unplug;
> @@ -74,6 +76,16 @@ void hotplug_handler_plug(HotplugHandler *plug_handler,
>Error **errp);
>  
>  /**
> + * hotplug_handler_pre_plug:
> + *
> + * Call #HotplugHandlerClass.pre_plug callback of @plug_handler.
> + */
> +void hotplug_handler_pre_plug(HotplugHandler *plug_handler,
> +  DeviceState *plugged_dev,
> +  Error **errp);
> +
> +
> +/**
>   * hotplug_handler_unplug_request:
>   *
>   * Calls #HotplugHandlerClass.unplug_request callback of @plug_handler.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

[Qemu-devel] [PATCH] target-ppc: Correct KVM synchronization for ppc_hash64_set_external_hpt()

2016-03-31 Thread David Gibson

ppc_hash64_set_external_hpt() was added in e5c0d3c "target-ppc: Add helpers
for updating a CPU's SDR1 and external HPT".  This helper contains a
cpu_synchronize_state() since it may need to push state back to KVM
afterwards.

This turns out to break things when it is used in the reset path, which is
the only current user.  It appears that kvm_vcpu_dirty is not being set
early in the reset path, so the cpu_synchronize_state() is clobbering state
set up by the early part of the cpu reset path with stale state from KVM.

To fix this, remove the cpu_synchronize_state() from
ppc_hash64_set_external_hpt().  Any future non-reset-path users will need
to manually invoke cpu_synchronize_state().

Reported-by: Laurent Vivier 
Signed-off-by: David Gibson 
---
 target-ppc/mmu-hash64.c | 2 --
 1 file changed, 2 deletions(-)

Paolo, Peter,

This seems like the right minimal fix in the qemu-2.6 timeframe to fix
the actual bug.  However, longer term it seems like the correct thing
to do might be to set kvm_vcpu_dirty early in the reset path.  Thoughts?

diff --git a/target-ppc/mmu-hash64.c b/target-ppc/mmu-hash64.c
index 72c4ab5..caf41ce 100644
--- a/target-ppc/mmu-hash64.c
+++ b/target-ppc/mmu-hash64.c
@@ -283,8 +283,6 @@ void ppc_hash64_set_external_hpt(PowerPCCPU *cpu, void 
*hpt, int shift,
 CPUPPCState *env = >env;
 Error *local_err = NULL;
 
-cpu_synchronize_state(CPU(cpu));
-
 if (hpt) {
 env->external_htab = hpt;
 } else {
-- 
2.5.5

Re: [Qemu-devel] [PATCH for-2.6] spapr_drc: enable immediate detach for unsignalled devices

2016-03-31 Thread Michael Roth

Quoting David Gibson (2016-03-31 21:33:25)
> On Thu, Mar 31, 2016 at 05:27:37PM -0500, Michael Roth wrote:
> > Currently spapr doesn't support "aborting" hotplug of PCI
> > devices by allowing device_del to immediately remove the
> > device if we haven't signalled the presence of the device
> > to the guest.
> > 
> > In the past this wasn't an issue, since we always immediately
> > signalled device attach and simply relied on full guest-aware
> > add->remove path for device removal. However, as of 788d259,
> > we now defer signalling for PCI functions until function 0
> > is attached, so now we need to deal with these "abort" operations
> > for cases where a user hotplugs a non-0 function, then opts to
> > remove it prior hotplugging function 0. Currently they'd have to
> > reboot before the unplug completed. PCIe multifunction hotplug
> > does not have this requirement however, so from a management
> > implementation perspective it would be good to address this within
> > the same release as 788d259.
> > 
> > We accomplish this by simply adding a 'signalled' flag to track
> > whether a device hotplug event has been sent to the guest. If it
> > hasn't, we allow immediate removal under the assumption that the
> > guest will not be using the device. Devices present at boot/reset
> > time are also assumed to be 'signalled'.
> > 
> > For CPU/memory/etc, signalling will still happen immediately
> > as part of device_add, so only PCI functions should be affected.
> > 
> > Cc: bhar...@linux.vnet.ibm.com
> > Cc: da...@gibson.dropbear.id.au
> > Cc: sb...@linux.vnet.ibm.com
> > Cc: qemu-...@nongnu.org
> > Signed-off-by: Michael Roth 
> 
> So, I'm disinclined to take this during the hard freeze, without some
> evidence that it fixes a problem case that's really being hit in
> practice.

The basic situation is that previously we had:

device_add virtio-net-pci,id=hp5.2,addr=0x5.2
  a1) plug device into slot
  a2) signal guest of attach
  a3) guest adds device
device_del hp5.2 
  d1) signal guest of detach
  d2) wait for guest to clean up device and signal completion
  d3) unplug device from slot

After 788d259 we have:

device_add virtio-net-pci,id=hp5.2,addr=0x5.2
  a1) plug device into slot
  a2) defer signalling until we see 0x5.0 added
device_del hp5.2 
  d1) defer signalling removal until we see 0x5.0 deleted
  d2) wait for guest to clean up device and signal completion

But d2) never happens because the guest was never signalled that
the device was added in the first place.

A real world situation where this might occur is admittedly a bit
contrived, but in practice I can certainly see it happening:

a) user hotplugs multifunction virtio-net-pci to slot 5, starting
   with function 3 at 0x5.3
b) user notices they made a mistake, for instance, they forget enable
   vhost in the accompanying netdev
c) user attempts to unplug function and "abort" the operation, but
   device does not go away

Shiva (on cc:) is working on the libvirt implementation for this
and hit this in testing. Since it differs in behavior from the PCIe
workflow it was based on (where aborts are explicitly allowed), and
causes a minor regression from 2.5, I thought it might be worth
considering for 2.6, but I can certainly understand if you opt to
hold off till 2.7.

> 
> > ---
> >  hw/ppc/spapr_drc.c | 34 ++
> >  hw/ppc/spapr_events.c  | 11 +++
> >  include/hw/ppc/spapr_drc.h |  2 ++
> >  3 files changed, 47 insertions(+)
> > 
> > diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
> > index ef063c0..5568e44 100644
> > --- a/hw/ppc/spapr_drc.c
> > +++ b/hw/ppc/spapr_drc.c
> > @@ -173,6 +173,12 @@ static void set_configured(sPAPRDRConnector *drc)
> >  drc->configured = true;
> >  }
> >  
> > +/* has the guest been notified of device attachment? */
> > +static void set_signalled(sPAPRDRConnector *drc)
> > +{
> > +drc->signalled = true;
> > +}
> > +
> >  /*
> >   * dr-entity-sense sensor value
> >   * returned via get-sensor-state RTAS calls
> > @@ -355,6 +361,7 @@ static void attach(sPAPRDRConnector *drc, DeviceState 
> > *d, void *fdt,
> >  drc->fdt = fdt;
> >  drc->fdt_start_offset = fdt_start_offset;
> >  drc->configured = coldplug;
> > +drc->signalled = coldplug;
> >  
> >  object_property_add_link(OBJECT(drc), "device",
> >   object_get_typename(OBJECT(drc->dev)),
> > @@ -371,6 +378,26 @@ static void detach(sPAPRDRConnector *drc, DeviceState 
> > *d,
> >  drc->detach_cb = detach_cb;
> >  drc->detach_cb_opaque = detach_cb_opaque;
> >  
> > +/* if we've signalled device presence to the guest, or if the guest
> > + * has gone ahead and configured the device (via manually-executed
> > + * device add via drmgr in guest, namely), we need to wait
> > + * for the guest to quiesce the device before completing detach.
> > + * Otherwise, we can assume the guest hasn't seen it and complete

Re: [Qemu-devel] [PATCH] Improve documentation of FUA and FLUSH

2016-03-31 Thread Eric Blake

On 03/31/2016 05:03 PM, Alex Bligh wrote:
> Improve the documentation of NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA. Specifically
> the latter may be set on any command, and its semantics on commands other
> than NBD_CMD_WRITE need explaining. Further, explain how these relate to
> reordering of commands.
> 
> Signed-off-by: Alex Bligh 
> ---
>  doc/proto.md | 52 ++--
>  1 file changed, 42 insertions(+), 10 deletions(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index c1e05c5..bc4483d 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -197,6 +197,37 @@ handle as was sent by the client in the corresponding 
> request.  In
>  this way, the client can correlate which request is receiving a
>  response.
>  
> + Ordering of messages and writes
> +
> +The server MAY process commands out of order, and MAY reply out of
> +order, save that:
> +
> +* All write commands (that includes both `NBD_CMD_WRITE` and
> +  `NBD_CMD_TRIM`) that the server completes (i.e. replies to)
> +  prior to processing to a `NBD_CMD_FLUSH` MUST be written to non-volatile
> +  storage prior to replying to that `NBD_CMD_FLUSH`. The server SHOULD ensure
> +  that all write command received prior to processing the `NBD_CMD_FLUSH`
> +  (whether they are replied to or not) are written to non-volatile
> +  storage prior to processing an `NBD_CMD_FLUSH`; note this is a
> +  stronger condition than the previous 'MUST' condition. This
> +  paragram only applies if `NBD_FLAG_SEND_FLUSH` is set within

s/paragram/paragraph/

> +  the transmission flags, as otherwise `NBD_CMD_FLUSH` will never
> +  be sent by the client to the server.
> +
> +* A server MUST NOT reply to a command that has `NBD_CMD_FLAG_FUA` set
> +  in its command flags until   the data area referred to by that command

why multiple spaces?

> +  is persisted to non-volatile storage. This only applies if
> +  `NBD_FLAG_SEND_FLUSH` is set within the transmission flags, as otherwise

s/FLUSH/FUA/

> +  `NBD_CMD_FLAG_FUA` will not be set on any commands sent to the server
> +  by the client.
> +
> +`NBD_CMD_FLUSH` is modelled on the Linux kernel empty bio with
> +`REQ_FLUSH` set. `NBD_CMD_FLAG_FUA` is modelled on the Linux
> +kernel bio with `REQ_FUA` set. In case of ambiguity in this
> +specification, the
> +[kernel 
> documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
> +may be useful.
> +
>   Request message
>  
>  The request message, sent by the client, looks as follows:
> @@ -444,10 +475,17 @@ affects a particular command.  Clients MUST NOT set a 
> command flag bit
>  that is not documented for the particular command; and whether a flag is
>  valid may depend on negotiation during the handshake phase.
>  
> -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`.  SHOULD be
> -  set to 1 if the client requires "Force Unit Access" mode of
> -  operation.  MUST NOT be set unless transmission flags included
> -  `NBD_FLAG_SEND_FUA`.
> +- bit 0, `NBD_CMD_FLAG_FUA`. This bit
> +  MUST be set to 0 unless the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access")
> +  was set in the transmission flags field. If the `NBD_FLAG_SEND_FUA`
> +  is set in the transmission flags field, the client MAY set
> +  `NBD_CMD_FLAG_FUA` in any request. If this bit is set, the server
> +  MUST NOT send a reply until it has ensured that any data referred to
> +  by this request (i.e. written data on a write or trim, read data on
> +  a read) has reached permanent storage. There will be certain commands
> +  (e.g. `NBD_CMD_DISC`) for which this flag will thus not alter behaviour
> +  (as the command does not refer to any data), in which case the server
> +  MUST ignore this bit.

Makes sense, but we now need to fix the reference implementation  to
match (the recent commit ab22e082 rejects NBD_CMD_FLAG_FUA on all but
writes).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-devel] [PATCH] rtl8139: using CP_TX_OWN for ownership transferring during tx

2016-03-31 Thread Jason Wang

Through CP_TX_OWN and CP_RX_OWN points to the same bit, we'd better use
CP_TX_OWN for tx descriptor handling.

Signed-off-by: Jason Wang 
---
 hw/net/rtl8139.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/net/rtl8139.c b/hw/net/rtl8139.c
index fee97bf..1e5ec14 100644
--- a/hw/net/rtl8139.c
+++ b/hw/net/rtl8139.c
@@ -2046,7 +2046,7 @@ static int rtl8139_cplus_transmit_one(RTL8139State *s)
 }
 
 /* transfer ownership to target */
-txdw0 &= ~CP_RX_OWN;
+txdw0 &= ~CP_TX_OWN;
 
 /* reset error indicator bits */
 txdw0 &= ~CP_TX_STATUS_UNF;
-- 
2.5.0

Re: [Qemu-devel] [PATCH for-2.6] spapr_drc: enable immediate detach for unsignalled devices

2016-03-31 Thread David Gibson

On Thu, Mar 31, 2016 at 05:27:37PM -0500, Michael Roth wrote:
> Currently spapr doesn't support "aborting" hotplug of PCI
> devices by allowing device_del to immediately remove the
> device if we haven't signalled the presence of the device
> to the guest.
> 
> In the past this wasn't an issue, since we always immediately
> signalled device attach and simply relied on full guest-aware
> add->remove path for device removal. However, as of 788d259,
> we now defer signalling for PCI functions until function 0
> is attached, so now we need to deal with these "abort" operations
> for cases where a user hotplugs a non-0 function, then opts to
> remove it prior hotplugging function 0. Currently they'd have to
> reboot before the unplug completed. PCIe multifunction hotplug
> does not have this requirement however, so from a management
> implementation perspective it would be good to address this within
> the same release as 788d259.
> 
> We accomplish this by simply adding a 'signalled' flag to track
> whether a device hotplug event has been sent to the guest. If it
> hasn't, we allow immediate removal under the assumption that the
> guest will not be using the device. Devices present at boot/reset
> time are also assumed to be 'signalled'.
> 
> For CPU/memory/etc, signalling will still happen immediately
> as part of device_add, so only PCI functions should be affected.
> 
> Cc: bhar...@linux.vnet.ibm.com
> Cc: da...@gibson.dropbear.id.au
> Cc: sb...@linux.vnet.ibm.com
> Cc: qemu-...@nongnu.org
> Signed-off-by: Michael Roth 

So, I'm disinclined to take this during the hard freeze, without some
evidence that it fixes a problem case that's really being hit in
practice.

> ---
>  hw/ppc/spapr_drc.c | 34 ++
>  hw/ppc/spapr_events.c  | 11 +++
>  include/hw/ppc/spapr_drc.h |  2 ++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
> index ef063c0..5568e44 100644
> --- a/hw/ppc/spapr_drc.c
> +++ b/hw/ppc/spapr_drc.c
> @@ -173,6 +173,12 @@ static void set_configured(sPAPRDRConnector *drc)
>  drc->configured = true;
>  }
>  
> +/* has the guest been notified of device attachment? */
> +static void set_signalled(sPAPRDRConnector *drc)
> +{
> +drc->signalled = true;
> +}
> +
>  /*
>   * dr-entity-sense sensor value
>   * returned via get-sensor-state RTAS calls
> @@ -355,6 +361,7 @@ static void attach(sPAPRDRConnector *drc, DeviceState *d, 
> void *fdt,
>  drc->fdt = fdt;
>  drc->fdt_start_offset = fdt_start_offset;
>  drc->configured = coldplug;
> +drc->signalled = coldplug;
>  
>  object_property_add_link(OBJECT(drc), "device",
>   object_get_typename(OBJECT(drc->dev)),
> @@ -371,6 +378,26 @@ static void detach(sPAPRDRConnector *drc, DeviceState *d,
>  drc->detach_cb = detach_cb;
>  drc->detach_cb_opaque = detach_cb_opaque;
>  
> +/* if we've signalled device presence to the guest, or if the guest
> + * has gone ahead and configured the device (via manually-executed
> + * device add via drmgr in guest, namely), we need to wait
> + * for the guest to quiesce the device before completing detach.
> + * Otherwise, we can assume the guest hasn't seen it and complete the
> + * detach immediately. Note that there is a small race window
> + * just before, or during, configuration, which is this context
> + * refers mainly to fetching the device tree via RTAS.
> + * During this window the device access will be arbitrated by
> + * associated DRC, which will simply fail the RTAS calls as invalid.
> + * This is recoverable within guest and current implementations of
> + * drmgr should be able to cope.

Sorry, I'm really not following this description of the race, or
seeing how it relates to allowing removal of "half plugged" devices.

> + */
> +if (!drc->signalled && !drc->configured) {
> +/* if the guest hasn't seen the device we can't rely on it to
> + * set it back to an isolated state via RTAS, so do it here manually
> + */
> +drc->isolation_state = SPAPR_DR_ISOLATION_STATE_ISOLATED;
> +}
> +
>  if (drc->isolation_state != SPAPR_DR_ISOLATION_STATE_ISOLATED) {
>  DPRINTFN("awaiting transition to isolated state before removal");
>  drc->awaiting_release = true;
> @@ -409,6 +436,7 @@ static void reset(DeviceState *d)
>  {
>  sPAPRDRConnector *drc = SPAPR_DR_CONNECTOR(d);
>  sPAPRDRConnectorClass *drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc);
> +sPAPRDREntitySense state;
>  
>  DPRINTFN("drc reset: %x", drck->get_index(drc));
>  /* immediately upon reset we can safely assume DRCs whose devices
> @@ -436,6 +464,11 @@ static void reset(DeviceState *d)
>  drck->set_allocation_state(drc, 
> SPAPR_DR_ALLOCATION_STATE_UNUSABLE);
>  }
>  }
> +
> +drck->entity_sense(drc, );
> +if (state ==

Re: [Qemu-devel] [Qemu-ppc] [PATCH v2] spapr: compute interrupt vector address from LPCR

2016-03-31 Thread David Gibson

On Thu, Mar 31, 2016 at 08:13:09AM +0100, Mark Cave-Ayland wrote:
> On 31/03/16 05:55, David Gibson wrote:
> 
> > On Wed, Mar 30, 2016 at 05:38:34PM +0200, Cédric Le Goater wrote:
> >> This address is changed by the linux kernel using the H_SET_MODE hcall
> >> and needs to be restored when migrating a spapr VM running in
> >> TCG. This can be done using the AIL bits from the LPCR register.
> >>
> >> The patch introduces a helper routine cpu_ppc_get_excp_prefix() which
> >> returns the effective address offset of the interrupt handler
> >> depending on the LPCR_AIL bits. The same helper can be used in the
> >> H_SET_MODE hcall, which lets us remove the H_SET_MODE_ADDR_TRANS_*
> >> defines.
> >>
> >> Signed-off-by: Cédric Le Goater 
> > 
> > I've applied this (with Greg's minor amendments) to ppc-for-2.6),
> > since it certainly improves behaviour, although I have a couple of
> > queries:
> > 
> >> ---
> >>
> >>  Changes since v1:
> >>
> >>  - moved helper routine under target-ppc/
> >>  - moved the restore of excp_prefix under cpu_post_load()
> >>
> >>  hw/ppc/spapr_hcall.c|   13 ++---
> >>  include/hw/ppc/spapr.h  |5 -
> >>  target-ppc/cpu.h|9 +
> >>  target-ppc/machine.c|   20 +++-
> >>  target-ppc/translate_init.c |   14 ++
> >>  5 files changed, 44 insertions(+), 17 deletions(-)
> >>
> >> Index: qemu-dgibson-for-2.6.git/hw/ppc/spapr_hcall.c
> >> ===
> >> --- qemu-dgibson-for-2.6.git.orig/hw/ppc/spapr_hcall.c
> >> +++ qemu-dgibson-for-2.6.git/hw/ppc/spapr_hcall.c
> >> @@ -835,17 +835,8 @@ static target_ulong h_set_mode_resource_
> >>  return H_P4;
> >>  }
> >>  
> >> -switch (mflags) {
> >> -case H_SET_MODE_ADDR_TRANS_NONE:
> >> -prefix = 0;
> >> -break;
> >> -case H_SET_MODE_ADDR_TRANS_0001_8000:
> >> -prefix = 0x18000;
> >> -break;
> >> -case H_SET_MODE_ADDR_TRANS_C000___4000:
> >> -prefix = 0xC0004000ULL;
> >> -break;
> >> -default:
> >> +prefix = cpu_ppc_get_excp_prefix(mflags);
> >> +if (prefix == (target_ulong) -1ULL) {
> >>  return H_UNSUPPORTED_FLAG;
> >>  }
> >>  
> >> Index: qemu-dgibson-for-2.6.git/target-ppc/machine.c
> >> ===
> >> --- qemu-dgibson-for-2.6.git.orig/target-ppc/machine.c
> >> +++ qemu-dgibson-for-2.6.git/target-ppc/machine.c
> >> @@ -156,12 +156,26 @@ static void cpu_pre_save(void *opaque)
> >>  }
> >>  }
> >>  
> >> +
> >> +static int cpu_post_load_excp_prefix(CPUPPCState *env)
> >> +{
> >> +int ail = (env->spr[SPR_LPCR] & LPCR_AIL) >> LPCR_AIL_SHIFT;
> >> +target_ulong prefix = cpu_ppc_get_excp_prefix(ail);
> >> +
> >> +if (prefix == (target_ulong) -1ULL) {
> >> +return -EINVAL;
> >> +}
> >> +env->excp_prefix = prefix;
> >> +return 0;
> >> +}
> >> +
> >>  static int cpu_post_load(void *opaque, int version_id)
> >>  {
> >>  PowerPCCPU *cpu = opaque;
> >>  CPUPPCState *env = >env;
> >>  int i;
> >>  target_ulong msr;
> >> +int ret = 0;
> >>  
> >>  /*
> >>   * We always ignore the source PVR. The user or management
> >> @@ -201,7 +215,11 @@ static int cpu_post_load(void *opaque, i
> >>  
> >>  hreg_compute_mem_idx(env);
> >>  
> >> -return 0;
> >> +if (env->spr[SPR_LPCR] & LPCR_AIL) {
> >> +ret = cpu_post_load_excp_prefix(env);
> >> +}
> > 
> > Why not call this unconditionally?  If AIL == 0 it will still do the
> > right thing.
> > 
> > Aren't there also circumstances where the exception prefix can depend
> > on the MSR?  Do those need to be handled somewhere?
> 
> Yes indeed - this was part of my patchset last year to fix up various
> migration issues for the Mac PPC machines (see commit
> 2360b6e84f78d41fa0f76555a947148b73645259).
> 
> I agree that having the env->excp_prefix logic split like this isn't a
> particularly great idea. Let's just have a single helper function as in
> the patch above and use that in both places (and in fact with recent
> changes I have a feeling env->excp_prefix is one of the few remaining
> reasons that the above change is still required, but please check this).

Right.

So, what I'd really prefer to see here is a single
update_excp_prefix() helper which will set env->excp_prefix based on
whatever registers are relevant (LPCR, MSR and probably the cpu
model).  That would be called on incoming migration, and after
updating any of the relevant registers (so from the mtmsr and mtlpcr
emulations).  H_SET_MODE should be handled by first updating the LPCR
value, then calling the update helper.

Cédric, I'm ok if you provide a version of that which only handles
POWER7 and POWER8 (i.e. spapr compatible models for now).  Mark - if
you can supply corrections to that outline for Macintosh era models,
that would be great.

I do

Re: [Qemu-devel] [PATCH for-2.7] hw/net/spapr_llan: Delay flushing of the RX queue while adding new RX buffers

2016-03-31 Thread David Gibson

On Thu, Mar 31, 2016 at 01:47:05PM +0200, Thomas Huth wrote:
> Currently, the spapr-vlan device is trying to flush the RX queue
> after each RX buffer that has been added by the guest via the
> H_ADD_LOGICAL_LAN_BUFFER hypercall. In case the receive buffer pool
> was empty before, we only pass single packets to the guest this
> way. This can cause very bad performance if a sender is trying
> to stream fragmented UDP packets to the guest. For example when
> using the UDP_STREAM test from netperf with UDP packets that are
> much bigger than the MTU size, almost all UDP packets are dropped
> in the guest since the chances are quite high that at least one of
> the fragments got lost on the way.
> 
> When flushing the receive queue, it's much better if we'd have
> a bunch of receive buffers available already, so that fragmented
> packets can be passed to the guest in one go. To do this, the
> spapr_vlan_receive() function should return 0 instead of -1 if there
> are no more receive buffers available, so that receive_disabled = 1
> gets temporarily set for the receive queue, and we have to delay
> the queue flushing at the end of h_add_logical_lan_buffer() a little
> bit by using a timer, so that the guest gets a chance to add multiple
> RX buffers before we flush the queue again.
> 
> This improves the UDP_STREAM test with the spapr-vlan device a lot:
> Running
>  netserver -p 4 -L  -f -D -4
> in the guest, and
>  netperf -p 4 -L  -H  -t UDP_STREAM -l 60 -- -m 16384
> in the host, I get the following values _without_ this patch:
> 
> Socket  Message  Elapsed  Messages
> SizeSize Time Okay Errors   Throughput
> bytes   bytessecs#  #   10^6bits/sec
> 
> 229376   16384   60.00 1738970  03798.83
> 229376   60.00  23  0.05
> 
> That "0.05" means that almost all UDP packets got lost/discarded
> at the receiving side.
> With this patch applied, the value look much better:
> 
> Socket  Message  Elapsed  Messages
> SizeSize Time Okay Errors   Throughput
> bytes   bytessecs#  #   10^6bits/sec
> 
> 229376   16384   60.00 1789104  03908.35
> 229376   60.00   22818 49.85
> 
> Signed-off-by: Thomas Huth 
> ---
>  As a reference: When running the same test with a "e1000" NIC
>  instead of "spapr-vlan", I get a throughput of ~85 up to 100 MBits/s
>  ... so the spapr-vlan is still not as good as other NICs here, but
>  at least it's much better than before.

Nice work!

Applied to ppc-for-2.7.

> 
>  hw/net/spapr_llan.c | 28 +---
>  1 file changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/net/spapr_llan.c b/hw/net/spapr_llan.c
> index a647f25..d604d55 100644
> --- a/hw/net/spapr_llan.c
> +++ b/hw/net/spapr_llan.c
> @@ -109,6 +109,7 @@ typedef struct VIOsPAPRVLANDevice {
>  target_ulong buf_list;
>  uint32_t add_buf_ptr, use_buf_ptr, rx_bufs;
>  target_ulong rxq_ptr;
> +QEMUTimer *rxp_timer;
>  uint32_t compat_flags; /* Compatability flags for migration 
> */
>  RxBufPool *rx_pool[RX_MAX_POOLS];  /* Receive buffer descriptor pools */
>  } VIOsPAPRVLANDevice;
> @@ -205,7 +206,7 @@ static ssize_t spapr_vlan_receive(NetClientState *nc, 
> const uint8_t *buf,
>  }
>  
>  if (!dev->rx_bufs) {
> -return -1;
> +return 0;
>  }
>  
>  if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) {
> @@ -214,7 +215,7 @@ static ssize_t spapr_vlan_receive(NetClientState *nc, 
> const uint8_t *buf,
>  bd = spapr_vlan_get_rx_bd_from_page(dev, size);
>  }
>  if (!bd) {
> -return -1;
> +return 0;
>  }
>  
>  dev->rx_bufs--;
> @@ -265,6 +266,13 @@ static NetClientInfo net_spapr_vlan_info = {
>  .receive = spapr_vlan_receive,
>  };
>  
> +static void spapr_vlan_flush_rx_queue(void *opaque)
> +{
> +VIOsPAPRVLANDevice *dev = opaque;
> +
> +qemu_flush_queued_packets(qemu_get_queue(dev->nic));
> +}
> +
>  static void spapr_vlan_reset_rx_pool(RxBufPool *rxp)
>  {
>  /*
> @@ -301,6 +309,9 @@ static void spapr_vlan_realize(VIOsPAPRDevice *sdev, 
> Error **errp)
>  dev->nic = qemu_new_nic(_spapr_vlan_info, >nicconf,
>  object_get_typename(OBJECT(sdev)), 
> sdev->qdev.id, dev);
>  qemu_format_nic_info_str(qemu_get_queue(dev->nic), 
> dev->nicconf.macaddr.a);
> +
> +dev->rxp_timer = timer_new_us(QEMU_CLOCK_VIRTUAL, 
> spapr_vlan_flush_rx_queue,
> +  dev);
>  }
>  
>  static void spapr_vlan_instance_init(Object *obj)
> @@ -331,6 +342,11 @@ static void spapr_vlan_instance_finalize(Object *obj)
>  dev->rx_pool[i] = NULL;
>  }
>  }
> +
> +if (dev->rxp_timer) {
> +timer_del(dev->rxp_timer);
> +timer_free(dev->rxp_timer);
> +}
>  }
>  
>  void spapr_vlan_create(VIOsPAPRBus *bus, NICInfo *nd)
> @@ -628,7 +644,13 @@

Re: [Qemu-devel] [PATCH] net: fix OptsVisitor memory leak

2016-03-31 Thread Jason Wang



On 04/01/2016 12:22 AM, Eric Blake wrote:
> On 03/31/2016 08:28 AM, Paolo Bonzini wrote:
>> Signed-off-by: Paolo Bonzini 
> May want to add: regression introduced in commit 96a1616.
>
>> ---
>>  net/net.c | 1 +
>>  1 file changed, 1 insertion(+)
> Reviewed-by: Eric Blake 

Applied to -net.

Thanks

>
>> diff --git a/net/net.c b/net/net.c
>> index 594c3b8..3847a13 100644
>> --- a/net/net.c
>> +++ b/net/net.c
>> @@ -1100,6 +1100,7 @@ int net_client_init(QemuOpts *opts, int is_netdev, 
>> Error **errp)
>>  }
>>  
>>  error_propagate(errp, err);
>> +opts_visitor_cleanup(ov);
>>  return ret;
>>  }
>>  
>>

Re: [Qemu-devel] [patch v5 11/12] vfio: device may stuck in D3 when doing aer recovery

2016-03-31 Thread Chen Fan



On 04/01/2016 09:40 AM, Chen Fan wrote:


On 03/31/2016 11:44 PM, Alex Williamson wrote:

On Thu, 31 Mar 2016 14:55:07 +0800
Chen Fan  wrote:


On 03/25/2016 10:22 AM, Alex Williamson wrote:

On Fri, 25 Mar 2016 09:38:09 +0800
Chen Fan  wrote:

On 03/25/2016 06:54 AM, Alex Williamson wrote:

On Wed, 23 Mar 2016 18:12:06 +0800
Cao jin  wrote:

From: Chen Fan 

when a physical device aer occurred, the device state probably
is not in D0 in a short time, if we recover the device quickly.
we may stuck in D3 state when force to change device state to D0.
we may need to wait for a short time to inject the error to guest.

Signed-off-by: Chen Fan 
---
hw/vfio/pci.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 25fc095..5216e7f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void 
*opaque)

msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
PCI_ERR_ROOT_CMD_NONFATAL_EN;
+/* wait a bit to ensure aer device is ready */
+usleep(2 * 1000);

Where does this number come from?  Why would the device be in D3?  I
don't understand this at all.

Hi Alex,

   when I tested the code in my environment, I found that when 
I used
the aer-inject module to inject a fake aer error to device on 
host, the qemu
would throw out the message "vfio: Unable to power on device, 
stuck in D3"
on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the 
phenomenon
would not appearance, I just thought it should be some timing race 
issue,
so I use a sleep() to wait 2ms (double the reset time of 1ms) to 
ensure the

device state is ready. maybe the root reason still need to be
investigated deeply.

Yes, it sounds like you need to investigate this further, the delay is
arbitrary and perhaps suggests a race that needs to be fixed
correctly.  Thanks,

Hi Alex,

  after done some investigation of the problem, I found that only
when the injected
   error is fatal, the problem will appear. because in aer do_recovery,
host will call reset_link
on the root port, which would invoke pci_reset_bridge_secondary_bus in
aer_root_reset,
that would reset the bridge and all the device under that. so when qemu
receive the aer
notification, then propagate the error to guest, guest does the same 
way

to perform the
recovery, if the guest `reset_link` that will call the vfio_pre_reset
done at the stage of host
bridge reset, the device status would probable stick in D3.

so I think after qemu receive the aer notification, we should wait for
enough time to
ensure the bridge has been reset completely. I just use sleep <=10ms to
test the code,
seems still appear the message "vfio: Unable to power on device, stuck
in D3". so I think
we should sleep 100ms to ensure the delay sufficient. I have tested 
that

code 100+ times
by inject aer error. the issue no longer appears.

I'm not satisfied with this.  pci_reset_bridge_secondary_bus() is
invoked by both the host AER code and the guest AER code, the latter
via the vfio PCI hot reset interface.  The
pci_reset_bridge_secondary_bus() function includes the spec defined
delay by which point all the devices should be operational again.  The
spec also defines that devices are in D0 after reset, which implies
that the only reason we would ever be seeing a device in D3 is if we're
reading the device while it is still in reset or before it has
recovered from reset.  That implies that either
pci_reset_bridge_secondary_bus() is not waiting long enough or QEMU is
allowing one device to call vfio_pre_reset while another device is
still in reset.  I suspect QEMU serializes reset such that the latter
case is not possible, which means that you might have a device that
takes longer to reset than the spec defines.  Such a quirk should be
handled in the host kernel reset, not by adding arbitrary delays in
userspace code.  Thanks,

maybe it is not the delay issue actually. let me analyze the aer process
in do_recovery:

   host qemu   guest

broadcast: error_detected
+--> error_detected (VFIO)
  +--> eventfd_signal   ->inject_aer -> 
broadcast: error_detected

*reset_link* +--> error_detected (real driver)
*reset_link*
broadcast: mmio/slot_reset 
broadcast: mmio/slot_reset
broadcast: 
resume  broadcast: resume

apologies to all, the above diagram was chaotic. I redraw it.

   host qemuguest

broadcast: error_detected (host)
 +--> error_detected (VFIO)
   +--> eventfd_signal   ->  inject_aer  -> broadcast: 
error_detected (gust)

*reset_link* (host) +--> error_detected (real driver) (gust)
*reset_link* (gust)
broadcast:

Re: [Qemu-devel] [PATCH v5] net: Allocating Large sized arrays to heap

2016-03-31 Thread Jason Wang



On 03/28/2016 08:34 PM, Pooja Dhannawat wrote:
> nc_sendv_compat has a huge stack usage of 69680 bytes approx.
> Moving large arrays to heap to reduce stack usage.
>
> Reviewed-by: Stefan Hajnoczi 
> Signed-off-by: Pooja Dhannawat 
> ---
>  net/net.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/net/net.c b/net/net.c
> index b0c832e..663da13 100644
> --- a/net/net.c
> +++ b/net/net.c
> @@ -709,23 +709,28 @@ ssize_t qemu_send_packet_raw(NetClientState *nc, const 
> uint8_t *buf, int size)
>  static ssize_t nc_sendv_compat(NetClientState *nc, const struct iovec *iov,
> int iovcnt, unsigned flags)
>  {
> -uint8_t buf[NET_BUFSIZE];
> +uint8_t *buf = NULL;
>  uint8_t *buffer;
>  size_t offset;
> +ssize_t ret;
>  
>  if (iovcnt == 1) {
>  buffer = iov[0].iov_base;
>  offset = iov[0].iov_len;
>  } else {
> +buf = g_new(uint8_t, NET_BUFSIZE);
>  buffer = buf;
> -offset = iov_to_buf(iov, iovcnt, 0, buf, sizeof(buf));
> +offset = iov_to_buf(iov, iovcnt, 0, buf, NET_BUFSIZE);
>  }
>  
>  if (flags & QEMU_NET_PACKET_FLAG_RAW && nc->info->receive_raw) {
> -return nc->info->receive_raw(nc, buffer, offset);
> +ret = nc->info->receive_raw(nc, buffer, offset);
>  } else {
> -return nc->info->receive(nc, buffer, offset);
> +ret = nc->info->receive(nc, buffer, offset);
>  }
> +
> +g_free(buf);
> +return ret;
>  }
>  
>  ssize_t qemu_deliver_packet_iov(NetClientState *sender,

Applied to -net.

Thanks

Re: [Qemu-devel] [PATCH v2] util: Improved qemu_hexmap() to include an ascii dump of the buffer

2016-03-31 Thread Jason Wang



On 03/30/2016 03:18 PM, Thomas Huth wrote:
> There's a typo in the title: qemu_hexmap should be qemu_hexdump/ instead.
>
> On 25.03.2016 11:42, Isaac Lozano wrote:
>> qemu_hexdump() in util/hexdump.c has been changed to give also include a
>> ascii dump of the buffer. Also, calls to hex_dump() in net/net.c have
>> been replaced with calls to qemu_hexdump(). This takes care of two misc
>> BiteSized Tasks.
>>
>> Signed-off-by: Isaac Lozano <109loza...@gmail.com>
>> ---
>>
>> v2: Fixed code-style and made debug line smaller.
>>
>>  net/net.c  | 30 +-
>>  util/hexdump.c | 33 ++---
>>  2 files changed, 23 insertions(+), 40 deletions(-)
> [...]
>
> Code looks fine now, thanks for taking care of this!
>
> Reviewed-by: Thomas Huth 
>
> Jason, since there is no explicit maintainer for util/hexdump.c, could
> you maybe take this through your net tree, since this patch touches
> net/net.c as well?

Yes, applied in -net.

Thanks

>
>  Thomas
>
>

Re: [Qemu-devel] [patch v5 11/12] vfio: device may stuck in D3 when doing aer recovery

2016-03-31 Thread Chen Fan



On 03/31/2016 11:44 PM, Alex Williamson wrote:

On Thu, 31 Mar 2016 14:55:07 +0800
Chen Fan  wrote:


On 03/25/2016 10:22 AM, Alex Williamson wrote:

On Fri, 25 Mar 2016 09:38:09 +0800
Chen Fan  wrote:
  

On 03/25/2016 06:54 AM, Alex Williamson wrote:

On Wed, 23 Mar 2016 18:12:06 +0800
Cao jin  wrote:
 

From: Chen Fan 

when a physical device aer occurred, the device state probably
is not in D0 in a short time, if we recover the device quickly.
we may stuck in D3 state when force to change device state to D0.
we may need to wait for a short time to inject the error to guest.

Signed-off-by: Chen Fan 
---
hw/vfio/pci.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 25fc095..5216e7f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
 PCI_ERR_ROOT_CMD_NONFATAL_EN;

+/* wait a bit to ensure aer device is ready */

+usleep(2 * 1000);

Where does this number come from?  Why would the device be in D3?  I
don't understand this at all.

Hi Alex,

   when I tested the code in my environment, I found that when I used
the aer-inject module to inject a fake aer error to device on host, the qemu
would throw out the message "vfio: Unable to power on device, stuck in D3"
on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the phenomenon
would not appearance, I just thought it should be some timing race issue,
so I use a sleep() to wait 2ms (double the reset time of 1ms) to ensure the
device state is ready. maybe the root reason still need to be
investigated deeply.

Yes, it sounds like you need to investigate this further, the delay is
arbitrary and perhaps suggests a race that needs to be fixed
correctly.  Thanks,

Hi Alex,

  after done some investigation of the problem, I found that only
when the injected
   error is fatal, the problem will appear. because in aer do_recovery,
host will call reset_link
on the root port, which would invoke pci_reset_bridge_secondary_bus in
aer_root_reset,
that would reset the bridge and all the device under that. so when qemu
receive the aer
notification, then propagate the error to guest, guest does the same way
to perform the
recovery, if the guest `reset_link` that will call the vfio_pre_reset
done at the stage of host
bridge reset, the device status would probable stick in D3.

so I think after qemu receive the aer notification, we should wait for
enough time to
ensure the bridge has been reset completely. I just use sleep <=10ms to
test the code,
seems still appear the message "vfio: Unable to power on device, stuck
in D3". so I think
we should sleep 100ms to ensure the delay sufficient. I have tested that
code 100+ times
by inject aer error. the issue no longer appears.

I'm not satisfied with this.  pci_reset_bridge_secondary_bus() is
invoked by both the host AER code and the guest AER code, the latter
via the vfio PCI hot reset interface.  The
pci_reset_bridge_secondary_bus() function includes the spec defined
delay by which point all the devices should be operational again.  The
spec also defines that devices are in D0 after reset, which implies
that the only reason we would ever be seeing a device in D3 is if we're
reading the device while it is still in reset or before it has
recovered from reset.  That implies that either
pci_reset_bridge_secondary_bus() is not waiting long enough or QEMU is
allowing one device to call vfio_pre_reset while another device is
still in reset.  I suspect QEMU serializes reset such that the latter
case is not possible, which means that you might have a device that
takes longer to reset than the spec defines.  Such a quirk should be
handled in the host kernel reset, not by adding arbitrary delays in
userspace code.  Thanks,

maybe it is not the delay issue actually. let me analyze the aer process
in do_recovery:

   host  qemu  
 guest


broadcast: error_detected
+--> error_detected (VFIO)
  +--> eventfd_signal   ->inject_aer -> 
broadcast: error_detected

*reset_link* +--> error_detected (real driver)
*reset_link*
broadcast: mmio/slot_reset 
broadcast: mmio/slot_reset
broadcast: 
resume  broadcast: resume


we can see that the reset_link in host and in guest are not called in 
order. the reset_link
in guest may execute during the host broadcast "error_detected" or 
"reset_link" as long
as there are many devices need to walk on the bus. so in the case qemu 
has the opportunity

to call vfio_pre_reset while host is still in reset.

so can we

Re: [Qemu-devel] [PATCH V2 0/3] Introduce COLO-compare

2016-03-31 Thread Li Zhijian




On 03/31/2016 05:43 PM, Dr. David Alan Gilbert wrote:

* Li Zhijian (lizhij...@cn.fujitsu.com) wrote:



On 03/30/2016 08:05 PM, Dr. David Alan Gilbert wrote:

* Zhang Chen (zhangchen.f...@cn.fujitsu.com) wrote:

COLO-compare is a part of COLO project. It is used
to compare the network package to help COLO decide
whether to do checkpoint.


Hi Zhang Chen,
   I've put comments on the individual patches, but some more general things:

   1) Please add a coment giving the example of the command line for the primary
 and secondary use of this module - it helps make it easier to understand 
the patches.

   2) There's no tracing in here - please add some; I found when I tried to get
 COLO working I needed to use lots of tracing and debugging to understand 
the
 packet flow.

   3) Add comments; e.g. for each function say which thread is using it and 
where
  the packets are coming from; e.g.
 called from the main thread on the primary for packets arriving over 
the socket
 from the secondary.

  There's just so many packets going in so many directions it would make it
  easier to follow.

   4) A more fundamental problem is what happens if the secondary never sends 
anything
  on the socket, the result is you end up running until the end of the long 
COLO
  checkpoint without triggering a discompare - in my world I added a 
timeout (400ms)
  for an unmatched packet from the primary, where if no matching packet was 
received
  a checkpoint would be triggered.

   5) I see the packet comparison is still the simple memcmpy that you had in 
December;
  are you planning on doing anything more complicated; you must be seing 
most packets
  miscompare?

You can see my current world at; 
https://github.com/orbitfp7/qemu/commits/orbit-wp4-colo-mar16
which has my basic TCP comparison (it's only tracking incoming connections) and 
I know it's
not complete either.  It mostly works OK, although I've got an occasional seg
(which makes me wonder if I need to add the conn_list_lock I see you added).  
I'm also
not doing any TCP reassembly which is probably needed.


Thank you very much for your comments.
I just see you tree, you put in a lot of work(tcp comparison enhance, 
sequence/acknowledge
number re-write, timeout...)

Actually, this compare module is just in a RFC stage(only including compare 
frame), there are
many works to be done:

1) Integrate to COLO frame(and Let COLO primary and secondary at running state)


Yes; although I think you've had most of the code for that, also feel free to 
use
any of the bits I've changed.

Thank you, we will add this part to next version.




2) ip segment defrag


Yes, this seems the hardest bit to me. It is needed for some workloads; for 
example
a simple case if just an ssh connection, the differences in timing between the 
primary
and secondary you see that the primary might send two short packets and the 
secondary
might send one long packet with the same data.

OK, let it as a TODO issue




3) comparison base on the sequence number(tcp and udp) if packet has
Because tcp re-transmission is quit common. IRC, your code will compare the 
whole tcp
 packet(sequence number will be compare)


Yes; once the secondary reworks the sequence number I found the seuence number
matched most of the time on the primary.

well, let (3) (4) (5) as a TODO issue too or move it to an extra 
patchset(depending on 3)



 One problem (depending on the traffic)
is that the ack number might not match and I'm not sure of the best fix, for 
example,
consider:

 a) Send packet 1
 b) Send packet 2
 c) Receive response

   The order of b & c is random - if the response on the same socket arrives 
(c) before (b)
then the ack number in packet 2 is different; on one workload this caused a lot 
of miscompares
but only if the timing is just wrong.

Yes, I've had this problem too.
the logic is a bit complicated, as kernel colo-proxy, we compare the tcp 
playload(excluding sequence/ack number)
only. And introduce the 'max_ack'(max_ack = MAX(primary_max_ack, 
secondary_max_ack)) to guarantee
the packet which ack is <= 'max_ack' can be release to client.



(I also had to turn off TCP timestamping to get useful comparisons)

Yes, it works.




4) packet belongs to the same connection is sort by sequence number

5) Out-Of-Oder packet handle


I think 4 & 5 both happen as part of the defrag; out of order packets were a 
problem for
me on some of the workloads; I had to turn off multiqueue in one case.


6) cleanup the un-active conn_list which maybe closed. the simple way is to 
introduce a
timer to record whether a connection have packet come within a timeout, 
connection gone
 beyond this timeout should be cleanup.


At the moment that isn't a big problem because when you receive the next 
checkpoint you
can flush the list.
The only case where you have to deal with this is for continuous failover when 
the
original

Re: [Qemu-devel] [Bug 1563887] Re: qemu-system-ppc64 freezes on starting image on ppc64le

2016-03-31 Thread Serge Hallyn

Thanks very much for testing.

I think I have a system I can use to try and bisect tonight/tomorrow.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1563887

Title:
  qemu-system-ppc64 freezes on starting image on ppc64le

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  New

Bug description:
  qemu-system-ppc64 running on Ubuntu 16.04 beta-2 fails to start an
  image as part of the certification process. This on an IBM ppc64le in
  PowerVM mode running Ubuntu 16.04 beta-2 deployed by MAAS 1.9.1. There
  is no error output.

  ubuntu@alpine01:~/kvm$ qemu-system-ppc64 -m 256 -display none -nographic -net 
nic -net user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine 
pseries -drive file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  WARNING: Image format was not specified for 'seed.iso' and probing guessed 
raw.
   Automatically detecting the format is dangerous for raw images, 
write operations on block 0 will be restricted.
   Specify the 'raw' format explicitly to remove the restrictions.

  SLOF **
  QEMU Starting
   Build Date = Jan 29 2016 18:58:37
   FW Version = buildd@ release 20151103
   Press "s" to enter Open Firmware.

  Populating /vdevice methods
  Populating /vdevice/vty@7100
  Populating /vdevice/nvram@7101
  Populating /vdevice/l-lan@7102
  Populating /vdevice/v-scsi@7103
     SCSI: Looking for devices
    8200 CD-ROM   : "QEMU QEMU CD-ROM  2.5+"
  Populating /pci@8002000
   00 1800 (D) : 1af4 1001virtio [ block ]
   00 1000 (D) : 1af4 1001virtio [ block ]
   00 0800 (D) : 106b 003fserial bus [ usb-ohci ]
   00  (D) : 1234 qemu vga
  No NVRAM common partition, re-initializing...
  Installing QEMU fb

  Scanning USB
    OHCI: initializing
  USB Keyboard
  USB mouse
  No console specified using screen & keyboard

    Welcome to Open Firmware

    Copyright (c) 2004, 2011 IBM Corporation All rights reserved.
    This program and the accompanying materials are made available
    under the terms of the BSD License available at
    http://www.opensource.org/licenses/bsd-license.php

  Trying to load:  from: /pci@8002000/scsi@3 ...
  E3404: Not a bootable device!
  Trying to load:  from: /pci@8002000/scsi@2 ...   Successfully loaded
  Linux ppc64le
  #31-Ubuntu SMP F

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: qemu-system-ppc 1:2.5+dfsg-5ubuntu6
  ProcVersionSignature: Ubuntu 4.4.0-16.32-generic 4.4.6
  Uname: Linux 4.4.0-16-generic ppc64le
  ApportVersion: 2.20-0ubuntu3
  Architecture: ppc64el
  Date: Wed Mar 30 14:10:01 2016
  KvmCmdLine:
   COMMAND STAT  EUID  RUID   PID  PPID %CPU COMMAND
   kvm-irqfd-clean S<   0 0  1172 2  0.0 [kvm-irqfd-clean]
   qemu-nbdSsl  0 0 13467 1  0.0 qemu-nbd -c /dev/nbd0 
xenial-server-cloudimg-ppc64el-disk1.img
   qemu-system-ppc Sl+   1000  1000 18973 18896  101 qemu-system-ppc64 -m 256 
-display none -nographic -net nic -net 
user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine pseries -drive 
file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  ProcEnviron:
   TERM=xterm
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinux-4.4.0-16-generic 
root=UUID=92d820c8-ab25-497b-9b1e-f1435992bbf3 ro
  ProcLoadAvg: 1.08 0.94 0.58 2/616 19571
  ProcLocks:
   1: POSIX  ADVISORY  WRITE 886 00:13:381 0 EOF
   2: POSIX  ADVISORY  WRITE 1339 00:13:528 0 EOF
   3: FLOCK  ADVISORY  WRITE 1284 00:13:522 0 EOF
   4: POSIX  ADVISORY  WRITE 2281 00:13:563 0 EOF
   5: POSIX  ADVISORY  WRITE 1331 00:13:536 0 EOF
  ProcSwaps:
   Filename TypeSizeUsedPriority
   /swap.img   file 8388544 0   -1
  ProcVersion: Linux version 4.4.0-16-generic (buildd@bos01-ppc64el-001) (gcc 
version 5.3.1 20160320 (Ubuntu/Linaro/IBM 5.3.1-12ubuntu4) ) #32-Ubuntu SMP Thu 
Mar 24 22:31:14 UTC 2016
  SourcePackage: qemu
  UpgradeStatus: No upgrade log present (probably fresh install)
  bootlist:
   /pci@8002011/pci1014,034A@0/sas/disk@4068402c40
   
/pci@8002018/ethernet@0:speed=auto,duplex=auto,csarch,000.000.000.000,,000.000.000.000,000.000.000.000,5,5,000.000.000.000,512
   
/pci@8002018/ethernet@0,1:speed=auto,duplex=auto,csarch,000.000.000.000,,000.000.000.000,000.000.000.000,5,5,000.000.000.000,512
   
/pci@8002018/ethernet@0,2:speed=auto,duplex=auto,csarch,000.000.000.000,,000.000.000.000,000.000.000.000,5,5,000.000.000.000,512

Re: [Qemu-devel] [PATCH] xilinx_zynq: merged support for ULPI PHY and ULPI viewport from xilinx/qemu

2016-03-31 Thread Alistair Francis

On Wed, Mar 30, 2016 at 3:46 AM, Joscha Benz  wrote:
> Hi,
>
> On 29.03.2016 22:23, Alistair Francis wrote:
>> On Tue, Mar 29, 2016 at 12:51 PM,   wrote:
>>> Signed-off-by: Joscha Benz 
>>
>> Hello Joscha ,
>>
>> Thanks for the patch. In future can you please use git send-email to
>> send the patches instead of attaching the patch file. You can find
>> more information on doing this at:
>> http://wiki.qemu.org/Contribute/SubmitAPatch
>
> Of course.

Great! Thanks

>
>> What are you trying to do here? Why are you trying to apply the Xilinx
>> model of the device? This relies on the register API (which you have
>> included in your patch) which isn't in mainline QEMU yet, so that part
>> can't be accepted just yet. We are working at upstreaming that at the
>> moment though.
>
> I was trying to get USB support for that device. That's why i tried to
> apply the Xilinx model since it has USB support.

Do you have any more specifics on what you need? At the moment we
won't be able to take all of the code until the full register API is
accepted.

Thanks,

Alistair

>
> Regards,
>
> Joscha
>

Re: [Qemu-devel] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension

2016-03-31 Thread Eric Blake

On 03/31/2016 07:02 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov 
> 
> There exist some cases when a client knows that the data it is going to
> write is all zeroes. Such cases include mirroring or backing up a device
> implemented by a sparse file.
> 
> With current NBD command set, the client has to issue NBD_CMD_WRITE
> command with zeroed payload and transfer these zero bytes through the
> wire. The server has to write the data onto disk, effectively denying
> the sparseness.
> 
> To remedy this, the patch adds WRITE_ZEROES extension with one new
> NBD_CMD_WRITE_ZEROES command.
> 

> +++ b/doc/proto.md
> @@ -261,6 +261,8 @@ immediately after the handshake flags field in oldstyle 
> negotiation:
>schedule I/O accesses as for a rotational medium
>  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
>`NBD_CMD_TRIM` commands
> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> +  supports `NBD_CMD_WRITE_ZEROES` commands

Hmm, we've picked overlapping bits between your proposal and mine for
`NBD_FLAG_SEND_DF`. Obviously, whoever goes in second gets bit 7.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-devel] [PATCH] Docs: proto.md: Clarify NUL in export name

2016-03-31 Thread Alex Bligh

Clarify that

* The name is not NUL terminated (not just that the length
  'does not include NUL termination' which might be taken to mean
  there is NUL termination but the length doesn't include it.

* The name cannot itself include embedded NUL characters (despite
  it not being NUL terminated).

Signed-off-by: Alex Bligh 
---
 doc/proto.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/doc/proto.md b/doc/proto.md
index c1e05c5..7729051 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -292,7 +292,8 @@ of the newstyle negotiation.
 haggling, and proceed to the transmission phase. Data: name of the
 export, free-form UTF-8 text (subject to limitations by server
 implementation). The length of the name is determined from the
-option header, and does NOT include a NUL terminator.  If the
+option header. The name is not NUL terminated, and may not
+contain embedded NUL characters. If the
 chosen export does not exist or requirements for the chosen export
 are not met (e.g., the client did not negotiate TLS for an export
 where the server requires it), the server should close the
-- 
1.9.1

[Qemu-devel] [PATCH] Improve documentation of FUA and FLUSH

2016-03-31 Thread Alex Bligh

Improve the documentation of NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA. Specifically
the latter may be set on any command, and its semantics on commands other
than NBD_CMD_WRITE need explaining. Further, explain how these relate to
reordering of commands.

Signed-off-by: Alex Bligh 
---
 doc/proto.md | 52 ++--
 1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index c1e05c5..bc4483d 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -197,6 +197,37 @@ handle as was sent by the client in the corresponding 
request.  In
 this way, the client can correlate which request is receiving a
 response.
 
+ Ordering of messages and writes
+
+The server MAY process commands out of order, and MAY reply out of
+order, save that:
+
+* All write commands (that includes both `NBD_CMD_WRITE` and
+  `NBD_CMD_TRIM`) that the server completes (i.e. replies to)
+  prior to processing to a `NBD_CMD_FLUSH` MUST be written to non-volatile
+  storage prior to replying to that `NBD_CMD_FLUSH`. The server SHOULD ensure
+  that all write command received prior to processing the `NBD_CMD_FLUSH`
+  (whether they are replied to or not) are written to non-volatile
+  storage prior to processing an `NBD_CMD_FLUSH`; note this is a
+  stronger condition than the previous 'MUST' condition. This
+  paragram only applies if `NBD_FLAG_SEND_FLUSH` is set within
+  the transmission flags, as otherwise `NBD_CMD_FLUSH` will never
+  be sent by the client to the server.
+
+* A server MUST NOT reply to a command that has `NBD_CMD_FLAG_FUA` set
+  in its command flags until   the data area referred to by that command
+  is persisted to non-volatile storage. This only applies if
+  `NBD_FLAG_SEND_FLUSH` is set within the transmission flags, as otherwise
+  `NBD_CMD_FLAG_FUA` will not be set on any commands sent to the server
+  by the client.
+
+`NBD_CMD_FLUSH` is modelled on the Linux kernel empty bio with
+`REQ_FLUSH` set. `NBD_CMD_FLAG_FUA` is modelled on the Linux
+kernel bio with `REQ_FUA` set. In case of ambiguity in this
+specification, the
+[kernel 
documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
+may be useful.
+
  Request message
 
 The request message, sent by the client, looks as follows:
@@ -444,10 +475,17 @@ affects a particular command.  Clients MUST NOT set a 
command flag bit
 that is not documented for the particular command; and whether a flag is
 valid may depend on negotiation during the handshake phase.
 
-- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`.  SHOULD be
-  set to 1 if the client requires "Force Unit Access" mode of
-  operation.  MUST NOT be set unless transmission flags included
-  `NBD_FLAG_SEND_FUA`.
+- bit 0, `NBD_CMD_FLAG_FUA`. This bit
+  MUST be set to 0 unless the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access")
+  was set in the transmission flags field. If the `NBD_FLAG_SEND_FUA`
+  is set in the transmission flags field, the client MAY set
+  `NBD_CMD_FLAG_FUA` in any request. If this bit is set, the server
+  MUST NOT send a reply until it has ensured that any data referred to
+  by this request (i.e. written data on a write or trim, read data on
+  a read) has reached permanent storage. There will be certain commands
+  (e.g. `NBD_CMD_DISC`) for which this flag will thus not alter behaviour
+  (as the command does not refer to any data), in which case the server
+  MUST ignore this bit.
 
  Request types
 
@@ -479,12 +517,6 @@ The following request types exist:
 message. The server MAY send the reply message before the data has
 reached permanent storage.
 
-If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
-transmission flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` in
-the command flags field. If this flag was set, the server MUST NOT send
-the reply until it has ensured that the newly-written data has reached
-permanent storage.
-
 If an error occurs, the server SHOULD set the appropriate error code
 in the error field. The server MAY then close the connection.
 
-- 
1.9.1

Re: [Qemu-devel] Why is SeaBIOS used with -kernel?

2016-03-31 Thread Kevin O'Connor

On Thu, Mar 31, 2016 at 11:17:30PM +0100, Richard W.M. Jones wrote:
> On Thu, Mar 31, 2016 at 12:22:23PM -0400, Kevin O'Connor wrote:
> > On Thu, Mar 31, 2016 at 10:21:25AM +0100, Stefan Hajnoczi wrote:
> > > On Sat, Mar 19, 2016 at 08:31:24PM +, Richard W.M. Jones wrote:
> > > > Is there something I'm missing, or for Linux + -kernel could we use a
> > > > much simpler BIOS?
> > > 
> > > The data that Marc Mari collected when comparing qboot with an optimized
> > > SeaBIOS/QEMU showed that there's no need for a separate "lightweight
> > > firmware" codebase.
> 
> [http://www.seabios.org/pipermail/seabios/2015-July/009554.html]
> 
> The problem is that now we've solved the fw_cfg problem, SeaBIOS is
> again a bottleneck (but one of several, and not the biggest).
> 
> > > https://github.com/bonzini/qboot
> 
> I'm actually comparing this to the extremely minimal BIOS used by
> kvmtool (and hence by Intel Clear Containers).  That "BIOS" (it's
> hardly fair to call it that) contains only a the bare minimum calls
> necessary to service the Linux startup code.  In this scenario Linux
> is memcpy'd into the guest memory and jumped to directly, so there is
> no separate BIOS loading step at all.  The BIOS is only needed because
> Linux startup issues BIOS calls eg to get the e820 memory map and do
> some VGA mode manipulation.

I think you'll find that if you compile out some features from
SeaBIOS, it will be of a similar speed to that "minimal BIOS".  Try
this:

cd /path/to/seabios/
echo -e 
'CONFIG_USB=n\nCONFIG_DRIVES=n\nCONFIG_KEYBOARD=n\nCONFIG_MOUSE=n\nCONFIG_WRITABLE_UPPERMEMORY=y\nCONFIG_TCGBIOS=n\nCONFIG_PIRTABLE=n\nCONFIG_MPTABLE=n\nCONFIG_SMBIOS=n\nCONFIG_ACPI=n\nCONFIG_DEBUG_LEVEL=0'
 > .config
make olddefconfig
make

What time do you get with the above stripped down seabios (the
generated bios is in out/bios.bin)?

[...]
> I'd dearly love to get rid of the sgabios option ROM.  It looks like
> SeaBIOS nearly supports a full serial console now?

Last I checked, one could disable the option rom by adding "-device
VGA,romfile=" to the qemu command line.

-Kevin

[Qemu-devel] [PATCH for-2.6] spapr_drc: enable immediate detach for unsignalled devices

2016-03-31 Thread Michael Roth

Currently spapr doesn't support "aborting" hotplug of PCI
devices by allowing device_del to immediately remove the
device if we haven't signalled the presence of the device
to the guest.

In the past this wasn't an issue, since we always immediately
signalled device attach and simply relied on full guest-aware
add->remove path for device removal. However, as of 788d259,
we now defer signalling for PCI functions until function 0
is attached, so now we need to deal with these "abort" operations
for cases where a user hotplugs a non-0 function, then opts to
remove it prior hotplugging function 0. Currently they'd have to
reboot before the unplug completed. PCIe multifunction hotplug
does not have this requirement however, so from a management
implementation perspective it would be good to address this within
the same release as 788d259.

We accomplish this by simply adding a 'signalled' flag to track
whether a device hotplug event has been sent to the guest. If it
hasn't, we allow immediate removal under the assumption that the
guest will not be using the device. Devices present at boot/reset
time are also assumed to be 'signalled'.

For CPU/memory/etc, signalling will still happen immediately
as part of device_add, so only PCI functions should be affected.

Cc: bhar...@linux.vnet.ibm.com
Cc: da...@gibson.dropbear.id.au
Cc: sb...@linux.vnet.ibm.com
Cc: qemu-...@nongnu.org
Signed-off-by: Michael Roth 
---
 hw/ppc/spapr_drc.c | 34 ++
 hw/ppc/spapr_events.c  | 11 +++
 include/hw/ppc/spapr_drc.h |  2 ++
 3 files changed, 47 insertions(+)

diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
index ef063c0..5568e44 100644
--- a/hw/ppc/spapr_drc.c
+++ b/hw/ppc/spapr_drc.c
@@ -173,6 +173,12 @@ static void set_configured(sPAPRDRConnector *drc)
 drc->configured = true;
 }
 
+/* has the guest been notified of device attachment? */
+static void set_signalled(sPAPRDRConnector *drc)
+{
+drc->signalled = true;
+}
+
 /*
  * dr-entity-sense sensor value
  * returned via get-sensor-state RTAS calls
@@ -355,6 +361,7 @@ static void attach(sPAPRDRConnector *drc, DeviceState *d, 
void *fdt,
 drc->fdt = fdt;
 drc->fdt_start_offset = fdt_start_offset;
 drc->configured = coldplug;
+drc->signalled = coldplug;
 
 object_property_add_link(OBJECT(drc), "device",
  object_get_typename(OBJECT(drc->dev)),
@@ -371,6 +378,26 @@ static void detach(sPAPRDRConnector *drc, DeviceState *d,
 drc->detach_cb = detach_cb;
 drc->detach_cb_opaque = detach_cb_opaque;
 
+/* if we've signalled device presence to the guest, or if the guest
+ * has gone ahead and configured the device (via manually-executed
+ * device add via drmgr in guest, namely), we need to wait
+ * for the guest to quiesce the device before completing detach.
+ * Otherwise, we can assume the guest hasn't seen it and complete the
+ * detach immediately. Note that there is a small race window
+ * just before, or during, configuration, which is this context
+ * refers mainly to fetching the device tree via RTAS.
+ * During this window the device access will be arbitrated by
+ * associated DRC, which will simply fail the RTAS calls as invalid.
+ * This is recoverable within guest and current implementations of
+ * drmgr should be able to cope.
+ */
+if (!drc->signalled && !drc->configured) {
+/* if the guest hasn't seen the device we can't rely on it to
+ * set it back to an isolated state via RTAS, so do it here manually
+ */
+drc->isolation_state = SPAPR_DR_ISOLATION_STATE_ISOLATED;
+}
+
 if (drc->isolation_state != SPAPR_DR_ISOLATION_STATE_ISOLATED) {
 DPRINTFN("awaiting transition to isolated state before removal");
 drc->awaiting_release = true;
@@ -409,6 +436,7 @@ static void reset(DeviceState *d)
 {
 sPAPRDRConnector *drc = SPAPR_DR_CONNECTOR(d);
 sPAPRDRConnectorClass *drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc);
+sPAPRDREntitySense state;
 
 DPRINTFN("drc reset: %x", drck->get_index(drc));
 /* immediately upon reset we can safely assume DRCs whose devices
@@ -436,6 +464,11 @@ static void reset(DeviceState *d)
 drck->set_allocation_state(drc, 
SPAPR_DR_ALLOCATION_STATE_UNUSABLE);
 }
 }
+
+drck->entity_sense(drc, );
+if (state == SPAPR_DR_ENTITY_SENSE_PRESENT) {
+drck->set_signalled(drc);
+}
 }
 
 static void realize(DeviceState *d, Error **errp)
@@ -594,6 +627,7 @@ static void spapr_dr_connector_class_init(ObjectClass *k, 
void *data)
 drck->attach = attach;
 drck->detach = detach;
 drck->release_pending = release_pending;
+drck->set_signalled = set_signalled;
 /*
  * Reason: it crashes FIXME find and document the real reason
  */
diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
index 39f4682..be8de63 100644
---

Re: [Qemu-devel] [PATCH] slirp: Fix migration from older versions of QEMU to the current one

2016-03-31 Thread Samuel Thibault

Samuel Thibault, on Fri 01 Apr 2016 00:00:43 +0200, wrote:
> Just realizing... We'd need to add AF_INET6 cases here too, to be able
> to save/restore a VM using ipv6 connections.

That seems quite involved however, maybe we should postpone that to
later?

Samuel

[Qemu-devel] [PULL] slirp: Fix migration from older versions of QEMU to the current one

2016-03-31 Thread Samuel Thibault

From: Thomas Huth 

While adding the IPv6 support, the commit eae303ff23f51259eddc8856c71453d8
("slirp: Make Socket structure IPv6 compatible") changed the format of
the migration stream, without taking into account that we might still
receive an old migration stream layout when upgrading from QEMU version
2.5 (or older) to QEMU 2.6. Currently, QEMU bails out when doing a
migration from QEMU 2.5 to the recent master version when it has
been started with a "-net user,guestfwd=..." network. So let's fix
this by checking the version ID of the migration stream and by using
the old behavior if we've detected version 3 or less.

Signed-off-by: Thomas Huth 
Signed-off-by: Samuel Thibault 
---
 slirp/slirp.c | 44 ++--
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/slirp/slirp.c b/slirp/slirp.c
index 3481fcc..998f278 100644
--- a/slirp/slirp.c
+++ b/slirp/slirp.c
@@ -1233,31 +1233,39 @@ static int slirp_sbuf_load(QEMUFile *f, struct sbuf 
*sbuf)
 return 0;
 }
 
-static int slirp_socket_load(QEMUFile *f, struct socket *so)
+static int slirp_socket_load(QEMUFile *f, struct socket *so, int version_id)
 {
 if (tcp_attach(so) < 0)
 return -ENOMEM;
 
 so->so_urgc = qemu_get_be32(f);
-so->so_ffamily = qemu_get_be16(f);
-switch (so->so_ffamily) {
-case AF_INET:
+if (version_id <= 3) {
+so->so_ffamily = AF_INET;
 so->so_faddr.s_addr = qemu_get_be32(f);
-so->so_fport = qemu_get_be16(f);
-break;
-default:
-error_report(
-"so_ffamily unknown, unable to restore so_faddr and 
so_lport\n");
-}
-so->so_lfamily = qemu_get_be16(f);
-switch (so->so_lfamily) {
-case AF_INET:
 so->so_laddr.s_addr = qemu_get_be32(f);
+so->so_fport = qemu_get_be16(f);
 so->so_lport = qemu_get_be16(f);
-break;
-default:
-error_report(
-"so_ffamily unknown, unable to restore so_laddr and 
so_lport\n");
+} else {
+so->so_ffamily = qemu_get_be16(f);
+switch (so->so_ffamily) {
+case AF_INET:
+so->so_faddr.s_addr = qemu_get_be32(f);
+so->so_fport = qemu_get_be16(f);
+break;
+default:
+error_report(
+"so_ffamily unknown, unable to restore so_faddr and so_lport");
+}
+so->so_lfamily = qemu_get_be16(f);
+switch (so->so_lfamily) {
+case AF_INET:
+so->so_laddr.s_addr = qemu_get_be32(f);
+so->so_lport = qemu_get_be16(f);
+break;
+default:
+error_report(
+"so_ffamily unknown, unable to restore so_laddr and so_lport");
+}
 }
 so->so_iptos = qemu_get_byte(f);
 so->so_emu = qemu_get_byte(f);
@@ -1294,7 +1302,7 @@ static int slirp_state_load(QEMUFile *f, void *opaque, 
int version_id)
 if (!so)
 return -ENOMEM;
 
-ret = slirp_socket_load(f, so);
+ret = slirp_socket_load(f, so, version_id);
 
 if (ret < 0)
 return ret;
-- 
2.8.0.rc3

[Qemu-devel] [PULL] slirp updates

2016-03-31 Thread Samuel Thibault

  MAINTAINERS: Delete invalid maintainer entries of the Exynos section 
(2016-03-31 18:21:01 +0100)

are available in the git repository at:

  http://people.debian.org/~sthibault/qemu.git tags/samuel-thibault-2

for you to fetch changes up to eaf136f9a21e02a2f55346e44d2d88df37b2cde3:

  slirp: Fix migration from older versions of QEMU to the current one 
(2016-04-01 00:05:06 +0200)


slirp updates (2)


Thomas Huth (1):
  slirp: Fix migration from older versions of QEMU to the current one

 slirp/slirp.c | 44 ++--
 1 file changed, 26 insertions(+), 18 deletions(-)

Re: [Qemu-devel] Why is SeaBIOS used with -kernel?

2016-03-31 Thread Richard W.M. Jones

[This time without the massive attachment]

On Thu, Mar 31, 2016 at 12:22:23PM -0400, Kevin O'Connor wrote:
> On Thu, Mar 31, 2016 at 10:21:25AM +0100, Stefan Hajnoczi wrote:
> > On Sat, Mar 19, 2016 at 08:31:24PM +, Richard W.M. Jones wrote:
> > > Is there something I'm missing, or for Linux + -kernel could we use a
> > > much simpler BIOS?
> > 
> > The data that Marc Mari collected when comparing qboot with an optimized
> > SeaBIOS/QEMU showed that there's no need for a separate "lightweight
> > firmware" codebase.

[http://www.seabios.org/pipermail/seabios/2015-July/009554.html]

The problem is that now we've solved the fw_cfg problem, SeaBIOS is
again a bottleneck (but one of several, and not the biggest).

> > https://github.com/bonzini/qboot

I'm actually comparing this to the extremely minimal BIOS used by
kvmtool (and hence by Intel Clear Containers).  That "BIOS" (it's
hardly fair to call it that) contains only a the bare minimum calls
necessary to service the Linux startup code.  In this scenario Linux
is memcpy'd into the guest memory and jumped to directly, so there is
no separate BIOS loading step at all.  The BIOS is only needed because
Linux startup issues BIOS calls eg to get the e820 memory map and do
some VGA mode manipulation.

https://git.kernel.org/cgit/linux/kernel/git/will/kvmtool.git/tree/x86

BTW my target for the total time taken to go from the qemu command to
a guest shell prompt is 150 ms.  Intel Clear Containers can do this
already, albeit using kvmtool and a heavily patched kernel with a
minimal config.

> > It would create a maintenance burden and eventually we'd want many of
> > the SeaBIOS features anyway.  It's better to optimize linuxboot.bin and
> > SeaBIOS instead.
> 
> In the tests I've run, the time spent in SeaBIOS is dominated by
> hardware delays.  (Or for qemu, the time needed for seabios to
> communicate with virtual hardware and the time required for qemu to
> implement the requests.)  As such, boot times can be most easily
> improved by configuring the VM with less hardware, or configuring (via
> SeaBIOS kconfig) less hardware drivers in SeaBIOS.
>
> It's possible to do course grained profiling with SeaBIOS by timing
> its debug messages - see:
> http://www.seabios.org/Debugging#Timing_debug_messages

This is roughly what I've been doing.  I'm using my own test harness:

https://github.com/libguestfs/libguestfs/blob/master/tests/qemu/boot-analysis.c
https://github.com/libguestfs/libguestfs/blob/master/tests/qemu/boot-analysis-timeline.c

The output from this is here (note: download it and use `less -r' to
view it):

http://oirase.annexia.org/tmp/seabios.txt

Actual total boot time is about 1s now.  The attached file
takes longer for a couple of reasons:

 - SeaBIOS debugging is enabled

 - kernel initcall debugging is enabled

I'd dearly love to get rid of the sgabios option ROM.  It looks like
SeaBIOS nearly supports a full serial console now?

> The debug messages themselves can consume time though (one can
> eliminate debug messages using CONFIG_DEBUG_LEVEL=0).  I use the
> following to profile while also accounting for the debug message delay
> (on my system, each debug character takes ~2.5us):
> 
> scripts/readserial.py -f ../qemu-test/qemudebugpipe -t 2.5
> 
> I find a standard SeaBIOS build on my machine with KVM takes ~50ms to
> start the OS (not including OS load time).  This breaks down roughly
> to the following times:
> 
> 6ms  - enabling shadow ram (qemu makes 0xc-0x10 read/writable)
> 4ms  - PCI initialization
> 2ms  - smm init
> 4ms  - load acpi tables from qemu
> 16ms - init and enable vga console
> 5ms  - load and run various option roms from qemu (eg, ipxe)
> 7ms  - locking shadow ram (qemu makes 0xc-0x10 readonly)
> 6ms  - other
> 
> There are several SeaBIOS kconfig options that would remove many of
> the above delays (with the obvious caveat that the given hardware
> would no longer be initialized by seabios).
> 
> > Kevin O'Connor had some SeaBIOS optimizations that improved boot time by
> > skipping unnecessary probing and timer calibration IIRC.  I have CCed
> > Marc and Kevin on this email.
> 
> There were a couple of optimizations in SeaBIOS (avoid TSC calibration
> when not using the TSC, avoid a PS2 keyboard reset delay) found last
> year, but they were committed and released in SeaBIOS v1.9.0.  They
> should already be in QEMU.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [Qemu-devel] [PATCH] slirp: Fix migration from older versions of QEMU to the current one

2016-03-31 Thread Samuel Thibault

Thomas Huth, on Thu 31 Mar 2016 16:48:41 +0200, wrote:
> While adding the IPv6 support, the commit eae303ff23f51259eddc8856c71453d8
> ("slirp: Make Socket structure IPv6 compatible") changed the format of
> the migration stream, without taking into account that we might still
> receive an old migration stream layout when upgrading from QEMU version
> 2.5 (or older) to QEMU 2.6. Currently, QEMU bails out when doing a
> migration from QEMU 2.5 to the recent master version when it has
> been started with a "-net user,guestfwd=..." network. So let's fix
> this by checking the version ID of the migration stream and by using
> the old behavior if we've detected version 3 or less.
> 
> Signed-off-by: Thomas Huth 

Reviewed-by: Samuel Thibault 

Just realizing... We'd need to add AF_INET6 cases here too, to be able
to save/restore a VM using ipv6 connections.

> ---
>  slirp/slirp.c | 44 ++--
>  1 file changed, 26 insertions(+), 18 deletions(-)
> 
> diff --git a/slirp/slirp.c b/slirp/slirp.c
> index 3481fcc..998f278 100644
> --- a/slirp/slirp.c
> +++ b/slirp/slirp.c
> @@ -1233,31 +1233,39 @@ static int slirp_sbuf_load(QEMUFile *f, struct sbuf 
> *sbuf)
>  return 0;
>  }
>  
> -static int slirp_socket_load(QEMUFile *f, struct socket *so)
> +static int slirp_socket_load(QEMUFile *f, struct socket *so, int version_id)
>  {
>  if (tcp_attach(so) < 0)
>  return -ENOMEM;
>  
>  so->so_urgc = qemu_get_be32(f);
> -so->so_ffamily = qemu_get_be16(f);
> -switch (so->so_ffamily) {
> -case AF_INET:
> +if (version_id <= 3) {
> +so->so_ffamily = AF_INET;
>  so->so_faddr.s_addr = qemu_get_be32(f);
> -so->so_fport = qemu_get_be16(f);
> -break;
> -default:
> -error_report(
> -"so_ffamily unknown, unable to restore so_faddr and 
> so_lport\n");
> -}
> -so->so_lfamily = qemu_get_be16(f);
> -switch (so->so_lfamily) {
> -case AF_INET:
>  so->so_laddr.s_addr = qemu_get_be32(f);
> +so->so_fport = qemu_get_be16(f);
>  so->so_lport = qemu_get_be16(f);
> -break;
> -default:
> -error_report(
> -"so_ffamily unknown, unable to restore so_laddr and 
> so_lport\n");
> +} else {
> +so->so_ffamily = qemu_get_be16(f);
> +switch (so->so_ffamily) {
> +case AF_INET:
> +so->so_faddr.s_addr = qemu_get_be32(f);
> +so->so_fport = qemu_get_be16(f);
> +break;
> +default:
> +error_report(
> +"so_ffamily unknown, unable to restore so_faddr and 
> so_lport");
> +}
> +so->so_lfamily = qemu_get_be16(f);
> +switch (so->so_lfamily) {
> +case AF_INET:
> +so->so_laddr.s_addr = qemu_get_be32(f);
> +so->so_lport = qemu_get_be16(f);
> +break;
> +default:
> +error_report(
> +"so_ffamily unknown, unable to restore so_laddr and 
> so_lport");
> +}
>  }
>  so->so_iptos = qemu_get_byte(f);
>  so->so_emu = qemu_get_byte(f);
> @@ -1294,7 +1302,7 @@ static int slirp_state_load(QEMUFile *f, void *opaque, 
> int version_id)
>  if (!so)
>  return -ENOMEM;
>  
> -ret = slirp_socket_load(f, so);
> +ret = slirp_socket_load(f, so, version_id);
>  
>  if (ret < 0)
>  return ret;
> -- 
> 1.8.3.1
> 

-- 
Samuel
requests.agnjo
gj a po  mi
shnthdrdcvallus hsx mvgduwolgfwtq
uzuy
s
p
h
 -+- spams forever ... -+-

[Qemu-devel] [PATCH 1/3] nbd: Treat flags vs. command type as separate fields

2016-03-31 Thread Eric Blake

Current upstream NBD documents that requests have a 16-bit flags,
followed by a 16-bit type integer; although older versions mentioned
only a 32-bit field with masking to find flags.  Since the protocol
is in network order (big-endian over the wire), the ABI is unchanged;
but dealing with the flags as a separate field rather than masking
will make it easier to add support for upcoming NBD extensions that
improve the efficiency of sparse file handling (since those
extensions will increase the number of both flags and commands).

Improve some comments in nbd.h based on the current upstream
NBD protocol (https://github.com/yoe/nbd/blob/master/doc/proto.md),
and touch some nearby code to keep checkpatch.pl happy.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h | 17 +++--
 block/nbd-client.c  | 11 ---
 nbd/client.c| 17 ++---
 nbd/server.c| 28 
 4 files changed, 41 insertions(+), 32 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index b86a976..f018968 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -27,7 +27,8 @@

 struct nbd_request {
 uint32_t magic;
-uint32_t type;
+uint16_t flags;
+uint16_t type;
 uint64_t handle;
 uint64_t from;
 uint32_t len;
@@ -39,6 +40,8 @@ struct nbd_reply {
 uint64_t handle;
 } QEMU_PACKED;

+/* Transmission (export) flags: sent from server to client during handshake,
+   but describe what will happen during transmission */
 #define NBD_FLAG_HAS_FLAGS  (1 << 0)/* Flags are there */
 #define NBD_FLAG_READ_ONLY  (1 << 1)/* Device is read-only */
 #define NBD_FLAG_SEND_FLUSH (1 << 2)/* Send FLUSH */
@@ -46,10 +49,12 @@ struct nbd_reply {
 #define NBD_FLAG_ROTATIONAL (1 << 4)/* Use elevator algorithm - 
rotational media */
 #define NBD_FLAG_SEND_TRIM  (1 << 5)/* Send TRIM (discard) */

-/* New-style global flags. */
+/* New-style handshake (global) flags, sent from server to client, and
+   control what will happen during handshake phase. */
 #define NBD_FLAG_FIXED_NEWSTYLE (1 << 0)/* Fixed newstyle protocol. */

-/* New-style client flags. */
+/* New-style client flags, sent from client to server to control what happens
+   during handshake phase. */
 #define NBD_FLAG_C_FIXED_NEWSTYLE   (1 << 0)/* Fixed newstyle protocol. */

 /* Reply types. */
@@ -60,10 +65,10 @@ struct nbd_reply {
 #define NBD_REP_ERR_INVALID ((UINT32_C(1) << 31) | 3) /* Invalid length. */
 #define NBD_REP_ERR_TLS_REQD((UINT32_C(1) << 31) | 5) /* TLS required */

+/* Request flags, sent from client to server during transmission phase */
+#define NBD_CMD_FLAG_FUA(1 << 0)

-#define NBD_CMD_MASK_COMMAND   0x
-#define NBD_CMD_FLAG_FUA   (1 << 16)
-
+/* Supported request types */
 enum {
 NBD_CMD_READ = 0,
 NBD_CMD_WRITE = 1,
diff --git a/block/nbd-client.c b/block/nbd-client.c
index 021a88b..8645be4 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -252,7 +252,7 @@ static int nbd_co_writev_1(BlockDriverState *bs, int64_t 
sector_num,

 if ((*flags & BDRV_REQ_FUA) && (client->nbdflags & NBD_FLAG_SEND_FUA)) {
 *flags &= ~BDRV_REQ_FUA;
-request.type |= NBD_CMD_FLAG_FUA;
+request.flags |= NBD_CMD_FLAG_FUA;
 }

 request.from = sector_num * 512;
@@ -319,8 +319,9 @@ int nbd_client_co_flush(BlockDriverState *bs)
 return 0;
 }

+/* XXX: NBD Protocol only documents use of FUA with WRITE */
 if (client->nbdflags & NBD_FLAG_SEND_FUA) {
-request.type |= NBD_CMD_FLAG_FUA;
+request.flags |= NBD_CMD_FLAG_FUA;
 }

 request.from = 0;
@@ -380,11 +381,7 @@ void nbd_client_attach_aio_context(BlockDriverState *bs,
 void nbd_client_close(BlockDriverState *bs)
 {
 NbdClientSession *client = nbd_get_client_session(bs);
-struct nbd_request request = {
-.type = NBD_CMD_DISC,
-.from = 0,
-.len = 0
-};
+struct nbd_request request = { .type = NBD_CMD_DISC };

 if (client->ioc == NULL) {
 return;
diff --git a/nbd/client.c b/nbd/client.c
index f89c0a1..5af9f26 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -628,15 +628,18 @@ ssize_t nbd_send_request(QIOChannel *ioc, struct 
nbd_request *request)
 uint8_t buf[NBD_REQUEST_SIZE];
 ssize_t ret;

-cpu_to_be32w((uint32_t*)buf, NBD_REQUEST_MAGIC);
-cpu_to_be32w((uint32_t*)(buf + 4), request->type);
-cpu_to_be64w((uint64_t*)(buf + 8), request->handle);
-cpu_to_be64w((uint64_t*)(buf + 16), request->from);
-cpu_to_be32w((uint32_t*)(buf + 24), request->len);
+cpu_to_be32w((uint32_t *)buf, NBD_REQUEST_MAGIC);
+cpu_to_be16w((uint16_t *)(buf + 4), request->flags);
+cpu_to_be16w((uint16_t *)(buf + 6), request->type);
+cpu_to_be64w((uint64_t *)(buf + 8), request->handle);
+cpu_to_be64w((uint64_t *)(buf + 16), request->from);
+cpu_to_be32w((uint32_t *)(buf

[Qemu-devel] [PATCH 3/3] nbd: Reject unknown request flags

2016-03-31 Thread Eric Blake

The NBD protocol says that clients should not send a command flag
that has not been negotiated (whether by the client requesting an
option during a handshake, or because we advertise support for the
flag in response to NBD_OPT_EXPORT_NAME), and that servers should
reject invalid flags with EINVAL.  We were silently ignoring the
flags instead.  The client can't rely on our behavior, since it is
their fault for passing the bad flag in the first place, but it's
better to be robust up front than to possibly behave differently
than the client was expecting with the attempted flag.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/nbd/server.c b/nbd/server.c
index a590773..31bd9c5 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -974,6 +974,10 @@ static ssize_t nbd_co_receive_request(NBDRequest *req, 
struct nbd_request *reque
 goto out;
 }

+if (request->flags & ~NBD_CMD_FLAG_FUA) {
+LOG("unsupported flags (got 0x%x)", request->flags);
+return -EINVAL;
+}
 if ((request->from + request->len) < request->from) {
 LOG("integer overflow detected! "
 "you're probably being attacked");
-- 
2.5.5

[Qemu-devel] [PATCH 2/3] nbd: Fix poor debug message

2016-03-31 Thread Eric Blake

The client sends messages to the server, not itself.

Signed-off-by: Eric Blake 
---
 nbd/client.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/nbd/client.c b/nbd/client.c
index 5af9f26..d95ad7a 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -635,7 +635,7 @@ ssize_t nbd_send_request(QIOChannel *ioc, struct 
nbd_request *request)
 cpu_to_be64w((uint64_t *)(buf + 16), request->from);
 cpu_to_be32w((uint32_t *)(buf + 24), request->len);

-TRACE("Sending request to client: "
+TRACE("Sending request to server: "
   "{ .from = %" PRIu64", .len = %u, .handle = %" PRIu64
   ", .flags=%x, .type=%i}",
   request->from, request->len, request->handle, request->flags,
-- 
2.5.5

[Qemu-devel] [PATCH 0/3] NBD: flag handling cleanup

2016-03-31 Thread Eric Blake

I'm currently helping with work on adding extensions to the
upstream NBD protocol to allow more efficient handling of sparse
files, and wanted to get my feet wet in the qemu NBD code (both
client and server), to actually implement those proposals to see
how well they work.  In the process, I found a flaw where we
were silently ignoring bad client flags; the same flaw was just
recently patched in upstream NBD:
https://github.com/yoe/nbd/commit/ab22e0820

I didn't go quite as far as upstream (that is, I still silently
permit FLAG_FUA in combination with CMD_TRIM, which does not
have defined semantics upstream, and where our current
implementation might be too weak compared to the actual semantics
that upstream wants to propose, because we aren't actually
flushing in that case), but this at least gets rid of the
worst offenses and starts to document some anomolies that may
need later fixing.

Eric Blake (3):
  nbd: Treat flags vs. command type as separate fields
  nbd: Fix poor debug message
  nbd: Reject unknown request flags

 include/block/nbd.h | 17 +++--
 block/nbd-client.c  | 11 ---
 nbd/client.c| 19 +++
 nbd/server.c| 32 
 4 files changed, 46 insertions(+), 33 deletions(-)

-- 
2.5.5

Re: [Qemu-devel] [Nbd] Is NBD_CMD_FLAG_FUA valid during NBD_CMD_FLUSH?

2016-03-31 Thread Eric Blake

On 03/31/2016 02:17 PM, Alex Bligh wrote:
> OK so I actually went and researched what my answer was last time I
> was asked ( :-) ):
> 
> Here was my conclusion last time after trawling through lkml
> on the subject:
> 
> From https://sourceforge.net/p/nbd/mailman/message/27569820/
> 
>> You may process commands out of order, and reply out of order,
>> save that
>> a) all write commands *completed* before you process a REQ_FLUSH
>>  must be written to non-volatile storage prior to completing
>>  that REQ_FLUSH (though apparently you should, if possible, make
>>  this true for all write commands *received*, which is a stronger
>>  condition) [Ignore this if you don't set SEND_REQ_FLUSH]
>> b) a REQ_FUA flagged write must not complete until its payload
>>  is written to non-volatile storage [ignore this if you don't
>>  set SEND_REQ_FUA]
>>
> 
> 
> Perhaps it would be good for that to actually go in the docs!

Indeed.

> 
> I don't think we need a 'stronger barrier' as the client can
> implement that itself merely by waiting for all commands to
> complete prior to sending FLUSH.
> 
> Incidentally, last time I looked, the linux kernel always sent
> a FLUSH immediately after any bio marked FUA. Does qemu use
> more interesting behavioural modes?

I'm just learning the qemu nbd code myself, so I don't have a good
answer, other than what I wrote before:

>>
>> In qemu, read+FUA just triggers blk_co_flush() prior to reading; but
>> that's the same function it calls for write+FUA.
> 
> That's harmless, but unnecessary in the sense that current documented
> behaviour doesn't require it. Perhaps it should?

It _is_ a reasonable semantic - it means you are guaranteed that what
YOU read will match what anyone ELSE reads (without waiting for writes
to land, what YOU read SHOULD favor what is sitting in pending writes,
while what others read may be stale on-disk data about to be overwritten
by pending writes).  And while I'm a bit fuzzy on the POSIX semantics of
O_SYNC and O_DSYNC with open(), and on sync() vs. fsync() vs.
fdatasync() vs. syncfs() (they are all subtly different, and I never
remember which is stronger in what scenarios, nor how Linux subtly
differs from what POSIX says), POSIX does have some wording syncs being
useful even on reads to force the read to not complete until you can
guarantee that everyone else will read the same content (that is, the
sync flushes the pending writes, even though you are doing a read
operation).

> 
> I suppose TRIM etc. should support FUA too?

Probably, and with similar semantics to WRITE (only affects this
transaction rather than all pending ones, but guarantees that the trim
lands on disk before returning).

>> Meanwhile, it sounds like FUA is valid on read, write, AND flush
>> (because the kernel supports all three),
> 
> Do you have a pointer to what FUA means on kernel reads? Does it

No clue. I'm not a kernel expert, and was assuming that you knew more
about it than me.

> mean "force unit access for the read" or does it mean "flush any
> write for that block first"? The first is subtly different if the
> file is remote and being accessed by multiple people (e.g. NFS, Ceph etc.)

I would lean to the latter - FUA on a read seems like it is most useful
if it means "guarantee that no one else can read something older than
what I read", and NOT "give me possibly stale data because I accessed
the underlying storage rather than paying attention to in-flight writes
that would change what I read".  In other words, I think you should
ALWAYS prefer data from in-flight writes over going to backing storage,
but USUALLY don't need the overhead of waiting for those writes to
complete; FUA slows down your read, but gives you better data assurance.

> 
>> even if we aren't quite sure
>> what to document of those flags.  And that means qemu is correct, and
>> the NBD protocol has a bug.  Since you contributed the FUA flag, is that
>> something you can try to improve?
> 
> Yeah. My mess so I should clean it up. I think FUA should be valid
> on essentially everything.
> 
> I think I might wait until structured replies is in though!

It's also tricky because we just barely documented that servers SHOULD
reject invalid flags with EINVAL; and that clients MUST NOT send FUA on
commands where it is not documented; I don't know if we have an adequate
discovery system in place to learn _which_ commands support FUA,
especially if you are proposing that we expand the scope of FUA to be
valid alongside a TRIM request.

It doesn't have to be solved today, though, so I'm fine if you wait for
structured replies first.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

[Qemu-devel] [Bug 1563887] Re: qemu-system-ppc64 freezes on starting image on ppc64le

2016-03-31 Thread Mike Rushton

I have installed qemu-slof from trusty and ran again:

qemu-slof:
  Installed: 20151103+dfsg-1ubuntu1
  Candidate: 20151103+dfsg-1ubuntu1
  Version table:
 *** 20151103+dfsg-1ubuntu1 500
500 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el Packages
100 /var/lib/dpkg/status
 20131015+dfsg-1ubuntu1 500
500 http://ports.ubuntu.com/ubuntu-ports trusty/universe ppc64el 
Packages
ubuntu@alpine01:~$ sudo apt-get install qemu-slof=20131015+dfsg-1ubuntu1

It froze on the exact same spot.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1563887

Title:
  qemu-system-ppc64 freezes on starting image on ppc64le

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  New

Bug description:
  qemu-system-ppc64 running on Ubuntu 16.04 beta-2 fails to start an
  image as part of the certification process. This on an IBM ppc64le in
  PowerVM mode running Ubuntu 16.04 beta-2 deployed by MAAS 1.9.1. There
  is no error output.

  ubuntu@alpine01:~/kvm$ qemu-system-ppc64 -m 256 -display none -nographic -net 
nic -net user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine 
pseries -drive file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  WARNING: Image format was not specified for 'seed.iso' and probing guessed 
raw.
   Automatically detecting the format is dangerous for raw images, 
write operations on block 0 will be restricted.
   Specify the 'raw' format explicitly to remove the restrictions.

  SLOF **
  QEMU Starting
   Build Date = Jan 29 2016 18:58:37
   FW Version = buildd@ release 20151103
   Press "s" to enter Open Firmware.

  Populating /vdevice methods
  Populating /vdevice/vty@7100
  Populating /vdevice/nvram@7101
  Populating /vdevice/l-lan@7102
  Populating /vdevice/v-scsi@7103
     SCSI: Looking for devices
    8200 CD-ROM   : "QEMU QEMU CD-ROM  2.5+"
  Populating /pci@8002000
   00 1800 (D) : 1af4 1001virtio [ block ]
   00 1000 (D) : 1af4 1001virtio [ block ]
   00 0800 (D) : 106b 003fserial bus [ usb-ohci ]
   00  (D) : 1234 qemu vga
  No NVRAM common partition, re-initializing...
  Installing QEMU fb

  Scanning USB
    OHCI: initializing
  USB Keyboard
  USB mouse
  No console specified using screen & keyboard

    Welcome to Open Firmware

    Copyright (c) 2004, 2011 IBM Corporation All rights reserved.
    This program and the accompanying materials are made available
    under the terms of the BSD License available at
    http://www.opensource.org/licenses/bsd-license.php

  Trying to load:  from: /pci@8002000/scsi@3 ...
  E3404: Not a bootable device!
  Trying to load:  from: /pci@8002000/scsi@2 ...   Successfully loaded
  Linux ppc64le
  #31-Ubuntu SMP F

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: qemu-system-ppc 1:2.5+dfsg-5ubuntu6
  ProcVersionSignature: Ubuntu 4.4.0-16.32-generic 4.4.6
  Uname: Linux 4.4.0-16-generic ppc64le
  ApportVersion: 2.20-0ubuntu3
  Architecture: ppc64el
  Date: Wed Mar 30 14:10:01 2016
  KvmCmdLine:
   COMMAND STAT  EUID  RUID   PID  PPID %CPU COMMAND
   kvm-irqfd-clean S<   0 0  1172 2  0.0 [kvm-irqfd-clean]
   qemu-nbdSsl  0 0 13467 1  0.0 qemu-nbd -c /dev/nbd0 
xenial-server-cloudimg-ppc64el-disk1.img
   qemu-system-ppc Sl+   1000  1000 18973 18896  101 qemu-system-ppc64 -m 256 
-display none -nographic -net nic -net 
user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine pseries -drive 
file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  ProcEnviron:
   TERM=xterm
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinux-4.4.0-16-generic 
root=UUID=92d820c8-ab25-497b-9b1e-f1435992bbf3 ro
  ProcLoadAvg: 1.08 0.94 0.58 2/616 19571
  ProcLocks:
   1: POSIX  ADVISORY  WRITE 886 00:13:381 0 EOF
   2: POSIX  ADVISORY  WRITE 1339 00:13:528 0 EOF
   3: FLOCK  ADVISORY  WRITE 1284 00:13:522 0 EOF
   4: POSIX  ADVISORY  WRITE 2281 00:13:563 0 EOF
   5: POSIX  ADVISORY  WRITE 1331 00:13:536 0 EOF
  ProcSwaps:
   Filename TypeSizeUsedPriority
   /swap.img   file 8388544 0   -1
  ProcVersion: Linux version 4.4.0-16-generic (buildd@bos01-ppc64el-001) (gcc 
version 5.3.1 20160320 (Ubuntu/Linaro/IBM 5.3.1-12ubuntu4) ) #32-Ubuntu SMP Thu 
Mar 24 22:31:14 UTC 2016
  SourcePackage: qemu
  UpgradeStatus: No upgrade log present (probably fresh install)
  bootlist:
   /pci@8002011/pci1014,034A@0/sas/disk@4068402c40

Re: [Qemu-devel] [PATCH 2/2] Correct definition of NBD_CMD_FLAG_FUA

2016-03-31 Thread Alex Bligh


On 31 Mar 2016, at 21:07, Eric Blake  wrote:

>> 
>> or include/uapi/linux/nbd.h ; these have no reference to FUA at all,
> 
> I'm not finding that file on my system; not sure what it contains, or
> where it is maintained.

files have moved around in the kernel, and the userspace api file now lives in
include/uapi/linux/nbd.h e.g:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/nbd.h

include/linux/nbd.h has gone from the kernel, and previously included
internal structs (now in nbd.c I think)

Disclaimer: it's a while since I looked at nbd kernel side.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Qemu-devel] [Nbd] Is NBD_CMD_FLAG_FUA valid during NBD_CMD_FLUSH?

2016-03-31 Thread Alex Bligh


On 31 Mar 2016, at 20:54, Eric Blake  wrote:
> 
> Oh, and I also just found that qemu's nbd-server tries to honor FUA on
> read, even though the protocol doesn't document that as valid either.

Potentially useful, but I believe not required (I don't believe the
kernel does that, and I *believe* qemu's block layer does the
same as the kernel).

>> This turned out to be an easier way of describing the operations
>> than describing them semantically (in particular FLUSH, where I
>> couldn't get an entirely consistent answer of what it required
>> of inflight requests, specifically whether it required all
>> requests inflight at the time of making the request to be written
>> to disk prior to answering, or all requests inflight prior to the
>> time of replying to be written to disk prior to answering, though
>> I believe the former).
>> 
>> FUA just requires that particular request to be persisted to
>> disk, and does not require other requests to be persisted to disk
> 
> As written, NBD says that FUA requires the current write operation to
> land on disk (but says nothing about any other writes, whether those
> writes had an early reply).

That is my understanding.

>  And for flush, NBD only requires that all
> writes that have _sent_ their reply to the client must land on disk, but
> this can certainly be a smaller set of write requests than _all_ writes
> issued prior to that point in time.  So maybe flush+FUA is a valid thing
> to support, and means that ALL in-flight writes must land, whether or
> not a reply has been sent to the client, for an even stronger barrier?

OK so I actually went and researched what my answer was last time I
was asked ( :-) ):

Here was my conclusion last time after trawling through lkml
on the subject:

From https://sourceforge.net/p/nbd/mailman/message/27569820/

> You may process commands out of order, and reply out of order,
> save that
> a) all write commands *completed* before you process a REQ_FLUSH
>  must be written to non-volatile storage prior to completing
>  that REQ_FLUSH (though apparently you should, if possible, make
>  this true for all write commands *received*, which is a stronger
>  condition) [Ignore this if you don't set SEND_REQ_FLUSH]
> b) a REQ_FUA flagged write must not complete until its payload
>  is written to non-volatile storage [ignore this if you don't
>  set SEND_REQ_FUA]
> 


Perhaps it would be good for that to actually go in the docs!

I don't think we need a 'stronger barrier' as the client can
implement that itself merely by waiting for all commands to
complete prior to sending FLUSH.

Incidentally, last time I looked, the linux kernel always sent
a FLUSH immediately after any bio marked FUA. Does qemu use
more interesting behavioural modes?

>> So in answer to your question, my understanding is that FLUSH requires
>> (some subset) of otherwise potentially non-persisted requests to
>> be persisted to disk. In that sense it implies FUA. It is permitted
>> to set FUA (as it is permitted, I believe, in the linux block layer)
>> but it will make no difference.
>> 
>> I once thought FUA on read should bypass any local read cache, though
>> that is not part of the spec currently.
> 
> In qemu, read+FUA just triggers blk_co_flush() prior to reading; but
> that's the same function it calls for write+FUA.

That's harmless, but unnecessary in the sense that current documented
behaviour doesn't require it. Perhaps it should?

I suppose TRIM etc. should support FUA too?

>  And for flush (whether
> or not FUA was specified), qemu still calls blk_co_flush().  So from
> qemu's perspective, FUA is synonymous with "finish ALL pending
> transactions", which is stronger than what the NBD protocol requires.
> (Nothing wrong with an implementation doing more work than required,
> although it may be less efficient).  Alas, that means I can't use qemu's
> behavior as a good reference for how to improve the NBD spec.
> 
> Meanwhile, it sounds like FUA is valid on read, write, AND flush
> (because the kernel supports all three),

Do you have a pointer to what FUA means on kernel reads? Does it
mean "force unit access for the read" or does it mean "flush any
write for that block first"? The first is subtly different if the
file is remote and being accessed by multiple people (e.g. NFS, Ceph etc.)

> even if we aren't quite sure
> what to document of those flags.  And that means qemu is correct, and
> the NBD protocol has a bug.  Since you contributed the FUA flag, is that
> something you can try to improve?

Yeah. My mess so I should clean it up. I think FUA should be valid
on essentially everything.

I think I might wait until structured replies is in though!

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Qemu-devel] [PATCH 2/2] Correct definition of NBD_CMD_FLAG_FUA

2016-03-31 Thread Eric Blake

On 03/31/2016 01:25 PM, Alex Bligh wrote:
> 
> On 31 Mar 2016, at 20:14, Eric Blake  wrote:
> 
>>>
>>>
>>> Should we produce a new name for it (and future command flags)
>>> that aren't shifted left 16 places, and just maintain the
>>> current value for compatibility?
>>
>> I don't see the point.  Your fix looks correct.
> 
> OK. And the wrongness hasn't yet got into /usr/include/linux/nbd.h

Still using the older '__be32 type;' instead of the newer '__be16 flags;
__be16 type;', changing that one will be ABI compatible, but not API
compatible.  I don't know what people want to do there, :(

> or include/uapi/linux/nbd.h ; these have no reference to FUA at all,

I'm not finding that file on my system; not sure what it contains, or
where it is maintained.

> even though it does have NBD_FLAG_SEND_FLUSH, which is odd as I added
> both flags at the same time.
> 
> So I'm guessing it's safe.
> 
> --
> Alex Bligh
> 
> 
> 
> 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] tcg: reworking tb_invalidated_flag

2016-03-31 Thread Paolo Bonzini

On 31/03/2016 21:03, Sergey Fedorov wrote:
> Looks like we have to ensure all vCPUs are out of translated code when
> doing TB patching either doing tb_add_jump() or tb_phys_invalidate().
> Did I missed something?

Almost all TCG targets have naturally aligned instructions, so that's
not a problem; we can assume that 32-bit writes are atomic, though
perhaps we can change them to atomic_set just to be safe.

Only s390 and x86 can have unaligned instructions.  For x86 I suppose
you can use 1 to 3 byte nops so that the first byte of the jump ends up
at ip%4=3.  For s390 you can do the same, I don't know the encoding of
the canonical nop but an "or 0,0" instruction can do and is 16 bits wide
(in this case instructions are 16-bit aligned so you'd want ip%4=2).

Paolo

Re: [Qemu-devel] [Nbd] Is NBD_CMD_FLAG_FUA valid during NBD_CMD_FLUSH?

2016-03-31 Thread Eric Blake

On 03/31/2016 01:41 PM, Alex Bligh wrote:
> 
> On 31 Mar 2016, at 20:33, Eric Blake  wrote:
> 
>> Qemu's nbd-client is setting NBD_CMD_FLAG_FUA during a flush command,
>> but the official NBD protocol documentation doesn't describe this as
>> valid (it merely states that flush must not have a reply until all
>> acknowledged writes have hit permanent storage).  Does this flag make
>> sense (what semantics would the flag add, and we need to fix the NBD
>> docs as well as relax the reference implementation to allow the flag),
>> or is it a bug in qemu (and the recent tightening of NBD to throw EINVAL
>> on unsupported flags will trip up qemu)?
> 
> As the original author of that particular mess, the intent was that
> they should reflect exactly the Linux kernel's semantics for FLUSH
> and FUA, not only in terms of whether they can be used together,
> but also exactly what they mean.

Oh, and I also just found that qemu's nbd-server tries to honor FUA on
read, even though the protocol doesn't document that as valid either.

> 
> This turned out to be an easier way of describing the operations
> than describing them semantically (in particular FLUSH, where I
> couldn't get an entirely consistent answer of what it required
> of inflight requests, specifically whether it required all
> requests inflight at the time of making the request to be written
> to disk prior to answering, or all requests inflight prior to the
> time of replying to be written to disk prior to answering, though
> I believe the former).
> 
> FUA just requires that particular request to be persisted to
> disk, and does not require other requests to be persisted to disk

As written, NBD says that FUA requires the current write operation to
land on disk (but says nothing about any other writes, whether those
writes had an early reply).  And for flush, NBD only requires that all
writes that have _sent_ their reply to the client must land on disk, but
this can certainly be a smaller set of write requests than _all_ writes
issued prior to that point in time.  So maybe flush+FUA is a valid thing
to support, and means that ALL in-flight writes must land, whether or
not a reply has been sent to the client, for an even stronger barrier?

> So in answer to your question, my understanding is that FLUSH requires
> (some subset) of otherwise potentially non-persisted requests to
> be persisted to disk. In that sense it implies FUA. It is permitted
> to set FUA (as it is permitted, I believe, in the linux block layer)
> but it will make no difference.
> 
> I once thought FUA on read should bypass any local read cache, though
> that is not part of the spec currently.

In qemu, read+FUA just triggers blk_co_flush() prior to reading; but
that's the same function it calls for write+FUA.  And for flush (whether
or not FUA was specified), qemu still calls blk_co_flush().  So from
qemu's perspective, FUA is synonymous with "finish ALL pending
transactions", which is stronger than what the NBD protocol requires.
(Nothing wrong with an implementation doing more work than required,
although it may be less efficient).  Alas, that means I can't use qemu's
behavior as a good reference for how to improve the NBD spec.

Meanwhile, it sounds like FUA is valid on read, write, AND flush
(because the kernel supports all three), even if we aren't quite sure
what to document of those flags.  And that means qemu is correct, and
the NBD protocol has a bug.  Since you contributed the FUA flag, is that
something you can try to improve?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [Nbd] Is NBD_CMD_FLAG_FUA valid during NBD_CMD_FLUSH?

2016-03-31 Thread Alex Bligh

On 31 Mar 2016, at 20:33, Eric Blake  wrote:

> Qemu's nbd-client is setting NBD_CMD_FLAG_FUA during a flush command,
> but the official NBD protocol documentation doesn't describe this as
> valid (it merely states that flush must not have a reply until all
> acknowledged writes have hit permanent storage).  Does this flag make
> sense (what semantics would the flag add, and we need to fix the NBD
> docs as well as relax the reference implementation to allow the flag),
> or is it a bug in qemu (and the recent tightening of NBD to throw EINVAL
> on unsupported flags will trip up qemu)?

As the original author of that particular mess, the intent was that
they should reflect exactly the Linux kernel's semantics for FLUSH
and FUA, not only in terms of whether they can be used together,
but also exactly what they mean.

This turned out to be an easier way of describing the operations
than describing them semantically (in particular FLUSH, where I
couldn't get an entirely consistent answer of what it required
of inflight requests, specifically whether it required all
requests inflight at the time of making the request to be written
to disk prior to answering, or all requests inflight prior to the
time of replying to be written to disk prior to answering, though
I believe the former).

FUA just requires that particular request to be persisted to
disk, and does not require other requests to be persisted to disk

So in answer to your question, my understanding is that FLUSH requires
(some subset) of otherwise potentially non-persisted requests to
be persisted to disk. In that sense it implies FUA. It is permitted
to set FUA (as it is permitted, I believe, in the linux block layer)
but it will make no difference.

I once thought FUA on read should bypass any local read cache, though
that is not part of the spec currently.

--
Alex Bligh

signature.asc
Description: Message signed with OpenPGP using GPGMail

[Qemu-devel] Is NBD_CMD_FLAG_FUA valid during NBD_CMD_FLUSH?

2016-03-31 Thread Eric Blake

Qemu's nbd-client is setting NBD_CMD_FLAG_FUA during a flush command,
but the official NBD protocol documentation doesn't describe this as
valid (it merely states that flush must not have a reply until all
acknowledged writes have hit permanent storage).  Does this flag make
sense (what semantics would the flag add, and we need to fix the NBD
docs as well as relax the reference implementation to allow the flag),
or is it a bug in qemu (and the recent tightening of NBD to throw EINVAL
on unsupported flags will trip up qemu)?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH 2/2] Correct definition of NBD_CMD_FLAG_FUA

2016-03-31 Thread Alex Bligh

On 31 Mar 2016, at 20:14, Eric Blake  wrote:

>> 
>> 
>> Should we produce a new name for it (and future command flags)
>> that aren't shifted left 16 places, and just maintain the
>> current value for compatibility?
> 
> I don't see the point.  Your fix looks correct.

OK. And the wrongness hasn't yet got into /usr/include/linux/nbd.h
or include/uapi/linux/nbd.h ; these have no reference to FUA at all,
even though it does have NBD_FLAG_SEND_FLUSH, which is odd as I added
both flags at the same time.

So I'm guessing it's safe.

--
Alex Bligh

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Qemu-devel] [Qemu-stable] [PATCH] target-i386: do not read/write MSR_TSC_AUX from KVM if CPUID bit is not set

2016-03-31 Thread Peter Lieven

Am 31.03.2016 um 18:52 schrieb Paolo Bonzini:
>
> On 31/03/2016 17:38, Peter Lieven wrote:
>>
>>> Am 31.03.2016 um 15:23 schrieb Paolo Bonzini :
>>>
>>>
>>>
 On 31/03/2016 15:12, Peter Lieven wrote:
>> KVM does not let you read or write this MSR if the corresponding CPUID
>> bit is not set.  This in turn causes MSRs that come after MSR_TSC_AUX
>> to be ignored by KVM_SET_MRSS.
 Is it possible that this causes a freeze when migrating vom qemu 2.2.0
 to 2.5.1?
>>> I wouldn't exclude it if the CPU model is Westmere or earlier.
>> the Host CPU or the Emulated CPU?
> The guest.

I can confirm this seems to fix the migration for emulated Westmere CPUs. 
Regardless if its a 2.2.0 -> 2.5.1 or 2.5.1 -> 2.5.1
migration.

Peter

Re: [Qemu-devel] [PATCH 2/2] Correct definition of NBD_CMD_FLAG_FUA

2016-03-31 Thread Eric Blake

On 03/31/2016 12:21 PM, Alex Bligh wrote:
> 
> On 31 Mar 2016, at 19:15, Alex Bligh  wrote:
> 
>> It is doubtful whether anyone is using NBD_CMD_FLAG_FUA
>> at the moment in any case.
> 
> Drat. I spoke too soon. Qemu uses it, but presumably from its
> own .h file.

Yes, qemu has its own nbd.h, which still has nbd_request with a single
uint32_type that holds both flags and command type.  It wouldn't be too
hard to rework that to more closely match upstream NBD.

> 
> However, it's now nonsensical having it defined as 1<<16 in a
> 16 bit flags variable.

I don't see any problem with your patch on the NBD project side of
things; it's not like 'make install' is dumping a header into
/usr/include for client programs to reuse (which is _why_ qemu is using
its own nbd.h), because no one has really churned out an NBD-client
library for embedding in larger programs.

> 
> Should we produce a new name for it (and future command flags)
> that aren't shifted left 16 places, and just maintain the
> current value for compatibility?

I don't see the point.  Your fix looks correct.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH 1/2] proto.md: Clearly set out NBDMAGIC is the actual value

2016-03-31 Thread Eric Blake

On 03/31/2016 12:15 PM, Alex Bligh wrote:
> Clearly set out NBDMAGIC, not the name of a constant equal to
> some value. Set out the value in hex as well.
> 
> Signed-off-by: Alex Bligh 
> ---
>  doc/proto.md | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index c1e05c5..7994076 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -67,7 +67,7 @@ newstyle negotiation.
>  
>   Oldstyle negotiation
>  
> -S: 64 bits, `NBDMAGIC` (also known as the `INIT_PASSWD`)  
> +S: 64 bits, `0x4e42444d41474943` (ASCII '`NBDMAGIC`') (also known as the 
> `INIT_PASSWD`)  
>  S: 64 bits, `0x00420281861253` (`cliserv_magic`, a magic number)  
>  S: 64 bits, size of the export in bytes (unsigned)  
>  S: 32 bits, flags  
> @@ -96,7 +96,7 @@ production purposes.
>  
>  The initial few exchanges in newstyle negotiation look as follows:
>  
> -S: 64 bits, `NBDMAGIC` (as in the old style handshake)  
> +S: 64 bits, `0x4e42444d41474943` (ASCII '`NBDMAGIC`') (as in the old style 
> handshake)  

Markdown doesn't care, but the rest of this file is less than 80
columns, making this a long line.  You can wrap it (see how I wrapped
the line regarding 124 bytes of zeroes in response to NBD_OPT_EXPORT_NAME).

>  S: 64 bits, `0x49484156454F5054` (note different magic number)  

As long as we are spelling out ASCII counterpart, is it worth mentioning
that this is (ASCII `'IHAVEOPT'`)?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] tcg: reworking tb_invalidated_flag

2016-03-31 Thread Sergey Fedorov

On 31/03/16 17:06, Sergey Fedorov wrote:
> It should be safe to invalidate a TB while some other CPU is executing
> its translated code.

Probably it's not safe to invalidate a TB while some other thread is
executing the translated code. Direct jumps to the TB being invalidated
should be reset. In case of using direct jump method, native jump
instruction should be patched in the translated code. There are some
restrictions on modification of concurrently executing code, e.g. see
section "3.4 Atomic Modification of Machine-Code Instructions" in [1].
For instance, only aligned, 8-byte atomic code modification are safe on
AMD processors, otherwise we can wind up executing a corrupted
instruction stream. I can't see i386 TCG backend does some alignment of
the jump target when translating goto_tb TCG op. I suspect other TCG
targets also have their limitations.

Looks like we have to ensure all vCPUs are out of translated code when
doing TB patching either doing tb_add_jump() or tb_phys_invalidate().
Did I missed something?

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37204.pdf

Kind regards,
Sergey

[Qemu-devel] q35 migration broken

2016-03-31 Thread Dr. David Alan Gilbert

Hi,
  I'm seeing a breakage on q35 migration on head (and possibly older
but certainly head; it's also on a 2.5.0 world I've got with a bunch
of patches but I've not tried a clean 2.5.0 yet).

It looks like some type of interrupt screwup; with a virtio-net device
I get a:
  BUG: soft lockup - CPU#0 stuck for 22s!
  ...  virtnet_config_changed_work 

but if I swap that out for an e1000 I get:
  Disabling IRQ #22

  and various timeouts on e1000 and cdrom (scsi).
The guest kind of limps along with an existing terminal scrolling dmesg -w 
output.

This is an f23 guest on a rhel7.2-ish host; with the guest sitting an idle
(MATE) Gui.

i440fx works.

qemu 30823 1 15 14:51 ?00:00:07 
/opt/qemu-head/bin/qemu-system-x86_64 -name f23-q35 -S -machine 
pc-q35-2.6,accel=kvm,usb=off,vmport=off -cpu SandyBridge -m 4096 -realtime 
mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 
3cc93d9b-9b87-4472-847c-25cea2bfc51f -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-12-f23-q35/monitor.sock,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew 
-global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global 
ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device 
i82801b11-bridge,id=pci.1,bus=pcie.0,addr=0x1e -device 
pci-bridge,chassis_nr=2,id=pci.2,bus=pci.1,addr=0x1 -device 
ich9-usb-ehci1,id=usb,bus=pcie.0,addr=0x1d.0x7 -device 
ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pcie.0,multifunction=on,addr=0x1d
 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pcie.0,addr=0x1d.0x1 
-device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pcie.0,addr=0x1d.0x2 
-device virtio-scsi-pci,id=scsi0,bus=pci.2,addr=0x3 -device 
virtio-serial-pci,id=virtio-serial0,bus=pci.2,addr=0x4 -drive 
file=/home/vms/f23-q35.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none
 -device 
virtio-blk-pci,scsi=off,bus=pci.2,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive if=none,id=drive-scsi0-0-0-0,readonly=on -device 
scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
 -netdev tap,fd=25,id=hostnet0 -device 
e1000,netdev=hostnet0,id=net0,mac=52:54:00:98:12:7d,bus=pci.2,addr=0x1 -chardev 
pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev 
socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-12-f23-q35/org.qemu.guest_agent.0,server,nowait
 -device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
 -chardev spicevmc,id=charchannel1,name=vdagent -device 
virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0
 -device usb-tablet,id=input0 -spice 
port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device 
qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,bus=pcie.0,addr=0x1
 -device intel-hda,id=sound0,bus=pci.2,addr=0x2 -device 
hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev 
spicevmc,id=charredir0,name=usbredir -device 
usb-redir,chardev=charredir0,id=redir0 -chardev 
spicevmc,id=charredir1,name=usbredir -device 
usb-redir,chardev=charredir1,id=redir1 -incoming defer -device 
virtio-balloon-pci,id=balloon0,bus=pci.2,addr=0x6 -msg timestamp=on

(Attaching libvirt xml that generated that lot).

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



  f23-q35
  3cc93d9b-9b87-4472-847c-25cea2bfc51f
  4194304
  4194304
  4
  
hvm

  
  



  
  
SandyBridge
  
  



  
  destroy
  restart
  restart
  


  
  
/opt/qemu-head/bin/qemu-system-x86_64

Re: [Qemu-devel] [PATCH 2/2] Correct definition of NBD_CMD_FLAG_FUA

2016-03-31 Thread Alex Bligh

On 31 Mar 2016, at 19:15, Alex Bligh  wrote:

> It is doubtful whether anyone is using NBD_CMD_FLAG_FUA
> at the moment in any case.

Drat. I spoke too soon. Qemu uses it, but presumably from its
own .h file.

However, it's now nonsensical having it defined as 1<<16 in a
16 bit flags variable.

Should we produce a new name for it (and future command flags)
that aren't shifted left 16 places, and just maintain the
current value for compatibility?

-- 
Alex Bligh

[Qemu-devel] [Bug 1563887] Re: qemu-system-ppc64 freezes on starting image on ppc64le

2016-03-31 Thread Serge Hallyn

Failing any brighter ideas, this should be pretty bisectable to figure
out what happened.  Hardware availability is the main problem.  Would
you be able to use your system to bisect to the commit introducing the
bug?

Actually before we get to that, could you try installing the 14.04 slof
package on the 16.04 system, and see whether that fixes it?


** Also affects: qemu
   Importance: Undecided
   Status: New

** Changed in: qemu (Ubuntu)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1563887

Title:
  qemu-system-ppc64 freezes on starting image on ppc64le

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  New

Bug description:
  qemu-system-ppc64 running on Ubuntu 16.04 beta-2 fails to start an
  image as part of the certification process. This on an IBM ppc64le in
  PowerVM mode running Ubuntu 16.04 beta-2 deployed by MAAS 1.9.1. There
  is no error output.

  ubuntu@alpine01:~/kvm$ qemu-system-ppc64 -m 256 -display none -nographic -net 
nic -net user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine 
pseries -drive file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  WARNING: Image format was not specified for 'seed.iso' and probing guessed 
raw.
   Automatically detecting the format is dangerous for raw images, 
write operations on block 0 will be restricted.
   Specify the 'raw' format explicitly to remove the restrictions.

  SLOF **
  QEMU Starting
   Build Date = Jan 29 2016 18:58:37
   FW Version = buildd@ release 20151103
   Press "s" to enter Open Firmware.

  Populating /vdevice methods
  Populating /vdevice/vty@7100
  Populating /vdevice/nvram@7101
  Populating /vdevice/l-lan@7102
  Populating /vdevice/v-scsi@7103
     SCSI: Looking for devices
    8200 CD-ROM   : "QEMU QEMU CD-ROM  2.5+"
  Populating /pci@8002000
   00 1800 (D) : 1af4 1001virtio [ block ]
   00 1000 (D) : 1af4 1001virtio [ block ]
   00 0800 (D) : 106b 003fserial bus [ usb-ohci ]
   00  (D) : 1234 qemu vga
  No NVRAM common partition, re-initializing...
  Installing QEMU fb

  Scanning USB
    OHCI: initializing
  USB Keyboard
  USB mouse
  No console specified using screen & keyboard

    Welcome to Open Firmware

    Copyright (c) 2004, 2011 IBM Corporation All rights reserved.
    This program and the accompanying materials are made available
    under the terms of the BSD License available at
    http://www.opensource.org/licenses/bsd-license.php

  Trying to load:  from: /pci@8002000/scsi@3 ...
  E3404: Not a bootable device!
  Trying to load:  from: /pci@8002000/scsi@2 ...   Successfully loaded
  Linux ppc64le
  #31-Ubuntu SMP F

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: qemu-system-ppc 1:2.5+dfsg-5ubuntu6
  ProcVersionSignature: Ubuntu 4.4.0-16.32-generic 4.4.6
  Uname: Linux 4.4.0-16-generic ppc64le
  ApportVersion: 2.20-0ubuntu3
  Architecture: ppc64el
  Date: Wed Mar 30 14:10:01 2016
  KvmCmdLine:
   COMMAND STAT  EUID  RUID   PID  PPID %CPU COMMAND
   kvm-irqfd-clean S<   0 0  1172 2  0.0 [kvm-irqfd-clean]
   qemu-nbdSsl  0 0 13467 1  0.0 qemu-nbd -c /dev/nbd0 
xenial-server-cloudimg-ppc64el-disk1.img
   qemu-system-ppc Sl+   1000  1000 18973 18896  101 qemu-system-ppc64 -m 256 
-display none -nographic -net nic -net 
user,net=10.0.0.0/8,host=10.0.0.1,hostfwd=tcp::-:22 -machine pseries -drive 
file=xenial-server-cloudimg-ppc64el-disk1.img,if=virtio -drive 
file=seed.iso,if=virtio
  Lsusb: Error: command ['lsusb'] failed with exit code 1:
  ProcEnviron:
   TERM=xterm
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinux-4.4.0-16-generic 
root=UUID=92d820c8-ab25-497b-9b1e-f1435992bbf3 ro
  ProcLoadAvg: 1.08 0.94 0.58 2/616 19571
  ProcLocks:
   1: POSIX  ADVISORY  WRITE 886 00:13:381 0 EOF
   2: POSIX  ADVISORY  WRITE 1339 00:13:528 0 EOF
   3: FLOCK  ADVISORY  WRITE 1284 00:13:522 0 EOF
   4: POSIX  ADVISORY  WRITE 2281 00:13:563 0 EOF
   5: POSIX  ADVISORY  WRITE 1331 00:13:536 0 EOF
  ProcSwaps:
   Filename TypeSizeUsedPriority
   /swap.img   file 8388544 0   -1
  ProcVersion: Linux version 4.4.0-16-generic (buildd@bos01-ppc64el-001) (gcc 
version 5.3.1 20160320 (Ubuntu/Linaro/IBM 5.3.1-12ubuntu4) ) #32-Ubuntu SMP Thu 
Mar 24 22:31:14 UTC 2016
  SourcePackage: qemu
  UpgradeStatus: No upgrade log present (probably fresh install)
  bootlist:
   /pci@8002011/pci1014,034A@0/sas/disk@4068402c40

[Qemu-devel] [PATCH 2/2] Correct definition of NBD_CMD_FLAG_FUA

2016-03-31 Thread Alex Bligh

NBD_CMD_FLAG_FUA is defined as 1<<0 in the documentation, but
1<<16 in nbd.h. It is not used anywhere within the code.
1<<16 cannot work as the flags word is only 16 bits long.
It is doubtful whether anyone is using NBD_CMD_FLAG_FUA
at the moment in any case.

Signed-off-by: Alex Bligh 
---
 nbd.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/nbd.h b/nbd.h
index f2a32dd..53b6ca1 100644
--- a/nbd.h
+++ b/nbd.h
@@ -38,7 +38,7 @@ enum {
 };
 
 #define NBD_CMD_MASK_COMMAND 0x
-#define NBD_CMD_FLAG_FUA (1<<16)
+#define NBD_CMD_FLAG_FUA (1 << 0)
 
 /* values for flags field */
 #define NBD_FLAG_HAS_FLAGS (1 << 0)/* Flags are there */
-- 
1.9.1

[Qemu-devel] [PATCH 1/2] proto.md: Clearly set out NBDMAGIC is the actual value

2016-03-31 Thread Alex Bligh

Clearly set out NBDMAGIC, not the name of a constant equal to
some value. Set out the value in hex as well.

Signed-off-by: Alex Bligh 
---
 doc/proto.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index c1e05c5..7994076 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -67,7 +67,7 @@ newstyle negotiation.
 
  Oldstyle negotiation
 
-S: 64 bits, `NBDMAGIC` (also known as the `INIT_PASSWD`)  
+S: 64 bits, `0x4e42444d41474943` (ASCII '`NBDMAGIC`') (also known as the 
`INIT_PASSWD`)  
 S: 64 bits, `0x00420281861253` (`cliserv_magic`, a magic number)  
 S: 64 bits, size of the export in bytes (unsigned)  
 S: 32 bits, flags  
@@ -96,7 +96,7 @@ production purposes.
 
 The initial few exchanges in newstyle negotiation look as follows:
 
-S: 64 bits, `NBDMAGIC` (as in the old style handshake)  
+S: 64 bits, `0x4e42444d41474943` (ASCII '`NBDMAGIC`') (as in the old style 
handshake)  
 S: 64 bits, `0x49484156454F5054` (note different magic number)  
 S: 16 bits, handshake flags  
 C: 32 bits, flags  
-- 
1.9.1

[Qemu-devel] Ballooning on TPS!=HPS hosts

2016-03-31 Thread Dr. David Alan Gilbert

Hi,
  I was reading the balloon code and am confused as to how/if ballooning
works on hosts where the host page size is larger than the
target page size.

static void balloon_page(void *addr, int deflate)
{
#if defined(__linux__)
if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
 kvm_has_sync_mmu())) {
qemu_madvise(addr, TARGET_PAGE_SIZE,
deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
}
#endif
}

The virtio-balloon code only does stuff through ballon_page,
and an madvise DONTNEED should fail if you try and do it on
a size smaller than the host page size.  So does ballooning work on
Power/ARM?

Am I misunderstanding this?

Of course looking at the above we won't actually generate an error since
we don't check the return of qemu_madvise.

We have three sizes:
a) host page size
b) target page size
c) VIRTIO_BALLOON_PFN_SHIFT

 c == 12 (4k) for everyone
 

1) I think the virtio-balloon code needs to coallesce adjecent requests
  and call balloon_page on whole chunks at once passing a length.
2) why does balloon_page use TARGET_PAGE_SIZE, ignoring anything else
   shouldn't it be 1 << VIRTIO_BALLOON_PFN_SHIFT ?
3) I'm guessing the guest kernel doesn't know the host page size, so
   how can it know what size chunks of balloon to work in?

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] tcg/ppc: maintainer's mail delivery failure

2016-03-31 Thread Richard Henderson

On 03/31/2016 10:30 AM, Peter Maydell wrote:
> On 24 March 2016 at 11:09, Peter Maydell  wrote:
>> On 24 March 2016 at 10:55, Sergey Fedorov  wrote:
>>> Probably we don't have a maintainer for PPC TCG backend.
>>>
>>> PPC
>>> M: Vassili Karpov (malc) 
>>> S: Maintained
>>> F: tcg/ppc/
>>> F: disas/ppc.c
>>
>> Yes, malc has not been with the project for some years now. We
>> should probably go through and tidy up the MAINTAINERS file.
> 
> Would anybody like to put their name forward for tcg/ppc maintainer
> (possibly at Odd Fixes status), or should we mark this as Orphaned ?
> 
> I have cc'd some plausible suspects. If nobody cares to volunteer
> within a few weeks I'll send out a patch marking it as Orphan.

I'll do odd fixes.  Thankfully the gcc compile farm makes this easy.


r~

Re: [Qemu-devel] tcg/ppc: maintainer's mail delivery failure

2016-03-31 Thread Peter Maydell

On 24 March 2016 at 11:09, Peter Maydell  wrote:
> On 24 March 2016 at 10:55, Sergey Fedorov  wrote:
>> Probably we don't have a maintainer for PPC TCG backend.
>>
>> PPC
>> M: Vassili Karpov (malc) 
>> S: Maintained
>> F: tcg/ppc/
>> F: disas/ppc.c
>
> Yes, malc has not been with the project for some years now. We
> should probably go through and tidy up the MAINTAINERS file.

Would anybody like to put their name forward for tcg/ppc maintainer
(possibly at Odd Fixes status), or should we mark this as Orphaned ?

I have cc'd some plausible suspects. If nobody cares to volunteer
within a few weeks I'll send out a patch marking it as Orphan.

thanks
-- PMM

Re: [Qemu-devel] [PATCH] Xen: update MAINTAINERS info

2016-03-31 Thread Peter Maydell

On 29 March 2016 at 11:02, Stefano Stabellini
 wrote:
> Add Anthony Perard as Xen co-maintainer.
> Update my email address.
>
> Signed-off-by: Stefano Stabellini 
> Acked-by: Anthony Perard 
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index afbe845..66abde8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -278,7 +278,8 @@ Guest CPU Cores (Xen):
>  --
>
>  X86
> -M: Stefano Stabellini 
> +M: Stefano Stabellini 
> +M: Anthony Perard 
>  L: xen-de...@lists.xensource.com
>  S: Supported
>  F: xen-*

Applied to master, thanks.

-- PMM

Re: [Qemu-devel] [PATCH] MAINTAINERS: Delete invalid maintainer entries of the Exynos section

2016-03-31 Thread Peter Maydell

On 30 March 2016 at 13:32, Thomas Huth  wrote:
> Mails to these e-mail addresses are rejected by the mail server
> of Samsung with "User unknown" messages, so it seems like these
> Exynos maintainers are no longer available.
>
> Signed-off-by: Thomas Huth 
> ---
>  MAINTAINERS | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index afbe845..12eb3fd 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -357,10 +357,7 @@ F: include/hw/timer/a9gtimer.h
>  F: include/hw/timer/arm_mptimer.h
>
>  Exynos
> -M: Evgeny Voevodin 
> -M: Maksim Kozlov 
>  M: Igor Mitsyanko 
> -M: Dmitry Solodkiy 
>  L: qemu-...@nongnu.org
>  S: Maintained
>  F: hw/*/exynos*
> --
> 1.8.3.1

Applied to master, thanks.

-- PMM

Re: [Qemu-devel] [FreeBSD] Host build i386 failing to build aarch64 targets

2016-03-31 Thread Sean Bruno

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

This seems to at least allow things to compile, qemu-system seems to
be able to boot images on i386, but I may have missed something:

diff --git a/cpus.c b/cpus.c
index 23cf7aa..11f8bab 100644
- --- a/cpus.c
+++ b/cpus.c
@@ -338,15 +338,8 @@ static int64_t qemu_icount_round(int64_t count)

 static void icount_warp_rt(void)
 {
- -/* The icount_warp_timer is rescheduled soon after
vm_clock_warp_start
- - * changes from -1 to another value, so the race here is okay.
- - */
- -if (atomic_read(_clock_warp_start) == -1) {
- -return;
- -}
- -
 seqlock_write_lock(_state.vm_clock_seqlock);
- -if (runstate_is_running()) {
+if ((vm_clock_warp_start != -1) && runstate_is_running()) {
 int64_t clock = REPLAY_CLOCK(REPLAY_CLOCK_VIRTUAL_RT,
  cpu_get_clock_locked());
 int64_t warp_delta;
-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQF8BAEBCgBmBQJW/VbrXxSAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w
ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXRCQUFENDYzMkU3MTIxREU4RDIwOTk3REQx
MjAxRUZDQTFFNzI3RTY0AAoJEBIB78oecn5k4PUH/0PP+LzCDZmHCUN0ZuQ6Z6SK
VWevvfg6d7mgqcobGD94E8V5AZYO9y4aj+WerCPULXq7I2qAthgqlccUaPMN4KKz
+cJiBo02pzOXJmbPLiFetsQj87+nUzUpcJ2yVY0eiFcO2lbNQV20UZdYDuacQ0ds
bjyhq49dC51wsn6ET1FMY6BH8vmdicWe4SMDEgSw32pFvZmIlGKFwuP3w+2dLpYg
H4aajemN2PrHggcsRI8Gxu2G6yPbuWuGB6eaqS/OQ+6NxUmXz2lqOgzMfBAZjlzQ
5kVp8XRcU42koS2hLPmmxyxZ98W54oM8LtrUYdoecTabvJG3cwaApw7OITWZJPM=
=Hpp4
-END PGP SIGNATURE-

Re: [Qemu-devel] [Qemu-stable] [PATCH] target-i386: do not read/write MSR_TSC_AUX from KVM if CPUID bit is not set

2016-03-31 Thread Paolo Bonzini



On 31/03/2016 17:38, Peter Lieven wrote:
> 
> 
>> Am 31.03.2016 um 15:23 schrieb Paolo Bonzini :
>>
>>
>>
>>> On 31/03/2016 15:12, Peter Lieven wrote:
>
> KVM does not let you read or write this MSR if the corresponding CPUID
> bit is not set.  This in turn causes MSRs that come after MSR_TSC_AUX
> to be ignored by KVM_SET_MRSS.
>>>
>>> Is it possible that this causes a freeze when migrating vom qemu 2.2.0
>>> to 2.5.1?
>>
>> I wouldn't exclude it if the CPU model is Westmere or earlier.
> 
> the Host CPU or the Emulated CPU?

The guest.

Paolo

> We emulate Westmere and have also some old blades with Westmere CPUs.
> 
> anyway, i give this a try.
> 
> Michael, this should have gone into 2.5.1 i think...
> 
> Thanks,
> Peter
> 
>

Re: [Qemu-devel] [PATCH] target-i386: assert that KVM_GET/SET_MSRS can set all requested MSRs

2016-03-31 Thread Eduardo Habkost

On Thu, Mar 31, 2016 at 03:01:29PM +0200, Laszlo Ersek wrote:
> On 03/30/16 22:59, Paolo Bonzini wrote:
> > This would have caught the bug in the previous patch.
> 
> Should this patch share a series with
> ? Otherwise
> they could be separated by other patches in the commit history, and then
> "previous patch" would be misleading.
> 
> (Alternatively, the reference to "previous patch" could be made by subject.)
> 
> > Signed-off-by: Paolo Bonzini 
> > ---
> >  target-i386/kvm.c | 34 ++
> >  1 file changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 19e2d94..799fdfa 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -141,6 +141,7 @@ static int kvm_get_tsc(CPUState *cs)
> >  return ret;
> >  }
> >  
> > +assert(ret == 1);
> >  env->tsc = msr_data.entries[0].data;
> >  return 0;
> >  }
> > @@ -1446,6 +1447,7 @@ static int kvm_put_tscdeadline_msr(X86CPU *cpu)
> >  struct kvm_msr_entry entries[1];
> >  } msr_data;
> >  struct kvm_msr_entry *msrs = msr_data.entries;
> > +int ret;
> >  
> >  if (!has_msr_tsc_deadline) {
> >  return 0;
> > @@ -1457,7 +1459,13 @@ static int kvm_put_tscdeadline_msr(X86CPU *cpu)
> >  .nmsrs = 1,
> >  };
> >  
> > -return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, _data);
> > +ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, _data);
> > +if (ret < 0) {
> > +return ret;
> > +}
> > +
> > +assert(ret == 1);
> > +return 0;
> >  }
> 
> This changes the return value of kvm_put_tscdeadline_msr() -- and
> friends below -- for successful invocations. I guess that's fine, but a
> note about it in the commit message would be nice.

All these functions have only one caller each, that only checks
if ret < 0.

(As they are all static functions with a single caller in
target-i386/kvm.c, I don't mind if this is not mentioned in the
commit message.)

> 
> Anyway, I'm not an "expert" in this area, so the best I can offer for
> this two-part (almost-) series, with the commit message nits fixed, is
> 
> Acked-by: Laszlo Ersek 
> 
> Thanks
> Laszlo
> 
> >  
> >  /*
> > @@ -1472,6 +1480,11 @@ static int kvm_put_msr_feature_control(X86CPU *cpu)
> >  struct kvm_msrs info;
> >  struct kvm_msr_entry entry;
> >  } msr_data;
> > +int ret;
> > +
> > +if (!has_msr_feature_control) {
> > +return 0;
> > +}

This is not strictly needed to implement what's described in the
commit message, but it makes kvm_put_msr_feature_control() safer
and harder to break.

Reviewed-by: Eduardo Habkost 

> >  
> >  kvm_msr_entry_set(_data.entry, MSR_IA32_FEATURE_CONTROL,
> >cpu->env.msr_ia32_feature_control);
> > @@ -1480,7 +1493,13 @@ static int kvm_put_msr_feature_control(X86CPU *cpu)
> >  .nmsrs = 1,
> >  };
> >  
> > -return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, _data);
> > +ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, _data);
> > +if (ret < 0) {
> > +return ret;
> > +}
> > +
> > +assert(ret == 1);
> > +return 0;
> >  }
> >  
> >  static int kvm_put_msrs(X86CPU *cpu, int level)
> > @@ -1492,6 +1511,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
> >  } msr_data;
> >  struct kvm_msr_entry *msrs = msr_data.entries;
> >  int n = 0, i;
> > +int ret;
> >  
> >  kvm_msr_entry_set([n++], MSR_IA32_SYSENTER_CS, env->sysenter_cs);
> >  kvm_msr_entry_set([n++], MSR_IA32_SYSENTER_ESP, 
> > env->sysenter_esp);
> > @@ -1685,8 +1705,13 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
> >  .nmsrs = n,
> >  };
> >  
> > -return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, _data);
> > +ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, _data);
> > +if (ret < 0) {
> > +return ret;
> > +}
> >  
> > +assert(ret == n);
> > +return 0;
> >  }
> >  
> >  
> > @@ -2055,6 +2080,7 @@ static int kvm_get_msrs(X86CPU *cpu)
> >  return ret;
> >  }
> >  
> > +assert(ret == n);
> >  for (i = 0; i < ret; i++) {
> >  uint32_t index = msrs[i].index;
> >  switch (index) {
> > @@ -2511,7 +2537,7 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
> >  
> >  assert(cpu_is_stopped(cpu) || qemu_cpu_is_self(cpu));
> >  
> > -if (level >= KVM_PUT_RESET_STATE && has_msr_feature_control) {
> > +if (level >= KVM_PUT_RESET_STATE) {
> >  ret = kvm_put_msr_feature_control(x86_cpu);
> >  if (ret < 0) {
> >  return ret;
> > 
> 

-- 
Eduardo

Re: [Qemu-devel] [PATCH v2] migration: skip sending ram pages released by virtio-balloon driver.

2016-03-31 Thread Dr. David Alan Gilbert

* Jitendra Kolhe (jitendra.ko...@hpe.com) wrote:
> While measuring live migration performance for qemu/kvm guest, it
> was observed that the qemu doesn’t maintain any intelligence for the
> guest ram pages which are released by the guest balloon driver and
> treat such pages as any other normal guest ram pages. This has direct
> impact on overall migration time for the guest which has released
> (ballooned out) memory to the host.

Hi Jitendra,
  I've read over the patch and I've got a mix of comments; I've not read
it in full detail:

   a) It does need splitting up; it's a bit big to review in one go;
  I suggest you split it into the main code, and separately the bitmap
  save/load.  It might be worth splitting it up even more.

   b) in balloon_bitmap_load check the next and len fields; since it's read
  over the wire we've got to treat them as hostile; so check they don't
  run over the length of the bitmap.

   c) The bitmap_load/save needs to be tied to the machine type or something
  that means that if you were migraitng in a stream from an older qemu
  it wouldn't get upset when it tried to read the extra data you read.
  I prefer if it's tied to either the config setting or the new machine
  type (that way backwards migration works as well).

   d) I agree with the other comments tha the stuff looking up the ram blocks
  addressing looks wrong;  you use last_ram_offset() to size the bitmap,
  so it makes me think it's the whole of the ram_addr_t; but I think you're
  saying you're not interested in all of it.   However remember that
  the order of ram_addr_t is not stable between two qemu's - even something
  like hotplugging a ethercard in one qemu vs having it on the command line 
on
  the other can change that order; so anything going over the wire has
  to be block+offset-into-block. Also remember that last_ram_offset() 
includes
  annoying things like firmware RAM, video memory and all those things;

   e) It should be possible to get it to work for postcopy if you just use
  it as a way to speed up the zero detection but still send the zero page
  messages.

Dave

> 
> In case of large systems, where we can configure large guests with 1TB
> and with considerable amount of memory release by balloon driver to the,
> host the migration time gets worse.
> 
> The solution proposed below is local only to qemu (and does not require
> any modification to Linux kernel or any guest driver). We have verified
> the fix for large guests =1TB on HPE Superdome X (which can support up
> to 240 cores and 12TB of memory) and in case where 90% of memory is
> released by balloon driver the migration time for an idle guests reduces
> to ~600 sec's from ~1200 sec’s.
> 
> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
> -> ram_find_and_save_block () will try to migrate ram pages which are
> released by vitrio-balloon driver as part of dynamic memory delete.
> Even though the pages which are returned to the host by virtio-balloon
> driver are zero pages, the migration algorithm will still end up
> scanning the entire page ram_find_and_save_block() -> ram_save_page/
> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
> also end-up sending some control information over network for these
> page during migration. This adds to total migration time.
> 
> The proposed fix, uses the existing bitmap infrastructure to create
> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
> entire guest ram memory till max configured memory. Guest ram pages
> claimed by the virtio-balloon driver will be represented by 1 in the
> bitmap. During live migration, each guest ram page (host VA offset)
> is checked against the virtio-balloon bitmap, if the bit is set the
> corresponding ram page will be excluded from scanning and sending
> control information during migration. The bitmap is also migrated to
> the target as part of every ram_save_iterate loop and after the
> guest is stopped remaining balloon bitmap is migrated as part of
> balloon driver save / load interface.
> 
> With the proposed fix, the average migration time for an idle guest
> with 1TB maximum memory and 64vCpus
>  - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
>down to 128GB (~10% of 1TB).
>  - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
>down to 896GB (~90% of 1TB),
>  - with no ballooning configured, we don’t expect to see any impact
>on total migration time.
> 
> The optimization gets temporarily disabled, if the balloon operation is
> in progress. Since the optimization skips scanning and migrating control
> information for ballooned out pages, we might skip guest ram pages in
> cases where the guest balloon driver has freed the ram page to the guest
> but not yet informed the host/qemu about the ram

Re: [Qemu-devel] [PATCH 2/2] target-mips: Implement IEEE 754-2008 functionality for R6 and MSA instructions

2016-03-31 Thread Richard Henderson

On 03/31/2016 04:55 AM, Aleksandar Markovic wrote:
> Hi, Richard, what would you think about this approach:
> 
> Functionality of . and ..
> instructions is dependent on flags ABS2008 and NAN2008 in FCR31. There are
> MIPS architectures (for example mips32r5) that allow implementations
> with different values of these flags. So, in order to detect the desired
> behavior in translate-time, insn_flags field can't be used - and, therefore,
> it makes sense to add two new members to the MIPS's DisasContext:
> 
> typedef struct DisasContext {
> . . .
> bool nan2008;
> bool abs2008;
> } DisasContext;
> 
> Their initialization could be in gen_intermediate_code_internal():
> 
> ctx.nan2008 = (env->active_fpu.fcr31 >> FCR31_NAN2008) & 1;
> ctx.abs2008 = (env->active_fpu.fcr31 >> FCR31_ABS2008) & 1;
> 
> Now, ABS.D (and all .) handling might look like this:
> 
> case OPC_ABS_D:
> check_cp1_registers(ctx, fs | fd);
> {
> TCGv_i64 fp0 = tcg_temp_new_i64();
> 
> gen_load_fpr64(ctx, fp0, fs);
> if (ctx->abs2008) {
> tcg_gen_andi_i64(fp0, fp0, 0x7fffULL);
> } else {
> gen_helper_float_abs_d(fp0, fp0);
> }
> gen_store_fpr64(ctx, fp0, fd);
> tcg_temp_free_i64(fp0);
> }
> opn = "abs.d";
> break;
> 
> Here, 2008-style ABS.D is implemented inline, without a helper, and
> gen_helper_float_abs_d() is an old pre-2008 helper that would be intact
> (the same as it is currently) with this change.

Yes, that's exactly what I had in mind.

> On the other hand, CVT.L.D (and all ..)
> handling would take this form:
> 
> case OPC_CVT_L_D:
> check_cp1_64bitmode(ctx);
> {
> TCGv_i64 fp0 = tcg_temp_new_i64();
> 
> gen_load_fpr64(ctx, fp0, fs);
> if (ctx->nan2008) {
> gen_helper_float_cvt_2008_l_d(fp0, cpu_env, fp0);
> } else {
> gen_helper_float_cvt_l_d(fp0, cpu_env, fp0);
> }
> gen_store_fpr64(ctx, fp0, fd);
> tcg_temp_free_i64(fp0);
> }
> opn = "cvt.l.d";
> break;
> 
> Function helper_float_cvt_2008_l_d() is a new, only-2008-style helper for
> CVT.L.D and would look like this:
> 
> uint64_t helper_float_cvt_2008_l_d(CPUMIPSState *env, uint64_t fdt0)
> {
> uint64_t dt2;
> 
> dt2 = float64_to_int64(fdt0, >active_fpu.fp_status);
> if (get_float_exception_flags(>active_fpu.fp_status)
> & (float_flag_invalid | float_flag_overflow)) {
> dt2 = DBL_TO_INT64_OVERFLOW(fdt0)
> }
> update_fcr31(env, GETPC());
> return dt2;
> }

That looks fine as well.



r~

Re: [Qemu-devel] [PATCH] target-i386: do not read/write MSR_TSC_AUX from KVM if CPUID bit is not set

2016-03-31 Thread Eduardo Habkost

On Wed, Mar 30, 2016 at 10:59:42PM +0200, Paolo Bonzini wrote:
> KVM does not let you read or write this MSR if the corresponding CPUID
> bit is not set.  This in turn causes MSRs that come after MSR_TSC_AUX
> to be ignored by KVM_SET_MRSS.
> 
> One visible symptom is that s3.flat from kvm-unit-tests fails with
> CPUs that do not have RDTSCP, because the SMBASE is not reset to
> 0x3 after reset.
> 
> Fixes: c9b8f6b6210847b4381c5b2ee172b1c7eb9985d6
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Eduardo Habkost 

-- 
Eduardo

Re: [Qemu-devel] tcg: reworking tb_invalidated_flag

2016-03-31 Thread Richard Henderson

On 03/31/2016 05:42 AM, Sergey Fedorov wrote:
> On 31/03/16 13:48, Alex Bennée wrote:
>> I know we are system focused at the moment but does linux-user ever
>> flush groups of TBs, say when mappings change? Or does this trigger a
>> whole tb_flush?
> 
> Yes, e.g. target_mmap() calls tb_invalidate_phys_range().

Mapping over executable code is exceedingly rare, however.
We could do just as well with a full flush, if that's simpler.


r~

Re: [Qemu-devel] Virtio-9p

2016-03-31 Thread Greg Kurz

On Wed, 30 Mar 2016 16:27:48 +0200
Pradeep Kiruvale  wrote:

> Hi Greg,
> 

Hi Pradeep,

> Thanks for the reply.
> 
> Let me put it this way, virtio-blk-pci is used for block IO on the devices
> shared between the guest and the host.

I don't really understand the "devices shared between the guest and the
host" wording... virtio-blk-pci exposes a virtio-blk device through PCI
to the guest. The virtio-blk device can be backed by a file or a block
device from the host.

> Here I want to share the file and have QoS between the guests. So I am
> using the Virtio-9p-pci.
> 

What file ?

> Basically I want to have QoS for virtio-9p-pci.
> 

Can you provide a more detailed scenario on the result you want to reach ?

> Regards,
> Pradeep
> 

Cheers.

--
Greg

> On 30 March 2016 at 16:13, Greg Kurz  wrote:
> 
> > On Wed, 30 Mar 2016 14:10:38 +0200
> > Pradeep Kiruvale  wrote:
> >
> > > Hi All,
> > >
> > > Is virtio-9p-pci device only supports the fsdev deices? I am trying to
> > use
> > > -drive option for applying QoS for block device using Virtio-9p-pci
> > device,
> > > but failing to create/add a device other than fsdev. Can you please help
> > me
> > > on this?
> > >
> > > Regards,
> > > Pradeep
> >
> > Hi Pradeep,
> >
> > Not sure to catch what you want to do but I confirm that virti-9p-pci only
> > supports
> > fsdev... if you want a block device, why don't you use virtio-blk-pci ?
> >
> > Cheers.
> >
> > --
> > Greg
> >
> >

Re: [Qemu-devel] Why is SeaBIOS used with -kernel?

2016-03-31 Thread Kevin O'Connor

On Thu, Mar 31, 2016 at 10:21:25AM +0100, Stefan Hajnoczi wrote:
> On Sat, Mar 19, 2016 at 08:31:24PM +, Richard W.M. Jones wrote:
> > Is there something I'm missing, or for Linux + -kernel could we use a
> > much simpler BIOS?
> 
> The data that Marc Mari collected when comparing qboot with an optimized
> SeaBIOS/QEMU showed that there's no need for a separate "lightweight
> firmware" codebase.
> 
> https://github.com/bonzini/qboot
> 
> It would create a maintenance burden and eventually we'd want many of
> the SeaBIOS features anyway.  It's better to optimize linuxboot.bin and
> SeaBIOS instead.

In the tests I've run, the time spent in SeaBIOS is dominated by
hardware delays.  (Or for qemu, the time needed for seabios to
communicate with virtual hardware and the time required for qemu to
implement the requests.)  As such, boot times can be most easily
improved by configuring the VM with less hardware, or configuring (via
SeaBIOS kconfig) less hardware drivers in SeaBIOS.

It's possible to do course grained profiling with SeaBIOS by timing
its debug messages - see:
http://www.seabios.org/Debugging#Timing_debug_messages

The debug messages themselves can consume time though (one can
eliminate debug messages using CONFIG_DEBUG_LEVEL=0).  I use the
following to profile while also accounting for the debug message delay
(on my system, each debug character takes ~2.5us):

scripts/readserial.py -f ../qemu-test/qemudebugpipe -t 2.5

I find a standard SeaBIOS build on my machine with KVM takes ~50ms to
start the OS (not including OS load time).  This breaks down roughly
to the following times:

6ms  - enabling shadow ram (qemu makes 0xc-0x10 read/writable)
4ms  - PCI initialization
2ms  - smm init
4ms  - load acpi tables from qemu
16ms - init and enable vga console
5ms  - load and run various option roms from qemu (eg, ipxe)
7ms  - locking shadow ram (qemu makes 0xc-0x10 readonly)
6ms  - other

There are several SeaBIOS kconfig options that would remove many of
the above delays (with the obvious caveat that the given hardware
would no longer be initialized by seabios).

> Kevin O'Connor had some SeaBIOS optimizations that improved boot time by
> skipping unnecessary probing and timer calibration IIRC.  I have CCed
> Marc and Kevin on this email.

There were a couple of optimizations in SeaBIOS (avoid TSC calibration
when not using the TSC, avoid a PS2 keyboard reset delay) found last
year, but they were committed and released in SeaBIOS v1.9.0.  They
should already be in QEMU.

Cheers,
-Kevin

Re: [Qemu-devel] [PATCH] net: fix OptsVisitor memory leak

2016-03-31 Thread Eric Blake

On 03/31/2016 08:28 AM, Paolo Bonzini wrote:
> Signed-off-by: Paolo Bonzini 

May want to add: regression introduced in commit 96a1616.

> ---
>  net/net.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Eric Blake 

> 
> diff --git a/net/net.c b/net/net.c
> index 594c3b8..3847a13 100644
> --- a/net/net.c
> +++ b/net/net.c
> @@ -1100,6 +1100,7 @@ int net_client_init(QemuOpts *opts, int is_netdev, 
> Error **errp)
>  }
>  
>  error_propagate(errp, err);
> +opts_visitor_cleanup(ov);
>  return ret;
>  }
>  
> 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH v1] net: fix missing include of qapi/error.h in netmap.c

2016-03-31 Thread Eric Blake

On 03/31/2016 07:08 AM, Daniel P. Berrange wrote:
> The netmap.c file fails to build on FreeBSD with
> 
> net/netmap.c:95:9: warning: implicit declaration of function 
> 'error_setg_errno' is invalid in C99 [-Wimplicit-function-declaration]
>  error_setg_errno(errp, errno, "Failed to nm_open() %s",
>  ^
> net/netmap.c:432:9: warning: implicit declaration of function 
> 'error_propagate' is invalid in C99 [-Wimplicit-function-declaration]
>  error_propagate(errp, err);
>  ^
> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  net/netmap.c | 1 +
>  1 file changed, 1 insertion(+)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [patch v5 11/12] vfio: device may stuck in D3 when doing aer recovery

2016-03-31 Thread Alex Williamson

On Thu, 31 Mar 2016 14:55:07 +0800
Chen Fan  wrote:

> On 03/25/2016 10:22 AM, Alex Williamson wrote:
> > On Fri, 25 Mar 2016 09:38:09 +0800
> > Chen Fan  wrote:
> >  
> >> On 03/25/2016 06:54 AM, Alex Williamson wrote:  
> >>> On Wed, 23 Mar 2016 18:12:06 +0800
> >>> Cao jin  wrote:
> >>> 
>  From: Chen Fan 
> 
>  when a physical device aer occurred, the device state probably
>  is not in D0 in a short time, if we recover the device quickly.
>  we may stuck in D3 state when force to change device state to D0.
>  we may need to wait for a short time to inject the error to guest.
> 
>  Signed-off-by: Chen Fan 
>  ---
> hw/vfio/pci.c | 3 +++
> 1 file changed, 3 insertions(+)
> 
>  diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>  index 25fc095..5216e7f 100644
>  --- a/hw/vfio/pci.c
>  +++ b/hw/vfio/pci.c
>  @@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
> msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>  PCI_ERR_ROOT_CMD_NONFATAL_EN;
> 
>  +/* wait a bit to ensure aer device is ready */
>  +usleep(2 * 1000);  
> >>> Where does this number come from?  Why would the device be in D3?  I
> >>> don't understand this at all.  
> >> Hi Alex,
> >>
> >>   when I tested the code in my environment, I found that when I used
> >> the aer-inject module to inject a fake aer error to device on host, the 
> >> qemu
> >> would throw out the message "vfio: Unable to power on device, stuck in D3"
> >> on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the phenomenon
> >> would not appearance, I just thought it should be some timing race issue,
> >> so I use a sleep() to wait 2ms (double the reset time of 1ms) to ensure the
> >> device state is ready. maybe the root reason still need to be
> >> investigated deeply.  
> > Yes, it sounds like you need to investigate this further, the delay is
> > arbitrary and perhaps suggests a race that needs to be fixed
> > correctly.  Thanks,  
> Hi Alex,
> 
>  after done some investigation of the problem, I found that only 
> when the injected
>   error is fatal, the problem will appear. because in aer do_recovery, 
> host will call reset_link
> on the root port, which would invoke pci_reset_bridge_secondary_bus in 
> aer_root_reset,
> that would reset the bridge and all the device under that. so when qemu 
> receive the aer
> notification, then propagate the error to guest, guest does the same way 
> to perform the
> recovery, if the guest `reset_link` that will call the vfio_pre_reset 
> done at the stage of host
> bridge reset, the device status would probable stick in D3.
> 
> so I think after qemu receive the aer notification, we should wait for 
> enough time to
> ensure the bridge has been reset completely. I just use sleep <=10ms to 
> test the code,
> seems still appear the message "vfio: Unable to power on device, stuck 
> in D3". so I think
> we should sleep 100ms to ensure the delay sufficient. I have tested that 
> code 100+ times
> by inject aer error. the issue no longer appears.

I'm not satisfied with this.  pci_reset_bridge_secondary_bus() is
invoked by both the host AER code and the guest AER code, the latter
via the vfio PCI hot reset interface.  The
pci_reset_bridge_secondary_bus() function includes the spec defined
delay by which point all the devices should be operational again.  The
spec also defines that devices are in D0 after reset, which implies
that the only reason we would ever be seeing a device in D3 is if we're
reading the device while it is still in reset or before it has
recovered from reset.  That implies that either
pci_reset_bridge_secondary_bus() is not waiting long enough or QEMU is
allowing one device to call vfio_pre_reset while another device is
still in reset.  I suspect QEMU serializes reset such that the latter
case is not possible, which means that you might have a device that
takes longer to reset than the spec defines.  Such a quirk should be
handled in the host kernel reset, not by adding arbitrary delays in
userspace code.  Thanks,

Alex

Re: [Qemu-devel] [Qemu-stable] [PATCH] target-i386: do not read/write MSR_TSC_AUX from KVM if CPUID bit is not set

2016-03-31 Thread Peter Lieven



> Am 31.03.2016 um 15:23 schrieb Paolo Bonzini :
> 
> 
> 
>> On 31/03/2016 15:12, Peter Lieven wrote:
 
 KVM does not let you read or write this MSR if the corresponding CPUID
 bit is not set.  This in turn causes MSRs that come after MSR_TSC_AUX
 to be ignored by KVM_SET_MRSS.
>> 
>> Is it possible that this causes a freeze when migrating vom qemu 2.2.0
>> to 2.5.1?
> 
> I wouldn't exclude it if the CPU model is Westmere or earlier.

the Host CPU or the Emulated CPU?

We emulate Westmere and have also some old blades with Westmere CPUs.

anyway, i give this a try.

Michael, this should have gone into 2.5.1 i think...

Thanks,
Peter

[Qemu-devel] [PATCH v2 0/2] Fixing non-blocking operation of chardevs

2016-03-31 Thread Daniel P. Berrange

This fixes socket chardevs to always be in non-blocking
mode, as they were before the QIOChannel conversion. The
second patch was already posted before, but dropped when
Peter discovered a problem on OS-X causing ahci-test to
hang:

  https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05807.html

I traced this down to broken EAGAIN handling affecting
OS-X, hence the first patch in this series.

Changed in v2:

 - Also fix qemu_chr_fe_write_log method

Daniel P. Berrange (2):
  char: fix broken EAGAIN retry on OS-X due to errno clobbering
  char: ensure all clients are in non-blocking mode

 qemu-char.c | 39 ---
 1 file changed, 20 insertions(+), 19 deletions(-)

-- 
2.5.5

[Qemu-devel] [PATCH v2 1/2] char: fix broken EAGAIN retry on OS-X due to errno clobbering

2016-03-31 Thread Daniel P. Berrange

Some of the chardev I/O paths really want to write the
complete data buffer even though the channel is in
non-blocking mode. To achieve this they look for EAGAIN
and g_usleep() for 100ms. Unfortunately the code is set
to check errno == EAGAIN a second time, after the g_usleep()
call has completed. On OS-X at least, g_usleep clobbers
errno to ETIMEDOUT, causing the retry to be skipped.

This failure to retry means the full data isn't written
to the chardev backend, which causes various failures
including making the tests/ahci-test qtest hang.

Rather than playing games trying to reset errno just
simplify the code to use a goto to retry instead of a
a loop.

Signed-off-by: Daniel P. Berrange 
---
 qemu-char.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/qemu-char.c b/qemu-char.c
index 270819a..93fd733 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -225,12 +225,12 @@ static void qemu_chr_fe_write_log(CharDriverState *s,
 }
 
 while (done < len) {
-do {
-ret = write(s->logfd, buf + done, len - done);
-if (ret == -1 && errno == EAGAIN) {
-g_usleep(100);
-}
-} while (ret == -1 && errno == EAGAIN);
+retry:
+ret = write(s->logfd, buf + done, len - done);
+if (ret == -1 && errno == EAGAIN) {
+g_usleep(100);
+goto retry;
+}
 
 if (ret <= 0) {
 return;
@@ -246,12 +246,12 @@ static int qemu_chr_fe_write_buffer(CharDriverState *s, 
const uint8_t *buf, int
 
 qemu_mutex_lock(>chr_write_lock);
 while (*offset < len) {
-do {
-res = s->chr_write(s, buf + *offset, len - *offset);
-if (res == -1 && errno == EAGAIN) {
-g_usleep(100);
-}
-} while (res == -1 && errno == EAGAIN);
+retry:
+res = s->chr_write(s, buf + *offset, len - *offset);
+if (res < 0 && errno == EAGAIN) {
+g_usleep(100);
+goto retry;
+}
 
 if (res <= 0) {
 break;
@@ -333,12 +333,12 @@ int qemu_chr_fe_read_all(CharDriverState *s, uint8_t 
*buf, int len)
 }
 
 while (offset < len) {
-do {
-res = s->chr_sync_read(s, buf + offset, len - offset);
-if (res == -1 && errno == EAGAIN) {
-g_usleep(100);
-}
-} while (res == -1 && errno == EAGAIN);
+retry:
+res = s->chr_sync_read(s, buf + offset, len - offset);
+if (res == -1 && errno == EAGAIN) {
+g_usleep(100);
+goto retry;
+}
 
 if (res == 0) {
 break;
-- 
2.5.5

[Qemu-devel] [PATCH v2 2/2] char: ensure all clients are in non-blocking mode

2016-03-31 Thread Daniel P. Berrange

Only some callers of tcp_chr_new_client are putting the
socket client into non-blocking mode. Move the call to
qio_channel_set_blocking() into the tcp_chr_new_client
method to guarantee that all code paths set non-blocking
mode

Reported-by: Andrew Baumann 
Reported-by: Laurent Vivier 
Signed-off-by: Daniel P. Berrange 
---
 qemu-char.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/qemu-char.c b/qemu-char.c
index 93fd733..b597ee1 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -3081,6 +3081,8 @@ static int tcp_chr_new_client(CharDriverState *chr, 
QIOChannelSocket *sioc)
 s->sioc = sioc;
 object_ref(OBJECT(sioc));
 
+qio_channel_set_blocking(s->ioc, false, NULL);
+
 if (s->do_nodelay) {
 qio_channel_set_delay(s->ioc, false);
 }
@@ -3112,7 +3114,6 @@ static int tcp_chr_add_client(CharDriverState *chr, int 
fd)
 if (!sioc) {
 return -1;
 }
-qio_channel_set_blocking(QIO_CHANNEL(sioc), false, NULL);
 ret = tcp_chr_new_client(chr, sioc);
 object_unref(OBJECT(sioc));
 return ret;
-- 
2.5.5

[Qemu-devel] [PATCH 2/4] target-arm: Remove incorrect ALIAS tags from ESR_EL2 and ESR_EL3

2016-03-31 Thread Peter Maydell

The regdefs for the ESR_EL2 and ESR_EL3 system registers should not
be marked as ARM_CP_ALIAS, because these are the master copies; the
DFSR regdef in vmsa_pmsa_cp_reginfo[] is marked as an alias.
Remove the ALIAS tags so that these registers are correctly migrated.

Signed-off-by: Peter Maydell 
---
 target-arm/helper.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/target-arm/helper.c b/target-arm/helper.c
index e583e6a..0e54d90 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -3509,7 +3509,6 @@ static const ARMCPRegInfo el2_cp_reginfo[] = {
   .access = PL2_RW,
   .fieldoffset = offsetof(CPUARMState, elr_el[2]) },
 { .name = "ESR_EL2", .state = ARM_CP_STATE_AA64,
-  .type = ARM_CP_ALIAS,
   .opc0 = 3, .opc1 = 4, .crn = 5, .crm = 2, .opc2 = 0,
   .access = PL2_RW, .fieldoffset = offsetof(CPUARMState, cp15.esr_el[2]) },
 { .name = "FAR_EL2", .state = ARM_CP_STATE_AA64,
@@ -3759,7 +3758,6 @@ static const ARMCPRegInfo el3_cp_reginfo[] = {
   .access = PL3_RW,
   .fieldoffset = offsetof(CPUARMState, elr_el[3]) },
 { .name = "ESR_EL3", .state = ARM_CP_STATE_AA64,
-  .type = ARM_CP_ALIAS,
   .opc0 = 3, .opc1 = 6, .crn = 5, .crm = 2, .opc2 = 0,
   .access = PL3_RW, .fieldoffset = offsetof(CPUARMState, cp15.esr_el[3]) },
 { .name = "FAR_EL3", .state = ARM_CP_STATE_AA64,
-- 
1.9.1

Re: [Qemu-devel] [PATCH 1/4] target-arm: Correctly reset SCTLR_EL3 for 64-bit CPUs

2016-03-31 Thread Laurent Desnogues

On Thu, Mar 31, 2016 at 4:49 PM, Peter Maydell  wrote:
> The regdef for SCTRL_EL3 was incorrectly marked as being an
> ARM_CP_ALIAS, with the remark that this was because the 32-bit
> definition would take care of reset and migration. However the
> intention for banked registers as documented in the comment in
> add_cpreg_to_hashtable() is:
>
>  * 2) If ARMv8 is enabled then we can count on a 64-bit version
>  *taking care of the secure bank.  This requires that separate
>  *32 and 64-bit definitions are provided.
>
> and so it marks the 32-bit secure banked version as an alias.
> This results in the sctlr_s/sctlr_el[3] field never being reset
> or migrated for a 64-bit CPU with EL3 enabled.
>
> Fix this by removing the ARM_CP_ALIAS annotation from SCTLR_EL3.
> Since this means it now needs a real reset value, move the regdef
> into the same place that we define the 32-bit SCTLR.
>
> Signed-off-by: Peter Maydell 

Reviewed-by: Laurent Desnogues 

Thanks,

Laurent

> ---
>  target-arm/helper.c | 23 +--
>  1 file changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/target-arm/helper.c b/target-arm/helper.c
> index 19d5d52..e583e6a 100644
> --- a/target-arm/helper.c
> +++ b/target-arm/helper.c
> @@ -3744,11 +3744,6 @@ static const ARMCPRegInfo el3_cp_reginfo[] = {
>.access = PL1_RW, .accessfn = access_trap_aa32s_el1,
>.writefn = vbar_write, .resetvalue = 0,
>.fieldoffset = offsetof(CPUARMState, cp15.mvbar) },
> -{ .name = "SCTLR_EL3", .state = ARM_CP_STATE_AA64,
> -  .type = ARM_CP_ALIAS, /* reset handled by AArch32 view */
> -  .opc0 = 3, .opc1 = 6, .crn = 1, .crm = 0, .opc2 = 0,
> -  .access = PL3_RW, .raw_writefn = raw_write, .writefn = sctlr_write,
> -  .fieldoffset = offsetof(CPUARMState, cp15.sctlr_el[3]) },
>  { .name = "TTBR0_EL3", .state = ARM_CP_STATE_AA64,
>.opc0 = 3, .opc1 = 6, .crn = 2, .crm = 0, .opc2 = 0,
>.access = PL3_RW, .writefn = vmsa_ttbr_write, .resetvalue = 0,
> @@ -4641,12 +4636,20 @@ void register_cp_regs_for_features(ARMCPU *cpu)
>  }
>  if (arm_feature(env, ARM_FEATURE_EL3)) {
>  define_arm_cp_regs(cpu, el3_cp_reginfo);
> -ARMCPRegInfo rvbar = {
> -.name = "RVBAR_EL3", .state = ARM_CP_STATE_AA64,
> -.opc0 = 3, .opc1 = 6, .crn = 12, .crm = 0, .opc2 = 1,
> -.type = ARM_CP_CONST, .access = PL3_R, .resetvalue = cpu->rvbar
> +ARMCPRegInfo el3_regs[] = {
> +{ .name = "RVBAR_EL3", .state = ARM_CP_STATE_AA64,
> +  .opc0 = 3, .opc1 = 6, .crn = 12, .crm = 0, .opc2 = 1,
> +  .type = ARM_CP_CONST, .access = PL3_R, .resetvalue = 
> cpu->rvbar },
> +{ .name = "SCTLR_EL3", .state = ARM_CP_STATE_AA64,
> +  .opc0 = 3, .opc1 = 6, .crn = 1, .crm = 0, .opc2 = 0,
> +  .access = PL3_RW,
> +  .raw_writefn = raw_write, .writefn = sctlr_write,
> +  .fieldoffset = offsetof(CPUARMState, cp15.sctlr_el[3]),
> +  .resetvalue = cpu->reset_sctlr },
> +REGINFO_SENTINEL
>  };
> -define_one_arm_cp_reg(cpu, );
> +
> +define_arm_cp_regs(cpu, el3_regs);
>  }
>  /* The behaviour of NSACR is sufficiently various that we don't
>   * try to describe it in a single reginfo:
> --
> 1.9.1
>
>

[Qemu-devel] [PATCH for-2.6 0/4] various regdef fixes for EL2/EL3 regs

2016-03-31 Thread Peter Maydell

This patchset fixes a number of minor bugs in regdefs for some EL2
and EL3 registers. The most interesting one here is the first --
we weren't resetting SCTLR_EL3 correctly for 64-bit CPUs. The
rest are things I discovered by code inspection looking at other
registers:
 * we weren't migrating ESR_EL2 and ESR_EL3
 * we weren't migrating the (RES0) high 32 bits of VTCR_EL2
 * unneeded TLB flush on TCR_EL2 writes

I think these should go into 2.6 since they're bug fixes.

thanks
-- PMM


Peter Maydell (4):
  target-arm: Correctly reset SCTLR_EL3 for 64-bit CPUs
  target-arm: Remove incorrect ALIAS tags from ESR_EL2 and ESR_EL3
  target-arm: Make the 64-bit version of VTCR do the migration
  target-arm: Avoid unnecessary TLB flush on TCR_EL2 writes

 target-arm/helper.c | 37 ++---
 1 file changed, 22 insertions(+), 15 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH 1/4] target-arm: Correctly reset SCTLR_EL3 for 64-bit CPUs

2016-03-31 Thread Peter Maydell

The regdef for SCTRL_EL3 was incorrectly marked as being an
ARM_CP_ALIAS, with the remark that this was because the 32-bit
definition would take care of reset and migration. However the
intention for banked registers as documented in the comment in
add_cpreg_to_hashtable() is:

 * 2) If ARMv8 is enabled then we can count on a 64-bit version
 *taking care of the secure bank.  This requires that separate
 *32 and 64-bit definitions are provided.

and so it marks the 32-bit secure banked version as an alias.
This results in the sctlr_s/sctlr_el[3] field never being reset
or migrated for a 64-bit CPU with EL3 enabled.

Fix this by removing the ARM_CP_ALIAS annotation from SCTLR_EL3.
Since this means it now needs a real reset value, move the regdef
into the same place that we define the 32-bit SCTLR.

Signed-off-by: Peter Maydell 
---
 target-arm/helper.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/target-arm/helper.c b/target-arm/helper.c
index 19d5d52..e583e6a 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -3744,11 +3744,6 @@ static const ARMCPRegInfo el3_cp_reginfo[] = {
   .access = PL1_RW, .accessfn = access_trap_aa32s_el1,
   .writefn = vbar_write, .resetvalue = 0,
   .fieldoffset = offsetof(CPUARMState, cp15.mvbar) },
-{ .name = "SCTLR_EL3", .state = ARM_CP_STATE_AA64,
-  .type = ARM_CP_ALIAS, /* reset handled by AArch32 view */
-  .opc0 = 3, .opc1 = 6, .crn = 1, .crm = 0, .opc2 = 0,
-  .access = PL3_RW, .raw_writefn = raw_write, .writefn = sctlr_write,
-  .fieldoffset = offsetof(CPUARMState, cp15.sctlr_el[3]) },
 { .name = "TTBR0_EL3", .state = ARM_CP_STATE_AA64,
   .opc0 = 3, .opc1 = 6, .crn = 2, .crm = 0, .opc2 = 0,
   .access = PL3_RW, .writefn = vmsa_ttbr_write, .resetvalue = 0,
@@ -4641,12 +4636,20 @@ void register_cp_regs_for_features(ARMCPU *cpu)
 }
 if (arm_feature(env, ARM_FEATURE_EL3)) {
 define_arm_cp_regs(cpu, el3_cp_reginfo);
-ARMCPRegInfo rvbar = {
-.name = "RVBAR_EL3", .state = ARM_CP_STATE_AA64,
-.opc0 = 3, .opc1 = 6, .crn = 12, .crm = 0, .opc2 = 1,
-.type = ARM_CP_CONST, .access = PL3_R, .resetvalue = cpu->rvbar
+ARMCPRegInfo el3_regs[] = {
+{ .name = "RVBAR_EL3", .state = ARM_CP_STATE_AA64,
+  .opc0 = 3, .opc1 = 6, .crn = 12, .crm = 0, .opc2 = 1,
+  .type = ARM_CP_CONST, .access = PL3_R, .resetvalue = cpu->rvbar 
},
+{ .name = "SCTLR_EL3", .state = ARM_CP_STATE_AA64,
+  .opc0 = 3, .opc1 = 6, .crn = 1, .crm = 0, .opc2 = 0,
+  .access = PL3_RW,
+  .raw_writefn = raw_write, .writefn = sctlr_write,
+  .fieldoffset = offsetof(CPUARMState, cp15.sctlr_el[3]),
+  .resetvalue = cpu->reset_sctlr },
+REGINFO_SENTINEL
 };
-define_one_arm_cp_reg(cpu, );
+
+define_arm_cp_regs(cpu, el3_regs);
 }
 /* The behaviour of NSACR is sufficiently various that we don't
  * try to describe it in a single reginfo:
-- 
1.9.1

[Qemu-devel] [PATCH 3/4] target-arm: Make the 64-bit version of VTCR do the migration

2016-03-31 Thread Peter Maydell

Move the ALIAS tag from VTCR_EL2 to VTCR so that we migrate the
64-bit version, as is usual. (This has no particular effect now
unless the guest wrote to the high RES0 bits of VTCR_EL2.)
Add a comment about why it's OK that we don't have the various
accessor functions that the EL1 TCR regdefs do.

Signed-off-by: Peter Maydell 
---
 target-arm/helper.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/target-arm/helper.c b/target-arm/helper.c
index 0e54d90..09638b2 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -3564,11 +3564,15 @@ static const ARMCPRegInfo el2_cp_reginfo[] = {
   .fieldoffset = offsetof(CPUARMState, cp15.tcr_el[2]) },
 { .name = "VTCR", .state = ARM_CP_STATE_AA32,
   .cp = 15, .opc1 = 4, .crn = 2, .crm = 1, .opc2 = 2,
+  .type = ARM_CP_ALIAS,
   .access = PL2_RW, .accessfn = access_el3_aa32ns,
   .fieldoffset = offsetof(CPUARMState, cp15.vtcr_el2) },
 { .name = "VTCR_EL2", .state = ARM_CP_STATE_AA64,
   .opc0 = 3, .opc1 = 4, .crn = 2, .crm = 1, .opc2 = 2,
-  .access = PL2_RW, .type = ARM_CP_ALIAS,
+  .access = PL2_RW,
+  /* no .writefn needed as this can't cause an ASID change;
+   * no .raw_writefn or .resetfn needed as we never use mask/base_mask
+   */
   .fieldoffset = offsetof(CPUARMState, cp15.vtcr_el2) },
 { .name = "VTTBR", .state = ARM_CP_STATE_AA32,
   .cp = 15, .opc1 = 6, .crm = 2,
-- 
1.9.1

Re: [Qemu-devel] [PULL 0/5] ipv4-only and ipv6-only support

2016-03-31 Thread Samuel Thibault

Peter Maydell, on Thu 31 Mar 2016 15:44:13 +0100, wrote:
>  if (get_dns6_addr(>sin6_addr,
> >sin6_scope_id) < 0) {
>  ^
> In file included from 
> /home/petmay01/linaro/qemu-for-merges/slirp/slirp.h:99:0,
>  from /home/petmay01/linaro/qemu-for-merges/slirp/socket.c:10:
> /home/petmay01/linaro/qemu-for-merges/slirp/libslirp.h:10:5: note:
> expected ‘uint32_t *’ but argument is of type ‘u_long *’

Urgl, so Windows is again not posix-compliant here... I guess the least
ugly workaround is as below.

Samuel

diff --git a/slirp/socket.c b/slirp/socket.c
index 896c27e..3d3c72a 100644
--- a/slirp/socket.c
+++ b/slirp/socket.c
@@ -796,7 +796,10 @@ void sotranslate_out(struct socket *so, struct 
sockaddr_storage *addr)
 if (in6_equal_net(>so_faddr6, >vprefix_addr6,
 slirp->vprefix_len)) {
 if (in6_equal(>so_faddr6, >vnameserver_addr6)) {
-if (get_dns6_addr(>sin6_addr, >sin6_scope_id) < 0) 
{
+uint32_t scope_id;
+if (get_dns6_addr(>sin6_addr, _id) >= 0) {
+sin6->sin6_scope_id = scope_id;
+} else {
 sin6->sin6_addr = in6addr_loopback;
 }
 } else {

[Qemu-devel] [PATCH 4/4] target-arm: Avoid unnecessary TLB flush on TCR_EL2 writes

2016-03-31 Thread Peter Maydell

The TCR_EL2 regdef was incorrectly using the vmsa_tcr_el1_write
function for writes. Since TCR_EL2 doesn't have the A1 bit that
TCR_EL1 does, we don't need to do a tlb_flush() when it is written.
Remove the unnecessary .writefn and also the harmless but unneeded
.raw_writefn and .resetfn definitions.

Signed-off-by: Peter Maydell 
---
 target-arm/helper.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/target-arm/helper.c b/target-arm/helper.c
index 09638b2..4dbd844 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -3559,8 +3559,10 @@ static const ARMCPRegInfo el2_cp_reginfo[] = {
   .resetvalue = 0 },
 { .name = "TCR_EL2", .state = ARM_CP_STATE_BOTH,
   .opc0 = 3, .opc1 = 4, .crn = 2, .crm = 0, .opc2 = 2,
-  .access = PL2_RW, .writefn = vmsa_tcr_el1_write,
-  .resetfn = vmsa_ttbcr_reset, .raw_writefn = raw_write,
+  .access = PL2_RW,
+  /* no .writefn needed as this can't cause an ASID change;
+   * no .raw_writefn or .resetfn needed as we never use mask/base_mask
+   */
   .fieldoffset = offsetof(CPUARMState, cp15.tcr_el[2]) },
 { .name = "VTCR", .state = ARM_CP_STATE_AA32,
   .cp = 15, .opc1 = 4, .crn = 2, .crm = 1, .opc2 = 2,
-- 
1.9.1

[Qemu-devel] [PATCH] slirp: Fix migration from older versions of QEMU to the current one

2016-03-31 Thread Thomas Huth

While adding the IPv6 support, the commit eae303ff23f51259eddc8856c71453d8
("slirp: Make Socket structure IPv6 compatible") changed the format of
the migration stream, without taking into account that we might still
receive an old migration stream layout when upgrading from QEMU version
2.5 (or older) to QEMU 2.6. Currently, QEMU bails out when doing a
migration from QEMU 2.5 to the recent master version when it has
been started with a "-net user,guestfwd=..." network. So let's fix
this by checking the version ID of the migration stream and by using
the old behavior if we've detected version 3 or less.

Signed-off-by: Thomas Huth 
---
 slirp/slirp.c | 44 ++--
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/slirp/slirp.c b/slirp/slirp.c
index 3481fcc..998f278 100644
--- a/slirp/slirp.c
+++ b/slirp/slirp.c
@@ -1233,31 +1233,39 @@ static int slirp_sbuf_load(QEMUFile *f, struct sbuf 
*sbuf)
 return 0;
 }
 
-static int slirp_socket_load(QEMUFile *f, struct socket *so)
+static int slirp_socket_load(QEMUFile *f, struct socket *so, int version_id)
 {
 if (tcp_attach(so) < 0)
 return -ENOMEM;
 
 so->so_urgc = qemu_get_be32(f);
-so->so_ffamily = qemu_get_be16(f);
-switch (so->so_ffamily) {
-case AF_INET:
+if (version_id <= 3) {
+so->so_ffamily = AF_INET;
 so->so_faddr.s_addr = qemu_get_be32(f);
-so->so_fport = qemu_get_be16(f);
-break;
-default:
-error_report(
-"so_ffamily unknown, unable to restore so_faddr and 
so_lport\n");
-}
-so->so_lfamily = qemu_get_be16(f);
-switch (so->so_lfamily) {
-case AF_INET:
 so->so_laddr.s_addr = qemu_get_be32(f);
+so->so_fport = qemu_get_be16(f);
 so->so_lport = qemu_get_be16(f);
-break;
-default:
-error_report(
-"so_ffamily unknown, unable to restore so_laddr and 
so_lport\n");
+} else {
+so->so_ffamily = qemu_get_be16(f);
+switch (so->so_ffamily) {
+case AF_INET:
+so->so_faddr.s_addr = qemu_get_be32(f);
+so->so_fport = qemu_get_be16(f);
+break;
+default:
+error_report(
+"so_ffamily unknown, unable to restore so_faddr and so_lport");
+}
+so->so_lfamily = qemu_get_be16(f);
+switch (so->so_lfamily) {
+case AF_INET:
+so->so_laddr.s_addr = qemu_get_be32(f);
+so->so_lport = qemu_get_be16(f);
+break;
+default:
+error_report(
+"so_ffamily unknown, unable to restore so_laddr and so_lport");
+}
 }
 so->so_iptos = qemu_get_byte(f);
 so->so_emu = qemu_get_byte(f);
@@ -1294,7 +1302,7 @@ static int slirp_state_load(QEMUFile *f, void *opaque, 
int version_id)
 if (!so)
 return -ENOMEM;
 
-ret = slirp_socket_load(f, so);
+ret = slirp_socket_load(f, so, version_id);
 
 if (ret < 0)
 return ret;
-- 
1.8.3.1

Re: [Qemu-devel] [PATCH 1/2] char: fix broken EAGAIN retry on OS-X due to errno clobbering

2016-03-31 Thread Daniel P. Berrange

On Thu, Mar 31, 2016 at 03:35:44PM +0100, Peter Maydell wrote:
> On 31 March 2016 at 15:29, Daniel P. Berrange  wrote:
> > Some of the chardev I/O paths really want to write the
> > complete data buffer even though the channel is in
> > non-blocking mode. To achieve this they look for EAGAIN
> > and g_usleep() for 100ms. Unfortunately the code is set
> > to check errno == EAGAIN a second time, after the g_usleep()
> > call has completed. On OS-X at least, g_usleep clobbers
> > errno to ETIMEDOUT, causing the retry to be skipped.
> >
> > This failure to retry means the full data isn't written
> > to the chardev backend, which causes various failures
> > including making the tests/ahci-test qtest hang.
> >
> > Rather than playing games trying to reset errno just
> > simplify the code to use a goto to retry instead of a
> > a loop.
> >
> > Signed-off-by: Daniel P. Berrange 
> > ---
> >  qemu-char.c | 24 
> >  1 file changed, 12 insertions(+), 12 deletions(-)
> >
> > diff --git a/qemu-char.c b/qemu-char.c
> > index 270819a..6e623c3 100644
> > --- a/qemu-char.c
> > +++ b/qemu-char.c
> > @@ -246,12 +246,12 @@ static int qemu_chr_fe_write_buffer(CharDriverState 
> > *s, const uint8_t *buf, int
> >
> >  qemu_mutex_lock(>chr_write_lock);
> >  while (*offset < len) {
> > -do {
> > -res = s->chr_write(s, buf + *offset, len - *offset);
> > -if (res == -1 && errno == EAGAIN) {
> > -g_usleep(100);
> > -}
> > -} while (res == -1 && errno == EAGAIN);
> > +retry:
> > +res = s->chr_write(s, buf + *offset, len - *offset);
> > +if (res < 0 && errno == EAGAIN) {
> > +g_usleep(100);
> > +goto retry;
> > +}
> >
> >  if (res <= 0) {
> >  break;
> > @@ -333,12 +333,12 @@ int qemu_chr_fe_read_all(CharDriverState *s, uint8_t 
> > *buf, int len)
> >  }
> >
> >  while (offset < len) {
> > -do {
> > -res = s->chr_sync_read(s, buf + offset, len - offset);
> > -if (res == -1 && errno == EAGAIN) {
> > -g_usleep(100);
> > -}
> > -} while (res == -1 && errno == EAGAIN);
> > +retry:
> > +res = s->chr_sync_read(s, buf + offset, len - offset);
> > +if (res == -1 && errno == EAGAIN) {
> > +g_usleep(100);
> > +goto retry;
> > +}
> >
> >  if (res == 0) {
> >  break;
> 
> qemu_chr_fe_write_log() also seems to have this broken retry pattern.

Oh yes, so it does. Will resend a v2.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

Re: [Qemu-devel] tcg: reworking tb_invalidated_flag

2016-03-31 Thread Paolo Bonzini



On 31/03/2016 16:35, Sergey Fedorov wrote:
>> > My plan was to use some kind of double buffering, where only half of
>> > code_gen_buffer is in use.  At the end of tb_flush you call cpu_exit()
>> > on all CPUs, so that CPUs stop executing chained TBs from the old half
>> > before they can see one from the new half.
>> >
>> > If code_gen_buffer is static you have to preallocate two buffers (and
>> > two tbs arrays) and waste one of them; while it is theoretically
>> > possible to have CPUs still executing from the old half while you finish
>> > the new half, it can be more or less ignored.
>> >
>> > If it is dynamic, the previously used areas can be freed with call_rcu,
>> > and you can safely allocate a new code_gen_buffer and tbs array.
>> >
>> > I haven't thought much about it; it might require keeping a cache of the
>> > tbs array per CPU, and possibly changing the code under "if
>> > (tcg_ctx.tb_ctx.tb_invalidated_flag)" to simply exit cpu_exec.
> Maybe save this idea for latter? :) We'd better use a simpler approach
> at first and then move on and optimize. BTW, a few years ago I came
> across an interesting paper on code cache eviction granularities [1].

It depends on what is simpler.  Emilio and Fred's code was tricky and
touched the central CPU execution loop in cpus.c.

Paolo

Re: [Qemu-devel] [PULL 0/5] ipv4-only and ipv6-only support

2016-03-31 Thread Peter Maydell

On 31 March 2016 at 15:19, Samuel Thibault  wrote:
> Peter Maydell, on Thu 31 Mar 2016 15:11:27 +0100, wrote:
>> On 31 March 2016 at 13:51, Samuel Thibault  wrote:
>> > Peter Maydell, on Thu 31 Mar 2016 13:47:28 +0100, wrote:
>> >> On 31 March 2016 at 10:20, Samuel Thibault  
>> >> wrote:
>> >> >   Update version for v2.6.0-rc0 release (2016-03-30 19:25:40 +0100)
>> >> >
>> >> > are available in the git repository at:
>> >> >
>> >> >   http://people.debian.org/~sthibault/qemu.git tags/samuel-thibault-2
>> >> >
>> >> > for you to fetch changes up to c99751f2a711e9eecf60901520c6d4197bdaf9b4:
>> >> >
>> >> >   slirp: Add RDNSS advertisement (2016-03-31 11:18:13 +0200)
>> >> >
>> >> > 
>> >> > slirp updates (2)
>> >> >
>> >> > 
>> >> > Samuel Thibault (5):
>> >> >   slirp: Allow disabling IPv4 or IPv6
>> >> >   slirp: Split get_dns_addr
>> >> >   slirp: Add dns6 resolution
>> >> >   slirp: Support link-local DNS addresses
>> >> >   slirp: Add RDNSS advertisement
>> >>
>> >> Hi. I'm afraid this doesn't build for Windows:
>> >
>> > Indeed, I forgot to fix the the windows version of the function, could
>> > you try with the attached change?
>>
>> I'm afraid that's not sufficient:
>
> Ah, this is using -Werror, OK, here are more fixes.

Still nope:
/home/petmay01/linaro/qemu-for-merges/slirp/socket.c: In function
‘sotranslate_out’:
/home/petmay01/linaro/qemu-for-merges/slirp/socket.c:799:17: error:
passing argument 2 of ‘get_dns6_addr’ from incompatible pointer type
[-Werror]
 if (get_dns6_addr(>sin6_addr,
>sin6_scope_id) < 0) {
 ^
In file included from /home/petmay01/linaro/qemu-for-merges/slirp/slirp.h:99:0,
 from /home/petmay01/linaro/qemu-for-merges/slirp/socket.c:10:
/home/petmay01/linaro/qemu-for-merges/slirp/libslirp.h:10:5: note:
expected ‘uint32_t *’ but argument is of type ‘u_long *’
 int get_dns6_addr(struct in6_addr *pdns6_addr, uint32_t *scope_id);
 ^
cc1: all warnings being treated as errors

thanks
-- PMM

Re: [Qemu-devel] tcg: reworking tb_invalidated_flag

2016-03-31 Thread Sergey Fedorov

On 31/03/16 16:40, Paolo Bonzini wrote:
>
> On 31/03/2016 15:14, Sergey Fedorov wrote:
>> On 30/03/16 21:13, Paolo Bonzini wrote:
>>> On 30/03/2016 19:08, Sergey Fedorov wrote:
 The second approach is to make 'tb_invalidated_flag' per-CPU. This
 would be conceptually similar to what we have, but would give us thread
 safety. With this approach, we need to be careful to correctly clear and
 set the flag.
>>> You can just ensure that setting and clearing it is done under tb_lock.
>> So it could remain sitting in 'tcg_ctx.tb_ctx'. I'm just wondering what
>> could be real benefits for making it per-CPU then?
> All CPUs need to observe it in order to clear their own local next_tb
> variable.  It is not enough to do that once, so it has to be per-CPU.

So for each vCPU thread we have a separate flag to clear it safely. Got
it, thanks.

>
>>> Because TranslationBlocks live in tcg_ctx.tb_ctx.tbs you need
>>> special code to exit all CPUs at tb_flush time, otherwise you risk that
>>> a tb_alloc reuses a TranslationBlock while it is in use by a VCPU.
>> Looks like no matter which approach we use, it's ultimately necessary to
>> ensure all CPUs have exited from translated code before the translation
>> buffer may be safely flushed.
> My plan was to use some kind of double buffering, where only half of
> code_gen_buffer is in use.  At the end of tb_flush you call cpu_exit()
> on all CPUs, so that CPUs stop executing chained TBs from the old half
> before they can see one from the new half.
>
> If code_gen_buffer is static you have to preallocate two buffers (and
> two tbs arrays) and waste one of them; while it is theoretically
> possible to have CPUs still executing from the old half while you finish
> the new half, it can be more or less ignored.
>
> If it is dynamic, the previously used areas can be freed with call_rcu,
> and you can safely allocate a new code_gen_buffer and tbs array.
>
> I haven't thought much about it; it might require keeping a cache of the
> tbs array per CPU, and possibly changing the code under "if
> (tcg_ctx.tb_ctx.tb_invalidated_flag)" to simply exit cpu_exec.

Maybe save this idea for latter? :) We'd better use a simpler approach
at first and then move on and optimize. BTW, a few years ago I came
across an interesting paper on code cache eviction granularities [1].

[1]
http://www.cs.virginia.edu/kim/courses/cs851/papers/hazelwood04mediumgrained.pdf

Kind regards,
Sergey

Re: [Qemu-devel] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension

2016-03-31 Thread Paolo Bonzini

On 31/03/2016 16:27, Alex Bligh wrote:
> > > IE why not always permit trimming PROVIDED the data always reads back
> > > as zero? This would be far simpler.
> > 
> > Because trimming can make future operations more expensive and cause
> > fragmentation (which may not be as bad as it used to be at the media
> > level, but it is still somewhat bad at the filesystem level).
> > 
> > So if you want a fully-provisioned file, the simplest way to do so is to
> > write zeroes to it, and trimming is undesirable.
> But isn't the server in a better position to know this than the
> client?

There are at least three possible states for a sector:

- hole (thin-provisioned)

- allocated as data (disk contains actual zeroes)

- allocated as unwritten (blocks reserved on backing storage, reads as
zeroes but the disk may not contain actual zeroes)

It's always okay for the backend to convert a zero block to an unwritten
extent; it's generally not okay for a backend to take a request to
create an unwritten extent and instead create a hole.

It's all an "as if" situation. The server must provide the semantics
requested by the client.  For example, writing to a hole could cause
ENOSPC, writing to an unwritten extend could not.  The server might know
better, because it certainly is in a better position to know how to
fulfill the client's request.

But even if it's just a hint, it makes sense for NBD to provide it.
It's not a coincidence that this hint exists at all levels: SCSI has an
UNMAP bit that can be set in the WRITE SAME command (and it has UNMAP
which matches NBD's TRIM); the fallocate system call has
FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE (plus Linux has the
BLKDISCARD ioctl which again matches NBD's TRIM for block devices).

> EG if the server has a back end implementation (as I suspect
> Ceph on qemu-nbd does)

Ceph doesn't, but gluster does.

> which never actually stores all zero blocks,
> it won't make a difference, and conceivably you're generating a whole
> pile of I/O to avoid sparseness when sparseness might be faster. Take
> for example a persistent memory interface, where fragmentation is
> irrelevant, and writing piles of zeroes to memory is a waste of time.

It certainly isn't a waste of time if your intention is to scrub data
belonging to a previous tenant, before giving access to someone else!
If you have a metadata layer above then you can handle the command there
(that's why we're adding it); if you haven't you do have to write the
zeroes.

Paolo

Re: [Qemu-devel] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension

2016-03-31 Thread Alex Bligh

On 31 Mar 2016, at 14:55, Paolo Bonzini  wrote:

> On 31/03/2016 15:53, Alex Bligh wrote:
 +If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the 
 command
 +flags field, the server MAY use trimming to zero out the area, but it
 +MUST ensure that the data reads back as zero.
 +
>> Can you give an example of a situation where the client would not set this
>> and it would be undesirable for the server to create a 'hole' using
>> 'trim' type technology, even when the client doesn't specify it?
>> I suspect there are already some backends (e.g. ceph on qemu-nbd) which
>> will effectively do a 'trim' if you write 4k of zeroes even under
>> current circumstances.
>> 
>> IE why not always permit trimming PROVIDED the data always reads back
>> as zero? This would be far simpler.
> 
> Because trimming can make future operations more expensive and cause
> fragmentation (which may not be as bad as it used to be at the media
> level, but it is still somewhat bad at the filesystem level).
> 
> So if you want a fully-provisioned file, the simplest way to do so is to
> write zeroes to it, and trimming is undesirable.

But isn't the server in a better position to know this than the
client? EG if the server has a back end implementation (as I suspect
Ceph on qemu-nbd does) which never actually stores all zero blocks,
it won't make a difference, and conceivably you're generating a whole
pile of I/O to avoid sparseness when sparseness might be faster. Take
for example a persistent memory interface, where fragmentation is
irrelevant, and writing piles of zeroes to memory is a waste of time.

and on the same subject

On 31 Mar 2016, at 15:08, Eric Blake  wrote:
> Yes, I can see situations where the client REQUIRES that the server
> write actual zeroes, rather than trimming.  The biggest reason is that
> in an environment where storage can be oversubscribed (multiple sparse
> files that in name occupy more data than the underlying storage
> contains), explicitly writing zeroes without punching a hole guarantees
> that YOUR file has storage allocated to it (whereas if YOUR file is
> trimmed, some other file can then use enough allocation to prevent you
> from actually writing data in place of the hole).  Of course, the client
> can still achieve this by sticking with NBD_CMD_WRITE, but that requires
> more network traffic.

Ditto, the server is surely in a better position to know this. Perhaps
the server KNOWS it doesn't oversubscribe.

On the other hand, a third reason I suppose could be security.

Whatever, the implication that a server may never use a trim type
operation unless NBD_CMD_FLAG_MAY_TRIM is specified seems to me
pretty draconian. I'd prefer this as NBD_CMD_FLAG_NO_TRIM
(as Eric sets out below), and to make it a 'hint', saying the
data SHOULD actually be written out as zeroes for security and
to maintain allocation and lack of sparseness.

A good example of why this can only be a 'SHOULD' would be
a file system that itself is CoW (or perhaps journals
data). Either way, you aren't going to get your space back, you
aren't going to get secure overwriting, and sparseness doesn't
much mean anything.

> However, having written that, I'm thinking we have the wrong sense for
> the flag.  I think it makes more sense to allow trim/hole-punching by
> default (but ONLY when the server can guarantee that reads will still be
> zeroes), and make the flag NBD_CMD_FLAG_NO_TRIM to explicitly specify
> the cases where the server MUST NOT trim but allocate and write actual
> zeroes.  I suspect that explicit allocation requests are less common,
> and also less efficient; so having the default state of the flag geared
> towards efficiency (both in the sense that punching holes can be faster
> than writing zeroes, and that most people LIKE the storage savings of
> sparse files).

I agree with the sense reversal, but I think it should be a SHOULD NOT
(for the reasons set out above), and explaining why would be helpful.

-- 
Alex Bligh

Re: [Qemu-devel] [PATCH 1/2] char: fix broken EAGAIN retry on OS-X due to errno clobbering

2016-03-31 Thread Peter Maydell

On 31 March 2016 at 15:29, Daniel P. Berrange  wrote:
> Some of the chardev I/O paths really want to write the
> complete data buffer even though the channel is in
> non-blocking mode. To achieve this they look for EAGAIN
> and g_usleep() for 100ms. Unfortunately the code is set
> to check errno == EAGAIN a second time, after the g_usleep()
> call has completed. On OS-X at least, g_usleep clobbers
> errno to ETIMEDOUT, causing the retry to be skipped.
>
> This failure to retry means the full data isn't written
> to the chardev backend, which causes various failures
> including making the tests/ahci-test qtest hang.
>
> Rather than playing games trying to reset errno just
> simplify the code to use a goto to retry instead of a
> a loop.
>
> Signed-off-by: Daniel P. Berrange 
> ---
>  qemu-char.c | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/qemu-char.c b/qemu-char.c
> index 270819a..6e623c3 100644
> --- a/qemu-char.c
> +++ b/qemu-char.c
> @@ -246,12 +246,12 @@ static int qemu_chr_fe_write_buffer(CharDriverState *s, 
> const uint8_t *buf, int
>
>  qemu_mutex_lock(>chr_write_lock);
>  while (*offset < len) {
> -do {
> -res = s->chr_write(s, buf + *offset, len - *offset);
> -if (res == -1 && errno == EAGAIN) {
> -g_usleep(100);
> -}
> -} while (res == -1 && errno == EAGAIN);
> +retry:
> +res = s->chr_write(s, buf + *offset, len - *offset);
> +if (res < 0 && errno == EAGAIN) {
> +g_usleep(100);
> +goto retry;
> +}
>
>  if (res <= 0) {
>  break;
> @@ -333,12 +333,12 @@ int qemu_chr_fe_read_all(CharDriverState *s, uint8_t 
> *buf, int len)
>  }
>
>  while (offset < len) {
> -do {
> -res = s->chr_sync_read(s, buf + offset, len - offset);
> -if (res == -1 && errno == EAGAIN) {
> -g_usleep(100);
> -}
> -} while (res == -1 && errno == EAGAIN);
> +retry:
> +res = s->chr_sync_read(s, buf + offset, len - offset);
> +if (res == -1 && errno == EAGAIN) {
> +g_usleep(100);
> +goto retry;
> +}
>
>  if (res == 0) {
>  break;

qemu_chr_fe_write_log() also seems to have this broken retry pattern.

thanks
-- PMM

[Qemu-devel] [PATCH 1/2] char: fix broken EAGAIN retry on OS-X due to errno clobbering

2016-03-31 Thread Daniel P. Berrange

Some of the chardev I/O paths really want to write the
complete data buffer even though the channel is in
non-blocking mode. To achieve this they look for EAGAIN
and g_usleep() for 100ms. Unfortunately the code is set
to check errno == EAGAIN a second time, after the g_usleep()
call has completed. On OS-X at least, g_usleep clobbers
errno to ETIMEDOUT, causing the retry to be skipped.

This failure to retry means the full data isn't written
to the chardev backend, which causes various failures
including making the tests/ahci-test qtest hang.

Rather than playing games trying to reset errno just
simplify the code to use a goto to retry instead of a
a loop.

Signed-off-by: Daniel P. Berrange 
---
 qemu-char.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/qemu-char.c b/qemu-char.c
index 270819a..6e623c3 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -246,12 +246,12 @@ static int qemu_chr_fe_write_buffer(CharDriverState *s, 
const uint8_t *buf, int
 
 qemu_mutex_lock(>chr_write_lock);
 while (*offset < len) {
-do {
-res = s->chr_write(s, buf + *offset, len - *offset);
-if (res == -1 && errno == EAGAIN) {
-g_usleep(100);
-}
-} while (res == -1 && errno == EAGAIN);
+retry:
+res = s->chr_write(s, buf + *offset, len - *offset);
+if (res < 0 && errno == EAGAIN) {
+g_usleep(100);
+goto retry;
+}
 
 if (res <= 0) {
 break;
@@ -333,12 +333,12 @@ int qemu_chr_fe_read_all(CharDriverState *s, uint8_t 
*buf, int len)
 }
 
 while (offset < len) {
-do {
-res = s->chr_sync_read(s, buf + offset, len - offset);
-if (res == -1 && errno == EAGAIN) {
-g_usleep(100);
-}
-} while (res == -1 && errno == EAGAIN);
+retry:
+res = s->chr_sync_read(s, buf + offset, len - offset);
+if (res == -1 && errno == EAGAIN) {
+g_usleep(100);
+goto retry;
+}
 
 if (res == 0) {
 break;
-- 
2.5.5

[Qemu-devel] [PATCH 2/2] char: ensure all clients are in non-blocking mode

2016-03-31 Thread Daniel P. Berrange

Only some callers of tcp_chr_new_client are putting the
socket client into non-blocking mode. Move the call to
qio_channel_set_blocking() into the tcp_chr_new_client
method to guarantee that all code paths set non-blocking
mode

Reported-by: Andrew Baumann 
Reported-by: Laurent Vivier 
Signed-off-by: Daniel P. Berrange 
---
 qemu-char.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/qemu-char.c b/qemu-char.c
index 6e623c3..3558429 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -3081,6 +3081,8 @@ static int tcp_chr_new_client(CharDriverState *chr, 
QIOChannelSocket *sioc)
 s->sioc = sioc;
 object_ref(OBJECT(sioc));
 
+qio_channel_set_blocking(s->ioc, false, NULL);
+
 if (s->do_nodelay) {
 qio_channel_set_delay(s->ioc, false);
 }
@@ -3112,7 +3114,6 @@ static int tcp_chr_add_client(CharDriverState *chr, int 
fd)
 if (!sioc) {
 return -1;
 }
-qio_channel_set_blocking(QIO_CHANNEL(sioc), false, NULL);
 ret = tcp_chr_new_client(chr, sioc);
 object_unref(OBJECT(sioc));
 return ret;
-- 
2.5.5

[Qemu-devel] [PATCH 0/2] Fixing non-blocking operation of chardevs

2016-03-31 Thread Daniel P. Berrange

This fixes socket chardevs to always be in non-blocking
mode, as they were before the QIOChannel conversion. The
second patch was already posted before, but dropped when
Peter discovered a problem on OS-X causing ahci-test to
hang:

  https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05807.html

I traced this down to broken EAGAIN handling affecting
OS-X, hence the first patch in this series.

Daniel P. Berrange (2):
  char: fix broken EAGAIN retry on OS-X due to errno clobbering
  char: ensure all clients are in non-blocking mode

 qemu-char.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

-- 
2.5.5

[Qemu-devel] [PATCH] net: fix OptsVisitor memory leak

2016-03-31 Thread Paolo Bonzini

Signed-off-by: Paolo Bonzini 
---
 net/net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/net.c b/net/net.c
index 594c3b8..3847a13 100644
--- a/net/net.c
+++ b/net/net.c
@@ -1100,6 +1100,7 @@ int net_client_init(QemuOpts *opts, int is_netdev, Error 
**errp)
 }
 
 error_propagate(errp, err);
+opts_visitor_cleanup(ov);
 return ret;
 }
 
-- 
2.5.5

Re: [Qemu-devel] [PULL 0/5] ipv4-only and ipv6-only support

2016-03-31 Thread Samuel Thibault

Peter Maydell, on Thu 31 Mar 2016 15:11:27 +0100, wrote:
> On 31 March 2016 at 13:51, Samuel Thibault  wrote:
> > Peter Maydell, on Thu 31 Mar 2016 13:47:28 +0100, wrote:
> >> On 31 March 2016 at 10:20, Samuel Thibault  
> >> wrote:
> >> >   Update version for v2.6.0-rc0 release (2016-03-30 19:25:40 +0100)
> >> >
> >> > are available in the git repository at:
> >> >
> >> >   http://people.debian.org/~sthibault/qemu.git tags/samuel-thibault-2
> >> >
> >> > for you to fetch changes up to c99751f2a711e9eecf60901520c6d4197bdaf9b4:
> >> >
> >> >   slirp: Add RDNSS advertisement (2016-03-31 11:18:13 +0200)
> >> >
> >> > 
> >> > slirp updates (2)
> >> >
> >> > 
> >> > Samuel Thibault (5):
> >> >   slirp: Allow disabling IPv4 or IPv6
> >> >   slirp: Split get_dns_addr
> >> >   slirp: Add dns6 resolution
> >> >   slirp: Support link-local DNS addresses
> >> >   slirp: Add RDNSS advertisement
> >>
> >> Hi. I'm afraid this doesn't build for Windows:
> >
> > Indeed, I forgot to fix the the windows version of the function, could
> > you try with the attached change?
> 
> I'm afraid that's not sufficient:

Ah, this is using -Werror, OK, here are more fixes.

Samuel

diff --git a/slirp/libslirp.h b/slirp/libslirp.h
index 81bd139..b6fc584 100644
--- a/slirp/libslirp.h
+++ b/slirp/libslirp.h
@@ -7,7 +7,7 @@ struct Slirp;
 typedef struct Slirp Slirp;
 
 int get_dns_addr(struct in_addr *pdns_addr);
-int get_dns6_addr(struct in6_addr *pdns6_addr, unsigned *scope_id);
+int get_dns6_addr(struct in6_addr *pdns6_addr, uint32_t *scope_id);
 
 Slirp *slirp_init(int restricted, bool in_enabled, struct in_addr vnetwork,
   struct in_addr vnetmask, struct in_addr vhost,
diff --git a/slirp/slirp.c b/slirp/slirp.c
index c6bcc6e..551a63c 100644
--- a/slirp/slirp.c
+++ b/slirp/slirp.c
@@ -46,9 +50,13 @@ static QTAILQ_HEAD(slirp_instances, Slirp) slirp_instances =
 QTAILQ_HEAD_INITIALIZER(slirp_instances);
 
 static struct in_addr dns_addr;
+#ifndef _WIN32
 static struct in6_addr dns6_addr;
+#endif
 static u_int dns_addr_time;
+#ifndef _WIN32
 static u_int dns6_addr_time;
+#endif
 
 #define TIMEOUT_FAST 2  /* milliseconds */
 #define TIMEOUT_SLOW 499  /* milliseconds */
@@ -110,7 +110,7 @@ int get_dns_addr(struct in_addr *pdns_addr)
 return 0;
 }
 
-int get_dns6_addr(struct in6_addr *pdns6_addr, unsigned *scope_id)
+int get_dns6_addr(struct in6_addr *pdns6_addr, uint32_t *scope_id)
 {
 return -1;
 }
@@ -146,7 +146,7 @@ static int get_dns_addr_cached(void *pdns_addr, void 
*cached_addr,
 }
 
 static int get_dns_addr_resolv_conf(int af, void *pdns_addr, void *cached_addr,
-socklen_t addrlen, unsigned *scope_id,
+socklen_t addrlen, uint32_t *scope_id,
 u_int *cached_time)
 {
 char buff[512];
@@ -229,7 +229,7 @@ int get_dns_addr(struct in_addr *pdns_addr)
 sizeof(dns_addr), NULL, _addr_time);
 }
 
-int get_dns6_addr(struct in6_addr *pdns6_addr, unsigned *scope_id)
+int get_dns6_addr(struct in6_addr *pdns6_addr, uint32_t *scope_id)
 {
 static struct stat dns6_addr_stat;

Re: [Qemu-devel] [PULL 0/5] ipv4-only and ipv6-only support

2016-03-31 Thread Peter Maydell

On 31 March 2016 at 13:51, Samuel Thibault  wrote:
> Peter Maydell, on Thu 31 Mar 2016 13:47:28 +0100, wrote:
>> On 31 March 2016 at 10:20, Samuel Thibault  
>> wrote:
>> >   Update version for v2.6.0-rc0 release (2016-03-30 19:25:40 +0100)
>> >
>> > are available in the git repository at:
>> >
>> >   http://people.debian.org/~sthibault/qemu.git tags/samuel-thibault-2
>> >
>> > for you to fetch changes up to c99751f2a711e9eecf60901520c6d4197bdaf9b4:
>> >
>> >   slirp: Add RDNSS advertisement (2016-03-31 11:18:13 +0200)
>> >
>> > 
>> > slirp updates (2)
>> >
>> > 
>> > Samuel Thibault (5):
>> >   slirp: Allow disabling IPv4 or IPv6
>> >   slirp: Split get_dns_addr
>> >   slirp: Add dns6 resolution
>> >   slirp: Support link-local DNS addresses
>> >   slirp: Add RDNSS advertisement
>>
>> Hi. I'm afraid this doesn't build for Windows:
>
> Indeed, I forgot to fix the the windows version of the function, could
> you try with the attached change?

I'm afraid that's not sufficient:

/home/petmay01/linaro/qemu-for-merges/slirp/slirp.c:53:24: error:
‘dns6_addr’ defined but not used [-Werror=unused-variable]
 static struct in6_addr dns6_addr;
^
/home/petmay01/linaro/qemu-for-merges/slirp/slirp.c:55:14: error:
‘dns6_addr_time’ defined but not used [-Werror=unused-variable]
 static u_int dns6_addr_time;
  ^
cc1: all warnings being treated as errors
make: *** [slirp/slirp.o] Error 1
make: *** Waiting for unfinished jobs
/home/petmay01/linaro/qemu-for-merges/slirp/socket.c: In function
‘sotranslate_out’:
/home/petmay01/linaro/qemu-for-merges/slirp/socket.c:799:17: error:
passing argument 2 of ‘get_dns6_addr’ from incompatible pointer type
[-Werror]
 if (get_dns6_addr(>sin6_addr,
>sin6_scope_id) < 0) {
 ^
In file included from /home/petmay01/linaro/qemu-for-merges/slirp/slirp.h:99:0,
 from /home/petmay01/linaro/qemu-for-merges/slirp/socket.c:10:
/home/petmay01/linaro/qemu-for-merges/slirp/libslirp.h:10:5: note:
expected ‘unsigned int *’ but argument is of type ‘u_long *’
 int get_dns6_addr(struct in6_addr *pdns6_addr, unsigned *scope_id);
 ^
cc1: all warnings being treated as errors

thanks
-- PMM

Re: [Qemu-devel] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension

2016-03-31 Thread Eric Blake

On 03/31/2016 07:53 AM, Alex Bligh wrote:
> 
> On 31 Mar 2016, at 14:02, Denis V. Lunev  wrote:
> 
>> From: Pavel Borzenkov 
>>
>> There exist some cases when a client knows that the data it is going to
>> write is all zeroes. Such cases include mirroring or backing up a device
>> implemented by a sparse file.
> 
> Useful.
> 
>> -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`.  SHOULD be
>> -  set to 1 if the client requires "Force Unit Access" mode of
>> -  operation.  MUST NOT be set unless transmission flags included
>> -  `NBD_FLAG_SEND_FUA`.
>> +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and
>> +  `NBD_CMD_WRITE_ZEROES` commands.  SHOULD be set to 1 if the client 
>> requires
>> +  "Force Unit Access" mode of operation.  MUST NOT be set unless 
>> transmission
>> +  flags included `NBD_FLAG_SEND_FUA`.
> 
> Not your fault, but this should actually say "unless export flags
> included". Transmission flags would be the flags with the command.

No, we just barely renamed 'export flags' to 'transmission flags', to
represent the 16 bits sent by the server at the end of handshake phase;
these are named 'NBD_FLAG_*'.  We still use the term 'command flags'
(although maybe 'request flags' is better) for the 16 bits sent with
each request; these are named 'NBD_CMD_FLAG_*'.

So Pavel's text is correct as-is.

> 
>> +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES`
>> +  extension; see below.
> 
> For consistency, probably useful to say here:
> 
> MUST NOT be set unless the export flags include NBD_FLAG_SEND_WRITE_ZEROES.

Elsewhere, when defining an experimental extension, the forward
reference has been as sparse as possible; so this sentence (about the
transmission flags including NBD_FLAG_SEND_WRITE_ZEROES) should appear
only in the experimental section, if it is not already there.

>>
>> +### `WRITE_ZEROES` extension
>> +
>> +There exist some cases when a client knows that the data it is going to 
>> write
>> +is all zeroes. Such cases include mirroring or backing up a device 
>> implemented
>> +by a sparse file. With current NBD command set, the client has to issue
>> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
>> +through the wire. The server has to write the data onto disk, effectively
>> +losing the sparseness.
>> +
>> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension 
>> adds
>> +one new command and one new command flag.
>> +
>> +* `NBD_CMD_WRITE_ZEROES` (6)

Wouter recently pointed out that we explicitly do NOT want to repeat
constants in more than one location; define the value to (6) above where
you make the forward reference in the normative section, then keep the
experimental section referring to the command by name only.  Especially
useful if we end up renumbering things because we have multiple
extension proposals in flight at the moment.

>> +If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command
>> +flags field, the server MAY use trimming to zero out the area, but it
>> +MUST ensure that the data reads back as zero.
>> +
> 
> Can you give an example of a situation where the client would not set this
> and it would be undesirable for the server to create a 'hole' using
> 'trim' type technology, even when the client doesn't specify it?

Yes, I can see situations where the client REQUIRES that the server
write actual zeroes, rather than trimming.  The biggest reason is that
in an environment where storage can be oversubscribed (multiple sparse
files that in name occupy more data than the underlying storage
contains), explicitly writing zeroes without punching a hole guarantees
that YOUR file has storage allocated to it (whereas if YOUR file is
trimmed, some other file can then use enough allocation to prevent you
from actually writing data in place of the hole).  Of course, the client
can still achieve this by sticking with NBD_CMD_WRITE, but that requires
more network traffic.

However, having written that, I'm thinking we have the wrong sense for
the flag.  I think it makes more sense to allow trim/hole-punching by
default (but ONLY when the server can guarantee that reads will still be
zeroes), and make the flag NBD_CMD_FLAG_NO_TRIM to explicitly specify
the cases where the server MUST NOT trim but allocate and write actual
zeroes.  I suspect that explicit allocation requests are less common,
and also less efficient; so having the default state of the flag geared
towards efficiency (both in the sense that punching holes can be faster
than writing zeroes, and that most people LIKE the storage savings of
sparse files).

> I suspect there are already some backends (e.g. ceph on qemu-nbd) which
> will effectively do a 'trim' if you write 4k of zeroes even under
> current circumstances.
> 
> IE why not always permit trimming PROVIDED the data always reads back
> as zero? This would be far simpler.
> 

-- 
Eric Blake   eblake

1 2 3 >

1 - 100 of 227 matches

Mail list logo