date:20201022

Re: [PATCH] CHANGELOG: remove disused file

2020-10-22 Thread Thomas Huth

On 22/10/2020 18.28, John Snow wrote:
> There's no reason to keep this here; the versions described are
> ancient. Everything here is still mirrored on
> https://wiki.qemu.org/ChangeLog/old if anyone is curious; otherwise, use
> the git history.
> 
> Signed-off-by: John Snow 
> ---
>  Changelog | 580 --
>  1 file changed, 580 deletions(-)
>  delete mode 100644 Changelog
> 
> diff --git a/Changelog b/Changelog
> deleted file mode 100644
> index f7e178ccc01..000
> --- a/Changelog
> +++ /dev/null
> @@ -1,580 +0,0 @@
> -This file documents changes for QEMU releases 0.12 and earlier.
> -For changelog information for later releases, see
> -https://wiki.qemu.org/ChangeLog or look at the git history for
> -more detailed information.

I agree with removing the old log. But should we maybe leave a pointer to
https://wiki.qemu.org/ChangeLog / the git history here to let people know
how to see the changelogs?

 Thomas

Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Klaus Jensen

On Oct 22 23:02, Philippe Mathieu-Daudé wrote:
> Hi Klaus,
> 

Hi Philippe,

Thanks for your comments!

> On 10/22/20 8:49 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add support for the Dataset Management command and the Deallocate
> > attribute. Deallocation results in discards being sent to the underlying
> > block device. Whether of not the blocks are actually deallocated is
> > affected by the same factors as Write Zeroes (see previous commit).
> > 
> >   format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)
> 
> Please use B/KiB units which are unambiguous (kb is for kbits)
> (if you queue this yourself, you can fix when applying, no need
> to repost).
> 

Thanks, I'll change it.

> >  --
> >qcow2ignore   n   n  n
> >qcow2unmapn   n  y
> >raw  ignore   n   n  n
> >raw  unmapn   y  y
> > 
> > Again, a raw format and 4kb LBAs are preferable.
> > 
> > In order to set the Namespace Preferred Deallocate Granularity and
> > Alignment fields (NPDG and NPDA), choose a sane minimum discard
> > granularity of 4kb. If we are using a passthru device supporting discard
> > at a 512b granularity, user should set the discard_granularity property
> 
> Ditto.
> 
> > explicitly. NPDG and NPDA will also account for the cluster_size of the
> > block driver if required (i.e. for QCOW2).
> > 
> > See NVM Express 1.3d, Section 6.7 ("Dataset Management command").
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.h  |   2 +
> >   include/block/nvme.h |   7 ++-
> >   hw/block/nvme-ns.c   |  36 +--
> >   hw/block/nvme.c  | 101 ++-
> >   4 files changed, 140 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index e080a2318a50..574333caa3f9 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -28,6 +28,7 @@ typedef struct NvmeRequest {
> >   struct NvmeNamespace*ns;
> >   BlockAIOCB  *aiocb;
> >   uint16_tstatus;
> > +void*opaque;
> >   NvmeCqe cqe;
> >   NvmeCmd cmd;
> >   BlockAcctCookie acct;
> > @@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
> >   case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
> >   case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
> >   case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
> > +case NVME_CMD_DSM:  return "NVME_NVM_CMD_DSM";
> >   default:return "NVME_NVM_CMD_UNKNOWN";
> >   }
> >   }
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 966c3bb304bd..e95ff6ca9b37 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
> >   uint16_tnabspf;
> >   uint16_tnoiob;
> >   uint8_t nvmcap[16];
> > -uint8_t rsvd64[40];
> > +uint16_tnpwg;
> > +uint16_tnpwa;
> > +uint16_tnpdg;
> > +uint16_tnpda;
> > +uint16_tnows;
> > +uint8_t rsvd74[30];
> >   uint8_t nguid[16];
> >   uint64_teui64;
> >   NvmeLBAFlbaf[16];
> 
> If you consider "block/nvme.h" shared by 2 different subsystems,
> it is better to add the changes in a separate patch. That way
> the changes can be acked individually.
> 

Sure. Some other stuff here warrents a v6 I think, so I will split it.

> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index f1cc734c60f5..840651db7256 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -28,10 +28,14 @@
> >   #include "nvme.h"
> >   #include "nvme-ns.h"
> > -static void nvme_ns_init(NvmeNamespace *ns)
> > +#define MIN_DISCARD_GRANULARITY (4 * KiB)
> > +
> > +static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
> 
> Hmm the Error* argument could be squashed in "hw/block/nvme:
> support multiple namespaces". Else better split patch in dumb
> units IMHO (maybe a reviewer's taste).
> 

Yeah, I guess I can squash that in.

> >   {
> > +BlockDriverInfo bdi;
> >   NvmeIdNs *id_ns = >id_ns;
> >   int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> > +int npdg, ret;
> >   ns->id_ns.dlfeat = 0x9;
> > @@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)
> >   id_ns->ncap = id_ns->nsze;
> >   id_ns->nuse = id_ns->ncap;
> > -/* support DULBE */
> > -id_ns->nsfeat |= 0x4;
> > +/* support DULBE and I/O optimization fields */
> > +id_ns->nsfeat |= (0x4 | 0x10);
> 
> The comment helps, but isn't needed if you use explicit definitions
> for these flags. You already introduced the NVME_ID_NS_NSFEAT_DULBE
> and NVME_ID_NS_FLBAS_EXTENDED but they are restricted to extract bits.
>

Re: [PATCH v3 00/15] raspi: add the bcm2835 cprman clock manager

2020-10-22 Thread Guenter Roeck

On 10/22/20 3:06 PM, Philippe Mathieu-Daudé wrote:
> Cc'ing Guenter who had a similar patch and might be interested
> to test :)
> 
I applied the series on top of qemu mainline and ran all my test with it
(raspi2 with qemu-system-arm as well as qemu-system-aarch64, and raspi3
in both big endian and little endian mode with qemu-system-aarch64.
All tests passed without error or kernel warning.

For the series:

Tested-by: Guenter Roeck 

Guenter

> patch 16/15 fixup:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg752113.html
> 
> On 10/10/20 3:57 PM, Luc Michel wrote:
>> v2 -> v3:
>>    - patch 03: moved clock_new definition to hw/core/clock.c [Phil]
>>    - patch 03: commit message typo [Clement]
>>    - patch 10: clarifications around the CM_CTL/CM_DIBV mux registers.
>>    reg_cm replaced with reg_ctl and reg_div. Add some
>>    comments for clarity. [Phil]
>>    - patch 10: fixed update_mux_from_cm not matching the CM_DIV offset
>>    correctly. [Phil]
>>    - patch 11: replaced manual bitfield extraction with extract32 [Phil]
>>    - patch 11: added a visual representation of CM_DIV for clarity [Phil]
>>    - patch 11: added a missing return in clock_mux_update.
>>
>> v1 -> v2:
>>    - patch 05: Added a comment about MMIO .valid constraints [Phil]
>>    - patch 05: Added MMIO .impl [Phil]
>>    - patch 05: Moved init_internal_clock to the public clock API, renamed
>>  clock_new (new patch 03) [Phil]
>>    - patch 11: use muldiv64 for clock mux frequency output computation [Phil]
>>    - patch 11: add a check for null divisor (Phil: I dropped your r-b)
>>    - Typos, formatting, naming, style [Phil]
>>
>> Patches without review: 03, 11, 13
>>
>> Hi,
>>
>> This series add the BCM2835 CPRMAN clock manager peripheral to the
>> Raspberry Pi machine.
>>
>> Patches 1-4 are preliminary changes, patches 5-13 are the actual
>> implementation.
>>
>> The two last patches add a clock input to the PL011 and
>> connect it to the CPRMAN.
>>
>> This series has been tested with Linux 5.4.61 (the current raspios
>> version). It fixes the kernel Oops at boot time due to invalid UART
>> clock value, and other warnings/errors here and there because of bad
>> clocks or lack of CPRMAN.
>>
>> Here is the clock tree as seen by Linux when booted in QEMU:
>> (/sys/kernel/debug/clk/clk_summary with some columns removed)
>>
>>  enable  prepare
>>     clock count    count  rate
>> -
>>   otg 0    0 48000
>>   osc 5    5  1920
>>  gp2  1    1 32768
>>  tsens    0    0   192
>>  otp  0    0   480
>>  timer    0    0   102
>>  pllh 4    4 86400
>>     pllh_pix_prediv   1    1   3375000
>>    pllh_pix   0    0    337500
>>     pllh_aux  1    1 21600
>>    vec    0    0 10800
>>     pllh_rcal_prediv  1    1   3375000
>>    pllh_rcal  0    0    337500
>>  plld 3    3    200024
>>     plld_dsi1 0    0   7812501
>>     plld_dsi0 0    0   7812501
>>     plld_per  3    3 50006
>>    gp1    1    1  2500
>>    uart   1    2  47999625
>>     plld_core 2    2 50006
>>    sdram  0    0 16668
>>  pllc 3    3    24
>>     pllc_per  1    1    12
>>    emmc   0    0 2
>>     pllc_core2    0    0   9375000
>>     pllc_core1    0    0   9375000
>>     pllc_core0    2    2    12
>>    vpu    1    1 7
>>   aux_spi2    0    0 7
>>   aux_spi1    0    0 7
>>   aux_uart    0    0 7
>>   peri_image  0    0 7
>>  plla 2    2    225000
>>     plla_ccp2 0    0   8789063
>>     plla_dsi0 0    0   8789063
>>     plla_core 1    1 75000
>>    h264   0    0 25000
>>    isp    0    0 25000
>>   dsi1p   0    0 0
>>   dsi0p   0    0 0
>>   dsi1e   0    0 0
>>   dsi0e   0    0 0
>>   cam1

Re: [PATCH] pci: Refuse to hotplug PCI Devices when the Guest OS is not ready

2020-10-22 Thread David Gibson

On Thu, 22 Oct 2020 11:01:04 -0400
"Michael S. Tsirkin"  wrote:

> On Thu, Oct 22, 2020 at 05:50:51PM +0300, Marcel Apfelbaum wrote:
>  [...]  
> 
> Right. After detecting just failing unconditionally it a bit too
> simplistic IMHO.

There's also another factor here, which I thought I'd mentioned
already, but looks like I didn't: I think we're still missing some
details in what's going on.

The premise for this patch is that plugging while the indicator is in
transition state is allowed to fail in any way on the guest side.  I
don't think that's a reasonable interpretation, because it's unworkable
for physical hotplug.  If the indicator starts blinking while you're in
the middle of shoving a card in, you'd be in trouble.

So, what I'm assuming here is that while "don't plug while blinking" is
the instruction for the operator to obey as best they can, on the guest
side the rule has to be "start blinking, wait a while and by the time
you leave blinking state again, you can be confident any plugs or
unplugs have completed".  Obviously still racy in the strict computer
science sense, but about the best you can do with slow humans in the
mix.

So, qemu should of course endeavour to follow that rule as though it
was a human operator on a physical machine and not plug when the
indicator is blinking.  *But* the qemu plug will in practice be fast
enough that if we're hitting real problems here, it suggests the guest
is still doing something wrong.

-- 
David Gibson 
Principal Software Engineer, Virtualization, Red Hat

pgpVqQ_dYAEyy.pgp
Description: OpenPGP digital signature

Enable MSI-X support in PCIe device.

2020-10-22 Thread Douglas Su

To use MSI-X interrupt in my PCIe device, In realize() function I make a MSIX 
initialization like this:

#define MYDEV_MSIX_VEC_NUM 5

void realize() {
memory_region_init(>msix, OBJECT(edu), "mydev-msix",
   MYDEV_MSIX_SIZE);
pci_register_bar(pdev, MYDEV_MSIX_IDX,
 PCI_BASE_ADDRESS_SPACE_MEMORY, >msix);

rv = msix_init(pdev, MYDEV_MSIX_VEC_NUM,
   >msix, MYDEV_MSIX_IDX, MYDEV_MSIX_TABLE,
   >msix, MYDEV_MSIX_IDX, MYDEV_MSIX_PBA,
   0, errp);
}

After this, a simple logic is added  to trigger interrupt by writing command to 
a specific BAR0 address.

void trigger() {
msix_notify(pdev, 1); // send vector 1 to msix
}

In the OS driver, MSIX is enabled via `pci_alloc_irq_vectors()`, which is 
detailed in Linux Kernel's documentation `Documentation/PCI/msi-howto.rst` (I 
use kernel 5.7).

It is correct to obtain the number of vector from that function but failed to 
receive interrupt from device. The IRQ, which is returned from 
`pci_irq_vector`, is registered via `request_irq()` in the deriver.

Can anyone give a clue?

Re: [PATCH] pci: Refuse to hotplug PCI Devices when the Guest OS is not ready

2020-10-22 Thread David Gibson

On Thu, 22 Oct 2020 16:55:10 +0300
Marcel Apfelbaum  wrote:

> Hi David, Michael,
> 
> On Thu, Oct 22, 2020 at 3:56 PM David Gibson  wrote:
> 
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> > >
> > > Probably the only way to handle for existing machine types.  
> >  
> 
> I agree
> 
> 
> > > For new ones, can't we queue it in host memory somewhere?  
> >
> >  
> I am not sure I understand what will be the flow.
>   - The user asks for a hotplug operation.
>   -  QEMU deferred operation.
> After that the operation may still fail, how would the user know if the
> operation
> succeeded or not?
> 
> 
> 
> > I'm not actually convinced we can't do that even for existing machine
> > types.  
> 
> 
> Is a Guest visible change, I don't think we can do it.

How is it a guest visible change?

> > So I'm a bit hesitant to suggest going ahead with this without
> > looking a bit closer at whether we can implement a wait-for-ready in
> > qemu, rather than forcing every user of qemu (human or machine) to do
> > so.
> 
> While I agree it is a pain from the usability point of view, hotplug
> operations
> are allowed to fail. This is not more than a corner case, ensuring the right
> response (gracefully erroring out) may be enough.
> 
> Thanks,
> Marcel
> 
> 
> 
>  [...]  


-- 
David Gibson 
Principal Software Engineer, Virtualization, Red Hat


pgpwhYUKe1WIT.pgp
Description: OpenPGP digital signature

Re: [PATCH] pci: Refuse to hotplug PCI Devices when the Guest OS is not ready

2020-10-22 Thread David Gibson

On Thu, 22 Oct 2020 09:15:28 -0400
"Michael S. Tsirkin"  wrote:

> On Thu, Oct 22, 2020 at 11:56:32PM +1100, David Gibson wrote:
>  [...]  
>  [...]  
>  [...]  
> > > 
> > > Probably the only way to handle for existing machine types.
> > > For new ones, can't we queue it in host memory somewhere?  
> > 
> > I'm not actually convinced we can't do that even for existing machine
> > types.  
> 
> The difficulty would be in migrating the extra "reuested but defferred"
> state.

Ah, true.  Although we could block migration for the duration instead.

-- 
David Gibson 
Principal Software Engineer, Virtualization, Red Hat


pgpW6H01K0qwq.pgp
Description: OpenPGP digital signature

Re: [PULL 22/23] hw/sd: Fix incorrect populated function switch status data structure

2020-10-22 Thread Bin Meng

Hi Niek,

On Thu, Oct 22, 2020 at 11:20 PM Niek Linnenbank
 wrote:
>
> Hi Bin, Philippe,
>
> If im correct the acceptance tests for orange pi need to be run with a flag 
> ARMBIAN_ARTIFACTS_CACHED set that explicitly allows them to be run using the 
> armbian mirror. So if you pass that flag on the same command that Philippe 
> gave, the rests should run.

Thank you for the hints. Actually I noticed the environment variable
ARMBIAN_ARTIFACTS_CACHED when looking at the test codes, but after I
turned on the flag it still could not download the test asset from the
apt.armbian.com website.

> I have a follow up question and Im interested to hear your opinion on that 
> Philippe. Should we perhaps update the orange pi tests (and maybe others) so 
> they use a reliable mirror that we can control, for example a github repo? I 
> would be happy to create a repo for that, at least for the orange pi tests. 
> But maybe there is already something planned as a more general solution for 
> artifacts of other machines as well?
>

Regards,
Bin

Re: Emulation for riscv

2020-10-22 Thread Alistair Francis

On Thu, Oct 22, 2020 at 4:58 PM Moises Arreola  wrote:
>
> Hello everyone, my name is Moses and I'm trying to set up a VM for a risc-v 
> processor, I'm using the Risc-V Getting Started Guide and on the final step 
> I'm getting an error while trying to launch the virtual machine using the cmd:

Hello,

Please don't use the RISC-V Getting Started Guide. Pretty much all of
the information there is out of date and wrong. Unfortunately we are
unable to correct it.

The QEMU wiki is a much better place for information:
https://wiki.qemu.org/Documentation/Platforms/RISCV

>
> sudo qemu-system-riscv64 -nographic -machine virt \
> -kernel linux/arch/riscv/boot/Image -append "root=/dev/vda ro console=ttyS0" \
> -drive file=busybox,format=raw,id=hd0 \
> -device virtio-blk-device,drive=hd0
>
> But what I get in return is a message telling me that the file I gave wasn't 
> the right one, the actual output is:
>
> qemu-system-riscv64: -drive file=busybox,format=raw,id=hd0: A regular file 
> was expected by the 'file' driver, but something else was given
>
> And I checked the file busybox with de cmd "file" and got the following :
> busybox: ELF 64-bit LSB executable, UCB RISC-V, version 1 (SYSV), dynamically 
> linked, interpreter /lib/ld-linux-riscv64-lp64d.so.1, for GNU/Linux 4.15.0, 
> stripped

That looks like an ELF, which won't work when attached as a drive.

How are you building this rootFS?

Alistair

>
> So I was wondering if the error message was related to qemu.
> Thanks in advance for answering any suggestions are welcome

RE: [PATCH] i386/cpu: Expose the PTWRITE to the guest

2020-10-22 Thread Kang, Luwei

> > PTWRITE provides a mechanism by which software can instrument the
> > Intel PT trace. The current implementation will mask off this feature
> > when the PTWRITE is supported on the host because of the Intel PT
> > CPUID is a constant value(ICX CPUID) in qemu. This patch will expose
> > the PTWRITE feature to the guest.
> >
> > Signed-off-by: Luwei Kang 
> > ---
> >  target/i386/cpu.c | 24   target/i386/cpu.h |
> > 4 
> >  2 files changed, 28 insertions(+)
> >
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c index
> > aeabdd5bd4..242ba8a870 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -672,6 +672,7 @@ static void x86_cpu_vendor_words2str(char *dst,
> > uint32_t vendor1,  #define TCG_XSAVE_FEATURES
> (CPUID_XSAVE_XSAVEOPT | CPUID_XSAVE_XGETBV1)
> >/* missing:
> >CPUID_XSAVE_XSAVEC, CPUID_XSAVE_XSAVES */
> > +#define TCG_14_0_EBX_FEATURES 0
> >  #define TCG_14_0_ECX_FEATURES 0
> >
> >  typedef enum FeatureWordType {
> > @@ -1302,6 +1303,26 @@ static FeatureWordInfo
> feature_word_info[FEATURE_WORDS] = {
> >  }
> >  },
> >
> > +[FEAT_14_0_EBX] = {
> > +.type = CPUID_FEATURE_WORD,
> > +.feat_names = {
> > +NULL, NULL, NULL, NULL,
> > +"ptwrite", NULL, NULL, NULL,
> > +NULL, NULL, NULL, NULL,
> > +NULL, NULL, NULL, NULL,
> > +NULL, NULL, NULL, NULL,
> > +NULL, NULL, NULL, NULL,
> > +NULL, NULL, NULL, NULL,
> > +NULL, NULL, NULL, NULL,
> > +},
> > +.cpuid = {
> > +.eax = 0x14,
> > +.needs_ecx = true, .ecx = 0,
> > +.reg = R_EBX,
> > +},
> > +.tcg_features = TCG_14_0_EBX_FEATURES,
> > +},
> > +
> 
> Please add a dependency on the processor tracing flag too.

Will fix it in the next version. Thanks.

Luwei Kang

> 
> Paolo
>

Emulation for riscv

2020-10-22 Thread Moises Arreola

Hello everyone, my name is Moses and I'm trying to set up a VM for a risc-v
processor, I'm using the Risc-V Getting Started Guide and on the final step
I'm getting an error while trying to launch the virtual machine using the
cmd:

sudo qemu-system-riscv64 -nographic -machine virt \
-kernel linux/arch/riscv/boot/Image -append "root=/dev/vda ro
console=ttyS0" \
-drive file=busybox,format=raw,id=hd0 \
-device virtio-blk-device,drive=hd0

But what I get in return is a message telling me that the file I gave
wasn't the right one, the actual output is:

qemu-system-riscv64: -drive file=busybox,format=raw,id=hd0: A regular file
was expected by the 'file' driver, but something else was given

And I checked the file busybox with de cmd "file" and got the following :
busybox: ELF 64-bit LSB executable, UCB RISC-V, version 1 (SYSV),
dynamically linked, interpreter /lib/ld-linux-riscv64-lp64d.so.1, for
GNU/Linux 4.15.0, stripped

So I was wondering if the error message was related to qemu.
Thanks in advance for answering any suggestions are welcome

Re: [PATCH v3 00/15] raspi: add the bcm2835 cprman clock manager

2020-10-22 Thread Guenter Roeck

Hi,

On 10/22/20 3:06 PM, Philippe Mathieu-Daudé wrote:
> Cc'ing Guenter who had a similar patch and might be interested
> to test :)
> 

great. I think my patch doesn't work anymore since qemu 5.0 (at least not
for raspi3), and it was pretty hackish anyway. I'll give the series a try.

Guenter

> patch 16/15 fixup:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg752113.html
> 
> On 10/10/20 3:57 PM, Luc Michel wrote:
>> v2 -> v3:
>>    - patch 03: moved clock_new definition to hw/core/clock.c [Phil]
>>    - patch 03: commit message typo [Clement]
>>    - patch 10: clarifications around the CM_CTL/CM_DIBV mux registers.
>>    reg_cm replaced with reg_ctl and reg_div. Add some
>>    comments for clarity. [Phil]
>>    - patch 10: fixed update_mux_from_cm not matching the CM_DIV offset
>>    correctly. [Phil]
>>    - patch 11: replaced manual bitfield extraction with extract32 [Phil]
>>    - patch 11: added a visual representation of CM_DIV for clarity [Phil]
>>    - patch 11: added a missing return in clock_mux_update.
>>
>> v1 -> v2:
>>    - patch 05: Added a comment about MMIO .valid constraints [Phil]
>>    - patch 05: Added MMIO .impl [Phil]
>>    - patch 05: Moved init_internal_clock to the public clock API, renamed
>>  clock_new (new patch 03) [Phil]
>>    - patch 11: use muldiv64 for clock mux frequency output computation [Phil]
>>    - patch 11: add a check for null divisor (Phil: I dropped your r-b)
>>    - Typos, formatting, naming, style [Phil]
>>
>> Patches without review: 03, 11, 13
>>
>> Hi,
>>
>> This series add the BCM2835 CPRMAN clock manager peripheral to the
>> Raspberry Pi machine.
>>
>> Patches 1-4 are preliminary changes, patches 5-13 are the actual
>> implementation.
>>
>> The two last patches add a clock input to the PL011 and
>> connect it to the CPRMAN.
>>
>> This series has been tested with Linux 5.4.61 (the current raspios
>> version). It fixes the kernel Oops at boot time due to invalid UART
>> clock value, and other warnings/errors here and there because of bad
>> clocks or lack of CPRMAN.
>>
>> Here is the clock tree as seen by Linux when booted in QEMU:
>> (/sys/kernel/debug/clk/clk_summary with some columns removed)
>>
>>  enable  prepare
>>     clock count    count  rate
>> -
>>   otg 0    0 48000
>>   osc 5    5  1920
>>  gp2  1    1 32768
>>  tsens    0    0   192
>>  otp  0    0   480
>>  timer    0    0   102
>>  pllh 4    4 86400
>>     pllh_pix_prediv   1    1   3375000
>>    pllh_pix   0    0    337500
>>     pllh_aux  1    1 21600
>>    vec    0    0 10800
>>     pllh_rcal_prediv  1    1   3375000
>>    pllh_rcal  0    0    337500
>>  plld 3    3    200024
>>     plld_dsi1 0    0   7812501
>>     plld_dsi0 0    0   7812501
>>     plld_per  3    3 50006
>>    gp1    1    1  2500
>>    uart   1    2  47999625
>>     plld_core 2    2 50006
>>    sdram  0    0 16668
>>  pllc 3    3    24
>>     pllc_per  1    1    12
>>    emmc   0    0 2
>>     pllc_core2    0    0   9375000
>>     pllc_core1    0    0   9375000
>>     pllc_core0    2    2    12
>>    vpu    1    1 7
>>   aux_spi2    0    0 7
>>   aux_spi1    0    0 7
>>   aux_uart    0    0 7
>>   peri_image  0    0 7
>>  plla 2    2    225000
>>     plla_ccp2 0    0   8789063
>>     plla_dsi0 0    0   8789063
>>     plla_core 1    1 75000
>>    h264   0    0 25000
>>    isp    0    0 25000
>>   dsi1p   0    0 0
>>   dsi0p   0    0 0
>>   dsi1e   0    0 0
>>   dsi0e   0    0 0
>>   cam1    0    0 0
>>   cam0    0    0 0
>>   dpi 0    0

[PATCH 4/4] pc: Use object_class_property_add_bool_ptr()

2020-10-22 Thread Eduardo Habkost

Get rid of manually written property getters/setters.

Signed-off-by: Eduardo Habkost 
---
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Cc: "Michael S. Tsirkin" 
Cc: Marcel Apfelbaum 
Cc: qemu-devel@nongnu.org
---
 hw/i386/pc.c | 57 +---
 1 file changed, 9 insertions(+), 48 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 4e323755d0..d5a5b1b2ae 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1493,48 +1493,6 @@ static void pc_machine_set_vmport(Object *obj, Visitor 
*v, const char *name,
 visit_type_OnOffAuto(v, name, >vmport, errp);
 }
 
-static bool pc_machine_get_smbus(Object *obj, Error **errp)
-{
-PCMachineState *pcms = PC_MACHINE(obj);
-
-return pcms->smbus_enabled;
-}
-
-static void pc_machine_set_smbus(Object *obj, bool value, Error **errp)
-{
-PCMachineState *pcms = PC_MACHINE(obj);
-
-pcms->smbus_enabled = value;
-}
-
-static bool pc_machine_get_sata(Object *obj, Error **errp)
-{
-PCMachineState *pcms = PC_MACHINE(obj);
-
-return pcms->sata_enabled;
-}
-
-static void pc_machine_set_sata(Object *obj, bool value, Error **errp)
-{
-PCMachineState *pcms = PC_MACHINE(obj);
-
-pcms->sata_enabled = value;
-}
-
-static bool pc_machine_get_pit(Object *obj, Error **errp)
-{
-PCMachineState *pcms = PC_MACHINE(obj);
-
-return pcms->pit_enabled;
-}
-
-static void pc_machine_set_pit(Object *obj, bool value, Error **errp)
-{
-PCMachineState *pcms = PC_MACHINE(obj);
-
-pcms->pit_enabled = value;
-}
-
 static void pc_machine_get_max_ram_below_4g(Object *obj, Visitor *v,
 const char *name, void *opaque,
 Error **errp)
@@ -1697,14 +1655,17 @@ static void pc_machine_class_init(ObjectClass *oc, void 
*data)
 object_class_property_set_description(oc, PC_MACHINE_VMPORT,
 "Enable vmport (pc & q35)");
 
-object_class_property_add_bool(oc, PC_MACHINE_SMBUS,
-pc_machine_get_smbus, pc_machine_set_smbus);
+object_class_property_add_bool_ptr(oc, PC_MACHINE_SMBUS,
+   offsetof(PCMachineState, smbus_enabled),
+   OBJ_PROP_FLAG_READWRITE);
 
-object_class_property_add_bool(oc, PC_MACHINE_SATA,
-pc_machine_get_sata, pc_machine_set_sata);
+object_class_property_add_bool_ptr(oc, PC_MACHINE_SATA,
+   offsetof(PCMachineState, sata_enabled),
+   OBJ_PROP_FLAG_READWRITE);
 
-object_class_property_add_bool(oc, PC_MACHINE_PIT,
-pc_machine_get_pit, pc_machine_set_pit);
+object_class_property_add_bool_ptr(oc, PC_MACHINE_PIT,
+   offsetof(PCMachineState, pit_enabled),
+   OBJ_PROP_FLAG_READWRITE);
 }
 
 static const TypeInfo pc_machine_info = {
-- 
2.28.0

[PATCH 2/4] autz/listfile: Use object_class_property_add_bool_ptr()

2020-10-22 Thread Eduardo Habkost

Signed-off-by: Eduardo Habkost 
---
Cc: "Daniel P. Berrangé" 
Cc: qemu-devel@nongnu.org
---
 authz/listfile.c | 27 +++
 1 file changed, 3 insertions(+), 24 deletions(-)

diff --git a/authz/listfile.c b/authz/listfile.c
index aaf930453d..911c4e45f2 100644
--- a/authz/listfile.c
+++ b/authz/listfile.c
@@ -184,27 +184,6 @@ qauthz_list_file_prop_get_filename(Object *obj,
 }
 
 
-static void
-qauthz_list_file_prop_set_refresh(Object *obj,
-  bool value,
-  Error **errp G_GNUC_UNUSED)
-{
-QAuthZListFile *fauthz = QAUTHZ_LIST_FILE(obj);
-
-fauthz->refresh = value;
-}
-
-
-static bool
-qauthz_list_file_prop_get_refresh(Object *obj,
-  Error **errp G_GNUC_UNUSED)
-{
-QAuthZListFile *fauthz = QAUTHZ_LIST_FILE(obj);
-
-return fauthz->refresh;
-}
-
-
 static void
 qauthz_list_file_finalize(Object *obj)
 {
@@ -227,9 +206,9 @@ qauthz_list_file_class_init(ObjectClass *oc, void *data)
 object_class_property_add_str(oc, "filename",
   qauthz_list_file_prop_get_filename,
   qauthz_list_file_prop_set_filename);
-object_class_property_add_bool(oc, "refresh",
-   qauthz_list_file_prop_get_refresh,
-   qauthz_list_file_prop_set_refresh);
+object_class_property_add_bool_ptr(oc, "refresh",
+   offsetof(QAuthZListFile, refresh),
+   OBJ_PROP_FLAG_READWRITE);
 
 authz->is_allowed = qauthz_list_file_is_allowed;
 }
-- 
2.28.0

[PATCH 3/4] machine: Use object_class_property_add_bool_ptr() when possible

2020-10-22 Thread Eduardo Habkost

Get rid if some manually written properties getters/setters.

Not all properties could be converted because they have extra
logic in the property setter.

Signed-off-by: Eduardo Habkost 
---
Cc: Eduardo Habkost 
Cc: Marcel Apfelbaum 
Cc: qemu-devel@nongnu.org
---
 hw/core/machine.c | 78 ---
 1 file changed, 13 insertions(+), 65 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index d740a7e963..21cad22b9e 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -325,34 +325,6 @@ static void machine_set_dt_compatible(Object *obj, const 
char *value, Error **er
 ms->dt_compatible = g_strdup(value);
 }
 
-static bool machine_get_dump_guest_core(Object *obj, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-return ms->dump_guest_core;
-}
-
-static void machine_set_dump_guest_core(Object *obj, bool value, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-ms->dump_guest_core = value;
-}
-
-static bool machine_get_mem_merge(Object *obj, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-return ms->mem_merge;
-}
-
-static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-ms->mem_merge = value;
-}
-
 static bool machine_get_usb(Object *obj, Error **errp)
 {
 MachineState *ms = MACHINE(obj);
@@ -368,20 +340,6 @@ static void machine_set_usb(Object *obj, bool value, Error 
**errp)
 ms->usb_disabled = !value;
 }
 
-static bool machine_get_graphics(Object *obj, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-return ms->enable_graphics;
-}
-
-static void machine_set_graphics(Object *obj, bool value, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-ms->enable_graphics = value;
-}
-
 static char *machine_get_firmware(Object *obj, Error **errp)
 {
 MachineState *ms = MACHINE(obj);
@@ -397,20 +355,6 @@ static void machine_set_firmware(Object *obj, const char 
*value, Error **errp)
 ms->firmware = g_strdup(value);
 }
 
-static void machine_set_suppress_vmdesc(Object *obj, bool value, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-ms->suppress_vmdesc = value;
-}
-
-static bool machine_get_suppress_vmdesc(Object *obj, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-return ms->suppress_vmdesc;
-}
-
 static void machine_set_enforce_config_section(Object *obj, bool value,
  Error **errp)
 {
@@ -449,7 +393,7 @@ static void machine_set_memory_encryption(Object *obj, 
const char *value,
  * so there's no point in it trying to merge areas.
  */
 if (value) {
-machine_set_mem_merge(obj, false, errp);
+ms->mem_merge = false;
 }
 }
 
@@ -827,13 +771,15 @@ static void machine_class_init(ObjectClass *oc, void 
*data)
 object_class_property_set_description(oc, "dt-compatible",
 "Overrides the \"compatible\" property of the dt root node");
 
-object_class_property_add_bool(oc, "dump-guest-core",
-machine_get_dump_guest_core, machine_set_dump_guest_core);
+object_class_property_add_bool_ptr(oc, "dump-guest-core",
+   offsetof(MachineState, dump_guest_core),
+   OBJ_PROP_FLAG_READWRITE);
 object_class_property_set_description(oc, "dump-guest-core",
 "Include guest memory in a core dump");
 
-object_class_property_add_bool(oc, "mem-merge",
-machine_get_mem_merge, machine_set_mem_merge);
+object_class_property_add_bool_ptr(oc, "mem-merge",
+   offsetof(MachineState, mem_merge),
+   OBJ_PROP_FLAG_READWRITE);
 object_class_property_set_description(oc, "mem-merge",
 "Enable/disable memory merge support");
 
@@ -842,8 +788,9 @@ static void machine_class_init(ObjectClass *oc, void *data)
 object_class_property_set_description(oc, "usb",
 "Set on/off to enable/disable usb");
 
-object_class_property_add_bool(oc, "graphics",
-machine_get_graphics, machine_set_graphics);
+object_class_property_add_bool_ptr(oc, "graphics",
+   offsetof(MachineState, enable_graphics),
+   OBJ_PROP_FLAG_READWRITE);
 object_class_property_set_description(oc, "graphics",
 "Set on/off to enable/disable graphics emulation");
 
@@ -852,8 +799,9 @@ static void machine_class_init(ObjectClass *oc, void *data)
 object_class_property_set_description(oc, "firmware",
 "Firmware image");
 
-object_class_property_add_bool(oc, "suppress-vmdesc",
-machine_get_suppress_vmdesc, machine_set_suppress_vmdesc);
+object_class_property_add_bool_ptr(oc, "suppress-vmdesc",
+   offsetof(MachineState, suppress_vmdesc),
+   OBJ_PROP_FLAG_READWRITE);

[PATCH 1/4] qom: object*_property_add_bool_ptr() functions

2020-10-22 Thread Eduardo Habkost

Provide helpers for registering boolean properties that simply
read/write a struct field, to reduce the need to manually write
property getters and setters.

Signed-off-by: Eduardo Habkost 
---
Cc: Paolo Bonzini 
Cc: "Daniel P. Berrangé" 
Cc: Eduardo Habkost 
Cc: qemu-devel@nongnu.org
---
 include/qom/object.h | 23 +++
 qom/object.c | 31 +++
 2 files changed, 54 insertions(+)

diff --git a/include/qom/object.h b/include/qom/object.h
index a124cf897d..954a26c567 100644
--- a/include/qom/object.h
+++ b/include/qom/object.h
@@ -1815,6 +1815,29 @@ ObjectProperty 
*object_class_property_add_uint64_ptr(ObjectClass *klass,
   ptrdiff_t offset,
   ObjectPropertyFlags flags);
 
+/**
+ * object_property_add_bool_ptr:
+ * @obj: the object to add a property to
+ * @name: the name of the property
+ * @v: pointer to value
+ * @flags: bitwise-or'd ObjectPropertyFlags
+ *
+ * Add an bool property in memory.  This function will add a
+ * property of type 'bool'.
+ *
+ * Returns: The newly added property on success, or %NULL on failure.
+ */
+ObjectProperty *
+object_property_add_bool_ptr(Object *obj, const char *name,
+ bool *v,
+ ObjectPropertyFlags flags);
+
+ObjectProperty *
+object_class_property_add_bool_ptr(ObjectClass *klass,
+   const char *name,
+   ptrdiff_t offset,
+   ObjectPropertyFlags flags);
+
 /**
  * object_property_add_alias:
  * @obj: the object to add a property to
diff --git a/qom/object.c b/qom/object.c
index 73f27b8b7e..2abc2bda33 100644
--- a/qom/object.c
+++ b/qom/object.c
@@ -2713,6 +2713,37 @@ object_class_property_add_uint64_ptr(ObjectClass *klass, 
const char *name,
   flags, offset);
 }
 
+static void property_visit_bool_ptr(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
+PointerProperty *prop = opaque;
+bool *field = pointer_property_get_ptr(obj, prop);
+visit_type_bool(v, name, field, errp);
+}
+
+ObjectProperty *
+object_property_add_bool_ptr(Object *obj, const char *name,
+ bool *v,
+ ObjectPropertyFlags flags)
+{
+return object_property_add_uint_ptr(obj, name, "bool",
+property_visit_bool_ptr,
+property_visit_bool_ptr,
+flags,
+(void *)v);
+}
+
+ObjectProperty *
+object_class_property_add_bool_ptr(ObjectClass *klass, const char *name,
+   ptrdiff_t offset,
+   ObjectPropertyFlags flags)
+{
+return object_class_property_add_uint_ptr(klass, name, "bool",
+  property_visit_bool_ptr,
+  property_visit_bool_ptr,
+  flags, offset);
+}
+
 typedef struct {
 Object *target_obj;
 char *target_name;
-- 
2.28.0

[PATCH 0/4] qom: Introduce object*_property_add_bool_ptr() functions

2020-10-22 Thread Eduardo Habkost

Based-on: 20201009160122.1662082-1-ehabk...@redhat.com
Git branch: https://github.com/ehabkost/qemu work/qom-bool-ptr-prop

This series introduces a helper to make it easier to register
simple boolan QOM properties.  It will be useful for simplifying
existing property code in some types that can't use
QDEV_PROP_BOOL yet (because they are not TYPE_DEVICE subtypes).
As examples, some TYPE_MACHINE and TYPE_QAUTHZ_LIST_FILE
properties are converted to use the new functions.

This depends on the QOM property code cleanup that was also
submitted as part of:

  
https://lore.kernel.org/qemu-devel/20201009160122.1662082-1-ehabk...@redhat.com
  Subject: [PATCH 00/12] qom: Make all -object types use only class properties<

Eduardo Habkost (4):
  qom: object*_property_add_bool_ptr() functions
  autz/listfile: Use object_class_property_add_bool_ptr()
  machine: Use object_class_property_add_bool_ptr() when possible
  pc: Use object_class_property_add_bool_ptr()

 include/qom/object.h | 23 +
 authz/listfile.c | 27 ++-
 hw/core/machine.c| 78 
 hw/i386/pc.c | 57 +---
 qom/object.c | 31 ++
 5 files changed, 79 insertions(+), 137 deletions(-)

-- 
2.28.0

[Bug 1894869] Re: Chelsio T4 has old MSIX PBA offset bug

2020-10-22 Thread Bug Watch Updater

** Changed in: debian
   Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1894869

Title:
  Chelsio T4 has old MSIX PBA offset bug

Status in QEMU:
  Invalid
Status in Debian:
  Invalid

Bug description:
  There exists a bug with Chelsio NICs T4 that causes the following
  error:

  kvm: -device vfio-
  pci,host=:83:00.7,id=hostpci1.7,bus=pci.0,addr=0x11.7: vfio
  :83:00.7: hardware reports invalid configuration, MSIX PBA outside
  of specified BAR

  I discovered this bug on a Proxmox system, and I was working with a
  downstream Proxmox developer to try to fix this issue. They provided
  me with the following change to make from line 1484 of hw/vfio/pci.c:

  static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
    * is 0x1000, so we hard code that here.
    */
   if (vdev->vendor_id == PCI_VENDOR_ID_CHELSIO &&
  -(vdev->device_id & 0xff00) == 0x5800) {
  +((vdev->device_id & 0xff00) == 0x5800 ||
  + (vdev->device_id & 0xff00) == 0x1425)) {
   msix->pba_offset = 0x1000;
   } else if (vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
   error_setg(errp, "hardware reports invalid configuration, "

  However, I found that this did not fix the issue, so the bug appears
  to work differently than the one that was present on the T5 NICs which
  has already been patched. I have attached the output of my lspci
  -nnkvv

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1894869/+subscriptions

Re: [PATCH v27 17/17] qapi: Add VFIO devices migration stats in Migration stats

2020-10-22 Thread Alex Williamson

On Thu, 22 Oct 2020 16:42:07 +0530
Kirti Wankhede  wrote:

> Added amount of bytes transferred to the VM at destination by all VFIO
> devices
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Dr. David Alan Gilbert 
> ---
>  hw/vfio/common.c| 20 
>  hw/vfio/migration.c | 10 ++
>  include/qemu/vfio-helpers.h |  3 +++
>  migration/migration.c   | 14 ++
>  monitor/hmp-cmds.c  |  6 ++
>  qapi/migration.json | 17 +
>  6 files changed, 70 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9c879e5c0f62..8d0758eda9fa 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -39,6 +39,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> +#include "qemu/vfio-helpers.h"
>  
>  VFIOGroupList vfio_group_list =
>  QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -292,6 +293,25 @@ const MemoryRegionOps vfio_region_ops = {
>   * Device state interfaces
>   */
>  
> +bool vfio_mig_active(void)
> +{
> +VFIOGroup *group;
> +VFIODevice *vbasedev;
> +
> +if (QLIST_EMPTY(_group_list)) {
> +return false;
> +}
> +
> +QLIST_FOREACH(group, _group_list, next) {
> +QLIST_FOREACH(vbasedev, >device_list, next) {
> +if (vbasedev->migration_blocker) {
> +return false;
> +}
> +}
> +}
> +return true;
> +}
> +
>  static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
>  {
>  VFIOGroup *group;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 77ee60a43ea5..b23e21c6de2b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -28,6 +28,7 @@
>  #include "pci.h"
>  #include "trace.h"
>  #include "hw/hw.h"
> +#include "qemu/vfio-helpers.h"
>  
>  /*
>   * Flags to be used as unique delimiters for VFIO devices in the migration
> @@ -45,6 +46,8 @@
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
>  
> +static int64_t bytes_transferred;
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>off_t off, bool iswrite)
>  {
> @@ -255,6 +258,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
> *vbasedev, uint64_t *size)
>  *size = data_size;
>  }
>  
> +bytes_transferred += data_size;
>  return ret;
>  }
>  
> @@ -776,6 +780,7 @@ static void vfio_migration_state_notifier(Notifier 
> *notifier, void *data)
>  case MIGRATION_STATUS_CANCELLING:
>  case MIGRATION_STATUS_CANCELLED:
>  case MIGRATION_STATUS_FAILED:
> +bytes_transferred = 0;
>  ret = vfio_migration_set_state(vbasedev,
>~(VFIO_DEVICE_STATE_SAVING | 
> VFIO_DEVICE_STATE_RESUMING),
>VFIO_DEVICE_STATE_RUNNING);
> @@ -862,6 +867,11 @@ err:
>  
>  /* -- */
>  
> +int64_t vfio_mig_bytes_transferred(void)
> +{
> +return bytes_transferred;
> +}
> +
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
>  VFIOContainer *container = vbasedev->group->container;
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 4491c8e1a6e9..7f7a46e6ef2d 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, 
> void *bar,
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> int irq_type, Error **errp);
>  
> +bool vfio_mig_active(void);
> +int64_t vfio_mig_bytes_transferred(void);
> +
>  #endif


I don't think vfio-helpers is the right place for this, this header is
specifically for using util/vfio-helpers.c.  Would
include/hw/vfio/vfio-common.h work?


> diff --git a/migration/migration.c b/migration/migration.c
> index 0575ecb37953..8b2865d25ef4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -56,6 +56,7 @@
>  #include "net/announce.h"
>  #include "qemu/queue.h"
>  #include "multifd.h"
> +#include "qemu/vfio-helpers.h"
>  
>  #define MAX_THROTTLE  (128 << 20)  /* Migration transfer speed 
> throttling */
>  
> @@ -1002,6 +1003,17 @@ static void populate_disk_info(MigrationInfo *info)
>  }
>  }
>  
> +static void populate_vfio_info(MigrationInfo *info)
> +{
> +#ifdef CONFIG_LINUX

Use CONFIG_VFIO?  I get a build failure on qemu-system-avr

/usr/bin/ld: /tmp/tmp.3QbqxgbENl/build/../migration/migration.c:1012:
undefined reference to `vfio_mig_bytes_transferred'.  Thanks,

Alex

> +if (vfio_mig_active()) {
> +info->has_vfio = true;
> +info->vfio = g_malloc0(sizeof(*info->vfio));
> +info->vfio->transferred = vfio_mig_bytes_transferred();
> +}
> +#endif
> +}
> +
>  static void fill_source_migration_info(MigrationInfo *info)
>  {
>

Re: [PATCH v3 00/15] raspi: add the bcm2835 cprman clock manager

2020-10-22 Thread Philippe Mathieu-Daudé


Cc'ing Guenter who had a similar patch and might be interested
to test :)

patch 16/15 fixup:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg752113.html

On 10/10/20 3:57 PM, Luc Michel wrote:

v2 -> v3:
   - patch 03: moved clock_new definition to hw/core/clock.c [Phil]
   - patch 03: commit message typo [Clement]
   - patch 10: clarifications around the CM_CTL/CM_DIBV mux registers.
   reg_cm replaced with reg_ctl and reg_div. Add some
   comments for clarity. [Phil]
   - patch 10: fixed update_mux_from_cm not matching the CM_DIV offset
   correctly. [Phil]
   - patch 11: replaced manual bitfield extraction with extract32 [Phil]
   - patch 11: added a visual representation of CM_DIV for clarity [Phil]
   - patch 11: added a missing return in clock_mux_update.

v1 -> v2:
   - patch 05: Added a comment about MMIO .valid constraints [Phil]
   - patch 05: Added MMIO .impl [Phil]
   - patch 05: Moved init_internal_clock to the public clock API, renamed
 clock_new (new patch 03) [Phil]
   - patch 11: use muldiv64 for clock mux frequency output computation [Phil]
   - patch 11: add a check for null divisor (Phil: I dropped your r-b)
   - Typos, formatting, naming, style [Phil]

Patches without review: 03, 11, 13

Hi,

This series add the BCM2835 CPRMAN clock manager peripheral to the
Raspberry Pi machine.

Patches 1-4 are preliminary changes, patches 5-13 are the actual
implementation.

The two last patches add a clock input to the PL011 and
connect it to the CPRMAN.

This series has been tested with Linux 5.4.61 (the current raspios
version). It fixes the kernel Oops at boot time due to invalid UART
clock value, and other warnings/errors here and there because of bad
clocks or lack of CPRMAN.

Here is the clock tree as seen by Linux when booted in QEMU:
(/sys/kernel/debug/clk/clk_summary with some columns removed)

 enable  prepare
clock countcount  rate
-
  otg 00 48000
  osc 55  1920
 gp2  11 32768
 tsens00   192
 otp  00   480
 timer00   102
 pllh 44 86400
pllh_pix_prediv   11   3375000
   pllh_pix   00337500
pllh_aux  11 21600
   vec00 10800
pllh_rcal_prediv  11   3375000
   pllh_rcal  00337500
 plld 33200024
plld_dsi1 00   7812501
plld_dsi0 00   7812501
plld_per  33 50006
   gp111  2500
   uart   12  47999625
plld_core 22 50006
   sdram  00 16668
 pllc 3324
pllc_per  1112
   emmc   00 2
pllc_core200   9375000
pllc_core100   9375000
pllc_core02212
   vpu11 7
  aux_spi200 7
  aux_spi100 7
  aux_uart00 7
  peri_image  00 7
 plla 22225000
plla_ccp2 00   8789063
plla_dsi0 00   8789063
plla_core 11 75000
   h264   00 25000
   isp00 25000
  dsi1p   00 0
  dsi0p   00 0
  dsi1e   00 0
  dsi0e   00 0
  cam100 0
  cam000 0
  dpi 00 0
  tec 00 0
  smi 00 0
  slim00 0
  gp0 00 0
  dft 00 0
  aveo00 0
  pcm 00 0
  pwm 00 0
  hsm 00

[RFC PATCH] hw/nvme: Move NVMe emulation out of hw/block/ directory

2020-10-22 Thread Philippe Mathieu-Daudé

As IDE used to be, NVMe emulation is becoming an active
subsystem. Move it into its own namespace.

Signed-off-by: Philippe Mathieu-Daudé 
---
sent as RFC, it case it helps NVMe developers. As nvme-next
has patches queues, if the idea of moving seems useful, this
patch can resend later.
---
 meson.build   |   1 +
 hw/{block/nvme.h => nvme/nvme-internal.h} |   4 +-
 hw/{block/nvme.c => nvme/core.c}  |   2 +-
 MAINTAINERS   |   2 +-
 hw/Kconfig|   1 +
 hw/block/Kconfig  |   5 -
 hw/block/meson.build  |   1 -
 hw/block/trace-events | 124 -
 hw/meson.build|   1 +
 hw/nvme/Kconfig   |   4 +
 hw/nvme/meson.build   |   1 +
 hw/nvme/trace-events  | 125 ++
 12 files changed, 137 insertions(+), 134 deletions(-)
 rename hw/{block/nvme.h => nvme/nvme-internal.h} (98%)
 rename hw/{block/nvme.c => nvme/core.c} (99%)
 create mode 100644 hw/nvme/Kconfig
 create mode 100644 hw/nvme/meson.build
 create mode 100644 hw/nvme/trace-events

diff --git a/meson.build b/meson.build
index 7627a0ae46e..24234ebd473 100644
--- a/meson.build
+++ b/meson.build
@@ -1356,6 +1356,7 @@
 'hw/misc',
 'hw/misc/macio',
 'hw/net',
+'hw/nvme',
 'hw/nvram',
 'hw/pci',
 'hw/pci-host',
diff --git a/hw/block/nvme.h b/hw/nvme/nvme-internal.h
similarity index 98%
rename from hw/block/nvme.h
rename to hw/nvme/nvme-internal.h
index 52ba794f2e9..824788d9c6e 100644
--- a/hw/block/nvme.h
+++ b/hw/nvme/nvme-internal.h
@@ -1,5 +1,5 @@
-#ifndef HW_NVME_H
-#define HW_NVME_H
+#ifndef HW_NVME_INTERNAL_H
+#define HW_NVME_INTERNAL_H
 
 #include "block/nvme.h"
 
diff --git a/hw/block/nvme.c b/hw/nvme/core.c
similarity index 99%
rename from hw/block/nvme.c
rename to hw/nvme/core.c
index 44fa5b90769..04391fbb083 100644
--- a/hw/block/nvme.c
+++ b/hw/nvme/core.c
@@ -67,8 +67,8 @@
 #include "qemu/log.h"
 #include "qemu/module.h"
 #include "qemu/cutils.h"
+#include "nvme-internal.h"
 #include "trace.h"
-#include "nvme.h"
 
 #define NVME_MAX_IOQPAIRS 0x
 #define NVME_DB_SIZE  4
diff --git a/MAINTAINERS b/MAINTAINERS
index 6a197bd358d..7132bbe3ff4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1875,7 +1875,7 @@ M: Keith Busch 
 M: Klaus Jensen 
 L: qemu-bl...@nongnu.org
 S: Supported
-F: hw/block/nvme*
+F: hw/nvme/
 F: tests/qtest/nvme-test.c
 T: git git://git.infradead.org/qemu-nvme.git nvme-next
 
diff --git a/hw/Kconfig b/hw/Kconfig
index 4de1797ffda..4ef9ca40ab0 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -21,6 +21,7 @@ source mem/Kconfig
 source misc/Kconfig
 source net/Kconfig
 source nubus/Kconfig
+source nvme/Kconfig
 source nvram/Kconfig
 source pci-bridge/Kconfig
 source pci-host/Kconfig
diff --git a/hw/block/Kconfig b/hw/block/Kconfig
index 2d17f481adc..c2213173ffe 100644
--- a/hw/block/Kconfig
+++ b/hw/block/Kconfig
@@ -22,11 +22,6 @@ config ECC
 config ONENAND
 bool
 
-config NVME_PCI
-bool
-default y if PCI_DEVICES
-depends on PCI
-
 config VIRTIO_BLK
 bool
 default y
diff --git a/hw/block/meson.build b/hw/block/meson.build
index 78cad8f7cba..96697f739c0 100644
--- a/hw/block/meson.build
+++ b/hw/block/meson.build
@@ -13,7 +13,6 @@
 softmmu_ss.add(when: 'CONFIG_SWIM', if_true: files('swim.c'))
 softmmu_ss.add(when: 'CONFIG_XEN', if_true: files('xen-block.c'))
 softmmu_ss.add(when: 'CONFIG_SH4', if_true: files('tc58128.c'))
-softmmu_ss.add(when: 'CONFIG_NVME_PCI', if_true: files('nvme.c'))
 
 specific_ss.add(when: 'CONFIG_VIRTIO_BLK', if_true: files('virtio-blk.c'))
 specific_ss.add(when: 'CONFIG_VHOST_USER_BLK', if_true: 
files('vhost-user-blk.c'))
diff --git a/hw/block/trace-events b/hw/block/trace-events
index ec94c56a416..31be146 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -27,130 +27,6 @@ virtio_blk_submit_multireq(void *vdev, void *mrb, int 
start, int num_reqs, uint6
 hd_geometry_lchs_guess(void *blk, int cyls, int heads, int secs) "blk %p LCHS 
%d %d %d"
 hd_geometry_guess(void *blk, uint32_t cyls, uint32_t heads, uint32_t secs, int 
trans) "blk %p CHS %u %u %u trans %d"
 
-# nvme.c
-# nvme traces for successful events
-pci_nvme_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
-pci_nvme_irq_pin(void) "pulsing IRQ pin"
-pci_nvme_irq_masked(void) "IRQ is masked"
-pci_nvme_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" 
prp2=0x%"PRIx64""
-pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
-pci_nvme_map_addr_cmb(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len 
%"PRIu64""
-pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t 
prp2, int num_prps) "trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 
0x%"PRIx64" num_prps %d"
-pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) 
"cid

Re: [PATCH v4 2/7] nbd: Add new qemu:allocation-depth metadata context

2020-10-22 Thread Eric Blake

On 10/14/20 6:52 AM, Vladimir Sementsov-Ogievskiy wrote:

>>   docs/interop/nbd.txt | 27 ++---
> 
> [..]
> 
>> +In the allocation depth context, bits 0 and 1 form a tri-state value:
>> +
>> +    bits 0-1: 00: NBD_STATE_DEPTH_UNALLOC, the extent is unallocated
>> +  01: NBD_STATE_DEPTH_LOCAL, the extent is allocated in the
>> +  top level of the image
> 
> Hmm. I always thought that "image" == file, so backing chain is a chain
> of images,
> not a several levels of one image. If it is so, than it should be "the
> top level image".
> And "levels of the image" may designate internal qcow2 snapshots
> unrelated here..

It's fuzzy.  From the guest point of view, we are serving a single guest
image by use of multiple files in the host.  I will do s/level/layer/,
to match the wording I already had on the next line:

>   10: NBD_STATE_DEPTH_BACKING, the extent is inherited from a
>   backing layer

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 03/12] qom: Make object_class_property_add_uint*_ptr() get offset

2020-10-22 Thread Eduardo Habkost

On Thu, Oct 22, 2020 at 07:06:58AM +0200, Markus Armbruster wrote:
[...]
> > I don't love object*_property_add_*_ptr() either.  I consider the
> > qdev property API better.  But we need a reasonable alternative,
> > because the qdev API can't be used by non-device objects yet.
> 
> Emphasis on *yet*: we should be able to lift it up into QOM, shouldn't
> we?

We should, yes.  My plan is to make object_property_*_ptr() and
PropertyInfo code converge until they look exactly the same and
become a single API.

-- 
Eduardo

Re: [PATCH v27 00/17] Add migration support for VFIO devices

2020-10-22 Thread Alex Williamson

On Thu, 22 Oct 2020 16:41:50 +0530
Kirti Wankhede  wrote:

> Hi,
> 
> This Patch set adds migration support for VFIO devices in QEMU.

We're cutting it pretty close for the 5.2 soft freeze, but clearly
we've seen this series a few times.  The key points for me are that I
no longer see anything that should adversely affect non-migration
support (aside from the easily fixed bugs noted) and I think our config
space vmstate is sane now, so we hopefully won't need to throw it away
and start over (experts, please verify).  I think there's still a respin
needed, but I hope that others can squeeze in a review, find and verify
issues they've noted previously, re-confirm their reviews and acks, and
maybe we can get this in by Tuesday.  If migration is broken, we can
fix that as we go, but the foundation looks reasonable enough to me.
Thanks,

Alex

Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Keith Busch

On Thu, Oct 22, 2020 at 11:02:04PM +0200, Philippe Mathieu-Daudé wrote:
> On 10/22/20 8:49 PM, Klaus Jensen wrote:
> > +static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +NvmeNamespace *ns = req->ns;
> > +NvmeDsmCmd *dsm = (NvmeDsmCmd *) >cmd;
> > +NvmeDsmRange *range = NULL;
> 
> g_autofree?

Or just allocate the array on the stack. The upper limit is just 512
entries.

Re: [PATCH v2 15/20] iotests: 219: prepare for backup over block-copy

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


23.07.2020 11:35, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

The further change of moving backup to be a on block-copy call will


-on?


make copying chunk-size and cluster-size a separate things. So, even


s/a/two/


with 64k cluster sized qcow2 image, default chunk would be 1M.
Test 219 depends on specified chunk-size. Update it for explicit
chunk-size for backup as for mirror.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/219 | 13 +++--
  1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/219 b/tests/qemu-iotests/219
index db272c5249..2bbed28f39 100755
--- a/tests/qemu-iotests/219
+++ b/tests/qemu-iotests/219
@@ -203,13 +203,13 @@ with iotests.FilePath('disk.img') as disk_path, \
  # but related to this also automatic state transitions like job
  # completion), but still get pause points often enough to avoid making 
this
  # test very slow, it's important to have the right ratio between speed and
-# buf_size.
+# copy-chunk-size.
  #
-# For backup, buf_size is hard-coded to the source image cluster size 
(64k),
-# so we'll pick the same for mirror. The slice time, i.e. the granularity
-# of the rate limiting is 100ms. With a speed of 256k per second, we can
-# get four pause points per second. This gives us 250ms per iteration,
-# which should be enough to stay deterministic.
+# Chose 64k copy-chunk-size both for mirror (by buf_size) and backup (by
+# x-max-chunk). The slice time, i.e. the granularity of the rate limiting
+# is 100ms. With a speed of 256k per second, we can get four pause points
+# per second. This gives us 250ms per iteration, which should be enough to
+# stay deterministic.


Don’t we also have to limit the number of workers to 1 so we actually
keep 250 ms per iteration instead of just finishing four requests
immediately, then pausing for a second?


Block-copy rate limiter works good enough: it will not start too much requests.




  test_job_lifecycle(vm, 'drive-mirror', has_ready=True, job_args={
  'device': 'drive0-node',
@@ -226,6 +226,7 @@ with iotests.FilePath('disk.img') as disk_path, \
  'target': copy_path,
  'sync': 'full',
  'speed': 262144,
+'x-max-chunk': 65536,
  'auto-finalize': auto_finalize,
  'auto-dismiss': auto_dismiss,
  })







--
Best regards,
Vladimir

Re: Ramping up Continuous Fuzzing of Virtual Devices in QEMU

2020-10-22 Thread Philippe Mathieu-Daudé


On 10/22/20 6:39 PM, Daniel P. Berrangé wrote:

On Thu, Oct 22, 2020 at 12:24:16PM -0400, Alexander Bulekov wrote:

+CC Prasad

On 201022 1219, Alexander Bulekov wrote:

Hello,
QEMU was accepted into Google's oss-fuzz continuous-fuzzing platform [1]
earlier this year. The fuzzers currently running on oss-fuzz are based on my
2019 Google Summer of Code Project, which leveraged libfuzzer, qtest and libqos
to provide a framework for writing virtual-device fuzzers. At the moment, there
are a handful of fuzzers upstream and running on oss-fuzz(located in
tests/qtest/fuzz/). They fuzz only a few devices and serve mostly as
examples.

If everything goes well, soon a generic fuzzer [2] will land upstream, which
allows us to fuzz many configurations of QEMU, without any device-specific
code. To date this fuzzer has led to ~50 bug reports on launchpad. Once the
generic-fuzzer lands upstream, OSS-Fuzz will automatically start fuzzing a
bunch [3] of fuzzer configurations, and it is likely to find bugs.  Others will
also be able to send simple patches to add additional device configurations for
fuzzing.

The oss-fuzz process looks roughly like this:
 1. oss-fuzz fuzzes QEMU
 2. When oss-fuzz finds a bug, it reports it to a few [4] people that have
 access to reports and reproducers.
 3. If a fix is merged upstream, oss-fuzz will figure this out and mark the
 bug as fixed and make the report public 30 days later.
 3. After 90 days the bug(fixed or not) becomes public, so anyone can view
 it here https://bugs.chromium.org/p/oss-fuzz/issues/list

The oss-fuzz reports look like this:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=23701=qemu=2

This means that when oss-fuzz find new bugs, the relevant developers do not
know about them unless someone with access files a separate report to the
list/launchpad. So far this hasn't been a problem, since oss-fuzz has only been
running some small example fuzzers. Once [2] lands upstream, we should
see a significant uptick in oss-fuzz reports, and I hope that we can develop a
process to ensure these bugs are properly dealt with. One option we have is to
make the reports public immediately and send notifications to
qemu-devel. This is the approach taken by some other projects on
oss-fuzz, such as LLVM. Though its not on oss-fuzz, bugs found by
syzkaller in the kernel, are also automatically sent to a public list.
The question is:

What approach should we take for dealing with bugs found on oss-fuzz?


If we assume that a non-negligible number of fuzz bugs will be exploitable
by a malicious guest OS to break out into the host, then I think it is
likely undesirable to make them public immediately without at least a basic
human triage step to catch possibly serious security issues.

Still a large % are likely to be low impact / not urgent to deal with so
we want a low overhead way to handle the fuzz output, which doesn't create
a bottleneck on a small number of people.

Overall my feeling is that we want to be able to farm out triage to the
respective subsystem maintainers, who can then decide whether the bug
needs notifying to the security team, or can be made public immediately.

I think ideally we would be doing triage in QEMU's own bug tracker, so
we don't need to have maintainers create accounts on a 3rd party tracker
to see reports.

Is is practical to identify a primary affected source file from the fuzz
crash report with any level reliablility such that we could file a private
launchpad bug, automatically CC'ing a subsystem maintainer (and the security
team)  ?


Also what is not very clear is, who is supposed/going to fix these bugs?

I see this pattern:

a) bug found by human: human keeps asking for the bug
  1/ security issue: someone assigned to fix
  2/ else: if human keeps asking, the bug gets eventually fixed.

b) bug found by fuzzer:
  1/ security issue: someone assigned to fix
  2/ else: nothing happens because unlikely hit by user

Do we want to keep tracking b.2 bug reports? I think this is the case of
the ~50 Alexander mentioned.

Regards,

Phil.

Re: [PATCH v2 14/20] iotests: 185: prepare for backup over block-copy

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


23.07.2020 11:19, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

The further change of moving backup to be a on block-copy call will


-on?


one :)




make copying chunk-size and cluster-size a separate things. So, even


s/a/two/


with 64k cluster sized qcow2 image, default chunk would be 1M.
185 test however assumes, that with speed limited to 64K, one iteration
would result in offset=64K. It will change, as first iteration would
result in offset=1M independently of speed.

So, let's explicitly specify, what test wants: set max-chunk to 64K, so
that one iteration is 64K. Note, that we don't need to limit
max-workers, as block-copy rate limitator will handle the situation and


*limitator


wouldn't start new workers when speed limit is obviously reached.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/185 | 3 ++-
  tests/qemu-iotests/185.out | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/185 b/tests/qemu-iotests/185
index fd5e6ebe11..6afb3fc82f 100755
--- a/tests/qemu-iotests/185
+++ b/tests/qemu-iotests/185
@@ -182,7 +182,8 @@ _send_qemu_cmd $h \
'target': '$TEST_IMG.copy',
'format': '$IMGFMT',
'sync': 'full',
-  'speed': 65536 } }" \
+  'speed': 65536,
+  'x-max-chunk': 65536 } }" \


Out of curiosity, would it also suffice to disable copy offloading?


Note that x-max-chunk works even with copy offloading enabled, it sets maximum 
only for background copying, not for all operations.



But anyway:

Reviewed-by: Max Reitz 


  "return"
  
  # If we don't sleep here 'quit' command races with disk I/O

diff --git a/tests/qemu-iotests/185.out b/tests/qemu-iotests/185.out
index ac5ab16bc8..5232647972 100644
--- a/tests/qemu-iotests/185.out
+++ b/tests/qemu-iotests/185.out
@@ -61,7 +61,7 @@ Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 
cluster_size=65536 l
  
  { 'execute': 'qmp_capabilities' }

  {"return": {}}
-{ 'execute': 'drive-backup', 'arguments': { 'device': 'disk', 'target': 
'TEST_DIR/t.IMGFMT.copy', 'format': 'IMGFMT', 'sync': 'full', 'speed': 65536 } }
+{ 'execute': 'drive-backup', 'arguments': { 'device': 'disk', 'target': 
'TEST_DIR/t.IMGFMT.copy', 'format': 'IMGFMT', 'sync': 'full', 'speed': 65536, 
'x-max-chunk': 65536 } }
  Formatting 'TEST_DIR/t.qcow2.copy', fmt=qcow2 size=67108864 
cluster_size=65536 lazy_refcounts=off refcount_bits=16 compression_type=zlib
  {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "created", "id": "disk"}}
  {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": 
{"status": "running", "id": "disk"}}







--
Best regards,
Vladimir

Re: [PATCH v2 13/20] iotests: 129: prepare for backup over block-copy

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


23.07.2020 11:03, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

After introducing parallel async copy requests instead of plain
cluster-by-cluster copying loop, backup job may finish earlier than
final assertion in do_test_stop. Let's require slow backup explicitly
by specifying speed parameter.


Isn’t the problem really that block_set_io_throttle does absolutely
nothing?  (Which is a long-standing problem with 129.  I personally just
never run it, honestly.)


Hmm.. is it better to drop test_drive_backup() from here ?




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/129 | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/129 b/tests/qemu-iotests/129
index 4db5eca441..bca56b589d 100755
--- a/tests/qemu-iotests/129
+++ b/tests/qemu-iotests/129
@@ -76,7 +76,7 @@ class TestStopWithBlockJob(iotests.QMPTestCase):
  def test_drive_backup(self):
  self.do_test_stop("drive-backup", device="drive0",
target=self.target_img,
-  sync="full")
+  sync="full", speed=1024)
  
  def test_block_commit(self):

  self.do_test_stop("block-commit", device="drive0")







--
Best regards,
Vladimir

Re: [PATCH 1/2] target/m68k: remove useless qregs array

2020-10-22 Thread Philippe Mathieu-Daudé


On 10/22/20 10:29 PM, Laurent Vivier wrote:

They are unused since the target has been converted to TCG.

Fixes: e1f3808e03f7 ("Convert m68k target to TCG.")
Signed-off-by: Laurent Vivier 


Reviewed-by: Philippe Mathieu-Daudé 


---
  target/m68k/cpu.h | 4 
  1 file changed, 4 deletions(-)

diff --git a/target/m68k/cpu.h b/target/m68k/cpu.h
index 521ac67cdd04..9a6f0400fcfe 100644
--- a/target/m68k/cpu.h
+++ b/target/m68k/cpu.h
@@ -33,8 +33,6 @@
  #define OS_PACKED   6
  #define OS_UNSIZED  7
  
-#define MAX_QREGS 32

-
  #define EXCP_ACCESS 2   /* Access (MMU) error.  */
  #define EXCP_ADDRESS3   /* Address error.  */
  #define EXCP_ILLEGAL4   /* Illegal instruction.  */
@@ -139,8 +137,6 @@ typedef struct CPUM68KState {
  int pending_vector;
  int pending_level;
  
-uint32_t qregs[MAX_QREGS];

-
  /* Fields up to this point are cleared by a CPU reset */
  struct {} end_reset_fields;

Re: [PATCH v5 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Philippe Mathieu-Daudé


Hi Klaus,

On 10/22/20 8:49 PM, Klaus Jensen wrote:

From: Klaus Jensen 

Add support for the Dataset Management command and the Deallocate
attribute. Deallocation results in discards being sent to the underlying
block device. Whether of not the blocks are actually deallocated is
affected by the same factors as Write Zeroes (see previous commit).

  format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)


Please use B/KiB units which are unambiguous (kb is for kbits)
(if you queue this yourself, you can fix when applying, no need
to repost).


 --
   qcow2ignore   n   n  n
   qcow2unmapn   n  y
   raw  ignore   n   n  n
   raw  unmapn   y  y

Again, a raw format and 4kb LBAs are preferable.

In order to set the Namespace Preferred Deallocate Granularity and
Alignment fields (NPDG and NPDA), choose a sane minimum discard
granularity of 4kb. If we are using a passthru device supporting discard
at a 512b granularity, user should set the discard_granularity property


Ditto.


explicitly. NPDG and NPDA will also account for the cluster_size of the
block driver if required (i.e. for QCOW2).

See NVM Express 1.3d, Section 6.7 ("Dataset Management command").

Signed-off-by: Klaus Jensen 
---
  hw/block/nvme.h  |   2 +
  include/block/nvme.h |   7 ++-
  hw/block/nvme-ns.c   |  36 +--
  hw/block/nvme.c  | 101 ++-
  4 files changed, 140 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a50..574333caa3f9 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -28,6 +28,7 @@ typedef struct NvmeRequest {
  struct NvmeNamespace*ns;
  BlockAIOCB  *aiocb;
  uint16_tstatus;
+void*opaque;
  NvmeCqe cqe;
  NvmeCmd cmd;
  BlockAcctCookie acct;
@@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
  case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
  case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
  case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
+case NVME_CMD_DSM:  return "NVME_NVM_CMD_DSM";
  default:return "NVME_NVM_CMD_UNKNOWN";
  }
  }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 966c3bb304bd..e95ff6ca9b37 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
  uint16_tnabspf;
  uint16_tnoiob;
  uint8_t nvmcap[16];
-uint8_t rsvd64[40];
+uint16_tnpwg;
+uint16_tnpwa;
+uint16_tnpdg;
+uint16_tnpda;
+uint16_tnows;
+uint8_t rsvd74[30];
  uint8_t nguid[16];
  uint64_teui64;
  NvmeLBAFlbaf[16];


If you consider "block/nvme.h" shared by 2 different subsystems,
it is better to add the changes in a separate patch. That way
the changes can be acked individually.


diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index f1cc734c60f5..840651db7256 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -28,10 +28,14 @@
  #include "nvme.h"
  #include "nvme-ns.h"
  
-static void nvme_ns_init(NvmeNamespace *ns)

+#define MIN_DISCARD_GRANULARITY (4 * KiB)
+
+static int nvme_ns_init(NvmeNamespace *ns, Error **errp)


Hmm the Error* argument could be squashed in "hw/block/nvme:
support multiple namespaces". Else better split patch in dumb
units IMHO (maybe a reviewer's taste).


  {
+BlockDriverInfo bdi;
  NvmeIdNs *id_ns = >id_ns;
  int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
+int npdg, ret;
  
  ns->id_ns.dlfeat = 0x9;
  
@@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)

  id_ns->ncap = id_ns->nsze;
  id_ns->nuse = id_ns->ncap;
  
-/* support DULBE */

-id_ns->nsfeat |= 0x4;
+/* support DULBE and I/O optimization fields */
+id_ns->nsfeat |= (0x4 | 0x10);


The comment helps, but isn't needed if you use explicit definitions
for these flags. You already introduced the NVME_ID_NS_NSFEAT_DULBE
and NVME_ID_NS_FLBAS_EXTENDED but they are restricted to extract bits.
This is why I personally prefer the registerfields API (see
"hw/registerfields.h").


+
+npdg = ns->blkconf.discard_granularity / ns->blkconf.logical_block_size;
+
+ret = bdrv_get_info(blk_bs(ns->blkconf.blk), );
+if (ret < 0) {
+error_setg_errno(errp, -ret, "could not get block driver info");
+return ret;
+}
+
+if (bdi.cluster_size &&
+bdi.cluster_size > ns->blkconf.discard_granularity) {
+npdg = bdi.cluster_size / ns->blkconf.logical_block_size;
+}
+
+id_ns->npda = id_ns->npdg = npdg - 1;
+
+return 0;
  }
  
  static int nvme_ns_init_blk(NvmeCtrl *n,

Re: [PATCH v10 09/10] virtio-iommu: Set supported page size mask

2020-10-22 Thread Peter Xu

On Thu, Oct 22, 2020 at 06:39:37PM +0200, Jean-Philippe Brucker wrote:
> So what I'd like to do for next version:
> 
> * Set qemu_real_host_page_mask as the default page mask, instead of the
>   rather arbitrary TARGET_PAGE_MASK.

Oh, I thought TARGET_PAGE_MASK was intended - kernel committ 39b3b3c9cac1
("iommu/virtio: Reject IOMMU page granule larger than PAGE_SIZE", 2020-03-27)
explicitly introduced a check that virtio-iommu kernel driver will fail
directly if this psize is bigger than PAGE_SIZE in the guest.  So it sounds
reasonable to have the default value as PAGE_SIZE (if it's the same as
TARGET_PAGE_SIZE in QEMU, which seems true?).

For example, I'm thinking whether qemu_real_host_page_mask could be bigger than
PAGE_SIZE in the guest in some environments, then it seems virtio-iommu won't
boot anymore without assigned devices, because that extra check above will
always fail.

>   Otherwise we cannot hotplug assigned
>   devices on a 64kB host, since TARGET_PAGE_MASK is pretty much always
>   4kB.
> 
> * Disallow changing the page size. It's simpler and works in
>   practice if we default to qemu_real_host_page_mask.
> 
> * For non-hotplug devices, allow changing the rest of the mask. For
>   hotplug devices, only warn about it.

Could I ask what's "the rest of the mask"?  On the driver side, I see that
viommu_domain_finalise() will pick the largest supported page size to use, if
so then we seem to be quite restricted on what page size we can use.

I'm also a bit curious about what scenario we plan to support in this initial
version, especially for ARM.  For x86, I think it's probably always 4k
everywhere so it's fairly simple.  Know little on ARM side...

Thanks,

-- 
Peter Xu

Re: [PATCH v3 12/17] hw/block/nvme: add support for scatter gather lists

2020-10-22 Thread Philippe Mathieu-Daudé


On 9/22/20 10:45 AM, Klaus Jensen wrote:

From: Klaus Jensen 

For now, support the Data Block, Segment and Last Segment descriptor
types.

See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").

Signed-off-by: Klaus Jensen 
Reviewed-by: Keith Busch 
---
  include/block/nvme.h  |   6 +-
  hw/block/nvme.c   | 329 ++
  hw/block/trace-events |   4 +
  3 files changed, 279 insertions(+), 60 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 65e68a82c897..58647bcdad0b 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -412,9 +412,9 @@ typedef union NvmeCmdDptr {
  } NvmeCmdDptr;
  
  enum NvmePsdt {

-PSDT_PRP = 0x0,
-PSDT_SGL_MPTR_CONTIGUOUS = 0x1,
-PSDT_SGL_MPTR_SGL= 0x2,
+NVME_PSDT_PRP = 0x0,
+NVME_PSDT_SGL_MPTR_CONTIGUOUS = 0x1,
+NVME_PSDT_SGL_MPTR_SGL= 0x2,
  };
  
  typedef struct QEMU_PACKED NvmeCmd {

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3b901efd1ec0..c5d09ff1edf5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -413,13 +413,262 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, uint64_t prp1, 
uint64_t prp2,
  return NVME_SUCCESS;
  }
  
-static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,

- uint64_t prp1, uint64_t prp2, DMADirection dir,
+/*
+ * Map 'nsgld' data descriptors from 'segment'. The function will subtract the
+ * number of bytes mapped in len.
+ */
+static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
+  QEMUIOVector *iov,
+  NvmeSglDescriptor *segment, uint64_t nsgld,
+  size_t *len, NvmeRequest *req)
+{
+dma_addr_t addr, trans_len;
+uint32_t dlen;
+uint16_t status;
+
+for (int i = 0; i < nsgld; i++) {
+uint8_t type = NVME_SGL_TYPE(segment[i].type);
+
+switch (type) {
+case NVME_SGL_DESCR_TYPE_DATA_BLOCK:
+break;
+case NVME_SGL_DESCR_TYPE_SEGMENT:
+case NVME_SGL_DESCR_TYPE_LAST_SEGMENT:
+return NVME_INVALID_NUM_SGL_DESCRS | NVME_DNR;
+default:
+return NVME_SGL_DESCR_TYPE_INVALID | NVME_DNR;
+}
+
+dlen = le32_to_cpu(segment[i].len);
+if (!dlen) {
+continue;
+}
+
+if (*len == 0) {
+/*
+ * All data has been mapped, but the SGL contains additional
+ * segments and/or descriptors. The controller might accept
+ * ignoring the rest of the SGL.
+ */
+uint16_t sgls = le16_to_cpu(n->id_ctrl.sgls);
+if (sgls & NVME_CTRL_SGLS_EXCESS_LENGTH) {
+break;
+}
+
+trace_pci_nvme_err_invalid_sgl_excess_length(nvme_cid(req));
+return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
+}
+
+trans_len = MIN(*len, dlen);
+addr = le64_to_cpu(segment[i].addr);
+
+if (UINT64_MAX - addr < dlen) {
+return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
+}
+
+status = nvme_map_addr(n, qsg, iov, addr, trans_len);
+if (status) {
+return status;
+}
+
+*len -= trans_len;
+}
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
+ NvmeSglDescriptor sgl, size_t len,
   NvmeRequest *req)
+{
+/*
+ * Read the segment in chunks of 256 descriptors (one 4k page) to avoid
+ * dynamically allocating a potentially huge SGL. The spec allows the SGL
+ * to be larger (as in number of bytes required to describe the SGL
+ * descriptors and segment chain) than the command transfer size, so it is
+ * not bounded by MDTS.
+ */
+const int SEG_CHUNK_SIZE = 256;
+
+NvmeSglDescriptor segment[SEG_CHUNK_SIZE], *sgld, *last_sgld;
+uint64_t nsgld;
+uint32_t seg_len;
+uint16_t status;
+bool sgl_in_cmb = false;
+hwaddr addr;
+int ret;
+
+sgld = 
+addr = le64_to_cpu(sgl.addr);
+
+trace_pci_nvme_map_sgl(nvme_cid(req), NVME_SGL_TYPE(sgl.type), len);
+
+/*
+ * If the entire transfer can be described with a single data block it can
+ * be mapped directly.
+ */
+if (NVME_SGL_TYPE(sgl.type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
+status = nvme_map_sgl_data(n, qsg, iov, sgld, 1, , req);
+if (status) {
+goto unmap;
+}
+
+goto out;
+}
+
+/*
+ * If the segment is located in the CMB, the submission queue of the
+ * request must also reside there.
+ */
+if (nvme_addr_is_cmb(n, addr)) {
+if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+sgl_in_cmb = true;
+}
+
+for (;;) {
+switch (NVME_SGL_TYPE(sgld->type)) {
+

Re: [PATCH v2 08/20] block/block-copy: add block_copy_cancel

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


22.07.2020 14:28, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

Add function to cancel running async block-copy call. It will be used
in backup.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/block/block-copy.h |  7 +++
  block/block-copy.c | 22 +++---
  2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index d40e691123..370a194d3c 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -67,6 +67,13 @@ BlockCopyCallState *block_copy_async(BlockCopyState *s,
  void block_copy_set_speed(BlockCopyState *s, BlockCopyCallState *call_state,
uint64_t speed);
  
+/*

+ * Cancel running block-copy call.
+ * Cancel leaves block-copy state valid: dirty bits are correct and you may use
+ * cancel +  to emulate pause/resume.
+ */
+void block_copy_cancel(BlockCopyCallState *call_state);
+
  BdrvDirtyBitmap *block_copy_dirty_bitmap(BlockCopyState *s);
  void block_copy_set_skip_unallocated(BlockCopyState *s, bool skip);
  
diff --git a/block/block-copy.c b/block/block-copy.c

index 851d9c8aaf..b551feb6c2 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -44,6 +44,8 @@ typedef struct BlockCopyCallState {
  bool failed;
  bool finished;
  QemuCoSleepState *sleep_state;
+bool cancelled;
+Coroutine *canceller;
  
  /* OUT parameters */

  bool error_is_read;
@@ -582,7 +584,7 @@ block_copy_dirty_clusters(BlockCopyCallState *call_state)
  assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
  assert(QEMU_IS_ALIGNED(bytes, s->cluster_size));
  
-while (bytes && aio_task_pool_status(aio) == 0) {

+while (bytes && aio_task_pool_status(aio) == 0 && !call_state->cancelled) {
  BlockCopyTask *task;
  int64_t status_bytes;
  
@@ -693,7 +695,7 @@ static int coroutine_fn block_copy_common(BlockCopyCallState *call_state)

  do {
  ret = block_copy_dirty_clusters(call_state);
  
-if (ret == 0) {

+if (ret == 0 && !call_state->cancelled) {
  ret = block_copy_wait_one(call_state->s, call_state->offset,
call_state->bytes);
  }
@@ -707,13 +709,18 @@ static int coroutine_fn 
block_copy_common(BlockCopyCallState *call_state)
   * 2. We have waited for some intersecting block-copy request
   *It may have failed and produced new dirty bits.
   */
-} while (ret > 0);
+} while (ret > 0 && !call_state->cancelled);


Would it be cleaner if block_copy_dirty_cluster() just returned
-ECANCELED?  Or would that pose a problem for its callers or the async
callback?



I'd prefer not to merge io ret with block-copy logic: who knows what underlying 
operations may return.. Can't it be _another_ ECANCELED?
And it would be just a sugar for block_copy_dirty_clusters() call, I'll have to 
check ->cancelled after block_copy_wait_one() anyway.
Also, for the next version I try to make it more obvious that finished 
block-copy call is in one of thee states:
 - success
 - failed
 - cancelled

Hmm. Also, cancelled should be OK for copy-on-write operations in filter, it 
just mean that we don't need to care anymore.


  if (call_state->cb) {
  call_state->cb(ret, call_state->error_is_read,
 call_state->s->progress_opaque);
  }
  
+if (call_state->canceller) {

+aio_co_wake(call_state->canceller);
+call_state->canceller = NULL;
+}
+
  call_state->finished = true;
  
  return ret;





--
Best regards,
Vladimir

Re: [PATCH v2 11/20] qapi: backup: add x-max-chunk and x-max-workers parameters

2020-10-22 Thread Vladimir Sementsov-Ogievskiy


22.07.2020 15:22, Max Reitz wrote:

On 01.06.20 20:11, Vladimir Sementsov-Ogievskiy wrote:

Add new parameters to configure future backup features. The patch
doesn't introduce aio backup requests (so we actually have only one
worker) neither requests larger than one cluster. Still, formally we
satisfy these maximums anyway, so add the parameters now, to facilitate
further patch which will really change backup job behavior.

Options are added with x- prefixes, as the only use for them are some
very conservative iotests which will be updated soon.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  qapi/block-core.json  |  9 -
  include/block/block_int.h |  7 +++
  block/backup.c| 21 +
  block/replication.c   |  2 +-
  blockdev.c|  5 +
  5 files changed, 42 insertions(+), 2 deletions(-)



[..]


@@ -422,6 +436,11 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
  if (cluster_size < 0) {
  goto error;
  }
+if (max_chunk && max_chunk < cluster_size) {
+error_setg(errp, "Required max-chunk (%" PRIi64") is less than backup "


(missing a space after PRIi64)


+   "cluster size (%" PRIi64 ")", max_chunk, cluster_size);


Should this be noted in the QAPI documentation?


Hmm.. It makes sense, but I don't know what to write: should be >= job 
cluster_size? But then I'll have to describe the logic of 
backup_calculate_cluster_size()...


 (And perhaps the fact
that without copy offloading, we’ll never copy anything bigger than one
cluster at a time anyway?)


This is a parameter for background copying. Look at block_copy_task_create(), 
if call_state has own max_chunk, it's used instead of common copy_size (derived 
from cluster_size). But at a moment of this patch background process through 
async block-copy is not  yet implemented, and the parameter doesn't work, which 
is described in commit message.




+return NULL;
+}
  
  /*


[..]


--
Best regards,
Vladimir

Re: [PATCH v27 14/17] vfio: Dirty page tracking when vIOMMU is enabled

2020-10-22 Thread Alex Williamson

On Thu, 22 Oct 2020 16:42:04 +0530
Kirti Wankhede  wrote:

> When vIOMMU is enabled, register MAP notifier from log_sync when all
> devices in container are in stop and copy phase of migration. Call replay
> and get dirty pages from notifier callback.
> 
> Suggested-by: Alex Williamson 
> Signed-off-by: Kirti Wankhede 
> ---
>  hw/vfio/common.c  | 95 
> ---
>  hw/vfio/trace-events  |  1 +
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 91 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 2634387df948..98c2b1f9b190 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -442,8 +442,8 @@ static bool 
> vfio_listener_skipped_section(MemoryRegionSection *section)
>  }
>  
>  /* Called with rcu_read_lock held.  */
> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> -   bool *read_only)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> +   ram_addr_t *ram_addr, bool *read_only)
>  {
>  MemoryRegion *mr;
>  hwaddr xlat;
> @@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void 
> **vaddr,
>  return false;
>  }
>  
> -*vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -*read_only = !writable || mr->readonly;
> +if (vaddr) {
> +*vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +}
> +
> +if (ram_addr) {
> +*ram_addr = memory_region_get_ram_addr(mr) + xlat;
> +}
> +
> +if (read_only) {
> +*read_only = !writable || mr->readonly;
> +}
>  
>  return true;
>  }
> @@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
> IOMMUTLBEntry *iotlb)
>  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>  VFIOContainer *container = giommu->container;
>  hwaddr iova = iotlb->iova + giommu->iommu_offset;
> -bool read_only;
>  void *vaddr;
>  int ret;
>  
> @@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
> IOMMUTLBEntry *iotlb)
>  rcu_read_lock();
>  
>  if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -if (!vfio_get_vaddr(iotlb, , _only)) {
> +bool read_only;
> +
> +if (!vfio_get_xlat_addr(iotlb, , NULL, _only)) {
>  goto out;
>  }
>  /*
> @@ -899,11 +909,84 @@ err_out:
>  return ret;
>  }
>  
> +static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry 
> *iotlb)
> +{
> +VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, dirty_notify);
> +VFIOContainer *container = giommu->container;
> +hwaddr iova = iotlb->iova + giommu->iommu_offset;
> +ram_addr_t translated_addr;
> +
> +trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
> +
> +if (iotlb->target_as != _space_memory) {
> +error_report("Wrong target AS \"%s\", only system memory is allowed",
> + iotlb->target_as->name ? iotlb->target_as->name : 
> "none");
> +return;
> +}
> +
> +rcu_read_lock();
> +
> +if (vfio_get_xlat_addr(iotlb, NULL, _addr, NULL)) {
> +int ret;
> +
> +ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
> +translated_addr);
> +if (ret) {
> +error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
> + "0x%"HWADDR_PRIx") = %d (%m)",
> + container, iova,
> + iotlb->addr_mask + 1, ret);
> +}
> +}
> +
> +rcu_read_unlock();
> +}
> +
>  static int vfio_sync_dirty_bitmap(VFIOContainer *container,
>MemoryRegionSection *section)
>  {
>  ram_addr_t ram_addr;
>  
> +if (memory_region_is_iommu(section->mr)) {
> +VFIOGuestIOMMU *giommu;
> +int ret = 0;
> +
> +QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
> +if (MEMORY_REGION(giommu->iommu) == section->mr &&
> +giommu->n.start == section->offset_within_region) {
> +Int128 llend;
> +Error *err = NULL;
> +int idx = memory_region_iommu_attrs_to_index(giommu->iommu,
> +   
> MEMTXATTRS_UNSPECIFIED);
> +
> +llend = 
> int128_add(int128_make64(section->offset_within_region),
> +   section->size);
> +llend = int128_sub(llend, int128_one());
> +
> +iommu_notifier_init(>dirty_notify,
> +vfio_iommu_map_dirty_notify,
> +IOMMU_NOTIFIER_MAP,
> +section->offset_within_region,
> +int128_get64(llend),
> +idx);
> +ret =

[PATCH 2/2] target/m68k: Add vmstate definition for M68kCPU

2020-10-22 Thread Laurent Vivier

Signed-off-by: Laurent Vivier 
---
 target/m68k/cpu.h|   1 +
 target/m68k/cpu.c| 193 ++-
 target/m68k/fpu_helper.c |  10 +-
 3 files changed, 198 insertions(+), 6 deletions(-)

diff --git a/target/m68k/cpu.h b/target/m68k/cpu.h
index 9a6f0400fcfe..de5b9875fea3 100644
--- a/target/m68k/cpu.h
+++ b/target/m68k/cpu.h
@@ -179,6 +179,7 @@ int cpu_m68k_signal_handler(int host_signum, void *pinfo,
 uint32_t cpu_m68k_get_ccr(CPUM68KState *env);
 void cpu_m68k_set_ccr(CPUM68KState *env, uint32_t);
 void cpu_m68k_set_sr(CPUM68KState *env, uint32_t);
+void cpu_m68k_restore_fp_status(CPUM68KState *env);
 void cpu_m68k_set_fpcr(CPUM68KState *env, uint32_t val);
 
 
diff --git a/target/m68k/cpu.c b/target/m68k/cpu.c
index 72c545149e9b..b811a0bdde2d 100644
--- a/target/m68k/cpu.c
+++ b/target/m68k/cpu.c
@@ -260,10 +260,198 @@ static void m68k_cpu_initfn(Object *obj)
 cpu_set_cpustate_pointers(cpu);
 }
 
+#if defined(CONFIG_SOFTMMU)
+static bool fpu_needed(void *opaque)
+{
+M68kCPU *s = opaque;
+
+return m68k_feature(>env, M68K_FEATURE_CF_FPU) ||
+   m68k_feature(>env, M68K_FEATURE_FPU);
+}
+
+typedef struct m68k_FPReg_tmp {
+FPReg *parent;
+uint64_t tmp_mant;
+uint16_t tmp_exp;
+} m68k_FPReg_tmp;
+
+static void cpu_get_fp80(uint64_t *pmant, uint16_t *pexp, floatx80 f)
+{
+CPU_LDoubleU temp;
+
+temp.d = f;
+*pmant = temp.l.lower;
+*pexp = temp.l.upper;
+}
+
+static floatx80 cpu_set_fp80(uint64_t mant, uint16_t upper)
+{
+CPU_LDoubleU temp;
+
+temp.l.upper = upper;
+temp.l.lower = mant;
+return temp.d;
+}
+
+static int freg_pre_save(void *opaque)
+{
+m68k_FPReg_tmp *tmp = opaque;
+
+cpu_get_fp80(>tmp_mant, >tmp_exp, tmp->parent->d);
+
+return 0;
+}
+
+static int freg_post_load(void *opaque, int version)
+{
+m68k_FPReg_tmp *tmp = opaque;
+
+tmp->parent->d = cpu_set_fp80(tmp->tmp_mant, tmp->tmp_exp);
+
+return 0;
+}
+
+static const VMStateDescription vmstate_freg_tmp = {
+.name = "freg_tmp",
+.post_load = freg_post_load,
+.pre_save  = freg_pre_save,
+.fields = (VMStateField[]) {
+VMSTATE_UINT64(tmp_mant, m68k_FPReg_tmp),
+VMSTATE_UINT16(tmp_exp, m68k_FPReg_tmp),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static const VMStateDescription vmstate_freg = {
+.name = "freg",
+.fields = (VMStateField[]) {
+VMSTATE_WITH_TMP(FPReg, m68k_FPReg_tmp, vmstate_freg_tmp),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static int fpu_post_load(void *opaque, int version)
+{
+M68kCPU *s = opaque;
+
+cpu_m68k_restore_fp_status(>env);
+
+return 0;
+}
+
+const VMStateDescription vmmstate_fpu = {
+.name = "cpu/fpu",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = fpu_needed,
+.post_load = fpu_post_load,
+.fields = (VMStateField[]) {
+VMSTATE_UINT32(env.fpcr, M68kCPU),
+VMSTATE_UINT32(env.fpsr, M68kCPU),
+VMSTATE_STRUCT_ARRAY(env.fregs, M68kCPU, 8, 0, vmstate_freg, FPReg),
+VMSTATE_STRUCT(env.fp_result, M68kCPU, 0, vmstate_freg, FPReg),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static bool cf_spregs_needed(void *opaque)
+{
+M68kCPU *s = opaque;
+
+return m68k_feature(>env, M68K_FEATURE_CF_ISA_A);
+}
+
+const VMStateDescription vmstate_cf_spregs = {
+.name = "cpu/cf_spregs",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = cf_spregs_needed,
+.fields = (VMStateField[]) {
+VMSTATE_UINT64_ARRAY(env.macc, M68kCPU, 4),
+VMSTATE_UINT32(env.macsr, M68kCPU),
+VMSTATE_UINT32(env.mac_mask, M68kCPU),
+VMSTATE_UINT32(env.rambar0, M68kCPU),
+VMSTATE_UINT32(env.mbar, M68kCPU),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static bool cpu_68040_mmu_needed(void *opaque)
+{
+M68kCPU *s = opaque;
+
+return m68k_feature(>env, M68K_FEATURE_M68040);
+}
+
+const VMStateDescription vmstate_68040_mmu = {
+.name = "cpu/68040_mmu",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = cpu_68040_mmu_needed,
+.fields = (VMStateField[]) {
+VMSTATE_UINT32(env.mmu.ar, M68kCPU),
+VMSTATE_UINT32(env.mmu.ssw, M68kCPU),
+VMSTATE_UINT16(env.mmu.tcr, M68kCPU),
+VMSTATE_UINT32(env.mmu.urp, M68kCPU),
+VMSTATE_UINT32(env.mmu.srp, M68kCPU),
+VMSTATE_BOOL(env.mmu.fault, M68kCPU),
+VMSTATE_UINT32_ARRAY(env.mmu.ttr, M68kCPU, 4),
+VMSTATE_UINT32(env.mmu.mmusr, M68kCPU),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static bool cpu_68040_spregs_needed(void *opaque)
+{
+M68kCPU *s = opaque;
+
+return m68k_feature(>env, M68K_FEATURE_M68040);
+}
+
+const VMStateDescription vmstate_68040_spregs = {
+.name = "cpu/68040_spregs",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = cpu_68040_spregs_needed,
+.fields = (VMStateField[]) {
+VMSTATE_UINT32(env.vbr, M68kCPU),
+VMSTATE_UINT32(env.cacr, M68kCPU),
+

[PATCH 0/2] target/m68k: add vmstate structure to migrate M68kCPU

2020-10-22 Thread Laurent Vivier

First patch is a cleanup patch.

The second patch defines the vmstate structure for M68kCPU.

I have tested the migration with my experimental machine virt-m68k.

I didn't check if q800 machine type has all the needed vmstates
for all the hardware devices it uses.

Thanks,
Laurent

Laurent Vivier (2):
  target/m68k: remove useless qregs array
  target/m68k: Add vmstate definition for M68kCPU

 target/m68k/cpu.h|   5 +-
 target/m68k/cpu.c| 193 ++-
 target/m68k/fpu_helper.c |  10 +-
 3 files changed, 198 insertions(+), 10 deletions(-)

-- 
2.26.2

[PATCH 1/2] target/m68k: remove useless qregs array

2020-10-22 Thread Laurent Vivier

They are unused since the target has been converted to TCG.

Fixes: e1f3808e03f7 ("Convert m68k target to TCG.")
Signed-off-by: Laurent Vivier 
---
 target/m68k/cpu.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/target/m68k/cpu.h b/target/m68k/cpu.h
index 521ac67cdd04..9a6f0400fcfe 100644
--- a/target/m68k/cpu.h
+++ b/target/m68k/cpu.h
@@ -33,8 +33,6 @@
 #define OS_PACKED   6
 #define OS_UNSIZED  7
 
-#define MAX_QREGS 32
-
 #define EXCP_ACCESS 2   /* Access (MMU) error.  */
 #define EXCP_ADDRESS3   /* Address error.  */
 #define EXCP_ILLEGAL4   /* Illegal instruction.  */
@@ -139,8 +137,6 @@ typedef struct CPUM68KState {
 int pending_vector;
 int pending_level;
 
-uint32_t qregs[MAX_QREGS];
-
 /* Fields up to this point are cleared by a CPU reset */
 struct {} end_reset_fields;
 
-- 
2.26.2

Re: [PATCH v27 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled

2020-10-22 Thread Alex Williamson

Paolo,

I think this would need your ack.  Thanks,

Alex


On Thu, 22 Oct 2020 16:42:00 +0530
Kirti Wankhede  wrote:

> mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask
> wasn't set correctly due to which memory listener's log_sync doesn't
> get called.
> This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when
> IOMMU is enabled.
> 
> Signed-off-by: Kirti Wankhede 
> ---
>  softmmu/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 403ff3abc99b..94f606e9d9d9 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -1792,7 +1792,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
>  uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>  {
>  uint8_t mask = mr->dirty_log_mask;
> -if (global_dirty_log && mr->ram_block) {
> +if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
>  mask |= (1 << DIRTY_MEMORY_MIGRATION);
>  }
>  return mask;

Re: [PATCH v27 09/17] vfio: Add load state functions to SaveVMHandlers

2020-10-22 Thread Alex Williamson

On Thu, 22 Oct 2020 16:41:59 +0530
Kirti Wankhede  wrote:

> Sequence  during _RESUMING device state:
> While data for this device is available, repeat below steps:
> a. read data_offset from where user application should write data.
> b. write data of data_size to migration region from data_offset.
> c. write data_size which indicates vendor driver that data is written in
>staging buffer.
> 
> For user, data is opaque. User should write data in the same order as
> received.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> Reviewed-by: Dr. David Alan Gilbert 
> ---
>  hw/vfio/migration.c  | 192 
> +++
>  hw/vfio/trace-events |   3 +
>  2 files changed, 195 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 5506cef15d88..46d05d230e2a 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
> *vbasedev, uint64_t *size)
>  return ret;
>  }
>  
> +static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
> +uint64_t data_size)
> +{
> +VFIORegion *region = >migration->region;
> +uint64_t data_offset = 0, size, report_size;
> +int ret;
> +
> +do {
> +ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
> +  region->fd_offset + 
> VFIO_MIG_STRUCT_OFFSET(data_offset));
> +if (ret < 0) {
> +return ret;
> +}
> +
> +if (data_offset + data_size > region->size) {
> +/*
> + * If data_size is greater than the data section of migration 
> region
> + * then iterate the write buffer operation. This case can occur 
> if
> + * size of migration region at destination is smaller than size 
> of
> + * migration region at source.
> + */
> +report_size = size = region->size - data_offset;
> +data_size -= size;
> +} else {
> +report_size = size = data_size;
> +data_size = 0;
> +}
> +
> +trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
> +
> +while (size) {
> +void *buf;
> +uint64_t sec_size;
> +bool buf_alloc = false;
> +
> +buf = get_data_section_size(region, data_offset, size, 
> _size);
> +
> +if (!buf) {
> +buf = g_try_malloc(sec_size);
> +if (!buf) {
> +error_report("%s: Error allocating buffer ", __func__);
> +return -ENOMEM;
> +}
> +buf_alloc = true;
> +}
> +
> +qemu_get_buffer(f, buf, sec_size);
> +
> +if (buf_alloc) {
> +ret = vfio_mig_write(vbasedev, buf, sec_size,
> +region->fd_offset + data_offset);
> +g_free(buf);
> +
> +if (ret < 0) {
> +return ret;
> +}
> +}
> +size -= sec_size;
> +data_offset += sec_size;
> +}
> +
> +ret = vfio_mig_write(vbasedev, _size, sizeof(report_size),
> +region->fd_offset + 
> VFIO_MIG_STRUCT_OFFSET(data_size));
> +if (ret < 0) {
> +return ret;
> +}
> +} while (data_size);
> +
> +return 0;
> +}
> +
>  static int vfio_update_pending(VFIODevice *vbasedev)
>  {
>  VFIOMigration *migration = vbasedev->migration;
> @@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, 
> void *opaque)
>  return qemu_file_get_error(f);
>  }
>  
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +VFIODevice *vbasedev = opaque;
> +uint64_t data;
> +
> +if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> +int ret;
> +
> +ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> +if (ret) {
> +error_report("%s: Failed to load device config space",
> + vbasedev->name);
> +return ret;
> +}
> +}
> +
> +data = qemu_get_be64(f);
> +if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +error_report("%s: Failed loading device config space, "
> + "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> +return -EINVAL;
> +}
> +
> +trace_vfio_load_device_config_state(vbasedev->name);
> +return qemu_file_get_error(f);
> +}
> +
>  /* -- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -477,12 +575,106 @@ static int vfio_save_complete_precopy(QEMUFile *f, 
> void *opaque)
>  return ret;
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +VFIODevice *vbasedev = opaque;
> +VFIOMigration *migration = vbasedev->migration;
> +int ret =

Re: [PATCH] gitlab-ci: Clone from GitLab itself

2020-10-22 Thread Alex Bennée



Philippe Mathieu-Daudé  writes:

> Let GitLab runners use GitLab repository directly.

Queued to testing/next, thanks.

> Suggested-by: Paolo Bonzini 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  .gitlab-ci.yml | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
> index 66ad7aa5c22..ba77af51f2f 100644
> --- a/.gitlab-ci.yml
> +++ b/.gitlab-ci.yml
> @@ -24,6 +24,7 @@ include:
>image: $CI_REGISTRY_IMAGE/qemu/$IMAGE:latest
>before_script:
>  - JOBS=$(expr $(nproc) + 1)
> +- sed -i s,git.qemu.org/git,gitlab.com/qemu-project, .gitmodules
>script:
>  - mkdir build
>  - cd build


-- 
Alex Bennée

Re: [PATCH v3] virtiofsd: add container-friendly -o sandbox=chroot option

2020-10-22 Thread Vivek Goyal

On Thu, Oct 22, 2020 at 08:19:54PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > virtiofsd cannot run in a container because CAP_SYS_ADMIN is required to
> > create namespaces.
> > 
> > Introduce a weaker sandbox mode that is sufficient in container
> > environments because the container runtime already sets up namespaces.
> > Use chroot to restrict path traversal to the shared directory.
> > 
> > virtiofsd loses the following:
> > 
> > 1. Mount namespace. The process chroots to the shared directory but
> >leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> >syscalls.
> > 
> > 2. Pid namespace. This should be fine because virtiofsd is the only
> >process running in the container.
> > 
> > 3. Network namespace. This should be fine because seccomp already
> >rejects the connect(2) syscall, but an additional layer of security
> >is lost. Container runtime-specific network security policies can be
> >used drop network traffic (except for the vhost-user UNIX domain
> >socket).
> > 
> > Signed-off-by: Stefan Hajnoczi 
> 
> I've just tripped over another case where this probably helps (but not
> yet tested...); pivot_root doesn't work if your current / isn't a
> mountpoint - so you can't currently run the existing virtiofsd inside
> a chroot.

Can we avoid that issue simply by doing a bind mount of directory
before chroot().

Vivek

> 
> (pivot_root is awful for telling you this - it has 6 different manpage
> listed reasons it might return EINVAL and leaves you to figure out how
> you offended it).
> 
> Dave
> 
> > ---
> > v3:
> >  * Rebased onto David Gilbert's latest migration & virtiofsd pull
> >request
> > 
> >  tools/virtiofsd/helper.c |  8 +
> >  tools/virtiofsd/passthrough_ll.c | 57 ++--
> >  docs/tools/virtiofsd.rst | 32 ++
> >  3 files changed, 88 insertions(+), 9 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> > index 85770d63f1..2e181a49b5 100644
> > --- a/tools/virtiofsd/helper.c
> > +++ b/tools/virtiofsd/helper.c
> > @@ -166,6 +166,14 @@ void fuse_cmdline_help(void)
> > "   enable/disable readirplus\n"
> > "   default: readdirplus except 
> > with "
> > "cache=none\n"
> > +   "-o sandbox=namespace|chroot\n"
> > +   "   sandboxing mode:\n"
> > +   "   - namespace: mount, pid, and 
> > net\n"
> > +   " namespaces with 
> > pivot_root(2)\n"
> > +   " into shared directory\n"
> > +   "   - chroot: chroot(2) into 
> > shared\n"
> > +   " directory (use in 
> > containers)\n"
> > +   "   default: namespace\n"
> > "-o timeout=I/O timeout (seconds)\n"
> > "   default: depends on cache= 
> > option.\n"
> > "-o writeback|no_writeback  enable/disable writeback 
> > cache\n"
> > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > b/tools/virtiofsd/passthrough_ll.c
> > index ff53df4451..5b9064278a 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -137,8 +137,14 @@ enum {
> >  CACHE_ALWAYS,
> >  };
> >  
> > +enum {
> > +SANDBOX_NAMESPACE,
> > +SANDBOX_CHROOT,
> > +};
> > +
> >  struct lo_data {
> >  pthread_mutex_t mutex;
> > +int sandbox;
> >  int debug;
> >  int writeback;
> >  int flock;
> > @@ -163,6 +169,12 @@ struct lo_data {
> >  };
> >  
> >  static const struct fuse_opt lo_opts[] = {
> > +{ "sandbox=namespace",
> > +  offsetof(struct lo_data, sandbox),
> > +  SANDBOX_NAMESPACE },
> > +{ "sandbox=chroot",
> > +  offsetof(struct lo_data, sandbox),
> > +  SANDBOX_CHROOT },
> >  { "writeback", offsetof(struct lo_data, writeback), 1 },
> >  { "no_writeback", offsetof(struct lo_data, writeback), 0 },
> >  { "source=%s", offsetof(struct lo_data, source), 0 },
> > @@ -2660,6 +2672,41 @@ static void setup_capabilities(char *modcaps_in)
> >  pthread_mutex_unlock();
> >  }
> >  
> > +/*
> > + * Use chroot as a weaker sandbox for environments where the process is
> > + * launched without CAP_SYS_ADMIN.
> > + */
> > +static void setup_chroot(struct lo_data *lo)
> > +{
> > +lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> > +if (lo->proc_self_fd == -1) {
> > +fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
> > +exit(1);
> > +}
> > +
> > +/*
> > + * Make the shared directory the file system root so that FUSE_OPEN
> > + * (lo_open()) cannot escape the shared directory by opening a symlink.
> > + *
> > +

Re: [PATCH v3] virtiofsd: add container-friendly -o sandbox=chroot option

2020-10-22 Thread Dr. David Alan Gilbert

* Stefan Hajnoczi (stefa...@redhat.com) wrote:
> virtiofsd cannot run in a container because CAP_SYS_ADMIN is required to
> create namespaces.
> 
> Introduce a weaker sandbox mode that is sufficient in container
> environments because the container runtime already sets up namespaces.
> Use chroot to restrict path traversal to the shared directory.
> 
> virtiofsd loses the following:
> 
> 1. Mount namespace. The process chroots to the shared directory but
>leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
>syscalls.
> 
> 2. Pid namespace. This should be fine because virtiofsd is the only
>process running in the container.
> 
> 3. Network namespace. This should be fine because seccomp already
>rejects the connect(2) syscall, but an additional layer of security
>is lost. Container runtime-specific network security policies can be
>used drop network traffic (except for the vhost-user UNIX domain
>socket).
> 
> Signed-off-by: Stefan Hajnoczi 

I've just tripped over another case where this probably helps (but not
yet tested...); pivot_root doesn't work if your current / isn't a
mountpoint - so you can't currently run the existing virtiofsd inside
a chroot.

(pivot_root is awful for telling you this - it has 6 different manpage
listed reasons it might return EINVAL and leaves you to figure out how
you offended it).

Dave

> ---
> v3:
>  * Rebased onto David Gilbert's latest migration & virtiofsd pull
>request
> 
>  tools/virtiofsd/helper.c |  8 +
>  tools/virtiofsd/passthrough_ll.c | 57 ++--
>  docs/tools/virtiofsd.rst | 32 ++
>  3 files changed, 88 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 85770d63f1..2e181a49b5 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -166,6 +166,14 @@ void fuse_cmdline_help(void)
> "   enable/disable readirplus\n"
> "   default: readdirplus except with "
> "cache=none\n"
> +   "-o sandbox=namespace|chroot\n"
> +   "   sandboxing mode:\n"
> +   "   - namespace: mount, pid, and 
> net\n"
> +   " namespaces with pivot_root(2)\n"
> +   " into shared directory\n"
> +   "   - chroot: chroot(2) into shared\n"
> +   " directory (use in containers)\n"
> +   "   default: namespace\n"
> "-o timeout=I/O timeout (seconds)\n"
> "   default: depends on cache= 
> option.\n"
> "-o writeback|no_writeback  enable/disable writeback cache\n"
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index ff53df4451..5b9064278a 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -137,8 +137,14 @@ enum {
>  CACHE_ALWAYS,
>  };
>  
> +enum {
> +SANDBOX_NAMESPACE,
> +SANDBOX_CHROOT,
> +};
> +
>  struct lo_data {
>  pthread_mutex_t mutex;
> +int sandbox;
>  int debug;
>  int writeback;
>  int flock;
> @@ -163,6 +169,12 @@ struct lo_data {
>  };
>  
>  static const struct fuse_opt lo_opts[] = {
> +{ "sandbox=namespace",
> +  offsetof(struct lo_data, sandbox),
> +  SANDBOX_NAMESPACE },
> +{ "sandbox=chroot",
> +  offsetof(struct lo_data, sandbox),
> +  SANDBOX_CHROOT },
>  { "writeback", offsetof(struct lo_data, writeback), 1 },
>  { "no_writeback", offsetof(struct lo_data, writeback), 0 },
>  { "source=%s", offsetof(struct lo_data, source), 0 },
> @@ -2660,6 +2672,41 @@ static void setup_capabilities(char *modcaps_in)
>  pthread_mutex_unlock();
>  }
>  
> +/*
> + * Use chroot as a weaker sandbox for environments where the process is
> + * launched without CAP_SYS_ADMIN.
> + */
> +static void setup_chroot(struct lo_data *lo)
> +{
> +lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +if (lo->proc_self_fd == -1) {
> +fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
> +exit(1);
> +}
> +
> +/*
> + * Make the shared directory the file system root so that FUSE_OPEN
> + * (lo_open()) cannot escape the shared directory by opening a symlink.
> + *
> + * The chroot(2) syscall is later disabled by seccomp and the
> + * CAP_SYS_CHROOT capability is dropped so that tampering with the chroot
> + * is not possible.
> + *
> + * However, it's still possible to escape the chroot via lo->proc_self_fd
> + * but that requires first gaining control of the process.
> + */
> +if (chroot(lo->source) != 0) {
> +fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"):

[PATCH RFC v5 12/12] iotests: rename and move 169 and 199 tests

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Rename bitmaps migration tests and move them to tests subdirectory to
demonstrate new human-friendly test naming.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/{199 => tests/migrate-bitmaps-postcopy-test}   | 0
 .../{199.out => tests/migrate-bitmaps-postcopy-test.out}  | 0
 tests/qemu-iotests/{169 => tests/migrate-bitmaps-test}| 0
 tests/qemu-iotests/{169.out => tests/migrate-bitmaps-test.out}| 0
 4 files changed, 0 insertions(+), 0 deletions(-)
 rename tests/qemu-iotests/{199 => tests/migrate-bitmaps-postcopy-test} (100%)
 rename tests/qemu-iotests/{199.out => tests/migrate-bitmaps-postcopy-test.out} 
(100%)
 rename tests/qemu-iotests/{169 => tests/migrate-bitmaps-test} (100%)
 rename tests/qemu-iotests/{169.out => tests/migrate-bitmaps-test.out} (100%)

diff --git a/tests/qemu-iotests/199 
b/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
similarity index 100%
rename from tests/qemu-iotests/199
rename to tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
diff --git a/tests/qemu-iotests/199.out 
b/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test.out
similarity index 100%
rename from tests/qemu-iotests/199.out
rename to tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test.out
diff --git a/tests/qemu-iotests/169 
b/tests/qemu-iotests/tests/migrate-bitmaps-test
similarity index 100%
rename from tests/qemu-iotests/169
rename to tests/qemu-iotests/tests/migrate-bitmaps-test
diff --git a/tests/qemu-iotests/169.out 
b/tests/qemu-iotests/tests/migrate-bitmaps-test.out
similarity index 100%
rename from tests/qemu-iotests/169.out
rename to tests/qemu-iotests/tests/migrate-bitmaps-test.out
-- 
2.21.3

[PATCH v5 11/12] iotests: rewrite check into python

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Just use classes introduced in previous three commits. Behavior
difference is described in these three commits.

Drop group file, as it becomes unused.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/check | 977 ++-
 tests/qemu-iotests/group | 317 -
 2 files changed, 28 insertions(+), 1266 deletions(-)
 delete mode 100644 tests/qemu-iotests/group

diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
index 678b6e4910..48bb3128c3 100755
--- a/tests/qemu-iotests/check
+++ b/tests/qemu-iotests/check
@@ -1,7 +1,8 @@
-#!/usr/bin/env bash
+#!/usr/bin/env python3
 #
-# Copyright (C) 2009 Red Hat, Inc.
-# Copyright (c) 2000-2002,2006 Silicon Graphics, Inc.  All Rights Reserved.
+# Configure environment and run group of tests in it.
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
 #
 # This program is free software; you can redistribute it and/or
 # modify it under the terms of the GNU General Public License as
@@ -14,950 +15,28 @@
 #
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see .
-#
-#
-# Control script for QA
-#
-
-status=0
-needwrap=true
-try=0
-n_bad=0
-bad=""
-notrun=""
-casenotrun=""
-interrupt=true
-makecheck=false
-
-_init_error()
-{
-echo "check: $1" >&2
-exit 1
-}
-
-if [ -L "$0" ]
-then
-# called from the build tree
-source_iotests=$(dirname "$(readlink "$0")")
-if [ -z "$source_iotests" ]
-then
-_init_error "failed to obtain source tree name from check symlink"
-fi
-source_iotests=$(cd "$source_iotests"; pwd) || _init_error "failed to 
enter source tree"
-build_iotests=$(cd "$(dirname "$0")"; pwd)
-else
-# called from the source tree
-source_iotests=$PWD
-# this may be an in-tree build (note that in the following code we may not
-# assume that it truly is and have to test whether the build results
-# actually exist)
-build_iotests=$PWD
-fi
-
-build_root="$build_iotests/../.."
-
-# we need common.env
-if ! . "$build_iotests/common.env"
-then
-_init_error "failed to source common.env (make sure the qemu-iotests are 
run from tests/qemu-iotests in the build tree)"
-fi
-
-# we need common.config
-if ! . "$source_iotests/common.config"
-then
-_init_error "failed to source common.config"
-fi
-
-_full_imgfmt_details()
-{
-if [ -n "$IMGOPTS" ]; then
-echo "$IMGFMT ($IMGOPTS)"
-else
-echo "$IMGFMT"
-fi
-}
-
-_full_platform_details()
-{
-os=$(uname -s)
-host=$(hostname -s)
-kernel=$(uname -r)
-platform=$(uname -m)
-echo "$os/$platform $host $kernel"
-}
-
-_full_env_details()
-{
-cat < /dev/null)
-if [ -n "$p" -a -x "$p" ]; then
-type -p "$p"
-else
-return 1
-fi
-}
-
-if [ -z "$TEST_DIR" ]; then
-TEST_DIR=$PWD/scratch
-fi
-mkdir -p "$TEST_DIR" || _init_error 'Failed to create TEST_DIR'
-
-tmp_sock_dir=false
-if [ -z "$SOCK_DIR" ]; then
-SOCK_DIR=$(mktemp -d)
-tmp_sock_dir=true
-fi
-mkdir -p "$SOCK_DIR" || _init_error 'Failed to create SOCK_DIR'
-
-diff="diff -u"
-verbose=false
-debug=false
-group=false
-xgroup=false
-imgopts=false
-showme=false
-sortme=false
-expunge=true
-have_test_arg=false
-cachemode=false
-aiomode=false
-
-tmp="${TEST_DIR}"/$$
-rm -f $tmp.list $tmp.tmp $tmp.sed
-
-export IMGFMT=raw
-export IMGFMT_GENERIC=true
-export IMGPROTO=file
-export IMGOPTS=""
-export CACHEMODE="writeback"
-export AIOMODE="threads"
-export QEMU_IO_OPTIONS=""
-export QEMU_IO_OPTIONS_NO_FMT=""
-export CACHEMODE_IS_DEFAULT=true
-export VALGRIND_QEMU=
-export IMGKEYSECRET=
-export IMGOPTSSYNTAX=false
-
-# Save current tty settings, since an aborting qemu call may leave things
-# screwed up
-STTY_RESTORE=
-if test -t 0; then
-STTY_RESTORE=$(stty -g)
-fi
-
-for r
-do
-
-if $group
-then
-# arg after -g
-group_list=$(sed -n <"$source_iotests/group" -e 's/$/ /' -e 
"/^[0-9][0-9][0-9].* $r /"'{
-s/ .*//p
-}')
-if [ -z "$group_list" ]
-then
-echo "Group \"$r\" is empty or not defined?"
-exit 1
-fi
-[ ! -s $tmp.list ] && touch $tmp.list
-for t in $group_list
-do
-if grep -s "^$t\$" $tmp.list >/dev/null
-then
-:
-else
-echo "$t" >>$tmp.list
-fi
-done
-group=false
-continue
-
-elif $xgroup
-then
-# arg after -x
-# Populate $tmp.list with all tests
-awk '/^[0-9]{3,}/ {print $1}' "${source_iotests}/group" > $tmp.list 
2>/dev/null
-group_list=$(sed -n <"$source_iotests/group" -e 's/$/ /' -e 
"/^[0-9][0-9][0-9].* $r /"'{
-s/ .*//p
-}')
-if [ -z "$group_list" ]
-then
-echo "Group \"$r\" is empty or not defined?"
-exit 1
-fi
-numsed=0
-rm -f $tmp.sed
-for t in $group_list
-

[PATCH v5 07/12] iotests: define group in each iotest

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

We are going to drop group file. Define group in tests as a preparatory
step.

The patch is generated by

cd tests/qemu-iotests

grep '^[0-9]\{3\} ' group | while read line; do
file=$(awk '{print $1}' <<< "$line");
groups=$(sed -e 's/^... //' <<< "$line");
awk "NR==2{print \"# group: $groups\"}1" $file > tmp;
cat tmp > $file;
done

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/001 | 1 +
 tests/qemu-iotests/002 | 1 +
 tests/qemu-iotests/003 | 1 +
 tests/qemu-iotests/004 | 1 +
 tests/qemu-iotests/005 | 1 +
 tests/qemu-iotests/007 | 1 +
 tests/qemu-iotests/008 | 1 +
 tests/qemu-iotests/009 | 1 +
 tests/qemu-iotests/010 | 1 +
 tests/qemu-iotests/011 | 1 +
 tests/qemu-iotests/012 | 1 +
 tests/qemu-iotests/013 | 1 +
 tests/qemu-iotests/014 | 1 +
 tests/qemu-iotests/015 | 1 +
 tests/qemu-iotests/017 | 1 +
 tests/qemu-iotests/018 | 1 +
 tests/qemu-iotests/019 | 1 +
 tests/qemu-iotests/020 | 1 +
 tests/qemu-iotests/021 | 1 +
 tests/qemu-iotests/022 | 1 +
 tests/qemu-iotests/023 | 1 +
 tests/qemu-iotests/024 | 1 +
 tests/qemu-iotests/025 | 1 +
 tests/qemu-iotests/026 | 1 +
 tests/qemu-iotests/027 | 1 +
 tests/qemu-iotests/028 | 1 +
 tests/qemu-iotests/029 | 1 +
 tests/qemu-iotests/030 | 1 +
 tests/qemu-iotests/031 | 1 +
 tests/qemu-iotests/032 | 1 +
 tests/qemu-iotests/033 | 1 +
 tests/qemu-iotests/034 | 1 +
 tests/qemu-iotests/035 | 1 +
 tests/qemu-iotests/036 | 1 +
 tests/qemu-iotests/037 | 1 +
 tests/qemu-iotests/038 | 1 +
 tests/qemu-iotests/039 | 1 +
 tests/qemu-iotests/040 | 1 +
 tests/qemu-iotests/041 | 1 +
 tests/qemu-iotests/042 | 1 +
 tests/qemu-iotests/043 | 1 +
 tests/qemu-iotests/044 | 1 +
 tests/qemu-iotests/045 | 1 +
 tests/qemu-iotests/046 | 1 +
 tests/qemu-iotests/047 | 1 +
 tests/qemu-iotests/048 | 1 +
 tests/qemu-iotests/049 | 1 +
 tests/qemu-iotests/050 | 1 +
 tests/qemu-iotests/051 | 1 +
 tests/qemu-iotests/052 | 1 +
 tests/qemu-iotests/053 | 1 +
 tests/qemu-iotests/054 | 1 +
 tests/qemu-iotests/055 | 1 +
 tests/qemu-iotests/056 | 1 +
 tests/qemu-iotests/057 | 1 +
 tests/qemu-iotests/058 | 1 +
 tests/qemu-iotests/059 | 1 +
 tests/qemu-iotests/060 | 1 +
 tests/qemu-iotests/061 | 1 +
 tests/qemu-iotests/062 | 1 +
 tests/qemu-iotests/063 | 1 +
 tests/qemu-iotests/064 | 1 +
 tests/qemu-iotests/065 | 1 +
 tests/qemu-iotests/066 | 1 +
 tests/qemu-iotests/068 | 1 +
 tests/qemu-iotests/069 | 1 +
 tests/qemu-iotests/070 | 1 +
 tests/qemu-iotests/071 | 1 +
 tests/qemu-iotests/072 | 1 +
 tests/qemu-iotests/073 | 1 +
 tests/qemu-iotests/074 | 1 +
 tests/qemu-iotests/075 | 1 +
 tests/qemu-iotests/076 | 1 +
 tests/qemu-iotests/077 | 1 +
 tests/qemu-iotests/078 | 1 +
 tests/qemu-iotests/079 | 1 +
 tests/qemu-iotests/080 | 1 +
 tests/qemu-iotests/081 | 1 +
 tests/qemu-iotests/082 | 1 +
 tests/qemu-iotests/083 | 1 +
 tests/qemu-iotests/084 | 1 +
 tests/qemu-iotests/085 | 1 +
 tests/qemu-iotests/086 | 1 +
 tests/qemu-iotests/087 | 1 +
 tests/qemu-iotests/088 | 1 +
 tests/qemu-iotests/089 | 1 +
 tests/qemu-iotests/090 | 1 +
 tests/qemu-iotests/091 | 1 +
 tests/qemu-iotests/092 | 1 +
 tests/qemu-iotests/093 | 1 +
 tests/qemu-iotests/094 | 1 +
 tests/qemu-iotests/095 | 1 +
 tests/qemu-iotests/096 | 1 +
 tests/qemu-iotests/097 | 1 +
 tests/qemu-iotests/098 | 1 +
 tests/qemu-iotests/099 | 1 +
 tests/qemu-iotests/101 | 1 +
 tests/qemu-iotests/102 | 1 +
 tests/qemu-iotests/103 | 1 +
 tests/qemu-iotests/104 | 1 +
 tests/qemu-iotests/105 | 1 +
 tests/qemu-iotests/106 | 1 +
 tests/qemu-iotests/107 | 1 +
 tests/qemu-iotests/108 | 1 +
 tests/qemu-iotests/109 | 1 +
 tests/qemu-iotests/110 | 1 +
 tests/qemu-iotests/111 | 1 +
 tests/qemu-iotests/112 | 1 +
 tests/qemu-iotests/113 | 1 +
 tests/qemu-iotests/114 | 1 +
 tests/qemu-iotests/115 | 1 +
 tests/qemu-iotests/116 | 1 +
 tests/qemu-iotests/117 | 1 +
 tests/qemu-iotests/118 | 1 +
 tests/qemu-iotests/119 | 1 +
 tests/qemu-iotests/120 | 1 +
 tests/qemu-iotests/121 | 1 +
 tests/qemu-iotests/122 | 1 +
 tests/qemu-iotests/123 | 1 +
 tests/qemu-iotests/124 | 1 +
 tests/qemu-iotests/125 | 1 +
 tests/qemu-iotests/126 | 1 +
 tests/qemu-iotests/127 | 1 +
 tests/qemu-iotests/128 | 1 +
 tests/qemu-iotests/129 | 1 +
 tests/qemu-iotests/130 | 1 +
 tests/qemu-iotests/131 | 1 +
 tests/qemu-iotests/132 | 1 +
 tests/qemu-iotests/133 | 1 +
 tests/qemu-iotests/134 | 1 +
 tests/qemu-iotests/135 | 1 +
 tests/qemu-iotests/136 | 1 +
 tests/qemu-iotests/137 | 1 +
 tests/qemu-iotests/138 | 1 +
 tests/qemu-iotests/139 | 1 +
 tests/qemu-iotests/140 | 1 +
 tests/qemu-iotests/141 | 1 +
 tests/qemu-iotests/143 | 1 +
 tests/qemu-iotests/144 | 1 +
 tests/qemu-iotests/145 | 1 +
 tests/qemu-iotests/146 | 1 +
 tests/qemu-iotests/147 | 1 +
 tests/qemu-iotests/148 | 1 +
 tests/qemu-iotests/149 | 1 +
 tests/qemu-iotests/150 | 1 +
 tests/qemu-iotests/151 | 1 +
 tests/qemu-iotests/152 | 1 +
 tests/qemu-iotests/153 | 1 +
 tests/qemu-iotests/154 | 1 +
 tests/qemu-iotests/155 | 1 +
 tests/qemu-iotests/156 | 1 +
 tests/qemu-iotests/157 | 1 +

[PATCH v5 10/12] iotests: add testrunner.py

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Add TestRunner class, which will run tests in a new python iotests
running framework.

There are some differences with current ./check behavior, most
significant are:
- Consider all tests self-executable, just run them, don't run python
  by hand.
- Elapsed time is cached in json file
- Elapsed time precision increased a bit
- use python difflib instead of "diff -w", to ignore spaces at line
  ends strip lines by hand. Do not ignore other spaces.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/testrunner.py | 351 +++
 1 file changed, 351 insertions(+)
 create mode 100644 tests/qemu-iotests/testrunner.py

diff --git a/tests/qemu-iotests/testrunner.py b/tests/qemu-iotests/testrunner.py
new file mode 100644
index 00..e395877882
--- /dev/null
+++ b/tests/qemu-iotests/testrunner.py
@@ -0,0 +1,351 @@
+# Class for actual tests running.
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import random
+from pathlib import Path
+import datetime
+import time
+import difflib
+import subprocess
+import collections
+import contextlib
+import json
+import argparse
+import termios
+import sys
+from contextlib import contextmanager
+from typing import List, Optional, Iterator
+
+from testenv import TestEnv
+
+
+def silent_unlink(path: Path) -> None:
+try:
+path.unlink()
+except OSError:
+pass
+
+
+def file_diff(file1: str, file2: str) -> List[str]:
+with open(file1) as f1, open(file2) as f2:
+# We want to ignore spaces at line ends. There are a lot of mess about
+# it in iotests.
+# TODO: fix all tests to not produce extra spaces, fix all .out files
+# and use strict diff here!
+seq1 = [line.rstrip() for line in f1]
+seq2 = [line.rstrip() for line in f2]
+return list(difflib.unified_diff(seq1, seq2, file1, file2))
+
+
+# We want to save current tty settings during test run,
+# since an aborting qemu call may leave things screwed up.
+@contextmanager
+def savetty() -> Iterator:
+isterm = sys.stdin.isatty()
+if isterm:
+fd = sys.stdin.fileno()
+attr = termios.tcgetattr(0)
+
+try:
+yield
+finally:
+if isterm:
+termios.tcsetattr(fd, termios.TCSADRAIN, attr)
+
+
+class LastElapsedTime:
+""" Cache for elapsed time for tests, to show it during new test run
+
+Use get() in any time. But, if use update you should then call save(),
+or use update() inside with-block.
+"""
+def __init__(self, cache_file: str, env: TestEnv) -> None:
+self.env = env
+self.cache_file = cache_file
+
+try:
+with open(cache_file) as f:
+self.cache = json.load(f)
+except (OSError, ValueError):
+self.cache = {}
+
+def get(self, test: str,
+default: Optional[float] = None) -> Optional[float]:
+if test not in self.cache:
+return default
+
+if self.env.imgproto not in self.cache[test]:
+return default
+
+return self.cache[test][self.env.imgproto].get(self.env.imgfmt,
+   default)
+
+def update(self, test: str, elapsed: float) -> None:
+d = self.cache.setdefault(test, {})
+d = d.setdefault(self.env.imgproto, {})
+d[self.env.imgfmt] = elapsed
+
+def save(self) -> None:
+with open(self.cache_file, 'w') as f:
+json.dump(self.cache, f)
+
+def __enter__(self) -> 'LastElapsedTime':
+return self
+
+def __exit__(self, *args) -> None:
+self.save()
+
+
+TestResult = collections.namedtuple(
+'TestResult',
+['status', 'description', 'elapsed', 'diff', 'casenotrun'],
+defaults=('', '', '', ''))
+
+
+class TestRunner:
+_argparser = None
+@classmethod
+def get_argparser(cls) -> argparse.ArgumentParser:
+if cls._argparser is not None:
+return cls._argparser
+
+p = argparse.ArgumentParser(description="= test running options =",
+add_help=False, usage=argparse.SUPPRESS)
+
+p.add_argument('-makecheck', action='store_true',
+   help='pretty print output for make check')
+
+cls._argparser = p
+return p
+
+def

[PATCH v5 04/12] iotests/283: make executable

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

All other test files are executable, except for this one. Fix that.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
Reviewed-by: Philippe Mathieu-DaudÃ© 
---
 tests/qemu-iotests/283 | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 mode change 100644 => 100755 tests/qemu-iotests/283

diff --git a/tests/qemu-iotests/283 b/tests/qemu-iotests/283
old mode 100644
new mode 100755
-- 
2.21.3

[PATCH v5 09/12] iotests: add testenv.py

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Add TestEnv class, which will handle test environment in a new python
iotests running framework.

Difference with current ./check interface:
- -v (verbose) option dropped, as it is unused

- -xdiff option is dropped, until somebody complains that it is needed
- same for -n option

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/testenv.py | 325 ++
 1 file changed, 325 insertions(+)
 create mode 100755 tests/qemu-iotests/testenv.py

diff --git a/tests/qemu-iotests/testenv.py b/tests/qemu-iotests/testenv.py
new file mode 100755
index 00..97c75f70df
--- /dev/null
+++ b/tests/qemu-iotests/testenv.py
@@ -0,0 +1,325 @@
+#!/usr/bin/env python3
+#
+# Parse command line options to manage test environment variables.
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import sys
+import tempfile
+from pathlib import Path
+import shutil
+import collections
+import subprocess
+import argparse
+from typing import List, Dict
+
+
+def get_default_machine(qemu_prog: str) -> str:
+outp = subprocess.run([qemu_prog, '-machine', 'help'], check=True,
+  text=True, stdout=subprocess.PIPE).stdout
+
+machines = outp.split('\n')
+default_machine = next(m for m in machines if m.endswith(' (default)'))
+default_machine = default_machine.split(' ', 1)[0]
+
+alias_suf = ' (alias of {})'.format(default_machine)
+alias = next((m for m in machines if m.endswith(alias_suf)), None)
+if alias is not None:
+default_machine = alias.split(' ', 1)[0]
+
+return default_machine
+
+
+class TestEnv:
+"""
+Manage system environment for running tests
+
+The following variables are supported/provided. They are represented by
+lower-cased TestEnv attributes.
+"""
+env_variables = ['PYTHONPATH', 'TEST_DIR', 'SOCK_DIR', 'SAMPLE_IMG_DIR',
+ 'OUTPUT_DIR', 'PYTHON', 'QEMU_PROG', 'QEMU_IMG_PROG',
+ 'QEMU_IO_PROG', 'QEMU_NBD_PROG',
+ 'SOCKET_SCM_HELPER', 'QEMU_OPTIONS', 'QEMU_IMG_OPTIONS',
+ 'QEMU_IO_OPTIONS', 'QEMU_NBD_OPTIONS', 'IMGOPTS',
+ 'IMGFMT', 'IMGPROTO', 'AIOMODE', 'CACHEMODE',
+ 'VALGRIND_QEMU', 'CACHEMODE_IS_DEFAULT', 'IMGFMT_GENERIC',
+ 'IMGOPTSSYNTAX', 'IMGKEYSECRET', 'QEMU_DEFAULT_MACHINE']
+
+def get_env(self) -> Dict[str, str]:
+env = {}
+for v in self.env_variables:
+val = getattr(self, v.lower(), None)
+if val is not None:
+env[v] = val
+
+return env
+
+_argparser = None
+@classmethod
+def get_argparser(cls) -> argparse.ArgumentParser:
+if cls._argparser is not None:
+return cls._argparser
+
+p = argparse.ArgumentParser(description="= test environment options =",
+add_help=False, usage=argparse.SUPPRESS)
+
+p.add_argument('-d', dest='debug', action='store_true', help='debug')
+p.add_argument('-misalign', action='store_true',
+   help='misalign memory allocations')
+
+p.set_defaults(imgfmt='raw', imgproto='file')
+
+format_list = ['raw', 'bochs', 'cloop', 'parallels', 'qcow', 'qcow2',
+   'qed', 'vdi', 'vpc', 'vhdx', 'vmdk', 'luks', 'dmg']
+g = p.add_argument_group(
+'image format options',
+'The following options sets IMGFMT environment variable. '
+'At most one chose is allowed, default is "raw"')
+g = g.add_mutually_exclusive_group()
+for fmt in format_list:
+g.add_argument('-' + fmt, dest='imgfmt', action='store_const',
+   const=fmt)
+
+protocol_list = ['file', 'rbd', 'sheepdoc', 'nbd', 'ssh', 'nfs']
+g = p.add_argument_group(
+'image protocol options',
+'The following options sets IMGPROTO environment variably. '
+'At most one chose is allowed, default is "file"')
+g = g.add_mutually_exclusive_group()
+for prt in protocol_list:
+g.add_argument('-' + prt, dest='imgproto', action='store_const',
+   const=prt)
+
+g = p.add_mutually_exclusive_group()
+# We don't set default for

[PATCH v5 03/12] iotests: fix some whitespaces in test output files

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

We are going to be stricter about comparing test result with .out
files. So, fix some whitespaces now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/175.out |  2 +-
 tests/qemu-iotests/271.out | 12 ++--
 tests/qemu-iotests/287.out | 10 +-
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/tests/qemu-iotests/175.out b/tests/qemu-iotests/175.out
index 39c2ee0f62..40a5bd1ce6 100644
--- a/tests/qemu-iotests/175.out
+++ b/tests/qemu-iotests/175.out
@@ -23,4 +23,4 @@ size=4096, min allocation
 == resize empty image with block_resize ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=0
 size=1048576, min allocation
- *** done
+*** done
diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out
index 92deb7ebb0..81043ba4d7 100644
--- a/tests/qemu-iotests/271.out
+++ b/tests/qemu-iotests/271.out
@@ -500,7 +500,7 @@ L2 entry #0: 0x80050001 0001
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
 L2 entry #1: 0x8006 0001
-qcow2: Marking image as corrupt: Invalid cluster entry found  (L2 offset: 
0x4, L2 index: 0x1); further corruption events will be suppressed
+qcow2: Marking image as corrupt: Invalid cluster entry found (L2 offset: 
0x4, L2 index: 0x1); further corruption events will be suppressed
 write failed: Input/output error
 
 ### Corrupted L2 entries - write test (unallocated) ###
@@ -515,14 +515,14 @@ L2 entry #0: 0x8006 0001
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
 L2 entry #0: 0x 0001
-qcow2: Marking image as corrupt: Invalid cluster entry found  (L2 offset: 
0x4, L2 index: 0); further corruption events will be suppressed
+qcow2: Marking image as corrupt: Invalid cluster entry found (L2 offset: 
0x4, L2 index: 0); further corruption events will be suppressed
 write failed: Input/output error
 
 # Both 'subcluster is zero' and 'subcluster is allocated' bits set
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
 L2 entry #1: 0x 00010001
-qcow2: Marking image as corrupt: Invalid cluster entry found  (L2 offset: 
0x4, L2 index: 0x1); further corruption events will be suppressed
+qcow2: Marking image as corrupt: Invalid cluster entry found (L2 offset: 
0x4, L2 index: 0x1); further corruption events will be suppressed
 write failed: Input/output error
 
 ### Compressed cluster with subcluster bitmap != 0 - write test ###
@@ -583,7 +583,7 @@ read 524288/524288 bytes at offset 0
 read 524288/524288 bytes at offset 524288
 512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Offset  Length  Mapped to   File
-0   0x80   TEST_DIR/t.qcow2.base
+0   0x8 0   TEST_DIR/t.qcow2.base
 # backing file and preallocation=falloc
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=524288 
backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=raw preallocation=falloc
 Image resized.
@@ -592,7 +592,7 @@ read 524288/524288 bytes at offset 0
 read 524288/524288 bytes at offset 524288
 512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Offset  Length  Mapped to   File
-0   0x80   TEST_DIR/t.qcow2.base
+0   0x8 0   TEST_DIR/t.qcow2.base
 # backing file and preallocation=full
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=524288 
backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=raw preallocation=full
 Image resized.
@@ -601,7 +601,7 @@ read 524288/524288 bytes at offset 0
 read 524288/524288 bytes at offset 524288
 512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Offset  Length  Mapped to   File
-0   0x80   TEST_DIR/t.qcow2.base
+0   0x8 0   TEST_DIR/t.qcow2.base
 
 ### Image resizing with preallocation and backing files ###
 
diff --git a/tests/qemu-iotests/287.out b/tests/qemu-iotests/287.out
index 6b9dfb4af0..49ab6a27d5 100644
--- a/tests/qemu-iotests/287.out
+++ b/tests/qemu-iotests/287.out
@@ -10,22 +10,22 @@ incompatible_features []
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 incompatible_features [3]
 
-=== Testing zlib with incompatible bit set  ===
+=== Testing zlib with incompatible bit set ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 incompatible_features [3]
 
-=== Testing zstd with incompatible bit unset  ===
+=== Testing zstd with incompatible bit unset ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 incompatible_features []
 
-=== Testing compression type values  ===
+=== Testing compression type values ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
-   0
+0
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
-   1
+1
 
 === Testing simple reading and writing with zstd ===
 
-- 
2.21.3

[PATCH v5 06/12] iotests/294: add shebang line

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/294 | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tests/qemu-iotests/294 b/tests/qemu-iotests/294
index 9c95ed8c9a..4bdb7364af 100755
--- a/tests/qemu-iotests/294
+++ b/tests/qemu-iotests/294
@@ -1,3 +1,4 @@
+#!/usr/bin/env bash
 #
 # Copyright (C) 2019 Red Hat, Inc.
 #
-- 
2.21.3

[PATCH v5 05/12] iotests/299: make executable

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/299 | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 mode change 100644 => 100755 tests/qemu-iotests/299

diff --git a/tests/qemu-iotests/299 b/tests/qemu-iotests/299
old mode 100644
new mode 100755
-- 
2.21.3

[PATCH v5 08/12] iotests: add findtests.py

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Add python script with new logic of searching for tests:

Current ./check behavior:
 - tests are named [0-9][0-9][0-9]
 - tests must be registered in group file (even if test doesn't belong
   to any group, like 142)

Behavior of findtests.py:
 - group file is dropped
 - tests are all files in tests/ subdirectory (except for .out files),
   so it's not needed more to "register the test", just create it with
   appropriate name in tests/ subdirectory. Old names like
   [0-9][0-9][0-9] (in root iotests directory) are supported too, but
   not recommended for new tests
 - groups are parsed from '# group: ' line inside test files
 - optional file group.local may be used to define some additional
   groups for downstreams
 - 'disabled' group is used to temporary disable tests. So instead of
   commenting tests in old 'group' file you now can add them to
   disabled group with help of 'group.local' file
 - selecting test ranges like 5-15 are not supported more
   (to support restarting failed ./check command from the middle of the
process, new argument is added: --start-from)

Benefits:
 - no rebase conflicts in group file on patch porting from branch to
   branch
 - no conflicts in upstream, when different series want to occupy same
   test number
 - meaningful names for test files
   For example, with digital number, when some person wants to add some
   test about block-stream, he most probably will just create a new
   test. But if there would be test-block-stream test already, he will
   at first look at it and may be just add a test-case into it.
   And anyway meaningful names are better.

This commit don't update check behavior (which will be don in further
commit), still, the documentation changed like new behavior is already
here.  Let's live with this small inconsistency for the following few
commits, until final change.

The file findtests.py is self-executable and may be used for debugging
purposes.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 docs/devel/testing.rst  |  50 ++-
 tests/qemu-iotests/findtests.py | 229 
 2 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100755 tests/qemu-iotests/findtests.py

diff --git a/docs/devel/testing.rst b/docs/devel/testing.rst
index 0c3e79d31c..b2a4f6ce42 100644
--- a/docs/devel/testing.rst
+++ b/docs/devel/testing.rst
@@ -111,7 +111,7 @@ check-block
 ---
 
 ``make check-block`` runs a subset of the block layer iotests (the tests that
-are in the "auto" group in ``tests/qemu-iotests/group``).
+are in the "auto" group).
 See the "QEMU iotests" section below for more information.
 
 GCC gcov support
@@ -224,6 +224,54 @@ another application on the host may have locked the file, 
possibly leading to a
 test failure.  If using such devices are explicitly desired, consider adding
 ``locking=off`` option to disable image locking.
 
+Test case groups
+
+
+Test may belong to some groups, you may define it in the comment inside the
+test. By convention, test groups are listed in the second line of the test
+file, after "#!/..." line, like this:
+
+.. code::
+
+  #!/usr/bin/env python3
+  # group: auto quick
+  #
+  ...
+
+Additional way of defining groups is creating tests/qemu-iotests/group.local
+file. This should be used only for downstream (this file should never appear
+in upstream). This file may be used for defining some downstream test groups
+or for temporary disable tests, like this:
+
+.. code::
+
+  # groups for some company downstream process
+  #
+  # ci - tests to run on build
+  # down - our downstream tests, not for upstream
+  #
+  # Format of each line is:
+  # TEST_NAME TEST_GROUP [TEST_GROUP ]...
+
+  013 ci
+  210 disabled
+  215 disabled
+  our-ugly-workaround-test down ci
+
+The (not exhaustive) list of groups:
+
+- quick : Tests in this group should finish within some few seconds.
+
+- auto : Tests in this group are used during "make check" and should be
+  runnable in any case. That means they should run with every QEMU binary
+  (also non-x86), with every QEMU configuration (i.e. must not fail if
+  an optional feature is not compiled in - but reporting a "skip" is ok),
+  work at least with the qcow2 file format, work with all kind of host
+  filesystems and users (e.g. "nobody" or "root") and must not take too
+  much memory and disk space (since CI pipelines tend to fail otherwise).
+
+- disabled : Tests in this group are disabled and ignored by check.
+
 .. _docker-ref:
 
 Docker based tests
diff --git a/tests/qemu-iotests/findtests.py b/tests/qemu-iotests/findtests.py
new file mode 100755
index 00..b053db48e8
--- /dev/null
+++ b/tests/qemu-iotests/findtests.py
@@ -0,0 +1,229 @@
+#!/usr/bin/env python3
+#
+# Parse command line options to define set of tests to run.
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public

[PATCH v5 01/12] iotests/277: use dot slash for nbd-fault-injector.py running

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

If you run './check 277', check includes common.config which adjusts
$PATH to include '.' first, and therefore finds nbd-fault-injector.py
on PATH.  But if you run './277' directly, there is nothing to adjust
PATH, and if '.' is not already on your PATH by other means, the test
fails because the executable is not found.  Adjust how we invoke the
helper executable to avoid needing a PATH search in the first place.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 tests/qemu-iotests/277 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/277 b/tests/qemu-iotests/277
index d34f87021f..a39ce2d873 100755
--- a/tests/qemu-iotests/277
+++ b/tests/qemu-iotests/277
@@ -42,7 +42,7 @@ def make_conf_file(event):
 def start_server_NBD(event):
 make_conf_file(event)
 
-srv = subprocess.Popen(['nbd-fault-injector.py', '--classic-negotiation',
+srv = subprocess.Popen(['./nbd-fault-injector.py', '--classic-negotiation',
nbd_sock, conf_file], stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, universal_newlines=True)
 line = srv.stdout.readline()
-- 
2.21.3

Re: [PATCH v27 07/17] vfio: Register SaveVMHandlers for VFIO device

2020-10-22 Thread Alex Williamson

On Thu, 22 Oct 2020 16:41:57 +0530
Kirti Wankhede  wrote:

> Define flags to be used as delimiter in migration stream for VFIO devices.
> Added .save_setup and .save_cleanup functions. Map & unmap migration
> region from these functions at source during saving or pre-copy phase.
> 
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO 
> device.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  hw/vfio/migration.c  | 96 
> 
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 98 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 7c4fa0d08ea6..2e1054bf7f43 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,12 +8,15 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/cutils.h"
>  #include 
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> +#include "migration/vmstate.h"
>  #include "migration/qemu-file.h"
>  #include "migration/register.h"
>  #include "migration/blocker.h"
> @@ -25,6 +28,22 @@
>  #include "trace.h"
>  #include "hw/hw.h"
>  
> +/*
> + * Flags to be used as unique delimiters for VFIO devices in the migration
> + * stream. These flags are composed as:
> + * 0x => MSB 32-bit all 1s
> + * 0xef10 => Magic ID, represents emulated (virtual) function IO
> + * 0x => 16-bits reserved for flags
> + *
> + * The beginning of state information is marked by _DEV_CONFIG_STATE,
> + * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
> + * certain state information is marked by _END_OF_STATE.
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>off_t off, bool iswrite)
>  {
> @@ -129,6 +148,69 @@ static int vfio_migration_set_state(VFIODevice 
> *vbasedev, uint32_t mask,
>  return 0;
>  }
>  
> +/* -- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +VFIODevice *vbasedev = opaque;
> +VFIOMigration *migration = vbasedev->migration;
> +int ret;
> +
> +trace_vfio_save_setup(vbasedev->name);
> +
> +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +if (migration->region.mmaps) {
> +/*
> + * vfio_region_mmap() called from migration thread. Memory API called
> + * from vfio_regio_mmap() need it when called from outdide the main 
> loop
> + * thread.
> + */

Thanks for adding this detail, maybe refine slightly as:

  Calling vfio_region_mmap() from migration thread.  Memory APIs called
  from this function require locking the iothread when called from
  outside the main loop thread.

Does that capture the intent?

> +qemu_mutex_lock_iothread();
> +ret = vfio_region_mmap(>region);
> +qemu_mutex_unlock_iothread();
> +if (ret) {
> +error_report("%s: Failed to mmap VFIO migration region: %s",
> + vbasedev->name, strerror(-ret));
> +error_report("%s: Falling back to slow path", vbasedev->name);
> +}
> +}
> +
> +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> +   VFIO_DEVICE_STATE_SAVING);
> +if (ret) {
> +error_report("%s: Failed to set state SAVING", vbasedev->name);
> +return ret;
> +}
> +
> +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +ret = qemu_file_get_error(f);
> +if (ret) {
> +return ret;
> +}
> +
> +return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +VFIODevice *vbasedev = opaque;
> +VFIOMigration *migration = vbasedev->migration;
> +
> +if (migration->region.mmaps) {
> +vfio_region_unmap(>region);
> +}


Are we in a different thread context here that we don't need that same
iothread locking?


> +trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +.save_setup = vfio_save_setup,
> +.save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* -- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>  VFIODevice *vbasedev = opaque;
> @@ -219,6 +301,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  int ret;
>  Object *obj;
>  VFIOMigration

[PATCH v5 00/12] Rework iotests/check

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

Hi all!

These series has 3 goals:

 - get rid of group file
 - introduce human-readable names for tests
 - rewrite check into python

v5: [rebase on master]
02: new
03: updated
05,06: new
07: updated
09: rebased on master:
  vxhs removed
  qemu_prog path changed
  build_iotests path calculation changed
  handle mega2560 and gdbsim-r5f562n8 machines

Mark last patch as RFC: it's OK to postpone it for a while.
Also, patches 01-06 are simple cleanups and may be merged in separate.

Vladimir Sementsov-Ogievskiy (12):
  iotests/277: use dot slash for nbd-fault-injector.py running
  iotests/303: use dot slash for qcow2.py running
  iotests: fix some whitespaces in test output files
  iotests/283: make executable
  iotests/299: make executable
  iotests/294: add shebang line
  iotests: define group in each iotest
  iotests: add findtests.py
  iotests: add testenv.py
  iotests: add testrunner.py
  iotests: rewrite check into python
  iotests: rename and move 169 and 199 tests

 docs/devel/testing.rst|  50 +-
 tests/qemu-iotests/001|   1 +
 tests/qemu-iotests/002|   1 +
 tests/qemu-iotests/003|   1 +
 tests/qemu-iotests/004|   1 +
 tests/qemu-iotests/005|   1 +
 tests/qemu-iotests/007|   1 +
 tests/qemu-iotests/008|   1 +
 tests/qemu-iotests/009|   1 +
 tests/qemu-iotests/010|   1 +
 tests/qemu-iotests/011|   1 +
 tests/qemu-iotests/012|   1 +
 tests/qemu-iotests/013|   1 +
 tests/qemu-iotests/014|   1 +
 tests/qemu-iotests/015|   1 +
 tests/qemu-iotests/017|   1 +
 tests/qemu-iotests/018|   1 +
 tests/qemu-iotests/019|   1 +
 tests/qemu-iotests/020|   1 +
 tests/qemu-iotests/021|   1 +
 tests/qemu-iotests/022|   1 +
 tests/qemu-iotests/023|   1 +
 tests/qemu-iotests/024|   1 +
 tests/qemu-iotests/025|   1 +
 tests/qemu-iotests/026|   1 +
 tests/qemu-iotests/027|   1 +
 tests/qemu-iotests/028|   1 +
 tests/qemu-iotests/029|   1 +
 tests/qemu-iotests/030|   1 +
 tests/qemu-iotests/031|   1 +
 tests/qemu-iotests/032|   1 +
 tests/qemu-iotests/033|   1 +
 tests/qemu-iotests/034|   1 +
 tests/qemu-iotests/035|   1 +
 tests/qemu-iotests/036|   1 +
 tests/qemu-iotests/037|   1 +
 tests/qemu-iotests/038|   1 +
 tests/qemu-iotests/039|   1 +
 tests/qemu-iotests/040|   1 +
 tests/qemu-iotests/041|   1 +
 tests/qemu-iotests/042|   1 +
 tests/qemu-iotests/043|   1 +
 tests/qemu-iotests/044|   1 +
 tests/qemu-iotests/045|   1 +
 tests/qemu-iotests/046|   1 +
 tests/qemu-iotests/047|   1 +
 tests/qemu-iotests/048|   1 +
 tests/qemu-iotests/049|   1 +
 tests/qemu-iotests/050|   1 +
 tests/qemu-iotests/051|   1 +
 tests/qemu-iotests/052|   1 +
 tests/qemu-iotests/053|   1 +
 tests/qemu-iotests/054|   1 +
 tests/qemu-iotests/055|   1 +
 tests/qemu-iotests/056|   1 +
 tests/qemu-iotests/057|   1 +
 tests/qemu-iotests/058|   1 +
 tests/qemu-iotests/059|   1 +
 tests/qemu-iotests/060|   1 +
 tests/qemu-iotests/061|   1 +
 tests/qemu-iotests/062|   1 +
 tests/qemu-iotests/063|   1 +
 tests/qemu-iotests/064|   1 +
 tests/qemu-iotests/065|   1 +
 tests/qemu-iotests/066|   1 +
 tests/qemu-iotests/068|   1 +
 tests/qemu-iotests/069|   1 +
 tests/qemu-iotests/070|   1 +
 tests/qemu-iotests/071|   1 +
 tests/qemu-iotests/072|   1 +
 tests/qemu-iotests/073|   1 +
 tests/qemu-iotests/074|   1 +
 tests/qemu-iotests/075|   1

[PATCH v5 02/12] iotests/303: use dot slash for qcow2.py running

2020-10-22 Thread Vladimir Sementsov-Ogievskiy

If you run './check 303', check includes common.config which adjusts
$PATH to include '.' first, and therefore finds qcow2.py on PATH.  But
if you run './303' directly, there is nothing to adjust PATH, and if
'.' is not already on your PATH by other means, the test fails because
the executable is not found.  Adjust how we invoke the helper
executable to avoid needing a PATH search in the first place.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/303 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/303 b/tests/qemu-iotests/303
index 6c21774483..11cd9eeb26 100755
--- a/tests/qemu-iotests/303
+++ b/tests/qemu-iotests/303
@@ -56,7 +56,7 @@ qemu_img_create('-f', iotests.imgfmt, disk, '10M')
 
 add_bitmap(1, 0, 6, False)
 add_bitmap(2, 6, 8, True)
-dump = ['qcow2.py', disk, 'dump-header']
+dump = ['./qcow2.py', disk, 'dump-header']
 subprocess.run(dump)
 # Dump the metadata in JSON format
 dump.append('-j')
-- 
2.21.3

[PATCH v5 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Klaus Jensen

From: Klaus Jensen 

Add support for the Dataset Management command and the Deallocate
attribute. Deallocation results in discards being sent to the underlying
block device. Whether of not the blocks are actually deallocated is
affected by the same factors as Write Zeroes (see previous commit).

 format | discard | dsm (512b)  dsm (4kb)  dsm (64kb)
--
  qcow2ignore   n   n  n
  qcow2unmapn   n  y
  raw  ignore   n   n  n
  raw  unmapn   y  y

Again, a raw format and 4kb LBAs are preferable.

In order to set the Namespace Preferred Deallocate Granularity and
Alignment fields (NPDG and NPDA), choose a sane minimum discard
granularity of 4kb. If we are using a passthru device supporting discard
at a 512b granularity, user should set the discard_granularity property
explicitly. NPDG and NPDA will also account for the cluster_size of the
block driver if required (i.e. for QCOW2).

See NVM Express 1.3d, Section 6.7 ("Dataset Management command").

Signed-off-by: Klaus Jensen 
---
 hw/block/nvme.h  |   2 +
 include/block/nvme.h |   7 ++-
 hw/block/nvme-ns.c   |  36 +--
 hw/block/nvme.c  | 101 ++-
 4 files changed, 140 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a50..574333caa3f9 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -28,6 +28,7 @@ typedef struct NvmeRequest {
 struct NvmeNamespace*ns;
 BlockAIOCB  *aiocb;
 uint16_tstatus;
+void*opaque;
 NvmeCqe cqe;
 NvmeCmd cmd;
 BlockAcctCookie acct;
@@ -60,6 +61,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
 case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
 case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
 case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
+case NVME_CMD_DSM:  return "NVME_NVM_CMD_DSM";
 default:return "NVME_NVM_CMD_UNKNOWN";
 }
 }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 966c3bb304bd..e95ff6ca9b37 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -990,7 +990,12 @@ typedef struct QEMU_PACKED NvmeIdNs {
 uint16_tnabspf;
 uint16_tnoiob;
 uint8_t nvmcap[16];
-uint8_t rsvd64[40];
+uint16_tnpwg;
+uint16_tnpwa;
+uint16_tnpdg;
+uint16_tnpda;
+uint16_tnows;
+uint8_t rsvd74[30];
 uint8_t nguid[16];
 uint64_teui64;
 NvmeLBAFlbaf[16];
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index f1cc734c60f5..840651db7256 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -28,10 +28,14 @@
 #include "nvme.h"
 #include "nvme-ns.h"
 
-static void nvme_ns_init(NvmeNamespace *ns)
+#define MIN_DISCARD_GRANULARITY (4 * KiB)
+
+static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
 {
+BlockDriverInfo bdi;
 NvmeIdNs *id_ns = >id_ns;
 int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
+int npdg, ret;
 
 ns->id_ns.dlfeat = 0x9;
 
@@ -43,8 +47,25 @@ static void nvme_ns_init(NvmeNamespace *ns)
 id_ns->ncap = id_ns->nsze;
 id_ns->nuse = id_ns->ncap;
 
-/* support DULBE */
-id_ns->nsfeat |= 0x4;
+/* support DULBE and I/O optimization fields */
+id_ns->nsfeat |= (0x4 | 0x10);
+
+npdg = ns->blkconf.discard_granularity / ns->blkconf.logical_block_size;
+
+ret = bdrv_get_info(blk_bs(ns->blkconf.blk), );
+if (ret < 0) {
+error_setg_errno(errp, -ret, "could not get block driver info");
+return ret;
+}
+
+if (bdi.cluster_size &&
+bdi.cluster_size > ns->blkconf.discard_granularity) {
+npdg = bdi.cluster_size / ns->blkconf.logical_block_size;
+}
+
+id_ns->npda = id_ns->npdg = npdg - 1;
+
+return 0;
 }
 
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -59,6 +80,11 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, 
Error **errp)
 return -1;
 }
 
+if (ns->blkconf.discard_granularity == -1) {
+ns->blkconf.discard_granularity =
+MAX(ns->blkconf.logical_block_size, MIN_DISCARD_GRANULARITY);
+}
+
 ns->size = blk_getlength(ns->blkconf.blk);
 if (ns->size < 0) {
 error_setg_errno(errp, -ns->size, "could not get blockdev size");
@@ -92,7 +118,9 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error 
**errp)
 return -1;
 }
 
-nvme_ns_init(ns);
+if (nvme_ns_init(ns, errp)) {
+return -1;
+}
 
 if (nvme_register_namespace(n, ns, errp)) {
 return -1;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4ab0705f5a92..7acb9e9dc38a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -959,6

[PATCH v5 1/2] hw/block/nvme: add dulbe support

2020-10-22 Thread Klaus Jensen

From: Klaus Jensen 

Add support for reporting the Deallocated or Unwritten Logical Block
Error (DULBE).

Rely on the block status flags reported by the block layer and consider
any block with the BDRV_BLOCK_ZERO flag to be deallocated.

Multiple factors affect when a Write Zeroes command result in
deallocation of blocks.

  * the underlying file system block size
  * the blockdev format
  * the 'discard' and 'logical_block_size' parameters

 format | discard | wz (512b)  wz (4kb)  wz (64kb)
---
  qcow2ignore   n  n y
  qcow2unmapn  n y
  raw  ignore   n  y y
  raw  unmapn  y y

So, this works best with an image in raw format and 4kb LBAs, since
holes can then be punched on a per-block basis (this assumes a file
system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
of 64kb by default and blocks will only be marked deallocated if a full
cluster is zeroed or discarded. However, this *is* consistent with the
spec since Write Zeroes "should" deallocate the block if the Deallocate
attribute is set and "may" deallocate if the Deallocate attribute is not
set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
always set).

Signed-off-by: Klaus Jensen 
Reviewed-by: Keith Busch 
---
 hw/block/nvme-ns.h|  4 +++
 include/block/nvme.h  |  5 +++
 hw/block/nvme-ns.c|  8 +++--
 hw/block/nvme.c   | 83 +--
 hw/block/trace-events |  4 +++
 5 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 83734f4606e1..44bf6271b744 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,6 +31,10 @@ typedef struct NvmeNamespace {
 NvmeIdNs id_ns;
 
 NvmeNamespaceParams params;
+
+struct {
+uint32_t err_rec;
+} features;
 } NvmeNamespace;
 
 static inline uint32_t nvme_nsid(NvmeNamespace *ns)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 8a46d9cf015f..966c3bb304bd 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -687,6 +687,7 @@ enum NvmeStatusCodes {
 NVME_E2E_REF_ERROR  = 0x0284,
 NVME_CMP_FAILURE= 0x0285,
 NVME_ACCESS_DENIED  = 0x0286,
+NVME_DULB   = 0x0287,
 NVME_MORE   = 0x2000,
 NVME_DNR= 0x4000,
 NVME_NO_COMPLETE= 0x,
@@ -903,6 +904,9 @@ enum NvmeIdCtrlLpa {
 #define NVME_AEC_NS_ATTR(aec)   ((aec >> 8) & 0x1)
 #define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
 
+#define NVME_ERR_REC_TLER(err_rec)  (err_rec & 0x)
+#define NVME_ERR_REC_DULBE(err_rec) (err_rec & 0x1)
+
 enum NvmeFeatureIds {
 NVME_ARBITRATION= 0x1,
 NVME_POWER_MANAGEMENT   = 0x2,
@@ -1023,6 +1027,7 @@ enum NvmeNsIdentifierType {
 
 
 #define NVME_ID_NS_NSFEAT_THIN(nsfeat)  ((nsfeat & 0x1))
+#define NVME_ID_NS_NSFEAT_DULBE(nsfeat) ((nsfeat >> 2) & 0x1)
 #define NVME_ID_NS_FLBAS_EXTENDED(flbas)((flbas >> 4) & 0x1)
 #define NVME_ID_NS_FLBAS_INDEX(flbas)   ((flbas & 0xf))
 #define NVME_ID_NS_MC_SEPARATE(mc)  ((mc >> 1) & 0x1)
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 31c80cdf5b5f..f1cc734c60f5 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -33,9 +33,7 @@ static void nvme_ns_init(NvmeNamespace *ns)
 NvmeIdNs *id_ns = >id_ns;
 int lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
 
-if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
-ns->id_ns.dlfeat = 0x9;
-}
+ns->id_ns.dlfeat = 0x9;
 
 id_ns->lbaf[lba_index].ds = 31 - clz32(ns->blkconf.logical_block_size);
 
@@ -44,6 +42,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
 /* no thin provisioning */
 id_ns->ncap = id_ns->nsze;
 id_ns->nuse = id_ns->ncap;
+
+/* support DULBE */
+id_ns->nsfeat |= 0x4;
 }
 
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -92,6 +93,7 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error 
**errp)
 }
 
 nvme_ns_init(ns);
+
 if (nvme_register_namespace(n, ns, errp)) {
 return -1;
 }
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index fa2cba744b57..4ab0705f5a92 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -105,6 +105,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
 
 static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
 [NVME_TEMPERATURE_THRESHOLD]= NVME_FEAT_CAP_CHANGE,
+[NVME_ERROR_RECOVERY]   = NVME_FEAT_CAP_CHANGE | NVME_FEAT_CAP_NS,
 [NVME_VOLATILE_WRITE_CACHE] = NVME_FEAT_CAP_CHANGE,
 [NVME_NUMBER_OF_QUEUES] = NVME_FEAT_CAP_CHANGE,
 [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
@@ -878,6 +879,41 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, 
NvmeNamespace *ns,
 return NVME_SUCCESS;
 }
 
+static

[PATCH v5 0/2] hw/block/nvme: dulbe and dsm support

2020-10-22 Thread Klaus Jensen

From: Klaus Jensen 

This adds support for the Deallocated or Unwritten Logical Block error
recovery feature as well as the Dataset Management command.

I wanted to add support for the NPDG and NPDA fields such that the host
could get a hint on how many blocks to request deallocation of for the
deallocation to actually happen, but I cannot find a realiable way to
get the actual block size of the underlying device. If it is an image on
a file system we could typically use the host page size, but if it is a
raw device, we might have 512 byte sectors that we can issue discards
on. And QEMU doesn't seem to provide this without root privileges at
least.

See the two patches for some gotchas.

I also integrated this into my zoned proposal. I'll spare you the v4, nobody
cares anyway. But I put it in my repo[1] for posterity.

  [1]: https://irrelevant.dk/g/pci-nvme.git/tag/?h=zoned-v4.

v5:
  - Restore status code from callback (Keith)

v4:
  - Removed mixed declaration and code (Keith)
  - Set NPDG and NPDA and account for the blockdev cluster size.

Klaus Jensen (2):
  hw/block/nvme: add dulbe support
  hw/block/nvme: add the dataset management command

 hw/block/nvme-ns.h|   4 +
 hw/block/nvme.h   |   2 +
 include/block/nvme.h  |  12 ++-
 hw/block/nvme-ns.c|  40 +++--
 hw/block/nvme.c   | 184 +-
 hw/block/trace-events |   4 +
 6 files changed, 237 insertions(+), 9 deletions(-)

-- 
2.28.0

Re: [PATCH v4 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Klaus Jensen

On Oct 22 10:50, Keith Busch wrote:
> On Thu, Oct 22, 2020 at 07:43:33PM +0200, Klaus Jensen wrote:
> > On Oct 22 08:01, Keith Busch wrote:
> > > On Thu, Oct 22, 2020 at 09:33:13AM +0200, Klaus Jensen wrote:
> > > > +if (--(*discards)) {
> > > > +status = NVME_NO_COMPLETE;
> > > > +} else {
> > > > +g_free(discards);
> > > > +req->opaque = NULL;
> > > 
> > > This case needs a
> > > 
> > > status = req->status;
> > > 
> > > So that we get the error set in the callback.
> > > 
> > 
> > There are no cases that result in a non-zero status code here.
> 
> Your callback has a case that sets NVME_INTERNAL_DEV_ERROR status. That
> would get ignored if the final discard reference is dropped from the
> submission side.
> 

Oh. Crap. You are right. Nice catch!

> +static void nvme_aio_discard_cb(void *opaque, int ret)
> +{
> +NvmeRequest *req = opaque;
> +int *discards = req->opaque;
> +
> +trace_pci_nvme_aio_discard_cb(nvme_cid(req));
> +
> +if (ret) {
> +req->status = NVME_INTERNAL_DEV_ERROR;
> +trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret),
> +   req->status);
> +}


signature.asc
Description: PGP signature

Re: [PATCH v8 2/2] hw/misc/sifive_u_otp: Add backend drive support

2020-10-22 Thread Alistair Francis

On Mon, Oct 19, 2020 at 8:37 PM Green Wan  wrote:
>
> Add '-drive' support to OTP device. Allow users to assign a raw file
> as OTP image.
>
> test commands for 16k otp.img filled with zero:
>
> $ dd if=/dev/zero of=./otp.img bs=1k count=16
> $ ./qemu-system-riscv64 -M sifive_u -m 256M -nographic -bios none \
> -kernel ../opensbi/build/platform/sifive/fu540/firmware/fw_payload.elf \
> -d guest_errors -drive if=none,format=raw,file=otp.img
>
> Signed-off-by: Green Wan 
> Reviewed-by: Bin Meng 
> Tested-by: Bin Meng 

Acked-by: Alistair Francis 

Alistair

> ---
>  hw/misc/sifive_u_otp.c | 65 ++
>  include/hw/misc/sifive_u_otp.h |  2 ++
>  2 files changed, 67 insertions(+)
>
> diff --git a/hw/misc/sifive_u_otp.c b/hw/misc/sifive_u_otp.c
> index b9238d64cb..60066375ab 100644
> --- a/hw/misc/sifive_u_otp.c
> +++ b/hw/misc/sifive_u_otp.c
> @@ -19,11 +19,14 @@
>   */
>
>  #include "qemu/osdep.h"
> +#include "qapi/error.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/sysbus.h"
>  #include "qemu/log.h"
>  #include "qemu/module.h"
>  #include "hw/misc/sifive_u_otp.h"
> +#include "sysemu/blockdev.h"
> +#include "sysemu/block-backend.h"
>
>  #define WRITTEN_BIT_ON 0x1
>
> @@ -54,6 +57,16 @@ static uint64_t sifive_u_otp_read(void *opaque, hwaddr 
> addr, unsigned int size)
>  if ((s->pce & SIFIVE_U_OTP_PCE_EN) &&
>  (s->pdstb & SIFIVE_U_OTP_PDSTB_EN) &&
>  (s->ptrim & SIFIVE_U_OTP_PTRIM_EN)) {
> +
> +/* read from backend */
> +if (s->blk) {
> +int32_t buf;
> +
> +blk_pread(s->blk, s->pa * SIFIVE_U_OTP_FUSE_WORD, ,
> +  SIFIVE_U_OTP_FUSE_WORD);
> +return buf;
> +}
> +
>  return s->fuse[s->pa & SIFIVE_U_OTP_PA_MASK];
>  } else {
>  return 0xff;
> @@ -145,6 +158,12 @@ static void sifive_u_otp_write(void *opaque, hwaddr addr,
>  /* write bit data */
>  SET_FUSEARRAY_BIT(s->fuse, s->pa, s->paio, s->pdin);
>
> +/* write to backend */
> +if (s->blk) {
> +blk_pwrite(s->blk, s->pa * SIFIVE_U_OTP_FUSE_WORD,
> +   >fuse[s->pa], SIFIVE_U_OTP_FUSE_WORD, 0);
> +}
> +
>  /* update written bit */
>  SET_FUSEARRAY_BIT(s->fuse_wo, s->pa, s->paio, WRITTEN_BIT_ON);
>  }
> @@ -168,16 +187,48 @@ static const MemoryRegionOps sifive_u_otp_ops = {
>
>  static Property sifive_u_otp_properties[] = {
>  DEFINE_PROP_UINT32("serial", SiFiveUOTPState, serial, 0),
> +DEFINE_PROP_DRIVE("drive", SiFiveUOTPState, blk),
>  DEFINE_PROP_END_OF_LIST(),
>  };
>
>  static void sifive_u_otp_realize(DeviceState *dev, Error **errp)
>  {
>  SiFiveUOTPState *s = SIFIVE_U_OTP(dev);
> +DriveInfo *dinfo;
>
>  memory_region_init_io(>mmio, OBJECT(dev), _u_otp_ops, s,
>TYPE_SIFIVE_U_OTP, SIFIVE_U_OTP_REG_SIZE);
>  sysbus_init_mmio(SYS_BUS_DEVICE(dev), >mmio);
> +
> +dinfo = drive_get_next(IF_NONE);
> +if (dinfo) {
> +int ret;
> +uint64_t perm;
> +int filesize;
> +BlockBackend *blk;
> +
> +blk = blk_by_legacy_dinfo(dinfo);
> +filesize = SIFIVE_U_OTP_NUM_FUSES * SIFIVE_U_OTP_FUSE_WORD;
> +if (blk_getlength(blk) < filesize) {
> +error_setg(errp, "OTP drive size < 16K");
> +return;
> +}
> +
> +qdev_prop_set_drive_err(dev, "drive", blk, errp);
> +
> +if (s->blk) {
> +perm = BLK_PERM_CONSISTENT_READ |
> +   (blk_is_read_only(s->blk) ? 0 : BLK_PERM_WRITE);
> +ret = blk_set_perm(s->blk, perm, BLK_PERM_ALL, errp);
> +if (ret < 0) {
> +return;
> +}
> +
> +if (blk_pread(s->blk, 0, s->fuse, filesize) != filesize) {
> +error_setg(errp, "failed to read the initial flash content");
> +}
> +}
> +}
>  }
>
>  static void sifive_u_otp_reset(DeviceState *dev)
> @@ -191,6 +242,20 @@ static void sifive_u_otp_reset(DeviceState *dev)
>  s->fuse[SIFIVE_U_OTP_SERIAL_ADDR] = s->serial;
>  s->fuse[SIFIVE_U_OTP_SERIAL_ADDR + 1] = ~(s->serial);
>
> +if (s->blk) {
> +/* Put serial number to backend as well*/
> +uint32_t serial_data;
> +int index = SIFIVE_U_OTP_SERIAL_ADDR;
> +
> +serial_data = s->serial;
> +blk_pwrite(s->blk, index * SIFIVE_U_OTP_FUSE_WORD,
> +   _data, SIFIVE_U_OTP_FUSE_WORD, 0);
> +
> +serial_data = ~(s->serial);
> +blk_pwrite(s->blk, (index + 1) * SIFIVE_U_OTP_FUSE_WORD,
> +   _data, SIFIVE_U_OTP_FUSE_WORD, 0);
> +}
> +
>  /* Initialize write-once map */
>  memset(s->fuse_wo, 0x00, sizeof(s->fuse_wo));
>  }
> diff --git a/include/hw/misc/sifive_u_otp.h b/include/hw/misc/sifive_u_otp.h
> index ebffbc1fa5..5d0d7df455 100644
> ---

[PATCH v12 12/14] copy-on-read: skip non-guest reads if no copy needed

2020-10-22 Thread Andrey Shinkevich via

If the flag BDRV_REQ_PREFETCH was set, skip idling read/write
operations in COR-driver. It can be taken into account for the
COR-algorithms optimization. That check is being made during the
block stream job by the moment.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index a2b180a..081e661 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -153,10 +153,14 @@ static int coroutine_fn 
cor_co_preadv_part(BlockDriverState *bs,
 }
 }
 
-ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
-  local_flags);
-if (ret < 0) {
-return ret;
+/* Skip if neither read nor write are needed */
+if ((local_flags & (BDRV_REQ_PREFETCH | BDRV_REQ_COPY_ON_READ)) !=
+BDRV_REQ_PREFETCH) {
+ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
+  local_flags);
+if (ret < 0) {
+return ret;
+}
 }
 
 offset += n;
-- 
1.8.3.1

Re: [PATCH v27 05/17] vfio: Add VM state change handler to know state of VM

2020-10-22 Thread Alex Williamson

On Thu, 22 Oct 2020 23:11:39 +0530
Kirti Wankhede  wrote:

> On 10/22/2020 10:05 PM, Alex Williamson wrote:
> > On Thu, 22 Oct 2020 16:41:55 +0530
> > Kirti Wankhede  wrote:
> >   
> >> VM state change handler is called on change in VM's state. Based on
> >> VM state, VFIO device state should be changed.
> >> Added read/write helper functions for migration region.
> >> Added function to set device_state.
> >>
> >> Signed-off-by: Kirti Wankhede 
> >> Reviewed-by: Neo Jia 
> >> Reviewed-by: Dr. David Alan Gilbert 
> >> ---
> >>   hw/vfio/migration.c   | 158 
> >> ++
> >>   hw/vfio/trace-events  |   2 +
> >>   include/hw/vfio/vfio-common.h |   4 ++
> >>   3 files changed, 164 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 5f74a3ad1d72..34f39c7e2e28 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -10,6 +10,7 @@
> >>   #include "qemu/osdep.h"
> >>   #include 
> >>   
> >> +#include "sysemu/runstate.h"
> >>   #include "hw/vfio/vfio-common.h"
> >>   #include "cpu.h"
> >>   #include "migration/migration.h"
> >> @@ -22,6 +23,157 @@
> >>   #include "exec/ram_addr.h"
> >>   #include "pci.h"
> >>   #include "trace.h"
> >> +#include "hw/hw.h"
> >> +
> >> +static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int 
> >> count,
> >> +  off_t off, bool iswrite)
> >> +{
> >> +int ret;
> >> +
> >> +ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
> >> +pread(vbasedev->fd, val, count, off);
> >> +if (ret < count) {
> >> +error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, 
> >> err: %s",
> >> + iswrite ? "write" : "read", count,
> >> + vbasedev->name, off, strerror(errno));
> >> +return (ret < 0) ? ret : -EINVAL;
> >> +}
> >> +return 0;
> >> +}
> >> +
> >> +static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
> >> +   off_t off, bool iswrite)
> >> +{
> >> +int ret, done = 0;
> >> +__u8 *tbuf = buf;
> >> +
> >> +while (count) {
> >> +int bytes = 0;
> >> +
> >> +if (count >= 8 && !(off % 8)) {
> >> +bytes = 8;
> >> +} else if (count >= 4 && !(off % 4)) {
> >> +bytes = 4;
> >> +} else if (count >= 2 && !(off % 2)) {
> >> +bytes = 2;
> >> +} else {
> >> +bytes = 1;
> >> +}
> >> +
> >> +ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
> >> +if (ret) {
> >> +return ret;
> >> +}
> >> +
> >> +count -= bytes;
> >> +done += bytes;
> >> +off += bytes;
> >> +tbuf += bytes;
> >> +}
> >> +return done;
> >> +}
> >> +
> >> +#define vfio_mig_read(f, v, c, o)   vfio_mig_rw(f, (__u8 *)v, c, o, 
> >> false)
> >> +#define vfio_mig_write(f, v, c, o)  vfio_mig_rw(f, (__u8 *)v, c, o, 
> >> true)
> >> +
> >> +#define VFIO_MIG_STRUCT_OFFSET(f)   \
> >> + offsetof(struct 
> >> vfio_device_migration_info, f)
> >> +/*
> >> + * Change the device_state register for device @vbasedev. Bits set in 
> >> @mask
> >> + * are preserved, bits set in @value are set, and bits not set in either 
> >> @mask
> >> + * or @value are cleared in device_state. If the register cannot be 
> >> accessed,
> >> + * the resulting state would be invalid, or the device enters an error 
> >> state,
> >> + * an error is returned.
> >> + */
> >> +
> >> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >> +uint32_t value)
> >> +{
> >> +VFIOMigration *migration = vbasedev->migration;
> >> +VFIORegion *region = >region;
> >> +off_t dev_state_off = region->fd_offset +
> >> +  VFIO_MIG_STRUCT_OFFSET(device_state);
> >> +uint32_t device_state;
> >> +int ret;
> >> +
> >> +ret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
> >> +dev_state_off);
> >> +if (ret < 0) {
> >> +return ret;
> >> +}
> >> +
> >> +device_state = (device_state & mask) | value;
> >> +
> >> +if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> >> +return -EINVAL;
> >> +}
> >> +
> >> +ret = vfio_mig_write(vbasedev, _state, sizeof(device_state),
> >> + dev_state_off);
> >> +if (ret < 0) {
> >> +int rret;
> >> +
> >> +rret = vfio_mig_read(vbasedev, _state, 
> >> sizeof(device_state),
> >> + dev_state_off);
> >> +
> >> +if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
> >> +hw_error("%s: Device in error state 0x%x", vbasedev->name,
> >> + device_state);
> >> +return rret ? rret : -EIO;
> >> +}
> >> +return ret;
> >> +}
> >> +
> >> +

[PATCH v12 13/14] stream: skip filters when writing backing file name to QCOW2 header

2020-10-22 Thread Andrey Shinkevich via

Avoid writing a filter JSON file name and a filter format name to QCOW2
image when the backing file is changed after the block stream job.
A user is still able to assign the 'backing-file' parameter for a
block-stream job keeping in mind the possible issue mentioned above.
If the user does not specify the 'backing-file' parameter, QEMU will
assign it automatically.

Signed-off-by: Andrey Shinkevich 
---
 block/stream.c | 15 +--
 blockdev.c |  9 ++---
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index e0540ee..1ba74ab 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -65,6 +65,7 @@ static int stream_prepare(Job *job)
 BlockDriverState *bs = blk_bs(bjob->blk);
 BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
 BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base);
+BlockDriverState *base_unfiltered = NULL;
 Error *local_err = NULL;
 int ret = 0;
 
@@ -75,8 +76,18 @@ static int stream_prepare(Job *job)
 const char *base_id = NULL, *base_fmt = NULL;
 if (base) {
 base_id = s->backing_file_str;
-if (base->drv) {
-base_fmt = base->drv->format_name;
+if (base_id) {
+if (base->drv) {
+base_fmt = base->drv->format_name;
+}
+} else {
+base_unfiltered = bdrv_skip_filters(base);
+if (base_unfiltered) {
+base_id = base_unfiltered->filename;
+if (base_unfiltered->drv) {
+base_fmt = base_unfiltered->drv->format_name;
+}
+}
 }
 }
 bdrv_set_backing_hd(unfiltered_bs, base, _err);
diff --git a/blockdev.c b/blockdev.c
index c917625..0e9c783 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2508,7 +2508,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 BlockDriverState *base_bs = NULL;
 AioContext *aio_context;
 Error *local_err = NULL;
-const char *base_name = NULL;
 int job_flags = JOB_DEFAULT;
 
 if (!has_on_error) {
@@ -2536,7 +2535,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 goto out;
 }
 assert(bdrv_get_aio_context(base_bs) == aio_context);
-base_name = base;
 }
 
 if (has_base_node) {
@@ -2551,7 +2549,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 assert(bdrv_get_aio_context(base_bs) == aio_context);
 bdrv_refresh_filename(base_bs);
-base_name = base_bs->filename;
 }
 
 /* Check for op blockers in the whole chain between bs and base */
@@ -2571,9 +2568,6 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 goto out;
 }
 
-/* backing_file string overrides base bs filename */
-base_name = has_backing_file ? backing_file : base_name;
-
 if (has_auto_finalize && !auto_finalize) {
 job_flags |= JOB_MANUAL_FINALIZE;
 }
@@ -2581,7 +2575,8 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 job_flags |= JOB_MANUAL_DISMISS;
 }
 
-stream_start(has_job_id ? job_id : NULL, bs, base_bs, base_name,
+stream_start(has_job_id ? job_id : NULL, bs, base_bs,
+ has_backing_file ? backing_file : NULL,
  job_flags, has_speed ? speed : 0, on_error,
  filter_node_name, _err);
 if (local_err) {
-- 
1.8.3.1

[PATCH v12 08/14] iotests: add #310 to test bottom node in COR driver

2020-10-22 Thread Andrey Shinkevich via

The test case #310 is similar to #216 by Max Reitz. The difference is
that the test #310 involves a bottom node to the COR filter driver.

Signed-off-by: Andrey Shinkevich 
---
 tests/qemu-iotests/310 | 109 +
 tests/qemu-iotests/310.out |  15 +++
 tests/qemu-iotests/group   |   3 +-
 3 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100755 tests/qemu-iotests/310
 create mode 100644 tests/qemu-iotests/310.out

diff --git a/tests/qemu-iotests/310 b/tests/qemu-iotests/310
new file mode 100755
index 000..5ad7ad2
--- /dev/null
+++ b/tests/qemu-iotests/310
@@ -0,0 +1,109 @@
+#!/usr/bin/env python3
+#
+# Copy-on-read tests using a COR filter with a bottom node
+#
+# Copyright (c) 2020 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import iotests
+from iotests import log, qemu_img, qemu_io_silent
+
+# Need backing file support
+iotests.script_initialize(supported_fmts=['qcow2', 'qcow', 'qed', 'vmdk'],
+  supported_platforms=['linux'])
+
+log('')
+log('=== Copy-on-read across nodes ===')
+log('')
+
+# This test is similar to the 216 one by Max Reitz 
+# The difference is that this test case involves a bottom node to the
+# COR filter driver.
+
+with iotests.FilePath('base.img') as base_img_path, \
+ iotests.FilePath('mid.img') as mid_img_path, \
+ iotests.FilePath('top.img') as top_img_path, \
+ iotests.VM() as vm:
+
+log('--- Setting up images ---')
+log('')
+
+assert qemu_img('create', '-f', iotests.imgfmt, base_img_path, '64M') == 0
+assert qemu_io_silent(base_img_path, '-c', 'write -P 1 0M 1M') == 0
+assert qemu_io_silent(base_img_path, '-c', 'write -P 1 3M 1M') == 0
+assert qemu_img('create', '-f', iotests.imgfmt, '-b', base_img_path,
+'-F', iotests.imgfmt, mid_img_path) == 0
+assert qemu_io_silent(mid_img_path,  '-c', 'write -P 3 2M 1M') == 0
+assert qemu_io_silent(mid_img_path,  '-c', 'write -P 3 4M 1M') == 0
+assert qemu_img('create', '-f', iotests.imgfmt, '-b', mid_img_path,
+'-F', iotests.imgfmt, top_img_path) == 0
+assert qemu_io_silent(top_img_path,  '-c', 'write -P 2 1M 1M') == 0
+
+log('Done')
+
+log('')
+log('--- Doing COR ---')
+log('')
+
+vm.launch()
+
+log(vm.qmp('blockdev-add',
+node_name='node0',
+driver='copy-on-read',
+bottom='node2',
+file={
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': top_img_path
+},
+'backing': {
+'node-name': 'node2',
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': mid_img_path
+},
+'backing': {
+#'node-name': 'node2',
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': base_img_path
+}
+},
+}
+}))
+
+# Trigger COR
+log(vm.qmp('human-monitor-command',
+   command_line='qemu-io node0 "read 0 5M"'))
+
+vm.shutdown()
+
+log('')
+log('--- Checking COR result ---')
+log('')
+
+assert qemu_io_silent(base_img_path, '-c', 'discard 0 4M') == 0
+assert qemu_io_silent(mid_img_path, '-c', 'discard 0M 5M') == 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 1 0M 1M') != 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 2 1M 1M') == 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 3 2M 1M') == 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 1 3M 1M') != 0
+assert qemu_io_silent(top_img_path,  '-c', 'read -P 3 4M 1M') == 0
+
+log('Done')
diff --git a/tests/qemu-iotests/310.out b/tests/qemu-iotests/310.out
new file mode 100644

[PATCH v12 05/14] qapi: create BlockdevOptionsCor structure for COR driver

2020-10-22 Thread Andrey Shinkevich via

Create the BlockdevOptionsCor structure for COR driver specific options
splitting it off form the BlockdevOptionsGenericFormat. The only option
'bottom' node in the structure denotes an image file that limits the
COR operations in the backing chain.

Suggested-by: Max Reitz 
Signed-off-by: Andrey Shinkevich 
---
 qapi/block-core.json | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0a64306..bf465f6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3938,6 +3938,25 @@
   'data': { 'throttle-group': 'str',
 'file' : 'BlockdevRef'
  } }
+
+##
+# @BlockdevOptionsCor:
+#
+# Driver specific block device options for the copy-on-read driver.
+#
+# @bottom: the name of a non-filter node (allocation-bearing layer) that limits
+#  the COR operations in the backing chain (inclusive).
+#  For the block-stream job, it will be the first non-filter overlay of
+#  the base node. We do not involve the base node into the COR
+#  operations because the base may change due to a concurrent
+#  block-commit job on the same backing chain.
+#
+# Since: 5.2
+##
+{ 'struct': 'BlockdevOptionsCor',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { '*bottom': 'str' } }
+
 ##
 # @BlockdevOptions:
 #
@@ -3990,7 +4009,7 @@
   'bochs':  'BlockdevOptionsGenericFormat',
   'cloop':  'BlockdevOptionsGenericFormat',
   'compress':   'BlockdevOptionsGenericFormat',
-  'copy-on-read':'BlockdevOptionsGenericFormat',
+  'copy-on-read':'BlockdevOptionsCor',
   'dmg':'BlockdevOptionsGenericFormat',
   'file':   'BlockdevOptionsFile',
   'ftp':'BlockdevOptionsCurlFtp',
-- 
1.8.3.1

[PATCH v12 10/14] block: include supported_read_flags into BDS structure

2020-10-22 Thread Andrey Shinkevich via

Add the new member supported_read_flags to the BlockDriverState
structure. It will control the flags set for copy-on-read operations.
Make the block generic layer evaluate supported read flags before they
go to a block driver.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Andrey Shinkevich 
---
 block/io.c| 12 ++--
 include/block/block_int.h |  4 
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/io.c b/block/io.c
index 54f0968..78ddf13 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1392,6 +1392,9 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
*child,
 if (flags & BDRV_REQ_COPY_ON_READ) {
 int64_t pnum;
 
+/* The flag BDRV_REQ_COPY_ON_READ has reached its addressee */
+flags &= ~BDRV_REQ_COPY_ON_READ;
+
 ret = bdrv_is_allocated(bs, offset, bytes, );
 if (ret < 0) {
 goto out;
@@ -1413,9 +1416,13 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
*child,
 goto out;
 }
 
+if (flags & ~bs->supported_read_flags) {
+abort();
+}
+
 max_bytes = ROUND_UP(MAX(0, total_bytes - offset), align);
 if (bytes <= max_bytes && bytes <= max_transfer) {
-ret = bdrv_driver_preadv(bs, offset, bytes, qiov, qiov_offset, 0);
+ret = bdrv_driver_preadv(bs, offset, bytes, qiov, qiov_offset, flags);
 goto out;
 }
 
@@ -1428,7 +1435,8 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
*child,
 
 ret = bdrv_driver_preadv(bs, offset + bytes - bytes_remaining,
  num, qiov,
- qiov_offset + bytes - bytes_remaining, 0);
+ qiov_offset + bytes - bytes_remaining,
+ flags);
 max_bytes -= num;
 } else {
 num = bytes_remaining;
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f782737..474174c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -873,6 +873,10 @@ struct BlockDriverState {
 /* I/O Limits */
 BlockLimits bl;
 
+/*
+ * Flags honored during pread
+ */
+unsigned int supported_read_flags;
 /* Flags honored during pwrite (so far: BDRV_REQ_FUA,
  * BDRV_REQ_WRITE_UNCHANGED).
  * If a driver does not support BDRV_REQ_WRITE_UNCHANGED, those
-- 
1.8.3.1

[PATCH v12 06/14] copy-on-read: pass bottom node name to COR driver

2020-10-22 Thread Andrey Shinkevich via

We are going to use the COR-filter for a block-stream job.
To limit COR operations by the base node in the backing chain during
stream job, pass the bottom node name, that is the first non-filter
overlay of the base, to the copy-on-read driver as the base node itself
may change due to possible concurrent jobs.
The rest of the functionality will be implemented in the patch that
follows.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 618c4c4..3d8e4db 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -24,18 +24,24 @@
 #include "block/block_int.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qerror.h"
+#include "qapi/qmp/qdict.h"
 #include "block/copy-on-read.h"
 
 
 typedef struct BDRVStateCOR {
 bool active;
+BlockDriverState *bottom_bs;
 } BDRVStateCOR;
 
 
 static int cor_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
+BlockDriverState *bottom_bs = NULL;
 BDRVStateCOR *state = bs->opaque;
+/* Find a bottom node name, if any */
+const char *bottom_node = qdict_get_try_str(options, "bottom");
 
 bs->file = bdrv_open_child(NULL, options, "file", bs, _of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -51,7 +57,17 @@ static int cor_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
 bs->file->bs->supported_zero_flags);
 
+if (bottom_node) {
+bottom_bs = bdrv_lookup_bs(NULL, bottom_node, errp);
+if (!bottom_bs) {
+error_setg(errp, QERR_BASE_NOT_FOUND, bottom_node);
+qdict_del(options, "bottom");
+return -EINVAL;
+}
+qdict_del(options, "bottom");
+}
 state->active = true;
+state->bottom_bs = bottom_bs;
 
 /*
  * We don't need to call bdrv_child_refresh_perms() now as the permissions
-- 
1.8.3.1

[PATCH v12 14/14] block: apply COR-filter to block-stream jobs

2020-10-22 Thread Andrey Shinkevich via

This patch completes the series with the COR-filter insertion for
block-stream operations. Adding the filter makes it possible for copied
regions to be discarded in backing files during the block-stream job,
what will reduce the disk overuse.
The COR-filter insertion incurs changes in the iotests case
245:test_block_stream_4 that reopens the backing chain during a
block-stream job. There are changes in the iotests #030 as well.
The iotests case 030:test_stream_parallel was deleted due to multiple
conflicts between the concurrent job operations over the same backing
chain. The base backing node for one job is the top node for another
job. It may change due to the filter node inserted into the backing
chain while both jobs are running. Another issue is that the parts of
the backing chain are being frozen by the running job and may not be
changed by the concurrent job when needed. The concept of the parallel
jobs with common nodes is considered vital no more.

Signed-off-by: Andrey Shinkevich 
---
 block/stream.c | 98 ++
 tests/qemu-iotests/030 | 51 +++-
 tests/qemu-iotests/030.out |  4 +-
 tests/qemu-iotests/141.out |  2 +-
 tests/qemu-iotests/245 | 22 +++
 5 files changed, 87 insertions(+), 90 deletions(-)

diff --git a/block/stream.c b/block/stream.c
index 1ba74ab..f6ed315 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -17,8 +17,10 @@
 #include "block/blockjob_int.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
+#include "qapi/qmp/qdict.h"
 #include "qemu/ratelimit.h"
 #include "sysemu/block-backend.h"
+#include "block/copy-on-read.h"
 
 enum {
 /*
@@ -33,6 +35,8 @@ typedef struct StreamBlockJob {
 BlockJob common;
 BlockDriverState *base_overlay; /* COW overlay (stream from this) */
 BlockDriverState *above_base;   /* Node directly above the base */
+BlockDriverState *cor_filter_bs;
+BlockDriverState *target_bs;
 BlockdevOnError on_error;
 char *backing_file_str;
 bool bs_read_only;
@@ -44,8 +48,7 @@ static int coroutine_fn stream_populate(BlockBackend *blk,
 {
 assert(bytes < SIZE_MAX);
 
-return blk_co_preadv(blk, offset, bytes, NULL,
- BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH);
+return blk_co_preadv(blk, offset, bytes, NULL, BDRV_REQ_PREFETCH);
 }
 
 static void stream_abort(Job *job)
@@ -53,23 +56,20 @@ static void stream_abort(Job *job)
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 
 if (s->chain_frozen) {
-BlockJob *bjob = >common;
-bdrv_unfreeze_backing_chain(blk_bs(bjob->blk), s->above_base);
+bdrv_unfreeze_backing_chain(s->cor_filter_bs, s->above_base);
 }
 }
 
 static int stream_prepare(Job *job)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
-BlockJob *bjob = >common;
-BlockDriverState *bs = blk_bs(bjob->blk);
-BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
+BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs);
 BlockDriverState *base = bdrv_filter_or_cow_bs(s->above_base);
 BlockDriverState *base_unfiltered = NULL;
 Error *local_err = NULL;
 int ret = 0;
 
-bdrv_unfreeze_backing_chain(bs, s->above_base);
+bdrv_unfreeze_backing_chain(s->cor_filter_bs, s->above_base);
 s->chain_frozen = false;
 
 if (bdrv_cow_child(unfiltered_bs)) {
@@ -105,15 +105,16 @@ static void stream_clean(Job *job)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 BlockJob *bjob = >common;
-BlockDriverState *bs = blk_bs(bjob->blk);
 
 /* Reopen the image back in read-only mode if necessary */
 if (s->bs_read_only) {
 /* Give up write permissions before making it read-only */
 blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, _abort);
-bdrv_reopen_set_read_only(bs, true, NULL);
+bdrv_reopen_set_read_only(s->target_bs, true, NULL);
 }
 
+bdrv_cor_filter_drop(s->cor_filter_bs);
+
 g_free(s->backing_file_str);
 }
 
@@ -121,9 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 {
 StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
 BlockBackend *blk = s->common.blk;
-BlockDriverState *bs = blk_bs(blk);
-BlockDriverState *unfiltered_bs = bdrv_skip_filters(bs);
-bool enable_cor = !bdrv_cow_child(s->base_overlay);
+BlockDriverState *unfiltered_bs = bdrv_skip_filters(s->target_bs);
 int64_t len;
 int64_t offset = 0;
 uint64_t delay_ns = 0;
@@ -135,21 +134,12 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
 return 0;
 }
 
-len = bdrv_getlength(bs);
+len = bdrv_getlength(s->target_bs);
 if (len < 0) {
 return len;
 }
 job_progress_set_remaining(>common.job, len);
 
-/* Turn on copy-on-read for the whole block device so that guest read
- * requests help us make progress.  Only do this when copying

[PATCH v12 00/14] Apply COR-filter to the block-stream permanently

2020-10-22 Thread Andrey Shinkevich via

The node insert/remove functions were added at the block generic layer.
COR-filter options structure was added to the QAPI.
The test case #310 was added to check the 'bottom' node limit for COR.
The 'supported_read_flags' member was added to the BDS structure
(with the flags check at the block generic layer for drivers).

v12:
  02: New.
  03: Only the temporary drop filter function left.
  05: New (suggested by Max)
  06: 'base' -> 'bottom' option.
  07: Fixes based on the review of the v11.
  08: New.
  09: The comment ext was modified.
  10: The read flags check at the block generic layer.
  11: COR flag was added.
  12: The condition was fixed.
  13: The 'backing-file' parameter returned. No deprecation.
  14: The COR-filter 'add' function replaced with the 'insert node' generic
  function. Fixes based on the review of the v11.

Andrey Shinkevich (14):
  copy-on-read: support preadv/pwritev_part functions
  block: add insert/remove node functions
  copy-on-read: add filter drop function
  qapi: add filter-node-name to block-stream
  qapi: create BlockdevOptionsCor structure for COR driver
  copy-on-read: pass bottom node name to COR driver
  copy-on-read: limit COR operations to bottom node
  iotests: add #310 to test bottom node in COR driver
  block: modify the comment for BDRV_REQ_PREFETCH flag
  block: include supported_read_flags into BDS structure
  copy-on-read: add support for read flags to COR-filter
  copy-on-read: skip non-guest reads if no copy needed
  stream: skip filters when writing backing file name to QCOW2 header
  block: apply COR-filter to block-stream jobs

 block.c|  49 ++
 block/copy-on-read.c   | 144 +
 block/copy-on-read.h   |  32 +
 block/io.c |  12 +++-
 block/monitor/block-hmp-cmds.c |   4 +-
 block/stream.c | 117 ++---
 blockdev.c |  13 ++--
 include/block/block.h  |  11 +++-
 include/block/block_int.h  |  11 +++-
 qapi/block-core.json   |  27 +++-
 tests/qemu-iotests/030 |  51 ++-
 tests/qemu-iotests/030.out |   4 +-
 tests/qemu-iotests/141.out |   2 +-
 tests/qemu-iotests/245 |  22 +--
 tests/qemu-iotests/310 | 109 +++
 tests/qemu-iotests/310.out |  15 +
 tests/qemu-iotests/group   |   3 +-
 17 files changed, 503 insertions(+), 123 deletions(-)
 create mode 100644 block/copy-on-read.h
 create mode 100755 tests/qemu-iotests/310
 create mode 100644 tests/qemu-iotests/310.out

-- 
1.8.3.1

[PATCH v12 09/14] block: modify the comment for BDRV_REQ_PREFETCH flag

2020-10-22 Thread Andrey Shinkevich via

Modify the comment for the flag BDRV_REQ_PREFETCH as we are going to
use it alone and pass it to the COR-filter driver for further
processing.

Signed-off-by: Andrey Shinkevich 
---
 include/block/block.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index ae7612f..1b6742f 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -81,9 +81,11 @@ typedef enum {
 BDRV_REQ_NO_FALLBACK= 0x100,
 
 /*
- * BDRV_REQ_PREFETCH may be used only together with BDRV_REQ_COPY_ON_READ
- * on read request and means that caller doesn't really need data to be
- * written to qiov parameter which may be NULL.
+ * BDRV_REQ_PREFETCH makes sense only in the context of copy-on-read
+ * (i.e., together with the BDRV_REQ_COPY_ON_READ flag or when a COR
+ * filter is involved), in which case it signals that the COR operation
+ * need not read the data into memory (qiov) but only ensure they are
+ * copied to the top layer (i.e., that COR operation is done).
  */
 BDRV_REQ_PREFETCH  = 0x200,
 /* Mask of valid flags */
-- 
1.8.3.1

[PATCH v12 01/14] copy-on-read: support preadv/pwritev_part functions

2020-10-22 Thread Andrey Shinkevich via

Add support for the recently introduced functions
bdrv_co_preadv_part()
and
bdrv_co_pwritev_part()
to the COR-filter driver.

Signed-off-by: Andrey Shinkevich 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/copy-on-read.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 2816e61..cb03e0f 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -74,21 +74,25 @@ static int64_t cor_getlength(BlockDriverState *bs)
 }
 
 
-static int coroutine_fn cor_co_preadv(BlockDriverState *bs,
-  uint64_t offset, uint64_t bytes,
-  QEMUIOVector *qiov, int flags)
+static int coroutine_fn cor_co_preadv_part(BlockDriverState *bs,
+   uint64_t offset, uint64_t bytes,
+   QEMUIOVector *qiov,
+   size_t qiov_offset,
+   int flags)
 {
-return bdrv_co_preadv(bs->file, offset, bytes, qiov,
-  flags | BDRV_REQ_COPY_ON_READ);
+return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+   flags | BDRV_REQ_COPY_ON_READ);
 }
 
 
-static int coroutine_fn cor_co_pwritev(BlockDriverState *bs,
-   uint64_t offset, uint64_t bytes,
-   QEMUIOVector *qiov, int flags)
+static int coroutine_fn cor_co_pwritev_part(BlockDriverState *bs,
+uint64_t offset,
+uint64_t bytes,
+QEMUIOVector *qiov,
+size_t qiov_offset, int flags)
 {
-
-return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+return bdrv_co_pwritev_part(bs->file, offset, bytes, qiov, qiov_offset,
+flags);
 }
 
 
@@ -137,8 +141,8 @@ static BlockDriver bdrv_copy_on_read = {
 
 .bdrv_getlength = cor_getlength,
 
-.bdrv_co_preadv = cor_co_preadv,
-.bdrv_co_pwritev= cor_co_pwritev,
+.bdrv_co_preadv_part= cor_co_preadv_part,
+.bdrv_co_pwritev_part   = cor_co_pwritev_part,
 .bdrv_co_pwrite_zeroes  = cor_co_pwrite_zeroes,
 .bdrv_co_pdiscard   = cor_co_pdiscard,
 .bdrv_co_pwritev_compressed = cor_co_pwritev_compressed,
-- 
1.8.3.1

[PATCH v12 03/14] copy-on-read: add filter drop function

2020-10-22 Thread Andrey Shinkevich via

Provide API for the COR-filter removal. Also, drop the filter child
permissions for an inactive state when the filter node is being
removed. This function may be considered as an intermediate solution
before we are able to use bdrv_remove_node(). It will be possible once
the QEMU permission update system has overhauled.
To insert the filter, the block generic layer function
bdrv_insert_node() can be used.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 56 
 block/copy-on-read.h | 32 ++
 2 files changed, 88 insertions(+)
 create mode 100644 block/copy-on-read.h

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index cb03e0f..618c4c4 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -23,11 +23,20 @@
 #include "qemu/osdep.h"
 #include "block/block_int.h"
 #include "qemu/module.h"
+#include "qapi/error.h"
+#include "block/copy-on-read.h"
+
+
+typedef struct BDRVStateCOR {
+bool active;
+} BDRVStateCOR;
 
 
 static int cor_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
+BDRVStateCOR *state = bs->opaque;
+
 bs->file = bdrv_open_child(NULL, options, "file", bs, _of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
false, errp);
@@ -42,6 +51,13 @@ static int cor_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
 bs->file->bs->supported_zero_flags);
 
+state->active = true;
+
+/*
+ * We don't need to call bdrv_child_refresh_perms() now as the permissions
+ * will be updated later when the filter node gets its parent.
+ */
+
 return 0;
 }
 
@@ -57,6 +73,17 @@ static void cor_child_perm(BlockDriverState *bs, BdrvChild 
*c,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
 {
+BDRVStateCOR *s = bs->opaque;
+
+if (!s->active) {
+/*
+ * While the filter is being removed
+ */
+*nperm = 0;
+*nshared = BLK_PERM_ALL;
+return;
+}
+
 *nperm = perm & PERM_PASSTHROUGH;
 *nshared = (shared & PERM_PASSTHROUGH) | PERM_UNCHANGED;
 
@@ -135,6 +162,7 @@ static void cor_lock_medium(BlockDriverState *bs, bool 
locked)
 
 static BlockDriver bdrv_copy_on_read = {
 .format_name= "copy-on-read",
+.instance_size  = sizeof(BDRVStateCOR),
 
 .bdrv_open  = cor_open,
 .bdrv_child_perm= cor_child_perm,
@@ -154,6 +182,34 @@ static BlockDriver bdrv_copy_on_read = {
 .is_filter  = true,
 };
 
+
+void bdrv_cor_filter_drop(BlockDriverState *cor_filter_bs)
+{
+BdrvChild *child;
+BlockDriverState *bs;
+BDRVStateCOR *s = cor_filter_bs->opaque;
+
+child = bdrv_filter_child(cor_filter_bs);
+if (!child) {
+return;
+}
+bs = child->bs;
+
+/* Retain the BDS until we complete the graph change. */
+bdrv_ref(bs);
+/* Hold a guest back from writing while permissions are being reset. */
+bdrv_drained_begin(bs);
+/* Drop permissions before the graph change. */
+s->active = false;
+bdrv_child_refresh_perms(cor_filter_bs, child, _abort);
+bdrv_replace_node(cor_filter_bs, bs, _abort);
+
+bdrv_drained_end(bs);
+bdrv_unref(bs);
+bdrv_unref(cor_filter_bs);
+}
+
+
 static void bdrv_copy_on_read_init(void)
 {
 bdrv_register(_copy_on_read);
diff --git a/block/copy-on-read.h b/block/copy-on-read.h
new file mode 100644
index 000..7bf405d
--- /dev/null
+++ b/block/copy-on-read.h
@@ -0,0 +1,32 @@
+/*
+ * Copy-on-read filter block driver
+ *
+ * The filter driver performs Copy-On-Read (COR) operations
+ *
+ * Copyright (c) 2018-2020 Virtuozzo International GmbH.
+ *
+ * Author:
+ *   Andrey Shinkevich 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#ifndef BLOCK_COPY_ON_READ
+#define BLOCK_COPY_ON_READ
+
+#include "block/block_int.h"
+
+void bdrv_cor_filter_drop(BlockDriverState *cor_filter_bs);
+
+#endif /* BLOCK_COPY_ON_READ */
-- 
1.8.3.1

[PATCH v12 02/14] block: add insert/remove node functions

2020-10-22 Thread Andrey Shinkevich via

Provide API for a node insertion to and removal from a backing chain.

Suggested-by: Max Reitz 
Signed-off-by: Andrey Shinkevich 
---
 block.c   | 49 +
 include/block/block.h |  3 +++
 2 files changed, 52 insertions(+)

diff --git a/block.c b/block.c
index 430edf7..502b483 100644
--- a/block.c
+++ b/block.c
@@ -4670,6 +4670,55 @@ static void bdrv_delete(BlockDriverState *bs)
 g_free(bs);
 }
 
+BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *node_options,
+   int flags, Error **errp)
+{
+BlockDriverState *new_node_bs;
+Error *local_err = NULL;
+
+new_node_bs =  bdrv_open(NULL, NULL, node_options, flags, errp);
+if (new_node_bs == NULL) {
+error_prepend(errp, "Could not create node: ");
+return NULL;
+}
+
+bdrv_drained_begin(bs);
+bdrv_replace_node(bs, new_node_bs, _err);
+bdrv_drained_end(bs);
+
+if (local_err) {
+bdrv_unref(new_node_bs);
+error_propagate(errp, local_err);
+return NULL;
+}
+
+return new_node_bs;
+}
+
+void bdrv_remove_node(BlockDriverState *bs)
+{
+BdrvChild *child;
+BlockDriverState *inferior_bs;
+
+child = bdrv_filter_or_cow_child(bs);
+if (!child) {
+return;
+}
+inferior_bs = child->bs;
+
+/* Retain the BDS until we complete the graph change. */
+bdrv_ref(inferior_bs);
+/* Hold a guest back from writing while permissions are being reset. */
+bdrv_drained_begin(inferior_bs);
+/* Refresh permissions before the graph change. */
+bdrv_child_refresh_perms(bs, child, _abort);
+bdrv_replace_node(bs, inferior_bs, _abort);
+
+bdrv_drained_end(inferior_bs);
+bdrv_unref(inferior_bs);
+bdrv_unref(bs);
+}
+
 /*
  * Run consistency checks on an image
  *
diff --git a/include/block/block.h b/include/block/block.h
index d16c401..ae7612f 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -350,6 +350,9 @@ void bdrv_append(BlockDriverState *bs_new, BlockDriverState 
*bs_top,
  Error **errp);
 void bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
Error **errp);
+BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *node_options,
+   int flags, Error **errp);
+void bdrv_remove_node(BlockDriverState *bs);
 
 int bdrv_parse_aio(const char *mode, int *flags);
 int bdrv_parse_cache_mode(const char *mode, int *flags, bool *writethrough);
-- 
1.8.3.1

[PATCH v12 07/14] copy-on-read: limit COR operations to bottom node

2020-10-22 Thread Andrey Shinkevich via

Limit COR operations to the bottom node (inclusively) in the backing
chain when the bottom node name is given. It will be useful for a block
stream job when the COR-filter is applied. The bottom node is passed as
the base itself may change due to concurrent commit jobs on the same
backing chain.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 42 --
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 3d8e4db..8178a91 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -123,8 +123,46 @@ static int coroutine_fn 
cor_co_preadv_part(BlockDriverState *bs,
size_t qiov_offset,
int flags)
 {
-return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
-   flags | BDRV_REQ_COPY_ON_READ);
+int64_t n = 0;
+int local_flags;
+int ret;
+BDRVStateCOR *state = bs->opaque;
+
+if (!state->bottom_bs) {
+return bdrv_co_preadv_part(bs->file, offset, bytes, qiov, qiov_offset,
+   flags | BDRV_REQ_COPY_ON_READ);
+}
+
+while (bytes) {
+local_flags = flags;
+
+/* In case of failure, try to copy-on-read anyway */
+ret = bdrv_is_allocated(bs->file->bs, offset, bytes, );
+if (!ret || ret < 0) {
+ret = 
bdrv_is_allocated_above(bdrv_backing_chain_next(bs->file->bs),
+  state->bottom_bs, true, offset,
+  n, );
+if (ret == 1 || ret < 0) {
+local_flags |= BDRV_REQ_COPY_ON_READ;
+}
+/* Finish earlier if the end of a backing file has been reached */
+if (ret == 0 && n == 0) {
+break;
+}
+}
+
+ret = bdrv_co_preadv_part(bs->file, offset, n, qiov, qiov_offset,
+  local_flags);
+if (ret < 0) {
+return ret;
+}
+
+offset += n;
+qiov_offset += n;
+bytes -= n;
+}
+
+return 0;
 }
 
 
-- 
1.8.3.1

[PATCH v12 04/14] qapi: add filter-node-name to block-stream

2020-10-22 Thread Andrey Shinkevich via

Provide the possibility to pass the 'filter-node-name' parameter to the
block-stream job as it is done for the commit block job.

Signed-off-by: Andrey Shinkevich 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/monitor/block-hmp-cmds.c | 4 ++--
 block/stream.c | 4 +++-
 blockdev.c | 4 +++-
 include/block/block_int.h  | 7 ++-
 qapi/block-core.json   | 6 ++
 5 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index d15a2be..e8a58f3 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -508,8 +508,8 @@ void hmp_block_stream(Monitor *mon, const QDict *qdict)
 
 qmp_block_stream(true, device, device, base != NULL, base, false, NULL,
  false, NULL, qdict_haskey(qdict, "speed"), speed, true,
- BLOCKDEV_ON_ERROR_REPORT, false, false, false, false,
- );
+ BLOCKDEV_ON_ERROR_REPORT, false, NULL, false, false, 
false,
+ false, );
 
 hmp_handle_error(mon, error);
 }
diff --git a/block/stream.c b/block/stream.c
index 8ce6729..e0540ee 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -221,7 +221,9 @@ static const BlockJobDriver stream_job_driver = {
 void stream_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, const char *backing_file_str,
   int creation_flags, int64_t speed,
-  BlockdevOnError on_error, Error **errp)
+  BlockdevOnError on_error,
+  const char *filter_node_name,
+  Error **errp)
 {
 StreamBlockJob *s;
 BlockDriverState *iter;
diff --git a/blockdev.c b/blockdev.c
index fe6fb5d..c917625 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2499,6 +2499,7 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
   bool has_backing_file, const char *backing_file,
   bool has_speed, int64_t speed,
   bool has_on_error, BlockdevOnError on_error,
+  bool has_filter_node_name, const char *filter_node_name,
   bool has_auto_finalize, bool auto_finalize,
   bool has_auto_dismiss, bool auto_dismiss,
   Error **errp)
@@ -2581,7 +2582,8 @@ void qmp_block_stream(bool has_job_id, const char 
*job_id, const char *device,
 }
 
 stream_start(has_job_id ? job_id : NULL, bs, base_bs, base_name,
- job_flags, has_speed ? speed : 0, on_error, _err);
+ job_flags, has_speed ? speed : 0, on_error,
+ filter_node_name, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 goto out;
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 38cad9d..f782737 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1134,6 +1134,9 @@ int is_windows_drive(const char *filename);
  *  See @BlockJobCreateFlags
  * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
  * @on_error: The action to take upon error.
+ * @filter_node_name: The node name that should be assigned to the filter
+ * driver that the commit job inserts into the graph above @bs. NULL means
+ * that a node name should be autogenerated.
  * @errp: Error object.
  *
  * Start a streaming operation on @bs.  Clusters that are unallocated
@@ -1146,7 +1149,9 @@ int is_windows_drive(const char *filename);
 void stream_start(const char *job_id, BlockDriverState *bs,
   BlockDriverState *base, const char *backing_file_str,
   int creation_flags, int64_t speed,
-  BlockdevOnError on_error, Error **errp);
+  BlockdevOnError on_error,
+  const char *filter_node_name,
+  Error **errp);
 
 /**
  * commit_start:
diff --git a/qapi/block-core.json b/qapi/block-core.json
index ee5ebef..0a64306 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2542,6 +2542,11 @@
 #'stop' and 'enospc' can only be used if the block device
 #supports io-status (see BlockInfo).  Since 1.3.
 #
+# @filter-node-name: the node name that should be assigned to the
+#filter driver that the stream job inserts into the graph
+#above @device. If this option is not given, a node name is
+#autogenerated. (Since: 5.2)
+#
 # @auto-finalize: When false, this job will wait in a PENDING state after it 
has
 # finished its work, waiting for @block-job-finalize before
 # making any block graph changes.
@@ -2572,6 +2577,7 @@
   'data': { '*job-id': 'str', 'device': 'str', '*base': 'str',
 '*base-node': 'str', '*backing-file': 'str', '*speed': 'int',
 '*on-error': 'BlockdevOnError',
+

[PATCH v12 11/14] copy-on-read: add support for read flags to COR-filter

2020-10-22 Thread Andrey Shinkevich via

Add the BDRV_REQ_COPY_ON_READ and BDRV_REQ_PREFETCH flags to the
supported_read_flags of the COR-filter.

Signed-off-by: Andrey Shinkevich 
---
 block/copy-on-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 8178a91..a2b180a 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -50,6 +50,8 @@ static int cor_open(BlockDriverState *bs, QDict *options, int 
flags,
 return -EINVAL;
 }
 
+bs->supported_read_flags = BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH;
+
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
 
-- 
1.8.3.1

Re: Ramping up Continuous Fuzzing of Virtual Devices in QEMU

2020-10-22 Thread Alexander Bulekov

On 201022 1739, Daniel P. Berrangé wrote:
> On Thu, Oct 22, 2020 at 12:24:16PM -0400, Alexander Bulekov wrote:
> > +CC Prasad
> > 
> > On 201022 1219, Alexander Bulekov wrote:
> > > Hello,
> > > QEMU was accepted into Google's oss-fuzz continuous-fuzzing platform [1]
> > > earlier this year. The fuzzers currently running on oss-fuzz are based on 
> > > my
> > > 2019 Google Summer of Code Project, which leveraged libfuzzer, qtest and 
> > > libqos
> > > to provide a framework for writing virtual-device fuzzers. At the moment, 
> > > there
> > > are a handful of fuzzers upstream and running on oss-fuzz(located in
> > > tests/qtest/fuzz/). They fuzz only a few devices and serve mostly as
> > > examples.
> > > 
> > > If everything goes well, soon a generic fuzzer [2] will land upstream, 
> > > which
> > > allows us to fuzz many configurations of QEMU, without any device-specific
> > > code. To date this fuzzer has led to ~50 bug reports on launchpad. Once 
> > > the
> > > generic-fuzzer lands upstream, OSS-Fuzz will automatically start fuzzing a
> > > bunch [3] of fuzzer configurations, and it is likely to find bugs.  
> > > Others will
> > > also be able to send simple patches to add additional device 
> > > configurations for
> > > fuzzing.
> > > 
> > > The oss-fuzz process looks roughly like this:
> > > 1. oss-fuzz fuzzes QEMU
> > > 2. When oss-fuzz finds a bug, it reports it to a few [4] people that 
> > > have
> > > access to reports and reproducers.
> > > 3. If a fix is merged upstream, oss-fuzz will figure this out and 
> > > mark the
> > > bug as fixed and make the report public 30 days later.
> > > 3. After 90 days the bug(fixed or not) becomes public, so anyone can 
> > > view
> > > it here https://bugs.chromium.org/p/oss-fuzz/issues/list
> > > 
> > > The oss-fuzz reports look like this:
> > > https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=23701=qemu=2
> > > 
> > > This means that when oss-fuzz find new bugs, the relevant developers do 
> > > not
> > > know about them unless someone with access files a separate report to the
> > > list/launchpad. So far this hasn't been a problem, since oss-fuzz has 
> > > only been
> > > running some small example fuzzers. Once [2] lands upstream, we should
> > > see a significant uptick in oss-fuzz reports, and I hope that we can 
> > > develop a
> > > process to ensure these bugs are properly dealt with. One option we have 
> > > is to
> > > make the reports public immediately and send notifications to
> > > qemu-devel. This is the approach taken by some other projects on
> > > oss-fuzz, such as LLVM. Though its not on oss-fuzz, bugs found by
> > > syzkaller in the kernel, are also automatically sent to a public list.
> > > The question is: 
> > > 
> > > What approach should we take for dealing with bugs found on oss-fuzz?
> 
> If we assume that a non-negligible number of fuzz bugs will be exploitable
> by a malicious guest OS to break out into the host, then I think it is
> likely undesirable to make them public immediately without at least a basic
> human triage step to catch possibly serious security issues.
> 
> Still a large % are likely to be low impact / not urgent to deal with so
> we want a low overhead way to handle the fuzz output, which doesn't create
> a bottleneck on a small number of people.
> 
> Overall my feeling is that we want to be able to farm out triage to the
> respective subsystem maintainers, who can then decide whether the bug
> needs notifying to the security team, or can be made public immediately.
> 
> I think ideally we would be doing triage in QEMU's own bug tracker, so
> we don't need to have maintainers create accounts on a 3rd party tracker
> to see reports.
> 
> Is is practical to identify a primary affected source file from the fuzz
> crash report with any level reliablility such that we could file a private
> launchpad bug, automatically CC'ing a subsystem maintainer (and the security
> team)  ?

Hi Daniel,
As far as I know, there is currently no API for accessing oss-fuzz
results. We could use email-based scripts to parse the automated reports
(e.g.  [1]) and follow the links to automatically download crash-traces
and reproducers. However, accessing those requires a login through
google.com which might be tough to script against in a reliable way.

Assuming we have found a way to automatically download the binary fuzzer
reproducer, we should be able to automatically convert it into a qtest
reproducer that we could send to the right people. 
There are a few approaches I can think of to automatically identify the
maintainers to CC:
 1. Walk the stack-trace until we find the line likely responsible for
 the bug. This can be tricky, since the buggy line is often not the
 first line. E.g. from [2]
#3 __GI___assert_fail assert.c:101
#4 iov_from_buf_full util/iov.c:40
#5 iov_from_buf iov.h:49
#6 net_tx_pkt_update_ip_checksums hw/net/net_tx_pkt.c:139
#7 e1000e_setup_tx_offloads

Re: [PATCH v4 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Keith Busch

On Thu, Oct 22, 2020 at 07:43:33PM +0200, Klaus Jensen wrote:
> On Oct 22 08:01, Keith Busch wrote:
> > On Thu, Oct 22, 2020 at 09:33:13AM +0200, Klaus Jensen wrote:
> > > +if (--(*discards)) {
> > > +status = NVME_NO_COMPLETE;
> > > +} else {
> > > +g_free(discards);
> > > +req->opaque = NULL;
> > 
> > This case needs a
> > 
> > status = req->status;
> > 
> > So that we get the error set in the callback.
> > 
> 
> There are no cases that result in a non-zero status code here.

Your callback has a case that sets NVME_INTERNAL_DEV_ERROR status. That
would get ignored if the final discard reference is dropped from the
submission side.

+static void nvme_aio_discard_cb(void *opaque, int ret)
+{
+NvmeRequest *req = opaque;
+int *discards = req->opaque;
+
+trace_pci_nvme_aio_discard_cb(nvme_cid(req));
+
+if (ret) {
+req->status = NVME_INTERNAL_DEV_ERROR;
+trace_pci_nvme_err_aio(nvme_cid(req), strerror(ret),
+   req->status);
+}

Re: [PATCH 0/2] hw/block/nvme: two fixes for create sq/cq

2020-10-22 Thread Klaus Jensen

On Oct 22 08:20, Keith Busch wrote:
> On Thu, Oct 22, 2020 at 03:24:02PM +0200, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > The first patch is a follow up to "hw/block/nvme: fix prp mapping status
> > codes" and fixes some status codes in the nvme_create_{sq,cq} functions.
> > 
> > The second patch fixes a faulty check on the given queue identifier.
> 
> Looks good.
> 
> Reviewed-by: Keith Busch 

Thanks! Applied to nvme-next.


signature.asc
Description: PGP signature

Re: [PATCH v4 2/2] hw/block/nvme: add the dataset management command

2020-10-22 Thread Klaus Jensen

On Oct 22 08:01, Keith Busch wrote:
> On Thu, Oct 22, 2020 at 09:33:13AM +0200, Klaus Jensen wrote:
> > +if (--(*discards)) {
> > +status = NVME_NO_COMPLETE;
> > +} else {
> > +g_free(discards);
> > +req->opaque = NULL;
> 
> This case needs a
> 
> status = req->status;
> 
> So that we get the error set in the callback.
> 

There are no cases that result in a non-zero status code here. If an LBA
range is invalid we simply continue with the next. In case the DMA
transfer fails, we return the error directly and the normal path takes
care of it. The else block is for when there are no pending aios for
some reason (all invalid ranges or they completed immediately) - in that
case we can just return NVME_SUCCESS directly.

> Otherwise, this looks fine. I am assuming everything still runs single
> threaded since this isn't using atomics.

Yeah, all device code (including callbacks) run on the main thread.

signature.asc
Description: PGP signature

Re: [PATCH v27 05/17] vfio: Add VM state change handler to know state of VM

2020-10-22 Thread Kirti Wankhede





On 10/22/2020 10:05 PM, Alex Williamson wrote:

On Thu, 22 Oct 2020 16:41:55 +0530
Kirti Wankhede  wrote:


VM state change handler is called on change in VM's state. Based on
VM state, VFIO device state should be changed.
Added read/write helper functions for migration region.
Added function to set device_state.

Signed-off-by: Kirti Wankhede 
Reviewed-by: Neo Jia 
Reviewed-by: Dr. David Alan Gilbert 
---
  hw/vfio/migration.c   | 158 ++
  hw/vfio/trace-events  |   2 +
  include/hw/vfio/vfio-common.h |   4 ++
  3 files changed, 164 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 5f74a3ad1d72..34f39c7e2e28 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
  #include "qemu/osdep.h"
  #include 
  
+#include "sysemu/runstate.h"

  #include "hw/vfio/vfio-common.h"
  #include "cpu.h"
  #include "migration/migration.h"
@@ -22,6 +23,157 @@
  #include "exec/ram_addr.h"
  #include "pci.h"
  #include "trace.h"
+#include "hw/hw.h"
+
+static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
+  off_t off, bool iswrite)
+{
+int ret;
+
+ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
+pread(vbasedev->fd, val, count, off);
+if (ret < count) {
+error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s",
+ iswrite ? "write" : "read", count,
+ vbasedev->name, off, strerror(errno));
+return (ret < 0) ? ret : -EINVAL;
+}
+return 0;
+}
+
+static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
+   off_t off, bool iswrite)
+{
+int ret, done = 0;
+__u8 *tbuf = buf;
+
+while (count) {
+int bytes = 0;
+
+if (count >= 8 && !(off % 8)) {
+bytes = 8;
+} else if (count >= 4 && !(off % 4)) {
+bytes = 4;
+} else if (count >= 2 && !(off % 2)) {
+bytes = 2;
+} else {
+bytes = 1;
+}
+
+ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
+if (ret) {
+return ret;
+}
+
+count -= bytes;
+done += bytes;
+off += bytes;
+tbuf += bytes;
+}
+return done;
+}
+
+#define vfio_mig_read(f, v, c, o)   vfio_mig_rw(f, (__u8 *)v, c, o, false)
+#define vfio_mig_write(f, v, c, o)  vfio_mig_rw(f, (__u8 *)v, c, o, true)
+
+#define VFIO_MIG_STRUCT_OFFSET(f)   \
+ offsetof(struct vfio_device_migration_info, f)
+/*
+ * Change the device_state register for device @vbasedev. Bits set in @mask
+ * are preserved, bits set in @value are set, and bits not set in either @mask
+ * or @value are cleared in device_state. If the register cannot be accessed,
+ * the resulting state would be invalid, or the device enters an error state,
+ * an error is returned.
+ */
+
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
+uint32_t value)
+{
+VFIOMigration *migration = vbasedev->migration;
+VFIORegion *region = >region;
+off_t dev_state_off = region->fd_offset +
+  VFIO_MIG_STRUCT_OFFSET(device_state);
+uint32_t device_state;
+int ret;
+
+ret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+dev_state_off);
+if (ret < 0) {
+return ret;
+}
+
+device_state = (device_state & mask) | value;
+
+if (!VFIO_DEVICE_STATE_VALID(device_state)) {
+return -EINVAL;
+}
+
+ret = vfio_mig_write(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+if (ret < 0) {
+int rret;
+
+rret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
+ dev_state_off);
+
+if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
+hw_error("%s: Device in error state 0x%x", vbasedev->name,
+ device_state);
+return rret ? rret : -EIO;
+}
+return ret;
+}
+
+migration->device_state = device_state;
+trace_vfio_migration_set_state(vbasedev->name, device_state);
+return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+uint32_t value, mask;
+int ret;
+
+if ((vbasedev->migration->vm_running == running)) {
+return;
+}
+
+if (running) {
+/*
+ * Here device state can have one of _SAVING, _RESUMING or _STOP bit.
+ * Transition from _SAVING to _RUNNING can happen if there is migration
+ * failure, in that case clear _SAVING bit.
+ * Transition from _RESUMING to _RUNNING occurs during resuming
+ * phase, in that case clear _RESUMING bit.
+

Re: [RFC] Using gitlab for upstream qemu repo?

2020-10-22 Thread Eric Blake

On 10/22/20 11:47 AM, Paolo Bonzini wrote:
> Hi all,
> 
> now that Gitlab is the primary CI infrastructure for QEMU, and that all
> QEMU git repositories (including mirrors) are available on Gitlab, I
> would like to propose that committers use Gitlab when merging commits to
> QEMU repositories.
> 

> Nothing would change for developers, who would still have access to all
> three sets of repositories (git.qemu.org, gitlab.com and github.com).
> Committers however would need to have an account on the
> https://gitlab.com/qemu-project organization with access to the
> repositories they care about.  They would also lose write access to
> /srv/git on qemu.org.

For clarification, I'm assuming the set of committers is rather small,
and not the same as the set of subsystem maintainers who send pull
requests for a committer to then merge in.  Does this proposal mean that
pull requests would have to switch to gitlab merge requests, or would
there be a transition period where submaintainers still send pull
requests via whichever means desired (mail or gitlab merge request), but
the eventual committer repackages that as a gitlab merge request before
it is upstream?

> 
> Of course this is just starting a discussion, so I'm not even proposing
> a date for the switch.

I'm hoping that as part of the consideration that we make sure that
command line tooling can still drive everything; there is a difference
between requiring a web page to initiate a merge request, vs. proper
command line tooling one to leave the web page as an optional part of
the workflow for only those who want it.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

[Bug 1901068] [NEW] Deleted tests are still run if they exist in the build tree

2020-10-22 Thread Havard Skinnemoen

Public bug reported:

Steps to reproduce:
1. Add a new device along with a qtest to exercise it.
2. Run make check-qtest. It passes.
3. Revert the commit that added the device and qtest.
4. Run make check-qtest again. It now fails because the device no longer 
exists, but the test is somehow still there even though the source files are 
gone and it's not mentioned in tests/qtest/meson.build.

After running make clean, make check-qtest passes again.

$ git describe
v5.1.0-2465-g4c5b97bfd0

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1901068

Title:
  Deleted tests are still run if they exist in the build tree

Status in QEMU:
  New

Bug description:
  Steps to reproduce:
  1. Add a new device along with a qtest to exercise it.
  2. Run make check-qtest. It passes.
  3. Revert the commit that added the device and qtest.
  4. Run make check-qtest again. It now fails because the device no longer 
exists, but the test is somehow still there even though the source files are 
gone and it's not mentioned in tests/qtest/meson.build.

  After running make clean, make check-qtest passes again.

  $ git describe
  v5.1.0-2465-g4c5b97bfd0

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1901068/+subscriptions

[RFC] Using gitlab for upstream qemu repo?

2020-10-22 Thread Paolo Bonzini

Hi all,

now that Gitlab is the primary CI infrastructure for QEMU, and that all
QEMU git repositories (including mirrors) are available on Gitlab, I
would like to propose that committers use Gitlab when merging commits to
QEMU repositories.

There are four reasons for this:

- this would be a step towards ensuring that all commits go through the
CI process, and it would also provide a way to run the deployment of the
web site via .gitlab-ci.yml.

- right now Gitlab pulls from upstream repos and qemu.org pulls from
gitlab, but this is not true for the qemu, qemu-web and openbios
repositories where Gitlab pulls from qemu.org and qemu.org is the main
repository.  With this switch, all the main repositories would be on
Gitlab and then mirrored to both qemu.org and GitHub.  Having a
homogeneous configuration makes it easier to document what's going on.

- it would limit the number of people with access to qemu.org, since
committers would no longer need an account on the machine.

- by treating gitlab as authoritative, we could include it in the
.gitmodules file and remove load on the qemu.org server

Nothing would change for developers, who would still have access to all
three sets of repositories (git.qemu.org, gitlab.com and github.com).
Committers however would need to have an account on the
https://gitlab.com/qemu-project organization with access to the
repositories they care about.  They would also lose write access to
/srv/git on qemu.org.

Of course this is just starting a discussion, so I'm not even proposing
a date for the switch.

Paolo

Re: [PATCH v6 3/6] migration: Maintain postcopy faulted addresses

2020-10-22 Thread Dr. David Alan Gilbert

* Peter Xu (pet...@redhat.com) wrote:
> Maintain a list of faulted addresses on the destination host for which we're
> waiting on.  This is implemented using a GTree rather than a real list to make
> sure even there're plenty of vCPUs/threads that are faulting, the lookup will
> still be fast with O(log(N)) (because we'll do that after placing each page).
> It should bring a slight overhead, but ideally that shouldn't be a big problem
> simply because in most cases the requested page list will be short.
> 
> Actually we did similar things for postcopy blocktime measurements.  This 
> patch
> didn't use that simply because:
> 
>   (1) blocktime measurement is towards vcpu threads only, but here we need to
>   record all faulted addresses, including main thread and external
>   thread (like, DPDK via vhost-user).
> 
>   (2) blocktime measurement will require UFFD_FEATURE_THREAD_ID, but here we
>   don't want to add that extra dependency on the kernel version since not
>   necessary.  E.g., we don't need to know which thread faulted on which
>   page, we also don't care about multiple threads faulting on the same
>   page.  But we only care about what addresses are faulted so waiting for 
> a
>   page copying from src.
> 
>   (3) blocktime measurement is not enabled by default.  However we need this 
> by
>   default especially for postcopy recover.
> 
> Another thing to mention is that this patch introduced a new mutex to 
> serialize
> the receivedmap and the page_requested tree, however that serialization does
> not cover other procedures like UFFDIO_COPY.
> 
> Signed-off-by: Peter Xu 

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/migration.c| 41 +++-
>  migration/migration.h| 19 ++-
>  migration/postcopy-ram.c | 17 ++---
>  migration/trace-events   |  2 ++
>  4 files changed, 74 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 255e69c8aa..e3a958b299 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -143,6 +143,13 @@ static int migration_maybe_pause(MigrationState *s,
>   int new_state);
>  static void migrate_fd_cancel(MigrationState *s);
>  
> +static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
> +{
> +uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
> +
> +return (a > b) - (a < b);
> +}
> +
>  void migration_object_init(void)
>  {
>  MachineState *ms = MACHINE(qdev_get_machine());
> @@ -165,6 +172,8 @@ void migration_object_init(void)
>  qemu_event_init(_incoming->main_thread_load_event, false);
>  qemu_sem_init(_incoming->postcopy_pause_sem_dst, 0);
>  qemu_sem_init(_incoming->postcopy_pause_sem_fault, 0);
> +qemu_mutex_init(_incoming->page_request_mutex);
> +current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
>  
>  if (!migration_object_check(current_migration, )) {
>  error_report_err(err);
> @@ -240,6 +249,11 @@ void migration_incoming_state_destroy(void)
>  
>  qemu_event_reset(>main_thread_load_event);
>  
> +if (mis->page_requested) {
> +g_tree_destroy(mis->page_requested);
> +mis->page_requested = NULL;
> +}
> +
>  if (mis->socket_address_list) {
>  qapi_free_SocketAddressList(mis->socket_address_list);
>  mis->socket_address_list = NULL;
> @@ -354,8 +368,33 @@ int 
> migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  }
>  
>  int migrate_send_rp_req_pages(MigrationIncomingState *mis,
> -  RAMBlock *rb, ram_addr_t start)
> +  RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>  {
> +void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
> +bool received;
> +
> +WITH_QEMU_LOCK_GUARD(>page_request_mutex) {
> +received = ramblock_recv_bitmap_test_byte_offset(rb, start);
> +if (!received && !g_tree_lookup(mis->page_requested, aligned)) {
> +/*
> + * The page has not been received, and it's not yet in the page
> + * request list.  Queue it.  Set the value of element to 1, so 
> that
> + * things like g_tree_lookup() will return TRUE (1) when found.
> + */
> +g_tree_insert(mis->page_requested, aligned, (gpointer)1);
> +mis->page_requested_count++;
> +trace_postcopy_page_req_add(aligned, mis->page_requested_count);
> +}
> +}
> +
> +/*
> + * If the page is there, skip sending the message.  We don't even need 
> the
> + * lock because as long as the page arrived, it'll be there forever.
> + */
> +if (received) {
> +return 0;
> +}
> +
>  return migrate_send_rp_message_req_pages(mis, rb, start);
>  }
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index e853ccf8b1..8d2d1ce839 100644
>

[PATCH] target/arm: Get correct MMU index for other-security-state

2020-10-22 Thread Peter Maydell

In arm_v7m_mmu_idx_for_secstate() we get the 'priv' level to pass to
armv7m_mmu_idx_for_secstate_and_priv() by calling arm_current_el().
This is incorrect when the security state being queried is not the
current one, because arm_current_el() uses the current security state
to determine which of the banked CONTROL.nPRIV bits to look at.
The effect was that if (for instance) Secure state was in privileged
mode but Non-Secure was not then we would return the wrong MMU index.

The only places where we are using this function in a way that could
trigger this bug are for the stack loads during a v8M function-return
and for the instruction fetch of a v8M SG insn.

Fix the bug by expanding out the M-profile version of the
arm_current_el() logic inline so it can use the passed in secstate
rather than env->v7m.secure.

Signed-off-by: Peter Maydell 
---
 target/arm/m_helper.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/target/arm/m_helper.c b/target/arm/m_helper.c
index 036454234c7..aad01ea0127 100644
--- a/target/arm/m_helper.c
+++ b/target/arm/m_helper.c
@@ -2719,7 +2719,8 @@ ARMMMUIdx 
arm_v7m_mmu_idx_for_secstate_and_priv(CPUARMState *env,
 /* Return the MMU index for a v7M CPU in the specified security state */
 ARMMMUIdx arm_v7m_mmu_idx_for_secstate(CPUARMState *env, bool secstate)
 {
-bool priv = arm_current_el(env) != 0;
+bool priv = arm_v7m_is_handler_mode(env) ||
+!(env->v7m.control[secstate] & 1);
 
 return arm_v7m_mmu_idx_for_secstate_and_priv(env, secstate, priv);
 }
-- 
2.20.1

Re: [PATCH v10 09/10] virtio-iommu: Set supported page size mask

2020-10-22 Thread Jean-Philippe Brucker

On Fri, Oct 16, 2020 at 03:08:03PM +0200, Auger Eric wrote:
> > +static int virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
> > +   uint64_t page_size_mask,
> > +   Error **errp)
> > +{
> > +int new_granule, old_granule;
> > +IOMMUDevice *sdev = container_of(mr, IOMMUDevice, iommu_mr);
> > +VirtIOIOMMU *s = sdev->viommu;
> > +
> > +if (!page_size_mask) {
> set errp

Woops, fixed

> > +return -1;
> > +}
> > +
> > +new_granule = ctz64(page_size_mask);
> > +old_granule = ctz64(s->config.page_size_mask);
> 
> I think this would be interesting to add a trace point

Agreed

Thanks,
Jean

Re: [PATCH v10 07/10] memory: Add interface to set iommu page size mask

2020-10-22 Thread Jean-Philippe Brucker

On Fri, Oct 16, 2020 at 11:24:08AM +0200, Auger Eric wrote:
> > +/*
> > + * Set supported IOMMU page size
> > + *
> > + * If supported, allows to restrict the page size mask that can be 
> > supported
> To match other docs: Optional method:
> > + * with a given IOMMU memory region. For example, to propagate host 
> > physical
> > + * IOMMU page size mask limitations to the virtual IOMMU.
> > + *
> > + * Returns 0 on success, or a negative error. In case of failure, the 
> > error
> > + * object must be created.
> document args as done for other functions?

I'll change this comment to:

/**
 * @iommu_set_page_size_mask:
 *
 * Restrict the page size mask that can be supported with a given IOMMU
 * memory region. Used for example to propagate host physical IOMMU page
 * size mask limitations to the virtual IOMMU.
 *
 * Optional method: if this method is not provided, then the default global
 * page mask is used.
 *
 * @iommu: the IOMMUMemoryRegion
 *
 * @page_size_mask: a bitmask of supported page sizes. At least one bit,
 * representing the smallest page size, must be set. Additional set bits
 * represent supported block sizes. For example a host physical IOMMU that
 * uses page tables with a page size of 4kB, and supports 2MB and 4GB
 * blocks, will set mask 0x40201000. A granule of 4kB with indiscriminate
 * block sizes is specified with mask 0xf000.
 *
 * Returns 0 on success, or a negative error. In case of failure, the error
 * object must be created.
 */

Thanks,
Jean

Re: [PATCH v10 05/10] virtio-iommu: Add replay() memory region callback

2020-10-22 Thread Jean-Philippe Brucker

On Fri, Oct 16, 2020 at 11:12:35AM +0200, Auger Eric wrote:
> > +static gboolean virtio_iommu_remap(gpointer key, gpointer value, gpointer 
> > data)
> > +{
> > +VirtIOIOMMUMapping *mapping = (VirtIOIOMMUMapping *) value;
> > +VirtIOIOMMUInterval *interval = (VirtIOIOMMUInterval *) key;
> > +IOMMUMemoryRegion *mr = (IOMMUMemoryRegion *) data;
> > +
> > +trace_virtio_iommu_remap(mr->parent_obj.name, interval->low, 
> > interval->high,
> > + mapping->phys_addr);
> > +virtio_iommu_notify_unmap(mr, interval->low, interval->high);
> > +virtio_iommu_notify_map(mr, interval->low, interval->high,
> > +mapping->phys_addr);
> I don't get the preliminary unmap with the same data. Why isn't the map
> sufficient to replay?
> 
> The default implementation only notifies for valid entries.

Yes it should be enough, I'll remove the unmap

Thanks,
Jean

Re: [PATCH v10 03/10] virtio-iommu: Add memory notifiers for map/unmap

2020-10-22 Thread Jean-Philippe Brucker

On Fri, Oct 16, 2020 at 09:58:28AM +0200, Auger Eric wrote:
> > +static void virtio_iommu_notify_map(IOMMUMemoryRegion *mr, hwaddr 
> > virt_start,
> > +hwaddr virt_end, hwaddr paddr)
> > +{
> > +IOMMUTLBEntry entry;
> > +IOMMUNotifierFlag flags = mr->iommu_notify_flags;
> > +
> > +if (!(flags & IOMMU_NOTIFIER_MAP)) {
> > +return;
> > +}
> > +
> > +trace_virtio_iommu_notify_map(mr->parent_obj.name, virt_start, 
> > virt_end,
> > +  paddr);
> > +
> > +entry.target_as = _space_memory;
> > +entry.addr_mask = virt_end - virt_start;
> > +entry.iova = virt_start;
> > +entry.perm = IOMMU_RW;
> logically you should be able to cascade the struct virtio_iommu_req_map
> *req flags field instead.

Agreed.

I'm also thinking of adding a check for VIRTIO_IOMMU_MAP_F_MMIO, to avoid
going further into the notifier and maybe do the same for unmap.

Thanks,
Jean

Re: [PATCH] CHANGELOG: remove disused file

2020-10-22 Thread Daniel P . Berrangé

On Thu, Oct 22, 2020 at 12:28:43PM -0400, John Snow wrote:
> There's no reason to keep this here; the versions described are
> ancient. Everything here is still mirrored on
> https://wiki.qemu.org/ChangeLog/old if anyone is curious; otherwise, use
> the git history.
> 
> Signed-off-by: John Snow 
> ---
>  Changelog | 580 --
>  1 file changed, 580 deletions(-)
>  delete mode 100644 Changelog

Reviewed-by: Daniel P. Berrangé 


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: Ramping up Continuous Fuzzing of Virtual Devices in QEMU

2020-10-22 Thread Daniel P . Berrangé

On Thu, Oct 22, 2020 at 12:24:16PM -0400, Alexander Bulekov wrote:
> +CC Prasad
> 
> On 201022 1219, Alexander Bulekov wrote:
> > Hello,
> > QEMU was accepted into Google's oss-fuzz continuous-fuzzing platform [1]
> > earlier this year. The fuzzers currently running on oss-fuzz are based on my
> > 2019 Google Summer of Code Project, which leveraged libfuzzer, qtest and 
> > libqos
> > to provide a framework for writing virtual-device fuzzers. At the moment, 
> > there
> > are a handful of fuzzers upstream and running on oss-fuzz(located in
> > tests/qtest/fuzz/). They fuzz only a few devices and serve mostly as
> > examples.
> > 
> > If everything goes well, soon a generic fuzzer [2] will land upstream, which
> > allows us to fuzz many configurations of QEMU, without any device-specific
> > code. To date this fuzzer has led to ~50 bug reports on launchpad. Once the
> > generic-fuzzer lands upstream, OSS-Fuzz will automatically start fuzzing a
> > bunch [3] of fuzzer configurations, and it is likely to find bugs.  Others 
> > will
> > also be able to send simple patches to add additional device configurations 
> > for
> > fuzzing.
> > 
> > The oss-fuzz process looks roughly like this:
> > 1. oss-fuzz fuzzes QEMU
> > 2. When oss-fuzz finds a bug, it reports it to a few [4] people that 
> > have
> > access to reports and reproducers.
> > 3. If a fix is merged upstream, oss-fuzz will figure this out and mark 
> > the
> > bug as fixed and make the report public 30 days later.
> > 3. After 90 days the bug(fixed or not) becomes public, so anyone can 
> > view
> > it here https://bugs.chromium.org/p/oss-fuzz/issues/list
> > 
> > The oss-fuzz reports look like this:
> > https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=23701=qemu=2
> > 
> > This means that when oss-fuzz find new bugs, the relevant developers do not
> > know about them unless someone with access files a separate report to the
> > list/launchpad. So far this hasn't been a problem, since oss-fuzz has only 
> > been
> > running some small example fuzzers. Once [2] lands upstream, we should
> > see a significant uptick in oss-fuzz reports, and I hope that we can 
> > develop a
> > process to ensure these bugs are properly dealt with. One option we have is 
> > to
> > make the reports public immediately and send notifications to
> > qemu-devel. This is the approach taken by some other projects on
> > oss-fuzz, such as LLVM. Though its not on oss-fuzz, bugs found by
> > syzkaller in the kernel, are also automatically sent to a public list.
> > The question is: 
> > 
> > What approach should we take for dealing with bugs found on oss-fuzz?

If we assume that a non-negligible number of fuzz bugs will be exploitable
by a malicious guest OS to break out into the host, then I think it is
likely undesirable to make them public immediately without at least a basic
human triage step to catch possibly serious security issues.

Still a large % are likely to be low impact / not urgent to deal with so
we want a low overhead way to handle the fuzz output, which doesn't create
a bottleneck on a small number of people.

Overall my feeling is that we want to be able to farm out triage to the
respective subsystem maintainers, who can then decide whether the bug
needs notifying to the security team, or can be made public immediately.

I think ideally we would be doing triage in QEMU's own bug tracker, so
we don't need to have maintainers create accounts on a 3rd party tracker
to see reports.

Is is practical to identify a primary affected source file from the fuzz
crash report with any level reliablility such that we could file a private
launchpad bug, automatically CC'ing a subsystem maintainer (and the security
team)  ?


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v10 09/10] virtio-iommu: Set supported page size mask

2020-10-22 Thread Jean-Philippe Brucker

On Mon, Oct 19, 2020 at 05:35:39PM -0400, Peter Xu wrote:
> > +/*
> > + * Disallow shrinking the page size. For example if an endpoint only
> > + * supports 64kB pages, we can't globally enable 4kB pages. But that
> > + * shouldn't happen, the host is unlikely to setup differing page 
> > granules.
> > + * The other bits are only hints describing optimal block sizes.
> > + */
> > +if (new_granule < old_granule) {
> > +error_setg(errp, "memory region shrinks the virtio-iommu page 
> > granule");
> > +return -1;
> > +}
> 
> My understanding is that shrink is actually allowed, instead we should forbid
> growing of the mask?  For example, initially the old_granule will always 
> points
> to the guest page size.  Then as long as the host page size (which new_granule
> represents) is smaller than the old_granule, then it seems fine... Or am I 
> wrong?

The case I was checking against is two assigned devices with different
page sizes. First one sets a 64kB page size, then the second one shouldn't
be able to shrink it back to 4kB, because the guest would create mappings
not aligned on 64kB, which can't be applied by the pIOMMU of the first
device.

But let's forget this case for now, in practice all assigned devices use
the host page size.

> 
> Another thing, IIUC this function will be majorly called in vfio code when the
> container page mask will be passed into it.  If there're multiple vfio
> containers that support different host IOMMU page sizes, then IIUC the order 
> of
> the call to virtio_iommu_set_page_size_mask() is undefined.  It's probably
> related to which "-device vfio-pci,..." parameter is earlier.
> 
> To make this simpler, I'm thinking whether we should just forbid the case 
> where
> devices have different iommu page sizes.  So when assigned devices are used, 
> we
> make sure all host iommu page sizes are the same, and the value should be
> smaller than guest page size.  Otherwise we'll simply fall back to guest 
> psize.

Mostly agree, I need to simplify this function.

I don't think we care about guest page size, though. Currently our default
mask is TARGET_PAGE_MASK, which is the smallest size supported by vCPUs
(generally 4kB), but it doesn't really mean guest page size, since the
guest can choose a larger granule at runtime. Besides virtio-iommu can in
theory map at byte granule if there isn't any assigned device, so our
default mask could as well be ~0ULL (but doesn't work at the moment, I've
tried).

So what I'd like to do for next version:

* Set qemu_real_host_page_mask as the default page mask, instead of the
  rather arbitrary TARGET_PAGE_MASK. Otherwise we cannot hotplug assigned
  devices on a 64kB host, since TARGET_PAGE_MASK is pretty much always
  4kB.

* Disallow changing the page size. It's simpler and works in
  practice if we default to qemu_real_host_page_mask.

* For non-hotplug devices, allow changing the rest of the mask. For
  hotplug devices, only warn about it.

Thanks,
Jean

Re: [PATCH v27 05/17] vfio: Add VM state change handler to know state of VM

2020-10-22 Thread Alex Williamson

On Thu, 22 Oct 2020 16:41:55 +0530
Kirti Wankhede  wrote:

> VM state change handler is called on change in VM's state. Based on
> VM state, VFIO device state should be changed.
> Added read/write helper functions for migration region.
> Added function to set device_state.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> Reviewed-by: Dr. David Alan Gilbert 
> ---
>  hw/vfio/migration.c   | 158 
> ++
>  hw/vfio/trace-events  |   2 +
>  include/hw/vfio/vfio-common.h |   4 ++
>  3 files changed, 164 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 5f74a3ad1d72..34f39c7e2e28 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -10,6 +10,7 @@
>  #include "qemu/osdep.h"
>  #include 
>  
> +#include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> @@ -22,6 +23,157 @@
>  #include "exec/ram_addr.h"
>  #include "pci.h"
>  #include "trace.h"
> +#include "hw/hw.h"
> +
> +static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
> +  off_t off, bool iswrite)
> +{
> +int ret;
> +
> +ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
> +pread(vbasedev->fd, val, count, off);
> +if (ret < count) {
> +error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: 
> %s",
> + iswrite ? "write" : "read", count,
> + vbasedev->name, off, strerror(errno));
> +return (ret < 0) ? ret : -EINVAL;
> +}
> +return 0;
> +}
> +
> +static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
> +   off_t off, bool iswrite)
> +{
> +int ret, done = 0;
> +__u8 *tbuf = buf;
> +
> +while (count) {
> +int bytes = 0;
> +
> +if (count >= 8 && !(off % 8)) {
> +bytes = 8;
> +} else if (count >= 4 && !(off % 4)) {
> +bytes = 4;
> +} else if (count >= 2 && !(off % 2)) {
> +bytes = 2;
> +} else {
> +bytes = 1;
> +}
> +
> +ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
> +if (ret) {
> +return ret;
> +}
> +
> +count -= bytes;
> +done += bytes;
> +off += bytes;
> +tbuf += bytes;
> +}
> +return done;
> +}
> +
> +#define vfio_mig_read(f, v, c, o)   vfio_mig_rw(f, (__u8 *)v, c, o, 
> false)
> +#define vfio_mig_write(f, v, c, o)  vfio_mig_rw(f, (__u8 *)v, c, o, true)
> +
> +#define VFIO_MIG_STRUCT_OFFSET(f)   \
> + offsetof(struct vfio_device_migration_info, 
> f)
> +/*
> + * Change the device_state register for device @vbasedev. Bits set in @mask
> + * are preserved, bits set in @value are set, and bits not set in either 
> @mask
> + * or @value are cleared in device_state. If the register cannot be accessed,
> + * the resulting state would be invalid, or the device enters an error state,
> + * an error is returned.
> + */
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> +uint32_t value)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +VFIORegion *region = >region;
> +off_t dev_state_off = region->fd_offset +
> +  VFIO_MIG_STRUCT_OFFSET(device_state);
> +uint32_t device_state;
> +int ret;
> +
> +ret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
> +dev_state_off);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +device_state = (device_state & mask) | value;
> +
> +if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> +return -EINVAL;
> +}
> +
> +ret = vfio_mig_write(vbasedev, _state, sizeof(device_state),
> + dev_state_off);
> +if (ret < 0) {
> +int rret;
> +
> +rret = vfio_mig_read(vbasedev, _state, sizeof(device_state),
> + dev_state_off);
> +
> +if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
> +hw_error("%s: Device in error state 0x%x", vbasedev->name,
> + device_state);
> +return rret ? rret : -EIO;
> +}
> +return ret;
> +}
> +
> +migration->device_state = device_state;
> +trace_vfio_migration_set_state(vbasedev->name, device_state);
> +return 0;
> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +VFIODevice *vbasedev = opaque;
> +VFIOMigration *migration = vbasedev->migration;
> +uint32_t value, mask;
> +int ret;
> +
> +if ((vbasedev->migration->vm_running == running)) {
> +return;
> +}
> +
> +if (running) {
> +/*
> + * Here device state can have one of _SAVING, _RESUMING or _STOP bit.
> + *

Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions

2020-10-22 Thread Fam Zheng

On Tue, 2020-10-20 at 09:34 +0800, Zhenyu Ye wrote:
> On 2020/10/19 21:25, Paolo Bonzini wrote:
> > On 19/10/20 14:40, Zhenyu Ye wrote:
> > > The kernel backtrace for io_submit in GUEST is:
> > > 
> > >   guest# ./offcputime -K -p `pgrep -nx fio`
> > >   b'finish_task_switch'
> > >   b'__schedule'
> > >   b'schedule'
> > >   b'io_schedule'
> > >   b'blk_mq_get_tag'
> > >   b'blk_mq_get_request'
> > >   b'blk_mq_make_request'
> > >   b'generic_make_request'
> > >   b'submit_bio'
> > >   b'blkdev_direct_IO'
> > >   b'generic_file_read_iter'
> > >   b'aio_read'
> > >   b'io_submit_one'
> > >   b'__x64_sys_io_submit'
> > >   b'do_syscall_64'
> > >   b'entry_SYSCALL_64_after_hwframe'
> > >   -fio (1464)
> > >   40031912
> > > 
> > > And Linux io_uring can avoid the latency problem.

Thanks for the info. What this tells us is basically the inflight
requests are high. It's sad that the linux-aio is in practice
implemented as a blocking API.

Host side backtrace will be of more help. Can you get that too?

Fam

> > 
> > What filesystem are you using?
> > 
> 
> On host, the VM image and disk images are based on ext4 filesystem.
> In guest, the '/' uses xfs filesystem, and the disks are raw devices.
> 
> guest# df -hT
> Filesystem  Type  Size  Used Avail Use% Mounted on
> devtmpfsdevtmpfs   16G 0   16G   0% /dev
> tmpfs   tmpfs  16G 0   16G   0% /dev/shm
> tmpfs   tmpfs  16G  976K   16G   1% /run
> /dev/mapper/fedora-root xfs   8.0G  3.2G  4.9G  40% /
> tmpfs   tmpfs  16G 0   16G   0% /tmp
> /dev/sda1   xfs  1014M  181M  834M  18% /boot
> tmpfs   tmpfs 3.2G 0  3.2G   0% /run/user/0
> 
> guest# lsblk
> NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda   8:00  10G  0 disk
> ├─sda18:10   1G  0 part /boot
> └─sda28:20   9G  0 part
>   ├─fedora-root 253:00   8G  0 lvm  /
>   └─fedora-swap 253:10   1G  0 lvm  [SWAP]
> vda 252:00  10G  0 disk
> vdb 252:16   0  10G  0 disk
> vdc 252:32   0  10G  0 disk
> vdd 252:48   0  10G  0 disk
> 
> Thanks,
> Zhenyu
>

[PATCH] CHANGELOG: remove disused file

2020-10-22 Thread John Snow

There's no reason to keep this here; the versions described are
ancient. Everything here is still mirrored on
https://wiki.qemu.org/ChangeLog/old if anyone is curious; otherwise, use
the git history.

Signed-off-by: John Snow 
---
 Changelog | 580 --
 1 file changed, 580 deletions(-)
 delete mode 100644 Changelog

diff --git a/Changelog b/Changelog
deleted file mode 100644
index f7e178ccc01..000
--- a/Changelog
+++ /dev/null
@@ -1,580 +0,0 @@
-This file documents changes for QEMU releases 0.12 and earlier.
-For changelog information for later releases, see
-https://wiki.qemu.org/ChangeLog or look at the git history for
-more detailed information.
-
-
-version 0.12.0:
-
-  - Update to SeaBIOS 0.5.0
-  - e1000: fix device link status in Linux (Anthony Liguori)
-  - monitor: fix QMP for balloon command (Luiz Capitulino)
-  - QMP: Return an empty dict by default (Luiz Capitulino)
-  - QMP: Only handle converted commands (Luiz Capitulino)
-  - pci: support PCI based option rom loading (Gerd Hoffman/Anthony Liguori)
-  - Fix backcompat for hotplug of SCSI controllers (Daniel P. Berrange)
-  - fdc: fix migration from 0.11 (Juan Quintela)
-  - vmware-vga: fix segv on cursor resize. (Dave Airlie)
-  - vmware-vga: various fixes (Dave Airlie/Anthony Liguori)
-  - qdev: improve property error reporting. (Gerd Hoffmann)
-  - fix vga names in default_list (Gerd Hoffmann)
-  - usb-host: check mon before using it. (Gerd Hoffmann)
-  - usb-net: use qdev for -usbdevice (Gerd Hoffmann)
-  - monitor: Catch printing to non-existent monitor (Luiz Capitulino)
-  - Avoid permanently disabled QEMU monitor when UNIX migration fails (Daniel 
P. Berrange)
-  - Fix loading of ELF multiboot kernels (Kevin Wolf)
-  - qemu-io: Fix memory leak (Kevin Wolf)
-  - Fix thinko in linuxboot.S (Paolo Bonzini)
-  - target-i386: Fix evaluation of DR7 register (Jan Kiszka)
-  - vnc: hextile: do not generate ForegroundSpecified and SubrectsColoured 
tiles (Anthony Liguori)
-  - S390: Bail out without KVM (Alexander Graf)
-  - S390: Don't tell guest we're updating config space (Alexander Graf)
-  - target-s390: Fail on unknown instructions (Alexander Graf)
-  - osdep: Fix runtime failure on older Linux kernels (Andre Przywara)
-  - Fix a make -j race (Juergen Lock)
-  - target-alpha: Fix generic ctz64. (Richard Henderson)
-  - s390: Fix buggy assignment (Stefan Weil)
-  - target-mips: fix user-mode emulation startup (Nathan Froyd)
-  - target-i386: Update CPUID feature set for TCG (Andre Przywara)
-  - s390: fix build on 32 bit host (Michael S. Tsirkin)
-   
-version 0.12.0-rc2:
-
-  - v2: properly save kvm system time msr registers (Glauber Costa)
-  - convert more monitor commands to qmp (Luiz Capitulino)
-  - vnc: fix capslock tracking logic. (Gerd Hoffmann)
-  - QemuOpts: allow larger option values. (Gerd Hoffmann)
-  - scsi: fix drive hotplug. (Gerd Hoffmann)
-  - pci: don't hw_error() when no slot is available. (Gerd Hoffmann)
-  - pci: don't abort() when trying to hotplug with acpi off. (Gerd Hoffmann)
-  - allow default devices to be implemented in config file (Gerd Hoffman)
-  - vc: colorize chardev title line with blue background. (Gerd Hoffmann)
-  - chardev: make chardevs specified in config file work. (Gerd Hoffmann)
-  - qdev: also match bus name for global properties (Gerd Hoffmann)
-  - qdev: add command line option to set global defaults for properties. (Gerd 
Hoffmann)
-  - kvm: x86: Save/restore exception_index (Jan Kiszka)
-  - qdev: Replace device names containing whitespace (Markus Armbruster)
-  - fix rtc-td-hack on host without high-res timers (Gleb Natapov)
-  - virtio: verify features on load (Michael S. Tsirkin)
-  - vmware_vga: add rom file so that it boots. (Dave Airlie)
-  - Do not abort on qemu_malloc(0) in production builds (Anthony Liguori)
-  - Fix ARM userspace strex implementation. (Paul Brook)
-  - qemu: delete rule target on error (Michael S. Tsirkin)
-  - QMP: add human-readable description to error response (Markus Armbruster)
-  - convert more monitor commands to QError (Markus Armbruster)
-  - monitor: Fix double-prompt after "change vnc passwd BLA" (Markus 
Armbruster)
-  - monitor: do_cont(): Don't ask for passwords (Luiz Capitulino)
-  - monitor: Introduce 'block_passwd' command (Luiz Capitulino)
-  - pci: interrupt disable bit support (Michael S. Tsirkin)
-  - pci: interrupt status bit implementation (Michael S. Tsirkin)
-  - pci: prepare irq code for interrupt state (Michael S. Tsirkin)
-  - msix: function mask support (Michael S. Tsirkin)
-  - msix: macro rename for function mask support (Michael S. Tsirkin)
-  - cpuid: Fix multicore setup on Intel (Andre Przywara)
-  - kvm: x86: Fix initial kvm_has_msr_star (Jan Kiszka)
-  - Update OpenBIOS images to r640 (Aurelien Jarno)
-
-version 0.10.2:
-
-  - fix savevm/loadvm (Anthony Liguori)
-  - live migration: fix dirty tracking windows (Glauber Costa)
-  - live migration: improve error

Re: Ramping up Continuous Fuzzing of Virtual Devices in QEMU

2020-10-22 Thread Alexander Bulekov

+CC Prasad

On 201022 1219, Alexander Bulekov wrote:
> Hello,
> QEMU was accepted into Google's oss-fuzz continuous-fuzzing platform [1]
> earlier this year. The fuzzers currently running on oss-fuzz are based on my
> 2019 Google Summer of Code Project, which leveraged libfuzzer, qtest and 
> libqos
> to provide a framework for writing virtual-device fuzzers. At the moment, 
> there
> are a handful of fuzzers upstream and running on oss-fuzz(located in
> tests/qtest/fuzz/). They fuzz only a few devices and serve mostly as
> examples.
> 
> If everything goes well, soon a generic fuzzer [2] will land upstream, which
> allows us to fuzz many configurations of QEMU, without any device-specific
> code. To date this fuzzer has led to ~50 bug reports on launchpad. Once the
> generic-fuzzer lands upstream, OSS-Fuzz will automatically start fuzzing a
> bunch [3] of fuzzer configurations, and it is likely to find bugs.  Others 
> will
> also be able to send simple patches to add additional device configurations 
> for
> fuzzing.
> 
> The oss-fuzz process looks roughly like this:
> 1. oss-fuzz fuzzes QEMU
> 2. When oss-fuzz finds a bug, it reports it to a few [4] people that have
> access to reports and reproducers.
> 3. If a fix is merged upstream, oss-fuzz will figure this out and mark the
> bug as fixed and make the report public 30 days later.
> 3. After 90 days the bug(fixed or not) becomes public, so anyone can view
> it here https://bugs.chromium.org/p/oss-fuzz/issues/list
> 
> The oss-fuzz reports look like this:
> https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=23701=qemu=2
> 
> This means that when oss-fuzz find new bugs, the relevant developers do not
> know about them unless someone with access files a separate report to the
> list/launchpad. So far this hasn't been a problem, since oss-fuzz has only 
> been
> running some small example fuzzers. Once [2] lands upstream, we should
> see a significant uptick in oss-fuzz reports, and I hope that we can develop a
> process to ensure these bugs are properly dealt with. One option we have is to
> make the reports public immediately and send notifications to
> qemu-devel. This is the approach taken by some other projects on
> oss-fuzz, such as LLVM. Though its not on oss-fuzz, bugs found by
> syzkaller in the kernel, are also automatically sent to a public list.
> The question is: 
> 
> What approach should we take for dealing with bugs found on oss-fuzz?
> 
> [1] https://github.com/google/oss-fuzz
> [2] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06331.html
> [3] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06345.html
> [4] 
> https://github.com/google/oss-fuzz/blob/fbf916ce14952ba192e58fe8550096b868fcf62d/projects/qemu/project.yaml#L4
> 
> For further reference, the vast majority of these bugs, were found with the
> generic-fuzzer:
> https://bugs.launchpad.net/~a1xndr/+bugs
> 
> There are more that I haven't yet had time to write reports for.
> Thank you
> -Alex

1 2 3 4 >

1 - 100 of 329 matches

Mail list logo