Re: [Qemu-devel] How to reserve guest physical region for ACPI

2016-01-07 Thread Michael S. Tsirkin
On Thu, Jan 07, 2016 at 11:30:25AM +0100, Igor Mammedov wrote:
> On Tue, 5 Jan 2016 18:43:02 +0200
> "Michael S. Tsirkin" <m...@redhat.com> wrote:
> 
> > On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:
> > > > > bios-linker-loader is a great interface for initializing some
> > > > > guest owned data and linking it together but I think it adds
> > > > > unnecessary complexity and is misused if it's used to handle
> > > > > device owned data/on device memory in this and VMGID cases.
> > > > 
> > > > I want a generic interface for guest to enumerate these things.  linker
> > > > seems quite reasonable but if you see a reason why it won't do, or want
> > > > to propose a better interface, fine.
> > > > 
> > > > PCI would do, too - though windows guys had concerns about
> > > > returning PCI BARs from ACPI.  
> > > There were potential issues with pSeries bootloader that treated
> > > PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
> > > Could you point out to discussion about windows issues?
> > > 
> > > What VMGEN patches that used PCI for mapping purposes were
> > > stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
> > > class id but we couldn't agree on it.
> > > 
> > > VMGEN v13 with full discussion is here
> > > https://patchwork.ozlabs.org/patch/443554/
> > > So to continue with this route we would need to pick some other
> > > driver less class id so windows won't prompt for driver or
> > > maybe supply our own driver stub to guarantee that no one
> > > would touch it. Any suggestions?  
> > 
> > Pick any device/vendor id pair for which windows specifies no driver.
> > There's a small risk that this will conflict with some
> > guest but I think it's minimal.
> device/vendor id pair was QEMU specific so doesn't conflicts with anything
> issue we were trying to solve was to prevent windows asking for driver
> even though it does so only once if told not to ask again.
> 
> That's why PCI_CLASS_MEMORY_RAM was selected as it's generic driver-less
> device descriptor in INF file which matches as the last resort if
> there isn't any other diver that's matched device by device/vendor id pair.

I think this is the only class in this inf.
If you can't use it, you must use an existing device/vendor id pair,
there's some risk involved but probably not much.

> > 
> > 
> > > > 
> > > >   
> > > > > There was RFC on list to make BIOS boot from NVDIMM already
> > > > > doing some ACPI table lookup/parsing. Now if they were forced
> > > > > to also parse and execute AML to initialize QEMU with guest
> > > > > allocated address that would complicate them quite a bit.
> > > > 
> > > > If they just need to find a table by name, it won't be
> > > > too bad, would it?  
> > > that's what they were doing scanning memory for static NVDIMM table.
> > > However if it were DataTable, BIOS side would have to execute
> > > AML so that the table address could be told to QEMU.  
> > 
> > Not at all. You can find any table by its signature without
> > parsing AML.
> yep, and then BIOS would need to tell its address to QEMU
> writing to IO port which is allocated statically in QEMU
> for this purpose and is described in AML only on guest side.

io ports are an ABI too but they are way easier to
maintain.

> > 
> > 
> > > In case of direct mapping or PCI BAR there is no need to initialize
> > > QEMU side from AML.
> > > That also saves us IO port where this address should be written
> > > if bios-linker-loader approach is used.
> > >   
> > > >   
> > > > > While with NVDIMM control memory region mapped directly by QEMU,
> > > > > respective patches don't need in any way to initialize QEMU,
> > > > > all they would need just read necessary data from control region.
> > > > > 
> > > > > Also using bios-linker-loader takes away some usable RAM
> > > > > from guest and in the end that doesn't scale,
> > > > > the more devices I add the less usable RAM is left for guest OS
> > > > > while all the device needs is a piece of GPA address space
> > > > > that would belong to it.
> > > > 
> > > > I don't get this comment. I don't think it's MMIO that is wanted.
> > > > If it's backed by qemu virtual memory then it's RAM.  
> > 

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Michael S. Tsirkin
On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> >> The two mechanisms referenced above would likely require coordination with
> >> QEMU and as such are open to discussion.  I haven't attempted to address
> >> them as I am not sure there is a consensus as of yet.  My personal
> >> preference would be to add a vendor-specific configuration block to the
> >> emulated pci-bridge interfaces created by QEMU that would allow us to
> >> essentially extend shpc to support guest live migration with pass-through
> >> devices.
> >
> > shpc?
> 
> That is kind of what I was thinking.  We basically need some mechanism
> to allow for the host to ask the device to quiesce.  It has been
> proposed to possibly even look at something like an ACPI interface
> since I know ACPI is used by QEMU to manage hot-plug in the standard
> case.
> 
> - Alex


Start by using hot-unplug for this!

Really use your patch guest side, and write host side
to allow starting migration with the device, but
defer completing it.

So

1.- host tells guest to start tracking memory writes
2.- guest acks
3.- migration starts
4.- most memory is migrated
5.- host tells guest to eject device
6.- guest acks
7.- stop vm and migrate rest of state


It will already be a win since hot unplug after migration starts and
most memory has been migrated is better than hot unplug before migration
starts.

Then measure downtime and profile. Then we can look at ways
to quiesce device faster which really means step 5 is replaced
with "host tells guest to quiesce device and dirty (or just unmap!)
all memory mapped for write by device".

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Michael S. Tsirkin
On Tue, Jan 05, 2016 at 10:01:04AM +, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (m...@redhat.com) wrote:
> > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > >> The two mechanisms referenced above would likely require coordination 
> > > >> with
> > > >> QEMU and as such are open to discussion.  I haven't attempted to 
> > > >> address
> > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > >> preference would be to add a vendor-specific configuration block to the
> > > >> emulated pci-bridge interfaces created by QEMU that would allow us to
> > > >> essentially extend shpc to support guest live migration with 
> > > >> pass-through
> > > >> devices.
> > > >
> > > > shpc?
> > > 
> > > That is kind of what I was thinking.  We basically need some mechanism
> > > to allow for the host to ask the device to quiesce.  It has been
> > > proposed to possibly even look at something like an ACPI interface
> > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > case.
> > > 
> > > - Alex
> > 
> > 
> > Start by using hot-unplug for this!
> > 
> > Really use your patch guest side, and write host side
> > to allow starting migration with the device, but
> > defer completing it.
> > 
> > So
> > 
> > 1.- host tells guest to start tracking memory writes
> > 2.- guest acks
> > 3.- migration starts
> > 4.- most memory is migrated
> > 5.- host tells guest to eject device
> > 6.- guest acks
> > 7.- stop vm and migrate rest of state
> > 
> > 
> > It will already be a win since hot unplug after migration starts and
> > most memory has been migrated is better than hot unplug before migration
> > starts.
> > 
> > Then measure downtime and profile. Then we can look at ways
> > to quiesce device faster which really means step 5 is replaced
> > with "host tells guest to quiesce device and dirty (or just unmap!)
> > all memory mapped for write by device".
> 
> 
> Doing a hot-unplug is going to upset the guests network stacks view
> of the world; that's something we don't want to change.
> 
> Dave

It might but if you store the IP and restore it quickly
after migration e.g. using guest agent, as opposed to DHCP,
then it won't.

It allows calming the device down in a generic way,
specific drivers can then implement the fast quiesce.

> > 
> > -- 
> > MST
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm/s390: drop unpaired smp_mb

2016-01-05 Thread Michael S. Tsirkin
smp_mb on vcpu destroy isn't paired with anything, violating pairing
rules, and seems to be useless.

Drop it.

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---

Untested.

 arch/s390/kvm/kvm-s390.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 8465892..7305d2c 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1195,7 +1195,6 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
(__u64) vcpu->arch.sie_block)
vcpu->kvm->arch.sca->cpu[vcpu->vcpu_id].sda = 0;
}
-   smp_mb();
 
if (kvm_is_ucontrol(vcpu->kvm))
gmap_free(vcpu->arch.gmap);
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to reserve guest physical region for ACPI

2016-01-05 Thread Michael S. Tsirkin
gt; > > neccesary.
> > > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
> > > > >   
> > > > 
> > > > Yes, this technique works.
> > > > 
> > > > An alternative is to add an XSDT, XP ignores that.
> > > > XSDT at the moment breaks OVMF (because it loads both
> > > > the RSDT and the XSDT, which is wrong), but I think
> > > > Laszlo was working on a fix for that.  
> > > Using XSDT would increase ACPI tables occupied RAM
> > > as it would duplicate DSDT + non XP supported AML
> > > at global namespace.  
> > 
> > Not at all - I posted patches linking to same
> > tables from both RSDT and XSDT at some point.
> > Only the list of pointers would be different.
> if you put XP incompatible AML in separate SSDT and link it
> only from XSDT than that would work but if incompatibility
> is in DSDT, one would have to provide compat DSDT for RSDT
> an incompat DSDT for XSDT.

So don't do this.

> So far policy was don't try to run guest OS on QEMU
> configuration that isn't supported by it.

It's better if guests don't see some features but
don't crash. It's not always possible of course but
we should try to avoid this.

> For example we use VAR_PACKAGE when running with more
> than 255 VCPUs (commit b4f4d5481) which BSODs XP.

Yes. And it's because we violate the spec, DSDT
should not have this stuff.

> So we can continue with that policy with out resorting to
> using both RSDT and XSDT,
> It would be even easier as all AML would be dynamically
> generated and DSDT would only contain AML elements for
> a concrete QEMU configuration.

I'd prefer XSDT but I won't nack it if you do it in DSDT.
I think it's not spec compliant but guests do not
seem to care.

> > > So far we've managed keep DSDT compatible with XP while
> > > introducing features from v2 and higher ACPI revisions as
> > > AML that is only evaluated on demand.
> > > We can continue doing so unless we have to unconditionally
> > > add incompatible AML at global scope.
> > >   
> > 
> > Yes.
> > 
> > > >   
> > > > > Michael, Paolo, what do you think about these ideas?
> > > > > 
> > > > > Thanks!  
> > > > 
> > > > 
> > > > 
> > > > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > > > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > > > current offset so we can add that to the linker.
> > > > 
> > > > Won't work if you append the Name to the Aml structure (these can be
> > > > nested to arbitrary depth using aml_append), so using plain GArray for
> > > > this API makes sense to me.
> > > >   
> > > > --->  
> > > > 
> > > > acpi: add build_append_named_dword, returning an offset in buffer
> > > > 
> > > > This is a very limited form of support for runtime patching -
> > > > similar in functionality to what we can do with ACPI_EXTRACT
> > > > macros in python, but implemented in C.
> > > > 
> > > > This is to allow ACPI code direct access to data tables -
> > > > which is exactly what DataTableRegion is there for, except
> > > > no known windows release so far implements DataTableRegion.  
> > > unsupported means Windows will BSOD, so it's practically
> > > unusable unless MS will patch currently existing Windows
> > > versions.  
> > 
> > Yes. That's why my patch allows patching SSDT without using
> > DataTableRegion.
> > 
> > > Another thing about DataTableRegion is that ACPI tables are
> > > supposed to have static content which matches checksum in
> > > table the header while you are trying to use it for dynamic
> > > data. It would be cleaner/more compatible to teach
> > > bios-linker-loader to just allocate memory and patch AML
> > > with the allocated address.  
> > 
> > Yes - if address is static, you need to put it outside
> > the table. Can come right before or right after this.
> > 
> > > Also if OperationRegion() is used, then one has to patch
> > > DefOpRegion directly as RegionOffset must be Integer,
> > > using variable names is not permitted there.  
> > 
> > I am not sure the comment was understood correctly.
> > The comment says really "we can't use DataTableRegion
> > so here is an alternative".
> so how are you going to access data at which patched
> NameString point to?
>

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Michael S. Tsirkin
On Tue, Jan 05, 2016 at 10:45:25AM +, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (m...@redhat.com) wrote:
> > On Tue, Jan 05, 2016 at 10:01:04AM +, Dr. David Alan Gilbert wrote:
> > > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > >> The two mechanisms referenced above would likely require 
> > > > > >> coordination with
> > > > > >> QEMU and as such are open to discussion.  I haven't attempted to 
> > > > > >> address
> > > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > > >> preference would be to add a vendor-specific configuration block 
> > > > > >> to the
> > > > > >> emulated pci-bridge interfaces created by QEMU that would allow us 
> > > > > >> to
> > > > > >> essentially extend shpc to support guest live migration with 
> > > > > >> pass-through
> > > > > >> devices.
> > > > > >
> > > > > > shpc?
> > > > > 
> > > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > proposed to possibly even look at something like an ACPI interface
> > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > case.
> > > > > 
> > > > > - Alex
> > > > 
> > > > 
> > > > Start by using hot-unplug for this!
> > > > 
> > > > Really use your patch guest side, and write host side
> > > > to allow starting migration with the device, but
> > > > defer completing it.
> > > > 
> > > > So
> > > > 
> > > > 1.- host tells guest to start tracking memory writes
> > > > 2.- guest acks
> > > > 3.- migration starts
> > > > 4.- most memory is migrated
> > > > 5.- host tells guest to eject device
> > > > 6.- guest acks
> > > > 7.- stop vm and migrate rest of state
> > > > 
> > > > 
> > > > It will already be a win since hot unplug after migration starts and
> > > > most memory has been migrated is better than hot unplug before migration
> > > > starts.
> > > > 
> > > > Then measure downtime and profile. Then we can look at ways
> > > > to quiesce device faster which really means step 5 is replaced
> > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > all memory mapped for write by device".
> > > 
> > > 
> > > Doing a hot-unplug is going to upset the guests network stacks view
> > > of the world; that's something we don't want to change.
> > > 
> > > Dave
> > 
> > It might but if you store the IP and restore it quickly
> > after migration e.g. using guest agent, as opposed to DHCP,
> > then it won't.
> 
> I thought if you hot-unplug then it will lose any outstanding connections
> on that device.

Which connections and which device?  TCP connections and an ethernet
device?  These are on different layers so of course you don't lose them.
Just do not change the IP address.

Some guests send a signal to applications to close connections
when all links go down. One can work around this
in a variety of ways.

> > It allows calming the device down in a generic way,
> > specific drivers can then implement the fast quiesce.
> 
> Except that if it breaks the guest networking it's useless.
> 
> Dave
> 
> > 
> > > > 
> > > > -- 
> > > > MST
> > > --
> > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Michael S. Tsirkin
On Tue, Jan 05, 2016 at 12:59:54PM +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 05, 2016 at 10:45:25AM +, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > On Tue, Jan 05, 2016 at 10:01:04AM +, Dr. David Alan Gilbert wrote:
> > > > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > >> The two mechanisms referenced above would likely require 
> > > > > > >> coordination with
> > > > > > >> QEMU and as such are open to discussion.  I haven't attempted to 
> > > > > > >> address
> > > > > > >> them as I am not sure there is a consensus as of yet.  My 
> > > > > > >> personal
> > > > > > >> preference would be to add a vendor-specific configuration block 
> > > > > > >> to the
> > > > > > >> emulated pci-bridge interfaces created by QEMU that would allow 
> > > > > > >> us to
> > > > > > >> essentially extend shpc to support guest live migration with 
> > > > > > >> pass-through
> > > > > > >> devices.
> > > > > > >
> > > > > > > shpc?
> > > > > > 
> > > > > > That is kind of what I was thinking.  We basically need some 
> > > > > > mechanism
> > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > > case.
> > > > > > 
> > > > > > - Alex
> > > > > 
> > > > > 
> > > > > Start by using hot-unplug for this!
> > > > > 
> > > > > Really use your patch guest side, and write host side
> > > > > to allow starting migration with the device, but
> > > > > defer completing it.
> > > > > 
> > > > > So
> > > > > 
> > > > > 1.- host tells guest to start tracking memory writes
> > > > > 2.- guest acks
> > > > > 3.- migration starts
> > > > > 4.- most memory is migrated
> > > > > 5.- host tells guest to eject device
> > > > > 6.- guest acks
> > > > > 7.- stop vm and migrate rest of state
> > > > > 
> > > > > 
> > > > > It will already be a win since hot unplug after migration starts and
> > > > > most memory has been migrated is better than hot unplug before 
> > > > > migration
> > > > > starts.
> > > > > 
> > > > > Then measure downtime and profile. Then we can look at ways
> > > > > to quiesce device faster which really means step 5 is replaced
> > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > all memory mapped for write by device".
> > > > 
> > > > 
> > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > of the world; that's something we don't want to change.
> > > > 
> > > > Dave
> > > 
> > > It might but if you store the IP and restore it quickly
> > > after migration e.g. using guest agent, as opposed to DHCP,
> > > then it won't.
> > 
> > I thought if you hot-unplug then it will lose any outstanding connections
> > on that device.
> > 
> > > It allows calming the device down in a generic way,
> > > specific drivers can then implement the fast quiesce.
> > 
> > Except that if it breaks the guest networking it's useless.
> > 
> > Dave
> 
> Is hot unplug useless then?

Actually I misunderstood the question, unplug does not
have to break guest networking.

> > > 
> > > > > 
> > > > > -- 
> > > > > MST
> > > > --
> > > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Michael S. Tsirkin
On Tue, Jan 05, 2016 at 10:45:25AM +, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (m...@redhat.com) wrote:
> > On Tue, Jan 05, 2016 at 10:01:04AM +, Dr. David Alan Gilbert wrote:
> > > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > >> The two mechanisms referenced above would likely require 
> > > > > >> coordination with
> > > > > >> QEMU and as such are open to discussion.  I haven't attempted to 
> > > > > >> address
> > > > > >> them as I am not sure there is a consensus as of yet.  My personal
> > > > > >> preference would be to add a vendor-specific configuration block 
> > > > > >> to the
> > > > > >> emulated pci-bridge interfaces created by QEMU that would allow us 
> > > > > >> to
> > > > > >> essentially extend shpc to support guest live migration with 
> > > > > >> pass-through
> > > > > >> devices.
> > > > > >
> > > > > > shpc?
> > > > > 
> > > > > That is kind of what I was thinking.  We basically need some mechanism
> > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > proposed to possibly even look at something like an ACPI interface
> > > > > since I know ACPI is used by QEMU to manage hot-plug in the standard
> > > > > case.
> > > > > 
> > > > > - Alex
> > > > 
> > > > 
> > > > Start by using hot-unplug for this!
> > > > 
> > > > Really use your patch guest side, and write host side
> > > > to allow starting migration with the device, but
> > > > defer completing it.
> > > > 
> > > > So
> > > > 
> > > > 1.- host tells guest to start tracking memory writes
> > > > 2.- guest acks
> > > > 3.- migration starts
> > > > 4.- most memory is migrated
> > > > 5.- host tells guest to eject device
> > > > 6.- guest acks
> > > > 7.- stop vm and migrate rest of state
> > > > 
> > > > 
> > > > It will already be a win since hot unplug after migration starts and
> > > > most memory has been migrated is better than hot unplug before migration
> > > > starts.
> > > > 
> > > > Then measure downtime and profile. Then we can look at ways
> > > > to quiesce device faster which really means step 5 is replaced
> > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > all memory mapped for write by device".
> > > 
> > > 
> > > Doing a hot-unplug is going to upset the guests network stacks view
> > > of the world; that's something we don't want to change.
> > > 
> > > Dave
> > 
> > It might but if you store the IP and restore it quickly
> > after migration e.g. using guest agent, as opposed to DHCP,
> > then it won't.
> 
> I thought if you hot-unplug then it will lose any outstanding connections
> on that device.
> 
> > It allows calming the device down in a generic way,
> > specific drivers can then implement the fast quiesce.
> 
> Except that if it breaks the guest networking it's useless.
> 
> Dave

Is hot unplug useless then?

> > 
> > > > 
> > > > -- 
> > > > MST
> > > --
> > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Michael S. Tsirkin
On Tue, Jan 05, 2016 at 11:03:38AM +, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (m...@redhat.com) wrote:
> > On Tue, Jan 05, 2016 at 10:45:25AM +, Dr. David Alan Gilbert wrote:
> > > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > > On Tue, Jan 05, 2016 at 10:01:04AM +, Dr. David Alan Gilbert wrote:
> > > > > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > > >> The two mechanisms referenced above would likely require 
> > > > > > > >> coordination with
> > > > > > > >> QEMU and as such are open to discussion.  I haven't attempted 
> > > > > > > >> to address
> > > > > > > >> them as I am not sure there is a consensus as of yet.  My 
> > > > > > > >> personal
> > > > > > > >> preference would be to add a vendor-specific configuration 
> > > > > > > >> block to the
> > > > > > > >> emulated pci-bridge interfaces created by QEMU that would 
> > > > > > > >> allow us to
> > > > > > > >> essentially extend shpc to support guest live migration with 
> > > > > > > >> pass-through
> > > > > > > >> devices.
> > > > > > > >
> > > > > > > > shpc?
> > > > > > > 
> > > > > > > That is kind of what I was thinking.  We basically need some 
> > > > > > > mechanism
> > > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > > since I know ACPI is used by QEMU to manage hot-plug in the 
> > > > > > > standard
> > > > > > > case.
> > > > > > > 
> > > > > > > - Alex
> > > > > > 
> > > > > > 
> > > > > > Start by using hot-unplug for this!
> > > > > > 
> > > > > > Really use your patch guest side, and write host side
> > > > > > to allow starting migration with the device, but
> > > > > > defer completing it.
> > > > > > 
> > > > > > So
> > > > > > 
> > > > > > 1.- host tells guest to start tracking memory writes
> > > > > > 2.- guest acks
> > > > > > 3.- migration starts
> > > > > > 4.- most memory is migrated
> > > > > > 5.- host tells guest to eject device
> > > > > > 6.- guest acks
> > > > > > 7.- stop vm and migrate rest of state
> > > > > > 
> > > > > > 
> > > > > > It will already be a win since hot unplug after migration starts and
> > > > > > most memory has been migrated is better than hot unplug before 
> > > > > > migration
> > > > > > starts.
> > > > > > 
> > > > > > Then measure downtime and profile. Then we can look at ways
> > > > > > to quiesce device faster which really means step 5 is replaced
> > > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > > all memory mapped for write by device".
> > > > > 
> > > > > 
> > > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > > of the world; that's something we don't want to change.
> > > > > 
> > > > > Dave
> > > > 
> > > > It might but if you store the IP and restore it quickly
> > > > after migration e.g. using guest agent, as opposed to DHCP,
> > > > then it won't.
> > > 
> > > I thought if you hot-unplug then it will lose any outstanding connections
> > > on that device.
> > > 
> > > > It allows calming the device down in a generic way,
> > > > specific drivers can then implement the fast quiesce.
> > > 
> > > Except that if it breaks the guest networking it's useless.
> > > 
> > > Dave
> > 
> > Is hot unplug useless then?
> 
> As a migration hack, yes,

Based on a premise that it breaks connections but it does not
have to.

> unless it's paired with a second network device
> as a redundent route.

You can do this too.

But this is not a must at all.

> To do what's being suggested here it's got to be done at the device level
> and not visible to the networking stack.
> 
> Dave

Need for this was never demonstrated.

> > 
> > > > 
> > > > > > 
> > > > > > -- 
> > > > > > MST
> > > > > --
> > > > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> > > --
> > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2016-01-05 Thread Michael S. Tsirkin
On Tue, Jan 05, 2016 at 12:43:03PM +, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (m...@redhat.com) wrote:
> > On Tue, Jan 05, 2016 at 10:45:25AM +, Dr. David Alan Gilbert wrote:
> > > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > > On Tue, Jan 05, 2016 at 10:01:04AM +, Dr. David Alan Gilbert wrote:
> > > > > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > > > > On Mon, Jan 04, 2016 at 07:11:25PM -0800, Alexander Duyck wrote:
> > > > > > > >> The two mechanisms referenced above would likely require 
> > > > > > > >> coordination with
> > > > > > > >> QEMU and as such are open to discussion.  I haven't attempted 
> > > > > > > >> to address
> > > > > > > >> them as I am not sure there is a consensus as of yet.  My 
> > > > > > > >> personal
> > > > > > > >> preference would be to add a vendor-specific configuration 
> > > > > > > >> block to the
> > > > > > > >> emulated pci-bridge interfaces created by QEMU that would 
> > > > > > > >> allow us to
> > > > > > > >> essentially extend shpc to support guest live migration with 
> > > > > > > >> pass-through
> > > > > > > >> devices.
> > > > > > > >
> > > > > > > > shpc?
> > > > > > > 
> > > > > > > That is kind of what I was thinking.  We basically need some 
> > > > > > > mechanism
> > > > > > > to allow for the host to ask the device to quiesce.  It has been
> > > > > > > proposed to possibly even look at something like an ACPI interface
> > > > > > > since I know ACPI is used by QEMU to manage hot-plug in the 
> > > > > > > standard
> > > > > > > case.
> > > > > > > 
> > > > > > > - Alex
> > > > > > 
> > > > > > 
> > > > > > Start by using hot-unplug for this!
> > > > > > 
> > > > > > Really use your patch guest side, and write host side
> > > > > > to allow starting migration with the device, but
> > > > > > defer completing it.
> > > > > > 
> > > > > > So
> > > > > > 
> > > > > > 1.- host tells guest to start tracking memory writes
> > > > > > 2.- guest acks
> > > > > > 3.- migration starts
> > > > > > 4.- most memory is migrated
> > > > > > 5.- host tells guest to eject device
> > > > > > 6.- guest acks
> > > > > > 7.- stop vm and migrate rest of state
> > > > > > 
> > > > > > 
> > > > > > It will already be a win since hot unplug after migration starts and
> > > > > > most memory has been migrated is better than hot unplug before 
> > > > > > migration
> > > > > > starts.
> > > > > > 
> > > > > > Then measure downtime and profile. Then we can look at ways
> > > > > > to quiesce device faster which really means step 5 is replaced
> > > > > > with "host tells guest to quiesce device and dirty (or just unmap!)
> > > > > > all memory mapped for write by device".
> > > > > 
> > > > > 
> > > > > Doing a hot-unplug is going to upset the guests network stacks view
> > > > > of the world; that's something we don't want to change.
> > > > > 
> > > > > Dave
> > > > 
> > > > It might but if you store the IP and restore it quickly
> > > > after migration e.g. using guest agent, as opposed to DHCP,
> > > > then it won't.
> > > 
> > > I thought if you hot-unplug then it will lose any outstanding connections
> > > on that device.
> > 
> > Which connections and which device?  TCP connections and an ethernet
> > device?  These are on different layers so of course you don't lose them.
> > Just do not change the IP address.
> > 
> > Some guests send a signal to applications to close connections
> > when all links go down. One can work around this
> > in a variety of ways.
> 
> So, OK, I was surprised that a simple connection didn't go down when
> I test

Re: [PATCH RFC] vhost: basic device IOTLB support

2015-12-31 Thread Michael S. Tsirkin
On Thu, Dec 31, 2015 at 03:13:45PM +0800, Jason Wang wrote:
> This patch tries to implement an device IOTLB for vhost. This could be
> used with for co-operation with userspace(qemu) implementation of
> iommu for a secure DMA environment in guest.
> 
> The idea is simple. When vhost meets an IOTLB miss, it will request
> the assistance of userspace to do the translation, this is done
> through:
> 
> - Fill the translation request in a preset userspace address (This
>   address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
> - Notify userspace through eventfd (This eventfd was set through ioctl
>   VHOST_SET_IOTLB_FD).
> 
> When userspace finishes the translation, it will update the vhost
> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
> snooping the IOTLB invalidation of IOMMU IOTLB and use
> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
> 
> For simplicity, IOTLB was implemented with a simple hash array. The
> index were calculated from IOVA page frame number which can only works
> at PAGE_SIZE level.
> 
> An qemu implementation (for reference) is available at:
> g...@github.com:jasowang/qemu.git iommu
> 
> TODO & Known issues:
> 
> - read/write permission validation was not implemented.
> - no feature negotiation.
> - VHOST_SET_MEM_TABLE is not reused (maybe there's a chance).
> - working at PAGE_SIZE level, don't support large mappings.
> - better data structure for IOTLB instead of simple hash array.
> - better API, e.g using mmap() instead of preset userspace address.
> 
> Signed-off-by: Jason Wang 

Interesting. I'm working on a slightly different approach
which is direct vt-d support in vhost.
This one has the advantage of being more portable.

> ---
>  drivers/vhost/net.c|   2 +-
>  drivers/vhost/vhost.c  | 190 
> -
>  drivers/vhost/vhost.h  |  13 
>  include/uapi/linux/vhost.h |  26 +++
>  4 files changed, 229 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e..a172be9 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -1083,7 +1083,7 @@ static long vhost_net_ioctl(struct file *f, unsigned 
> int ioctl,
>   r = vhost_dev_ioctl(>dev, ioctl, argp);
>   if (r == -ENOIOCTLCMD)
>   r = vhost_vring_ioctl(>dev, ioctl, argp);
> - else
> + else if (ioctl != VHOST_UPDATE_IOTLB)
>   vhost_net_flush(n);
>   mutex_unlock(>dev.mutex);
>   return r;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index eec2f11..729fe05 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -113,6 +113,11 @@ static void vhost_init_is_le(struct vhost_virtqueue *vq)
>  }
>  #endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
>  
> +static inline int vhost_iotlb_hash(u64 iova)
> +{
> + return (iova >> PAGE_SHIFT) & (VHOST_IOTLB_SIZE - 1);
> +}
> +
>  static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
>   poll_table *pt)
>  {
> @@ -384,8 +389,14 @@ void vhost_dev_init(struct vhost_dev *dev,
>   dev->memory = NULL;
>   dev->mm = NULL;
>   spin_lock_init(>work_lock);
> + spin_lock_init(>iotlb_lock);
> + mutex_init(>iotlb_req_mutex);
>   INIT_LIST_HEAD(>work_list);
>   dev->worker = NULL;
> + dev->iotlb_request = NULL;
> + dev->iotlb_ctx = NULL;
> + dev->iotlb_file = NULL;
> + dev->pending_request.flags.type = VHOST_IOTLB_INVALIDATE;
>  
>   for (i = 0; i < dev->nvqs; ++i) {
>   vq = dev->vqs[i];
> @@ -393,12 +404,17 @@ void vhost_dev_init(struct vhost_dev *dev,
>   vq->indirect = NULL;
>   vq->heads = NULL;
>   vq->dev = dev;
> + vq->iotlb_request = NULL;
>   mutex_init(>mutex);
>   vhost_vq_reset(dev, vq);
>   if (vq->handle_kick)
>   vhost_poll_init(>poll, vq->handle_kick,
>   POLLIN, dev);
>   }
> +
> + init_completion(>iotlb_completion);
> + for (i = 0; i < VHOST_IOTLB_SIZE; i++)
> + dev->iotlb[i].flags.valid = VHOST_IOTLB_INVALID;
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_init);
>  
> @@ -940,9 +956,10 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
> ioctl, void __user *argp)
>  {
>   struct file *eventfp, *filep = NULL;
>   struct eventfd_ctx *ctx = NULL;
> + struct vhost_iotlb_entry entry;
>   u64 p;
>   long r;
> - int i, fd;
> + int index, i, fd;
>  
>   /* If you are not the owner, you can become one */
>   if (ioctl == VHOST_SET_OWNER) {
> @@ -1008,6 +1025,80 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
> ioctl, void __user *argp)
>   if (filep)
>   fput(filep);
>   break;
> + case VHOST_SET_IOTLB_FD:
> +   

Re: How to reserve guest physical region for ACPI

2015-12-30 Thread Michael S. Tsirkin
On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
> On Mon, 28 Dec 2015 14:50:15 +0200
> "Michael S. Tsirkin" <m...@redhat.com> wrote:
> 
> > On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> > > 
> > > Hi Michael, Paolo,
> > > 
> > > Now it is the time to return to the challenge that how to reserve guest
> > > physical region internally used by ACPI.
> > > 
> > > Igor suggested that:
> > > | An alternative place to allocate reserve from could be high memory.
> > > | For pc we have "reserved-memory-end" which currently makes sure
> > > | that hotpluggable memory range isn't used by firmware
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> > 
> > I don't want to tie things to reserved-memory-end because this
> > does not scale: next time we need to reserve memory,
> > we'll need to find yet another way to figure out what is where.
> Could you elaborate a bit more on a problem you're seeing?
> 
> To me it looks like it scales rather well.
> For example lets imagine that we adding a device
> that has some on device memory that should be mapped into GPA
> code to do so would look like:
> 
>   pc_machine_device_plug_cb(dev)
>   {
>...
>if (dev == OUR_NEW_DEVICE_TYPE) {
>memory_region_add_subregion(as, current_reserved_end, >mr);
>set_new_reserved_end(current_reserved_end + 
> memory_region_size(>mr));
>}
>   }
> 
> we can practically add any number of new devices that way.

Yes but we'll have to build a host side allocator for these, and that's
nasty. We'll also have to maintain these addresses indefinitely (at
least per machine version) as they are guest visible.
Not only that, there's no way for guest to know if we move things
around, so basically we'll never be able to change addresses.


>  
> > I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> > support 64 bit RAM instead (and maybe a way to allocate and
> > zero-initialize buffer without loading it through fwcfg), this way bios
> > does the allocation, and addresses can be patched into acpi.
> and then guest side needs to parse/execute some AML that would
> initialize QEMU side so it would know where to write data.

Well not really - we can put it in a data table, by itself
so it's easy to find.

AML is only needed if access from ACPI is desired.


> bios-linker-loader is a great interface for initializing some
> guest owned data and linking it together but I think it adds
> unnecessary complexity and is misused if it's used to handle
> device owned data/on device memory in this and VMGID cases.

I want a generic interface for guest to enumerate these things.  linker
seems quite reasonable but if you see a reason why it won't do, or want
to propose a better interface, fine.

PCI would do, too - though windows guys had concerns about
returning PCI BARs from ACPI.


> There was RFC on list to make BIOS boot from NVDIMM already
> doing some ACPI table lookup/parsing. Now if they were forced
> to also parse and execute AML to initialize QEMU with guest
> allocated address that would complicate them quite a bit.

If they just need to find a table by name, it won't be
too bad, would it?

> While with NVDIMM control memory region mapped directly by QEMU,
> respective patches don't need in any way to initialize QEMU,
> all they would need just read necessary data from control region.
> 
> Also using bios-linker-loader takes away some usable RAM
> from guest and in the end that doesn't scale,
> the more devices I add the less usable RAM is left for guest OS
> while all the device needs is a piece of GPA address space
> that would belong to it.

I don't get this comment. I don't think it's MMIO that is wanted.
If it's backed by qemu virtual memory then it's RAM.

> > 
> > See patch at the bottom that might be handy.
> > 
> > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > | when writing ASL one shall make sure that only XP supported
> > > | features are in global scope, which is evaluated when tables
> > > | are loaded and features of rev2 and higher are inside methods.
> > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > | opcode and one can guard those opcodes checking _REV object if 
> > > neccesary.
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
> > 
> > Yes, this technique works.
> > 
> > An alternative is to add an XSDT, XP ignores that.
> > XSDT at the moment breaks OVMF (because it loads both
> > the RSDT and the XSDT, which is 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-29 Thread Michael S. Tsirkin
On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> >As long as you keep up this vague talk about performance during
> >migration, without even bothering with any measurements, this patchset
> >will keep going nowhere.
> >
> 
> I measured network service downtime for "keep device alive"(RFC patch V1
> presented) and "put down and up network interface"(RFC patch V2 presented)
> during migration with some optimizations.
> 
> The former is around 140ms and the later is around 240ms.
> 
> My patchset relies on the maibox irq which doesn't work in the suspend state
> and so can't get downtime for suspend/resume cases. Will try to get the
> result later.


Interesting. So you sare saying merely ifdown/ifup is 100ms?
This does not sound reasonable.
Is there a chance you are e.g. getting IP from dhcp?

If so that is wrong - clearly should reconfigure the old IP
back without playing with dhcp. For testing, just set up
a static IP.

> >
> >
> >
> >There's Alex's patch that tracks memory changes during migration.  It
> >needs some simple enhancements to be useful in production (e.g. add a
> >host/guest handshake to both enable tracking in guest and to detect the
> >support in host), then it can allow starting migration with an assigned
> >device, by invoking hot-unplug after most of memory have been migrated.
> >
> >Please implement this in qemu and measure the speed.
> 
> Sure. Will do that.
> 
> >I will not be surprised if destroying/creating netdev in linux
> >turns out to take too long, but before anyone bothered
> >checking, it does not make sense to discuss further enhancements.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-29 Thread Michael S. Tsirkin
On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote:
> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
> >>
> >>
> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> >> >As long as you keep up this vague talk about performance during
> >> >migration, without even bothering with any measurements, this patchset
> >> >will keep going nowhere.
> >> >
> >>
> >> I measured network service downtime for "keep device alive"(RFC patch V1
> >> presented) and "put down and up network interface"(RFC patch V2 presented)
> >> during migration with some optimizations.
> >>
> >> The former is around 140ms and the later is around 240ms.
> >>
> >> My patchset relies on the maibox irq which doesn't work in the suspend 
> >> state
> >> and so can't get downtime for suspend/resume cases. Will try to get the
> >> result later.
> >
> >
> > Interesting. So you sare saying merely ifdown/ifup is 100ms?
> > This does not sound reasonable.
> > Is there a chance you are e.g. getting IP from dhcp?
> 
> 
> Actually it wouldn't surprise me if that is due to a reset logic in
> the driver.  For starters there is a 10 msec delay in the call
> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
> clear registers after the VF has requested a reset.  There is also a
> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
> were disabled.  That is in addition to the fact that the function that
> disables the queues does so serially and polls each queue until the
> hardware acknowledges that the queues are actually disabled.  The
> driver also does the serial enable with poll logic on re-enabling the
> queues which likely doesn't help things.
> 
> Really this driver is probably in need of a refactor to clean the
> cruft out of the reset and initialization logic.  I suspect we have
> far more delays than we really need and that is the source of much of
> the slow down.
> 
> - Alex

For ifdown, why is there any need to reset the device at all?
Is it so buffers can be reclaimed?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Michael S. Tsirkin
On Mon, Dec 28, 2015 at 03:20:10AM +, Dong, Eddie wrote:
> > >
> > > Even if the device driver doesn't support migration, you still want to
> > > migrate VM? That maybe risk and we should add the "bad path" for the
> > > driver at least.
> > 
> > At a minimum we should have support for hot-plug if we are expecting to
> > support migration.  You would simply have to hot-plug the device before you
> > start migration and then return it after.  That is how the current bonding
> > approach for this works if I am not mistaken.
> 
> Hotplug is good to eliminate the device spefic state clone, but
> bonding approach is very network specific, it doesn’t work for other
> devices such as FPGA device, QaT devices & GPU devices, which we plan
> to support gradually :)

Alexander didn't say do bonding. He just said bonding uses hot-unplug.

Gradual and generic is the correct approach. So focus on splitting the
work into manageable pieces which are also useful by themselves, and
generally reusable by different devices.

So live the pausing alone for a moment.

Start from Alexander's patchset for tracking dirty memory, add a way to
control and detect it from userspace (and maybe from host), and a way to
start migration while device is attached, removing it at the last
possible moment.

That will be a nice first step.


> > 
> > The advantage we are looking to gain is to avoid removing/disabling the
> > device for as long as possible.  Ideally we want to keep the device active
> > through the warm-up period, but if the guest doesn't do that we should still
> > be able to fall back on the older approaches if needed.
> > 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Michael S. Tsirkin
On Sun, Dec 27, 2015 at 01:45:15PM -0800, Alexander Duyck wrote:
> On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> > On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
> >> The PCI hot-plug specification calls out that the OS can optionally
> >> implement a "pause" mechanism which is meant to be used for high
> >> availability type environments.  What I am proposing is basically
> >> extending the standard SHPC capable PCI bridge so that we can support
> >> the DMA page dirtying for everything hosted on it, add a vendor
> >> specific block to the config space so that the guest can notify the
> >> host that it will do page dirtying, and add a mechanism to indicate
> >> that all hot-plug events during the warm-up phase of the migration are
> >> pause events instead of full removals.
> >
> > Two comments:
> >
> > 1. A vendor specific capability will always be problematic.
> > Better to register a capability id with pci sig.
> >
> > 2. There are actually several capabilities:
> >
> > A. support for memory dirtying
> > if not supported, we must stop device before migration
> >
> > This is supported by core guest OS code,
> > using patches similar to posted by you.
> >
> >
> > B. support for device replacement
> > This is a faster form of hotplug, where device is removed and
> > later another device using same driver is inserted in the same slot.
> >
> > This is a possible optimization, but I am convinced
> > (A) should be implemented independently of (B).
> >
> 
> My thought on this was that we don't need much to really implement
> either feature.  Really only a bit or two for either one.  I had
> thought about extending the PCI Advanced Features, but for now it
> might make more sense to just implement it as a vendor capability for
> the QEMU based bridges instead of trying to make this a true PCI
> capability since I am not sure if this in any way would apply to
> physical hardware.  The fact is the PCI Advanced Features capability
> is essentially just a vendor specific capability with a different ID

Interesting. I see it more as a backport of pci express
features to pci.

> so if we were to use 2 bits that are currently reserved in the
> capability we could later merge the functionality without much
> overhead.

Don't do this. You must not touch reserved bits.

> I fully agree that the two implementations should be separate but
> nothing says we have to implement them completely different.  If we
> are just using 3 bits for capability, status, and control of each
> feature there is no reason for them to need to be stored in separate
> locations.

True.

> >> I've been poking around in the kernel and QEMU code and the part I
> >> have been trying to sort out is how to get QEMU based pci-bridge to
> >> use the SHPC driver because from what I can tell the driver never
> >> actually gets loaded on the device as it is left in the control of
> >> ACPI hot-plug.
> >
> > There are ways, but you can just use pci express, it's easier.
> 
> That's true.  I should probably just give up on trying to do an
> implementation that works with the i440fx implementation.  I could
> probably move over to the q35 and once that is done then we could look
> at something like the PCI Advanced Features solution for something
> like the PCI-bridge drivers.
> 
> - Alex

Once we have a decent idea of what's required, I can write
an ECN for pci code and id assignment specification.
That's cleaner than vendor specific stuff that's tied to
a specific device/vendor ID.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-28 Thread Michael S. Tsirkin
On Mon, Dec 28, 2015 at 11:52:43AM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > A dedicated IRQ per device for something that is a system wide event
> > sounds like a waste.  I don't understand why a spec change is strictly
> > required, we only need to support this with the specific virtual bridge
> > used by QEMU, so I think that a vendor specific capability will do.
> > Once this works well in the field, a PCI spec ECN might make sense
> > to standardise the capability.
> 
>  Keeping track of your discussion for some time, decided to jump in...
>  So far, we want to have some kind of mailbox to notify the quest about 
> migration. So what about some dedicated "pci device" for
> this purpose? Some kind of "migration controller". This is:
> a) perhaps easier to implement than capability, we don't need to push 
> anything to PCI spec.
> b) could easily make friendship with Windows, because this means that no bus 
> code has to be touched at all. It would rely only on
> drivers' ability to communicate with each other (i guess it should be 
> possible in Windows, isn't it?)
> c) does not need to steal resources (BARs, IRQs, etc) from the actual devices.
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 

Sure, or we can use an ACPI device.  It doesn't really matter what we do
for the mailbox. Whoever writes this first will get to select a
mechanism.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to reserve guest physical region for ACPI

2015-12-28 Thread Michael S. Tsirkin
On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> 
> Hi Michael, Paolo,
> 
> Now it is the time to return to the challenge that how to reserve guest
> physical region internally used by ACPI.
> 
> Igor suggested that:
> | An alternative place to allocate reserve from could be high memory.
> | For pc we have "reserved-memory-end" which currently makes sure
> | that hotpluggable memory range isn't used by firmware
> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)

I don't want to tie things to reserved-memory-end because this
does not scale: next time we need to reserve memory,
we'll need to find yet another way to figure out what is where.

I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
support 64 bit RAM instead (and maybe a way to allocate and
zero-initialize buffer without loading it through fwcfg), this way bios
does the allocation, and addresses can be patched into acpi.

See patch at the bottom that might be handy.

> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> | when writing ASL one shall make sure that only XP supported
> | features are in global scope, which is evaluated when tables
> | are loaded and features of rev2 and higher are inside methods.
> | That way XP doesn't crash as far as it doesn't evaluate unsupported
> | opcode and one can guard those opcodes checking _REV object if neccesary.
> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)

Yes, this technique works.

An alternative is to add an XSDT, XP ignores that.
XSDT at the moment breaks OVMF (because it loads both
the RSDT and the XSDT, which is wrong), but I think
Laszlo was working on a fix for that.

> Michael, Paolo, what do you think about these ideas?
> 
> Thanks!



So using a patch below, we can add Name(PQRS, 0x0) at the top of the
SSDT (or bottom, or add a separate SSDT just for that).  It returns the
current offset so we can add that to the linker.

Won't work if you append the Name to the Aml structure (these can be
nested to arbitrary depth using aml_append), so using plain GArray for
this API makes sense to me.

--->

acpi: add build_append_named_dword, returning an offset in buffer

This is a very limited form of support for runtime patching -
similar in functionality to what we can do with ACPI_EXTRACT
macros in python, but implemented in C.

This is to allow ACPI code direct access to data tables -
which is exactly what DataTableRegion is there for, except
no known windows release so far implements DataTableRegion.

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>

---

diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 1b632dc..f8998ea 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, 
bool mfre);
 void
 build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
 
+int
+build_append_named_dword(GArray *array, const char *name_format, ...);
+
 #endif
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 0d4b324..7f9fa65 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
 }
 }
 
+/* Build NAME(, 0x0) where 0x0 is encoded as a qword,
+ * and return the offset to 0x0 for runtime patching.
+ *
+ * Warning: runtime patching is best avoided. Only use this as
+ * a replacement for DataTableRegion (for guests that don't
+ * support it).
+ */
+int
+build_append_named_qword(GArray *array, const char *name_format, ...)
+{
+int offset;
+va_list ap;
+
+va_start(ap, name_format);
+build_append_namestringv(array, name_format, ap);
+va_end(ap);
+
+build_append_byte(array, 0x0E); /* QWordPrefix */
+
+offset = array->len;
+build_append_int_noprefix(array, 0x0, 8);
+assert(array->len == offset + 8);
+
+return offset;
+}
+
 static GPtrArray *alloc_list;
 
 static Aml *aml_alloc(void)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-27 Thread Michael S. Tsirkin
On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
> The PCI hot-plug specification calls out that the OS can optionally
> implement a "pause" mechanism which is meant to be used for high
> availability type environments.  What I am proposing is basically
> extending the standard SHPC capable PCI bridge so that we can support
> the DMA page dirtying for everything hosted on it, add a vendor
> specific block to the config space so that the guest can notify the
> host that it will do page dirtying, and add a mechanism to indicate
> that all hot-plug events during the warm-up phase of the migration are
> pause events instead of full removals.

Two comments:

1. A vendor specific capability will always be problematic.
Better to register a capability id with pci sig.

2. There are actually several capabilities:

A. support for memory dirtying
if not supported, we must stop device before migration

This is supported by core guest OS code,
using patches similar to posted by you.


B. support for device replacement
This is a faster form of hotplug, where device is removed and
later another device using same driver is inserted in the same slot.

This is a possible optimization, but I am convinced
(A) should be implemented independently of (B).




> I've been poking around in the kernel and QEMU code and the part I
> have been trying to sort out is how to get QEMU based pci-bridge to
> use the SHPC driver because from what I can tell the driver never
> actually gets loaded on the device as it is left in the control of
> ACPI hot-plug.
> 
> - Alex

There are ways, but you can just use pci express, it's easier.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-25 Thread Michael S. Tsirkin
On Fri, Dec 25, 2015 at 03:03:47PM +0800, Lan Tianyu wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
> 
> On 2015年12月14日 03:30, Alexander Duyck wrote:
> >> > These sounds we need to add a faked bridge for migration and adding a
> >> > driver in the guest for it. It also needs to extend PCI bus/hotplug
> >> > driver to do pause/resume other devices, right?
> >> >
> >> > My concern is still that whether we can change PCI bus/hotplug like that
> >> > without spec change.
> >> >
> >> > IRQ should be general for any devices and we may extend it for
> >> > migration. Device driver also can make decision to support migration
> >> > or not.
> > The device should have no say in the matter.  Either we are going to
> > migrate or we will not.  This is why I have suggested my approach as
> > it allows for the least amount of driver intrusion while providing the
> > maximum number of ways to still perform migration even if the device
> > doesn't support it.
> 
> Even if the device driver doesn't support migration, you still want to
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.
> 
> > 
> > The solution I have proposed is simple:
> > 
> > 1.  Extend swiotlb to allow for a page dirtying functionality.
> > 
> >  This part is pretty straight forward.  I'll submit a few patches
> > later today as RFC that can provided the minimal functionality needed
> > for this.
> 
> Very appreciate to do that.
> 
> > 
> > 2.  Provide a vendor specific configuration space option on the QEMU
> > implementation of a PCI bridge to act as a bridge between direct
> > assigned devices and the host bridge.
> > 
> >  My thought was to add some vendor specific block that includes a
> > capabilities, status, and control register so you could go through and
> > synchronize things like the DMA page dirtying feature.  The bridge
> > itself could manage the migration capable bit inside QEMU for all
> > devices assigned to it.  So if you added a VF to the bridge it would
> > flag that you can support migration in QEMU, while the bridge would
> > indicate you cannot until the DMA page dirtying control bit is set by
> > the guest.
> > 
> >  We could also go through and optimize the DMA page dirtying after
> > this is added so that we can narrow down the scope of use, and as a
> > result improve the performance for other devices that don't need to
> > support migration.  It would then be a matter of adding an interrupt
> > in the device to handle an event such as the DMA page dirtying status
> > bit being set in the config space status register, while the bit is
> > not set in the control register.  If it doesn't get set then we would
> > have to evict the devices before the warm-up phase of the migration,
> > otherwise we can defer it until the end of the warm-up phase.
> > 
> > 3.  Extend existing shpc driver to support the optional "pause"
> > functionality as called out in section 4.1.2 of the Revision 1.1 PCI
> > hot-plug specification.
> 
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
> 
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve the
> passthough device's performance during migration.

As long as you keep up this vague talk about performance during
migration, without even bothering with any measurements, this patchset
will keep going nowhere.




There's Alex's patch that tracks memory changes during migration.  It
needs some simple enhancements to be useful in production (e.g. add a
host/guest handshake to both enable tracking in guest and to detect the
support in host), then it can allow starting migration with an assigned
device, by invoking hot-unplug after most of memory have been migrated.

Please implement this in qemu and measure the speed.
I will not be surprised if destroying/creating netdev in linux
turns out to take too long, but before anyone bothered
checking, it does not make sense to discuss further enhancements.



> > 
> >  Note I call out "extend" here instead of saying to add this.
> > Basically what we should do is provide a means of quiescing the device
> > without unloading the driver.  This is called out as something the OS
> > vendor can optionally implement in the PCI hot-plug specification.  On
> > OSes that wouldn't support this it would just be treated as a standard
> > hot-plug event.   We could add a capability, status, and control bit
> > in the vendor specific configuration block for this as well and if we
> > set the status bit would indicate the host wants to pause instead of
> > remove and the control bit would indicate the guest supports "pause"
> > in the OS.  We then could optionally disable guest migration while the
> > VF is present and pause is not supported.
> > 
> >  To support this we would need 

[PULL] vhost: cleanups and fixes

2015-12-20 Thread Michael S. Tsirkin
The following changes since commit 9f9499ae8e6415cefc4fe0a96ad0e27864353c89:

  Linux 4.4-rc5 (2015-12-13 17:42:58 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 74a599f09bec7419b2490039f0fb33bc8581ef7c:

  virtio/s390: handle error values in irb (2015-12-17 10:37:33 +0200)


virtio: fixes on top of 4.4-rc5

This includes a single fix for virtio ccw error handling.

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>


Cornelia Huck (1):
  virtio/s390: handle error values in irb

 drivers/s390/virtio/virtio_ccw.c | 62 
 1 file changed, 37 insertions(+), 25 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-14 Thread Michael S. Tsirkin
On Sun, Dec 13, 2015 at 11:47:44PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
> >On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu  wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> 
> Ideally, it is able to leave guest driver unmodified but it requires the
> >hypervisor or qemu to aware the device which means we may need a driver
> >in
> >hypervisor or qemu to handle the device on behalf of guest driver.
> >>>
> >>>Can you answer the question of when do you use your code -
> >>> at the start of migration or
> >>> just before the end?
> >>
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >The VF mailbox interrupt is a very bad idea.  Really the device should
> >be in a reset state on the other side of a migration.  It doesn't make
> >sense to have the interrupt firing if the device is not configured.
> >This is one of the things that is preventing you from being able to
> >migrate the device while the interface is administratively down or the
> >VF driver is not loaded.
> 
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
> 
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
> 
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
> 
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.
> 
> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.
> 
> 
> >
> >My thought on all this is that it might make sense to move this
> >functionality into a PCI-to-PCI bridge device and make it a
> >requirement that all direct-assigned devices have to exist behind that
> >device in order to support migration.  That way you would be working
> >with a directly emulated device that would likely already be
> >supporting hot-plug anyway.  Then it would just be a matter of coming
> >up with a few Qemu specific extensions that you would need to add to
> >the device itself.  The same approach would likely be portable enough
> >that you could achieve it with PCIe as well via the same configuration
> >space being present on the upstream side of a PCIe port or maybe a
> >PCIe switch of some sort.
> >
> >It would then be possible to signal via your vendor-specific PCI
> >capability on that device that all devices behind this bridge require
> >DMA page dirtying, you could use the configuration in addition to the
> >interrupt already provided for hot-plug to signal things like when you
> >are starting migration, and possibly even just extend the shpc
> >functionality so that if this capability is present you have the
> >option to pause/resume instead of remove/probe the device in the case
> >of certain hot-plug events.  The fact is there may be some use for a
> >pause/resume type approach for PCIe hot-plug in the near future
> >anyway.  From the sounds of it Apple has required it for all
> >Thunderbolt device drivers so that they can halt the device in order
> >to shuffle resources around, perhaps we should look at something
> >similar for Linux.
> >
> >The other advantage behind grouping functions on one bridge is things
> >like reset domains.  The PCI error handling logic will want to be able
> >to reset any devices that experienced an error in the event of
> >something such as a surprise removal.  By grouping all of the devices
> >you could disable/reset/enable them as one logical group in the event
> >of something such as the "bad path" approach Michael has mentioned.
> >
> 
> These sounds we need to add a faked bridge for migration and adding a
> driver in the guest for it. It also needs to extend PCI bus/hotplug
> driver to do pause/resume other devices, right?
> 
> My concern is still that whether we can change PCI bus/hotplug like that
> without spec change.
> 
> IRQ should be general for any devices and we may extend it for
> migration. Device driver also can make decision to support migration
> or not.

A dedicated IRQ per device for something that is a system wide event
sounds like a waste.  I don't understand why a spec change is strictly
required, we only need to support this with the specific virtual bridge
used by QEMU, so I think that a vendor specific capability will do.
Once this works well in the field, a PCI spec 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-14 Thread Michael S. Tsirkin
On Fri, Dec 11, 2015 at 03:32:04PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:
> >On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>>>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>>>hypervisor or qemu to aware the device which means we may need a driver 
> >>>>>in
> >>>>>hypervisor or qemu to handle the device on behalf of guest driver.
> >>>Can you answer the question of when do you use your code -
> >>>at the start of migration or
> >>>just before the end?
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >Right, this is the "good path" optimization. Whether this buys anything
> >as compared to just sending reset to the device when VCPU is stopped
> >needs to be measured. In any case, we probably do need a way to
> >interrupt driver on destination to make it reconfigure the device -
> >otherwise it might take seconds for it to notice.  And a way to make
> >sure driver can handle this surprise reset so we can block migration if
> >it can't.
> >
> 
> Yes, we need such a way to notify driver about migration status and do
> reset or restore operation on the destination machine. My original
> design is to take advantage of device's irq to do that. Driver can tell
> Qemu that which irq it prefers to handle such task and whether the irq
> is enabled or bound with handler. We may discuss the detail in the other
> thread.
> 
> >>>
> >>>>>>>It would be great if we could avoid changing the guest; but at least 
> >>>>>>>your guest
> >>>>>>>driver changes don't actually seem to be that hardware specific; could 
> >>>>>>>your
> >>>>>>>changes actually be moved to generic PCI level so they could be made
> >>>>>>>to work for lots of drivers?
> >>>>>
> >>>>>It is impossible to use one common solution for all devices unless the 
> >>>>>PCIE
> >>>>>spec documents it clearly and i think one day it will be there. But 
> >>>>>before
> >>>>>that, we need some workarounds on guest driver to make it work even it 
> >>>>>looks
> >>>>>ugly.
> >>
> >>Yes, so far there is not hardware migration support
> >
> >VT-D supports setting dirty bit in the PTE in hardware.
> 
> Actually, this doesn't support in the current hardware.
> VTD spec documents the dirty bit for first level translation which
> requires devices to support DMA request with PASID(process
> address space identifier). Most device don't support the feature.

True, I missed this.  It's generally unfortunate that first level
translation only applies to requests with PASID.  All other features
limited to requests with PASID like nested translation would be very
useful for all requests, not just requests with PASID.


> >
> >>and it's hard to modify
> >>bus level code.
> >
> >Why is it hard?
> 
> As Yang said, the concern is that PCI Spec doesn't document about how to do
> migration.

We can submit a PCI spec ECN documenting a new capability.

I think for existing devices which lack it, adding this capability to
the bridge to which the device is attached is preferable to trying to
add it to the device itself.

> >
> >>It also will block implementation on the Windows.
> >
> >Implementation of what?  We are discussing motivation here, not
> >implementation.  E.g. windows drivers typically support surprise
> >removal, should you use that, you get some working code for free.  Just
> >stop worrying about it.  Make it work, worry about closed source
> >software later.
> >
> >>>Dave
> >>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] virtio/s390: one fix

2015-12-14 Thread Michael S. Tsirkin
On Mon, Dec 14, 2015 at 04:02:33PM +0100, Cornelia Huck wrote:
> On Thu,  3 Dec 2015 17:23:59 +0100
> Cornelia Huck  wrote:
> 
> > Michael,
> > 
> > here's one fix for the virtio-ccw driver.
> > 
> > Patch is against master, as your git branches on mst/vhost.git seem
> > to be quite old. Is there some branch I can base my own branch on,
> > or do you prefer to take patches directly anyway?
> > 
> > Cornelia Huck (1):
> >   virtio/s390: handle error values in irb
> > 
> >  drivers/s390/virtio/virtio_ccw.c | 62 
> > 
> >  1 file changed, 37 insertions(+), 25 deletions(-)
> > 
> 
> Ping?

Thanks, I'll merge this, hope it can still make 4.4.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Michael S. Tsirkin
On Sun, Dec 13, 2015 at 01:28:31PM -0800, Alexander Duyck wrote:
> This patch is meant to provide the guest with a way of flagging DMA pages
> as being dirty to the host when using a direct-assign device within a
> guest.  The advantage to this approach is that it is fairly simple, however
> It currently has a singificant impact on device performance in all the
> scenerios where it won't be needed.
> 
> As such this is really meant only as a proof of concept and to get the ball
> rolling in terms of figuring out how best to approach the issue of dirty
> page tracking for a guest that is using a direct assigned device.  In
> addition with just this patch it should be possible to modify current
> migration approaches so that instead of having to hot-remove the device
> before starting the migration this can instead be delayed until the period
> before the final stop and copy.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  arch/arm/include/asm/dma-mapping.h   |3 ++-
>  arch/arm64/include/asm/dma-mapping.h |5 ++---
>  arch/ia64/include/asm/dma.h  |1 +
>  arch/mips/include/asm/dma-mapping.h  |1 +
>  arch/powerpc/include/asm/swiotlb.h   |1 +
>  arch/tile/include/asm/dma-mapping.h  |1 +
>  arch/unicore32/include/asm/dma-mapping.h |1 +
>  arch/x86/Kconfig |   11 +++
>  arch/x86/include/asm/swiotlb.h   |   26 ++
>  drivers/xen/swiotlb-xen.c|6 ++
>  lib/swiotlb.c|6 ++
>  11 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm/include/asm/dma-mapping.h 
> b/arch/arm/include/asm/dma-mapping.h
> index ccb3aa64640d..1962d7b471c7 100644
> --- a/arch/arm/include/asm/dma-mapping.h
> +++ b/arch/arm/include/asm/dma-mapping.h
> @@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>   return 1;
>  }
>  
> -static inline void dma_mark_clean(void *addr, size_t size) { }
> +static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  extern int arm_dma_set_mask(struct device *dev, u64 dma_mask);
>  
> diff --git a/arch/arm64/include/asm/dma-mapping.h 
> b/arch/arm64/include/asm/dma-mapping.h
> index 61e08f360e31..8d24fe11c8a3 100644
> --- a/arch/arm64/include/asm/dma-mapping.h
> +++ b/arch/arm64/include/asm/dma-mapping.h
> @@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>   return addr + size - 1 <= *dev->dma_mask;
>  }
>  
> -static inline void dma_mark_clean(void *addr, size_t size)
> -{
> -}
> +static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  #endif   /* __KERNEL__ */
>  #endif   /* __ASM_DMA_MAPPING_H */
> diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h
> index 4d97f60f1ef5..d92ebeb2758e 100644
> --- a/arch/ia64/include/asm/dma.h
> +++ b/arch/ia64/include/asm/dma.h
> @@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS;
>  #define free_dma(x)
>  
>  void dma_mark_clean(void *addr, size_t size);
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  #endif /* _ASM_IA64_DMA_H */
> diff --git a/arch/mips/include/asm/dma-mapping.h 
> b/arch/mips/include/asm/dma-mapping.h
> index e604f760c4a0..567f6e03e337 100644
> --- a/arch/mips/include/asm/dma-mapping.h
> +++ b/arch/mips/include/asm/dma-mapping.h
> @@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>  }
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  #include 
>  
> diff --git a/arch/powerpc/include/asm/swiotlb.h 
> b/arch/powerpc/include/asm/swiotlb.h
> index de99d6e29430..b694e8399e28 100644
> --- a/arch/powerpc/include/asm/swiotlb.h
> +++ b/arch/powerpc/include/asm/swiotlb.h
> @@ -16,6 +16,7 @@
>  extern struct dma_map_ops swiotlb_dma_ops;
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  extern unsigned int ppc_swiotlb_enable;
>  int __init swiotlb_setup_bus_notifier(void);
> diff --git a/arch/tile/include/asm/dma-mapping.h 
> b/arch/tile/include/asm/dma-mapping.h
> index 96ac6cce4a32..79953f09e938 100644
> --- a/arch/tile/include/asm/dma-mapping.h
> +++ b/arch/tile/include/asm/dma-mapping.h
> @@ -58,6 +58,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, 
> dma_addr_t daddr)
>  }
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops)
>  {
> diff --git a/arch/unicore32/include/asm/dma-mapping.h 
> 

Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2015-12-14 Thread Michael S. Tsirkin
On Mon, Dec 14, 2015 at 03:20:26PM +0800, Yang Zhang wrote:
> On 2015/12/14 13:46, Alexander Duyck wrote:
> >On Sun, Dec 13, 2015 at 9:22 PM, Yang Zhang  wrote:
> >>On 2015/12/14 12:54, Alexander Duyck wrote:
> >>>
> >>>On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang 
> >>>wrote:
> 
> On 2015/12/14 5:28, Alexander Duyck wrote:
> >
> >
> >This patch set is meant to be the guest side code for a proof of concept
> >involving leaving pass-through devices in the guest during the warm-up
> >phase of guest live migration.  In order to accomplish this I have added
> >a
> >new function called dma_mark_dirty that will mark the pages associated
> >with
> >the DMA transaction as dirty in the case of either an unmap or a
> >sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
> >DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
> >the stop-and-copy phase, however allowing the device to be present
> >should
> >significantly improve the performance of the guest during the warm-up
> >period.
> >
> >This current implementation is very preliminary and there are number of
> >items still missing.  Specifically in order to make this a more complete
> >solution we need to support:
> >1.  Notifying hypervisor that drivers are dirtying DMA pages received
> >2.  Bypassing page dirtying when it is not needed.
> >
> 
> Shouldn't current log dirty mechanism already cover them?
> >>>
> >>>
> >>>The guest has no way of currently knowing that the hypervisor is doing
> >>>dirty page logging, and the log dirty mechanism currently has no way
> >>>of tracking device DMA accesses.  This change is meant to bridge the
> >>>two so that the guest device driver will force the SWIOTLB DMA API to
> >>>mark pages written to by the device as dirty.
> >>
> >>
> >>OK. This is what we called "dummy write mechanism". Actually, this is just a
> >>workaround before iommu dirty bit ready. Eventually, we need to change to
> >>use the hardware dirty bit. Besides, we may still lost the data if dma
> >>happens during/just before stop and copy phase.
> >
> >Right, this is a "dummy write mechanism" in order to allow for entry
> >tracking.  This only works completely if we force the hardware to
> >quiesce via a hot-plug event before we reach the stop-and-copy phase
> >of the migration.
> >
> >The IOMMU dirty bit approach is likely going to have a significant
> >number of challenges involved.  Looking over the driver and the data
> >sheet it looks like the current implementation is using a form of huge
> >pages in the IOMMU, as such we will need to tear that down and replace
> >it with 4K pages if we don't want to dirty large regions with each DMA
> 
> Yes, we need to split the huge page into small pages to get the small dirty
> range.
> 
> >transaction, and I'm not sure that is something we can change while
> >DMA is active to the affected regions.  In addition the data sheet
> 
> what changes do you mean?
> 
> >references the fact that the page table entries are stored in a
> >translation cache and in order to sync things up you have to
> >invalidate the entries.  I'm not sure what the total overhead would be
> >for invalidating something like a half million 4K pages to migrate a
> >guest with just 2G of RAM, but I would think that might be a bit
> 
> Do you mean the cost of submit the flush request or the performance
> impaction due to IOTLB miss? For the former, we have domain-selective
> invalidation. For the latter, it would be acceptable since live migration
> shouldn't last too long.

That's pretty weak - if migration time is short and speed does not
matter during migration, then all this work is useless, temporarily
switching to a virtual card would be preferable.

> >expensive given the fact that IOMMU accesses aren't known for being
> >incredibly fast when invalidating DMA on the host.
> >
> >- Alex
> >
> 
> 
> -- 
> best regards
> yang
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Michael S. Tsirkin
On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
> > This way distro can use a guest agent to disable
> > dirtying until before migration starts.
> 
> Right.  For a v2 version I would definitely want to have some way to
> limit the scope of this.  My main reason for putting this out here is
> to start altering the course of discussions since it seems like were
> weren't getting anywhere with the ixgbevf migration changes that were
> being proposed.

Absolutely, thanks for working on this.

> >> + unsigned long pg_addr, start;
> >> +
> >> + start = (unsigned long)addr;
> >> + pg_addr = PAGE_ALIGN(start + size);
> >> + start &= ~(sizeof(atomic_t) - 1);
> >> +
> >> + /* trigger a write fault on each page, excluding first page */
> >> + while ((pg_addr -= PAGE_SIZE) > start)
> >> + atomic_add(0, (atomic_t *)pg_addr);
> >> +
> >> + /* trigger a write fault on first word of DMA */
> >> + atomic_add(0, (atomic_t *)start);
> >
> > start might not be aligned correctly for a cast to atomic_t.
> > It's harmless to do this for any memory, so I think you should
> > just do this for 1st byte of all pages including the first one.
> 
> You may not have noticed it but I actually aligned start in the line
> after pg_addr.

Yes you did. alignof would make it a bit more noticeable.

>  However instead of aligning to the start of the next
> atomic_t I just masked off the lower bits so that we start at the
> DWORD that contains the first byte of the starting address.  The
> assumption here is that I cannot trigger any sort of fault since if I
> have access to a given byte within a DWORD I will have access to the
> entire DWORD.

I'm curious where does this come from.  Isn't it true that access is
controlled at page granularity normally, so you can touch beginning of
page just as well?

>  I coded this up so that the spots where we touch the
> memory should match up with addresses provided by the hardware to
> perform the DMA over the PCI bus.

Yes but there's no requirement to do it like this from
virt POV. You just need to touch each page.

> Also I intentionally ran from highest address to lowest since that way
> we don't risk pushing the first cache line of the DMA buffer out of
> the L1 cache due to the PAGE_SIZE stride.
> 
> - Alex

Interesting. How does order of access help with this?

By the way, if you are into these micro-optimizations you might want to
limit prefetch, to this end you want to access the last line of the
page.  And it's probably worth benchmarking a bit and not doing it all just
based on theory, keep code simple in v1 otherwise.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Michael S. Tsirkin
On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
> >> > This way distro can use a guest agent to disable
> >> > dirtying until before migration starts.
> >>
> >> Right.  For a v2 version I would definitely want to have some way to
> >> limit the scope of this.  My main reason for putting this out here is
> >> to start altering the course of discussions since it seems like were
> >> weren't getting anywhere with the ixgbevf migration changes that were
> >> being proposed.
> >
> > Absolutely, thanks for working on this.
> >
> >> >> + unsigned long pg_addr, start;
> >> >> +
> >> >> + start = (unsigned long)addr;
> >> >> + pg_addr = PAGE_ALIGN(start + size);
> >> >> + start &= ~(sizeof(atomic_t) - 1);
> >> >> +
> >> >> + /* trigger a write fault on each page, excluding first page */
> >> >> + while ((pg_addr -= PAGE_SIZE) > start)
> >> >> + atomic_add(0, (atomic_t *)pg_addr);
> >> >> +
> >> >> + /* trigger a write fault on first word of DMA */
> >> >> + atomic_add(0, (atomic_t *)start);

Actually, I have second thoughts about using atomic_add here,
especially for _sync.

Many architectures do

#define ATOMIC_OP_RETURN(op, c_op)  \
static inline int atomic_##op##_return(int i, atomic_t *v)  \
{   \
unsigned long flags;\
int ret;\
\
raw_local_irq_save(flags);  \
ret = (v->counter = v->counter c_op i); \
raw_local_irq_restore(flags);   \
\
return ret; \
}

and this is not safe if device is still doing DMA to/from
this memory.

Generally, atomic_t is there for SMP effects, not for sync
with devices.

This is why I said you should do
cmpxchg(pg_addr, 0xdead, 0xdead); 

Yes, we probably never actually want to run m68k within a VM,
but let's not misuse interfaces like this.


> >> >
> >> > start might not be aligned correctly for a cast to atomic_t.
> >> > It's harmless to do this for any memory, so I think you should
> >> > just do this for 1st byte of all pages including the first one.
> >>
> >> You may not have noticed it but I actually aligned start in the line
> >> after pg_addr.
> >
> > Yes you did. alignof would make it a bit more noticeable.
> >
> >>  However instead of aligning to the start of the next
> >> atomic_t I just masked off the lower bits so that we start at the
> >> DWORD that contains the first byte of the starting address.  The
> >> assumption here is that I cannot trigger any sort of fault since if I
> >> have access to a given byte within a DWORD I will have access to the
> >> entire DWORD.
> >
> > I'm curious where does this come from.  Isn't it true that access is
> > controlled at page granularity normally, so you can touch beginning of
> > page just as well?
> 
> Yeah, I am pretty sure it probably is page granularity.  However my
> thought was to try and stick to the start of the DMA as the last
> access.  That way we don't pull in any more cache lines than we need
> to in order to dirty the pages.  Usually the start of the DMA region
> will contain some sort of headers or something that needs to be
> accessed with the highest priority so I wanted to make certain that we
> were forcing usable data into the L1 cache rather than just the first
> cache line of the page where the DMA started.  If however the start of
> a DMA was the start of the page there is nothing there to prevent
> that.

OK, maybe this helps. You should document all these tricks
in code comments.

> >>  I coded this up so that the spots where we touch the
> >> memory should match up with addresses provided by the hardware to
> >> perform the DMA over the PCI bus.
> >
> > Yes but there's no requirement to do it like this from
> > virt POV. You just need to touch each page.
> 
> I know, but at the same time if we mat

Re: live migration vs device assignment (motivation)

2015-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2015 at 11:04:54AM +0800, Lan, Tianyu wrote:
> 
> On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote:
> >On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
> >>On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >>>I thought about what this is doing at the high level, and I do have some
> >>>value in what you are trying to do, but I also think we need to clarify
> >>>the motivation a bit more.  What you are saying is not really what the
> >>>patches are doing.
> >>>
> >>>And with that clearer understanding of the motivation in mind (assuming
> >>>it actually captures a real need), I would also like to suggest some
> >>>changes.
> >>
> >>Motivation:
> >>Most current solutions for migration with passthough device are based on
> >>the PCI hotplug but it has side affect and can't work for all device.
> >>
> >>For NIC device:
> >>PCI hotplug solution can work around Network device migration
> >>via switching VF and PF.
> >
> >This is just more confusion. hotplug is just a way to add and remove
> >devices. switching VF and PF is up to guest and hypervisor.
> 
> This is a combination. Because it's not able to migrate device state in
> the current world during migration(What we are doing), Exist solutions
> of migrating VM with passthough NIC relies on the PCI hotplug.

That's where you go wrong I think. This marketing speak about solution
of migrating VM with passthrough is just confusing people.

There's no way to do migration with device passthrough on KVM at the
moment, in particular because of lack of way for host to save and
restore device state, and you do not propose a way either.

So how do people migrate? Stop doing device passthrough.
So what I think your patches do is add ability to do the two things
in parallel: stop doing passthrough and start migration.
You still can not migrate with passthrough.

> Unplug VF
> before starting migration and then switch network from VF NIC to PV NIC
> in order to maintain the network connection.

Again, this is mixing unrelated things.  This switching is not really
related to migration. You can do this at any time for any number of
reasons.  If migration takes a lot of time and if you unplug before
migration, then switching to another interface might make sense.
But it's question of policy.

> Plug VF again after
> migration and then switch from PV back to VF. Bond driver provides a way to
> switch between PV and VF NIC automatically with save IP and MAC and so bond
> driver is more preferred.

Preferred over switching manually? As long as it works well, sure.  But
one can come up with other techniques.  For example, don't switch. Save
ip, mac etc, remove source device and add the destination one.  You were
also complaining that the switch took too long.

> >
> >>But switching network interface will introduce service down time.
> >>
> >>I tested the service down time via putting VF and PV interface
> >>into a bonded interface and ping the bonded interface during plug
> >>and unplug VF.
> >>1) About 100ms when add VF
> >>2) About 30ms when del VF
> >
> >OK and what's the source of the downtime?
> >I'm guessing that's just arp being repopulated.  So simply save and
> >re-populate it.
> >
> >There would be a much cleaner solution.
> >
> >Or maybe there's a timer there that just delays hotplug
> >for no reason. Fix it, everyone will benefit.
> >
> >>It also requires guest to do switch configuration.
> >
> >That's just wrong. if you want a switch, you need to
> >configure a switch.
> 
> I meant the config of switching operation between PV and VF.

I see. So sure, there are many ways to configure networking
on linux. You seem to see this as a downside and so want
to hardcode a single configuration into the driver.

> >
> >>These are hard to
> >>manage and deploy from our customers.
> >
> >So kernel want to remain flexible, and the stack is
> >configurable. Downside: customers need to deploy userspace
> >to configure it. Your solution: a hard-coded configuration
> >within kernel and hypervisor.  Sorry, this makes no sense.
> >If kernel is easier for you to deploy than userspace,
> >you need to rethink your deployment strategy.
> 
> This is one factor.
> 
> >
> >>To maintain PV performance during
> >>migration, host side also needs to assign a VF to PV device. This
> >>affects scalability.
> >
> >No idea what this means.
> >
> >>These factors block SRIOV NIC passthough usage in the cloud service and
> >>OPNFV which

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>hypervisor or qemu to handle the device on behalf of guest driver.
> >Can you answer the question of when do you use your code -
> >at the start of migration or
> >just before the end?
> 
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

Right, this is the "good path" optimization. Whether this buys anything
as compared to just sending reset to the device when VCPU is stopped
needs to be measured. In any case, we probably do need a way to
interrupt driver on destination to make it reconfigure the device -
otherwise it might take seconds for it to notice.  And a way to make
sure driver can handle this surprise reset so we can block migration if
it can't.

> >
>  >It would be great if we could avoid changing the guest; but at least 
>  >your guest
>  >driver changes don't actually seem to be that hardware specific; could 
>  >your
>  >changes actually be moved to generic PCI level so they could be made
>  >to work for lots of drivers?
> >>>
> >>>It is impossible to use one common solution for all devices unless the PCIE
> >>>spec documents it clearly and i think one day it will be there. But before
> >>>that, we need some workarounds on guest driver to make it work even it 
> >>>looks
> >>>ugly.
> 
> Yes, so far there is not hardware migration support

VT-D supports setting dirty bit in the PTE in hardware.

> and it's hard to modify
> bus level code.

Why is it hard?

> It also will block implementation on the Windows.

Implementation of what?  We are discussing motivation here, not
implementation.  E.g. windows drivers typically support surprise
removal, should you use that, you get some working code for free.  Just
stop worrying about it.  Make it work, worry about closed source
software later.

> >Dave
> >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Michael S. Tsirkin
On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:
> Hi Michael & Alexander:
> Thanks a lot for your comments and suggestions.

It's nice that it's appreciated, but you then go on and ignore
all that I have written here:
https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html

> We still need to support Windows guest for migration and this is why our
> patches keep all changes in the driver since it's impossible to change
> Windows kernel.

This is not a reasonable argument.  It makes no sense to duplicate code
on Linux because you must duplicate code on Windows.  Let's assume you
must do it in the driver on windows because windows has closed source
drivers.  What does it matter? Linux can still do it as part of DMA API
and have it apply to all drivers.

> Following is my idea to do DMA tracking.
> 
> Inject event to VF driver after memory iterate stage
> and before stop VCPU and then VF driver marks dirty all
> using DMA memory. The new allocated pages also need to
> be marked dirty before stopping VCPU. All dirty memory
> in this time slot will be migrated until stop-and-copy
> stage. We also need to make sure to disable VF via clearing the
> bus master enable bit for VF before migrating these memory.
> 
> The dma page allocated by VF driver also needs to reserve space
> to do dummy write.

I suggested ways to do it all in the hypervisor without driver hacks, or
hide it within DMA API without need to reserve extra space. Both
approaches seem much cleaner.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Michael S. Tsirkin
On Wed, Dec 09, 2015 at 07:19:15PM +0800, Lan, Tianyu wrote:
> On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote:
> >On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:
> >>Hi Michael & Alexander:
> >>Thanks a lot for your comments and suggestions.
> >
> >It's nice that it's appreciated, but you then go on and ignore
> >all that I have written here:
> >https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html
> >
> 
> No, I will reply it separately and according your suggestion to snip it into
> 3 thread.
> 
> >>We still need to support Windows guest for migration and this is why our
> >>patches keep all changes in the driver since it's impossible to change
> >>Windows kernel.
> >
> >This is not a reasonable argument.  It makes no sense to duplicate code
> >on Linux because you must duplicate code on Windows.  Let's assume you
> >must do it in the driver on windows because windows has closed source
> >drivers.  What does it matter? Linux can still do it as part of DMA API
> >and have it apply to all drivers.
> >
> 
> Sure. Duplicated code should be encapsulated and make it able to reuse
> by other drivers. Just like you said the dummy write part.
> 
> I meant the framework should not require to change Windows kernel code
> (such as PM core or PCI bus driver)and this will block implementation on
> the Windows.

I remember reading that it's possible to implement a bus driver
on windows if required.  But basically I don't see how windows can be
relevant to discussing guest driver patches. That discussion
probably belongs on the qemu maling list, not on lkml.

> I think it's not problem to duplicate code in the Windows drivers.
> 
> >>Following is my idea to do DMA tracking.
> >>
> >>Inject event to VF driver after memory iterate stage
> >>and before stop VCPU and then VF driver marks dirty all
> >>using DMA memory. The new allocated pages also need to
> >>be marked dirty before stopping VCPU. All dirty memory
> >>in this time slot will be migrated until stop-and-copy
> >>stage. We also need to make sure to disable VF via clearing the
> >>bus master enable bit for VF before migrating these memory.
> >>
> >>The dma page allocated by VF driver also needs to reserve space
> >>to do dummy write.
> >
> >I suggested ways to do it all in the hypervisor without driver hacks, or
> >hide it within DMA API without need to reserve extra space. Both
> >approaches seem much cleaner.
> >
> 
> This sounds reasonable. We may discuss it detail in the separate thread.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/4] Add virtio transport for AF_VSOCK

2015-12-09 Thread Michael S. Tsirkin
On Wed, Dec 09, 2015 at 08:03:49PM +0800, Stefan Hajnoczi wrote:
> Note: the virtio-vsock device specification is currently under review but not
> yet finalized.  Please review this code but don't merge until I send an update
> when the spec is finalized.  Thanks!

Yes, this should have RFC in the subject.

> v3:
>  * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
>of REQUEST/RESPONSE/ACK
>  * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
>(also drop v2 Patch 1, it's only needed for SOCK_DGRAM)
>  * Only allow host->guest connections (same security model as latest
>VMware)
>  * Don't put vhost vsock driver into staging
>  * Add missing Kconfig dependencies (Arnd Bergmann )
>  * Remove unneeded variable used to store return value
>(Fengguang Wu  and Julia Lawall
>)
> 
> v2:
>  * Rebased onto Linux v4.4-rc2
>  * vhost: Refuse to assign reserved CIDs
>  * vhost: Refuse guest CID if already in use
>  * vhost: Only accept correctly addressed packets (no spoofing!)
>  * vhost: Support flexible rx/tx descriptor layout
>  * vhost: Add missing total_tx_buf decrement
>  * virtio_transport: Fix total_tx_buf accounting
>  * virtio_transport: Add virtio_transport global mutex to prevent races
>  * common: Notify other side of SOCK_STREAM disconnect (fixes shutdown
>semantics)
>  * common: Avoid recursive mutex_lock(tx_lock) for write_space (fixes 
> deadlock)
>  * common: Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
>  * common: Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
>  * common: Fix peer_buf_alloc inheritance on child socket
> 
> This patch series adds a virtio transport for AF_VSOCK (net/vmw_vsock/).
> AF_VSOCK is designed for communication between virtual machines and
> hypervisors.  It is currently only implemented for VMware's VMCI transport.
> 
> This series implements the proposed virtio-vsock device specification from
> here:
> http://permalink.gmane.org/gmane.comp.emulators.virtio.devel/980
> 
> Most of the work was done by Asias He and Gerd Hoffmann a while back.  I have
> picked up the series again.
> 
> The QEMU userspace changes are here:
> https://github.com/stefanha/qemu/commits/vsock
> 
> Why virtio-vsock?
> -
> Guest<->host communication is currently done over the virtio-serial device.
> This makes it hard to port sockets API-based applications and is limited to
> static ports.
> 
> virtio-vsock uses the sockets API so that applications can rely on familiar
> SOCK_STREAM semantics.  Applications on the host can easily connect to guest
> agents because the sockets API allows multiple connections to a listen socket
> (unlike virtio-serial).  This simplifies the guest<->host communication and
> eliminates the need for extra processes on the host to arbitrate virtio-serial
> ports.
> 
> Overview
> 
> This series adds 3 pieces:
> 
> 1. virtio_transport_common.ko - core virtio vsock code that uses vsock.ko
> 
> 2. virtio_transport.ko - guest driver
> 
> 3. drivers/vhost/vsock.ko - host driver
> 
> Howto
> -
> The following kernel options are needed:
>   CONFIG_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS_COMMON=y
>   CONFIG_VHOST_VSOCK=m
> 
> Launch QEMU as follows:
>   # qemu ... -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3
> 
> Guest and host can communicate via AF_VSOCK sockets.  The host's CID (address)
> is 2 and the guest must be assigned a CID (3 in the example above).
> 
> Status
> --
> This patch series implements the latest draft specification.  Please review.
> 
> Asias He (4):
>   VSOCK: Introduce virtio-vsock-common.ko
>   VSOCK: Introduce virtio-vsock.ko
>   VSOCK: Introduce vhost-vsock.ko
>   VSOCK: Add Makefile and Kconfig
> 
>  drivers/vhost/Kconfig   |  10 +
>  drivers/vhost/Makefile  |   4 +
>  drivers/vhost/vsock.c   | 628 +++
>  drivers/vhost/vsock.h   |   4 +
>  include/linux/virtio_vsock.h| 203 
>  include/uapi/linux/virtio_ids.h |   1 +
>  include/uapi/linux/virtio_vsock.h   |  87 
>  net/vmw_vsock/Kconfig   |  18 +
>  net/vmw_vsock/Makefile  |   2 +
>  net/vmw_vsock/virtio_transport.c| 466 +
>  net/vmw_vsock/virtio_transport_common.c | 854 
> 
>  11 files changed, 2277 insertions(+)
>  create mode 100644 drivers/vhost/vsock.c
>  create mode 100644 drivers/vhost/vsock.h
>  create mode 100644 include/linux/virtio_vsock.h
>  create mode 100644 include/uapi/linux/virtio_vsock.h
>  create mode 100644 net/vmw_vsock/virtio_transport.c
>  create mode 100644 net/vmw_vsock/virtio_transport_common.c
> 
> -- 
> 2.5.0
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: live migration vs device assignment (motivation)

2015-12-09 Thread Michael S. Tsirkin
On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >I thought about what this is doing at the high level, and I do have some
> >value in what you are trying to do, but I also think we need to clarify
> >the motivation a bit more.  What you are saying is not really what the
> >patches are doing.
> >
> >And with that clearer understanding of the motivation in mind (assuming
> >it actually captures a real need), I would also like to suggest some
> >changes.
> 
> Motivation:
> Most current solutions for migration with passthough device are based on
> the PCI hotplug but it has side affect and can't work for all device.
> 
> For NIC device:
> PCI hotplug solution can work around Network device migration
> via switching VF and PF.

This is just more confusion. hotplug is just a way to add and remove
devices. switching VF and PF is up to guest and hypervisor.

> But switching network interface will introduce service down time.
> 
> I tested the service down time via putting VF and PV interface
> into a bonded interface and ping the bonded interface during plug
> and unplug VF.
> 1) About 100ms when add VF
> 2) About 30ms when del VF

OK and what's the source of the downtime?
I'm guessing that's just arp being repopulated.  So simply save and
re-populate it.

There would be a much cleaner solution.

Or maybe there's a timer there that just delays hotplug
for no reason. Fix it, everyone will benefit.

> It also requires guest to do switch configuration.

That's just wrong. if you want a switch, you need to
configure a switch.

> These are hard to
> manage and deploy from our customers.

So kernel want to remain flexible, and the stack is
configurable. Downside: customers need to deploy userspace
to configure it. Your solution: a hard-coded configuration
within kernel and hypervisor.  Sorry, this makes no sense.
If kernel is easier for you to deploy than userspace,
you need to rethink your deployment strategy.

> To maintain PV performance during
> migration, host side also needs to assign a VF to PV device. This
> affects scalability.

No idea what this means.

> These factors block SRIOV NIC passthough usage in the cloud service and
> OPNFV which require network high performance and stability a lot.

Everyone needs performance and scalability.

> 
> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
> 
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
> 
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.
> 
> So we are trying implementing a new solution which really migrates
> device state to target machine and won't affect user during migration
> with low service down time.

Let's assume for the sake of the argument that there's a lot going on
and removing the device is just too slow (though you should figure out
what's going on before giving up and just building something new from
scratch).

I still don't think you should be migrating state.  That's just too
fragile, and it also means you depend on driver to be nice and shut down
device on source, so you can not migrate at will.  Instead, reset device
on destination and re-initialize it.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: vsock: select CONFIG_VHOST

2015-12-08 Thread Michael S. Tsirkin
On Tue, Dec 08, 2015 at 04:46:08PM +0100, Arnd Bergmann wrote:
> When building the new vsock code without vhost, we get a build error:
> 
> drivers/built-in.o: In function `vhost_vsock_flush':
> :(.text+0x24d29c): undefined reference to `vhost_poll_flush'
> 
> This adds an explicit 'select' like we have for the other vhost
> drivers.
> 
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

This will need to be done eventually, so

Acked-by: Michael S. Tsirkin <m...@redhat.com>

but I really think the right thing for now is to revert current vsock
code, or disable building it unconditionally.  Stefan, could you please
send a patch like this?

> ---
>  drivers/vhost/Kconfig.vsock | 2 ++
>  1 file changed, 2 insertions(+)
> 
> The patch causing the problem is currently in net-next, so the fix should be
> applied on top of that.
> 
> diff --git a/drivers/vhost/Kconfig.vsock b/drivers/vhost/Kconfig.vsock
> index 3491865d3eb9..bfb9edc4b5d6 100644
> --- a/drivers/vhost/Kconfig.vsock
> +++ b/drivers/vhost/Kconfig.vsock
> @@ -2,6 +2,8 @@ config VHOST_VSOCK
>   tristate "vhost virtio-vsock driver"
>   depends on VSOCKETS && EVENTFD
>   select VIRTIO_VSOCKETS_COMMON
> + select VHOST
> + select VHOST_RING
>   default n
>   ---help---
>   Say M here to enable the vhost-vsock for virtio-vsock guests
> -- 
> 2.1.0.rc2
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


live migration vs device assignment (was Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC)

2015-12-07 Thread Michael S. Tsirkin
On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
> This patchset is to propose a solution of adding live migration
> support for SRIOV NIC.

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.

TLDR:
- split this into 3 unrelated efforts/patchsets
- try implementing this host-side only using VT-d dirty tracking
- if making guest changes, make them in a way that makes many devices benefit
- measure speed before trying to improve it

---

First, this does not help to actually do migration with an
active assigned device. Guest needs to deactivate the device
before VM is moved around.

What they are actually able to do, instead, is three things.
My suggestion is to split them up, and work on them
separately.  There's really no need to have them all.

I discuss all 3 things below, but if we do need to have some discussion,
please snip and  let's have separate threads for each item please.


1. Starting live migration with device running.
This might help speed up networking during pre-copy where there is a
long warm-up phase.

Note: To complete migration, one also has to do something to stop
the device, but that's a separate item, since existing hot-unplug
request will do that just as well.


Proposed changes of approach:
One option is to write into the dma memory to make it dirty.  Your
patches do this within the driver, but doing this in the generic dma
unmap code seems more elegant as it will help all devices.  An
interesting note: on unplug, driver unmaps all memory for DMA, so this
works out fine.


Some benchmarking will be needed to show the performance overhead.
It is likely non zero, so an interface would be needed
to enable this tracking before starting migration.


According to the VT-d spec, I note that bit 6 in the PTE is the dirty
bit.  Why don't we use this to detect memory changes by the device?
Specifically, periodically scan pages that we have already
sent, test and clear atomically the dirty bit in the PTE of
the IOMMU, and if set, resend the page.
The interface could be simply an ioctl for VFIO giving
it a range of memory, and have VFIO do the scan and set
bits for userspace.

This might be slower than writing into DMA page,
since e.g. PML does not work here.

We could go for a mixed approach, where we negotiate with the
guest: if guest can write into memory on unmap, then
skip the scanning, otherwise do scanning of IOMMU PTEs
as described above.

I would suggest starting with clean IOMMU PTE polling
on host. If you see that there is a performance problem,
optimize later by enabling the updates within guest
if required.

2.  (Presumably) faster device stop.
After the warmup phase, we need to enter the stop and
copy phase. At that point, device needs to be stopped.
One way to do this is to send request to guest while
we continue to track and send memory changes.
I am not sure whether this is what you are doing,
but I'm assuming it is.

I don't know what do you do on the host,
I guesss you could send removal request to guest, and
keep sending page updates meanwhile.
After guest eject/stop acknowledge is received on the host,
you can enter stop and copy.

Your patches seem to stop device with a custom device specific
register, but using more generic interfaces, such as
e.g. device removal, could also work, even if
it's less optimal.

The way you defined the interfaces, they don't
seem device specific at all.
A new PCI capability ID reserved by the PCI SIG
could be one way to add the new interface
if it's needed.


We also need a way to know what does guest support.
With hotplug we know all modern guests support
it, but with any custom code we need negotiation,
and then fall back on either hot unplug
or blocking migration.

Additionally, hot-unplug will unmap all dma
memory so if all dma unmap callbacks do
a write, you get that memory dirtied for free.

At the moment, device removal destroys state such as IP address and arp
cache, but we could have guest move these around
if necessary. Possibly this can be done in userspace with
the guest agent. We could discuss guest kernel or firmware solutions
if we need to address corner cases such as network boot.

You might run into hotplug behaviour such as
a 5 second timeout until device is actually
detected. It always seemed silly to me.
A simple if (!kvm) in that code might be justified.

The fact that guest cooperation is needed
to complete migration is a big problem IMHO.
This practically means you need to give a lot of
CPU to a guest on an overcommitted host
in order to be able to move it out to another host.
Meanwhile, guest can abuse the extra CPU it got.

Can not surprise removal be emulated instead?

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Michael S. Tsirkin
On Mon, Dec 07, 2015 at 09:12:08AM -0800, Alexander Duyck wrote:
> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu  wrote:
> > On 12/5/2015 1:07 AM, Alexander Duyck wrote:
> >>>
> >>>
> >>> We still need to support Windows guest for migration and this is why our
> >>> patches keep all changes in the driver since it's impossible to change
> >>> Windows kernel.
> >>
> >>
> >> That is a poor argument.  I highly doubt Microsoft is interested in
> >> having to modify all of the drivers that will support direct assignment
> >> in order to support migration.  They would likely request something
> >> similar to what I have in that they will want a way to do DMA tracking
> >> with minimal modification required to the drivers.
> >
> >
> > This totally depends on the NIC or other devices' vendors and they
> > should make decision to support migration or not. If yes, they would
> > modify driver.
> 
> Having to modify every driver that wants to support live migration is
> a bit much.  In addition I don't see this being limited only to NIC
> devices.  You can direct assign a number of different devices, your
> solution cannot be specific to NICs.
> 
> > If just target to call suspend/resume during migration, the feature will
> > be meaningless. Most cases don't want to affect user during migration
> > a lot and so the service down time is vital. Our target is to apply
> > SRIOV NIC passthough to cloud service and NFV(network functions
> > virtualization) projects which are sensitive to network performance
> > and stability. From my opinion, We should give a change for device
> > driver to implement itself migration job. Call suspend and resume
> > callback in the driver if it doesn't care the performance during migration.
> 
> The suspend/resume callback should be efficient in terms of time.
> After all we don't want the system to stall for a long period of time
> when it should be either running or asleep.  Having it burn cycles in
> a power state limbo doesn't do anyone any good.  If nothing else maybe
> it will help to push the vendors to speed up those functions which
> then benefit migration and the system sleep states.
> 
> Also you keep assuming you can keep the device running while you do
> the migration and you can't.  You are going to corrupt the memory if
> you do, and you have yet to provide any means to explain how you are
> going to solve that.
> 
> 
> >
> >>
> >>> Following is my idea to do DMA tracking.
> >>>
> >>> Inject event to VF driver after memory iterate stage
> >>> and before stop VCPU and then VF driver marks dirty all
> >>> using DMA memory. The new allocated pages also need to
> >>> be marked dirty before stopping VCPU. All dirty memory
> >>> in this time slot will be migrated until stop-and-copy
> >>> stage. We also need to make sure to disable VF via clearing the
> >>> bus master enable bit for VF before migrating these memory.
> >>
> >>
> >> The ordering of your explanation here doesn't quite work.  What needs to
> >> happen is that you have to disable DMA and then mark the pages as dirty.
> >>   What the disabling of the BME does is signal to the hypervisor that
> >> the device is now stopped.  The ixgbevf_suspend call already supported
> >> by the driver is almost exactly what is needed to take care of something
> >> like this.
> >
> >
> > This is why I hope to reserve a piece of space in the dma page to do dummy
> > write. This can help to mark page dirty while not require to stop DMA and
> > not race with DMA data.
> 
> You can't and it will still race.  What concerns me is that your
> patches and the document you referenced earlier show a considerable
> lack of understanding about how DMA and device drivers work.  There is
> a reason why device drivers have so many memory barriers and the like
> in them.  The fact is when you have CPU and a device both accessing
> memory things have to be done in a very specific order and you cannot
> violate that.
> 
> If you have a contiguous block of memory you expect the device to
> write into you cannot just poke a hole in it.  Such a situation is not
> supported by any hardware that I am aware of.
> 
> As far as writing to dirty the pages it only works so long as you halt
> the DMA and then mark the pages dirty.  It has to be in that order.
> Any other order will result in data corruption and I am sure the NFV
> customers definitely don't want that.
> 
> > If can't do that, we have to stop DMA in a short time to mark all dma
> > pages dirty and then reenable it. I am not sure how much we can get by
> > this way to track all DMA memory with device running during migration. I
> > need to do some tests and compare results with stop DMA diretly at last
> > stage during migration.
> 
> We have to halt the DMA before we can complete the migration.  So
> please feel free to test this.
> 
> In addition I still feel you would be better off taking this in
> smaller steps.  I still say your first step would be to come up with a
> generic 

Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-04 Thread Michael S. Tsirkin
On Fri, Dec 04, 2015 at 02:42:36PM +0800, Lan, Tianyu wrote:
> 
> On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:
> >>>We hope
> >>>to find a better way to make SRIOV NIC work in these cases and this is
> >>>worth to do since SRIOV NIC provides better network performance compared
> >>>with PV NIC.
> >If this is a performance optimization as the above implies,
> >you need to include some numbers, and document how did
> >you implement the switch and how did you measure the performance.
> >
> 
> OK. Some ideas of my patches come from paper "CompSC: Live Migration with
> Pass-through Devices".
> http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf
> 
> It compared performance data between the solution of switching PV and VF and
> VF migration.(Chapter 7: Discussion)
> 

I haven't read it, but I would like to note you can't rely on research
papers.  If you propose a patch to be merged you need to measure what is
its actual effect on modern linux at the end of 2015.

> >>>Current patches have some issues. I think we can find
> >>>solution for them andimprove them step by step.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VSOCK: mark virtio_transport.ko experimental

2015-12-04 Thread Michael S. Tsirkin
On Fri, Dec 04, 2015 at 11:49:18AM +0800, Stefan Hajnoczi wrote:
> Be explicit that the virtio_transport.ko code implements a draft virtio
> specification that is still subject to change.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
> If you'd rather wait until the device specification has been finalized, feel
> free to revert the virtio-vsock code for now.  Apologies for not mentioning 
> the
> status in the Kconfig earlier.
> 
>  net/vmw_vsock/Kconfig | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
> index 74e0bc8..d8be850 100644
> --- a/net/vmw_vsock/Kconfig
> +++ b/net/vmw_vsock/Kconfig
> @@ -28,12 +28,17 @@ config VMWARE_VMCI_VSOCKETS
> will be called vmw_vsock_vmci_transport. If unsure, say N.
>  
>  config VIRTIO_VSOCKETS
> - tristate "virtio transport for Virtual Sockets"
> + tristate "virtio transport for Virtual Sockets (Experimental)"
>   depends on VSOCKETS && VIRTIO
>   select VIRTIO_VSOCKETS_COMMON
> + default n
>   help
> This module implements a virtio transport for Virtual Sockets.
>  
> +   This feature is based on a draft of the virtio-vsock device
> +   specification that is still subject to change.  It can be used
> +   to begin developing applications that use Virtual Sockets.
> +
> Enable this transport if your Virtual Machine runs on Qemu/KVM.
>  
> To compile this driver as a module, choose M here: the module

I'm pretty sure this alone is not enough.  I think depending on an entry
under drivers/staging is necessary. The issue is userspace depending on
the interface, not kernel code itself being unstable.  We can create
drivers/staging/virtio for this purpose, even if it's just to hold the
Kconfig entry.  And I'd rather add STAGING within the KConfig names too,
so people enabling it don't get a surpise when their userspace stops
working.

But yes, revert would be cleaner and easier than all this temporary
work.

If you agree, could you send a patch to do one of these two things pls?

> -- 
> 2.5.0
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Add virtio transport for AF_VSOCK

2015-12-03 Thread Michael S. Tsirkin
On Wed, Dec 02, 2015 at 02:43:58PM +0800, Stefan Hajnoczi wrote:
> v2:
>  * Rebased onto Linux v4.4-rc2
>  * vhost: Refuse to assign reserved CIDs
>  * vhost: Refuse guest CID if already in use
>  * vhost: Only accept correctly addressed packets (no spoofing!)
>  * vhost: Support flexible rx/tx descriptor layout
>  * vhost: Add missing total_tx_buf decrement
>  * virtio_transport: Fix total_tx_buf accounting
>  * virtio_transport: Add virtio_transport global mutex to prevent races
>  * common: Notify other side of SOCK_STREAM disconnect (fixes shutdown
>semantics)
>  * common: Avoid recursive mutex_lock(tx_lock) for write_space (fixes 
> deadlock)
>  * common: Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
>  * common: Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
>  * common: Fix peer_buf_alloc inheritance on child socket
> 
> This patch series adds a virtio transport for AF_VSOCK (net/vmw_vsock/).
> AF_VSOCK is designed for communication between virtual machines and
> hypervisors.  It is currently only implemented for VMware's VMCI transport.
> 
> This series implements the proposed virtio-vsock device specification from
> here:
> http://comments.gmane.org/gmane.comp.emulators.virtio.devel/855
> 
> Most of the work was done by Asias He and Gerd Hoffmann a while back.  I have
> picked up the series again.
> 
> The QEMU userspace changes are here:
> https://github.com/stefanha/qemu/commits/vsock
> 
> Why virtio-vsock?
> -
> Guest<->host communication is currently done over the virtio-serial device.
> This makes it hard to port sockets API-based applications and is limited to
> static ports.
> 
> virtio-vsock uses the sockets API so that applications can rely on familiar
> SOCK_STREAM and SOCK_DGRAM semantics.  Applications on the host can easily
> connect to guest agents because the sockets API allows multiple connections to
> a listen socket (unlike virtio-serial).  This simplifies the guest<->host
> communication and eliminates the need for extra processes on the host to
> arbitrate virtio-serial ports.
> 
> Overview
> 
> This series adds 3 pieces:
> 
> 1. virtio_transport_common.ko - core virtio vsock code that uses vsock.ko
> 
> 2. virtio_transport.ko - guest driver
> 
> 3. drivers/vhost/vsock.ko - host driver
> 
> Howto
> -
> The following kernel options are needed:
>   CONFIG_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS_COMMON=y
>   CONFIG_VHOST_VSOCK=m
> 
> Launch QEMU as follows:
>   # qemu ... -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3
> 
> Guest and host can communicate via AF_VSOCK sockets.  The host's CID (address)
> is 2 and the guest is automatically assigned a CID (use VMADDR_CID_ANY (-1) to
> bind to it).
> 
> Status
> --
> There are a few design changes I'd like to make to the virtio-vsock device:
> 
> 1. The 3-way handshake isn't necessary over a reliable transport (virtqueue).
>Spoofing packets is also impossible so the security aspects of the 3-way
>handshake (including syn cookie) add nothing.  The next version will have a
>single operation to establish a connection.

It's hard to discuss without seeing the details, but we do need to
slow down guests that are flooding host with socket creation requests.
The handshake is a simple way for hypervisor to defer
such requests until it has resources without
breaking things.

> 2. Credit-based flow control doesn't work for SOCK_DGRAM since multiple 
> clients
>can transmit to the same listen socket.  There is no way for the clients to
>coordinate buffer space with each other fairly.  The next version will drop
>credit-based flow control for SOCK_DGRAM and only rely on best-effort
>delivery.  SOCK_STREAM still has guaranteed delivery.

I suspect in the end we will need a measure of fairness even
if you drop packets. And recovering from packet loss is
hard enough that not many applications do it correctly.
I suggest disabling SOCK_DGRAM for now.

> 3. In the next version only the host will be able to establish connections
>(i.e. to connect to a guest agent).  This is for security reasons since
>there is currently no ability to provide host services only to certain
>guests.  This also matches how AF_VSOCK works on modern VMware hypervisors.


I see David merged this one already, but above planned changes are
userspace and hypervisor/guest visible.

Once this is upstream and userspace/guests start relying on it,
we'll be stuck supporting this version in addition to
whatever we really want, with no easy way to even test it.

Might it not be better to defer enabling this upstream until the interface is
finalized?


> Asias He (5):
>   VSOCK: Introduce vsock_find_unbound_socket and
> vsock_bind_dgram_generic
>   VSOCK: Introduce virtio-vsock-common.ko
>   VSOCK: Introduce virtio-vsock.ko
>   VSOCK: Introduce vhost-vsock.ko
>   VSOCK: Add Makefile and Kconfig
> 
>  drivers/vhost/Kconfig

Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-02 Thread Michael S. Tsirkin
On Wed, Dec 02, 2015 at 10:08:25PM +0800, Lan, Tianyu wrote:
> On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:
> >>But
> >>it requires guest OS to do specific configurations inside and rely on
> >>bonding driver which blocks it work on Windows.
> >> From performance side,
> >>putting VF and virtio NIC under bonded interface will affect their
> >>performance even when not do migration. These factors block to use VF
> >>NIC passthough in some user cases(Especially in the cloud) which require
> >>migration.
> >
> >That's really up to guest. You don't need to do bonding,
> >you can just move the IP and mac from userspace, that's
> >possible on most OS-es.
> >
> >Or write something in guest kernel that is more lightweight if you are
> >so inclined. What we are discussing here is the host-guest interface,
> >not the in-guest interface.
> >
> >>Current solution we proposed changes NIC driver and Qemu. Guest Os
> >>doesn't need to do special thing for migration.
> >>It's easy to deploy
> >
> >
> >Except of course these patches don't even work properly yet.
> >
> >And when they do, even minor changes in host side NIC hardware across
> >migration will break guests in hard to predict ways.
> 
> Switching between PV and VF NIC will introduce network stop and the
> latency of hotplug VF is measurable.
> For some user cases(cloud service
> and OPNFV) which are sensitive to network stabilization and performance,
> these are not friend and blocks SRIOV NIC usage in these case.

I find this hard to credit. hotplug is not normally a data path
operation.

> We hope
> to find a better way to make SRIOV NIC work in these cases and this is
> worth to do since SRIOV NIC provides better network performance compared
> with PV NIC.

If this is a performance optimization as the above implies,
you need to include some numbers, and document how did
you implement the switch and how did you measure the performance.

> Current patches have some issues. I think we can find
> solution for them andimprove them step by step.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-02 Thread Michael S. Tsirkin
On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote:
> On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> > On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:
> >> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> 
> >> > There are several components to this:
> >> > - dma_map_* needs to prevent page from
> >> >   being migrated while device is running.
> >> >   For example, expose some kind of bitmap from guest
> >> >   to host, set bit there while page is mapped.
> >> >   What happens if we stop the guest and some
> >> >   bits are still set? See dma_alloc_coherent below
> >> >   for some ideas.
> >>
> >> Yeah, I could see something like this working.  Maybe we could do
> >> something like what was done for the NX bit and make use of the upper
> >> order bits beyond the limits of the memory range to mark pages as
> >> non-migratable?
> >>
> >> I'm curious.  What we have with a DMA mapped region is essentially
> >> shared memory between the guest and the device.  How would we resolve
> >> something like this with IVSHMEM, or are we blocked there as well in
> >> terms of migration?
> >
> > I have some ideas. Will post later.
> 
> I look forward to it.
> 
> >> > - dma_unmap_* needs to mark page as dirty
> >> >   This can be done by writing into a page.
> >> >
> >> > - dma_sync_* needs to mark page as dirty
> >> >   This is trickier as we can not change the data.
> >> >   One solution is using atomics.
> >> >   For example:
> >> > int x = ACCESS_ONCE(*p);
> >> > cmpxchg(p, x, x);
> >> >   Seems to do a write without changing page
> >> >   contents.
> >>
> >> Like I said we can probably kill 2 birds with one stone by just
> >> implementing our own dma_mark_clean() for x86 virtualized
> >> environments.
> >>
> >> I'd say we could take your solution one step further and just use 0
> >> instead of bothering to read the value.  After all it won't write the
> >> area if the value at the offset is not 0.
> >
> > Really almost any atomic that has no side effect will do.
> > atomic or with 0
> > atomic and with 
> >
> > It's just that cmpxchg already happens to have a portable
> > wrapper.
> 
> I was originally thinking maybe an atomic_add with 0 would be the way
> to go.

cmpxchg with any value too.

>  Either way though we still are using a locked prefix and
> having to dirty a cache line per page which is going to come at some
> cost.

I agree. It's likely not necessary for everyone
to be doing this: only people that both
run within the VM and want migration to work
need to do this logging.

So set some module option to have driver tell hypervisor that it
supports logging.  If bus mastering is enabled before this, migration is
blocked.  Or even pass some flag from hypervisor so
driver can detect it needs to log writes.
I guess this could be put in device config somewhere,
though in practice it's a global thing, not a per device one, so
maybe we need some new channel to
pass this flag to guest. CPUID?
Or maybe we can put some kind of agent in the initrd
and use the existing guest agent channel after all.
agent in initrd could open up a lot of new possibilities.


> >> > - dma_alloc_coherent memory (e.g. device rings)
> >> >   must be migrated after device stopped modifying it.
> >> >   Just stopping the VCPU is not enough:
> >> >   you must make sure device is not changing it.
> >> >
> >> >   Or maybe the device has some kind of ring flush operation,
> >> >   if there was a reasonably portable way to do this
> >> >   (e.g. a flush capability could maybe be added to SRIOV)
> >> >   then hypervisor could do this.
> >>
> >> This is where things start to get messy. I was suggesting the
> >> suspend/resume to resolve this bit, but it might be possible to also
> >> deal with this via something like this via clearing the bus master
> >> enable bit for the VF.  If I am not mistaken that should disable MSI-X
> >> interrupts and halt any DMA.  That should work as long as you have
> >> some mechanism that is tracking the pages in use for DMA.
> >
> > A bigger issue is recovering afterwards.
> 
> Agreed.
> 
> >> >   In case you need to resume on source, you
> >> >   really need to follow the same path
> &

Re: [PATCH net-next 3/3] vhost_net: basic polling support

2015-12-02 Thread Michael S. Tsirkin
On Wed, Dec 02, 2015 at 01:04:03PM +0800, Jason Wang wrote:
> 
> 
> On 12/01/2015 10:43 PM, Michael S. Tsirkin wrote:
> > On Tue, Dec 01, 2015 at 01:17:49PM +0800, Jason Wang wrote:
> >>
> >> On 11/30/2015 06:44 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Nov 25, 2015 at 03:11:29PM +0800, Jason Wang wrote:
> >>>>> This patch tries to poll for new added tx buffer or socket receive
> >>>>> queue for a while at the end of tx/rx processing. The maximum time
> >>>>> spent on polling were specified through a new kind of vring ioctl.
> >>>>>
> >>>>> Signed-off-by: Jason Wang <jasow...@redhat.com>
> >>> One further enhancement would be to actually poll
> >>> the underlying device. This should be reasonably
> >>> straight-forward with macvtap (especially in the
> >>> passthrough mode).
> >>>
> >>>
> >> Yes, it is. I have some patches to do this by replacing
> >> skb_queue_empty() with sk_busy_loop() but for tap.
> > We probably don't want to do this unconditionally, though.
> >
> >> Tests does not show
> >> any improvement but some regression.
> > Did you add code to call sk_mark_napi_id on tap then?
> > sk_busy_loop won't do anything useful without.
> 
> Yes I did. Probably something wrong elsewhere.

Is this for guest-to-guest? the patch to do napi
for tap is still not upstream due to minor performance
regression.  Want me to repost it?

> >
> >>  Maybe it's better to test macvtap.
> > Same thing ...
> >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Michael S. Tsirkin
On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:
> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> > On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote:
> >>
> >>
> >> On 12/1/2015 12:07 AM, Alexander Duyck wrote:
> >> >They can only be corrected if the underlying assumptions are correct
> >> >and they aren't.  Your solution would have never worked correctly.
> >> >The problem is you assume you can keep the device running when you are
> >> >migrating and you simply cannot.  At some point you will always have
> >> >to stop the device in order to complete the migration, and you cannot
> >> >stop it before you have stopped your page tracking mechanism.  So
> >> >unless the platform has an IOMMU that is somehow taking part in the
> >> >dirty page tracking you will not be able to stop the guest and then
> >> >the device, it will have to be the device and then the guest.
> >> >
> >> >>>Doing suspend and resume() may help to do migration easily but some
> >> >>>devices requires low service down time. Especially network and I got
> >> >>>that some cloud company promised less than 500ms network service 
> >> >>>downtime.
> >> >Honestly focusing on the downtime is getting the cart ahead of the
> >> >horse.  First you need to be able to do this without corrupting system
> >> >memory and regardless of the state of the device.  You haven't even
> >> >gotten to that state yet.  Last I knew the device had to be up in
> >> >order for your migration to even work.
> >>
> >> I think the issue is that the content of rx package delivered to stack 
> >> maybe
> >> changed during migration because the piece of memory won't be migrated to
> >> new machine. This may confuse applications or stack. Current dummy write
> >> solution can ensure the content of package won't change after doing dummy
> >> write while the content maybe not received data if migration happens before
> >> that point. We can recheck the content via checksum or crc in the protocol
> >> after dummy write to ensure the content is what VF received. I think stack
> >> has already done such checks and the package will be abandoned if failed to
> >> pass through the check.
> >
> >
> > Most people nowdays rely on hardware checksums so I don't think this can
> > fly.
> 
> Correct.  The checksum/crc approach will not work since it is possible
> for a checksum to even be mangled in the case of some features such as
> LRO or GRO.
> 
> >> Another way is to tell all memory driver are using to Qemu and let Qemu to
> >> migrate these memory after stopping VCPU and the device. This seems safe 
> >> but
> >> implementation maybe complex.
> >
> > Not really 100% safe.  See below.
> >
> > I think hiding these details behind dma_* API does have
> > some appeal. In any case, it gives us a good
> > terminology as it covers what most drivers do.
> 
> That was kind of my thought.  If we were to build our own
> dma_mark_clean() type function that will mark the DMA region dirty on
> sync or unmap then that is half the battle right there as we would be
> able to at least keep the regions consistent after they have left the
> driver.
> 
> > There are several components to this:
> > - dma_map_* needs to prevent page from
> >   being migrated while device is running.
> >   For example, expose some kind of bitmap from guest
> >   to host, set bit there while page is mapped.
> >   What happens if we stop the guest and some
> >   bits are still set? See dma_alloc_coherent below
> >   for some ideas.
> 
> Yeah, I could see something like this working.  Maybe we could do
> something like what was done for the NX bit and make use of the upper
> order bits beyond the limits of the memory range to mark pages as
> non-migratable?
> 
> I'm curious.  What we have with a DMA mapped region is essentially
> shared memory between the guest and the device.  How would we resolve
> something like this with IVSHMEM, or are we blocked there as well in
> terms of migration?

I have some ideas. Will post later.

> > - dma_unmap_* needs to mark page as dirty
> >   This can be done by writing into a page.
> >
> > - dma_sync_* needs to mark page as dirty
> >   This is trickier as we can not change the data.
> >   One solution is using atomics.
> >   For example:
> > int x = ACCESS_ONCE(*p);
> > c

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Michael S. Tsirkin
On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/1/2015 12:07 AM, Alexander Duyck wrote:
> >They can only be corrected if the underlying assumptions are correct
> >and they aren't.  Your solution would have never worked correctly.
> >The problem is you assume you can keep the device running when you are
> >migrating and you simply cannot.  At some point you will always have
> >to stop the device in order to complete the migration, and you cannot
> >stop it before you have stopped your page tracking mechanism.  So
> >unless the platform has an IOMMU that is somehow taking part in the
> >dirty page tracking you will not be able to stop the guest and then
> >the device, it will have to be the device and then the guest.
> >
> >>>Doing suspend and resume() may help to do migration easily but some
> >>>devices requires low service down time. Especially network and I got
> >>>that some cloud company promised less than 500ms network service downtime.
> >Honestly focusing on the downtime is getting the cart ahead of the
> >horse.  First you need to be able to do this without corrupting system
> >memory and regardless of the state of the device.  You haven't even
> >gotten to that state yet.  Last I knew the device had to be up in
> >order for your migration to even work.
> 
> I think the issue is that the content of rx package delivered to stack maybe
> changed during migration because the piece of memory won't be migrated to
> new machine. This may confuse applications or stack. Current dummy write
> solution can ensure the content of package won't change after doing dummy
> write while the content maybe not received data if migration happens before
> that point. We can recheck the content via checksum or crc in the protocol
> after dummy write to ensure the content is what VF received. I think stack
> has already done such checks and the package will be abandoned if failed to
> pass through the check.


Most people nowdays rely on hardware checksums so I don't think this can
fly.

> Another way is to tell all memory driver are using to Qemu and let Qemu to
> migrate these memory after stopping VCPU and the device. This seems safe but
> implementation maybe complex.

Not really 100% safe.  See below.

I think hiding these details behind dma_* API does have
some appeal. In any case, it gives us a good
terminology as it covers what most drivers do.

There are several components to this:
- dma_map_* needs to prevent page from
  being migrated while device is running.
  For example, expose some kind of bitmap from guest
  to host, set bit there while page is mapped.
  What happens if we stop the guest and some
  bits are still set? See dma_alloc_coherent below
  for some ideas.


- dma_unmap_* needs to mark page as dirty
  This can be done by writing into a page.

- dma_sync_* needs to mark page as dirty
  This is trickier as we can not change the data.
  One solution is using atomics.
  For example:
int x = ACCESS_ONCE(*p);
cmpxchg(p, x, x);
  Seems to do a write without changing page
  contents.

- dma_alloc_coherent memory (e.g. device rings)
  must be migrated after device stopped modifying it.
  Just stopping the VCPU is not enough:
  you must make sure device is not changing it.

  Or maybe the device has some kind of ring flush operation,
  if there was a reasonably portable way to do this
  (e.g. a flush capability could maybe be added to SRIOV)
  then hypervisor could do this.

  With existing devices,
  either do it after device reset, or disable
  memory access in the IOMMU. Maybe both.

  In case you need to resume on source, you
  really need to follow the same path
  as on destination, preferably detecting
  device reset and restoring the device
  state.

  A similar approach could work for dma_map_ above.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] vhost_net: basic polling support

2015-12-01 Thread Michael S. Tsirkin
On Tue, Dec 01, 2015 at 01:17:49PM +0800, Jason Wang wrote:
> 
> 
> On 11/30/2015 06:44 PM, Michael S. Tsirkin wrote:
> > On Wed, Nov 25, 2015 at 03:11:29PM +0800, Jason Wang wrote:
> >> > This patch tries to poll for new added tx buffer or socket receive
> >> > queue for a while at the end of tx/rx processing. The maximum time
> >> > spent on polling were specified through a new kind of vring ioctl.
> >> > 
> >> > Signed-off-by: Jason Wang <jasow...@redhat.com>
> > One further enhancement would be to actually poll
> > the underlying device. This should be reasonably
> > straight-forward with macvtap (especially in the
> > passthrough mode).
> >
> >
> 
> Yes, it is. I have some patches to do this by replacing
> skb_queue_empty() with sk_busy_loop() but for tap.

We probably don't want to do this unconditionally, though.

> Tests does not show
> any improvement but some regression.

Did you add code to call sk_mark_napi_id on tap then?
sk_busy_loop won't do anything useful without.

>  Maybe it's better to test macvtap.

Same thing ...

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-12-01 Thread Michael S. Tsirkin
On Tue, Dec 01, 2015 at 02:26:57PM +0800, Lan, Tianyu wrote:
> 
> 
> On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote:
> >It is still not very clear what it is you are trying to achieve, and
> >whether your patchset achieves it.  You merely say "adding live
> >migration" but it seems pretty clear this isn't about being able to
> >migrate a guest transparently, since you are adding a host/guest
> >handshake.
> >
> >This isn't about functionality either: I think that on KVM, it isn't
> >hard to live migrate if you can do a host/guest handshake, even today,
> >with no kernel changes:
> >1. before migration, expose a pv nic to guest (can be done directly on
> >   boot)
> >2. use e.g. a serial connection to move IP from an assigned device to pv nic
> >3. maybe move the mac as well
> >4. eject the assigned device
> >5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
> >happens) and start migration
> >
> 
> This looks like the bonding driver solution

Why does it? Unlike bonding, this doesn't touch data path or
any kernel code. Just run a script from guest agent.

> which put pv nic and VF
> in one bonded interface under active-backup mode. The bonding driver
> will switch from VF to PV nic automatically when VF is unplugged during
> migration. This is the only available solution for VF NIC migration.

It really isn't. For one, there is also teaming.

> But
> it requires guest OS to do specific configurations inside and rely on
> bonding driver which blocks it work on Windows.
> From performance side,
> putting VF and virtio NIC under bonded interface will affect their
> performance even when not do migration. These factors block to use VF
> NIC passthough in some user cases(Especially in the cloud) which require
> migration.

That's really up to guest. You don't need to do bonding,
you can just move the IP and mac from userspace, that's
possible on most OS-es.

Or write something in guest kernel that is more lightweight if you are
so inclined. What we are discussing here is the host-guest interface,
not the in-guest interface.

> Current solution we proposed changes NIC driver and Qemu. Guest Os
> doesn't need to do special thing for migration.
> It's easy to deploy


Except of course these patches don't even work properly yet.

And when they do, even minor changes in host side NIC hardware across
migration will break guests in hard to predict ways.

> and
> all changes are in the NIC driver, NIC vendor can implement migration
> support just in the their driver.

Kernel code and hypervisor code is not easier to develop and deploy than
a userspace script.  If that is all the motivation there is, that's a
pretty small return on investment.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: replace % with & on data path

2015-11-30 Thread Michael S. Tsirkin
On Mon, Nov 30, 2015 at 10:34:07AM +0200, Michael S. Tsirkin wrote:
> We know vring num is a power of 2, so use &
> to mask the high bits.
> 
> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> ---
>  drivers/vhost/vhost.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 080422f..85f0f0a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1366,10 +1366,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   /* Only get avail ring entries after they have been exposed by guest. */
>   smp_rmb();
>  
> + }
> +

Oops. This sneaked in from an unrelated patch.
Pls ignore, will repost.

>   /* Grab the next descriptor number they're advertising, and increment
>* the index we've seen. */
>   if (unlikely(__get_user(ring_head,
> - >avail->ring[last_avail_idx % vq->num]))) {
> + >avail->ring[last_avail_idx & (vq->num - 
> 1)]))) {
>   vq_err(vq, "Failed to read head: idx %d address %p\n",
>  last_avail_idx,
>  >avail->ring[last_avail_idx % vq->num]);
> @@ -1489,7 +1491,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue 
> *vq,
>   u16 old, new;
>   int start;
>  
> - start = vq->last_used_idx % vq->num;
> + start = vq->last_used_idx & (vq->num - 1);
>   used = vq->used->ring + start;
>   if (count == 1) {
>   if (__put_user(heads[0].id, >id)) {
> @@ -1531,7 +1533,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
> vring_used_elem *heads,
>  {
>   int start, n, r;
>  
> - start = vq->last_used_idx % vq->num;
> + start = vq->last_used_idx & (vq->num - 1);
>   n = vq->num - start;
>   if (n < count) {
>   r = __vhost_add_used_n(vq, heads, n);
> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: replace % with & on data path

2015-11-30 Thread Michael S. Tsirkin
On Mon, Nov 30, 2015 at 12:42:49AM -0800, Joe Perches wrote:
> On Mon, 2015-11-30 at 10:34 +0200, Michael S. Tsirkin wrote:
> > We know vring num is a power of 2, so use &
> > to mask the high bits.
> []
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> []
> > @@ -1366,10 +1366,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> > /* Only get avail ring entries after they have been exposed by guest. */
> > smp_rmb();
> >  
> > +   }
> 
> ?

Yes, I noticed this - I moved this chunk from the next patch
in my tree by mistake.

Will fix, thanks!

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 3/5] nvdimm acpi: build ACPI NFIT table

2015-11-30 Thread Michael S. Tsirkin
On Mon, Nov 16, 2015 at 06:51:01PM +0800, Xiao Guangrong wrote:
> NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
> 
> Currently, we only support PMEM mode. Each device has 3 structures:
> - SPA structure, defines the PMEM region info
> 
> - MEM DEV structure, it has the @handle which is used to associate specified
>   ACPI NVDIMM  device we will introduce in later patch.
>   Also we can happily ignored the memory device's interleave, the real
>   nvdimm hardware access is hidden behind host
> 
> - DCR structure, it defines vendor ID used to associate specified vendor
>   nvdimm driver. Since we only implement PMEM mode this time, Command
>   window and Data window are not needed
> 
> The NVDIMM functionality is controlled by the parameter, 'nvdimm-support',
> is introduced for PIIX4_PM and ICH9-LPC, it is true on default and it is
> false on 2.4 and its earlier version to keep compatibility

Will need to make it false on 2.5 too.

Isn't there a device that needs to be created for this
to work?  It would be cleaned to just key off
the device presence, then we don't need compat gunk,
and further, people not using it don't get a
bunch of unused AML.


> Signed-off-by: Xiao Guangrong 
> ---
>  default-configs/i386-softmmu.mak   |   1 +
>  default-configs/x86_64-softmmu.mak |   1 +
>  hw/acpi/Makefile.objs  |   1 +
>  hw/acpi/ich9.c |  19 ++
>  hw/acpi/nvdimm.c   | 382 
> +
>  hw/acpi/piix4.c|   4 +
>  hw/i386/acpi-build.c   |   6 +
>  include/hw/acpi/ich9.h |   3 +
>  include/hw/i386/pc.h   |  12 +-
>  include/hw/mem/nvdimm.h|  12 ++
>  10 files changed, 440 insertions(+), 1 deletion(-)
>  create mode 100644 hw/acpi/nvdimm.c
> 
> diff --git a/default-configs/i386-softmmu.mak 
> b/default-configs/i386-softmmu.mak
> index 4c79d3b..53fb517 100644
> --- a/default-configs/i386-softmmu.mak
> +++ b/default-configs/i386-softmmu.mak
> @@ -47,6 +47,7 @@ CONFIG_IOAPIC=y
>  CONFIG_PVPANIC=y
>  CONFIG_MEM_HOTPLUG=y
>  CONFIG_NVDIMM=y
> +CONFIG_ACPI_NVDIMM=y
>  CONFIG_XIO3130=y
>  CONFIG_IOH3420=y
>  CONFIG_I82801B11=y
> diff --git a/default-configs/x86_64-softmmu.mak 
> b/default-configs/x86_64-softmmu.mak
> index e42d2fc..766c27c 100644
> --- a/default-configs/x86_64-softmmu.mak
> +++ b/default-configs/x86_64-softmmu.mak
> @@ -47,6 +47,7 @@ CONFIG_IOAPIC=y
>  CONFIG_PVPANIC=y
>  CONFIG_MEM_HOTPLUG=y
>  CONFIG_NVDIMM=y
> +CONFIG_ACPI_NVDIMM=y
>  CONFIG_XIO3130=y
>  CONFIG_IOH3420=y
>  CONFIG_I82801B11=y
> diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
> index 7d3230c..095597f 100644
> --- a/hw/acpi/Makefile.objs
> +++ b/hw/acpi/Makefile.objs
> @@ -2,6 +2,7 @@ common-obj-$(CONFIG_ACPI_X86) += core.o piix4.o pcihp.o
>  common-obj-$(CONFIG_ACPI_X86_ICH) += ich9.o tco.o
>  common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu_hotplug.o
>  common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
> +common-obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
>  common-obj-$(CONFIG_ACPI) += acpi_interface.o
>  common-obj-$(CONFIG_ACPI) += bios-linker-loader.o
>  common-obj-$(CONFIG_ACPI) += aml-build.o
> diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
> index 1c7fcfa..275796f 100644
> --- a/hw/acpi/ich9.c
> +++ b/hw/acpi/ich9.c
> @@ -307,6 +307,20 @@ static void ich9_pm_set_memory_hotplug_support(Object 
> *obj, bool value,
>  s->pm.acpi_memory_hotplug.is_enabled = value;
>  }
>  
> +static bool ich9_pm_get_nvdimm_support(Object *obj, Error **errp)
> +{
> +ICH9LPCState *s = ICH9_LPC_DEVICE(obj);
> +
> +return s->pm.nvdimm_acpi_state.is_enabled;
> +}
> +
> +static void ich9_pm_set_nvdimm_support(Object *obj, bool value, Error **errp)
> +{
> +ICH9LPCState *s = ICH9_LPC_DEVICE(obj);
> +
> +s->pm.nvdimm_acpi_state.is_enabled = value;
> +}
> +
>  static void ich9_pm_get_disable_s3(Object *obj, Visitor *v,
> void *opaque, const char *name,
> Error **errp)
> @@ -404,6 +418,7 @@ void ich9_pm_add_properties(Object *obj, ICH9LPCPMRegs 
> *pm, Error **errp)
>  {
>  static const uint32_t gpe0_len = ICH9_PMIO_GPE0_LEN;
>  pm->acpi_memory_hotplug.is_enabled = true;
> +pm->nvdimm_acpi_state.is_enabled = true;
>  pm->disable_s3 = 0;
>  pm->disable_s4 = 0;
>  pm->s4_val = 2;
> @@ -419,6 +434,10 @@ void ich9_pm_add_properties(Object *obj, ICH9LPCPMRegs 
> *pm, Error **errp)
>   ich9_pm_get_memory_hotplug_support,
>   ich9_pm_set_memory_hotplug_support,
>   NULL);
> +object_property_add_bool(obj, "nvdimm-support",
> + ich9_pm_get_nvdimm_support,
> + ich9_pm_set_nvdimm_support,
> + NULL);
>  object_property_add(obj, ACPI_PM_PROP_S3_DISABLED, "uint8",
>  

Re: [PATCH v8 4/5] nvdimm acpi: build ACPI nvdimm devices

2015-11-30 Thread Michael S. Tsirkin
On Mon, Nov 16, 2015 at 06:51:02PM +0800, Xiao Guangrong wrote:
> NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices
> 
> There is a root device under \_SB and specified NVDIMM devices are under the
> root device. Each NVDIMM device has _ADR which returns its handle used to
> associate MEMDEV structure in NFIT
> 
> Currently, we do not support any function on _DSM, that means, NVDIMM
> label data has not been supported yet
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/nvdimm.c | 85 
> 
>  1 file changed, 85 insertions(+)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 98c004d..abe0daa 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -367,6 +367,90 @@ static void nvdimm_build_nfit(GSList *device_list, 
> GArray *table_offsets,
>  g_array_free(structures, true);
>  }
>  
> +static void nvdimm_build_common_dsm(Aml *root_dev)
> +{
> +Aml *method, *ifctx, *function;
> +uint8_t byte_list[1];
> +
> +method = aml_method("NCAL", 4);

This "NCAL" needs a define as it's used
in multiple places. It's really just a DSM
implementation, right? Reflect this in the macro
name.

> +{

What's this doing?

> +function = aml_arg(2);
> +
> +/*
> + * function 0 is called to inquire what functions are supported by
> + * OSPM
> + */
> +ifctx = aml_if(aml_equal(function, aml_int(0)));
> +byte_list[0] = 0 /* No function Supported */;
> +aml_append(ifctx, aml_return(aml_buffer(1, byte_list)));
> +aml_append(method, ifctx);
> +
> +/* No function is supported yet. */
> +byte_list[0] = 1 /* Not Supported */;
> +aml_append(method, aml_return(aml_buffer(1, byte_list)));
> +}
> +aml_append(root_dev, method);
> +}
> +
> +static void nvdimm_build_nvdimm_devices(GSList *device_list, Aml *root_dev)
> +{
> +for (; device_list; device_list = device_list->next) {
> +DeviceState *dev = device_list->data;
> +int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
> +   NULL);
> +uint32_t handle = nvdimm_slot_to_handle(slot);
> +Aml *nvdimm_dev, *method;
> +
> +nvdimm_dev = aml_device("NV%02X", slot);
> +aml_append(nvdimm_dev, aml_name_decl("_ADR", aml_int(handle)));
> +
> +method = aml_method("_DSM", 4);
> +{
> +aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
> +   aml_arg(1), aml_arg(2), aml_arg(3;
> +}
> +aml_append(nvdimm_dev, method);
> +
> +aml_append(root_dev, nvdimm_dev);
> +}
> +}
> +
> +static void nvdimm_build_ssdt(GSList *device_list, GArray *table_offsets,
> +  GArray *table_data, GArray *linker)
> +{
> +Aml *ssdt, *sb_scope, *dev, *method;
> +
> +acpi_add_table(table_offsets, table_data);
> +
> +ssdt = init_aml_allocator();
> +acpi_data_push(ssdt->buf, sizeof(AcpiTableHeader));
> +
> +sb_scope = aml_scope("\\_SB");
> +
> +dev = aml_device("NVDR");
> +aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));

Pls add a comment explaining that ACPI0012 is NVDIMM root device.

Also - this will now appear for all users, e.g.
windows guests will prompt users for a driver.
Not nice if user didn't actually ask for nvdimm.

A simple solution is to default this functionality
to off by default.

> +
> +nvdimm_build_common_dsm(dev);
> +method = aml_method("_DSM", 4);
> +{
> +aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
> +   aml_arg(1), aml_arg(2), aml_arg(3;
> +}

Some duplication here, move above to a sub-function please.

> +aml_append(dev, method);
> +
> +nvdimm_build_nvdimm_devices(device_list, dev);
> +
> +aml_append(sb_scope, dev);
> +
> +aml_append(ssdt, sb_scope);
> +/* copy AML table into ACPI tables blob and patch header there */
> +g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
> +build_header(linker, table_data,
> +(void *)(table_data->data + table_data->len - ssdt->buf->len),
> +"SSDT", ssdt->buf->len, 1, "NVDIMM");
> +free_aml_allocator();
> +}
> +
>  void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
> GArray *linker)
>  {
> @@ -378,5 +462,6 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray 
> *table_data,
>  return;
>  }
>  nvdimm_build_nfit(device_list, table_offsets, table_data, linker);
> +nvdimm_build_ssdt(device_list, table_offsets, table_data, linker);
>  g_slist_free(device_list);
>  }
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: 

Re: [PATCH net-next 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-30 Thread Michael S. Tsirkin
On Wed, Nov 25, 2015 at 03:11:28PM +0800, Jason Wang wrote:
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/vhost.c | 26 +-
>  drivers/vhost/vhost.h |  1 +
>  2 files changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 163b365..b86c5aa 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1633,10 +1633,25 @@ void vhost_add_used_and_signal_n(struct vhost_dev 
> *dev,
>  }
>  EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
>  
> +bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
> + __virtio16 avail_idx;
> + int r;
> +
> + r = __get_user(avail_idx, >avail->idx);
> + if (r) {
> + vq_err(vq, "Failed to check avail idx at %p: %d\n",
> +>avail->idx, r);
> + return false;

In patch 3 you are calling this under preempt disable,
so this actually can fail and it isn't a VQ error.

> + }
> +
> + return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
> +}
> +EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
> +
>  /* OK, now we need to know about added descriptors. */
>  bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
> - __virtio16 avail_idx;
>   int r;
>  
>   if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> @@ -1660,14 +1675,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq)
>   /* They could have slipped one in as we were doing that: make
>* sure it's written, then check again. */
>   smp_mb();
> - r = __get_user(avail_idx, >avail->idx);
> - if (r) {
> - vq_err(vq, "Failed to check avail idx at %p: %d\n",
> ->avail->idx, r);
> - return false;
> - }
> -
> - return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
> + return vhost_vq_more_avail(dev, vq);
>  }
>  EXPORT_SYMBOL_GPL(vhost_enable_notify);
>  

This path does need an error though.
It's probably easier to just leave this call site alone.

> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 43284ad..2f3c57c 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, 
> struct vhost_virtqueue *,
>  struct vring_used_elem *heads, unsigned count);
>  void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
>  void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
> +bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
>  bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
>  
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> -- 
> 2.5.0
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/3] basic busy polling support for vhost_net

2015-11-30 Thread Michael S. Tsirkin
On Sun, Nov 29, 2015 at 10:31:10PM -0500, David Miller wrote:
> From: Jason Wang 
> Date: Wed, 25 Nov 2015 15:11:26 +0800
> 
> > This series tries to add basic busy polling for vhost net. The idea is
> > simple: at the end of tx/rx processing, busy polling for new tx added
> > descriptor and rx receive socket for a while. The maximum number of
> > time (in us) could be spent on busy polling was specified ioctl.
> > 
> > Test A were done through:
> > 
> > - 50 us as busy loop timeout
> > - Netperf 2.6
> > - Two machines with back to back connected ixgbe
> > - Guest with 1 vcpu and 1 queue
> > 
> > Results:
> > - For stream workload, ioexits were reduced dramatically in medium
> >   size (1024-2048) of tx (at most -43%) and almost all rx (at most
> >   -84%) as a result of polling. This compensate for the possible
> >   wasted cpu cycles more or less. That porbably why we can still see
> >   some increasing in the normalized throughput in some cases.
> > - Throughput of tx were increased (at most 50%) expect for the huge
> >   write (16384). And we can send more packets in the case (+tpkts were
> >   increased).
> > - Very minor rx regression in some cases.
> > - Improvemnt on TCP_RR (at most 17%).
> 
> Michael are you going to take this?  It's touching vhost core as
> much as it is the vhost_net driver.

There's a minor bug there, but once it's fixed - I agree,
it belongs in the vhost tree.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-11-30 Thread Michael S. Tsirkin
On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
> This patchset is to propose a solution of adding live migration
> support for SRIOV NIC.
> 
> During migration, Qemu needs to let VF driver in the VM to know
> migration start and end. Qemu adds faked PCI migration capability
> to help to sync status between two sides during migration.
> 
> Qemu triggers VF's mailbox irq via sending MSIX msg when migration
> status is changed. VF driver tells Qemu its mailbox vector index
> via the new PCI capability. In some cases(NIC is suspended or closed),
> VF mailbox irq is freed and VF driver can disable irq injecting via
> new capability.   
> 
> VF driver will put down nic before migration and put up again on
> the target machine.

It is still not very clear what it is you are trying to achieve, and
whether your patchset achieves it.  You merely say "adding live
migration" but it seems pretty clear this isn't about being able to
migrate a guest transparently, since you are adding a host/guest
handshake.

This isn't about functionality either: I think that on KVM, it isn't
hard to live migrate if you can do a host/guest handshake, even today,
with no kernel changes:
1. before migration, expose a pv nic to guest (can be done directly on
  boot)
2. use e.g. a serial connection to move IP from an assigned device to pv nic
3. maybe move the mac as well
4. eject the assigned device
5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
   happens) and start migration

Is this patchset a performance optimization then?
If yes it needs to be accompanied with some performance numbers.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 4/5] nvdimm acpi: build ACPI nvdimm devices

2015-11-30 Thread Michael S. Tsirkin
On Mon, Nov 16, 2015 at 06:51:02PM +0800, Xiao Guangrong wrote:
> NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

Forgot to mention:

Pls put spec info in code comments near
relevant functions, not just the log.

> 
> There is a root device under \_SB and specified NVDIMM devices are under the
> root device. Each NVDIMM device has _ADR which returns its handle used to
> associate MEMDEV structure in NFIT
> 
> Currently, we do not support any function on _DSM, that means, NVDIMM
> label data has not been supported yet
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/nvdimm.c | 85 
> 
>  1 file changed, 85 insertions(+)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 98c004d..abe0daa 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -367,6 +367,90 @@ static void nvdimm_build_nfit(GSList *device_list, 
> GArray *table_offsets,
>  g_array_free(structures, true);
>  }
>  
> +static void nvdimm_build_common_dsm(Aml *root_dev)
> +{
> +Aml *method, *ifctx, *function;
> +uint8_t byte_list[1];
> +
> +method = aml_method("NCAL", 4);
> +{
> +function = aml_arg(2);
> +
> +/*
> + * function 0 is called to inquire what functions are supported by
> + * OSPM
> + */
> +ifctx = aml_if(aml_equal(function, aml_int(0)));
> +byte_list[0] = 0 /* No function Supported */;
> +aml_append(ifctx, aml_return(aml_buffer(1, byte_list)));
> +aml_append(method, ifctx);
> +
> +/* No function is supported yet. */
> +byte_list[0] = 1 /* Not Supported */;
> +aml_append(method, aml_return(aml_buffer(1, byte_list)));
> +}
> +aml_append(root_dev, method);
> +}
> +
> +static void nvdimm_build_nvdimm_devices(GSList *device_list, Aml *root_dev)
> +{
> +for (; device_list; device_list = device_list->next) {
> +DeviceState *dev = device_list->data;
> +int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
> +   NULL);
> +uint32_t handle = nvdimm_slot_to_handle(slot);
> +Aml *nvdimm_dev, *method;
> +
> +nvdimm_dev = aml_device("NV%02X", slot);
> +aml_append(nvdimm_dev, aml_name_decl("_ADR", aml_int(handle)));
> +
> +method = aml_method("_DSM", 4);
> +{
> +aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
> +   aml_arg(1), aml_arg(2), aml_arg(3;
> +}
> +aml_append(nvdimm_dev, method);
> +
> +aml_append(root_dev, nvdimm_dev);
> +}
> +}
> +
> +static void nvdimm_build_ssdt(GSList *device_list, GArray *table_offsets,
> +  GArray *table_data, GArray *linker)
> +{
> +Aml *ssdt, *sb_scope, *dev, *method;

So why don't we skip this completely if device list is empty?

> +
> +acpi_add_table(table_offsets, table_data);
> +
> +ssdt = init_aml_allocator();
> +acpi_data_push(ssdt->buf, sizeof(AcpiTableHeader));
> +
> +sb_scope = aml_scope("\\_SB");
> +
> +dev = aml_device("NVDR");
> +aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));
> +
> +nvdimm_build_common_dsm(dev);
> +method = aml_method("_DSM", 4);
> +{
> +aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
> +   aml_arg(1), aml_arg(2), aml_arg(3;
> +}
> +aml_append(dev, method);
> +
> +nvdimm_build_nvdimm_devices(device_list, dev);
> +
> +aml_append(sb_scope, dev);
> +
> +aml_append(ssdt, sb_scope);
> +/* copy AML table into ACPI tables blob and patch header there */
> +g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
> +build_header(linker, table_data,
> +(void *)(table_data->data + table_data->len - ssdt->buf->len),
> +"SSDT", ssdt->buf->len, 1, "NVDIMM");
> +free_aml_allocator();
> +}
> +
>  void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
> GArray *linker)
>  {
> @@ -378,5 +462,6 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray 
> *table_data,
>  return;
>  }
>  nvdimm_build_nfit(device_list, table_offsets, table_data, linker);
> +nvdimm_build_ssdt(device_list, table_offsets, table_data, linker);
>  g_slist_free(device_list);
>  }
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] vhost_net: basic polling support

2015-11-30 Thread Michael S. Tsirkin
On Wed, Nov 25, 2015 at 03:11:29PM +0800, Jason Wang wrote:
> This patch tries to poll for new added tx buffer or socket receive
> queue for a while at the end of tx/rx processing. The maximum time
> spent on polling were specified through a new kind of vring ioctl.
> 
> Signed-off-by: Jason Wang 

One further enhancement would be to actually poll
the underlying device. This should be reasonably
straight-forward with macvtap (especially in the
passthrough mode).


> ---
>  drivers/vhost/net.c| 72 
> ++
>  drivers/vhost/vhost.c  | 15 ++
>  drivers/vhost/vhost.h  |  1 +
>  include/uapi/linux/vhost.h | 11 +++
>  4 files changed, 94 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e..ce6da77 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -287,6 +287,41 @@ static void vhost_zerocopy_callback(struct ubuf_info 
> *ubuf, bool success)
>   rcu_read_unlock_bh();
>  }
>  
> +static inline unsigned long busy_clock(void)
> +{
> + return local_clock() >> 10;
> +}
> +
> +static bool vhost_can_busy_poll(struct vhost_dev *dev,
> + unsigned long endtime)
> +{
> + return likely(!need_resched()) &&
> +likely(!time_after(busy_clock(), endtime)) &&
> +likely(!signal_pending(current)) &&
> +!vhost_has_work(dev) &&
> +single_task_running();
> +}
> +
> +static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
> + struct vhost_virtqueue *vq,
> + struct iovec iov[], unsigned int iov_size,
> + unsigned int *out_num, unsigned int *in_num)
> +{
> + unsigned long uninitialized_var(endtime);
> +
> + if (vq->busyloop_timeout) {
> + preempt_disable();
> + endtime = busy_clock() + vq->busyloop_timeout;
> + while (vhost_can_busy_poll(vq->dev, endtime) &&
> +!vhost_vq_more_avail(vq->dev, vq))
> + cpu_relax();
> + preempt_enable();
> + }
> +
> + return vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> +  out_num, in_num, NULL, NULL);
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_tx(struct vhost_net *net)
> @@ -331,10 +366,9 @@ static void handle_tx(struct vhost_net *net)
> % UIO_MAXIOV == nvq->done_idx))
>   break;
>  
> - head = vhost_get_vq_desc(vq, vq->iov,
> -  ARRAY_SIZE(vq->iov),
> -  , ,
> -  NULL, NULL);
> + head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
> + ARRAY_SIZE(vq->iov),
> + , );
>   /* On error, stop handling until the next kick. */
>   if (unlikely(head < 0))
>   break;
> @@ -435,6 +469,34 @@ static int peek_head_len(struct sock *sk)
>   return len;
>  }
>  
> +static int vhost_net_peek_head_len(struct vhost_net *net, struct sock *sk)
> +{
> + struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
> + struct vhost_virtqueue *vq = >vq;
> + unsigned long uninitialized_var(endtime);
> +
> + if (vq->busyloop_timeout) {
> + mutex_lock(>mutex);
> + vhost_disable_notify(>dev, vq);
> +
> + preempt_disable();
> + endtime = busy_clock() + vq->busyloop_timeout;
> +
> + while (vhost_can_busy_poll(>dev, endtime) &&
> +skb_queue_empty(>sk_receive_queue) &&
> +!vhost_vq_more_avail(>dev, vq))
> + cpu_relax();
> +
> + preempt_enable();
> +
> + if (vhost_enable_notify(>dev, vq))
> + vhost_poll_queue(>poll);
> + mutex_unlock(>mutex);
> + }
> +
> + return peek_head_len(sk);
> +}
> +
>  /* This is a multi-buffer version of vhost_get_desc, that works if
>   *   vq has read descriptors only.
>   * @vq   - the relevant virtqueue
> @@ -553,7 +615,7 @@ static void handle_rx(struct vhost_net *net)
>   vq->log : NULL;
>   mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
>  
> - while ((sock_len = peek_head_len(sock->sk))) {
> + while ((sock_len = vhost_net_peek_head_len(net, sock->sk))) {
>   sock_len += sock_hlen;
>   vhost_len = sock_len + vhost_hlen;
>   headcount = get_rx_bufs(vq, vq->heads, vhost_len,
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index b86c5aa..857af6c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -285,6 +285,7 @@ 

Re: [PATCH v8 0/5] implement vNVDIMM

2015-11-30 Thread Michael S. Tsirkin
On Mon, Nov 16, 2015 at 06:50:58PM +0800, Xiao Guangrong wrote:
> This patchset can be found at:
>   https://github.com/xiaogr/qemu.git nvdimm-v8
> 
> It is based on pci branch on Michael's tree and the top commit is:
> commit e3a4e177d9 (migration/ram: fix build on 32 bit hosts).
> 
> Changelog in v8:
> We split the long patch series into the small parts, as you see now, this
> is the first part which enables NVDIMM without label data support.

Finally found some time to review this.  Very nice, this is making good
progress, and I think to split it like this is a great idea.  I sent
some comments, most of them minor.

Thanks!

> The command line has been changed because some patches simplifying the
> things have not been included into this series, you should specify the
> file size exactly using the parameters as follows:
>memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
>-device nvdimm,memdev=mem1,id=nv1
> 
> Changelog in v7:
> - changes from Vladimir Sementsov-Ogievskiy's comments:
>   1) let gethugepagesize() realize if fstat is failed instead of get
>  normal page size
>   2) rename  open_file_path to open_ram_file_path
>   3) better log the error message by using error_setg_errno
>   4) update commit in the commit log to explain hugepage detection on
>  Windows
> 
> - changes from Eduardo Habkost's comments:
>   1) use 'Error**' to collect error message for qemu_file_get_page_size()
>   2) move gethugepagesize() replacement to the same patch to make it
>  better for review
>   3) introduce qemu_get_file_size to unity the code with raw_getlength()
> 
> - changes from Stefan's comments:
>   1) check the memory region is large enough to contain DSM output
>  buffer
> 
> - changes from Eric Blake's comments:
>   1) update the shell command in the commit log to generate the patch
>  which drops 'pc-dimm' prefix
>   
> - others:
>   pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
>   Eric Blake.
> 
> Changelog in v6:
> - changes from Stefan's comments:
>   1) fix code style of struct naming by CamelCase way
>   2) fix offset + length overflow when read/write label data
>   3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
>  be used to replace getpagesize()
> 
> Changelog in v5:
> - changes from Michael's comments:
>   1) prefix nvdimm_ to everything in NVDIMM source files
>   2) make parsing _DSM Arg3 more clear
>   3) comment style fix
>   5) drop single used definition
>   6) fix dirty dsm buffer lost due to memory write happened on host
>   7) check dsm buffer if it is big enough to contain input data
>   8) use build_append_int_noprefix to store single value to GArray
> 
> - changes from Michael's and Igor's comments:
>   1) introduce 'nvdimm-support' parameter to control nvdimm
>  enablement and it is disabled for 2.4 and its earlier versions
>  to make live migration compatible
>   2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
>  virtualization
> 
> - changes from Stefan's comments:
>   1) do endian adjustment for the buffer length
> 
> - changes from Bharata B Rao's comments:
>   1) fix compile on ppc
> 
> - others:
>   1) the buffer length is directly got from IO read rather than got
>  from dsm memory
>   2) fix dirty label data lost due to memory write happened on host
> 
> Changelog in v4:
> - changes from Michael's comments:
>   1) show the message, "Memory is not allocated from HugeTlbfs", if file
>  based memory is not allocated from hugetlbfs.
>   2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
>  from Machine.
>   3) statically define UUID and make its operation more clear
>   4) use GArray to build device structures to avoid potential buffer
>  overflow
>   4) improve comments in the code
>   5) improve code style
> 
> - changes from Igor's comments:
>   1) add NVDIMM ACPI spec document
>   2) use serialized method to avoid Mutex
>   3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
>   4) introduce a common ASL method used by _DSM for all devices to reduce
>  ACPI size
>   5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
>  it's better to upgrade QEMU to support Rev2 in the future
> 
> - changes from Stefan's comments:
>   1) copy input data from DSM memory to local buffer to avoid potential
>  issues as DSM memory is visible to guest. Output data is handled
>  in a similar way
> 
> - changes from Dan's comments:
>   1) drop static namespace as Linux has already supported label-less
>  nvdimm devices
> 
> - changes from Vladimir's comments:
>   1) print better message, "failed to get file size for %s, can't create
>  backend on it", if any file operation filed to obtain file size
> 
> - others:
>   create a git repo on github.com for better review/test
> 
> Also, thanks for Eric Blake's review on QAPI's side.
> 
> Thank all of you to review this patchset.
> 
> Changelog in v3:
> There 

[PATCH] vhost: replace % with & on data path

2015-11-30 Thread Michael S. Tsirkin
We know vring num is a power of 2, so use &
to mask the high bits.

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---
 drivers/vhost/vhost.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 080422f..85f0f0a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1366,10 +1366,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Only get avail ring entries after they have been exposed by guest. */
smp_rmb();
 
+   }
+
/* Grab the next descriptor number they're advertising, and increment
 * the index we've seen. */
if (unlikely(__get_user(ring_head,
-   >avail->ring[last_avail_idx % vq->num]))) {
+   >avail->ring[last_avail_idx & (vq->num - 
1)]))) {
vq_err(vq, "Failed to read head: idx %d address %p\n",
   last_avail_idx,
   >avail->ring[last_avail_idx % vq->num]);
@@ -1489,7 +1491,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
u16 old, new;
int start;
 
-   start = vq->last_used_idx % vq->num;
+   start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
if (count == 1) {
if (__put_user(heads[0].id, >id)) {
@@ -1531,7 +1533,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
vring_used_elem *heads,
 {
int start, n, r;
 
-   start = vq->last_used_idx % vq->num;
+   start = vq->last_used_idx & (vq->num - 1);
n = vq->num - start;
if (n < count) {
r = __vhost_add_used_n(vq, heads, n);
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] vhost: replace % with & on data path

2015-11-30 Thread Michael S. Tsirkin
We know vring num is a power of 2, so use &
to mask the high bits.

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---

Changes from v1: drop an unrelated chunk

 drivers/vhost/vhost.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 080422f..ad2146a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1369,7 +1369,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Grab the next descriptor number they're advertising, and increment
 * the index we've seen. */
if (unlikely(__get_user(ring_head,
-   >avail->ring[last_avail_idx % vq->num]))) {
+   >avail->ring[last_avail_idx & (vq->num - 
1)]))) {
vq_err(vq, "Failed to read head: idx %d address %p\n",
   last_avail_idx,
   >avail->ring[last_avail_idx % vq->num]);
@@ -1489,7 +1489,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
u16 old, new;
int start;
 
-   start = vq->last_used_idx % vq->num;
+   start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
if (count == 1) {
if (__put_user(heads[0].id, >id)) {
@@ -1531,7 +1531,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
vring_used_elem *heads,
 {
int start, n, r;
 
-   start = vq->last_used_idx % vq->num;
+   start = vq->last_used_idx & (vq->num - 1);
n = vq->num - start;
if (n < count) {
r = __vhost_add_used_n(vq, heads, n);
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC v2] virtio: skip avail/used index reads

2015-11-30 Thread Michael S. Tsirkin
This adds a new vring feature bit: when enabled, host and guest poll the
available/used ring directly instead of looking at the index field
first.

To guarantee it is possible to detect updates, the high bits (above
vring.num - 1) in the ring head ID value are modified to match the index
bits - these change on each wrap-around.  Writer also XORs this with
0x8000 such that rings can be zero-initialized.

Reader is modified to ignore these high bits when looking
up descriptors.

The point is to reduce the number of cacheline misses
for both reads and writes.

I see a performance improvement of about 20% on multithreaded benchmarks
(e.g. virtio-test), but regression of about 2% on vring_bench.
I think this has to do with the fact that complete_multi_user
is implemented suboptimally.

TODO:
investigate single-threaded regression
look at more aggressive ring layout changes
better name for a feature flag
split the patch to make it easier to review

This is on top of the following patches in my tree:
virtio_ring: Shadow available ring flags & index
vhost: replace % with & on data path
tools/virtio: fix byteswap logic
tools/virtio: move list macro stubs

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---

Changes from v1:
add a missing chunk in vhost_get_vq_desc

 drivers/vhost/vhost.h|   3 +-
 include/linux/vringh.h   |   3 +
 include/uapi/linux/virtio_ring.h |   3 +
 drivers/vhost/vhost.c| 104 ++
 drivers/vhost/vringh.c   | 153 +--
 drivers/virtio/virtio_ring.c |  40 --
 tools/virtio/virtio_test.c   |  14 +++-
 7 files changed, 256 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..aeeb15d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -175,7 +175,8 @@ enum {
 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
 (1ULL << VHOST_F_LOG_ALL) |
 (1ULL << VIRTIO_F_ANY_LAYOUT) |
-(1ULL << VIRTIO_F_VERSION_1)
+(1ULL << VIRTIO_F_VERSION_1) |
+(1ULL << VIRTIO_RING_F_POLL)
 };
 
 static inline bool vhost_has_feature(struct vhost_virtqueue *vq, int bit)
diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index bc6c28d..13a9e3e 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -40,6 +40,9 @@ struct vringh {
/* Can we get away with weak barriers? */
bool weak_barriers;
 
+   /* Poll ring directly */
+   bool poll;
+
/* Last available index we saw (ie. where we're up to). */
u16 last_avail_idx;
 
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index c072959..bf3ca1d 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -62,6 +62,9 @@
  * at the end of the used ring. Guest should ignore the used->flags field. */
 #define VIRTIO_RING_F_EVENT_IDX29
 
+/* Support ring polling */
+#define VIRTIO_RING_F_POLL 33
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
/* Address (guest-physical). */
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index ad2146a..cdbabf5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1346,25 +1346,27 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
/* Check it isn't doing very strange things with descriptor numbers. */
last_avail_idx = vq->last_avail_idx;
-   if (unlikely(__get_user(avail_idx, >avail->idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-  >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (!vhost_has_feature(vq, VIRTIO_RING_F_POLL)) {
+   if (unlikely(__get_user(avail_idx, >avail->idx))) {
+   vq_err(vq, "Failed to access avail idx at %p\n",
+  >avail->idx);
+   return -EFAULT;
+   }
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved used index from %u to %u",
-  last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
+   vq_err(vq, "Guest moved used index from %u to %u",
+  last_avail_idx, vq->avail_idx);
+   return -EFAULT;
+   }
 
-   /* If there's

[PATCH RFC] virtio: skip avail/used index reads

2015-11-30 Thread Michael S. Tsirkin
This adds a new vring feature bit: when enabled, host and guest poll the
available/used ring directly instead of looking at the index field
first.

To guarantee it is possible to detect updates, the high bits (above
vring.num - 1) in the ring head ID value are modified to match the index
bits - these change on each wrap-around.  Writer also XORs this with
0x8000 such that rings can be zero-initialized.

Reader is modified to ignore these high bits when looking
up descriptors.

The point is to reduce the number of cacheline misses
for both reads and writes.

I see a speedup of about 20% on a multithreaded micro-benchmark
(virtio-test), but regression of about 2% on a single-threaded one
(vring_bench).  I think this has to do with the fact that
complete_multi_user is implemented suboptimally.

TODO:
investigate single-threaded regression
better name for a feature flag
split the patch to make it easier to review
look at more aggressive ring layout changes
write a spec patch

This is on top of the following patches in my tree:
virtio_ring: Shadow available ring flags & index
vhost: replace % with & on data path
tools/virtio: fix byteswap logic
tools/virtio: move list macro stubs

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---
 drivers/vhost/vhost.h|   3 +-
 include/linux/vringh.h   |   3 +
 include/uapi/linux/virtio_ring.h |   3 +
 drivers/vhost/vhost.c| 104 ++
 drivers/vhost/vringh.c   | 153 +--
 drivers/virtio/virtio_ring.c |  40 --
 tools/virtio/virtio_test.c   |  14 +++-
 7 files changed, 255 insertions(+), 65 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..aeeb15d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -175,7 +175,8 @@ enum {
 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
 (1ULL << VHOST_F_LOG_ALL) |
 (1ULL << VIRTIO_F_ANY_LAYOUT) |
-(1ULL << VIRTIO_F_VERSION_1)
+(1ULL << VIRTIO_F_VERSION_1) |
+(1ULL << VIRTIO_RING_F_POLL)
 };
 
 static inline bool vhost_has_feature(struct vhost_virtqueue *vq, int bit)
diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index bc6c28d..13a9e3e 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -40,6 +40,9 @@ struct vringh {
/* Can we get away with weak barriers? */
bool weak_barriers;
 
+   /* Poll ring directly */
+   bool poll;
+
/* Last available index we saw (ie. where we're up to). */
u16 last_avail_idx;
 
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index c072959..bf3ca1d 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -62,6 +62,9 @@
  * at the end of the used ring. Guest should ignore the used->flags field. */
 #define VIRTIO_RING_F_EVENT_IDX29
 
+/* Support ring polling */
+#define VIRTIO_RING_F_POLL 33
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
/* Address (guest-physical). */
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 85f0f0a..cdbabf5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1346,26 +1346,26 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
/* Check it isn't doing very strange things with descriptor numbers. */
last_avail_idx = vq->last_avail_idx;
-   if (unlikely(__get_user(avail_idx, >avail->idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-  >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved used index from %u to %u",
-  last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   if (!vhost_has_feature(vq, VIRTIO_RING_F_POLL)) {
+   if (unlikely(__get_user(avail_idx, >avail->idx))) {
+   vq_err(vq, "Failed to access avail idx at %p\n",
+  >avail->idx);
+   return -EFAULT;
+   }
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
-   /* If there's nothing new since last we looked, return invalid. */
-   if (vq->avail_idx == last_avail_idx)
-   return vq->num;
+   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
+   vq_err(vq, "Guest moved used index from %u to %u",
+  

Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-11-25 Thread Michael S. Tsirkin
On Wed, Nov 25, 2015 at 11:32:23PM +0800, Lan, Tianyu wrote:
> 
> On 11/25/2015 5:03 AM, Michael S. Tsirkin wrote:
> >>>+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
> >>>+  uint32_t val, int len)
> >>>+{
> >>>+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> >>>+
> >>>+if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
> >>>+&& val == PCI_VF_READY_FOR_MIGRATION) {
> >>>+qemu_event_set(_event);
> >This would wake migration so it can proceed -
> >except it needs QEMU lock to run, and that's
> >taken by the migration thread.
> 
> Sorry, I seem to miss something.
> Which lock may cause dead lock when calling vfio_migration_cap_handle()
> and run migration?

qemu_global_mutex.

> The function is called when VF accesses faked PCI capability.
> 
> >
> >It seems unlikely that this ever worked - how
> >did you test this?
> >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-25 Thread Michael S. Tsirkin
On Thu, Nov 26, 2015 at 12:02:33AM +0800, Lan, Tianyu wrote:
> On 11/25/2015 8:28 PM, Michael S. Tsirkin wrote:
> >Frankly, I don't really see what this short term hack buys us,
> >and if it goes in, we'll have to maintain it forever.
> >
> 
> The framework of how to notify VF about migration status won't be
> changed regardless of stopping VF or not before doing migration.
> We hope to reach agreement on this first.

Well it's bi-directional, the framework won't work if it's
uni-directional.
Further, if you use this interface to stop the interface
at the moment, you won't be able to do anything else
with it, and will need a new one down the road.


> Tracking dirty memory still
> need to more discussions and we will continue working on it. Stop VF may
> help to work around the issue and make tracking easier.
> 
> 
> >Also, assuming you just want to do ifdown/ifup for some reason, it's
> >easy enough to do using a guest agent, in a completely generic way.
> >
> 
> Just ifdown/ifup is not enough for migration. It needs to restore some PCI
> settings before doing ifup on the target machine

I'd focus on just restoring then.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-25 Thread Michael S. Tsirkin
On Wed, Nov 25, 2015 at 08:24:38AM -0800, Alexander Duyck wrote:
> >> Also, assuming you just want to do ifdown/ifup for some reason, it's
> >> easy enough to do using a guest agent, in a completely generic way.
> >>
> >
> > Just ifdown/ifup is not enough for migration. It needs to restore some PCI
> > settings before doing ifup on the target machine
> 
> That is why I have been suggesting making use of suspend/resume logic
> that is already in place for PCI power management.  In the case of a
> suspend/resume we already have to deal with the fact that the device
> will go through a D0->D3->D0 reset so we have to restore all of the
> existing state.  It would take a significant load off of Qemu since
> the guest would be restoring its own state instead of making Qemu have
> to do all of the device migration work.

That can work, though again, the issue is you need guest
cooperation to migrate.

If you reset device on destination instead of restoring state,
then that issue goes away, but maybe the downtime
will be increased.

Will it really? I think it's worth it to start with the
simplest solution (reset on destination) and see
what the effect is, then add optimizations.


One thing that I've been thinking about for a while, is saving (some)
state speculatively.  For example, notify guest a bit before migration
is done, so it can save device state. If guest responds quickly, you
have state that can be restored.  If it doesn't, still migrate, and it
will have to reset on destination.


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-25 Thread Michael S. Tsirkin
On Wed, Nov 25, 2015 at 01:39:32PM +0800, Lan Tianyu wrote:
> On 2015年11月25日 05:20, Michael S. Tsirkin wrote:
> > I have to say, I was much more interested in the idea
> > of tracking dirty memory. I have some thoughts about
> > that one - did you give up on it then?
> 
> No, our finial target is to keep VF active before doing
> migration and tracking dirty memory is essential. But this
> seems not easy to do that in short term for upstream. As
> starters, stop VF before migration.

Frankly, I don't really see what this short term hack buys us,
and if it goes in, we'll have to maintain it forever.

Also, assuming you just want to do ifdown/ifup for some reason, it's
easy enough to do using a guest agent, in a completely generic way.


> After deep thinking, the way of stopping VF still needs tracking
> DMA-accessed dirty memory to make sure the received data buffer
> before stopping VF migrated. It's easier to do that via dummy writing
> data buffer when receive packet.
> 
> 
> -- 
> Best regards
> Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-11-24 Thread Michael S. Tsirkin
On Tue, Nov 24, 2015 at 09:35:26PM +0800, Lan Tianyu wrote:
> This patch is to add SRIOV VF migration support.
> Create new device type "vfio-sriov" and add faked PCI migration capability
> to the type device.
> 
> The purpose of the new capability
> 1) sync migration status with VF driver in the VM
> 2) Get mailbox irq vector to notify VF driver during migration.
> 3) Provide a way to control injecting irq or not.
> 
> Qemu will migrate PCI configure space regs and MSIX config for VF.
> Inject mailbox irq at last stage of migration to notify VF about
> migration event and wait VF driver ready for migration.

I think this last bit "wait VF driver ready for migration"
is wrong. Not a lot is gained as compared to hotunplug.

To really get a benefit from this feature migration should
succeed even if guest is stuck, then interrupt should
tell guest that it has to reset the driver.


> VF driver
> writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
> to tell Qemu.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  hw/vfio/Makefile.objs |   2 +-
>  hw/vfio/pci.c |   6 ++
>  hw/vfio/pci.h |   4 ++
>  hw/vfio/sriov.c   | 178 
> ++
>  4 files changed, 189 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/sriov.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index d540c9d..9cf0178 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o
> +obj-$(CONFIG_PCI) += pci.o sriov.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  endif
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 7c43fc1..e7583b5 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t 
> addr,
>  } else if (was_enabled && !is_enabled) {
>  vfio_disable_msix(vdev);
>  }
> +} else if (vdev->migration_cap &&
> +ranges_overlap(addr, len, vdev->migration_cap, 0x10)) {
> +/* Write everything to QEMU to keep emulated bits correct */
> +pci_default_write_config(pdev, addr, val, len);
> +vfio_migration_cap_handle(pdev, addr, val, len);
>  } else {
>  /* Write everything to QEMU to keep emulated bits correct */
>  pci_default_write_config(pdev, addr, val, len);
> @@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev)
>  vfio_register_err_notifier(vdev);
>  vfio_register_req_notifier(vdev);
>  vfio_setup_resetfn(vdev);
> +vfio_add_migration_capability(vdev);
>  
>  return 0;
>  
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 6c00575..ee6ca5e 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
>  PCIHostDeviceAddress host;
>  EventNotifier err_notifier;
>  EventNotifier req_notifier;
> +uint16_tmigration_cap;
>  int (*resetfn)(struct VFIOPCIDevice *);
>  uint32_t features;
>  #define VFIO_FEATURE_ENABLE_VGA_BIT 0
> @@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t 
> addr, int len);
>  void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
> uint32_t val, int len);
>  void vfio_enable_msix(VFIOPCIDevice *vdev);
> +void vfio_add_migration_capability(VFIOPCIDevice *vdev);
> +void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
> +   uint32_t val, int len);
> diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c
> new file mode 100644
> index 000..3109538
> --- /dev/null
> +++ b/hw/vfio/sriov.c
> @@ -0,0 +1,178 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "hw/hw.h"
> +#include "hw/vfio/pci.h"
> +#include "hw/vfio/vfio.h"
> +#include "hw/vfio/vfio-common.h"
> +
> +#define TYPE_VFIO_SRIOV "vfio-sriov"
> +
> +#define SRIOV_LM_SETUP 0x01
> +#define SRIOV_LM_COMPLETE 0x02
> +
> +QemuEvent migration_event;
> +
> +static void vfio_dev_post_load(void *opaque)
> +{
> +struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +MSIMessage msg;
> +int vector;
> +
> +if (vfio_pci_read_config(pdev,
> +vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
> +!= PCI_VF_MIGRATION_ENABLE)
> +return;
> +
> +vector = vfio_pci_read_config(pdev,
> +vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
> +
> +msg = msix_get_message(pdev, vector);
> +kvm_irqchip_send_msi(kvm_state, msg);
> +}
> +
> +static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +int ret;
> +
> +

Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Michael S. Tsirkin
On Tue, Nov 24, 2015 at 09:38:18PM +0800, Lan Tianyu wrote:
> This patch is to add migration support for ixgbevf driver. Using
> faked PCI migration capability table communicates with Qemu to
> share migration status and mailbox irq vector index.
> 
> Qemu will notify VF via sending MSIX msg to trigger mailbox
> vector during migration and store migration status in the
> PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table.
> The mailbox irq will be triggered just befoe stop-and-copy stage
> and after migration on the target machine.
> 
> VF driver will put down net when detect migration and tell
> Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS
> reg. After migration, put up net again.
> 
> Qemu will in charge of migrating PCI config space regs and MSIX config.
> 
> The patch is to dedicate on the normal case that net traffic works
> when mailbox irq is enabled. For other cases(such as the driver
> isn't loaded, adapter is suspended or closed), mailbox irq won't be
> triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP
> reg. These case will be resolved later.
> 
> Signed-off-by: Lan Tianyu 

I have to say, I was much more interested in the idea
of tracking dirty memory. I have some thoughts about
that one - did you give up on it then?



> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 
> ++
>  2 files changed, 107 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index 775d089..4b8ba2f 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -438,6 +438,11 @@ struct ixgbevf_adapter {
>   u64 bp_tx_missed;
>  #endif
>  
> + u8 migration_cap;
> + u8 last_migration_reg;
> + unsigned long migration_status;
> + struct work_struct migration_task;
> +
>   u8 __iomem *io_addr; /* Mainly for iounmap use */
>   u32 link_speed;
>   bool link_up;
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index a16d267..95860c2 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -96,6 +96,8 @@ static int debug = -1;
>  module_param(debug, int, 0);
>  MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
>  
> +#define MIGRATION_IN_PROGRESS0
> +
>  static void ixgbevf_service_event_schedule(struct ixgbevf_adapter *adapter)
>  {
>   if (!test_bit(__IXGBEVF_DOWN, >state) &&
> @@ -1262,6 +1264,22 @@ static void ixgbevf_set_itr(struct ixgbevf_q_vector 
> *q_vector)
>   }
>  }
>  
> +static void ixgbevf_migration_check(struct ixgbevf_adapter *adapter) 
> +{
> + struct pci_dev *pdev = adapter->pdev;
> + u8 val;
> +
> + pci_read_config_byte(pdev,
> +  adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS,
> +  );
> +
> + if (val != adapter->last_migration_reg) {
> + schedule_work(>migration_task);
> + adapter->last_migration_reg = val;
> + }
> +
> +}
> +
>  static irqreturn_t ixgbevf_msix_other(int irq, void *data)
>  {
>   struct ixgbevf_adapter *adapter = data;
> @@ -1269,6 +1287,7 @@ static irqreturn_t ixgbevf_msix_other(int irq, void 
> *data)
>  
>   hw->mac.get_link_status = 1;
>  
> + ixgbevf_migration_check(adapter);
>   ixgbevf_service_event_schedule(adapter);
>  
>   IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, adapter->eims_other);
> @@ -1383,6 +1402,7 @@ out:
>  static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter)
>  {
>   struct net_device *netdev = adapter->netdev;
> + struct pci_dev *pdev = adapter->pdev;
>   int q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
>   int vector, err;
>   int ri = 0, ti = 0;
> @@ -1423,6 +1443,12 @@ static int ixgbevf_request_msix_irqs(struct 
> ixgbevf_adapter *adapter)
>   goto free_queue_irqs;
>   }
>  
> + if (adapter->migration_cap) {
> + pci_write_config_byte(pdev,
> + adapter->migration_cap + PCI_VF_MIGRATION_IRQ,
> + vector);
> + }
> +
>   return 0;
>  
>  free_queue_irqs:
> @@ -2891,6 +2917,59 @@ static void ixgbevf_watchdog_subtask(struct 
> ixgbevf_adapter *adapter)
>   ixgbevf_update_stats(adapter);
>  }
>  
> +static void ixgbevf_migration_task(struct work_struct *work)
> +{
> + struct ixgbevf_adapter *adapter = container_of(work,
> + struct ixgbevf_adapter,
> + migration_task);
> + struct pci_dev *pdev = adapter->pdev;
> + struct net_device *netdev = adapter->netdev;
> + u8 val;
> +
> + if (!test_bit(MIGRATION_IN_PROGRESS, >migration_status)) {
> + pci_read_config_byte(pdev,
> + 

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-22 Thread Michael S. Tsirkin
On Sun, Nov 22, 2015 at 10:21:34PM -, David Woodhouse wrote:
> 
> 
> > There's that, and there's an "I care about security, but
> > do not want to burn up cycles on fake protections that
> > do not work" case.
> 
> It would seem to make most sense for this use case simply *not* to expose
> virtio devices to guests as being behind an IOMMU at all. Sure, there are
> esoteric use cases where the guest actually nests and runs further guests
> inside itself and wants to pass through the virtio devices from the real
> hardware host. But presumably those configurations will have multiple
> virtio devices assigned by the host anyway,  and further tweaking the
> configuration to put them behind an IOMMU shouldn't be hard.

Unfortunately it's a no-go: this breaks the much less esoteric usecase
of DPDK: using virtio devices with userspace drivers.

Well - not breaks as such as this doesn't currently work,
but this approach would prevent us from making it work.

> 
> -- 
> dwmw2
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-22 Thread Michael S. Tsirkin
On Sun, Nov 22, 2015 at 03:58:28PM +, David Woodhouse wrote:
> On Fri, 2015-11-20 at 10:21 +0200, Michael S. Tsirkin wrote:
> > 
> > David, there are two things a hypervisor needs to tell the guest.
> > 1. The actual device is behind an IOMMU. This is what you
> >    are suggesting we use DMAR for.
> > 2. Using IOMMU from kernel (as opposed to from userspace with VFIO)
> >    actually adds security. For exising virtio devices on KVM,
> >    the answer is no. And DMAR has no way to reflect that.
> 
> Using the IOMMU from the kernel *always* adds security. It protects
> against device driver (and device) bugs which can be made exploitable
> by allowing DMA to anywhere in the system.

No - speaking about QEMU/KVM here - you are not "allowing" DMA - by
programming the virtual IOMMU you are asking the hypervisor nicely to do
that. If it's buggy, it can ignore you and there's nothing you can do.

As with any random change in the system, some bugs might get masked and
become non-exploitable, but then some other bugs might surface and
become exploitable.

I gather that e.g. Xen is different.


> Sure, there are classes of that which are far more interesting, for
> example where you give the whole device to a guest and let it load the
> firmware. But "we trust the hypervisor" and "we trust the hardware" are
> not *so* far apart conceptually.

Depends on the hypervisor I guess. At least for QEMU/KVM, one conceptual
difference is that we actually could have the hypervisor tell us whether
a specific device has to be trusted, or can be protected against, and
user can actually read the code and verify that QEMU is doing the right
thing.

Hardware is closed source so harder to trust.

> Hell, with ATS you *still* have to trust the hardware to a large
> extent.
>
> I really think that something like the proposed DMA_ATTR_IOMMU_BYPASS
> should suffice

I'm not sure how that is supposed to be used - does
the driver request DMA_ATTR_IOMMU_BYPASS at setup time?

If yes then I think that will work for virtio -
we can just set that in the driver.

> for the "who cares about security; we want performance"
> case.
> 
> -- 
> dwmw2
> 

There's that, and there's an "I care about security, but
do not want to burn up cycles on fake protections that
do not work" case.


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-22 Thread Michael S. Tsirkin
On Sun, Nov 22, 2015 at 03:54:21PM +, David Woodhouse wrote:
> On Sun, 2015-11-22 at 15:06 +0200, Marcel Apfelbaum wrote:
> > 
> > 
> > I tried to generate a DMAR table that excludes some devices from
> > IOMMU translation, however it does not help.
> > 
> > The reason is, as far as I understand, that Linux kernel does
> > not allow any device being outside an IOMMU scope if the
> > iommu kernel option is activated.
> > 
> > Does anybody know if it is "by design" or is simply an uncommon
> > configuration?
> > (some devices in an IOMMU scope, while others outside *any* IOMMU
> > scope)
> 
> That's a kernel bug in the way it handles per-device DMA operations. Or
> more to the point, in the way it doesn't — the non-translated devices
> end up being pointed to the intel_dma_ops despite the fact they
> shouldn't be. I'm working on that...
> 
> -- 
> dwmw2
> 

Interesting. This seems to imply such configurations aren't
common, so I wonder whether other guest OS-es treat them
correctly.

If many of them are, we probably shouldn't use this in QEMU:
we care about guests actually working :)

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-20 Thread Michael S. Tsirkin
On Thu, Nov 19, 2015 at 11:38:06PM +, David Woodhouse wrote:
> On Thu, 2015-11-19 at 13:59 -0800, Andy Lutomirski wrote:
> > 
> > >
> > > So thinking hard about it, I don't see any real drawbacks to making this
> > > conditional on a new feature bit, that Xen can then set..
> > 
> > Can you elaborate?  If I run QEMU, hosting Xen, hosting Linux, and the
> > virtio device is provided by QEMU, then how does Xen set the bit?
> > Similarly, how would Xen set the bit for a real physical device?
> 
> Right. This is *not* a fundamental characteristic of the device. This
> is all about how your *particular* hypervisor (in the set of turtles-
> all-the-way-down) happened to expose the thing to you.
> 
> This is why it lives in the DMAR table, in the Intel world, which
> *tells* you which devices are behind which IOMMU (and which are not).

David, there are two things a hypervisor needs to tell the guest.
1. The actual device is behind an IOMMU. This is what you
   are suggesting we use DMAR for.
2. Using IOMMU from kernel (as opposed to from userspace with VFIO)
   actually adds security. For exising virtio devices on KVM,
   the answer is no. And DMAR has no way to reflect that.

Question 2 only makes sense if you answer yes to question 1 and if user
wants protection from malicious devices with iommu=on, and
if you care about getting good performance from *other*
devices.  And what guest would do is use 1:1 for the
devices where answer 2 is "no".

Maybe for now I should just give up and say "don't use iommu=on within
VMs if you want any performance".  But the point is, if we just fix QEMU
to actually obey IOMMU mappings for assigned devices, then there's
already a kind of answer with virtio being trusted since it's part of
hypervisor, all this without guest changes. Seems kind of sad to let
performance regress.

So a (yet another) feature bit would be a possible solution there, but
we don't seem to be able to even agree on using a feature bit for a
quirk.


> And why I keep repeating myself that it has nothing to do with the
> actual device or the virtio drivers.
>
> I understand that POWER and other platforms don't currently have a
> clean way to indicate that certain device don't have translation. And I
> understand that we may end up with a *quirk* which ensures that the DMA
> API does the right thing (i.e. nothing) in certain cases.

So assuming we forget about 2 above for now, then yes, all we need
is a quirk, using some logic to detect these systems.

> But we should *NOT* be involving the virtio device drivers in that
> quirk, in any way. And putting a feature bit in the virtio device
> itself doesn't seem at all sane either.

Only if there's some other device that benefits from all this work.  If
virtio is the only one that benefits, then why do we want to
spread the quirk rules around so much? A feature bit gives us
a single, portable rule that the quirk can use on all platforms.

> Bear in mind that qemu-system-x86_64 currently has the *same* problem
> with assigned physical devices. It's claiming they're translated, and
> they're not.
> 
> -- 
> dwmw2
> 

Presumably people either don't assign
devices or don't have an iommu otherwise things won't work for them,
but if they do have an iommu and don't assign devices, then Andy's
patch will break them.

This is not QEMU specific unfortunately, we don't know who
might have implemented virtio.





-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-20 Thread Michael S. Tsirkin
On Fri, Nov 20, 2015 at 01:56:39PM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2015-11-19 at 23:38 +, David Woodhouse wrote:
> > 
> > I understand that POWER and other platforms don't currently have a
> > clean way to indicate that certain device don't have translation. And I
> > understand that we may end up with a *quirk* which ensures that the DMA
> > API does the right thing (i.e. nothing) in certain cases.
> > 
> > But we should *NOT* be involving the virtio device drivers in that
> > quirk, in any way. And putting a feature bit in the virtio device
> > itself doesn't seem at all sane either.
> > 
> > Bear in mind that qemu-system-x86_64 currently has the *same* problem
> > with assigned physical devices. It's claiming they're translated, and
> > they're not.
> 
> It's not that clear but yeah ... as I mentioned, I can't find a
> way to do that quirk that won't break when we want to actually use
> the iommu... 
> 
> Ben.

Yes, I am not at all sure we need a quirk for assigned devices.
Better teach QEMU to make iommu work for them.


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 0/5] implement vNVDIMM

2015-11-19 Thread Michael S. Tsirkin
On Thu, Nov 19, 2015 at 10:39:05AM +0800, Xiao Guangrong wrote:
> 
> 
> On 11/19/2015 04:44 AM, Michael S. Tsirkin wrote:
> >On Wed, Nov 18, 2015 at 05:18:17PM -0200, Eduardo Habkost wrote:
> >>On Wed, Nov 18, 2015 at 09:59:34AM +0800, Xiao Guangrong wrote:
> >>>
> >>>Ping...
> >>>
> >>>Do you have any comment on this patchset? Could it be applied to somewhere
> >>>if it is okay for you?
> >>
> >>I have no additional comments, as the memory-backend patches I
> >>was reviewing are not included in this version. I didn't take the
> >>time to review the TYPE_NVDIMM and ACPI changes.
> >
> >No, I don't think the way guest memory is allocated here is ok.  I'm
> 
> Since the DSM memory/ACPI memory was not included in this patchset, i really
> do not understand what is "guest memory is allocated" exactly stands for...

I might even be confusing this with another patchset.
Let's have this discussion when I have the time to review
and respond properly.

> >sorry, I'm busy with 2.5 now, and this is clearly not 2.5 material.
> 
> I still see some pull requests were send our for 2.5 merge window today and
> yesterday ...
> 
> This patchset is the simplest version we can figure out to implement basic
> functionality for vNVDIMM and only minor change is needed for other code.
> It would be nice and really appreciate if it can go to 2.5.

Sorry, no way, we are in a bugfix only mode for 2.5.

> >Once that's out, I'll post some suggestions.
> 
> Look forward to you suggestions.
> 
> Thanks for your time, Michael and Eduardo!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-19 Thread Michael S. Tsirkin
On Fri, Nov 20, 2015 at 08:56:46AM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 19, 2015 at 01:59:05PM -0800, Andy Lutomirski wrote:
> > On Nov 19, 2015 5:45 AM, "Michael S. Tsirkin" <m...@redhat.com> wrote:
> > >
> > > On Tue, Oct 27, 2015 at 11:38:57PM -0700, Andy Lutomirski wrote:
> > > > This switches virtio to use the DMA API unconditionally.  I'm sure
> > > > it breaks things, but it seems to work on x86 using virtio-pci, with
> > > > and without Xen, and using both the modern 1.0 variant and the
> > > > legacy variant.
> > >
> > > So thinking hard about it, I don't see any real drawbacks to making this
> > > conditional on a new feature bit, that Xen can then set..
> > 
> > Can you elaborate?  If I run QEMU, hosting Xen, hosting Linux, and the
> > virtio device is provided by QEMU, then how does Xen set the bit?
> 
> You would run QEMU with the appropriate flag. E.g.
> -global virtio-pci,use_platform_dma=on

Or Xen code within QEMU can tweak this global internally
so users don't need to care.

> > Similarly, how would Xen set the bit for a real physical device?
> > 
> > 
> > --Andy
> 
> There's no need to set bits for physical devices I think: from security
> point of view, using them from a VM isn't very different from using them
> from host.
> 
> 
> 
> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-19 Thread Michael S. Tsirkin
On Thu, Nov 19, 2015 at 01:59:05PM -0800, Andy Lutomirski wrote:
> On Nov 19, 2015 5:45 AM, "Michael S. Tsirkin" <m...@redhat.com> wrote:
> >
> > On Tue, Oct 27, 2015 at 11:38:57PM -0700, Andy Lutomirski wrote:
> > > This switches virtio to use the DMA API unconditionally.  I'm sure
> > > it breaks things, but it seems to work on x86 using virtio-pci, with
> > > and without Xen, and using both the modern 1.0 variant and the
> > > legacy variant.
> >
> > So thinking hard about it, I don't see any real drawbacks to making this
> > conditional on a new feature bit, that Xen can then set..
> 
> Can you elaborate?  If I run QEMU, hosting Xen, hosting Linux, and the
> virtio device is provided by QEMU, then how does Xen set the bit?

You would run QEMU with the appropriate flag. E.g.
-global virtio-pci,use_platform_dma=on

> Similarly, how would Xen set the bit for a real physical device?
> 
> 
> --Andy

There's no need to set bits for physical devices I think: from security
point of view, using them from a VM isn't very different from using them
from host.



-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-19 Thread Michael S. Tsirkin
On Tue, Oct 27, 2015 at 11:38:57PM -0700, Andy Lutomirski wrote:
> This switches virtio to use the DMA API unconditionally.  I'm sure
> it breaks things, but it seems to work on x86 using virtio-pci, with
> and without Xen, and using both the modern 1.0 variant and the
> legacy variant.

So thinking hard about it, I don't see any real drawbacks to making this
conditional on a new feature bit, that Xen can then set.

As a bonus, host can distinguish between old and new guests using the
feature bit, even though making driver *control* whether IOMMU is
bypassed makes userspace drivers unsafe, so might not be a good idea.

A tiny bit more code but not by much, and we clearly won't
be breaking anything that's not already broken,
and we will be able to drop the extra code later
if we think it's a good idea.

I'll run this by the virtio TC on OASIS next week so we
can reserve a feature bit.

> Changes from v2:
>  - Fix really embarrassing bug.  This version actually works.
> 
> Changes from v1:
>  - Fix an endian conversion error causing a BUG to hit.
>  - Fix a DMA ordering issue (swiotlb=force works now).
>  - Minor cleanups.
> 
> Andy Lutomirski (3):
>   virtio_net: Stop doing DMA from the stack
>   virtio_ring: Support DMA APIs
>   virtio_pci: Use the DMA API
> 
>  drivers/net/virtio_net.c   |  53 +++
>  drivers/virtio/Kconfig |   2 +-
>  drivers/virtio/virtio_pci_common.h |   3 +-
>  drivers/virtio/virtio_pci_legacy.c |  19 +++-
>  drivers/virtio/virtio_pci_modern.c |  34 +--
>  drivers/virtio/virtio_ring.c   | 187 
> ++---
>  tools/virtio/linux/dma-mapping.h   |  17 
>  7 files changed, 246 insertions(+), 69 deletions(-)
>  create mode 100644 tools/virtio/linux/dma-mapping.h
> 
> -- 
> 2.4.3
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 0/5] implement vNVDIMM

2015-11-18 Thread Michael S. Tsirkin
On Wed, Nov 18, 2015 at 05:18:17PM -0200, Eduardo Habkost wrote:
> On Wed, Nov 18, 2015 at 09:59:34AM +0800, Xiao Guangrong wrote:
> > 
> > Ping...
> > 
> > Do you have any comment on this patchset? Could it be applied to somewhere
> > if it is okay for you?
> 
> I have no additional comments, as the memory-backend patches I
> was reviewing are not included in this version. I didn't take the
> time to review the TYPE_NVDIMM and ACPI changes.

No, I don't think the way guest memory is allocated here is ok.  I'm
sorry, I'm busy with 2.5 now, and this is clearly not 2.5 material.
Once that's out, I'll post some suggestions.

Thanks!

> -- 
> Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vhost: relax log address alignment

2015-11-16 Thread Michael S. Tsirkin
commit 5d9a07b0de512b77bf28d2401e5fe3351f00a240 ("vhost: relax used
address alignment") fixed the alignment for the used virtual address,
but not for the physical address used for logging.

That's a mistake: alignment should clearly be the same for virtual and
physical addresses,

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..080422f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -819,7 +819,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void 
__user *argp)
BUILD_BUG_ON(__alignof__ *vq->used > VRING_USED_ALIGN_SIZE);
if ((a.avail_user_addr & (VRING_AVAIL_ALIGN_SIZE - 1)) ||
(a.used_user_addr & (VRING_USED_ALIGN_SIZE - 1)) ||
-   (a.log_guest_addr & (sizeof(u64) - 1))) {
+   (a.log_guest_addr & (VRING_USED_ALIGN_SIZE - 1))) {
r = -EINVAL;
break;
}
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm/vmx: EPTP switching test

2015-11-16 Thread Michael S. Tsirkin
On Mon, Nov 16, 2015 at 06:51:06PM +0100, 
=?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= wrote:
> 2015-11-15 18:00+0200, Michael S. Tsirkin:
> > This patch adds a new parameter: eptp_switching_test, which enables
> > 
> > testing EPT switching on VMX if supported by hardware.  All EPT entries
> > are initialized to the same value so this adds no useful functionality
> > by itself, but can be used to test VMFUNC performance, and serve as a
> > basis for future features based on EPTP switching.
> > 
> > Support for nested virt is not enabled.
> > 
> > This was tested using the following code within guest:
> > #define VMX_VMFUNC ".byte 0x0f,0x01,0xd4"
> > static void vmfunc(unsigned int nr, unsigned int ept)
> > {
> > asm volatile(VMX_VMFUNC
> >  :
> >  : "a"(nr), "c"(ept)
> >  : "memory");
> > }
> > 
> > VMFUNC instruction cost was measured at ~122 cycles.
> > (Note: recent versions of gnu toolchain support
> >  the vmfunc instruction - removing the need for writing
> >  the bytecode manually).
> > 
> > Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> > ---
> > 
> > I think I'd like to put this upstream so future eptp switching work can
> > be implemented on top. Comments?
> 
> I'd wait for the future.  Patch is already on the list so people
> interested in benchmarking VMFUNC can quickly compile a kernel and
> developers will need to overwrite the code anyway.

It'll bitrot though.  But I'll let Paolo decide that.

> (And I think that eptp switching is expected to be used in conjuction
>  with #VE, so it'd then make sense to implement a nop for it as well.)

No idea how would I even test it, so I'm not interested in #VE at this
point.  If you are - go ahead and post a patch for that on top though,
why not.

> > diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> > @@ -3011,6 +3035,7 @@ static __init int setup_vmcs_config(struct 
> > vmcs_config *vmcs_conf)
> > SECONDARY_EXEC_PAUSE_LOOP_EXITING |
> > SECONDARY_EXEC_RDTSCP |
> > SECONDARY_EXEC_ENABLE_INVPCID |
> > +   SECONDARY_EXEC_ENABLE_VM_FUNCTIONS |
> 
> The VMFUNC vmexit should be handled to prevent guests from triggering a
> WARN_ON on the host.  (VMFUNC did just #UD before this patch.)

Do you mean VMFUNC other than EPTP switch 0?  True, thanks!

> 
> After that, it's ok for KVM.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm/vmx: EPTP switching test

2015-11-15 Thread Michael S. Tsirkin
This patch adds a new parameter: eptp_switching_test, which enables
testing EPT switching on VMX if supported by hardware.  All EPT entries
are initialized to the same value so this adds no useful functionality
by itself, but can be used to test VMFUNC performance, and serve as a
basis for future features based on EPTP switching.

Support for nested virt is not enabled.

This was tested using the following code within guest:
#define VMX_VMFUNC ".byte 0x0f,0x01,0xd4"
static void vmfunc(unsigned int nr, unsigned int ept)
{
asm volatile(VMX_VMFUNC
 :
 : "a"(nr), "c"(ept)
 : "memory");
}

VMFUNC instruction cost was measured at ~122 cycles.
(Note: recent versions of gnu toolchain support
 the vmfunc instruction - removing the need for writing
 the bytecode manually).

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
---

I think I'd like to put this upstream so future eptp switching work can
be implemented on top. Comments?

 arch/x86/include/asm/vmx.h |  7 
 arch/x86/kvm/vmx.c | 84 ++
 2 files changed, 91 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 448b7ca..ceb68d9 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -69,10 +69,13 @@
 #define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY0x0200
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
+#define SECONDARY_EXEC_ENABLE_VM_FUNCTIONS 0x2000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
 #define SECONDARY_EXEC_ENABLE_PML   0x0002
 #define SECONDARY_EXEC_XSAVES  0x0010
 
+/* Definitions for VM-function controls */
+#define VM_FUNCTION_EPTP_SWITCHING 0x0001
 
 #define PIN_BASED_EXT_INTR_MASK 0x0001
 #define PIN_BASED_NMI_EXITING   0x0008
@@ -153,6 +156,8 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
POSTED_INTR_DESC_ADDR   = 0x2016,
POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
+   VM_FUNCTION_CTRL= 0x2018,
+   VM_FUNCTION_CTRL_HIGH   = 0x2019,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
EOI_EXIT_BITMAP0= 0x201c,
@@ -163,6 +168,8 @@ enum vmcs_field {
EOI_EXIT_BITMAP2_HIGH   = 0x2021,
EOI_EXIT_BITMAP3= 0x2022,
EOI_EXIT_BITMAP3_HIGH   = 0x2023,
+   EPTP_LIST_ADDRESS   = 0x2024,
+   EPTP_LIST_ADDRESS_HIGH  = 0x2025,
VMREAD_BITMAP   = 0x2026,
VMWRITE_BITMAP  = 0x2028,
XSS_EXIT_BITMAP = 0x202C,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6a8bc64..3d1f613 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "trace.h"
 #include "pmu.h"
@@ -105,6 +106,9 @@ static u64 __read_mostly host_xss;
 static bool __read_mostly enable_pml = 1;
 module_param_named(pml, enable_pml, bool, S_IRUGO);
 
+static bool __read_mostly enable_eptp_switching = 0;
+module_param_named(eptp_switching_test, enable_eptp_switching, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE)
 #define KVM_VM_CR0_ALWAYS_ON   \
@@ -547,6 +551,10 @@ struct vcpu_vmx {
/* Support for PML */
 #define PML_ENTITY_NUM 512
struct page *pml_pg;
+
+   /* Support for EPTP switching */
+#define EPTP_LIST_NUM  512
+   struct page *eptp_list_pg;
 };
 
 enum segment_cache_field {
@@ -1113,6 +1121,22 @@ static inline bool cpu_has_vmx_pml(void)
return vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_ENABLE_PML;
 }
 
+static inline bool cpu_has_vmx_vm_functions(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl &
+   SECONDARY_EXEC_ENABLE_VM_FUNCTIONS;
+}
+
+/* check if the cpu supports writing EPTP switching */
+static inline bool cpu_has_vmx_eptp_switching(void)
+{
+   u64 vmx_msr;
+
+   rdmsrl(MSR_IA32_VMX_VMFUNC, vmx_msr);
+   /* This MSR has same format as VM-function controls */
+   return vmx_msr & VM_FUNCTION_EPTP_SWITCHING;
+}
+
 static inline bool report_flexpriority(void)
 {
return flexpriority_enabled;
@@ -3011,6 +3035,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_PAUSE_LOOP_EXITING |
   

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-12 Thread Michael S. Tsirkin
On Wed, Nov 11, 2015 at 11:30:27PM +0100, David Woodhouse wrote:
> On Wed, 2015-11-11 at 07:56 -0800, Andy Lutomirski wrote:
> > 
> > Can you flesh out this trick?
> > 
> > On x86 IIUC the IOMMU more-or-less defaults to passthrough.  If the
> > kernel wants, it can switch it to a non-passthrough mode.  My patches
> > cause the virtio driver to do exactly this, except that the host
> > implementation doesn't actually exist yet, so the patches will instead
> > have no particular effect.
> 
> At some level, yes — we're compatible with a 1982 IBM PC and thus the
> IOMMU is entirely disabled at boot until the kernel turns it on —
> except in TXT mode where we abandon that compatibility.
> 
> But no, the virtio driver has *nothing* to do with switching the device
> out of passthrough mode. It is either in passthrough mode, or it isn't.
> 
> If the VMM *doesn't* expose an IOMMU to the guest, obviously the
> devices are in passthrough mode. If the guest kernel doesn't have IOMMU
> support enabled, then obviously the devices are in passthrough mode.
> And if the ACPI tables exposed to the guest kernel *tell* it that the
> virtio devices are not actually behind the IOMMU (which qemu gets
> wrong), then it'll be in passthrough mode.
> 
> If the IOMMU is exposed, and enabled, and telling the guest kernel that
> it *does* cover the virtio devices, then those virtio devices will
> *not* be in passthrough mode.

This we need to fix. Because in most configurations if you are
using kernel drivers, then you don't want IOMMU with virtio,
but if you are using VFIO then you do.

Intel's iommu can be programmed to still
do a kind of passthrough (1:1) mapping, it's
just a matter of doing this for virtio devices
when not using VFIO.

> You choosing to use the DMA API in the virtio device drivers instead of
> being buggy, has nothing to do with whether it's actually in
> passthrough mode or not. Whether it's in passthrough mode or not, using
> the DMA API is technically the right thing to do — because it should
> either *do* the translation, or return a 1:1 mapped IOVA, as
> appropriate.

Right but first we need to actually make DMA API do the right thing
at least on x86,ppc and arm.

> > On powerpc and sparc, we *already* screwed up.  The host already tells
> > the guest that there's an IOMMU and that it's *enabled* because those
> > platforms don't have selective IOMMU coverage the way that x86 does.
> > So we need to work around it.
> 
> No, we need it on x86 too because once we fix the virtio device driver
> bug and make it start using the DMA API, then we start to trip up on
> the qemu bug where it lies about which devices are covered by the
> IOMMU.
> 
> Of course, we still have that same qemu bug w.r.t. assigned devices,
> which it *also* claims are behind its IOMMU when they're not...

I'm not worried about qemu bugs that much.  I am interested in being
able to use both VFIO and kernel drivers with virtio devices with good
performance and without tweaking kernel parameters.


> > I think that, if we want fancy virt-friendly IOMMU stuff like you're
> > talking about, then the right thing to do is to create a virtio bus
> > instead of pretending to be PCI.  That bus could have a virtio IOMMU
> > and its own cross-platform enumeration mechanism for devices on the
> > bus, and everything would be peachy.
> 
> That doesn't really help very much for the x86 case where the problem
> is compatibility with *existing* (arguably broken) qemu
> implementations.
> 
> Having said that, if this were real hardware I'd just be blacklisting
> it and saying "Another BIOS with broken DMAR tables --> IOMMU
> completely disabled". So perhaps we should just do that.
> 

Yes, once there is new QEMU where virtio is covered by the IOMMU,
that would be one way to address existing QEMU bugs. 

> > I still don't understand what trick.  If we want virtio devices to be
> > assignable, then they should be translated through the IOMMU, and the
> > DMA API is the right interface for that.
> 
> The DMA API is the right interface *regardless* of whether there's
> actual translation to be done. The device driver itself should not be
> involved in any way with that decision.

With virt, each device can have different priveledges:
some are part of hypervisor so with a kernel driver
trying to get protection from them using an IOMMU which is also
part of hypervisor makes no sense - but when using a
userspace driver then getting protection from the userspace
driver does make sense. Others are real devices so
getting protection from them makes some sense.

Which is which? It's easiest for the device driver itself to
gain that knowledge. Please note this is *not* the same
question as whether a specific device is covered by an IOMMU.

> When you want to access MMIO, you use ioremap() and writel() instead of
> doing random crap for yourself. When you want DMA, you use the DMA API
> to get a bus address for your device *even* if you expect there to be
> no IOMMU and 

Re: [PATCH] vhost: move is_le setup to the backend

2015-11-12 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 12:42:35PM +0100, Greg Kurz wrote:
> The vq->is_le field is used to fix endianness when accessing the vring via
> the cpu_to_vhost16() and vhost16_to_cpu() helpers in the following cases:
> 
> 1) host is big endian and device is modern virtio
> 
> 2) host has cross-endian support and device is legacy virtio with a different
>endianness than the host
> 
> Both cases rely on the VHOST_SET_FEATURES ioctl, but 2) also needs the
> VHOST_SET_VRING_ENDIAN ioctl to be called by userspace. Since vq->is_le
> is only needed when the backend is active, it was decided to set it at
> backend start.
> 
> This is currently done in vhost_init_used()->vhost_init_is_le() but it
> obfuscates the core vhost code. This patch moves the is_le setup to a
> dedicated function that is called from the backend code.
> 
> Note vhost_net is the only backend that can pass vq->private_data == NULL to
> vhost_init_used(), hence the "if (sock)" branch.
> 
> No behaviour change.
> 
> Signed-off-by: Greg Kurz 

I plan to look at this next week, busy with QEMU 2.5 now.

> ---
>  drivers/vhost/net.c   |6 ++
>  drivers/vhost/scsi.c  |3 +++
>  drivers/vhost/test.c  |2 ++
>  drivers/vhost/vhost.c |   12 +++-
>  drivers/vhost/vhost.h |1 +
>  5 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e40678..d6319cb2664c 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -917,6 +917,12 @@ static long vhost_net_set_backend(struct vhost_net *n, 
> unsigned index, int fd)
>  
>   vhost_net_disable_vq(n, vq);
>   vq->private_data = sock;
> +
> + if (sock)
> + vhost_set_is_le(vq);
> + else
> + vq->is_le = virtio_legacy_is_little_endian();
> +
>   r = vhost_init_used(vq);
>   if (r)
>   goto err_used;
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index e25a23692822..e2644a301fa5 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1276,6 +1276,9 @@ vhost_scsi_set_endpoint(struct vhost_scsi *vs,
>   vq = >vqs[i].vq;
>   mutex_lock(>mutex);
>   vq->private_data = vs_tpg;
> +
> + vhost_set_is_le(vq);
> +
>   vhost_init_used(vq);
>   mutex_unlock(>mutex);
>   }
> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> index f2882ac98726..b1c7df502211 100644
> --- a/drivers/vhost/test.c
> +++ b/drivers/vhost/test.c
> @@ -196,6 +196,8 @@ static long vhost_test_run(struct vhost_test *n, int test)
>   oldpriv = vq->private_data;
>   vq->private_data = priv;
>  
> + vhost_set_is_le(vq);
> +
>   r = vhost_init_used(>vqs[index]);
>  
>   mutex_unlock(>mutex);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index eec2f11809ff..6be863dcbd13 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -113,6 +113,12 @@ static void vhost_init_is_le(struct vhost_virtqueue *vq)
>  }
>  #endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
>  
> +void vhost_set_is_le(struct vhost_virtqueue *vq)
> +{
> + vhost_init_is_le(vq);
> +}
> +EXPORT_SYMBOL_GPL(vhost_set_is_le);
> +
>  static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
>   poll_table *pt)
>  {
> @@ -1156,12 +1162,8 @@ int vhost_init_used(struct vhost_virtqueue *vq)
>  {
>   __virtio16 last_used_idx;
>   int r;
> - if (!vq->private_data) {
> - vq->is_le = virtio_legacy_is_little_endian();
> + if (!vq->private_data)
>   return 0;
> - }
> -
> - vhost_init_is_le(vq);
>  
>   r = vhost_update_used_flags(vq);
>   if (r)
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 4772862b71a7..8a62041959fe 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -162,6 +162,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct 
> vhost_virtqueue *);
>  
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>   unsigned int log_num, u64 len);
> +void vhost_set_is_le(struct vhost_virtqueue *vq);
>  
>  #define vq_err(vq, fmt, ...) do {  \
>   pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 03/11] KVM: add support for any length io eventfd

2015-11-12 Thread Michael S. Tsirkin
From: Jason Wang <jasow...@redhat.com>

Signed-off-by: Jason Wang <jasow...@redhat.com>
Reviewed-by: Michael S. Tsirkin <m...@redhat.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
---
 include/sysemu/kvm.h | 8 
 kvm-all.c| 4 
 kvm-stub.c   | 1 +
 3 files changed, 13 insertions(+)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 4ac6176..b31f325 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -53,6 +53,7 @@ extern bool kvm_gsi_routing_allowed;
 extern bool kvm_gsi_direct_mapping;
 extern bool kvm_readonly_mem_allowed;
 extern bool kvm_direct_msi_allowed;
+extern bool kvm_ioeventfd_any_length_allowed;
 
 #if defined CONFIG_KVM || !defined NEED_CPU_H
 #define kvm_enabled()   (kvm_allowed)
@@ -153,6 +154,12 @@ extern bool kvm_direct_msi_allowed;
  */
 #define kvm_direct_msi_enabled() (kvm_direct_msi_allowed)
 
+/**
+ * kvm_ioeventfd_any_length_enabled:
+ * Returns: true if KVM allows any length io eventfd.
+ */
+#define kvm_ioeventfd_any_length_enabled() (kvm_ioeventfd_any_length_allowed)
+
 #else
 #define kvm_enabled()   (0)
 #define kvm_irqchip_in_kernel() (false)
@@ -166,6 +173,7 @@ extern bool kvm_direct_msi_allowed;
 #define kvm_gsi_direct_mapping() (false)
 #define kvm_readonly_mem_enabled() (false)
 #define kvm_direct_msi_enabled() (false)
+#define kvm_ioeventfd_any_length_enabled() (false)
 #endif
 
 struct kvm_run;
diff --git a/kvm-all.c b/kvm-all.c
index de3c8c4..c648b81 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -109,6 +109,7 @@ bool kvm_allowed;
 bool kvm_readonly_mem_allowed;
 bool kvm_vm_attributes_allowed;
 bool kvm_direct_msi_allowed;
+bool kvm_ioeventfd_any_length_allowed;
 
 static const KVMCapabilityInfo kvm_required_capabilites[] = {
 KVM_CAP_INFO(USER_MEMORY),
@@ -1611,6 +1612,9 @@ static int kvm_init(MachineState *ms)
 kvm_vm_attributes_allowed =
 (kvm_check_extension(s, KVM_CAP_VM_ATTRIBUTES) > 0);
 
+kvm_ioeventfd_any_length_allowed =
+(kvm_check_extension(s, KVM_CAP_IOEVENTFD_ANY_LENGTH) > 0);
+
 ret = kvm_arch_init(ms, s);
 if (ret < 0) {
 goto err;
diff --git a/kvm-stub.c b/kvm-stub.c
index a5051f7..dc97a5e 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -30,6 +30,7 @@ bool kvm_gsi_routing_allowed;
 bool kvm_gsi_direct_mapping;
 bool kvm_allowed;
 bool kvm_readonly_mem_allowed;
+bool kvm_ioeventfd_any_length_allowed;
 
 int kvm_init_vcpu(CPUState *cpu)
 {
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-11 Thread Michael S. Tsirkin
On Sat, Oct 31, 2015 at 12:16:12AM +0900, Joerg Roedel wrote:
> On Thu, Oct 29, 2015 at 11:01:41AM +0200, Michael S. Tsirkin wrote:
> > Example: you have a mix of assigned devices and virtio devices. You
> > don't trust your assigned device vendor not to corrupt your memory so
> > you want to limit the damage your assigned device can do to your guest,
> > so you use an IOMMU for that.  Thus existing iommu=pt within guest is out.
> > 
> > But you trust your hypervisor (you have no choice anyway),
> > and you don't want the overhead of tweaking IOMMU
> > on data path for virtio. Thus iommu=on is out too.
> 
> IOMMUs on x86 usually come with an ACPI table that describes which
> IOMMUs are in the system and which devices they translate. So you can
> easily describe all devices there that are not behind an IOMMU.
> 
> The ACPI table is built by the BIOS, and the platform intialization code
> sets the device dma_ops accordingly. If the BIOS provides wrong
> information in the ACPI table this is a platform bug.

It doesn't look like I managed to put the point across.
My point is that IOMMU is required to do things like
userspace drivers, what we need is a way to express
"there is an IOMMU but it is part of device itself, use passthrough
 unless your driver is untrusted".

> > I'm not sure what ACPI has to do with it.  It's about a way for guest
> > users to specify whether they want to bypass an IOMMU for a given
> > device.
> 
> We have no way yet to request passthrough-mode per-device from the IOMMU
> drivers, but that can easily be added. But as I see it:
> 
> > By the way, a bunch of code is missing on the QEMU side
> > to make this useful:
> > 1. virtio ignores the iommu
> > 2. vhost user ignores the iommu
> > 3. dataplane ignores the iommu
> > 4. vhost-net ignores the iommu
> > 5. VFIO ignores the iommu
> 
> Qemu does not implement IOMMU translation for virtio devices anyway
> (which is fine), so it just should tell the guest so in the ACPI table
> built to describe the emulated IOMMU.
> 
> 
>   Joerg

This is a short term limitation.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-11 Thread Michael S. Tsirkin
On Tue, Nov 10, 2015 at 10:54:21AM -0800, Andy Lutomirski wrote:
> On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin" <m...@redhat.com> wrote:
> >
> > On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote:
> > > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
> > > > I have no problem with that. For example, can we teach
> > > > the DMA API on intel x86 to use PT for virtio by default?
> > > > That would allow merging Andy's patches with
> > > > full compatibility with old guests and hosts.
> > >
> > > Well, the only incompatibility comes from an experimental qemu feature,
> > > more explicitly from a bug in that features implementation. So why
> > > should we work around that in the kernel? I think it is not too hard to
> > > fix qemu to generate a correct DMAR table which excludes the virtio
> > > devices from iommu translation.
> > >
> > >
> > >   Joerg
> >
> > It's not that easy - you'd have to dedicate some buses
> > for iommu bypass, and teach management tools to only put
> > virtio there - but it's possible.
> >
> > This will absolutely address guests that don't need to set up IOMMU for
> > virtio devices, and virtio that bypasses the IOMMU.
> >
> > But the problem is that we do want to *allow* guests
> > to set up IOMMU for virtio devices.
> > In that case, these are two other usecases:
> >
> > A- monolitic virtio within QEMU:
> > iommu only needed for VFIO ->
> > guest should always use iommu=pt
> > iommu=on works but is just useless overhead.
> >
> > B- modular out of process virtio outside QEMU:
> > iommu needed for VFIO or kernel driver ->
> > guest should use iommu=pt or iommu=on
> > depending on security/performance requirements
> >
> > Note that there could easily be a mix of these in the same system.
> >
> > So for these cases we do need QEMU to specify to guest that IOMMU covers
> > the virtio devices.  Also, once one does this, the default on linux is
> > iommu=on and not pt, which works but ATM is very slow.
> >
> > This poses three problems:
> >
> > 1. How do we address the different needs of A and B?
> >One way would be for virtio to pass the information to guest
> >using some virtio specific way, and have drivers
> >specify what kind of DMA access they want.
> >
> > 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most 
> > guests
> >use the more sensible iommu=pt.
> >
> > 3. Once we do allow IOMMU, how can we keep existing guests work in this 
> > configuration?
> >Creating different hypervisor configurations depending on guest is very 
> > nasty.
> >Again, one way would be some virtio specific interface.
> >
> > I'd rather we figured the answers to this before merging Andy's patches
> > because I'm concerned that instead of 1 broken configuration
> > (virtio always bypasses IOMMU) we'll get two bad configurations
> > (in the second one, virtio uses the slow default with no
> > gain in security).
> >
> > Suggestions wellcome.
> 
> I think there's still no downside of using my patches, even on x86.
> 
> Old kernels on new QEMU work unless IOMMU is enabled on the host.  I
> think that's the best we can possibly do.
> New kernels work at full speed on old QEMU.

Only if IOMMU is disabled, right?

> New kernels with new QEMU and iommu enabled work slower.  Even newer
> kernels with default passthrough work at full speed, and there's no
> obvious downside to the existence of kernels with just my patches.
> 
> --Andy
> 

I tried to explain the possible downside. Let me try again.  Imagine
that guest kernel notifies hypervisor that it wants IOMMU to actually
work.  This will make old kernel on new QEMU work even with IOMMU
enabled on host - better than "the best we can do" that you described
above.  Specifically, QEMU will assume that if it didn't get
notification, it's an old kernel so it should ignore the IOMMU.

But if we apply your patches this trick won't work.

Without implementing it all, I think the easiest incremental step would
be to teach linux to make passthrough the default when running as a
guest on top of QEMU, put your patches on top. If someone specifies
non passthrough on command line it'll still be broken,
but not too bad.


> >
> > --
> > MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_ring: Shadow available ring flags & index

2015-11-11 Thread Michael S. Tsirkin
On Tue, Nov 10, 2015 at 04:21:07PM -0800, Venkatesh Srinivas wrote:
> Improves cacheline transfer flow of available ring header.
> 
> Virtqueues are implemented as a pair of rings, one producer->consumer
> avail ring and one consumer->producer used ring; preceding the
> avail ring in memory are two contiguous u16 fields -- avail->flags
> and avail->idx. A producer posts work by writing to avail->idx and
> a consumer reads avail->idx.
> 
> The flags and idx fields only need to be written by a producer CPU
> and only read by a consumer CPU; when the producer and consumer are
> running on different CPUs and the virtio_ring code is structured to
> only have source writes/sink reads, we can continuously transfer the
> avail header cacheline between 'M' states between cores. This flow
> optimizes core -> core bandwidth on certain CPUs.
> 
> (see: "Software Optimization Guide for AMD Family 15h Processors",
> Section 11.6; similar language appears in the 10h guide and should
> apply to CPUs w/ exclusive caches, using LLC as a transfer cache)
> 
> Unfortunately the existing virtio_ring code issued reads to the
> avail->idx and read-modify-writes to avail->flags on the producer.
> 
> This change shadows the flags and index fields in producer memory;
> the vring code now reads from the shadows and only ever writes to
> avail->flags and avail->idx, allowing the cacheline to transfer
> core -> core optimally.

Sounds logical, I'll apply this after a  bit of testing
of my own, thanks!

> In a concurrent version of vring_bench, the time required for
> 10,000,000 buffer checkout/returns was reduced by ~2% (average
> across many runs) on an AMD Piledriver (15h) CPU:
> 
> (w/o shadowing):
>  Performance counter stats for './vring_bench':
>  5,451,082,016  L1-dcache-loads
>  ...
>2.221477739 seconds time elapsed
> 
> (w/ shadowing):
>  Performance counter stats for './vring_bench':
>  5,405,701,361  L1-dcache-loads
>  ...
>2.168405376 seconds time elapsed

Could you supply the full command line you used
to test this?

> The further away (in a NUMA sense) virtio producers and consumers are
> from each other, the more we expect to benefit. Physical implementations
> of virtio devices and implementations of virtio where the consumer polls
> vring avail indexes (vhost) should also benefit.
> 
> Signed-off-by: Venkatesh Srinivas 

Here's a similar patch for the ring itself:
https://lkml.org/lkml/2015/9/10/111

Does it help you as well?


> ---
>  drivers/virtio/virtio_ring.c | 46 
> 
>  1 file changed, 34 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 096b857..6262015 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -80,6 +80,12 @@ struct vring_virtqueue {
>   /* Last used index we've seen. */
>   u16 last_used_idx;
>  
> + /* Last written value to avail->flags */
> + u16 avail_flags_shadow;
> +
> + /* Last written value to avail->idx in guest byte order */
> + u16 avail_idx_shadow;
> +
>   /* How to notify other side. FIXME: commonalize hcalls! */
>   bool (*notify)(struct virtqueue *vq);
>  
> @@ -235,13 +241,14 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  
>   /* Put entry in available array (but don't update avail->idx until they
>* do sync). */
> - avail = virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx) & 
> (vq->vring.num - 1);
> + avail = vq->avail_idx_shadow & (vq->vring.num - 1);
>   vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
>   /* Descriptors and available array need to be set before we expose the
>* new available array entries. */
>   virtio_wmb(vq->weak_barriers);
> - vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, 
> virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx) + 1);
> + vq->avail_idx_shadow++;
> + vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
>   vq->num_added++;
>  
>   pr_debug("Added buffer head %i to %p\n", head, vq);
> @@ -354,8 +361,8 @@ bool virtqueue_kick_prepare(struct virtqueue *_vq)
>* event. */
>   virtio_mb(vq->weak_barriers);
>  
> - old = virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx) - vq->num_added;
> - new = virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx);
> + old = vq->avail_idx_shadow - vq->num_added;
> + new = vq->avail_idx_shadow;
>   vq->num_added = 0;
>  
>  #ifdef DEBUG
> @@ -510,7 +517,7 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned 
> int *len)
>   /* If we expect an interrupt for the next entry, tell host
>* by writing event index and flush out the write before
>* the read in the next get_buf call. */
> - if (!(vq->vring.avail->flags & cpu_to_virtio16(_vq->vdev, 
> VRING_AVAIL_F_NO_INTERRUPT))) {
> + if (!(vq->avail_flags_shadow & 

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-10 Thread Michael S. Tsirkin
On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote:
> On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote:
> > I have no problem with that. For example, can we teach
> > the DMA API on intel x86 to use PT for virtio by default?
> > That would allow merging Andy's patches with
> > full compatibility with old guests and hosts.
> 
> Well, the only incompatibility comes from an experimental qemu feature,
> more explicitly from a bug in that features implementation. So why
> should we work around that in the kernel? I think it is not too hard to
> fix qemu to generate a correct DMAR table which excludes the virtio
> devices from iommu translation.
> 
> 
>   Joerg

It's not that easy - you'd have to dedicate some buses
for iommu bypass, and teach management tools to only put
virtio there - but it's possible.

This will absolutely address guests that don't need to set up IOMMU for
virtio devices, and virtio that bypasses the IOMMU.

But the problem is that we do want to *allow* guests
to set up IOMMU for virtio devices.
In that case, these are two other usecases:

A- monolitic virtio within QEMU:
iommu only needed for VFIO ->
guest should always use iommu=pt
iommu=on works but is just useless overhead.

B- modular out of process virtio outside QEMU:
iommu needed for VFIO or kernel driver ->
guest should use iommu=pt or iommu=on
depending on security/performance requirements

Note that there could easily be a mix of these in the same system.

So for these cases we do need QEMU to specify to guest that IOMMU covers
the virtio devices.  Also, once one does this, the default on linux is
iommu=on and not pt, which works but ATM is very slow.

This poses three problems:

1. How do we address the different needs of A and B?
   One way would be for virtio to pass the information to guest
   using some virtio specific way, and have drivers
   specify what kind of DMA access they want.

2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most 
guests
   use the more sensible iommu=pt.

3. Once we do allow IOMMU, how can we keep existing guests work in this 
configuration?
   Creating different hypervisor configurations depending on guest is very 
nasty.
   Again, one way would be some virtio specific interface.

I'd rather we figured the answers to this before merging Andy's patches
because I'm concerned that instead of 1 broken configuration
(virtio always bypasses IOMMU) we'll get two bad configurations
(in the second one, virtio uses the slow default with no
gain in security).

Suggestions wellcome.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Michael S. Tsirkin
On Tue, Nov 10, 2015 at 09:37:54PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2015-11-09 at 21:35 -0800, Andy Lutomirski wrote:
> > 
> > We could do it the other way around: on powerpc, if a PCI device is in
> > that range and doesn't have the "bypass" property at all, then it's
> > assumed to bypass the IOMMU.  This means that everything that
> > currently works continues working.  If someone builds a physical
> > virtio device or uses another system in PCIe target mode speaking
> > virtio, then it won't work until they upgrade their firmware to set
> > bypass=0.  Meanwhile everyone using hypothetical new QEMU also gets
> > bypass=0 and no ambiguity.
> >
> > vfio will presumably notice the bypass and correctly refuse to map any
> > current virtio devices.
> > 
> > Would that work?
> 
> That would be extremely strange from a platform perspective. Any device
> in that vendor/device range would bypass the iommu unless some new
> property "actually-works-like-a-real-pci-device" happens to exist in
> the device-tree, which we would then need to define somewhere and
> handle accross at least 3 different platforms who get their device-tree 
> from widly different places.

Then we are back to virtio driver telling DMA core
whether it wants a 1:1 mapping in the iommu?
If that's acceptable to others, I don't think that's too bad.


> Also if tomorrow I create a PCI device that implements virtio-net and
> put it in a machine running IBM proprietary firmware (or Apple's or
> Sun's), it won't have that property...
> 
> This is not hypothetical. People are using virtio to do point-to-point
> communication between machines via PCIe today.
> 
> Cheers,
> Ben.

But not virtio-pci I think - that's broken for that usecase since we use
weaker barriers than required for real IO, as these have measureable
overhead.  We could have a feature "is a real PCI device",
that's completely reasonable.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Michael S. Tsirkin
On Thu, Oct 29, 2015 at 06:09:45PM -0700, Andy Lutomirski wrote:
> This switches virtio to use the DMA API unconditionally.  I'm sure
> it breaks things, but it seems to work on x86 using virtio-pci, with
> and without Xen, and using both the modern 1.0 variant and the
> legacy variant.
> 
> This appears to work on native and Xen x86_64 using both modern and
> legacy virtio-pci.  It also appears to work on arm and arm64.
> 
> It definitely won't work as-is on s390x, and I haven't been able to
> test Christian's patches because I can't get virtio-ccw to work in
> QEMU at all.  I don't know what I'm doing wrong.
> 
> It doesn't work on ppc64.  Ben, consider yourself pinged to send me
> a patch :)
> 
> It doesn't work on sparc64.  I didn't realize at Kernel Summit that
> sparc64 has the same problem as ppc64.
> 
> DaveM, for background, we're trying to fix virtio to use the DMA
> API.  That will require that every platform that uses virtio
> supplies valid DMA operations on devices that use virtio_ring.
> Unfortunately, QEMU historically ignores the IOMMU on virtio
> devices.
> 
> On x86, this isn't really a problem.  x86 has a nice way for the
> platform to describe which devices are behind an IOMMU, and QEMU
> will be adjusted accordingly.  The only thing that will break is a
> recently-added experimental mode.

Well that's not exactly true. I think we would like to make
it possible to put virtio devices behind an IOMMU on x86,
but if this means existing guests break, then many people won't be able
to use this option: having to find out which kernel version your guest
is running is a significant burden.


So on the host side, we need to detect guests that
don't program the IOMMU and make QEMU ignore it.
I think we need to figure out a way to do this
before we commit to the guest change.

Additionally, IOMMU overhead is very high when running within the VM.
So for uses such as VFIO, we'd like a way to make something like
iommu-pt the default.



> Ben's plan for powerpc is to add a quirk for existing virtio-pci
> devices and to eventually update the devicetree stuff to allow QEMU
> to tell the guest which devices use the IOMMU.
> 
> AFAICT sparc has a similar problem to powerpc.  DaveM, can you come
> up with a straightforward way to get sparc's DMA API to work
> correctly for virtio-pci devices?
> 
> NB: Sadly, the platforms I've successfully tested on don't include any
> big-endian platforms, so there could still be lurking endian problems.
> 
> Changes from v3:
>  - More big-endian fixes.
>  - Added better virtio-ring APIs that handle allocation and use them in
>virtio-mmio and virtio-pci.
>  - Switch to Michael's virtio-net patch.
> 
> Changes from v2:
>  - Fix vring_mapping_error incorrect argument
> 
> Changes from v1:
>  - Fix an endian conversion error causing a BUG to hit.
>  - Fix a DMA ordering issue (swiotlb=force works now).
>  - Minor cleanups.
> 
> Andy Lutomirski (5):
>   virtio_ring: Support DMA APIs
>   virtio_pci: Use the DMA API
>   virtio: Add improved queue allocation API
>   virtio_mmio: Use the DMA API
>   virtio_pci: Use the DMA API
> 
> Michael S. Tsirkin (1):
>   virtio-net: Stop doing DMA from the stack
> 
>  drivers/net/virtio_net.c   |  34 ++--
>  drivers/virtio/Kconfig |   2 +-
>  drivers/virtio/virtio_mmio.c   |  67 ++-
>  drivers/virtio/virtio_pci_common.h |   6 -
>  drivers/virtio/virtio_pci_legacy.c |  42 ++---
>  drivers/virtio/virtio_pci_modern.c |  61 ++-
>  drivers/virtio/virtio_ring.c   | 348 
> ++---
>  include/linux/virtio.h |  23 ++-
>  include/linux/virtio_ring.h|  35 
>  tools/virtio/linux/dma-mapping.h   |  17 ++
>  10 files changed, 426 insertions(+), 209 deletions(-)
>  create mode 100644 tools/virtio/linux/dma-mapping.h
> 
> -- 
> 2.4.3
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 07/33] util: introduce qemu_file_get_page_size()

2015-11-09 Thread Michael S. Tsirkin
On Sat, Oct 31, 2015 at 04:09:56PM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/30/2015 11:54 PM, Eduardo Habkost wrote:
> >On Fri, Oct 30, 2015 at 01:56:01PM +0800, Xiao Guangrong wrote:
> >>There are three places use the some logic to get the page size on
> >>the file path or file fd
> >>
> >>This patch introduces qemu_file_get_page_size() to unify the code
> >>
> >>Signed-off-by: Xiao Guangrong 
> >[...]
> >>diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>index 914cef5..ad94c5a 100644
> >>--- a/util/oslib-posix.c
> >>+++ b/util/oslib-posix.c
> >>@@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd)
> >>  return getpagesize();
> >>  }
> >>
> >>+size_t qemu_file_get_page_size(const char *path)
> >>+{
> >>+size_t size = 0;
> >>+int fd = qemu_open(path, O_RDONLY);
> >>+
> >>+if (fd < 0) {
> >>+fprintf(stderr, "Could not open %s.\n", path);
> >>+goto exit;
> >
> >Have you considered using a Error** argument here?
> >
> >>+}
> >>+
> >>+size = fd_getpagesize(fd);
> >>+qemu_close(fd);
> >>+exit:
> >>+return size;
> >>+}
> >>+
> >>diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> >>index ac70f08..c661f1c 100644
> >>--- a/target-ppc/kvm.c
> >>+++ b/target-ppc/kvm.c
> >>@@ -308,28 +308,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct 
> >>kvm_ppc_smmu_info *info)
> >>
> >>  static long gethugepagesize(const char *mem_path)
> >>  {
> >>-struct statfs fs;
> >>-int ret;
> >>-
> >>-do {
> >>-ret = statfs(mem_path, );
> >>-} while (ret != 0 && errno == EINTR);
> >>+long size = qemu_file_get_page_size(mem_path);
> >>
> >>-if (ret != 0) {
> >>-fprintf(stderr, "Couldn't statfs() memory path: %s\n",
> >>-strerror(errno));
> >>+if (!size) {
> >>  exit(1);
> >>  }
> >>
> >>-#define HUGETLBFS_MAGIC   0x958458f6
> >>-
> >>-if (fs.f_type != HUGETLBFS_MAGIC) {
> >>-/* Explicit mempath, but it's ordinary pages */
> >>-return getpagesize();
> >>-}
> >>-
> >>-/* It's hugepage, return the huge page size */
> >>-return fs.f_bsize;
> >>+return size;
> >>  }
> >
> >Why are you changing target-ppc/kvm.c:gethugepagesize() to use the new
> >funtion, but not the copy at exec.c? To make it simpler, we could
> >eliminate both gethugepagesize() functions completely and replace them
> >with qemu_file_get_page_size() calls (maybe as part of this patch, maybe
> >in a separate patch, I'm not sure).
> >
> 
> The gethugepagesize() in exec.c will be eliminated in later patch :).

That's why it's not a good idea to split patchset like this,
where patch 1 adds a new function, patch 2 uses it.
It's better if user is in the same patchset.
An exception if when a completely separate group of people
should review the function and the usage,
e.g. some logic in memory core versus caller in acpi.


> And the gethugepagesize() in ppc platform has error handling logic
> and has multiple caller. It's not so bad to keep it.
> 

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 14/33] pc-dimm: drop the prefix of pc-dimm

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:08PM +0800, Xiao Guangrong wrote:
> This patch is generated by this script:
> 
> find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
> | xargs sed -i "s/PC_DIMM/DIMM/g"
> 
> find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
> | xargs sed -i "s/PCDIMM/DIMM/g"
> 
> find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
> | xargs sed -i "s/pc_dimm/dimm/g"
> 
> find ./ -name "trace-events" -type f | xargs sed -i "s/pc-dimm/dimm/g"
> 
> It prepares the work which abstracts dimm device type for both pc-dimm and
> nvdimm
> 
> Signed-off-by: Xiao Guangrong 

I can see two ways this patchset can get merged
- merge refactorings first, nvdimm support on top
- merge nvdimm support first, refactor code on top

The way you put it in the middle of the series
allows neither.  And I definitely favor option 2:
it's easier to reason about the best way to refactor code
when you have multiple users before you.


> ---
>  hmp.c   |   2 +-
>  hw/acpi/ich9.c  |   6 +-
>  hw/acpi/memory_hotplug.c|  16 ++---
>  hw/acpi/piix4.c |   6 +-
>  hw/i386/pc.c|  32 -
>  hw/mem/pc-dimm.c| 148 
> 
>  hw/ppc/spapr.c  |  18 ++---
>  include/hw/mem/pc-dimm.h|  62 -
>  numa.c  |   2 +-
>  qapi-schema.json|   8 +--
>  qmp.c   |   2 +-
>  stubs/qmp_pc_dimm_device_list.c |   2 +-
>  trace-events|   8 +--
>  13 files changed, 156 insertions(+), 156 deletions(-)
> 
> diff --git a/hmp.c b/hmp.c
> index 5048eee..5c617d2 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -1952,7 +1952,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
> *qdict)
>  MemoryDeviceInfoList *info_list = qmp_query_memory_devices();
>  MemoryDeviceInfoList *info;
>  MemoryDeviceInfo *value;
> -PCDIMMDeviceInfo *di;
> +DIMMDeviceInfo *di;
>  
>  for (info = info_list; info; info = info->next) {
>  value = info->value;
> diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
> index 1c7fcfa..b0d6a67 100644
> --- a/hw/acpi/ich9.c
> +++ b/hw/acpi/ich9.c
> @@ -440,7 +440,7 @@ void ich9_pm_add_properties(Object *obj, ICH9LPCPMRegs 
> *pm, Error **errp)
>  void ich9_pm_device_plug_cb(ICH9LPCPMRegs *pm, DeviceState *dev, Error 
> **errp)
>  {
>  if (pm->acpi_memory_hotplug.is_enabled &&
> -object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
>  acpi_memory_plug_cb(>acpi_regs, pm->irq, 
> >acpi_memory_hotplug,
>  dev, errp);
>  } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
> @@ -455,7 +455,7 @@ void ich9_pm_device_unplug_request_cb(ICH9LPCPMRegs *pm, 
> DeviceState *dev,
>Error **errp)
>  {
>  if (pm->acpi_memory_hotplug.is_enabled &&
> -object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
>  acpi_memory_unplug_request_cb(>acpi_regs, pm->irq,
>>acpi_memory_hotplug, dev, errp);
>  } else {
> @@ -468,7 +468,7 @@ void ich9_pm_device_unplug_cb(ICH9LPCPMRegs *pm, 
> DeviceState *dev,
>Error **errp)
>  {
>  if (pm->acpi_memory_hotplug.is_enabled &&
> -object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
>  acpi_memory_unplug_cb(>acpi_memory_hotplug, dev, errp);
>  } else {
>  error_setg(errp, "acpi: device unplug for not supported device"
> diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
> index ce428df..e687852 100644
> --- a/hw/acpi/memory_hotplug.c
> +++ b/hw/acpi/memory_hotplug.c
> @@ -54,23 +54,23 @@ static uint64_t acpi_memory_hotplug_read(void *opaque, 
> hwaddr addr,
>  o = OBJECT(mdev->dimm);
>  switch (addr) {
>  case 0x0: /* Lo part of phys address where DIMM is mapped */
> -val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) : 0;
> +val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) : 0;
>  trace_mhp_acpi_read_addr_lo(mem_st->selector, val);
>  break;
>  case 0x4: /* Hi part of phys address where DIMM is mapped */
> -val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) >> 32 
> : 0;
> +val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) >> 32 : 0;
>  trace_mhp_acpi_read_addr_hi(mem_st->selector, val);
>  break;
>  case 0x8: /* Lo part of DIMM size */
> -val = o ? object_property_get_int(o, PC_DIMM_SIZE_PROP, NULL) : 0;
> +val = o ? object_property_get_int(o, DIMM_SIZE_PROP, NULL) : 0;
>  

Re: [PATCH V6 0/6] Fast mmio eventfd fixes

2015-11-09 Thread Michael S. Tsirkin
On Mon, Nov 09, 2015 at 12:35:45PM +0800, Jason Wang wrote:
> 
> 
> On 11/09/2015 01:11 AM, Michael S. Tsirkin wrote:
> > On Tue, Sep 15, 2015 at 02:41:53PM +0800, Jason Wang wrote:
> >> Hi:
> >>
> >> This series fixes two issues of fast mmio eventfd:
> >>
> >> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
> >>and KVM_FAST_MMIO_BUS. This will cause double in
> >>ioeventfd_destructor()
> >> 2) A zero length iodev on KVM_MMIO_BUS will never be found but
> >>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
> >>qemu instead of host.
> >>
> >> 1 is fixed by allocating two instances of iodev and introduce a new
> >> capability for userspace. 2 is fixed by ignore the actual length if
> >> the length of iodev is zero in kvm_io_bus_cmp().
> >>
> >> Please review.
> >> Changes from V5:
> >> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
> >>   remove the unnecessary checks
> >> - even more grammar and typo fixes
> >> - rabase to kvm.git
> >> - document KVM_CAP_FAST_MMIO
> > What's up with userspace using this capability?
> 
> It was renamed to KVM_CAP_IOEVENTFD_ANY_LENGTH.
> 
> > Did patches ever get posted?
> 
> See https://lkml.org/lkml/2015/9/28/208

Talking about userspace here.
QEMU freeze is approaching, it really should
use this to avoid regressions.


> >
> >> Changes from V4:
> >> - move the location of kvm_assign_ioeventfd() in patch 1 which reduce
> >>   the change set.
> >> - commit log typo fixes
> >> - switch to use kvm_deassign_ioeventfd_id) when fail to register to
> >>   fast mmio bus
> >> - change kvm_io_bus_cmp() as Paolo's suggestions
> >> - introduce a new capability to avoid new userspace crash old kernel
> >> - add a new patch that only try to register mmio eventfd on fast mmio
> >>   bus
> >>
> >> Changes from V3:
> >>
> >> - Don't do search on two buses when trying to do write on
> >>   KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat.
> >> - Since we don't do search on two buses, change kvm_io_bus_cmp() to
> >>   let it can find zero length iodevs.
> >> - Fix the unnecessary lines in tracepoint patch.
> >>
> >> Changes from V2:
> >> - Tweak styles and comment suggested by Cornelia.
> >>
> >> Changes from v1:
> >> - change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
> >>   needed to save lots of unnecessary changes.
> >>
> >> Jason Wang (6):
> >>   kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
> >>   kvm: factor out core eventfd assign/deassign logic
> >>   kvm: fix double free for fast mmio eventfd
> >>   kvm: fix zero length mmio searching
> >>   kvm: add tracepoint for fast mmio
> >>   kvm: add fast mmio capabilitiy
> >>
> >>  Documentation/virtual/kvm/api.txt |   7 ++-
> >>  arch/x86/kvm/trace.h  |  18 ++
> >>  arch/x86/kvm/vmx.c|   1 +
> >>  arch/x86/kvm/x86.c|   1 +
> >>  include/uapi/linux/kvm.h  |   1 +
> >>  virt/kvm/eventfd.c| 124 
> >> ++
> >>  virt/kvm/kvm_main.c   |  20 +-
> >>  7 files changed, 118 insertions(+), 54 deletions(-)
> >>
> >> -- 
> >> 2.1.4
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 19/33] dimm: keep the state of the whole backend memory

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:13PM +0800, Xiao Guangrong wrote:
> QEMU keeps the state of memory of dimm device during live migration,
> however, it is not enough for nvdimm device as its memory does not
> contain its label data, so that we should protect the whole backend
> memory instead
> 
> Signed-off-by: Xiao Guangrong 

It looks like there's now a difference between
host_memory_backend_get_memory and get_memory_region,
whereas previously they were exactly interchangeable.

This needs some thought, in particular the theoretically
generic dimm.c has to do tricks to accomodate nvdimm.

The missing piece for NVDIMM is the 128k label space at the end,
right?  Can't nvdimm specific code just register that as a
separate RAM chunk in order to migrate it?

> ---
>  hw/mem/dimm.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
> index 498d380..7d1 100644
> --- a/hw/mem/dimm.c
> +++ b/hw/mem/dimm.c
> @@ -134,9 +134,16 @@ void dimm_memory_plug(DeviceState *dev, 
> MemoryHotplugState *hpms,
>  }
>  
>  memory_region_add_subregion(>mr, addr - hpms->base, mr);
> -vmstate_register_ram(mr, dev);
>  numa_set_mem_node_id(addr, memory_region_size(mr), dimm->node);
>  
> +/*
> + * save the state only for @mr is not enough as it does not contain
> + * the label data of NVDIMM device, so that we keep the state of
> + * whole hostmem instead.
> + */
> +vmstate_register_ram(host_memory_backend_get_memory(dimm->hostmem, errp),
> + dev);
> +
>  out:
>  error_propagate(errp, local_err);
>  }
> @@ -145,10 +152,13 @@ void dimm_memory_unplug(DeviceState *dev, 
> MemoryHotplugState *hpms,
> MemoryRegion *mr)
>  {
>  DIMMDevice *dimm = DIMM(dev);
> +MemoryRegion *backend_mr;
> +
> +backend_mr = host_memory_backend_get_memory(dimm->hostmem, _abort);
>  
>  numa_unset_mem_node_id(dimm->addr, memory_region_size(mr), dimm->node);
>  memory_region_del_subregion(>mr, mr);
> -vmstate_unregister_ram(mr, dev);
> +vmstate_unregister_ram(backend_mr, dev);
>  }
>  
>  int qmp_dimm_device_list(Object *obj, void *opaque)
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 06/33] acpi: add aml_method_serialized

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:00PM +0800, Xiao Guangrong wrote:
> It avoid explicit Mutex and will be used by NVDIMM ACPI
> 
> Signed-off-by: Xiao Guangrong 

I'd rather you squashed these utility patches in with where
the code is used. This is just making it harder to review
as I have to jump back and forth.

> ---
>  hw/acpi/aml-build.c | 26 --
>  include/hw/acpi/aml-build.h |  1 +
>  2 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 9f792ab..8bee8b2 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -696,14 +696,36 @@ Aml *aml_while(Aml *predicate)
>  }
>  
>  /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefMethod */
> -Aml *aml_method(const char *name, int arg_count)
> +static Aml *__aml_method(const char *name, int arg_count, bool serialized)

Please don't prefix names with __.
what should you call this?
For example, you can call it aml_method_serialized.

>  {
>  Aml *var = aml_bundle(0x14 /* MethodOp */, AML_PACKAGE);
> +int methodflags;
> +
> +/*
> + * MethodFlags:
> + *   bit 0-2: ArgCount (0-7)
> + *   bit 3: SerializeFlag
> + * 0: NotSerialized
> + * 1: Serialized
> + *   bit 4-7: reserved (must be 0)
> + */
> +assert(!(arg_count & ~7));

Or shorter assert(arg_count < 8);

> +methodflags = arg_count | (serialized << 3);
>  build_append_namestring(var->buf, "%s", name);
> -build_append_byte(var->buf, arg_count); /* MethodFlags: ArgCount */
> +build_append_byte(var->buf, methodflags);
>  return var;
>  }
>  
> +Aml *aml_method(const char *name, int arg_count)
> +{
> +return __aml_method(name, arg_count, false);
> +}
> +
> +Aml *aml_method_serialized(const char *name, int arg_count)
> +{
> +return __aml_method(name, arg_count, true);
> +}
> +
>  /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefDevice */
>  Aml *aml_device(const char *name_format, ...)
>  {
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 5b8a118..00cf40e 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -263,6 +263,7 @@ Aml *aml_qword_memory(AmlDecode dec, AmlMinFixed 
> min_fixed,
>  Aml *aml_scope(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
>  Aml *aml_device(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
>  Aml *aml_method(const char *name, int arg_count);
> +Aml *aml_method_serialized(const char *name, int arg_count);
>  Aml *aml_if(Aml *predicate);
>  Aml *aml_else(void);
>  Aml *aml_while(Aml *predicate);
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 07/33] util: introduce qemu_file_get_page_size()

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:01PM +0800, Xiao Guangrong wrote:
> There are three places use the some logic to get the page size on
> the file path or file fd
> 
> This patch introduces qemu_file_get_page_size() to unify the code
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  include/qemu/osdep.h |  1 +
>  target-ppc/kvm.c | 21 +++--
>  util/oslib-posix.c   | 16 
>  util/oslib-win32.c   |  5 +
>  4 files changed, 25 insertions(+), 18 deletions(-)
> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index b568424..d4dde02 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -302,4 +302,5 @@ int qemu_read_password(char *buf, int buf_size);
>   */
>  pid_t qemu_fork(Error **errp);
>  
> +size_t qemu_file_get_page_size(const char *mem_path);
>  #endif
> diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> index ac70f08..c661f1c 100644
> --- a/target-ppc/kvm.c
> +++ b/target-ppc/kvm.c
> @@ -308,28 +308,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct 
> kvm_ppc_smmu_info *info)
>  
>  static long gethugepagesize(const char *mem_path)
>  {
> -struct statfs fs;
> -int ret;
> -
> -do {
> -ret = statfs(mem_path, );
> -} while (ret != 0 && errno == EINTR);
> +long size = qemu_file_get_page_size(mem_path);
>  
> -if (ret != 0) {
> -fprintf(stderr, "Couldn't statfs() memory path: %s\n",
> -strerror(errno));
> +if (!size) {
>  exit(1);
>  }
>  
> -#define HUGETLBFS_MAGIC   0x958458f6
> -
> -if (fs.f_type != HUGETLBFS_MAGIC) {
> -/* Explicit mempath, but it's ordinary pages */
> -return getpagesize();
> -}
> -
> -/* It's hugepage, return the huge page size */
> -return fs.f_bsize;
> +return size;
>  }
>  
>  static int find_max_supported_pagesize(Object *obj, void *opaque)
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 914cef5..ad94c5a 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd)
>  return getpagesize();
>  }
>  
> +size_t qemu_file_get_page_size(const char *path)
> +{
> +size_t size = 0;
> +int fd = qemu_open(path, O_RDONLY);
> +
> +if (fd < 0) {
> +fprintf(stderr, "Could not open %s.\n", path);
> +goto exit;
> +}
> +
> +size = fd_getpagesize(fd);
> +qemu_close(fd);
> +exit:
> +return size;
> +}
> +
>  void os_mem_prealloc(int fd, char *area, size_t memory)
>  {
>  int ret;

So this is opening the file for the sole purpose of
doing the fstatfs on it. Seems strange, just do statfs instead.
In fact, maybe we want statfs_getpagesize.

> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index 09f9e98..a18aa87 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -462,6 +462,11 @@ size_t getpagesize(void)
>  return system_info.dwPageSize;
>  }
>  
> +size_t qemu_file_get_page_size(const char *path)
> +{
> +return getpagesize();
> +}
> +
>  void os_mem_prealloc(int fd, char *area, size_t memory)
>  {
>  int i;

And why is this needed on win32?

> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 08/33] exec: allow memory to be allocated from any kind of path

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:02PM +0800, Xiao Guangrong wrote:
> Currently file_ram_alloc() is designed for hugetlbfs, however, the memory
> of nvdimm can come from either raw pmem device eg, /dev/pmem, or the file
> locates at DAX enabled filesystem
> 
> So this patch let it work on any kind of path
> 
> Signed-off-by: Xiao Guangrong 

So this allows regular memory to be specified directly.
This needs to be split out and merged separately
from acpi/nvdimm bits.

Alternatively, if it's possible to use nvdimm with DAX fs
(similar to hugetlbfs), leave these patches off for now.


> ---
>  exec.c | 56 +---
>  1 file changed, 17 insertions(+), 39 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 8af2570..3ca7e50 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1174,32 +1174,6 @@ void qemu_mutex_unlock_ramlist(void)
>  }
>  
>  #ifdef __linux__
> -
> -#include 
> -
> -#define HUGETLBFS_MAGIC   0x958458f6
> -
> -static long gethugepagesize(const char *path, Error **errp)
> -{
> -struct statfs fs;
> -int ret;
> -
> -do {
> -ret = statfs(path, );
> -} while (ret != 0 && errno == EINTR);
> -
> -if (ret != 0) {
> -error_setg_errno(errp, errno, "failed to get page size of file %s",
> - path);
> -return 0;
> -}
> -
> -if (fs.f_type != HUGETLBFS_MAGIC)
> -fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
> -
> -return fs.f_bsize;
> -}
> -
>  static void *file_ram_alloc(RAMBlock *block,
>  ram_addr_t memory,
>  const char *path,
> @@ -1210,20 +1184,24 @@ static void *file_ram_alloc(RAMBlock *block,
>  char *c;
>  void *area;
>  int fd;
> -uint64_t hpagesize;
> -Error *local_err = NULL;
> +uint64_t pagesize;
>  
> -hpagesize = gethugepagesize(path, _err);
> -if (local_err) {
> -error_propagate(errp, local_err);
> +pagesize = qemu_file_get_page_size(path);
> +if (!pagesize) {
> +error_setg(errp, "can't get page size for %s", path);
>  goto error;
>  }
> -block->mr->align = hpagesize;
>  
> -if (memory < hpagesize) {
> +if (pagesize == getpagesize()) {
> +fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n");
> +}
> +
> +block->mr->align = pagesize;
> +
> +if (memory < pagesize) {
>  error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
> -   "or larger than huge page size 0x%" PRIx64,
> -   memory, hpagesize);
> +   "or larger than page size 0x%" PRIx64,
> +   memory, pagesize);
>  goto error;
>  }
>  
> @@ -1247,14 +1225,14 @@ static void *file_ram_alloc(RAMBlock *block,
>  fd = mkstemp(filename);
>  if (fd < 0) {
>  error_setg_errno(errp, errno,
> - "unable to create backing store for hugepages");
> + "unable to create backing store for path %s", path);
>  g_free(filename);
>  goto error;
>  }
>  unlink(filename);
>  g_free(filename);

Looks like we are still calling mkstemp/unlink here.
How does this work?

>  
> -memory = ROUND_UP(memory, hpagesize);
> +memory = ROUND_UP(memory, pagesize);
>  
>  /*
>   * ftruncate is not supported by hugetlbfs in older
> @@ -1266,10 +1244,10 @@ static void *file_ram_alloc(RAMBlock *block,
>  perror("ftruncate");
>  }
>  
> -area = qemu_ram_mmap(fd, memory, hpagesize, block->flags & RAM_SHARED);
> +area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
>  if (area == MAP_FAILED) {
>  error_setg_errno(errp, errno,
> - "unable to map backing store for hugepages");
> + "unable to map backing store for path %s", path);
>  close(fd);
>  goto error;
>  }
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/33] pc-dimm: remove DEFAULT_PC_DIMMSIZE

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:06PM +0800, Xiao Guangrong wrote:
> It's not used any more
> 
> Signed-off-by: Xiao Guangrong 

You should leave the renames and cleanups off for later.
This patchset is large enough as it is.

> ---
>  include/hw/mem/pc-dimm.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
> index d83bf30..11a8937 100644
> --- a/include/hw/mem/pc-dimm.h
> +++ b/include/hw/mem/pc-dimm.h
> @@ -20,8 +20,6 @@
>  #include "sysemu/hostmem.h"
>  #include "hw/qdev.h"
>  
> -#define DEFAULT_PC_DIMMSIZE (1024*1024*1024)
> -
>  #define TYPE_PC_DIMM "pc-dimm"
>  #define PC_DIMM(obj) \
>  OBJECT_CHECK(PCDIMMDevice, (obj), TYPE_PC_DIMM)
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 09/33] exec: allow file_ram_alloc to work on file

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:03PM +0800, Xiao Guangrong wrote:
> Currently, file_ram_alloc() only works on directory - it creates a file
> under @path and do mmap on it
> 
> This patch tries to allow it to work on file directly, if @path is a
> directory it works as before, otherwise it treats @path as the target
> file then directly allocate memory from it
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  exec.c | 80 
> ++
>  1 file changed, 51 insertions(+), 29 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 3ca7e50..f219010 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
>  }
>  
>  #ifdef __linux__
> +static bool path_is_dir(const char *path)
> +{
> +struct stat fs;
> +
> +return stat(path, ) == 0 && S_ISDIR(fs.st_mode);

This means file doesn't exist is treated as a file.
Can't figure out if that's intentional, should
be documented in any case.

> +}
> +
> +static int open_file_path(RAMBlock *block, const char *path, size_t size)
> +{
> +char *filename;
> +char *sanitized_name;
> +char *c;
> +int fd;
> +
> +if (!path_is_dir(path)) {
> +int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;

Why does this make sense?

> +
> +flags |= O_EXCL;

And why does this makes sense?

> +return open(path, flags);
> +}
> +
> +/* Make name safe to use with mkstemp by replacing '/' with '_'. */
> +sanitized_name = g_strdup(memory_region_name(block->mr));
> +for (c = sanitized_name; *c != '\0'; c++) {
> +if (*c == '/') {
> +*c = '_';
> +}
> +}
> +filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
> +   sanitized_name);
> +g_free(sanitized_name);
> +fd = mkstemp(filename);
> +if (fd >= 0) {
> +unlink(filename);
> +/*
> + * ftruncate is not supported by hugetlbfs in older
> + * hosts, so don't bother bailing out on errors.
> + * If anything goes wrong with it under other filesystems,
> + * mmap will fail.
> + */
> +if (ftruncate(fd, size)) {
> +perror("ftruncate");
> +}
> +}
> +g_free(filename);
> +
> +return fd;
> +}
> +
>  static void *file_ram_alloc(RAMBlock *block,
>  ram_addr_t memory,
>  const char *path,
>  Error **errp)
>  {
> -char *filename;
> -char *sanitized_name;
> -char *c;
>  void *area;
>  int fd;
>  uint64_t pagesize;
> @@ -1211,38 +1257,14 @@ static void *file_ram_alloc(RAMBlock *block,
>  goto error;
>  }
>  
> -/* Make name safe to use with mkstemp by replacing '/' with '_'. */
> -sanitized_name = g_strdup(memory_region_name(block->mr));
> -for (c = sanitized_name; *c != '\0'; c++) {
> -if (*c == '/')
> -*c = '_';
> -}
> -
> -filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
> -   sanitized_name);
> -g_free(sanitized_name);
> +memory = ROUND_UP(memory, pagesize);
>  
> -fd = mkstemp(filename);
> +fd = open_file_path(block, path, memory);
>  if (fd < 0) {
>  error_setg_errno(errp, errno,
>   "unable to create backing store for path %s", path);
> -g_free(filename);
>  goto error;
>  }
> -unlink(filename);
> -g_free(filename);
> -
> -memory = ROUND_UP(memory, pagesize);
> -
> -/*
> - * ftruncate is not supported by hugetlbfs in older
> - * hosts, so don't bother bailing out on errors.
> - * If anything goes wrong with it under other filesystems,
> - * mmap will fail.
> - */
> -if (ftruncate(fd, memory)) {
> -perror("ftruncate");
> -}
>  
>  area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
>  if (area == MAP_FAILED) {
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 11/33] hostmem-file: use whole file size if possible

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:05PM +0800, Xiao Guangrong wrote:
> Use the whole file size if @size is not specified which is useful
> if we want to directly pass a file to guest
> 
> Signed-off-by: Xiao Guangrong 

Better split these simplifications off from the series.

> ---
>  backends/hostmem-file.c | 48 
>  1 file changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
> index 9097a57..e1bc9ff 100644
> --- a/backends/hostmem-file.c
> +++ b/backends/hostmem-file.c
> @@ -9,6 +9,9 @@
>   * This work is licensed under the terms of the GNU GPL, version 2 or later.
>   * See the COPYING file in the top-level directory.
>   */
> +#include 
> +#include 
> +
>  #include "qemu-common.h"
>  #include "sysemu/hostmem.h"
>  #include "sysemu/sysemu.h"
> @@ -33,20 +36,57 @@ struct HostMemoryBackendFile {
>  char *mem_path;
>  };
>  
> +static uint64_t get_file_size(const char *file)
> +{
> +struct stat stat_buf;
> +uint64_t size = 0;
> +int fd;
> +
> +fd = open(file, O_RDONLY);
> +if (fd < 0) {
> +return 0;
> +}
> +
> +if (stat(file, _buf) < 0) {
> +goto exit;
> +}
> +
> +if ((S_ISBLK(stat_buf.st_mode)) && !ioctl(fd, BLKGETSIZE64, )) {

You must test S_ISDIR too.

> +goto exit;
> +}
> +
> +size = lseek(fd, 0, SEEK_END);
> +if (size == -1) {
> +size = 0;
> +}
> +exit:
> +close(fd);
> +return size;
> +}
> +
>  static void
>  file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>  {
>  HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
>  
> -if (!backend->size) {
> -error_setg(errp, "can't create backend with size 0");
> -return;
> -}
>  if (!fb->mem_path) {
>  error_setg(errp, "mem-path property not set");
>  return;
>  }
>  
> +if (!backend->size) {
> +/*
> + * use the whole file size if @size is not specified.
> + */
> +backend->size = get_file_size(fb->mem_path);
> +}
> +
> +if (!backend->size) {
> +error_setg(errp, "failed to get file size for %s, can't create "
> + "backend on it", mem_path);
> +return;
> +}
> +
>  backend->force_prealloc = mem_prealloc;
>  memory_region_init_ram_from_file(>mr, OBJECT(backend),
>   object_get_canonical_path(OBJECT(backend)),
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 05/33] acpi: add aml_object_type

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:55:59PM +0800, Xiao Guangrong wrote:
> Implement ObjectType which is used by NVDIMM _DSM method in
> later patch
> 
> Signed-off-by: Xiao Guangrong 

I had to go dig in the _DSM patch to see how it's used.
And sure enough, callers have to know AML to make
sense of it. So pls don't split out tiny patches like this.
include the callee with the caller.

> ---
>  hw/acpi/aml-build.c | 8 
>  include/hw/acpi/aml-build.h | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index efc06ab..9f792ab 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1178,6 +1178,14 @@ Aml *aml_concatenate(Aml *source1, Aml *source2, Aml 
> *target)
>  return var;
>  }
>  
> +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefObjectType */
> +Aml *aml_object_type(Aml *object)
> +{
> +Aml *var = aml_opcode(0x8E /* ObjectTypeOp */);
> +aml_append(var, object);
> +return var;
> +}
> +

It would be better to have a higher level API
that can be used without knowning AML.
For example:

aml_object_type_is_package()



>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 325782d..5b8a118 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -278,6 +278,7 @@ Aml *aml_derefof(Aml *arg);
>  Aml *aml_sizeof(Aml *arg);
>  Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
>  Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target);
> +Aml *aml_object_type(Aml *object);
>  
>  void
>  build_header(GArray *linker, GArray *table_data,
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >