Re: [PATCH v5 6/9] x86: prevent inline distortion by paravirt ops

2018-06-19 Thread kbuild test robot
Hi Nadav,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc1 next-20180619]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Nadav-Amit/x86-macrofying-inline-asm-for-better-compilation/20180620-112043
config: i386-randconfig-x016-201824 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/include/asm/paravirt.h: Assembler messages:
>> arch/x86/include/asm/paravirt.h:253: Error: can't mix positional and keyword 
>> arguments
--
   arch/x86/include/asm/paravirt.h: Assembler messages:
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments
>> arch/x86/include/asm/paravirt.h:253: Error: can't mix positional and keyword 
>> arguments
   arch/x86/include/asm/paravirt.h:29: Error: can't mix positional and keyword 
arguments

vim +253 arch/x86/include/asm/paravirt.h

014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  249  
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  250  
static inline void write_gdt_entry(struct desc_struct *dt, int entry,
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  251  
   void *desc, int type)
f8822f42 include/asm-i386/paravirt.h Jeremy Fitzhardinge   2007-05-02  252  
{
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30 @253  
PVOP_VCALL4(pv_cpu_ops.write_gdt_entry, dt, entry, desc, type);
f8822f42 include/asm-i386/paravirt.h Jeremy Fitzhardinge   2007-05-02  254  
}
014b15be include/asm-x86/paravirt.h  Glauber de Oliveira Costa 2008-01-30  255  

:: The code at line 253 was first introduced by commit
:: 014b15be30c04622d130946ab7c0a9101b523a8a x86: change write_gdt_entry 
signature.

:: TO: Glauber de Oliveira Costa 
:: CC: Ingo Molnar 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-19 Thread Michael S. Tsirkin
On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
> 
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

That's what it boils down to, yes.  If there's need to have this for
non-pci devices, then we should put it in config space.
Cornelia, what do you think?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-19 Thread Siwei Liu
On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck  wrote:
> On Fri, 15 Jun 2018 10:06:07 -0700
> Siwei Liu  wrote:
>
>> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck  wrote:
>> > On Thu, 14 Jun 2018 18:57:11 -0700
>> > Siwei Liu  wrote:
>> >
>> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> >> think you raised really good points, some of which I don't have answer
>> >> yet and would also like to explore here.
>> >>
>> >> First off, I don't want to push the discussion to the extreme at this
>> >> point, or sell anything about having QEMU manage everything
>> >> automatically. Don't get me wrong, it's not there yet. Let's don't
>> >> assume we are tied to a specific or concerte solution. I think the key
>> >> for our discussion might be to define or refine the boundary between
>> >> VM and guest,  e.g. what each layer is expected to control and manage
>> >> exactly.
>> >>
>> >> In my view, there might be possibly 3 different options to represent
>> >> the failover device conceipt to QEMU and libvirt (or any upper layer
>> >> software):
>> >>
>> >> a. Seperate device: in this model, virtio and passthough remains
>> >> separate devices just as today. QEMU exposes the standby feature bit
>> >> for virtio, and publish status/event around the negotiation process of
>> >> this feature bit for libvirt to react upon. Since Libvirt has the
>> >> pairing relationship itself, maybe through MAC address or something
>> >> else, it can control the presence of primary by hot plugging or
>> >> unplugging the passthrough device, although it has to work tightly
>> >> with virtio's feature negotation process. Not just for migration but
>> >> also various corner scenarios (driver/feature ok, device reset,
>> >> reboot, legacy guest etc) along virtio's feature negotiation.
>> >
>> > Yes, that one has obvious tie-ins to virtio's modus operandi.
>> >
>> >>
>> >> b. Coupled device: in this model, virtio and passthough devices are
>> >> weakly coupled using some group ID, i.e. QEMU match the passthough
>> >> device for a standby virtio instance by comparing the group ID value
>> >> present behind each device's bridge. Libvirt provides QEMU the group
>> >> ID for both type of devices, and only deals with hot plug for
>> >> migration, by checking some migration status exposed (e.g. the feature
>> >> negotiation status on the virtio device) by QEMU. QEMU manages the
>> >> visibility of the primary in guest along virtio's feature negotiation
>> >> process.
>> >
>> > I'm a bit confused here. What, exactly, ties the two devices together?
>>
>> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
>> address (which it doesn't have to), the association between VFIO
>> passthrough and standby must be specificed for QEMU to understand the
>> relationship with this model. Note, standby feature is no longer
>> required to be exposed under this model.
>
> Isn't that a bit limiting, though?
>
> With this model, you can probably tie a vfio-pci device and a
> virtio-net-pci device together. But this will fail if you have
> different transports: Consider tying together a vfio-pci device and a
> virtio-net-ccw device on s390, for example. The standby feature bit is
> on the virtio-net level and should not have any dependency on the
> transport used.

Probably we'd limit the support for grouping to virtio-net-pci device
and vfio-pci device only. For virtio-net-pci, as you might see with
Venu's patch, we store the group UUID on the config space of
virtio-pci, which is only applicable to PCI transport.

If virtio-net-ccw needs to support the same, I think similar grouping
interface should be defined on the VirtIO CCW transport. I think the
current implementation of the Linux failover driver assumes that it's
SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
with, and that the PV path is on same PF without needing to update
network of the port-MAC association change. If we need to extend the
grouping mechanism to virtio-net-ccw, it has to pass such failover
mode to virtio driver specifically through some other option I guess.

>
>>
>> > If libvirt already has the knowledge that it should manage the two as a
>> > couple, why do we need the group id (or something else for other
>> > architectures)? (Maybe I'm simply missing something because I'm not
>> > that familiar with pci.)
>>
>> The idea is to have QEMU control the visibility and enumeration order
>> of the passthrough VFIO for the failover scenario. Hotplug can be one
>> way to achieve it, and perhaps there's other way around also. The
>> group ID is not just for QEMU to couple devices, it's also helpful to
>> guest too as grouping using MAC address is just not safe.
>
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to 

Re: Design Decision for KVM based anti rootkit

2018-06-19 Thread Ahmed Soliman
On 19 June 2018 at 19:37, David Vrabel  wrote:
> It's not clear how this increases security. What threats is this
> protecting again?
It won't completely protect prevent rootkits, because still rootkits
can edit dynamic kernel data structures, but it will limit what
rootkits damage to only dynamic data.
This way system calls can't be changed, or Interrupt tables.
> As an attacker, modifying the sensitive pages (kernel text?) will
> require either: a) altering the existing mappings for these (to make
> them read-write or user-writable for example); or b) creating aliased
> mappings with suitable permissions.
>
> If the attacker can modify page tables in this way then it can also
> bypass the suggested hypervisor's read-only protection by changing the
> mappings to point to a unprotected page.

I think I was missing this part out, but I meant to say completely
prevent any modification to pages including the guest physical address
to guest virtual address mapping for those protected pages, Another
tricky (something random just popped up in my mind right now, better
to say it than to forget it) solution is making new memory mappings
inherit the same protection as old one, I assume that Hyper visor can
do either things. Also that was the kind of performance hit I was
talking about. I am not sure if that might break things or I can say
it will for sure heavily limit some functionalities. like maybe
hibernating guest. But that will be the kind of trades off I am
expecting at least at the begining.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 6/9] x86: prevent inline distortion by paravirt ops

2018-06-19 Thread Juergen Gross
On 12/06/18 13:50, Nadav Amit wrote:
> GCC considers the number of statements in inlined assembly blocks,
> according to new-lines and semicolons, as an indication to the cost of
> the block in time and space. This data is distorted by the kernel code,
> which puts information in alternative sections. As a result, the
> compiler may perform incorrect inlining and branch optimizations.
> 
> The solution is to set an assembly macro and call it from the inlined
> assembly block. As a result GCC considers the inline assembly block as
> a single instruction.
> 
> The effect of the patch is a more aggressive inlining, which also
> causes a size increase of kernel.
> 
>text  data bss dec hex filename
> 18147336 10226688 2957312 31331336 1de1408 ./vmlinux before
> 18162555 10226288 2957312 31346155 1de4deb ./vmlinux after (+14819)
> 
> Static text symbols:
> Before:   40053
> After:39942   (-111)
> 
> Cc: Juergen Gross 
> Cc: Alok Kataria 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: virtualization@lists.linux-foundation.org
> 
> Signed-off-by: Nadav Amit 

Reviewed-by: Juergen Gross 


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 0/3] extern inline native_save_fl for paravirt

2018-06-19 Thread Juergen Gross
On 13/06/18 23:05, Nick Desaulniers wrote:
> paravirt depends on a custom calling convention (callee saved), but
> expects this from a static inline function that it then forces to be
> outlined. This is problematic because different compilers or flags can
> then add a stack guard that violates the calling conventions.
> 
> Uses extern inline with the out-of-line definition in assembly to
> prevent compilers from adding stack guards to the outlined version.
> 
> Other parts of the codebase overwrite KBUILD_CFLAGS, which is *extremely
> problematic* for extern inline, as the sematics are completely the
> opposite depending on what C standard is used.
> http://blahg.josefsipek.net/?p=529
> 
> Changes since v4:
>   Take Arnd's and hpa's suggestions properly feature detect gnu_inline
>   attribute to support gcc 4.1.
> 
> Changes since v3:
>   Take Joe's suggestion to hoist __inline__ and __inline out of
>   conditional block.
> 
> Changes since v2:
>   Take hpa's _ASM_ARG patch into the set in order to simplify cross
>   32b/64b x86 assembly and rebase my final patch to use it.  Apply
>   Sedat's typo fix to commit message and sign off on it. Take Joe's
>   suggestion to simplify __inline__ and __inline. Add Arnd to list of
>   folks who made helpful suggestions.
> 
> Changes since v1:
>   Prefer gnu_inline function attribute instead of explicitly setting C
>   standard compiler flag in problematic Makefiles. We should instead
>   carefully evaluate if those Makefiles should be overwriting
>   KBUILD_CFLAGS at all. Dropped the previous first two patches and added
>   a new first patch.
> 
> H. Peter Anvin (1):
>   x86/asm: add _ASM_ARG* constants for argument registers to 
> 
> Nick Desaulniers (2):
>   compiler-gcc.h: add gnu_inline to all inline declarations
>   x86: paravirt: make native_save_fl extern inline
> 
>  arch/x86/include/asm/asm.h  | 59 +
>  arch/x86/include/asm/irqflags.h |  2 +-
>  arch/x86/kernel/Makefile|  1 +
>  arch/x86/kernel/irqflags.S  | 26 +++
>  include/linux/compiler-gcc.h| 29 
>  5 files changed, 109 insertions(+), 8 deletions(-)
>  create mode 100644 arch/x86/kernel/irqflags.S
> 

For the series:

Acked-by: Juergen Gross 


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-19 Thread Michael S. Tsirkin
On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:
> > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > > > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) 
> > > > > above,
> > > > so the maximum memory that can be reported is 2TB. For larger guests, 
> > > > e.g.
> > > > 4TB, the optimization can still offer 2TB free memory (better than no
> > > > optimization).
> > > > 
> > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even 
> > > > more.
> > > > I'd rather we had a better plan. From that POV I like what Matthew 
> > > > Wilcox
> > > > suggested for this which is to steal the necessary # of entries off the 
> > > > list.
> > > Actually what Matthew suggested doesn't make a difference here. That 
> > > method always steal the first free page blocks, and sure can be changed 
> > > to take more. But all these can be achieved via kmalloc
> > I'd do get_user_pages really. You don't want pages split, etc.

Oops sorry. I meant get_free_pages .

> 
> > > by the caller which is more prudent and makes the code more 
> > > straightforward. I think we don't need to take that risk unless the MM 
> > > folks strongly endorse that approach.
> > > 
> > > The max size of the kmalloc-ed memory is 4MB, which gives us the 
> > > limitation that the max free memory to report is 2TB. Back to the 
> > > motivation of this work, the cloud guys want to use this optimization to 
> > > accelerate their guest live migration. 2TB guests are not common in 
> > > today's clouds. When huge guests become common in the future, we can 
> > > easily tweak this API to fill hints into scattered buffer (e.g. several 
> > > 4MB arrays passed to this API) instead of one as in this version.
> > > 
> > > This limitation doesn't cause any issue from functionality perspective. 
> > > For the extreme case like a 100TB guest live migration which is 
> > > theoretically possible today, this optimization helps skip 2TB of its 
> > > free memory. This result is that it may reduce only 2% live migration 
> > > time, but still better than not skipping the 2TB (if not using the 
> > > feature).
> > Not clearly better, no, since you are slowing the guest.
> 
> Not really. Live migration slows down the guest itself. It seems that the
> guest spends a little extra time reporting free pages, but in return the
> live migration time gets reduced a lot, which makes the guest endure less
> from live migration. (there is no drop of the workload performance when
> using the optimization in the tests)

My point was you can't say what is better without measuring.
Without special limitations you have hint overhead vs migration
overhead. I think we need to  build to scale to huge guests.
We might discover scalability problems down the road,
but no sense in building in limitations straight away.

> 
> 
> > 
> > 
> > > So, for the first release of this feature, I think it is better to have 
> > > the simpler and more straightforward solution as we have now, and clearly 
> > > document why it can report up to 2TB free memory.
> > No one has the time to read documentation about how an internal flag
> > within a device works. Come on, getting two pages isn't much harder
> > than a single one.
> 
> > > > If that doesn't fly, we can allocate out of the loop and just retry 
> > > > with more
> > > > pages.
> > > > 
> > > > > On the other hand, large guests being large mostly because the guests 
> > > > > need
> > > > to use large memory. In that case, they usually won't have that much 
> > > > free
> > > > memory to report.
> > > > 
> > > > And following this logic small guests don't have a lot of memory to 
> > > > report at
> > > > all.
> > > > Could you remind me why are we considering this optimization then?
> > > If there is a 3TB guest, it is 3TB not 2TB mostly because it would need 
> > > to use e.g. 2.5TB memory from time to time. In the worst case, it only 
> > > has 0.5TB free memory to report, but reporting 0.5TB with this 
> > > optimization is better than no optimization. (and the current 2TB 
> > > limitation isn't a limitation for the 3TB guest in this case)
> > I'd rather not spend time writing up random limitations.
> 
> This is not a random limitation. It would be more clear to see the code.

Users don't see code though, that's the point.

Exporting internal limitations from code to users isn't great.


> Also I'm not sure how get_user_pages could be used in our case, and what you
> meant by "getting two pages". I'll post out a new version, and we can
> discuss on the code.

Sorry, I meant get_free_pages.

> 
> Best,
> Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-19 Thread Wei Wang

On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:

On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:

On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:

On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:

Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,

so the maximum memory that can be reported is 2TB. For larger guests, e.g.
4TB, the optimization can still offer 2TB free memory (better than no
optimization).

Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
I'd rather we had a better plan. From that POV I like what Matthew Wilcox
suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method 
always steal the first free page blocks, and sure can be changed to take more. 
But all these can be achieved via kmalloc

I'd do get_user_pages really. You don't want pages split, etc.



by the caller which is more prudent and makes the code more straightforward. I 
think we don't need to take that risk unless the MM folks strongly endorse that 
approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation 
that the max free memory to report is 2TB. Back to the motivation of this work, 
the cloud guys want to use this optimization to accelerate their guest live 
migration. 2TB guests are not common in today's clouds. When huge guests become 
common in the future, we can easily tweak this API to fill hints into scattered 
buffer (e.g. several 4MB arrays passed to this API) instead of one as in this 
version.

This limitation doesn't cause any issue from functionality perspective. For the 
extreme case like a 100TB guest live migration which is theoretically possible 
today, this optimization helps skip 2TB of its free memory. This result is that 
it may reduce only 2% live migration time, but still better than not skipping 
the 2TB (if not using the feature).

Not clearly better, no, since you are slowing the guest.


Not really. Live migration slows down the guest itself. It seems that 
the guest spends a little extra time reporting free pages, but in return 
the live migration time gets reduced a lot, which makes the guest endure 
less from live migration. (there is no drop of the workload performance 
when using the optimization in the tests)








So, for the first release of this feature, I think it is better to have the 
simpler and more straightforward solution as we have now, and clearly document 
why it can report up to 2TB free memory.

No one has the time to read documentation about how an internal flag
within a device works. Come on, getting two pages isn't much harder
than a single one.


  

If that doesn't fly, we can allocate out of the loop and just retry with more
pages.


On the other hand, large guests being large mostly because the guests need

to use large memory. In that case, they usually won't have that much free
memory to report.

And following this logic small guests don't have a lot of memory to report at
all.
Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use 
e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free 
memory to report, but reporting 0.5TB with this optimization is better than no 
optimization. (and the current 2TB limitation isn't a limitation for the 3TB 
guest in this case)

I'd rather not spend time writing up random limitations.


This is not a random limitation. It would be more clear to see the code. 
Also I'm not sure how get_user_pages could be used in our case, and what 
you meant by "getting two pages". I'll post out a new version, and we 
can discuss on the code.



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-19 Thread Cornelia Huck
On Fri, 15 Jun 2018 10:06:07 -0700
Siwei Liu  wrote:

> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck  wrote:
> > On Thu, 14 Jun 2018 18:57:11 -0700
> > Siwei Liu  wrote:
> >  
> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
> >> think you raised really good points, some of which I don't have answer
> >> yet and would also like to explore here.
> >>
> >> First off, I don't want to push the discussion to the extreme at this
> >> point, or sell anything about having QEMU manage everything
> >> automatically. Don't get me wrong, it's not there yet. Let's don't
> >> assume we are tied to a specific or concerte solution. I think the key
> >> for our discussion might be to define or refine the boundary between
> >> VM and guest,  e.g. what each layer is expected to control and manage
> >> exactly.
> >>
> >> In my view, there might be possibly 3 different options to represent
> >> the failover device conceipt to QEMU and libvirt (or any upper layer
> >> software):
> >>
> >> a. Seperate device: in this model, virtio and passthough remains
> >> separate devices just as today. QEMU exposes the standby feature bit
> >> for virtio, and publish status/event around the negotiation process of
> >> this feature bit for libvirt to react upon. Since Libvirt has the
> >> pairing relationship itself, maybe through MAC address or something
> >> else, it can control the presence of primary by hot plugging or
> >> unplugging the passthrough device, although it has to work tightly
> >> with virtio's feature negotation process. Not just for migration but
> >> also various corner scenarios (driver/feature ok, device reset,
> >> reboot, legacy guest etc) along virtio's feature negotiation.  
> >
> > Yes, that one has obvious tie-ins to virtio's modus operandi.
> >  
> >>
> >> b. Coupled device: in this model, virtio and passthough devices are
> >> weakly coupled using some group ID, i.e. QEMU match the passthough
> >> device for a standby virtio instance by comparing the group ID value
> >> present behind each device's bridge. Libvirt provides QEMU the group
> >> ID for both type of devices, and only deals with hot plug for
> >> migration, by checking some migration status exposed (e.g. the feature
> >> negotiation status on the virtio device) by QEMU. QEMU manages the
> >> visibility of the primary in guest along virtio's feature negotiation
> >> process.  
> >
> > I'm a bit confused here. What, exactly, ties the two devices together?  
> 
> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
> address (which it doesn't have to), the association between VFIO
> passthrough and standby must be specificed for QEMU to understand the
> relationship with this model. Note, standby feature is no longer
> required to be exposed under this model.

Isn't that a bit limiting, though?

With this model, you can probably tie a vfio-pci device and a
virtio-net-pci device together. But this will fail if you have
different transports: Consider tying together a vfio-pci device and a
virtio-net-ccw device on s390, for example. The standby feature bit is
on the virtio-net level and should not have any dependency on the
transport used.

> 
> > If libvirt already has the knowledge that it should manage the two as a
> > couple, why do we need the group id (or something else for other
> > architectures)? (Maybe I'm simply missing something because I'm not
> > that familiar with pci.)  
> 
> The idea is to have QEMU control the visibility and enumeration order
> of the passthrough VFIO for the failover scenario. Hotplug can be one
> way to achieve it, and perhaps there's other way around also. The
> group ID is not just for QEMU to couple devices, it's also helpful to
> guest too as grouping using MAC address is just not safe.

Sorry about dragging mainframes into this, but this will only work for
homogenous device coupling, not for heterogenous. Consider my vfio-pci
+ virtio-net-ccw example again: The guest cannot find out that the two
belong together by checking some group ID, it has to either use the MAC
or some needs-to-be-architectured property.

Alternatively, we could propose that mechanism as pci-only, which means
we can rely on mechanisms that won't necessarily work on non-pci
transports. (FWIW, I don't see a use case for using vfio-ccw to pass
through a network card anytime in the near future, due to the nature of
network cards currently in use on s390.)

> 
> >  
> >>
> >> c. Fully combined device: in this model, virtio and passthough devices
> >> are viewed as a single VM interface altogther. QEMU not just controls
> >> the visibility of the primary in guest, but can also manage the
> >> exposure of the passthrough for migratability. It can be like that
> >> libvirt supplies the group ID to QEMU. Or libvirt does not even have
> >> to provide group ID for grouping the two devices, if just one single
> >> combined device is exposed by QEMU. In either case, QEMU manages all
> >> aspect of such internal