Re: [PATCH 0/7] Do not read from descriptor ring

2021-06-08 Thread Andy Lutomirski
On 6/3/21 10:53 PM, Jason Wang wrote:
> Hi:
> 
> The virtio driver should not trust the device. This beame more urgent
> for the case of encrtpyed VM or VDUSE[1]. In both cases, technology
> like swiotlb/IOMMU is used to prevent the poking/mangling of memory
> from the device. But this is not sufficient since current virtio
> driver may trust what is stored in the descriptor table (coherent
> mapping) for performing the DMA operations like unmap and bounce so
> the device may choose to utilize the behaviour of swiotlb to perform
> attacks[2].

Based on a quick skim, this looks entirely reasonable to me.

(I'm not a virtio maintainer or expert.  I got my hands very dirty with
virtio once dealing with the DMA mess, but that's about it.)

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andy Lutomirski
On 6/3/21 4:32 PM, Andi Kleen wrote:
> 
>> We do not need an increasing pile of kludges
> 
> Do you mean disabling features is a kludge?
> 
> If yes I disagree with that characterization.
> 
> 
>> to make TDX and SEV “secure”.  We need the actual loaded driver to be
>> secure.  The virtio architecture is full of legacy nonsense,
>> and there is no good reason for SEV and TDX to be a giant special case.
> 
> I don't know where you see a "giant special case". Except for the
> limited feature negotiation all the changes are common, and the
> disabling of features (which is not new BTW, but already done e.g. with
> forcing DMA API in some cases) can be of course used by all these other
> technologies too. But it just cannot be done by default for everything
> because it would break compatibility. So every technology with such
> requirements has to explicitly opt-in.
> 
> 
>>
>> As I said before, real PCIe (Thunderbolt/USB-C or anything else) has
>> the exact same problem.  The fact that TDX has encrypted memory is, at
>> best, a poor proxy for the actual condition.  The actual condition is
>> that the host does not trust the device to implement the virtio
>> protocol correctly.
> 
> Right they can do similar limitations of feature sets. But again it
> cannot be default.

Let me try again.

For most Linux drivers, a report that a misbehaving device can corrupt
host memory is a bug, not a feature.  If a USB device can corrupt kernel
memory, that's a serious bug.  If a USB-C device can corrupt kernel
memory, that's also a serious bug, although, sadly, we probably have
lots of these bugs.  If a Firewire device can corrupt kernel memory,
news at 11.  If a Bluetooth or WiFi peer can corrupt kernel memory,
people write sonnets about it and give it clever names.  Why is virtio
special?

If, for some reason, the virtio driver cannot be fixed so that it is
secure and compatible [1], then I think that the limited cases that are
secure should be accessible to anyone, with or without TDX.  Have a
virtio.secure_mode module option or a udev-controllable parameter or an
alternative driver name or *something*.  An alternative driver name
would allow userspace to prevent the insecure mode from auto-binding to
devices.  And make whatever system configures encrypted guests for
security use this mode.  (Linux is not going to be magically secure just
by booting it in TDX.  There's a whole process of unsealing or remote
attestation, something needs to prevent the hypervisor from connecting a
virtual keyboard and typing init=/bin/bash, something needs to provision
an SSH key, etc.)

In my opinion, it is not so great to identify bugs in the driver and
then say that they're only being fixed for TDX and SEV.

Keep in mind that, as I understand it, there is nothing virt specific
about virtio.  There are real physical devices that speak virtio.

[1] The DMA quirk is nasty.  Fortunately, it's the only case I'm aware
of in which the virtio driver genuinely cannot be made secure and
compatible at the smae time.  Also, fortunately, most real deployments
except on powerpc work just fine with the DMA quirk unquirked.

> 
> 
>>
>>>
>>> TDX and SEV use the arch hook to enforce DMA API, so that part is also
>>> solved.
>>>
>> Can you point me to the code you’re referring to?
> 
> See 4/8 in this patch kit. It uses an existing hook which is already
> used in tree by s390.

This one:

int arch_has_restricted_virtio_memory_access(void)
+{
+   return is_tdx_guest();
+}

I'm looking at a fairly recent kernel, and I don't see anything for s390
wired up in vring_use_dma_api.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andy Lutomirski


On Thu, Jun 3, 2021, at 12:53 PM, Andi Kleen wrote:
> 
> > Tell that to every crypto downgrade attack ever.
> 
> That's exactly what this patch addresses.
> 
> >
> > I see two credible solutions:
> >
> > 1. Actually harden the virtio driver.
> That's exactly what this patchkit, and the alternative approaches, like 
> Jason's, are doing.
> >
> > 2. Have a new virtio-modern driver and use it for modern use cases. Maybe 
> > rename the old driver virtio-legacy or virtio-insecure.  They can share 
> > code.
> 
> In most use cases the legacy driver is not insecure because there is no 
> memory protection anyways.
> 
> Yes maybe such a split would be a good idea for maintenance and maybe 
> performance reasons, but at least from the security perspective I don't 
> see any need for it.


Please reread my email.

We do not need an increasing pile of kludges to make TDX and SEV “secure”.  We 
need the actual loaded driver to be secure.  The virtio architecture is full of 
legacy nonsense, and there is no good reason for SEV and TDX to be a giant 
special case.

As I said before, real PCIe (Thunderbolt/USB-C or anything else) has the exact 
same problem.  The fact that TDX has encrypted memory is, at best, a poor proxy 
for the actual condition.  The actual condition is that the host does not trust 
the device to implement the virtio protocol correctly.

> 
> >
> > Another snag you may hit: virtio’s heuristic for whether to use proper DMA 
> > ops or to bypass them is a giant kludge. I’m very slightly optimistic that 
> > getting the heuristic wrong will make the driver fail to operate but won’t 
> > allow the host to take over the guest, but I’m not really convinced. And I 
> > wrote that code!  A virtio-modern mode probably should not have a 
> > heuristic, and the various iommu-bypassing modes should be fixed to work at 
> > the bus level, not the device level
> 
> TDX and SEV use the arch hook to enforce DMA API, so that part is also 
> solved.
> 

Can you point me to the code you’re referring to?

> 
> -Andi
> 
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andy Lutomirski


On Thu, Jun 3, 2021, at 11:00 AM, Andi Kleen wrote:
> 
> On 6/3/2021 10:33 AM, Andy Lutomirski wrote:
> > On 6/2/21 5:41 PM, Andi Kleen wrote:
> >> Only allow split mode when in a protected guest. Followon
> >> patches harden the split mode code paths, and we don't want
> >> an malicious host to force anything else. Also disallow
> >> indirect mode for similar reasons.
> > I read this as "the virtio driver is buggy.  Let's disable most of the
> > buggy code in one special case in which we need a driver without bugs.
> > In all the other cases (e.g. hardware virtio device connected over
> > USB-C), driver bugs are still allowed."
> 
> My understanding is most of the other modes (except for split with 
> separate descriptors) are obsolete and just there for compatibility. As 
> long as they're deprecated they won't harm anyone.
> 
>

Tell that to every crypto downgrade attack ever.

I see two credible solutions:

1. Actually harden the virtio driver.

2. Have a new virtio-modern driver and use it for modern use cases. Maybe 
rename the old driver virtio-legacy or virtio-insecure.  They can share code.

Another snag you may hit: virtio’s heuristic for whether to use proper DMA ops 
or to bypass them is a giant kludge. I’m very slightly optimistic that getting 
the heuristic wrong will make the driver fail to operate but won’t allow the 
host to take over the guest, but I’m not really convinced. And I wrote that 
code!  A virtio-modern mode probably should not have a heuristic, and the 
various iommu-bypassing modes should be fixed to work at the bus level, not the 
device level.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andy Lutomirski
On 6/2/21 5:41 PM, Andi Kleen wrote:
> Only allow split mode when in a protected guest. Followon
> patches harden the split mode code paths, and we don't want
> an malicious host to force anything else. Also disallow
> indirect mode for similar reasons.

I read this as "the virtio driver is buggy.  Let's disable most of the
buggy code in one special case in which we need a driver without bugs.
In all the other cases (e.g. hardware virtio device connected over
USB-C), driver bugs are still allowed."

Can we just fix the driver without special cases?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack

2021-02-18 Thread Andy Lutomirski
On Thu, Feb 18, 2021 at 11:21 AM Joerg Roedel  wrote:
>
> On Thu, Feb 18, 2021 at 09:49:06AM -0800, Andy Lutomirski wrote:
> > I don't understand what this means.  The whole entry mechanism on x86
> > is structured so that we call a C function *and return from that C
> > function without longjmp-like magic* with the sole exception of
> > unwind_stack_do_exit().  This means that you can match up enters and
> > exits, and that unwind_stack_do_exit() needs to unwind correctly.  In
> > the former case, it's normal C and we can use normal local variables.
> > In the latter case, we know exactly what state we're trying to restore
> > and we can restore it directly without any linked lists or similar.
>
> Okay, the unwinder will likely get confused by this logic.
>
> > What do you have in mind that requires a linked list?
>
> Cases when there are multiple IST vectors besides NMI that can hit while
> the #VC handler is still on its own IST stack. #MCE comes to mind, but
> that is broken anyway. At some point #VC itself will be one of them, but
> when that happens the code will kill the machine.
> This leaves #HV in the list, and I am not sure how that is going to be
> handled yet. I think the goal is that the #HV handler is not allowed to
> cause any #VC exception. In that case the linked-list logic will not be
> needed.

Can you give me an example, even artificial, in which the linked-list
logic is useful?

>
> > > I don't see how this would break, can you elaborate?
> > >
> > > What I think happens is:
> > >
> > > SYSCALL gap (RSP is from userspace and untrusted)
> > >
> > > -> #VC - Handler on #VC IST stack detects that it interrupted
> > >the SYSCALL gap and switches to the task stack.
> > >
> >
> > Can you point me to exactly what code you're referring to?  I spent a
> > while digging through the code and macro tangle and I can't find this.
>
> See the entry code in arch/x86/entry/entry_64.S, macro idtentry_vc. It
> creates the assembly code for the handler. At some point it calls
> vc_switch_off_ist(), which is a C function in arch/x86/kernel/traps.c.
> This function tries to find a new stack for the #VC handler.
>
> The first thing it does is checking whether the exception came from the
> SYSCALL gap and just uses the task stack in that case.
>
> Then it will check for other kernel stacks which are safe to switch
> to. If that fails it uses the fall-back stack (VC2), which will direct
> the handler to a separate function which, for now, just calls panic().
> Not safe are the entry or unknown stacks.

Can you explain your reasoning in considering the entry stack unsafe?
It's 4k bytes these days.

You forgot about entry_SYSCALL_compat.

Your 8-byte alignment is confusing to me.  In valid kernel code, SP
should be 8-byte-aligned already, and, if you're trying to match
architectural behavior, the CPU aligns to 16 bytes.

We're not robust against #VC, NMI in the #VC prologue before the magic
stack switch, and a new #VC in the NMI prologue.  Nor do we appear to
have any detection of the case where #VC nests directly inside its own
prologue.  Or did I miss something else here?

If we get NMI and get #VC in the NMI *asm*, the #VC magic stack switch
looks like it will merrily run itself in the NMI special-stack-layout
section, and that sounds really quite bad.

>
> The function then copies pt_regs and returns the new stack pointer to
> assembly code, which then writes it to %RSP.
>
> > Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
> > that #DB is *not* always called in safe places.
>
> You are right, forgot about this. The MOV SS bug can very well
> trigger a #VC(#DB) exception from the syscall gap.
>
> > > And with SNP we need to be able to at least detect a malicious HV so we
> > > can reliably kill the guest. Otherwise the HV could potentially take
> > > control over the guest's execution flow and make it reveal its secrets.
> >
> > True.  But is the rest of the machinery to be secure against EFLAGS.IF
> > violations and such in place yet?
>
> Not sure what you mean by EFLAGS.IF violations, probably enabling IRQs
> while in the #VC handler? The #VC handler _must_ _not_ enable IRQs
> anywhere in its call-path. If that ever happens it is a bug.
>

I mean that, IIRC, a malicious hypervisor can inject inappropriate
vectors at inappropriate times if the #HV mechanism isn't enabled.
For example, it could inject a page fault or an interrupt in a context
in which we have the wrong GSBASE loaded.

But the #DB issue makes this moot.  We have to use IST unless we turn
off SCE.  But I admit I'm leaning toward turning off SCE until we have
a solution that seems convincingly robust.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack

2021-02-18 Thread Andy Lutomirski
On Thu, Feb 18, 2021 at 3:25 AM Joerg Roedel  wrote:
>
> Hi Andy,
>
> On Wed, Feb 17, 2021 at 10:09:46AM -0800, Andy Lutomirski wrote:
> > Can you get rid of the linked list hack while you're at it?  This code
> > is unnecessarily convoluted right now, and it seems to be just asking
> > for weird bugs.  Just stash the old value in a local variable, please.
>
> Yeah, the linked list is not really necessary right now, because of the
> way nested NMI handling works and given that these functions are only
> used in the NMI handler right now.
> The whole #VC handling code was written with future requirements in
> mind, like what is needed when debugging registers get virtualized and
> #HV gets enabled.
> Until its clear whether __sev_es_ist_enter/exit() is needed in any of
> these paths, I'd like to keep the linked list for now. It is more
> complicated but allows nesting.

I don't understand what this means.  The whole entry mechanism on x86
is structured so that we call a C function *and return from that C
function without longjmp-like magic* with the sole exception of
unwind_stack_do_exit().  This means that you can match up enters and
exits, and that unwind_stack_do_exit() needs to unwind correctly.  In
the former case, it's normal C and we can use normal local variables.
In the latter case, we know exactly what state we're trying to restore
and we can restore it directly without any linked lists or similar.

What do you have in mind that requires a linked list?

>
> > Meanwhile, I'm pretty sure I can break this whole scheme if the
> > hypervisor is messing with us.  As a trivial example, the sequence
> > SYSCALL gap -> #VC -> NMI -> #VC will go quite poorly.
>
> I don't see how this would break, can you elaborate?
>
> What I think happens is:
>
> SYSCALL gap (RSP is from userspace and untrusted)
>
> -> #VC - Handler on #VC IST stack detects that it interrupted
>the SYSCALL gap and switches to the task stack.
>

Can you point me to exactly what code you're referring to?  I spent a
while digging through the code and macro tangle and I can't find this.

>
> -> NMI - Now running on NMI IST stack. Depending on whether the
>stack switch in the #VC handler already happened, the #VC IST
>entry is adjusted so that a subsequent #VC will not overwrite
>the interrupted handlers stack frame.
>
> -> #VC - Handler runs on the adjusted #VC IST stack and switches
>itself back to the NMI IST stack. This is safe wrt. nested
>NMIs as long as nested NMIs itself are safe.
>
> As a rule of thumb, think of the #VC handler as trying to be a non-IST
> handler by switching itself to the interrupted stack or the task stack.
> If it detects that this is not possible (which can't happen right now,
> but with SNP), it will kill the guest.

I will try to think of this, but it's hard, since I can't find the code :)

I found this comment:

 * With the current implementation it is always possible to switch to a safe
 * stack because #VC exceptions only happen at known places, like intercepted
 * instructions or accesses to MMIO areas/IO ports. They can also happen with
 * code instrumentation when the hypervisor intercepts #DB, but the critical
 * paths are forbidden to be instrumented, so #DB exceptions currently also
 * only happen in safe places.

Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
that #DB is *not* always called in safe places.

But I *thnk* the code you're talking about is this:

/*
 * If the entry is from userspace, switch stacks and treat it as
 * a normal entry.
 */
testb$3, CS-ORIG_RAX(%rsp)
jnz.Lfrom_usermode_switch_stack_\@

which does not run on #VC from kernel code.

> It needs to be IST, even without SNP, because #DB is IST too. When the
> hypervisor intercepts #DB then any #DB exception will be turned into
> #VC, so #VC needs to be handled anywhere a #DB can happen.

Eww.

>
> And with SNP we need to be able to at least detect a malicious HV so we
> can reliably kill the guest. Otherwise the HV could potentially take
> control over the guest's execution flow and make it reveal its secrets.

True.  But is the rest of the machinery to be secure against EFLAGS.IF
violations and such in place yet?

>
> Regards,
>
> Joerg
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack

2021-02-17 Thread Andy Lutomirski
On Wed, Feb 17, 2021 at 4:02 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> The code in the NMI handler to adjust the #VC handler IST stack is
> needed in case an NMI hits when the #VC handler is still using its IST
> stack.
> But the check for this condition also needs to look if the regs->sp
> value is trusted, meaning it was not set by user-space. Extend the
> check to not use regs->sp when the NMI interrupted user-space code or
> the SYSCALL gap.
>
> Reported-by: Andy Lutomirski 
> Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI 
> handler")
> Cc: sta...@vger.kernel.org # 5.10+
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/kernel/sev-es.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 84c1821819af..0df38b185d53 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -144,7 +144,9 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
> old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
>
> /* Make room on the IST stack */
> -   if (on_vc_stack(regs->sp))
> +   if (on_vc_stack(regs->sp) &&
> +   !user_mode(regs) &&
> +   !from_syscall_gap(regs))
> new_ist = ALIGN_DOWN(regs->sp, 8) - sizeof(old_ist);
> else
>

Can you get rid of the linked list hack while you're at it?  This code
is unnecessarily convoluted right now, and it seems to be just asking
for weird bugs.  Just stash the old value in a local variable, please.

Meanwhile, I'm pretty sure I can break this whole scheme if the
hypervisor is messing with us.  As a trivial example, the sequence
SYSCALL gap -> #VC -> NMI -> #VC will go quite poorly.  Is this really
better than just turning IST off for #VC and documenting that we are
not secure against a malicious hypervisor yet?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-11-23 Thread Andy Lutomirski




> On Nov 22, 2020, at 9:22 PM, Jürgen Groß  wrote:
> 
> On 22.11.20 22:44, Andy Lutomirski wrote:
>>> On Sat, Nov 21, 2020 at 10:55 PM Jürgen Groß  wrote:
>>> 
>>> On 20.11.20 12:59, Peter Zijlstra wrote:
>>>> On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote:
>>>>> +static __always_inline void arch_local_irq_restore(unsigned long flags)
>>>>> +{
>>>>> +if (!arch_irqs_disabled_flags(flags))
>>>>> +arch_local_irq_enable();
>>>>> +}
>>>> 
>>>> If someone were to write horrible code like:
>>>> 
>>>>   local_irq_disable();
>>>>   local_irq_save(flags);
>>>>   local_irq_enable();
>>>>   local_irq_restore(flags);
>>>> 
>>>> we'd be up some creek without a paddle... now I don't _think_ we have
>>>> genius code like that, but I'd feel saver if we can haz an assertion in
>>>> there somewhere...
>>>> 
>>>> Maybe something like:
>>>> 
>>>> #ifdef CONFIG_DEBUG_ENTRY // for lack of something saner
>>>>   WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF);
>>>> #endif
>>>> 
>>>> At the end?
>>> 
>>> I'd like to, but using WARN_ON_ONCE() in include/asm/irqflags.h sounds
>>> like a perfect receipt for include dependency hell.
>>> 
>>> We could use a plain asm("ud2") instead.
>> How about out-of-lining it:
>> #ifdef CONFIG_DEBUG_ENTRY
>> extern void warn_bogus_irqrestore();
>> #endif
>> static __always_inline void arch_local_irq_restore(unsigned long flags)
>> {
>>if (!arch_irqs_disabled_flags(flags)) {
>>arch_local_irq_enable();
>>} else {
>> #ifdef CONFIG_DEBUG_ENTRY
>>if (unlikely(arch_local_irq_save() & X86_EFLAGS_IF))
>> warn_bogus_irqrestore();
>> #endif
>> }
> 
> This couldn't be a WARN_ON_ONCE() then (or it would be a catch all).

If you put the WARN_ON_ONCE in the out-of-line helper, it should work 
reasonably well.

> Another approach might be to open-code the WARN_ON_ONCE(), like:
> 
> #ifdef CONFIG_DEBUG_ENTRY
> extern void warn_bogus_irqrestore(bool *once);
> #endif
> 
> static __always_inline void arch_local_irq_restore(unsigned long flags)
> {
>if (!arch_irqs_disabled_flags(flags))
>arch_local_irq_enable();
> #ifdef CONFIG_DEBUG_ENTRY
>{
>static bool once;
> 
>if (unlikely(arch_local_irq_save() & X86_EFLAGS_IF))
>warn_bogus_irqrestore();
>}
> #endif
> }
> 

I don’t know precisely what a static variable in an __always_inline function 
will do, but I imagine it will be, at best, erratic, especially when modules 
are involved.

> 
> Juergen
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 05/12] x86: rework arch_local_irq_restore() to not use popf

2020-11-22 Thread Andy Lutomirski
On Sat, Nov 21, 2020 at 10:55 PM Jürgen Groß  wrote:
>
> On 20.11.20 12:59, Peter Zijlstra wrote:
> > On Fri, Nov 20, 2020 at 12:46:23PM +0100, Juergen Gross wrote:
> >> +static __always_inline void arch_local_irq_restore(unsigned long flags)
> >> +{
> >> +if (!arch_irqs_disabled_flags(flags))
> >> +arch_local_irq_enable();
> >> +}
> >
> > If someone were to write horrible code like:
> >
> >   local_irq_disable();
> >   local_irq_save(flags);
> >   local_irq_enable();
> >   local_irq_restore(flags);
> >
> > we'd be up some creek without a paddle... now I don't _think_ we have
> > genius code like that, but I'd feel saver if we can haz an assertion in
> > there somewhere...
> >
> > Maybe something like:
> >
> > #ifdef CONFIG_DEBUG_ENTRY // for lack of something saner
> >   WARN_ON_ONCE((arch_local_save_flags() ^ flags) & X86_EFLAGS_IF);
> > #endif
> >
> > At the end?
>
> I'd like to, but using WARN_ON_ONCE() in include/asm/irqflags.h sounds
> like a perfect receipt for include dependency hell.
>
> We could use a plain asm("ud2") instead.

How about out-of-lining it:

#ifdef CONFIG_DEBUG_ENTRY
extern void warn_bogus_irqrestore();
#endif

static __always_inline void arch_local_irq_restore(unsigned long flags)
{
   if (!arch_irqs_disabled_flags(flags)) {
   arch_local_irq_enable();
   } else {
#ifdef CONFIG_DEBUG_ENTRY
   if (unlikely(arch_local_irq_save() & X86_EFLAGS_IF))
warn_bogus_irqrestore();
#endif
}
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 4/4] x86/xen: drop USERGS_SYSRET64 paravirt call

2020-11-16 Thread Andy Lutomirski
On Mon, Nov 16, 2020 at 7:23 AM Juergen Gross  wrote:
>
> USERGS_SYSRET64 is used to return from a syscall via sysret, but
> a Xen PV guest will nevertheless use the iret hypercall, as there
> is no sysret PV hypercall defined.
>
> So instead of testing all the prerequisites for doing a sysret and
> then mangling the stack for Xen PV again for doing an iret just use
> the iret exit from the beginning.
>
> This can easily be done via an ALTERNATIVE like it is done for the
> sysenter compat case already.
>
> While at it remove to stale sysret32 remnants.

s/to/the/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/4] x86/xen: drop USERGS_SYSRET64 paravirt call

2020-11-16 Thread Andy Lutomirski
On Mon, Nov 16, 2020 at 7:23 AM Juergen Gross  wrote:
>
> USERGS_SYSRET64 is used to return from a syscall via sysret, but
> a Xen PV guest will nevertheless use the iret hypercall, as there
> is no sysret PV hypercall defined.
>
> So instead of testing all the prerequisites for doing a sysret and
> then mangling the stack for Xen PV again for doing an iret just use
> the iret exit from the beginning.
>
> This can easily be done via an ALTERNATIVE like it is done for the
> sysenter compat case already.
>
> While at it remove to stale sysret32 remnants.
>
> Signed-off-by: Juergen Gross 

Acked-by: Andy Lutomirski 

FWIW, you've lost the VGCF_in_syscall optimization.  Let me see if I
can give it back to you better.

> ---
>  arch/x86/entry/entry_64.S | 22 +-
>  arch/x86/include/asm/irqflags.h   |  6 --
>  arch/x86/include/asm/paravirt.h   |  5 -
>  arch/x86/include/asm/paravirt_types.h |  8 
>  arch/x86/kernel/asm-offsets_64.c  |  2 --
>  arch/x86/kernel/paravirt.c|  5 +
>  arch/x86/kernel/paravirt_patch.c  |  4 
>  arch/x86/xen/enlighten_pv.c   |  1 -
>  arch/x86/xen/xen-asm.S| 20 
>  arch/x86/xen/xen-ops.h|  2 --
>  10 files changed, 10 insertions(+), 65 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index a876204a73e0..df865eebd3d7 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -46,14 +46,6 @@
>  .code64
>  .section .entry.text, "ax"
>
> -#ifdef CONFIG_PARAVIRT_XXL
> -SYM_CODE_START(native_usergs_sysret64)
> -   UNWIND_HINT_EMPTY
> -   swapgs
> -   sysretq
> -SYM_CODE_END(native_usergs_sysret64)
> -#endif /* CONFIG_PARAVIRT_XXL */
> -
>  /*
>   * 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
>   *
> @@ -123,12 +115,15 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, 
> SYM_L_GLOBAL)
>  * Try to use SYSRET instead of IRET if we're returning to
>  * a completely clean 64-bit userspace context.  If we're not,
>  * go to the slow exit path.
> +* In the Xen PV case we must use iret anyway.
>  */
> -   movqRCX(%rsp), %rcx
> -   movqRIP(%rsp), %r11
>
> -   cmpq%rcx, %r11  /* SYSRET requires RCX == RIP */
> -   jne swapgs_restore_regs_and_return_to_usermode
> +   ALTERNATIVE __stringify( \
> +   movqRCX(%rsp), %rcx; \
> +   movqRIP(%rsp), %r11; \
> +   cmpq%rcx, %r11; /* SYSRET requires RCX == RIP */ \
> +   jne swapgs_restore_regs_and_return_to_usermode), \
> +   "jmpswapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV

I'm not in love with this save-a-few-bytes stringify, but I can live with it.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] x86/pv: switch SWAPGS to ALTERNATIVE

2020-11-16 Thread Andy Lutomirski
On Mon, Nov 16, 2020 at 7:23 AM Juergen Gross  wrote:
>
> SWAPGS is used only for interrupts coming from user mode or for
> returning to user mode. So there is no reason to use the PARAVIRT
> framework, as it can easily be replaced by an ALTERNATIVE depending
> on X86_FEATURE_XENPV.
>
> There are several instances using the PV-aware SWAPGS macro in paths
> which are never executed in a Xen PV guest. Replace those with the
> plain swapgs instruction. For SWAPGS_UNSAFE_STACK the same applies.

Acked-by: Andy Lutomirski 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-18 Thread Andy Lutomirski
On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin  wrote:
>
> On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote:
> > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin  wrote:
> > >
> > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > > > 4c. The guest kernel maintains an array of physical addresses that are
> > > > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > > > location through whatever protocol, and before resuming a
> > > > moved/snapshotted/duplicated VM, it takes the responsibility for
> > > > memzeroing this memory. The huge pro here would be that this
> > > > eliminates all races, and reduces complexity quite a bit, because the
> > > > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > > > with this, and it can even optimize things like on-disk memory
> > > > snapshots to simply not write out those pages to disk.
> > > >
> > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > > > userspace API to deal with, and it'd be race free, and eliminate a lot
> > > > of kernel complexity.
> > >
> > > Clearly this has a chance to break applications, right?
> > > If there's an app that uses this as a non-system-calls way
> > > to find out whether there was a fork, it will break
> > > when wipe triggers without a fork ...
> > > For example, imagine:
> > >
> > > MADV_WIPEONFORK
> > > copy secret data to MADV_DONTFORK
> > > fork
> > >
> > >
> > > used to work, with this change it gets 0s instead of the secret data.
> > >
> > >
> > > I am also not sure it's wise to expose each guest process
> > > to the hypervisor like this. E.g. each process needs a
> > > guest physical address of its own then. This is a finite resource.
> > >
> > >
> > > The mmap interface proposed here is somewhat baroque, but it is
> > > certainly simple to implement ...
> >
> > Wipe of fork/vmgenid/whatever could end up being much more problematic
> > than it naively appears -- it could be wiped in the middle of a read.
> > Either the API needs to handle this cleanly, or we need something more
> > aggressive like signal-on-fork.
> >
> > --Andy
>
>
> Right, it's not on fork, it's actually when process is snapshotted.
>
> If we assume it's CRIU we care about, then I
> wonder what's wrong with something like
> MADV_CHANGEONPTRACE_SEIZE
> and basically say it's X bytes which change the value...

I feel like we may be approaching this from the wrong end.  Rather
than saying "what data structure can the kernel expose that might
plausibly be useful", how about we try identifying some specific
userspace needs and see what a good solution could look like.  I can
identify two major cryptographic use cases:

1. A userspace RNG.  The API exposed by the userspace end is a
function that generates random numbers.  The userspace code in turn
wants to know some things from the kernel: it wants some
best-quality-available random seed data from the kernel (and possibly
an indication of how good it is) as well as an indication of whether
the userspace memory may have been cloned or rolled back, or, failing
that, an indication of whether a reseed is needed.  Userspace could
implement a wide variety of algorithms on top depending on its goals
and compliance requirements, but the end goal is for the userspace
part to be very, very fast.

2. A userspace crypto stack that wants to avoid shooting itself in the
foot due to inadvertently doing the same thing twice.  For example, an
AES-GCM stack does not want to reuse an IV, *expecially* if there is
even the slightest chance that it might reuse the IV for different
data.  This use case doesn't necessarily involve random numbers, but,
if anything, it needs to be even faster than #1.

The threats here are not really the same.  For #1, a userspace RNG
should be able to recover from a scenario in which an adversary clones
the entire process *and gets to own the clone*.  For example, in
Android, an adversary can often gain complete control of a fork of the
zygote -- this shouldn't adversely affect the security properties of
other forks.  Similarly, a server farm could operate by having one
booted server that is cloned to create more workers.  Those clones
could be provisioned with secrets and permissions post-clone, and at
attacker gaining control of a fresh clone could be considered
acceptable.  For #2, in contrast, if an adversary gains control of a
clone of a

Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-18 Thread Andy Lutomirski
On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin  wrote:
>
> On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > 4c. The guest kernel maintains an array of physical addresses that are
> > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > location through whatever protocol, and before resuming a
> > moved/snapshotted/duplicated VM, it takes the responsibility for
> > memzeroing this memory. The huge pro here would be that this
> > eliminates all races, and reduces complexity quite a bit, because the
> > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > with this, and it can even optimize things like on-disk memory
> > snapshots to simply not write out those pages to disk.
> >
> > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > userspace API to deal with, and it'd be race free, and eliminate a lot
> > of kernel complexity.
>
> Clearly this has a chance to break applications, right?
> If there's an app that uses this as a non-system-calls way
> to find out whether there was a fork, it will break
> when wipe triggers without a fork ...
> For example, imagine:
>
> MADV_WIPEONFORK
> copy secret data to MADV_DONTFORK
> fork
>
>
> used to work, with this change it gets 0s instead of the secret data.
>
>
> I am also not sure it's wise to expose each guest process
> to the hypervisor like this. E.g. each process needs a
> guest physical address of its own then. This is a finite resource.
>
>
> The mmap interface proposed here is somewhat baroque, but it is
> certainly simple to implement ...

Wipe of fork/vmgenid/whatever could end up being much more problematic
than it naively appears -- it could be wiped in the middle of a read.
Either the API needs to handle this cleanly, or we need something more
aggressive like signal-on-fork.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-17 Thread Andy Lutomirski
On Fri, Oct 16, 2020 at 6:40 PM Jann Horn  wrote:
>
> [adding some more people who are interested in RNG stuff: Andy, Jason,
> Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
> concerns some pretty fundamental API stuff related to RNG usage]
>
> On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
>  wrote:
> > - Background
> >
> > The VM Generation ID is a feature defined by Microsoft (paper:
> > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> > multiple hypervisor vendors.
> >
> > The feature is required in virtualized environments by apps that work
> > with local copies/caches of world-unique data such as random values,
> > uuids, monotonically increasing counters, etc.
> > Such apps can be negatively affected by VM snapshotting when the VM
> > is either cloned or returned to an earlier point in time.
> >
> > The VM Generation ID is a simple concept meant to alleviate the issue
> > by providing a unique ID that changes each time the VM is restored
> > from a snapshot. The hw provided UUID value can be used to
> > differentiate between VMs or different generations of the same VM.
> >
> > - Problem
> >
> > The VM Generation ID is exposed through an ACPI device by multiple
> > hypervisor vendors but neither the vendors or upstream Linux have no
> > default driver for it leaving users to fend for themselves.
> >
> > Furthermore, simply finding out about a VM generation change is only
> > the starting point of a process to renew internal states of possibly
> > multiple applications across the system. This process could benefit
> > from a driver that provides an interface through which orchestration
> > can be easily done.
> >
> > - Solution
> >
> > This patch is a driver which exposes the Virtual Machine Generation ID
> > via a char-dev FS interface that provides ID update sync and async
> > notification, retrieval and confirmation mechanisms:
> >
> > When the device is 'open()'ed a copy of the current vm UUID is
> > associated with the file handle. 'read()' operations block until the
> > associated UUID is no longer up to date - until HW vm gen id changes -
> > at which point the new UUID is provided/returned. Nonblocking 'read()'
> > uses EWOULDBLOCK to signal that there is no _new_ UUID available.
> >
> > 'poll()' is implemented to allow polling for UUID updates. Such
> > updates result in 'EPOLLIN' events.
> >
> > Subsequent read()s following a UUID update no longer block, but return
> > the updated UUID. The application needs to acknowledge the UUID update
> > by confirming it through a 'write()'.
> > Only on writing back to the driver the right/latest UUID, will the
> > driver mark this "watcher" as up to date and remove EPOLLIN status.
> >
> > 'mmap()' support allows mapping a single read-only shared page which
> > will always contain the latest UUID value at offset 0.
>
> It would be nicer if that page just contained an incrementing counter,
> instead of a UUID. It's not like the application cares *what* the UUID
> changed to, just that it *did* change and all RNGs state now needs to
> be reseeded from the kernel, right? And an application can't reliably
> read the entire UUID from the memory mapping anyway, because the VM
> might be forked in the middle.
>
> So I think your kernel driver should detect UUID changes and then turn
> those into a monotonically incrementing counter. (Probably 64 bits
> wide?) (That's probably also a little bit faster than comparing an
> entire UUID.)
>
> An option might be to put that counter into the vDSO, instead of a
> separate VMA; but I don't know how the other folks feel about that.
> Andy, do you have opinions on this? That way, normal userspace code
> that uses this infrastructure wouldn't have to mess around with a
> special device at all. And it'd be usable in seccomp sandboxes and so
> on without needing special plumbing. And libraries wouldn't have to
> call open() and mess with file descriptor numbers.

The vDSO might be annoyingly slow for this.  Something like the rseq
page might make sense.  It could be a generic indication of "system
went through some form of suspend".
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-06-23 Thread Andy Lutomirski
On Tue, Jun 23, 2020 at 8:23 AM Andrew Cooper  wrote:
>
> On 23/06/2020 14:03, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote:
> >> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
> >>> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
> >>> you to only make it IST if/when you try and make SNP happen, not before.
> >> It is not the only reason, when ES guests gain debug register support
> >> then #VC also needs to be IST, because #DB can be promoted into #VC
> >> then, and as #DB is IST for a reason, #VC needs to be too.
> > Didn't I read somewhere that that is only so for Rome/Naples but not for
> > the later chips (Milan) which have #DB pass-through?
>
> I don't know about hardware timelines, but some future part can now opt
> in to having debug registers as part of the encrypted state, and swapped
> by VMExit, which would make debug facilities generally usable, and
> supposedly safe to the #DB infinite loop issues, at which point the
> hypervisor need not intercept #DB for safety reasons.
>
> Its worth nothing that on current parts, the hypervisor can set up debug
> facilities on behalf of the guest (or behind its back) as the DR state
> is unencrypted, but that attempting to intercept #DB will redirect to
> #VC inside the guest and cause fun. (Also spare a thought for 32bit
> kernels which have to cope with userspace singlestepping the SYSENTER
> path with every #DB turning into #VC.)

What do you mean 32-bit?  64-bit kernels have exactly the same
problem.  At least the stack is okay, though.


Anyway, since I'm way behind on this thread, here are some thoughts:

First, I plan to implement actual precise recursion detection for the
IST stacks.  We'll be able to reliably panic when unallowed recursion
happens.

Second, I don't object *that* strongly to switching to a second #VC
stack if an NMI or MCE happens, but we really need to make sure we
cover *all* the bases.  And #VC is distressingly close to "happens at
all kinds of unfortunate times and the guest doesn't actually have
much ability to predice it" right now.  So we have #VC + #DB + #VC,
#VC + NMI + #VC, #VC + MCE + #VC, and even worse options.  So doing
the shift in a reliable way is not necessarily possible in a clean
way.

Let me contemplate.   And maybe produce some code soon.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)

2020-04-27 Thread Andy Lutomirski
On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski  wrote:
>
> On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel  wrote:
> >
> > On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote:
> > > I assume the race you mean is:
> > >
> > > #VC
> > > Immediate NMI before IST gets shifted
> > > #VC
> > >
> > > Kaboom.
> > >
> > > How are you dealing with this?  Ultimately, I think that NMI will need
> > > to turn off IST before engaging in any funny business. Let me ponder
> > > this a bit.
> >
> > Right, I dealt with that by unconditionally shifting/unshifting the #VC IST 
> > entry
> > in do_nmi() (thanks to Davin Kaplan for the idea). It might cause
> > one of the IST stacks to be unused during nesting, but that is fine. The
> > stack memory for #VC is only allocated when SEV-ES is active (in an
> > SEV-ES VM).
>
> Blech.  It probably works, but still, yuck.  It's a bit sad that we
> seem to be growing more and more poorly designed happens-anywhere
> exception types at an alarming rate.  We seem to have #NM, #MC, #VC,
> #HV, and #DB.  This doesn't really scale.

I have a somewhat serious question: should we use IST for #VC at all?
As I understand it, Rome and Naples make it mandatory for hypervisors
to intercept #DB, which means that, due to the MOV SS mess, it's sort
of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
if we're running under a sufficiently sensible hypervisor, we don't
need IST for #VC.

So I think we have two choices:

1. Use IST for #VC and deal with all the mess that entails.

2. Say that we SEV-ES client support on Rome and Naples is for
development only and do a quick boot-time check for whether #DB is
intercepted.  (Just set TF and see what vector we get.)  If #DB is
intercepted, print a very loud warning and refuse to boot unless some
special sev_es.insecure_development_mode or similar option is set.

#2 results in simpler and more robust entry code.  #1 is more secure.

So my question is: will anyone actually use SEV-ES in production on
Rome or Naples?  As I understand it, it's not really ready for prime
time on those chips.  And do we care if the combination of a malicious
hypervisor and malicious guest userspace on Milan can compromise the
guest kernel?  I don't think SEV-ES is really mean to resist a
concerted effort by the hypervisor to compromise the guest.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Allow RDTSC and RDTSCP from userspace

2020-04-25 Thread Andy Lutomirski
On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel  wrote:
>
> On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote:
> > I assume the race you mean is:
> >
> > #VC
> > Immediate NMI before IST gets shifted
> > #VC
> >
> > Kaboom.
> >
> > How are you dealing with this?  Ultimately, I think that NMI will need
> > to turn off IST before engaging in any funny business. Let me ponder
> > this a bit.
>
> Right, I dealt with that by unconditionally shifting/unshifting the #VC IST 
> entry
> in do_nmi() (thanks to Davin Kaplan for the idea). It might cause
> one of the IST stacks to be unused during nesting, but that is fine. The
> stack memory for #VC is only allocated when SEV-ES is active (in an
> SEV-ES VM).

Blech.  It probably works, but still, yuck.  It's a bit sad that we
seem to be growing more and more poorly designed happens-anywhere
exception types at an alarming rate.  We seem to have #NM, #MC, #VC,
#HV, and #DB.  This doesn't really scale.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Allow RDTSC and RDTSCP from userspace

2020-04-25 Thread Andy Lutomirski


> On Apr 25, 2020, at 12:10 PM, Joerg Roedel  wrote:
> 
> On Sat, Apr 25, 2020 at 11:15:35AM -0700, Andy Lutomirski wrote:
>> shift_ist is gross.  What's it for?  If it's not needed, I'd rather
>> not use it, and I eventually want to get rid of it for #DB as well.
> 
> The #VC handler needs to be able to nest, there is no way around that
> for various reasons, the two most important ones are:
> 
>1. The #VC -> NMI -> #VC case. #VCs can happen in the NMI
>   handler, for example (but not exclusivly) for RDPMC.
> 
>2. In case of an error the #VC handler needs to print out error
>   information by calling one of the printk wrappers. Those will
>   end up doing IO to some console/serial port/whatever which
>   will also cause #VC exceptions to emulate the access to the
>   output devices.
> 
> Using shift_ist is perfect for that, the only problem is the race
> condition with the NMI handler, as shift_ist does not work well with
> exceptions that can also trigger within the NMI handler. But I have
> taken care of that for #VC.
> 

I assume the race you mean is:

#VC
Immediate NMI before IST gets shifted
#VC

Kaboom.

How are you dealing with this?  Ultimately, I think that NMI will need to turn 
off IST before engaging in any funny business. Let me ponder this a bit.

> 
> Regards,
> 
>Joerg
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] Allow RDTSC and RDTSCP from userspace

2020-04-25 Thread Andy Lutomirski
On Sat, Apr 25, 2020 at 5:49 AM Joerg Roedel  wrote:
>
> Hi Dave,
>
> On Fri, Apr 24, 2020 at 03:53:09PM -0700, Dave Hansen wrote:
> > Ahh, so any instruction that can have an instruction intercept set
> > potentially needs to be able to tolerate a #VC?  Those instruction
> > intercepts are under the control of the (untrusted relative to the
> > guest) hypervisor, right?
> >
> > >From the main sev-es series:
> >
> > +#ifdef CONFIG_AMD_MEM_ENCRYPT
> > +idtentry vmm_communication do_vmm_communicationhas_error_code=1
> > +#endif
>
> The next version of the patch-set (which I will hopefully have ready
> next week) will have this changed. The #VC exception handler uses an IST
> stack and is set to paranoid=1 and shift_ist. The IST stacks for the #VC
> handler are only allocated when SEV-ES is active.

shift_ist is gross.  What's it for?  If it's not needed, I'd rather
not use it, and I eventually want to get rid of it for #DB as well.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking

2020-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2020 at 12:26 PM Joerg Roedel  wrote:
>
> On Thu, Mar 19, 2020 at 11:40:39AM -0700, Andy Lutomirski wrote:
>
> > Nope.  A nested NMI will reset the interrupted NMI's return frame to
> > cause it to run again when it's done.  I don't think this will have
> > any real interaction with #VC.  There's no longjmp() here.
>
> Ahh, so I misunderstood that part, in this case your proposal of sending
> the NMI-complete message right at the beginning of do_nmi() should work
> just fine. I will test this and see how it works out.
>
> > I certainly *like* preventing nesting, but I don't think we really
> > want a whole alternate NMI path just for a couple of messed-up AMD
> > generations.  And the TF trick is not so pretty either.
>
> Indeed, if it could be avoided, it should.
>
> >
> > > > This causes us to pop the NMI frame off the stack.  Assuming the NMI
> > > > restart logic is invoked (which is maybe impossible?), we get #DB,
> > > > which presumably is actually delivered.  And we end up on the #DB
> > > > stack, which might already have been in use, so we have a potential
> > > > increase in nesting.  Also, #DB may be called from an unexpected
> > > > context.
> > >
> > > An SEV-ES hypervisor is required to intercept #DB, which means that the
> > > #DB exception actually ends up being a #VC exception. So it will not end
> > > up on the #DB stack.
> >
> > With your patch set, #DB doesn't seem to end up on the #DB stack either.
>
> Right, it does not use the #DB stack or shift-ist stuff. Maybe it
> should, is this needed for anything else than making entry code
> debugable by kgdb?

AIUI the shift-ist stuff is because we aren't very good about the way
that we handle tracing right now, and that can cause a limited degree
of recursion.  #DB uses IST for historical reasons that don't
necessarily make sense.  Right now, we need it for only one reason:
the MOV SS issue.  IIRC this isn't actually triggerable without
debugging enabled -- MOV SS with no breakpoint but TF on doesn't seem
to malfunction quite as badly.

--Andy

>
> Regards,
>
> Joerg
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 41/70] x86/sev-es: Add Runtime #VC Exception Handler

2020-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2020 at 9:24 AM Joerg Roedel  wrote:
>
> On Thu, Mar 19, 2020 at 08:44:03AM -0700, Andy Lutomirski wrote:
> > On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel  wrote:
> > >
> > > From: Tom Lendacky 
> > >
> > > Add the handler for #VC exceptions invoked at runtime.
> >
> > If I read this correctly, this does not use IST.  If that's true, I
> > don't see how this can possibly work.  There at least two nasty cases
> > that come to mind:
> >
> > 1. SYSCALL followed by NMI.  The NMI IRET hack gets to #VC and we
> > explode.  This is fixable by getting rid of the NMI EFLAGS.TF hack.
>
> Not an issue in this patch-set, the confusion comes from the fact that I
> left some parts of the single-step-over-iret code in the patch. But it
> is not used. The NMI handling in this patch-set sends the NMI-complete
> message before the IRET, when the kernel is still in a safe environment
> (kernel stack, kernel cr3).

Got it!

>
> > 2. tools/testing/selftests/x86/mov_ss_trap_64.  User code does MOV
> > (addr), SS; SYSCALL, where addr has a data breakpoint.  We get #DB
> > promoted to #VC with no stack.
>
> Also not an issue, as debugging is not supported at the moment in SEV-ES
> guests (hardware has no way yet to save/restore the debug registers
> across #VMEXITs). But this will change with future hardware. If you look
> at the implementation for dr7 read/write events, you see that the dr7
> value is cached and returned, but does not make it to the hardware dr7.

Eek.  This would probably benefit from some ptrace / perf logic to
prevent the kernel or userspace from thinking that debugging works.

I guess this means that #DB only happens due to TF or INT01.  I
suppose this is probably okay.

>
> I though about using IST for the #VC handler, but the implications for
> nesting #VC handlers made me decide against it. But for future hardware
> that supports debugging inside SEV-ES guests it will be an issue. I'll
> think about how to fix the problem, it probably has to be IST :(

Or future generations could have enough hardware support for debugging
that #DB doesn't need to be intercepted or can be re-injected
correctly with the #DB vector.

>
> Regards,
>
> Joerg
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking

2020-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2020 at 9:07 AM Joerg Roedel  wrote:
>
> Hi Andy,
>
> On Thu, Mar 19, 2020 at 08:35:59AM -0700, Andy Lutomirski wrote:
> > On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel  wrote:
> > >
> > > From: Joerg Roedel 
> > >
> > > Keep NMI state in SEV-ES code so the kernel can re-enable NMIs for the
> > > vCPU when it reaches IRET.
> >
> > IIRC I suggested just re-enabling NMI in C from do_nmi().  What was
> > wrong with that approach?
>
> If I understand the code correctly a nested NMI will just reset the
> interrupted NMI handler to start executing again at 'restart_nmi'.
> The interrupted NMI handler could be in the #VC handler, and it is not
> safe to just jump back to the start of the NMI handler from somewhere
> within the #VC handler.

Nope.  A nested NMI will reset the interrupted NMI's return frame to
cause it to run again when it's done.  I don't think this will have
any real interaction with #VC.  There's no longjmp() here.

>
> So I decided to not allow NMI nesting for SEV-ES and only re-enable the
> NMI window when the first NMI returns. This is not implemented in this
> patch, but I will do that once Thomas' entry-code rewrite is upstream.
>

I certainly *like* preventing nesting, but I don't think we really
want a whole alternate NMI path just for a couple of messed-up AMD
generations.  And the TF trick is not so pretty either.

> > This causes us to pop the NMI frame off the stack.  Assuming the NMI
> > restart logic is invoked (which is maybe impossible?), we get #DB,
> > which presumably is actually delivered.  And we end up on the #DB
> > stack, which might already have been in use, so we have a potential
> > increase in nesting.  Also, #DB may be called from an unexpected
> > context.
>
> An SEV-ES hypervisor is required to intercept #DB, which means that the
> #DB exception actually ends up being a #VC exception. So it will not end
> up on the #DB stack.

With your patch set, #DB doesn't seem to end up on the #DB stack either.

>
> > I think there are two credible ways to approach this:
> >
> > 1. Just put the NMI unmask in do_nmi().  The kernel *already* knows
> > how to handle running do_nmi() with NMIs unmasked.  This is much, much
> > simpler than your code.
>
> Right, and I thought about that, but the implication is that the
> complexity is moved somewhere else, namely into the #VC handler, which
> then has to be restartable.

As above, I don't think there's an actual problem here.

>
> > 2. Have an entirely separate NMI path for the
> > SEV-ES-on-misdesigned-CPU case.  And have very clear documentation for
> > what prevents this code from being executed on future CPUs (Zen3?)
> > that have this issue fixed for real?
>
> That sounds like a good alternative, I will investigate this approach.
> The NMI handler should be much simpler as it doesn't need to allow NMI
> nesting. The question is, does the C code down the NMI path depend on
> the NMI handlers stack frame layout (e.g. the in-nmi flag)?

Nope.  In particular, the 32-bit path doesn't have all this.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 42/70] x86/sev-es: Support nested #VC exceptions

2020-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> Handle #VC exceptions that happen while the GHCB is in use. This can
> happen when an NMI happens in the #VC exception handler and the NMI
> handler causes a #VC exception itself. Save the contents of the GHCB
> when nesting is detected and restore it when the GHCB is no longer
> used.
>
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/kernel/sev-es.c | 63 +---
>  1 file changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 97241d2f0f70..3b7bbc8d841e 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -32,9 +32,57 @@ struct ghcb boot_ghcb_page __bss_decrypted 
> __aligned(PAGE_SIZE);
>   */
>  struct ghcb __initdata *boot_ghcb;
>
> +struct ghcb_state {
> +   struct ghcb *ghcb;
> +};
> +
>  /* Runtime GHCB pointers */
>  static struct ghcb __percpu *ghcb_page;
>
> +/*
> + * Mark the per-cpu GHCB as in-use to detect nested #VC exceptions.
> + * There is no need for it to be atomic, because nothing is written to the 
> GHCB
> + * between the read and the write of ghcb_active. So it is safe to use it 
> when a
> + * nested #VC exception happens before the write.
> + */
> +static DEFINE_PER_CPU(bool, ghcb_active);
> +
> +static struct ghcb *sev_es_get_ghcb(struct ghcb_state *state)
> +{
> +   struct ghcb *ghcb = (struct ghcb *)this_cpu_ptr(ghcb_page);
> +   bool *active = this_cpu_ptr(_active);
> +
> +   if (unlikely(*active)) {
> +   /* GHCB is already in use - save its contents */
> +
> +   state->ghcb = kzalloc(sizeof(struct ghcb), GFP_ATOMIC);
> +   if (!state->ghcb)
> +   return NULL;

This can't possibly end well.  Maybe have a little percpu list of
GHCBs and make sure there are enough for any possible nesting?

Also, I admit confusion.  Isn't the GHCB required to be unencrypted?
How does that work with kzalloc()?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 41/70] x86/sev-es: Add Runtime #VC Exception Handler

2020-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel  wrote:
>
> From: Tom Lendacky 
>
> Add the handler for #VC exceptions invoked at runtime.

If I read this correctly, this does not use IST.  If that's true, I
don't see how this can possibly work.  There at least two nasty cases
that come to mind:

1. SYSCALL followed by NMI.  The NMI IRET hack gets to #VC and we
explode.  This is fixable by getting rid of the NMI EFLAGS.TF hack.

2. tools/testing/selftests/x86/mov_ss_trap_64.  User code does MOV
(addr), SS; SYSCALL, where addr has a data breakpoint.  We get #DB
promoted to #VC with no stack.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking

2020-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> Keep NMI state in SEV-ES code so the kernel can re-enable NMIs for the
> vCPU when it reaches IRET.

IIRC I suggested just re-enabling NMI in C from do_nmi().  What was
wrong with that approach?

> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +SYM_CODE_START(sev_es_iret_user)
> +   UNWIND_HINT_IRET_REGS offset=8
> +   /*
> +* The kernel jumps here directly from
> +* swapgs_restore_regs_and_return_to_usermode. %rsp points already to
> +* trampoline stack, but %cr3 is still from kernel. User-regs are live
> +* except %rdi. Switch to user CR3, restore user %rdi and user gs_base
> +* and single-step over IRET
> +*/
> +   SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
> +   popq%rdi
> +   SWAPGS
> +   /*
> +* Enable single-stepping and execute IRET. When IRET is
> +* finished the resulting #DB exception will cause a #VC
> +* exception to be raised. The #VC exception handler will send a
> +* NMI-complete message to the hypervisor to re-open the NMI
> +* window.

This is distressing to say the least.  The sequence if events is, roughly:

1. We're here with NMI masking in an unknown state because do_nmi()
and any nested faults could have done IRET, at least architecturally.
NMI could occur or it could not.  I suppose that, on SEV-ES, as least
on current CPUs, NMI is definitely masked.  What about on newer CPUs?
What if we migrate?

> +*/
> +sev_es_iret_kernel:
> +   pushf
> +   btsq $X86_EFLAGS_TF_BIT, (%rsp)
> +   popf

Now we have TF on, NMIs (architecturally) in unknown state.

> +   iretq

This causes us to pop the NMI frame off the stack.  Assuming the NMI
restart logic is invoked (which is maybe impossible?), we get #DB,
which presumably is actually delivered.  And we end up on the #DB
stack, which might already have been in use, so we have a potential
increase in nesting.  Also, #DB may be called from an unexpected
context.

Now somehow #DB is supposed to invoke #VC, which is supposed to do the
magic hypercall, and all of this is supposed to be safe?  Or is #DB
unconditionally redirected to #VC?  What happens if we had no stack
(e.g. we interrupted SYSCALL) or we were already in #VC to begin with?

I think there are two credible ways to approach this:

1. Just put the NMI unmask in do_nmi().  The kernel *already* knows
how to handle running do_nmi() with NMIs unmasked.  This is much, much
simpler than your code.

2. Have an entirely separate NMI path for the
SEV-ES-on-misdesigned-CPU case.  And have very clear documentation for
what prevents this code from being executed on future CPUs (Zen3?)
that have this issue fixed for real?

This hybrid code is no good.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 38/62] x86/sev-es: Handle instruction fetches from user-space

2020-03-17 Thread Andy Lutomirski
On Fri, Mar 13, 2020 at 2:12 AM Joerg Roedel  wrote:
>
> On Wed, Feb 12, 2020 at 01:42:48PM -0800, Andy Lutomirski wrote:
> > I realize that this is a somewhat arbitrary point in the series to
> > complain about this, but: the kernel already has infrastructure to
> > decode and fix up an instruction-based exception.  See
> > fixup_umip_exception().  Please refactor code so that you can share
> > the same infrastructure rather than creating an entirely new thing.
>
> Okay, but 'infrastructure' is a bold word for the call path down
> fixup_umip_exception().

I won't argue with that.

> It uses the in-kernel instruction decoder, which
> I already use in my patch-set. But I agree that some code in this
> patch-set is duplicated and already present in the instruction decoder,
> and that fixup_umip_exception() has more robust instruction decoding.
>
> I factor the instruction decoding part out and make is usable for the
> #VC handler too and remove the code that is already present in the
> instruction decoder.

Thanks!

>
> Regards,
>
> Joerg
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 38/62] x86/sev-es: Handle instruction fetches from user-space

2020-02-12 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> When a #VC exception is triggered by user-space the instruction
> decoder needs to read the instruction bytes from user addresses.
> Enhance es_fetch_insn_byte() to safely fetch kernel and user
> instruction bytes.

I realize that this is a somewhat arbitrary point in the series to
complain about this, but: the kernel already has infrastructure to
decode and fix up an instruction-based exception.  See
fixup_umip_exception().  Please refactor code so that you can share
the same infrastructure rather than creating an entirely new thing.

FWIW, the fixup_umip_exception() code seems to have much more robust
segment handling than yours :)

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 23/62] x86/idt: Move IDT to data segment

2020-02-12 Thread Andy Lutomirski


> On Feb 12, 2020, at 3:55 AM, Joerg Roedel  wrote:
> 
> On Tue, Feb 11, 2020 at 02:41:25PM -0800, Andy Lutomirski wrote:
>>> On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>>> 
>>> From: Joerg Roedel 
>>> 
>>> With SEV-ES, exception handling is needed very early, even before the
>>> kernel has cleared the bss segment. In order to prevent clearing the
>>> currently used IDT, move the IDT to the data segment.
>> 
>> Ugh.  At the very least this needs a comment in the code.
> 
> Yes, right, added a comment for that.
> 
>> I had a patch to fix the kernel ELF loader to clear BSS, which would
>> fix this problem once and for all, but it didn't work due to the messy
>> way that the decompressor handles memory.  I never got around to
>> fixing this, sadly.
> 
> Aren't there other ways of booting (Xen-PV?) which don't use the kernel
> ELF loader?

Dunno. I would hope the any sane loader would clear BSS before executing 
anything. This isn’t currently the case, though. Oh well.

> 
> Regards,
> 
>Joerg
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 14/62] x86/boot/compressed/64: Add stage1 #VC handler

2020-02-12 Thread Andy Lutomirski


> On Feb 12, 2020, at 3:38 AM, Joerg Roedel  wrote:
> 
> On Tue, Feb 11, 2020 at 02:23:22PM -0800, Andy Lutomirski wrote:
>>> On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>>> +void __init no_ghcb_vc_handler(struct pt_regs *regs)
>> 
>> Isn't there a second parameter: unsigned long error_code?
> 
> No, the function gets the error-code from regs->orig_ax. This particular
> function only needs to check for error_code == SVM_EXIT_CPUID, as that
> is the only one supported when there is no GHCB.
> 

Hmm. It might be nice to use the same signature for early handlers as for 
normal ones.

> Regards,
> 
>Joerg
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 00/62] Linux as SEV-ES Guest Support

2020-02-11 Thread Andy Lutomirski


> On Feb 11, 2020, at 5:53 AM, Joerg Roedel  wrote:

> 
> 
>* Putting some NMI-load on the guest will make it crash usually
>  within a minute

Suppose you do CPUID or some MMIO and get #VC. You fill in the GHCB to ask for 
help. Some time between when you start filling it out and when you do VMGEXIT, 
you get NMI. If the NMI does
its own GHCB access [0], it will clobber the outer #VC’a state, resulting in a 
failure when VMGEXIT happens. There’s a related failure mode if the NMI is 
after the VMGEXIT but before the result is read.

I suspect you can fix this by saving the GHCB at the beginning of do_nmi and 
restoring it at the end. This has the major caveat that it will not work if 
do_nmi comes from user mode and schedules, but I don’t believe this can happen.

[0] Due to the NMI_COMPLETE catastrophe, there is a 100% chance that this 
happens.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 50/62] x86/sev-es: Handle VMMCALL Events

2020-02-11 Thread Andy Lutomirski


> On Feb 11, 2020, at 5:53 AM, Joerg Roedel  wrote:
> 
> From: Tom Lendacky 
> 
> Implement a handler for #VC exceptions caused by VMMCALL instructions.
> This patch is only a starting point, VMMCALL emulation under SEV-ES
> needs further hypervisor-specific changes to provide additional state.
> 

How about we just don’t do VMMCALL if we’re a SEV-ES guest?  Otherwise we add 
thousands of cycles of extra latency for no good reason.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 46/62] x86/sev-es: Handle INVD Events

2020-02-11 Thread Andy Lutomirski


> On Feb 11, 2020, at 5:53 AM, Joerg Roedel  wrote:
> 
> From: Tom Lendacky 
> 
> Implement a handler for #VC exceptions caused by INVD instructions.

Uh, what?  Surely the #VC code can have a catch-all OOPS path for things like 
this. Linux should never ever do INVD.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 62/62] x86/sev-es: Add NMI state tracking

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> Keep NMI state in SEV-ES code so the kernel can re-enable NMIs for the
> vCPU when it reaches IRET.

This patch is overcomplicated IMO.  Just do the magic incantation in C
from do_nmi or from here:

/*
 * For ease of testing, unmask NMIs right away.  Disabled by
 * default because IRET is very expensive.

If you do the latter, you'll need to handle the case where the NMI
came from user mode.

The ideal solution is do_nmi, I think.

if (static_cpu_has(X86_BUG_AMD_FORGOT_ABOUT_NMI))
  sev_es_unmask_nmi();

Feel free to use X86_FEATURE_SEV_ES instead :)

--Andu
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 39/62] x86/sev-es: Harden runtime #VC handler for exceptions from user-space

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> Send SIGBUS to the user-space process that caused the #VC exception
> instead of killing the machine. Also ratelimit the error messages so
> that user-space can't flood the kernel log.

What would cause this?  CPUID?  Something else?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 35/62] x86/sev-es: Setup per-cpu GHCBs for the runtime handler

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Tom Lendacky 
>
> The runtime handler needs a GHCB per CPU. Set them up and map them
> unencrypted.
>
> Signed-off-by: Tom Lendacky 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/include/asm/mem_encrypt.h |  2 ++
>  arch/x86/kernel/sev-es.c   | 25 -
>  arch/x86/kernel/traps.c|  3 +++
>  3 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/mem_encrypt.h 
> b/arch/x86/include/asm/mem_encrypt.h
> index 6f61bb93366a..d48e7be9bb49 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -48,6 +48,7 @@ int __init early_set_memory_encrypted(unsigned long vaddr, 
> unsigned long size);
>  void __init mem_encrypt_init(void);
>  void __init mem_encrypt_free_decrypted_mem(void);
>
> +void __init encrypted_state_init_ghcbs(void);
>  bool sme_active(void);
>  bool sev_active(void);
>  bool sev_es_active(void);
> @@ -71,6 +72,7 @@ static inline void __init sme_early_init(void) { }
>  static inline void __init sme_encrypt_kernel(struct boot_params *bp) { }
>  static inline void __init sme_enable(struct boot_params *bp) { }
>
> +static inline void encrypted_state_init_ghcbs(void) { }
>  static inline bool sme_active(void) { return false; }
>  static inline bool sev_active(void) { return false; }
>  static inline bool sev_es_active(void) { return false; }
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 0e0b28477627..9a5530857db7 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -8,8 +8,11 @@
>   */
>
>  #include  /* For show_regs() */
> -#include 
> +#include 
> +#include 
>  #include 
> +#include 
> +#include 
>  #include 
>
>  #include 
> @@ -28,6 +31,9 @@ struct ghcb boot_ghcb_page __bss_decrypted 
> __aligned(PAGE_SIZE);
>   */
>  struct ghcb __initdata *boot_ghcb;
>
> +/* Runtime GHCBs */
> +static DEFINE_PER_CPU_DECRYPTED(struct ghcb, ghcb_page) __aligned(PAGE_SIZE);

Hmm.  This is a largeish amount of memory on large non-SEV-ES systems.
Maybe store a pointer instead?  It would be even better if it could be
DEFINE_PER_CPU like this but be discarded if we don't need it, but I
don't think we have the infrastructure for that.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 30/62] x86/head/64: Move early exception dispatch to C code

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> Move the assembly coded dispatch between page-faults and all other
> exceptions to C code to make it easier to maintain and extend.
>
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/kernel/head64.c  | 20 
>  arch/x86/kernel/head_64.S | 11 +--
>  2 files changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 7cdfb7113811..d83c62ebaa85 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -36,6 +36,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  /*
>   * Manage page tables very early on.
> @@ -377,6 +379,24 @@ int __init early_make_pgtable(unsigned long address)
> return __early_make_pgtable(address, pmd);
>  }
>
> +void __init early_exception(struct pt_regs *regs, int trapnr)
> +{
> +   unsigned long cr2;
> +   int r;

How about int (or bool) handled;  Or just if (!early_make_pgtable)
return;  This would also be nicer if you inverted the return value so
that true means "I handled it".

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 23/62] x86/idt: Move IDT to data segment

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> With SEV-ES, exception handling is needed very early, even before the
> kernel has cleared the bss segment. In order to prevent clearing the
> currently used IDT, move the IDT to the data segment.

Ugh.  At the very least this needs a comment in the code.

I had a patch to fix the kernel ELF loader to clear BSS, which would
fix this problem once and for all, but it didn't work due to the messy
way that the decompressor handles memory.  I never got around to
fixing this, sadly.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 25/62] x86/head/64: Install boot GDT

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> Handling exceptions during boot requires a working GDT. The kernel GDT
> is not yet ready for use, so install a temporary boot GDT.
>
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/kernel/head_64.S | 26 ++
>  1 file changed, 26 insertions(+)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 4bbc770af632..5a3cde971cb7 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -72,6 +72,20 @@ SYM_CODE_START_NOALIGN(startup_64)
> /* Set up the stack for verify_cpu(), similar to initial_stack below 
> */
> leaq(__end_init_task - SIZEOF_PTREGS)(%rip), %rsp
>
> +   /* Setup boot GDT descriptor and load boot GDT */
> +   leaqboot_gdt(%rip), %rax
> +   movq%rax, boot_gdt_base(%rip)
> +   lgdtboot_gdt_descr(%rip)
> +
> +   /* GDT loaded - switch to __KERNEL_CS so IRET works reliably */
> +   pushq   $__KERNEL_CS
> +   leaq.Lon_kernel_cs(%rip), %rax
> +   pushq   %rax
> +   lretq
> +
> +.Lon_kernel_cs:
> +   UNWIND_HINT_EMPTY

I would suggest fixing at least SS as well.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 19/62] x86/sev-es: Add support for handling IOIO exceptions

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Tom Lendacky 
>
> Add support for decoding and handling #VC exceptions for IOIO events.
>
> Signed-off-by: Tom Lendacky 
> [ jroe...@suse.de: Adapted code to #VC handling framework ]
> Co-developed-by: Joerg Roedel 
> Signed-off-by: Joerg Roedel 

It would be nice if this could reuse the existing in-kernel
instruction decoder.  Is there some reason it can't?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/62] x86/boot/compressed/64: Setup GHCB Based VC Exception handler

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> Install an exception handler for #VC exception that uses a GHCB. Also
> add the infrastructure for handling different exit-codes by decoding
> the instruction that caused the exception and error handling.
>

> diff --git a/arch/x86/boot/compressed/sev-es.c 
> b/arch/x86/boot/compressed/sev-es.c
> index 8d13121a8cf2..02fb6f57128b 100644
> --- a/arch/x86/boot/compressed/sev-es.c
> +++ b/arch/x86/boot/compressed/sev-es.c
> @@ -8,12 +8,16 @@
>  #include 
>
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>
>  #include "misc.h"
>
> +struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
> +struct ghcb *boot_ghcb;
> +
>  static inline u64 read_ghcb_msr(void)
>  {
> unsigned long low, high;
> @@ -35,8 +39,95 @@ static inline void write_ghcb_msr(u64 val)
> "a"(low), "d" (high) : "memory");
>  }
>
> +static enum es_result es_fetch_insn_byte(struct es_em_ctxt *ctxt,
> +unsigned int offset,
> +char *buffer)
> +{
> +   char *rip = (char *)ctxt->regs->ip;
> +
> +   buffer[offset] = rip[offset];
> +
> +   return ES_OK;
> +}
> +
> +static enum es_result es_write_mem(struct es_em_ctxt *ctxt,
> +  void *dst, char *buf, size_t size)
> +{
> +   memcpy(dst, buf, size);
> +
> +   return ES_OK;
> +}
> +
> +static enum es_result es_read_mem(struct es_em_ctxt *ctxt,
> + void *src, char *buf, size_t size)
> +{
> +   memcpy(buf, src, size);
> +
> +   return ES_OK;
> +}


What are all these abstractions for?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 14/62] x86/boot/compressed/64: Add stage1 #VC handler

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> Add the first handler for #VC exceptions. At stage 1 there is no GHCB
> yet becaue we might still be on the EFI page table and thus can't map
> memory unencrypted.
>
> The stage 1 handler is limited to the MSR based protocol to talk to
> the hypervisor and can only support CPUID exit-codes, but that is
> enough to get to stage 2.
>
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/boot/compressed/Makefile  |  1 +
>  arch/x86/boot/compressed/idt_64.c  |  4 ++
>  arch/x86/boot/compressed/idt_handlers_64.S |  4 ++
>  arch/x86/boot/compressed/misc.h|  1 +
>  arch/x86/boot/compressed/sev-es.c  | 42 ++
>  arch/x86/include/asm/msr-index.h   |  1 +
>  arch/x86/include/asm/sev-es.h  | 45 +++
>  arch/x86/include/asm/trap_defs.h   |  1 +
>  arch/x86/kernel/sev-es-shared.c| 66 ++
>  9 files changed, 165 insertions(+)
>  create mode 100644 arch/x86/boot/compressed/sev-es.c
>  create mode 100644 arch/x86/include/asm/sev-es.h
>  create mode 100644 arch/x86/kernel/sev-es-shared.c
>
> diff --git a/arch/x86/boot/compressed/Makefile 
> b/arch/x86/boot/compressed/Makefile
> index e6b3e0fc48de..583678c78e1b 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -84,6 +84,7 @@ ifdef CONFIG_X86_64
> vmlinux-objs-y += $(obj)/idt_64.o $(obj)/idt_handlers_64.o
> vmlinux-objs-y += $(obj)/mem_encrypt.o
> vmlinux-objs-y += $(obj)/pgtable_64.o
> +   vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev-es.o
>  endif
>
>  vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
> diff --git a/arch/x86/boot/compressed/idt_64.c 
> b/arch/x86/boot/compressed/idt_64.c
> index 84ba57d9d436..bdd20dfd1fd0 100644
> --- a/arch/x86/boot/compressed/idt_64.c
> +++ b/arch/x86/boot/compressed/idt_64.c
> @@ -31,6 +31,10 @@ void load_stage1_idt(void)
>  {
> boot_idt_desc.address = (unsigned long)boot_idt;
>
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +   set_idt_entry(X86_TRAP_VC, boot_stage1_vc_handler);
> +#endif
> +
> load_boot_idt(_idt_desc);
>  }
>
> diff --git a/arch/x86/boot/compressed/idt_handlers_64.S 
> b/arch/x86/boot/compressed/idt_handlers_64.S
> index f7f1ea66dcbf..330eb4e5c8b3 100644
> --- a/arch/x86/boot/compressed/idt_handlers_64.S
> +++ b/arch/x86/boot/compressed/idt_handlers_64.S
> @@ -71,3 +71,7 @@ SYM_FUNC_END(\name)
> .code64
>
>  EXCEPTION_HANDLER  boot_pf_handler do_boot_page_fault error_code=1
> +
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +EXCEPTION_HANDLER  boot_stage1_vc_handler no_ghcb_vc_handler error_code=1
> +#endif
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 4e5bc688f467..0e3508c5c15c 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -141,5 +141,6 @@ extern struct desc_ptr boot_idt_desc;
>
>  /* IDT Entry Points */
>  void boot_pf_handler(void);
> +void boot_stage1_vc_handler(void);
>
>  #endif /* BOOT_COMPRESSED_MISC_H */
> diff --git a/arch/x86/boot/compressed/sev-es.c 
> b/arch/x86/boot/compressed/sev-es.c
> new file mode 100644
> index ..8d13121a8cf2
> --- /dev/null
> +++ b/arch/x86/boot/compressed/sev-es.c
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * AMD Encrypted Register State Support
> + *
> + * Author: Joerg Roedel 
> + */
> +
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "misc.h"
> +
> +static inline u64 read_ghcb_msr(void)
> +{
> +   unsigned long low, high;
> +
> +   asm volatile("rdmsr\n" : "=a" (low), "=d" (high) :
> +   "c" (MSR_AMD64_SEV_ES_GHCB));
> +
> +   return ((high << 32) | low);
> +}
> +
> +static inline void write_ghcb_msr(u64 val)
> +{
> +   u32 low, high;
> +
> +   low  = val & 0xUL;
> +   high = val >> 32;
> +
> +   asm volatile("wrmsr\n" : : "c" (MSR_AMD64_SEV_ES_GHCB),
> +   "a"(low), "d" (high) : "memory");
> +}
> +
> +#undef __init
> +#define __init
> +
> +/* Include code for early handlers */
> +#include "../../kernel/sev-es-shared.c"
> diff --git a/arch/x86/include/asm/msr-index.h 
> b/arch/x86/include/asm/msr-index.h
> index ebe1685e92dd..b6139b70db54 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -432,6 +432,7 @@
>  #define MSR_AMD64_IBSBRTARGET  0xc001103b
>  #define MSR_AMD64_IBSOPDATA4   0xc001103d
>  #define MSR_AMD64_IBS_REG_COUNT_MAX8 /* includes MSR_AMD64_IBSBRTARGET */
> +#define MSR_AMD64_SEV_ES_GHCB  0xc0010130
>  #define MSR_AMD64_SEV  0xc0010131
>  #define MSR_AMD64_SEV_ENABLED_BIT  0
>  #define MSR_AMD64_SEV_ENABLED  BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
> new file mode 100644
> index 

Re: [PATCH 08/62] x86/boot/compressed/64: Add IDT Infrastructure

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> Add code needed to setup an IDT in the early pre-decompression
> boot-code. The IDT is loaded first in startup_64, which is after
> EfiExitBootServices() has been called, and later reloaded when the
> kernel image has been relocated to the end of the decompression area.
>
> This allows to setup different IDT handlers before and after the
> relocation.
>

> diff --git a/arch/x86/boot/compressed/idt_64.c 
> b/arch/x86/boot/compressed/idt_64.c
> new file mode 100644
> index ..46ecea671b90
> --- /dev/null
> +++ b/arch/x86/boot/compressed/idt_64.c
> @@ -0,0 +1,43 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include 
> +#include 
> +#include "misc.h"
> +
> +static void set_idt_entry(int vector, void (*handler)(void))
> +{
> +   unsigned long address = (unsigned long)handler;
> +   gate_desc entry;
> +
> +   memset(, 0, sizeof(entry));
> +
> +   entry.offset_low= (u16)(address & 0x);
> +   entry.segment   = __KERNEL_CS;
> +   entry.bits.type = GATE_TRAP;

^^^

I realize we're not running a real kernel here, but GATE_TRAP is
madness.  Please use GATE_INTERRUPT.

> +   entry.bits.p= 1;
> +   entry.offset_middle = (u16)((address >> 16) & 0x);
> +   entry.offset_high   = (u32)(address >> 32);
> +
> +   memcpy(_idt[vector], , sizeof(entry));
> +}
> +
> +/* Have this here so we don't need to include  */
> +static void load_boot_idt(const struct desc_ptr *dtr)
> +{
> +   asm volatile("lidt %0"::"m" (*dtr));
> +}
> +
> +/* Setup IDT before kernel jumping to  .Lrelocated */
> +void load_stage1_idt(void)
> +{
> +   boot_idt_desc.address = (unsigned long)boot_idt;
> +
> +   load_boot_idt(_idt_desc);
> +}
> +
> +/* Setup IDT after kernel jumping to  .Lrelocated */
> +void load_stage2_idt(void)
> +{
> +   boot_idt_desc.address = (unsigned long)boot_idt;
> +
> +   load_boot_idt(_idt_desc);
> +}
> diff --git a/arch/x86/boot/compressed/idt_handlers_64.S 
> b/arch/x86/boot/compressed/idt_handlers_64.S
> new file mode 100644
> index ..0b2b6cf747d2
> --- /dev/null
> +++ b/arch/x86/boot/compressed/idt_handlers_64.S
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Early IDT handler entry points
> + *
> + * Copyright (C) 2019 SUSE
> + *
> + * Author: Joerg Roedel 
> + */
> +
> +#include 
> +
> +.macro EXCEPTION_HANDLER name function error_code=0
> +SYM_FUNC_START(\name)
> +
> +   /* Build pt_regs */
> +   .if \error_code == 0
> +   pushq   $0
> +   .endif

cld

> +
> +   pushq   %rdi
> +   pushq   %rsi
> +   pushq   %rdx
> +   pushq   %rcx
> +   pushq   %rax
> +   pushq   %r8
> +   pushq   %r9
> +   pushq   %r10
> +   pushq   %r11
> +   pushq   %rbx
> +   pushq   %rbp
> +   pushq   %r12
> +   pushq   %r13
> +   pushq   %r14
> +   pushq   %r15
> +
> +   /* Call handler with pt_regs */
> +   movq%rsp, %rdi
> +   call\function
> +
> +   /* Restore regs */
> +   popq%r15
> +   popq%r14
> +   popq%r13
> +   popq%r12
> +   popq%rbp
> +   popq%rbx
> +   popq%r11
> +   popq%r10
> +   popq%r9
> +   popq%r8
> +   popq%rax
> +   popq%rcx
> +   popq%rdx
> +   popq%rsi
> +   popq%rdi

if error_code?

> +
> +   /* Remove error code and return */
> +   addq$8, %rsp
> +
> +   /*
> +* Make sure we return to __KERNEL_CS - the CS selector on
> +* the IRET frame might still be from an old BIOS GDT
> +*/
> +   movq$__KERNEL_CS, 8(%rsp)
> +

If this actually happens, you have a major bug.  Please sanitize all
the segment registers after installing the GDT rather than hacking
around it here.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 07/62] x86/boot/compressed/64: Disable red-zone usage

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 5:53 AM Joerg Roedel  wrote:
>
> From: Joerg Roedel 
>
> The x86-64 ABI defines a red-zone on the stack:
>
>   The 128-byte area beyond the location pointed to by %rsp is
>   considered to be reserved and shall not be modified by signal or
>   interrupt handlers. 10 Therefore, functions may use this area for
>   temporary data that is not needed across function calls. In
>   particular, leaf functions may use this area for their entire stack
>   frame, rather than adjusting the stack pointer in the prologue and
>   epilogue. This area is known as the red zone.
>
> This is not compatible with exception handling, so disable it for the
> pre-decompression boot code.

Acked-by: Andy Lutomirski 

I admit that I thought we already supported exceptions this early.  At
least I seem to remember writing this code.  Maybe it never got
upstreamed?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 00/62] Linux as SEV-ES Guest Support

2020-02-11 Thread Andy Lutomirski
On Tue, Feb 11, 2020 at 7:43 AM Joerg Roedel  wrote:
>
> On Tue, Feb 11, 2020 at 03:50:08PM +0100, Peter Zijlstra wrote:
>
> > Oh gawd; so instead of improving the whole NMI situation, AMD went and
> > made it worse still ?!?
>
> Well, depends on how you want to see it. Under SEV-ES an IRET will not
> re-open the NMI window, but the guest has to tell the hypervisor
> explicitly when it is ready to receive new NMIs via the NMI_COMPLETE
> message.  NMIs stay blocked even when an exception happens in the
> handler, so this could also be seen as a (slight) improvement.
>

I don't get it.  VT-x has a VMCS bit "Interruptibility
state"."Blocking by NMI" that tracks the NMI masking state.  Would it
have killed AMD to solve the problem they same way to retain
architectural behavior inside a SEV-ES VM?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Andy Lutomirski
On Mon, Jul 15, 2019 at 4:30 PM Andrew Cooper  wrote:
>
> On 15/07/2019 19:17, Nadav Amit wrote:
> >> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  
> >> wrote:
> >>
> >> There is a lot of infrastructure for functionality which is used
> >> exclusively in __{save,restore}_processor_state() on the suspend/resume
> >> path.
> >>
> >> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
> >> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
> >> rest of the Local APIC state isn't a clever thing to be doing.
> >>
> >> Delete the suspend/resume cr8 handling, which shrinks the size of struct
> >> saved_context, and allows for the removal of both PVOPS.
> > I think removing the interface for CR8 writes is also good to avoid
> > potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
> > Priorities between CR8 and APIC”):
> >
> > "Operating software should implement either direct APIC TPR updates or CR8
> > style TPR updates but not mix them. Software can use a serializing
> > instruction (for example, CPUID) to serialize updates between MOV CR8 and
> > stores to the APIC.”
> >
> > And native_write_cr8() did not even issue a serializing instruction.
> >
>
> Given its location, the one write_cr8() is bounded by two serialising
> operations, so is safe in practice.
>
> However, I agree with the statement in the manual.  I could submit a v3
> with an updated commit message, or let it be fixed on commit.  Whichever
> is easiest.
>

I don't see anything wrong with the message.  If we actually used CR8
for interrupt priorities, we wouldn't want it to serialize.  The bug
is that the code that did the write_cr8() should have had a comment as
to how it serialized against lapic_restore().  But that doesn't seem
worth mentioning in the message, since, as noted, the real problem was
that it nonsensically restored just TPR without restoring everything
else.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/2] Remove 32-bit Xen PV guest support

2019-07-15 Thread Andy Lutomirski
On Mon, Jul 15, 2019 at 9:34 AM Andi Kleen  wrote:
>
> Juergen Gross  writes:
>
> > The long term plan has been to replace Xen PV guests by PVH. The first
> > victim of that plan are now 32-bit PV guests, as those are used only
> > rather seldom these days. Xen on x86 requires 64-bit support and with
> > Grub2 now supporting PVH officially since version 2.04 there is no
> > need to keep 32-bit PV guest support alive in the Linux kernel.
> > Additionally Meltdown mitigation is not available in the kernel running
> > as 32-bit PV guest, so dropping this mode makes sense from security
> > point of view, too.
>
> Normally we have a deprecation period for feature removals like this.
> You would make the kernel print a warning for some releases, and when
> no user complains you can then remove. If a user complains you can't.
>

As I understand it, the kernel rules do allow changes like this even
if there's a complaint: this is a patch that removes what is
effectively hardware support.  If the maintenance cost exceeds the
value, then removal is fair game.  (Obviously we weight the value to
preserving compatibility quite highly, but in this case, Xen dropped
32-bit hardware support a long time ago.  If the Xen hypervisor says
that 32-bit PV guest support is deprecated, it's deprecated.)

That being said, a warning might not be a bad idea.  What's the
current status of this in upstream Xen?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Andy Lutomirski
On Mon, Jul 15, 2019 at 6:23 AM Juergen Gross  wrote:
>
> On 15.07.19 15:00, Andrew Cooper wrote:
> > There is a lot of infrastructure for functionality which is used
> > exclusively in __{save,restore}_processor_state() on the suspend/resume
> > path.
> >
> > cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored
> > independently by lapic_{suspend,resume}().
>
> Aren't those called only with CONFIG_PM set?
>


Unless I'm missing something, we only build any of the restore code
(including the write_cr8() call) if CONFIG_PM_SLEEP is set, and that
selects CONFIG_PM, so we should be fine, I think.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-25 Thread Andy Lutomirski
On Tue, Jun 25, 2019 at 8:48 PM Nadav Amit  wrote:
>
> > On Jun 25, 2019, at 8:36 PM, Andy Lutomirski  wrote:
> >
> > On Wed, Jun 12, 2019 at 11:49 PM Nadav Amit  wrote:
> >> To improve TLB shootdown performance, flush the remote and local TLBs
> >> concurrently. Introduce flush_tlb_multi() that does so. The current
> >> flush_tlb_others() interface is kept, since paravirtual interfaces need
> >> to be adapted first before it can be removed. This is left for future
> >> work. In such PV environments, TLB flushes are not performed, at this
> >> time, concurrently.
> >
> > Would it be straightforward to have a default PV flush_tlb_multi()
> > that uses flush_tlb_others() under the hood?
>
> I prefer not to have a default PV implementation that should anyhow go away.
>
> I can create unoptimized untested versions for Xen and Hyper-V, if you want.
>

I think I prefer that approach.  We should be able to get the
maintainers to test it.  I don't love having legacy paths in there,
ahem, UV.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-25 Thread Andy Lutomirski
On Wed, Jun 12, 2019 at 11:49 PM Nadav Amit  wrote:
>
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. The current
> flush_tlb_others() interface is kept, since paravirtual interfaces need
> to be adapted first before it can be removed. This is left for future
> work. In such PV environments, TLB flushes are not performed, at this
> time, concurrently.

Would it be straightforward to have a default PV flush_tlb_multi()
that uses flush_tlb_others() under the hood?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-11 Thread Andy Lutomirski
On Thu, Oct 11, 2018 at 3:28 PM Marcelo Tosatti  wrote:
>
> On Tue, Oct 09, 2018 at 01:09:42PM -0700, Andy Lutomirski wrote:
> > On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti  wrote:
> > >
> > > On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote:
> > > > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti  
> > > > wrote:
> >
> > > > I read the comment three more times and even dug through the git
> > > > history.  It seems like what you're saying is that, under certain
> > > > conditions (which arguably would be bugs in the core Linux timing
> > > > code),
> > >
> > > I don't see that as a bug. Its just a side effect of reading two
> > > different clocks (one is CLOCK_MONOTONIC and the other is TSC),
> > > and using those two clocks to as a "base + offset".
> > >
> > > As the comment explains, if you do that, can't guarantee monotonicity.
> > >
> > > > actually calling ktime_get_boot_ns() could be non-monotonic
> > > > with respect to the kvmclock timing.  But get_kvmclock_ns() isn't used
> > > > for VM timing as such -- it's used for the IOCTL interfaces for
> > > > updating the time offset.  So can you explain how my patch is
> > > > incorrect?
> > >
> > > ktime_get_boot_ns() has frequency correction applied, while
> > > reading masterclock + TSC offset does not.
> > >
> > > So the clock reads differ.
> > >
> >
> > Ah, okay, I finally think I see what's going on.  In the kvmclock data
> > exposed to the guest, tsc_shift and tsc_to_system_mul come from
> > tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from
> > CLOCK_BOOTTIME.  So the kvmclock and kernel clock drift apart at a
> > rate given by the frequency shift and then suddenly agree again every
> > time the pvclock data is updated.
>
> Yes.
>
> > Is there a reason to do it this way?
>
> Since pvclock updates which update system_timestamp are expensive (must stop 
> all vcpus),
> they should be avoided.
>

Fair enough.

> So only HW TSC counts

makes sense.

>, and used as offset against vcpu's tsc_timestamp.
>

Why don't you just expose CLOCK_MONTONIC_RAW or CLOCK_MONOTONIC_RAW
plus suspend time, though?  Then you would actually be tracking a real
kernel timekeeping mode, and you wouldn't need all this complicated
offsetting work to avoid accidentally going backwards.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-09 Thread Andy Lutomirski
On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti  wrote:
>
> On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote:
> > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti  wrote:

> > I read the comment three more times and even dug through the git
> > history.  It seems like what you're saying is that, under certain
> > conditions (which arguably would be bugs in the core Linux timing
> > code),
>
> I don't see that as a bug. Its just a side effect of reading two
> different clocks (one is CLOCK_MONOTONIC and the other is TSC),
> and using those two clocks to as a "base + offset".
>
> As the comment explains, if you do that, can't guarantee monotonicity.
>
> > actually calling ktime_get_boot_ns() could be non-monotonic
> > with respect to the kvmclock timing.  But get_kvmclock_ns() isn't used
> > for VM timing as such -- it's used for the IOCTL interfaces for
> > updating the time offset.  So can you explain how my patch is
> > incorrect?
>
> ktime_get_boot_ns() has frequency correction applied, while
> reading masterclock + TSC offset does not.
>
> So the clock reads differ.
>

Ah, okay, I finally think I see what's going on.  In the kvmclock data
exposed to the guest, tsc_shift and tsc_to_system_mul come from
tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from
CLOCK_BOOTTIME.  So the kvmclock and kernel clock drift apart at a
rate given by the frequency shift and then suddenly agree again every
time the pvclock data is updated.

Is there a reason to do it this way?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-08 Thread Andy Lutomirski
On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti  wrote:
>
> On Sat, Oct 06, 2018 at 03:28:05PM -0700, Andy Lutomirski wrote:
> > On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti  wrote:
> > >
> > > On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> > > > For better or for worse, I'm trying to understand this code.  So far,
> > > > I've come up with this patch:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx=14fd71e12b1c4492a06f368f75041f263e6862bf
> > > >
> > > > Is it correct, or am I missing some subtlety?
> > >
> > > The master clock, when initialized, has a pair
> > >
> > > masterclockvalues=(TSC value, time-of-day data).
> > >
> > > When updating the guest clock, we only update relative to (TSC value)
> > > that was read on masterclock initialization.
> >
> > I don't see the problem.  The masterclock data is updated here:
> >
> > host_tsc_clocksource = kvm_get_time_and_clockread(
> > >master_kernel_ns,
> > >master_cycle_now);
> >
> > kvm_get_time_and_clockread() gets those values from
> > do_monotonic_boot(), which, barring bugs, should cause
> > get_kvmclock_ns() to return exactly the same thing as
> > ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather
> > roundabout manner.
> >
> > So what am I missing?  Is there actually something wrong with my patch?
>
> For the bug mentioned in the comment not to happen, you must only read
> TSC and add it as offset to (TSC value, time-of-day data).
>
> Its more than "a roundabout manner".
>
> Read the comment again.
>

I read the comment three more times and even dug through the git
history.  It seems like what you're saying is that, under certain
conditions (which arguably would be bugs in the core Linux timing
code), actually calling ktime_get_boot_ns() could be non-monotonic
with respect to the kvmclock timing.  But get_kvmclock_ns() isn't used
for VM timing as such -- it's used for the IOCTL interfaces for
updating the time offset.  So can you explain how my patch is
incorrect?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-06 Thread Andy Lutomirski
On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti  wrote:
>
> On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> > For better or for worse, I'm trying to understand this code.  So far,
> > I've come up with this patch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx=14fd71e12b1c4492a06f368f75041f263e6862bf
> >
> > Is it correct, or am I missing some subtlety?
>
> The master clock, when initialized, has a pair
>
> masterclockvalues=(TSC value, time-of-day data).
>
> When updating the guest clock, we only update relative to (TSC value)
> that was read on masterclock initialization.

I don't see the problem.  The masterclock data is updated here:

host_tsc_clocksource = kvm_get_time_and_clockread(
>master_kernel_ns,
>master_cycle_now);

kvm_get_time_and_clockread() gets those values from
do_monotonic_boot(), which, barring bugs, should cause
get_kvmclock_ns() to return exactly the same thing as
ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather
roundabout manner.

So what am I missing?  Is there actually something wrong with my patch?


>
> See the following comment on x86.c:

I read that comment, and it's not obvious to me how it's related.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Andy Lutomirski
For better or for worse, I'm trying to understand this code.  So far,
I've come up with this patch:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx=14fd71e12b1c4492a06f368f75041f263e6862bf

Is it correct, or am I missing some subtlety?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Andy Lutomirski
On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner  wrote:
>
> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> implementation, which extended the clockid switch case and added yet
> another slightly different copy of the same code.
>
> Especially the extended switch case is problematic as the compiler tends to
> generate a jump table which then requires to use retpolines. If jump tables
> are disabled it adds yet another conditional to the existing maze.
>
> This series takes a different approach by consolidating the almost
> identical functions into one implementation for high resolution clocks and
> one for the coarse grained clock ids by storing the base data for each
> clock id in an array which is indexed by the clock id.
>
> This completely eliminates the switch case and allows further
> simplifications of the code base, which at the end all together gain a few
> cycles performance or at least stay on par with todays code. The resulting
> performance depends heavily on the micro architecture and the compiler.

tglx, please consider this whole series Acked-by: Andy Lutomirski


Please feel free to push it top tip:x86/vdso, as long as you pull in
tip/x86/urgent first.  Once it lands, I'll email out a couple of
follow-up patches.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Andy Lutomirski


> On Oct 4, 2018, at 12:31 PM, Peter Zijlstra  wrote:
> 
> On Thu, Oct 04, 2018 at 07:00:45AM -0700, Andy Lutomirski wrote:
>>> On Oct 4, 2018, at 1:11 AM, Peter Zijlstra  wrote:
>>> 
>>>> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
>>>> I was hoping to hear this from you :-) If I am to suggest how we can
>>>> move forward I'd propose:
>>>> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
>>>> is supported).
>>>> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
>>>> clocksource is a single page for the whole VM, not a per-cpu thing. Can
>>>> we think that all the buggy hardware is already gone?
>>> 
>>> No, and it is not the hardware you have to worry about (mostly), it is
>>> the frigging PoS firmware people put on it.
>>> 
>>> Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
>>> after which it still can be, but bets are off). But even relatively
>>> recent systems fail the TSC sync test because firmware messes it up by
>>> writing to either MSR_TSC or MSR_TSC_ADJUST.
>>> 
>>> But the thing is, if the TSC is not synced, you cannot use it for
>>> timekeeping, full stop. So having a single page is fine, it either
>>> contains a mult/shift that is valid, or it indicates TSC is messed up
>>> and you fall back to something else.
>>> 
>>> There is no inbetween there.
>>> 
>>> For sched_clock we can still use the global page, because the rate will
>>> still be the same for each cpu, it's just offset between CPUs and the
>>> code compensates for that.
>> 
>> But if we’re in a KVM guest, then the clock will jump around on the
>> same *vCPU* when the vCPU migrates.
> 
> Urgh yes..
> 
>> But I don’t see how kvmclock helps here, since I don’t think it’s used
>> for sched_clock.
> 
> I get hits on kvm_sched_clock, but haven't looked further.

Ok, so KVM is using the per-vCPU pvclock data for sched_clock. Which hopefully 
does something intelligent.

Regardless of any TSC syncing issues, a paravirt clock should presumably be 
used for sched_clock to account for time that the vCPU was stopped.

It would be fantastic if this stuff were documented much better, both in terms 
of the data structures and the Linux code.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Andy Lutomirski
On Thu, Oct 4, 2018 at 9:43 AM Marcelo Tosatti  wrote:
>
> On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote:
> > On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti  wrote:
> > >
> > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > > > Hi Vitaly, Paolo, Radim, etc.,
> > > >
> > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner  
> > > > wrote:
> > > > >
> > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > > > implementation, which extended the clockid switch case and added yet
> > > > > another slightly different copy of the same code.
> > > > >
> > > > > Especially the extended switch case is problematic as the compiler 
> > > > > tends to
> > > > > generate a jump table which then requires to use retpolines. If jump 
> > > > > tables
> > > > > are disabled it adds yet another conditional to the existing maze.
> > > > >
> > > > > This series takes a different approach by consolidating the almost
> > > > > identical functions into one implementation for high resolution 
> > > > > clocks and
> > > > > one for the coarse grained clock ids by storing the base data for each
> > > > > clock id in an array which is indexed by the clock id.
> > > > >
> > > >
> > > > I was trying to understand more of the implications of this patch
> > > > series, and I was again reminded that there is an entire extra copy of
> > > > the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> > > > that code is very, very opaque.
> > > >
> > > > Can one of you explain what the code is even doing?  From a couple of
> > > > attempts to read through it, it's a whole bunch of
> > > > probably-extremely-buggy code that,
> > >
> > > Yes, probably.
> > >
> > > > drumroll please, tries to atomically read the TSC value and the time.  
> > > > And decide whether the
> > > > result is "based on the TSC".
> > >
> > > I think "based on the TSC" refers to whether TSC clocksource is being
> > > used.
> > >
> > > > And then synthesizes a TSC-to-ns
> > > > multiplier and shift, based on *something other than the actual
> > > > multiply and shift used*.
> > > >
> > > > IOW, unless I'm totally misunderstanding it, the code digs into the
> > > > private arch clocksource data intended for the vDSO, uses a poorly
> > > > maintained copy of the vDSO code to read the time (instead of doing
> > > > the sane thing and using the kernel interfaces for this), and
> > > > propagates a totally made up copy to the guest.
> > >
> > > I posted kernel interfaces for this, and it was suggested to
> > > instead write a "in-kernel user of pvclock data".
> > >
> > > If you can get kernel interfaces to replace that, go for it. I prefer
> > > kernel interfaces as well.
> > >
> > > >  And gets it entirely
> > > > wrong when doing nested virt, since, unless there's some secret in
> > > > this maze, it doesn't acutlaly use the scaling factor from the host
> > > > when it tells the guest what to do.
> > > >
> > > > I am really, seriously tempted to send a patch to simply delete all
> > > > this code.
> > >
> > > If your patch which deletes the code gets the necessary features right,
> > > sure, go for it.
> > >
> > > > The correct way to do it is to hook
> > >
> > > Can you expand on the correct way to do it?
> > >
> > > > And I don't see how it's even possible to pass kvmclock correctly to
> > > > the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
> > > > L1 isn't notified when the data structure changes, so how the heck is
> > > > it supposed to update the kvmclock structure?
> > >
> > > I don't parse your question.
> >
> > Let me ask it more intelligently: when the "reenlightenment" IRQ
> > happens, what tells KVM to do its own update for its guests?
>
> Update of what, and why it needs to update anything from IRQ?
>
> The update i can think of is from host kernel clocksource,
> which there is a notifier for.
>
>

Unless I've missed some serious magic, L2 guests see kvmclock, not hv.
So we have the following sequence of events:

 - L0 migrates the whole VM.  Starting now, RDTSC is emulated to match
the old host, which applies in L1 and L2.

 - An IRQ is queued to L1.

 - L1 acknowledges that it noticed the TSC change.  RDTSC stops being
emulated for L1 and L2.

 - L2 reads the TSC.  It has no idea that anything changed, and it
gets the wrong answer.

 - At some point, kvm clock updates.

What prevents this?  Vitaly, am I missing some subtlety of what
actually happens?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Andy Lutomirski

> On Oct 4, 2018, at 5:00 AM, Paolo Bonzini  wrote:
> 
>> On 04/10/2018 09:54, Vitaly Kuznetsov wrote:
>> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
>> is supported).
> 
> Not if you want to migrate to pre-Skylake systems.
> 
>> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
>> clocksource is a single page for the whole VM, not a per-cpu thing. Can
>> we think that all the buggy hardware is already gone?
> 
> No. :(  We still get reports whenever we break 2007-2008 hardware.
> 
> 

Does the KVM non-masterclock mode actually help?  It’s not clear to me exactly 
how it’s supposed to work, but it seems like it’s trying to expose per-vCPU 
adjustments to the guest. Which is dubious at best, since the guest can’t 
validly use them for anything other than sched_clock, since they aren’t fully 
corrected by anything KVM can do.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Andy Lutomirski


> On Oct 4, 2018, at 1:11 AM, Peter Zijlstra  wrote:
> 
>> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
>> I was hoping to hear this from you :-) If I am to suggest how we can
>> move forward I'd propose:
>> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
>> is supported).
>> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
>> clocksource is a single page for the whole VM, not a per-cpu thing. Can
>> we think that all the buggy hardware is already gone?
> 
> No, and it is not the hardware you have to worry about (mostly), it is
> the frigging PoS firmware people put on it.
> 
> Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
> after which it still can be, but bets are off). But even relatively
> recent systems fail the TSC sync test because firmware messes it up by
> writing to either MSR_TSC or MSR_TSC_ADJUST.
> 
> But the thing is, if the TSC is not synced, you cannot use it for
> timekeeping, full stop. So having a single page is fine, it either
> contains a mult/shift that is valid, or it indicates TSC is messed up
> and you fall back to something else.
> 
> There is no inbetween there.
> 
> For sched_clock we can still use the global page, because the rate will
> still be the same for each cpu, it's just offset between CPUs and the
> code compensates for that.

But if we’re in a KVM guest, then the clock will jump around on the same *vCPU* 
when the vCPU migrates.

But I don’t see how kvmclock helps here, since I don’t think it’s used for 
sched_clock.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-03 Thread Andy Lutomirski
On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti  wrote:
>
> On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > Hi Vitaly, Paolo, Radim, etc.,
> >
> > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner  wrote:
> > >
> > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > implementation, which extended the clockid switch case and added yet
> > > another slightly different copy of the same code.
> > >
> > > Especially the extended switch case is problematic as the compiler tends 
> > > to
> > > generate a jump table which then requires to use retpolines. If jump 
> > > tables
> > > are disabled it adds yet another conditional to the existing maze.
> > >
> > > This series takes a different approach by consolidating the almost
> > > identical functions into one implementation for high resolution clocks and
> > > one for the coarse grained clock ids by storing the base data for each
> > > clock id in an array which is indexed by the clock id.
> > >
> >
> > I was trying to understand more of the implications of this patch
> > series, and I was again reminded that there is an entire extra copy of
> > the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> > that code is very, very opaque.
> >
> > Can one of you explain what the code is even doing?  From a couple of
> > attempts to read through it, it's a whole bunch of
> > probably-extremely-buggy code that,
>
> Yes, probably.
>
> > drumroll please, tries to atomically read the TSC value and the time.  And 
> > decide whether the
> > result is "based on the TSC".
>
> I think "based on the TSC" refers to whether TSC clocksource is being
> used.
>
> > And then synthesizes a TSC-to-ns
> > multiplier and shift, based on *something other than the actual
> > multiply and shift used*.
> >
> > IOW, unless I'm totally misunderstanding it, the code digs into the
> > private arch clocksource data intended for the vDSO, uses a poorly
> > maintained copy of the vDSO code to read the time (instead of doing
> > the sane thing and using the kernel interfaces for this), and
> > propagates a totally made up copy to the guest.
>
> I posted kernel interfaces for this, and it was suggested to
> instead write a "in-kernel user of pvclock data".
>
> If you can get kernel interfaces to replace that, go for it. I prefer
> kernel interfaces as well.
>
> >  And gets it entirely
> > wrong when doing nested virt, since, unless there's some secret in
> > this maze, it doesn't acutlaly use the scaling factor from the host
> > when it tells the guest what to do.
> >
> > I am really, seriously tempted to send a patch to simply delete all
> > this code.
>
> If your patch which deletes the code gets the necessary features right,
> sure, go for it.
>
> > The correct way to do it is to hook
>
> Can you expand on the correct way to do it?
>
> > And I don't see how it's even possible to pass kvmclock correctly to
> > the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
> > L1 isn't notified when the data structure changes, so how the heck is
> > it supposed to update the kvmclock structure?
>
> I don't parse your question.

Let me ask it more intelligently: when the "reenlightenment" IRQ
happens, what tells KVM to do its own update for its guests?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-03 Thread Andy Lutomirski
On Wed, Oct 3, 2018 at 8:10 AM Thomas Gleixner  wrote:
>
> On Wed, 3 Oct 2018, Andy Lutomirski wrote:
> > > On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov  wrote:
> > > Not all Hyper-V hosts support reenlightenment notifications (and, if I'm
> > > not mistaken, you need to enable nesting for the VM to get the feature -
> > > and most VMs don't have this) so I think we'll have to keep Hyper-V
> > > vclock for the time being.
> > >
> > But this does suggest that the correct way to pass a clock through to an
> > L2 guest where L0 is HV is to make L1 use the “tsc” clock and L2 use
> > kvmclock (or something newer and better).  This would require adding
> > support for atomic frequency changes all the way through the timekeeping
> > and arch code.
> >
> > John, tglx, would that be okay or crazy?
>
> Not sure what you mean. I think I lost you somewhere on the way.
>

What I mean is: currently we have a clocksource called
""hyperv_clocksource_tsc_page".  Reading it does:

static u64 read_hv_clock_tsc(struct clocksource *arg)
{
u64 current_tick = hv_read_tsc_page(tsc_pg);

if (current_tick == U64_MAX)
rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);

return current_tick;
}

From Vitaly's email, it sounds like, on most (all?) hyperv systems
with nesting enabled, this clock is better behaved than it appears.
It sounds like the read behavior is that current_tick will never be
U64_MAX -- instead, the clock always works and, more importantly, the
actual scaling factor and offset only change observably on *guest*
request.

So why don't we we improve the actual "tsc" clocksource to understand
this?  ISTM the best model would be where the
__clocksource_update_freq_xyz() mechanism gets called so we can use it
like this:

clocksource_begin_update();
clocksource_update_mult_shift();
tell_hv_that_we_reenlightened();
clocksource_end_update();

Where clocksource_begin_update() bumps the seqcount for the vDSO and
takes all the locks, clocksource_update_mult_shift() updates
everything, and clocksource_end_update() makes the updated parameters
usable.

(AFAICT there are currently no clocksources at all in the entire
kernel that update their frequency on the fly using
__clocksource_update_xyz().  Unless I'm missing something, the x86 tsc
cpufreq hooks don't call into the core timekeeping at all, so I'm
assuming that the tsc clocksource is just unusable as a clocksource on
systems that change its frequency.)

Or we could keep the hyperv_clocksource_tsc_page clocksource but make
it use VCLOCK_TSC and a similar update mechanism.

I don't personally want to do this, because the timekeeping code is
subtle and I'm unfamiliar with it.  And I don't have *that* many spare
cycles :)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-03 Thread Andy Lutomirski


> On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov  wrote:
> 
> Andy Lutomirski  writes:
> 
>>> On Oct 3, 2018, at 2:22 AM, Vitaly Kuznetsov  wrote:
>>> 
>>> Andy Lutomirski  writes:
>>> 
>>>> Hi Vitaly, Paolo, Radim, etc.,
>>>> 
>>> The notification you're talking about exists, it is called
>>> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
>>> reenlightenment"). When TSC page changes (and this only happens when L1
>>> is migrated to a different host with a different TSC frequency and TSC
>>> scaling is not supported by the CPU) we receive an interrupt in L1 (at
>>> this moment all TSC accesses are emulated which guarantees the
>>> correctness of the readings), pause all L2 guests, update their kvmclock
>>> structures with new data (we already know the new TSC frequency) and
>>> then tell L0 that we're done and it can stop emulating TSC accesses.
>> 
>> That’s delightful!  Does the emulation magic also work for L1 user
>> mode?
> 
> As far as I understand - yes, all rdtsc* calls will trap into L0.
> 
>> If so, couldn’t we drop the HyperV vclock entirely and just
>> fold the adjustment into the core timekeeping data?  (Preferably the
>> actual core data, which would require core changes, but it could
>> plausibly be done in arch code, too.)
> 
> Not all Hyper-V hosts support reenlightenment notifications (and, if I'm
> not mistaken, you need to enable nesting for the VM to get the feature -
> and most VMs don't have this) so I think we'll have to keep Hyper-V
> vclock for the time being.
> 
> 

But this does suggest that the correct way to pass a clock through to an L2 
guest where L0 is HV is to make L1 use the “tsc” clock and L2 use kvmclock (or 
something newer and better).  This would require adding support for atomic 
frequency changes all the way through the timekeeping and arch code.

John, tglx, would that be okay or crazy?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-03 Thread Andy Lutomirski


> On Oct 3, 2018, at 2:22 AM, Vitaly Kuznetsov  wrote:
> 
> Andy Lutomirski  writes:
> 
>> Hi Vitaly, Paolo, Radim, etc.,
>> 
>>> On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner  wrote:
>>> 
>>> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
>>> implementation, which extended the clockid switch case and added yet
>>> another slightly different copy of the same code.
>>> 
>>> Especially the extended switch case is problematic as the compiler tends to
>>> generate a jump table which then requires to use retpolines. If jump tables
>>> are disabled it adds yet another conditional to the existing maze.
>>> 
>>> This series takes a different approach by consolidating the almost
>>> identical functions into one implementation for high resolution clocks and
>>> one for the coarse grained clock ids by storing the base data for each
>>> clock id in an array which is indexed by the clock id.
>>> 
>> 
>> I was trying to understand more of the implications of this patch
>> series, and I was again reminded that there is an entire extra copy of
>> the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
>> that code is very, very opaque.
>> 
>> Can one of you explain what the code is even doing?  From a couple of
>> attempts to read through it, it's a whole bunch of
>> probably-extremely-buggy code that, drumroll please, tries to
>> atomically read the TSC value and the time.  And decide whether the
>> result is "based on the TSC".  And then synthesizes a TSC-to-ns
>> multiplier and shift, based on *something other than the actual
>> multiply and shift used*.
>> 
>> IOW, unless I'm totally misunderstanding it, the code digs into the
>> private arch clocksource data intended for the vDSO, uses a poorly
>> maintained copy of the vDSO code to read the time (instead of doing
>> the sane thing and using the kernel interfaces for this), and
>> propagates a totally made up copy to the guest.  And gets it entirely
>> wrong when doing nested virt, since, unless there's some secret in
>> this maze, it doesn't acutlaly use the scaling factor from the host
>> when it tells the guest what to do.
>> 
>> I am really, seriously tempted to send a patch to simply delete all
>> this code.  The correct way to do it is to hook
> 
> "I have discovered a truly marvelous proof of this, which this margin is
> too narrow to contain" :-)
> 
> There is a very long history of different (hardware) issues Marcelo was
> fighting with and the current code is the survived Frankenstein. E.g. it
> is very, very unclear what "catchup", "always catchup" and
> masterclock-less mode in general are and if we still need them.
> 
> That said I'm all for simplification. I'm not sure if we still need to
> care about buggy hardware though.
> 
>> 
>> And I don't see how it's even possible to pass kvmclock correctly to
>> the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
>> L1 isn't notified when the data structure changes, so how the heck is
>> it supposed to update the kvmclock structure?
> 
> Well, this kind of works in the the followin way:
> L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC:
> two numbers provided by L0: offset and scale and KVM was tought to treat
> this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable
> clocksource to guests when running nested on Hyper-V").
> 
> The notification you're talking about exists, it is called
> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
> reenlightenment"). When TSC page changes (and this only happens when L1
> is migrated to a different host with a different TSC frequency and TSC
> scaling is not supported by the CPU) we receive an interrupt in L1 (at
> this moment all TSC accesses are emulated which guarantees the
> correctness of the readings), pause all L2 guests, update their kvmclock
> structures with new data (we already know the new TSC frequency) and
> then tell L0 that we're done and it can stop emulating TSC accesses.

That’s delightful!  Does the emulation magic also work for L1 user mode?  If 
so, couldn’t we drop the HyperV vclock entirely and just fold the adjustment 
into the core timekeeping data?  (Preferably the actual core data, which would 
require core changes, but it could plausibly be done in arch code, too.)

> 
> (Nothing like this exists for KVM-on-KVM, by the way, when L1's
> clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.)
> 
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-02 Thread Andy Lutomirski
Hi Vitaly, Paolo, Radim, etc.,

On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner  wrote:
>
> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> implementation, which extended the clockid switch case and added yet
> another slightly different copy of the same code.
>
> Especially the extended switch case is problematic as the compiler tends to
> generate a jump table which then requires to use retpolines. If jump tables
> are disabled it adds yet another conditional to the existing maze.
>
> This series takes a different approach by consolidating the almost
> identical functions into one implementation for high resolution clocks and
> one for the coarse grained clock ids by storing the base data for each
> clock id in an array which is indexed by the clock id.
>

I was trying to understand more of the implications of this patch
series, and I was again reminded that there is an entire extra copy of
the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
that code is very, very opaque.

Can one of you explain what the code is even doing?  From a couple of
attempts to read through it, it's a whole bunch of
probably-extremely-buggy code that, drumroll please, tries to
atomically read the TSC value and the time.  And decide whether the
result is "based on the TSC".  And then synthesizes a TSC-to-ns
multiplier and shift, based on *something other than the actual
multiply and shift used*.

IOW, unless I'm totally misunderstanding it, the code digs into the
private arch clocksource data intended for the vDSO, uses a poorly
maintained copy of the vDSO code to read the time (instead of doing
the sane thing and using the kernel interfaces for this), and
propagates a totally made up copy to the guest.  And gets it entirely
wrong when doing nested virt, since, unless there's some secret in
this maze, it doesn't acutlaly use the scaling factor from the host
when it tells the guest what to do.

I am really, seriously tempted to send a patch to simply delete all
this code.  The correct way to do it is to hook

And I don't see how it's even possible to pass kvmclock correctly to
the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
L1 isn't notified when the data structure changes, so how the heck is
it supposed to update the kvmclock structure?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-27 Thread Andy Lutomirski


> On Sep 27, 2018, at 7:36 AM, Thomas Gleixner  wrote:
> 
>> On Wed, 19 Sep 2018, Thomas Gleixner wrote:
>> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
>>>> On Sep 18, 2018, at 3:46 PM, Thomas Gleixner  wrote:
>>>>> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
>>>>> Do we do better if we use signed arithmetic for the whole calculation?
>>>>> Then a small backwards movement would result in a small backwards result.
>>>>> Or we could offset everything so that we’d have to go back several
>>>>> hundred ms before we cross zero.
>>>> 
>>>> That would be probably the better solution as signed math would be
>>>> problematic when the resulting ns value becomes negative. As the delta is
>>>> really small, otherwise the TSC sync check would have caught it, the caller
>>>> should never be able to observe time going backwards.
>>>> 
>>>> I'll have a look into that. It needs some thought vs. the fractional part
>>>> of the base time, but it should be not rocket science to get that
>>>> correct. Famous last words...
>>>> 
>>> 
>>> It’s also fiddly to tune. If you offset it too much, then the fancy
>>> divide-by-repeated-subtraction loop will hurt more than the comparison to
>>> last.
>> 
>> Not really. It's sufficient to offset it by at max. 1000 cycles or so. That
>> won't hurt the magic loop, but it will definitely cover that slight offset
>> case.
> 
> I got it working, but first of all the gain is close to 0.
> 
> There is this other subtle issue that we've seen TSCs slowly drifting apart
> which is caught by the TSC watchdog eventually, but if it exeeds the offset
> _before_ the watchdog triggers, we're back to square one.
> 
> So I rather stay on the safe side and just accept that we have to deal with
> that. Sigh.
> 
> 

Seems okay to me. Oh well. 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Andy Lutomirski


> On Sep 18, 2018, at 3:46 PM, Thomas Gleixner  wrote:
> 
> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
>>> On Sep 18, 2018, at 12:52 AM, Thomas Gleixner  wrote:
>>> 
>>>>> On Mon, 17 Sep 2018, John Stultz wrote:
>>>>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski  wrote:
>>>>> Also, I'm not entirely convinced that this "last" thing is needed at
>>>>> all.  John, what's the scenario under which we need it?
>>>> 
>>>> So my memory is probably a bit foggy, but I recall that as we
>>>> accelerated gettimeofday, we found that even on systems that claimed
>>>> to have synced TSCs, they were actually just slightly out of sync.
>>>> Enough that right after cycles_last had been updated, a read on
>>>> another cpu could come in just behind cycles_last, resulting in a
>>>> negative interval causing lots of havoc.
>>>> 
>>>> So the sanity check is needed to avoid that case.
>>> 
>>> Your memory serves you right. That's indeed observable on CPUs which
>>> lack TSC_ADJUST.
>>> 
>>> @Andy: Welcome to the wonderful world of TSC.
>>> 
>> 
>> Do we do better if we use signed arithmetic for the whole calculation?
>> Then a small backwards movement would result in a small backwards result.
>> Or we could offset everything so that we’d have to go back several
>> hundred ms before we cross zero.
> 
> That would be probably the better solution as signed math would be
> problematic when the resulting ns value becomes negative. As the delta is
> really small, otherwise the TSC sync check would have caught it, the caller
> should never be able to observe time going backwards.
> 
> I'll have a look into that. It needs some thought vs. the fractional part
> of the base time, but it should be not rocket science to get that
> correct. Famous last words...
> 

It’s also fiddly to tune. If you offset it too much, then the fancy 
divide-by-repeated-subtraction loop will hurt more than the comparison to last.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-18 Thread Andy Lutomirski

> On Sep 18, 2018, at 12:52 AM, Thomas Gleixner  wrote:
> 
>> On Mon, 17 Sep 2018, John Stultz wrote:
>>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski  wrote:
>>> Also, I'm not entirely convinced that this "last" thing is needed at
>>> all.  John, what's the scenario under which we need it?
>> 
>> So my memory is probably a bit foggy, but I recall that as we
>> accelerated gettimeofday, we found that even on systems that claimed
>> to have synced TSCs, they were actually just slightly out of sync.
>> Enough that right after cycles_last had been updated, a read on
>> another cpu could come in just behind cycles_last, resulting in a
>> negative interval causing lots of havoc.
>> 
>> So the sanity check is needed to avoid that case.
> 
> Your memory serves you right. That's indeed observable on CPUs which
> lack TSC_ADJUST.
> 
> @Andy: Welcome to the wonderful world of TSC.
> 

Do we do better if we use signed arithmetic for the whole calculation? Then a 
small backwards movement would result in a small backwards result.  Or we could 
offset everything so that we’d have to go back several hundred ms before we 
cross zero.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case

2018-09-17 Thread Andy Lutomirski
On Fri, Sep 14, 2018 at 5:50 AM, Thomas Gleixner  wrote:
> The code flow for the vclocks is convoluted as it requires the vclocks
> which can be invalidated separately from the vsyscall_gtod_data sequence to
> store the fact in a separate variable. That's inefficient.
>

>  notrace static int do_hres(clockid_t clk, struct timespec *ts)
>  {
> struct vgtod_ts *base = >basetime[clk];
> unsigned int seq;
> -   int mode;
> -   u64 ns;
> +   u64 cycles, ns;
>
> do {
> seq = gtod_read_begin(gtod);
> -   mode = gtod->vclock_mode;
> ts->tv_sec = base->sec;
> ns = base->nsec;
> -   ns += vgetsns();
> +   cycles = vgetcyc(gtod->vclock_mode);
> +   if (unlikely((s64)cycles < 0))
> +   return vdso_fallback_gettime(clk, ts);

i was contemplating this, and I would suggest one of two optimizations:

1. have all the helpers return a struct containing a u64 and a bool,
and use that bool.  The compiler should do okay.

2. Be sneaky.  Later in the series, you do:

if (unlikely((s64)cycles < 0))
return vdso_fallback_gettime(clk, ts);
-   ns += (cycles - gtod->cycle_last) * gtod->mult;
+   if (cycles > last)
+   ns += (cycles - last) * gtod->mult;

How about:

if (unlikely((s64)cycles <= last)) {
  if (cycles < 0) [or cycles == -1 or whatever]
return vdso_fallback_gettime;
} else {
  ns += (cycles - last) * gtod->mult;
}

which should actually make this whole mess be essentially free.

Also, I'm not entirely convinced that this "last" thing is needed at
all.  John, what's the scenario under which we need it?

--Andy

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support

2018-09-14 Thread Andy Lutomirski


> On Sep 14, 2018, at 7:27 AM, Thomas Gleixner  wrote:
> 
> On Fri, 14 Sep 2018, Andy Lutomirski wrote:
>>> On Sep 14, 2018, at 5:50 AM, Thomas Gleixner  wrote:
>>> 
>>> With the storage array in place it's now trivial to support CLOCK_TAI in
>>> the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
>>> of the fact that:
>>> 
>>> - CLOCK ids are set in stone
>>> - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
>>>  the array slot 3 is unused
>>> - CLOCK_TAI is id 11 which results in 3 when masked with 0x3
>>> 
>>> Add the mask to the basetime array lookup and set up the CLOCK_TAI base
>>> time in update_vsyscall().
>> 
>> That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere
>> to assert that this actually works?
> 
> Sure, but changing any of the clock ids will cause more wreckage than that.
> 

I’m more concerned that we add a new one and break the magic masking. Maybe two 
start sharing the same slot. 

> Thanks,
> 
>tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support

2018-09-14 Thread Andy Lutomirski


> On Sep 14, 2018, at 5:50 AM, Thomas Gleixner  wrote:
> 
> With the storage array in place it's now trivial to support CLOCK_TAI in
> the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
> of the fact that:
> 
> - CLOCK ids are set in stone
> - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
>   the array slot 3 is unused
> - CLOCK_TAI is id 11 which results in 3 when masked with 0x3
> 
> Add the mask to the basetime array lookup and set up the CLOCK_TAI base
> time in update_vsyscall().

That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere to 
assert that this actually works?

> 
> The performance impact of the mask operation is within the noise.
> 
> Signed-off-by: Thomas Gleixner 
> ---
> arch/x86/entry/vdso/vclock_gettime.c|2 +-
> arch/x86/entry/vsyscall/vsyscall_gtod.c |4 
> arch/x86/include/asm/vgtod.h|6 +-
> 3 files changed, 10 insertions(+), 2 deletions(-)
> 
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -140,7 +140,7 @@ notrace static inline u64 vgetcyc(int mo
> 
> notrace static int do_hres(clockid_t clk, struct timespec *ts)
> {
> -struct vgtod_ts *base = >basetime[clk];
> +struct vgtod_ts *base = >basetime[clk & VGTOD_HRES_MASK];
>u64 cycles, last, ns;
>unsigned int seq;
> 
> --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
> +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
> @@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper *
>base->sec = tk->xtime_sec;
>base->nsec = tk->tkr_mono.xtime_nsec;
> 
> +base = >basetime[VGTOD_TAI];
> +base->sec = tk->xtime_sec + (s64)tk->tai_offset;
> +base->nsec = tk->tkr_mono.xtime_nsec;
> +
>base = >basetime[CLOCK_MONOTONIC];
>base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
>nsec = tk->tkr_mono.xtime_nsec;
> --- a/arch/x86/include/asm/vgtod.h
> +++ b/arch/x86/include/asm/vgtod.h
> @@ -19,9 +19,13 @@ struct vgtod_ts {
> };
> 
> #define VGTOD_BASES(CLOCK_MONOTONIC_COARSE + 1)
> -#define VGTOD_HRES(BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
> +#define VGTOD_HRES(BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | 
> BIT(CLOCK_TAI))
> #define VGTOD_COARSE(BIT(CLOCK_REALTIME_COARSE) | 
> BIT(CLOCK_MONOTONIC_COARSE))
> 
> +/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */
> +#define VGTOD_HRES_MASK0x3
> +#define VGTOD_TAI(CLOCK_TAI & VGTOD_HRES_MASK)
> +
> /*
>  * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
>  * so be carefull by modifying this structure.
> 
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 06/27] x86/entry/64: Adapt assembly for PIE support

2017-10-20 Thread Andy Lutomirski


> On Oct 20, 2017, at 5:20 PM, Ingo Molnar  wrote:
> 
> 
> * Thomas Garnier  wrote:
> 
   */
 - cmpq$.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
 + leaq.Lentry_SYSCALL_64_after_fastpath_call(%rip), %r11
 + cmpq%r11, (%rsp)
  jne 1f
> 
>>> This patch seems to add extra overhead to the syscall fast-path even when 
>>> PIE is
>>> disabled, right?
>> 
>> It does add extra instructions when one is not possible, I preferred
>> that over ifdefing but I can change it.
> 
> So my problem is, this pattern repeats in many other places as well, but 
> sprinking 
> various pieces of assembly code with #ifdefs would be very bad as well.
> 
> I have no good idea how to solve this.
> 
> Thanks,

Ugh, brain was off.  This is a bit messy. We could use a macro for this, too, I 
suppose.

> 
>Ingo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 06/27] x86/entry/64: Adapt assembly for PIE support

2017-10-20 Thread Andy Lutomirski


> On Oct 20, 2017, at 5:20 PM, Ingo Molnar  wrote:
> 
> 
> * Thomas Garnier  wrote:
> 
   */
 - cmpq$.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
 + leaq.Lentry_SYSCALL_64_after_fastpath_call(%rip), %r11
 + cmpq%r11, (%rsp)
  jne 1f
> 
>>> This patch seems to add extra overhead to the syscall fast-path even when 
>>> PIE is
>>> disabled, right?
>> 
>> It does add extra instructions when one is not possible, I preferred
>> that over ifdefing but I can change it.
> 
> So my problem is, this pattern repeats in many other places as well, but 
> sprinking 
> various pieces of assembly code with #ifdefs would be very bad as well.
> 
> I have no good idea how to solve this.
> 

How about:

.macro JMP_TO_LABEL ...


> Thanks,
> 
>Ingo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support

2017-02-17 Thread Andy Lutomirski
On Fri, Feb 17, 2017 at 2:14 AM, Vitaly Kuznetsov  wrote:
> Thomas Gleixner  writes:
>
>> On Wed, 15 Feb 2017, Vitaly Kuznetsov wrote:
>>> Actually, we already have an implementation of TSC page update in KVM
>>> (see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
>>> the following:
>>>
>>> 0) stash seq into seq_prev
>>> 1) seq = 0 making all reads from the page invalid
>>> 2) smp_wmb()
>>> 3) update tsc_scale, tsc_offset
>>> 4) smp_wmb()
>>> 5) set seq = seq_prev + 1
>>
>> I hope they handle the case where seq_prev overflows and becomes 0 :)
>>
>>> As far as I understand this helps with situations you described above as
>>> guest will notice either invalid value of 0 or seq change. In case the
>>> implementation in real Hyper-V is the same we're safe with compile
>>> barriers only.
>>
>> On x86 that's correct. smp_rmb() resolves to barrier(), but you certainly
>> need the smp_wmb() on the writer side.
>>
>> Now looking at the above your reader side code is bogus:
>>
>> +   while (1) {
>> +   sequence = tsc_pg->tsc_sequence;
>> +   if (!sequence)
>> +   break;
>>
>> Why would you break out of the loop when seq is 0? The 0 is just telling
>> you that there is an update in progress.
>
> Not only. As far as I understand (and I *think* K. Y. pointed this out)
> when VM is migrating to another host TSC page clocksource is disabled for
> extended period of time so we're better off reading from MSR than
> looping here. With regards to VDSO this means reverting to doing normal
> syscall.

That's a crappy design.  The guest really ought to be able to
distinguish "busy, try again" from "bail and use MSR".

Thomas, I can see valid reasons why the hypervisor might want to
temporarily disable shared page-based timing, but I think it's silly
that it's conflated with indicating "retry".

But if this is indeed the ABI, we're stuck with it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support

2017-02-14 Thread Andy Lutomirski
On Tue, Feb 14, 2017 at 7:50 AM, Vitaly Kuznetsov <vkuzn...@redhat.com> wrote:
> Thomas Gleixner <t...@linutronix.de> writes:
>
>> On Tue, 14 Feb 2017, Vitaly Kuznetsov wrote:
>>
>>> Hi,
>>>
>>> while we're still waiting for a definitive ACK from Microsoft that the
>>> algorithm is good for SMP case (as we can't prevent the code in vdso from
>>> migrating between CPUs) I'd like to send v2 with some modifications to keep
>>> the discussion going.
>>
>> Migration is irrelevant. The TSC page is guest global so updates will
>> happen on some (random) host CPU and therefor you need the usual barriers
>> like we have them in our seqcounts unless an access to the sequence will
>> trap into the host, which would defeat the whole purpose of the TSC page.
>>
>
> KY Srinivasan <k...@microsoft.com> writes:
>
>> I checked with the folks on the Hyper-V side and they have confirmed that we 
>> need to
>> add memory barriers in the guest code to ensure the various reads from the 
>> TSC page are
>> correctly ordered - especially, the initial read of the sequence counter 
>> must have acquire
>> semantics. We should ensure that other reads from the TSC page are completed 
>> before the
>> second read of the sequence counter. I am working with the Windows team to 
>> correctly
>> reflect this algorithm in the Hyper-V specification.
>
>
> Thank you,
>
> do I get it right that combining the above I only need to replace
> virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page
> (PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't
> need READ_ONCE(), compilers are not allowed to merge accesses. The
> resulting code looks good to me:

No, on multiple counts, unfortunately.

1. LFENCE is basically useless except for IO and for (Intel only)
rdtsc_ordered().  AFAIK there is literally no scenario under which
LFENCE is useful for access to normal memory.

2. The problem here has little to do with barriers.  You're doing:

read seq;
read var1;
read var2;
read tsc;
read seq again;

If the hypervisor updates things between reading var1 and var2 or
between reading var2 and tsc, you aren't guaranteed to notice unless
something fancy is going on regardless of barriers.  Similarly, if the
hypervisor starts updating, then you start and finish the whole
function, and then the hypervisor finishes, what guarantees you
notice?

This really needs a spec from the hypervisor folks as to what's going
on.  KVM does this horrible thing in which it sometimes freezes all
vCPUs when updating, for better or for worse.  Mostly for worse.  If
MS does the same thing, they should document it.

3. You need rdtsc_ordered().  Plain RDTSC is not ordered wrt other
memory access, and it can observably screw up as a result.

Also, Linux tries to avoid volatile variables, so please use READ_ONCE().

>
> (gdb) disassemble read_hv_clock_tsc
> Dump of assembler code for function read_hv_clock_tsc:
>0x8102ca60 <+0>: callq  0x816c7500 <__fentry__>
>0x8102ca65 <+5>: mov0xf67974(%rip),%rcx# 
> 0x81f943e0 
>0x8102ca6c <+12>:jmp0x8102ca87 
> <read_hv_clock_tsc+39>
>0x8102ca6e <+14>:lfence
>0x8102ca71 <+17>:mov0x8(%rcx),%r9
>0x8102ca75 <+21>:mov0x10(%rcx),%r8
>0x8102ca79 <+25>:nop
>0x8102ca7a <+26>:nop
>0x8102ca7b <+27>:nop
>0x8102ca7c <+28>:rdtsc
>0x8102ca7e <+30>:lfence
>0x8102ca81 <+33>:mov(%rcx),%edi
>0x8102ca83 <+35>:cmp%edi,%esi
>0x8102ca85 <+37>:je 0x8102caa3 
> <read_hv_clock_tsc+67>
>0x8102ca87 <+39>:mov(%rcx),%esi
>0x8102ca89 <+41>:test   %esi,%esi
>0x8102ca8b <+43>:jne0x8102ca6e 
> <read_hv_clock_tsc+14>
>0x8102ca8d <+45>:push   %rbp
>0x8102ca8e <+46>:mov$0x4020,%edi
>0x8102ca93 <+51>:mov%rsp,%rbp
>0x8102ca96 <+54>:and$0xfff0,%rsp
>0x8102ca9a <+58>:callq  *0x81c36330
>0x8102caa1 <+65>:leaveq
>0x8102caa2 <+66>:retq
>    0xffff8102caa3 <+67>:shl$0x20,%rdx
>0x8102caa7 <+71>:or %rdx,%rax
>0x8102caaa <+74>:mul%r9
>0x8102caad <+77>:mov%rdx,%rax
>0x8102cab0 <+80>:add%r8,%rax
>0x8102cab3 <+83>:cmp$0x,%rax
>0x8102cab7 <+87>:je 0x8102ca8d 
> <read_hv_clock_tsc+45>
>0x8102cab9 <+89>:repz retq
> End of assembler dump.
>
> --
>   Vitaly



-- 
Andy Lutomirski
AMA Capital Management, LLC
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-13 Thread Andy Lutomirski
On Sun, Feb 12, 2017 at 11:49 PM, Dexuan Cui  wrote:
>> From: Thomas Gleixner [mailto:t...@linutronix.de]
>> Sent: Saturday, February 11, 2017 02:02
>>  ...
>> That's important if the stuff happens cross CPU. If the update happens on
>> the same CPU then this is a different story and as there are VMexits
>> involved they might provide the required ordering already. But I can't tell
>> as I have no idea how that host side thing is done.
>>
>>   tglx
>
> IMO Hyper-V TSC page clocksource here seems pretty similar to KVM's pvclock,
> So I would guess "the structure is only updated just before reentering the 
> guest
> after some VM event" (https://rwmj.wordpress.com/2010/10/15/kvm-pvclock/),
> that is, the update should happen on the same CPU, I guess.

If the patch is correct, there is one of these shared by all vCPUs, so
this is not a sufficient explanation.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread Andy Lutomirski
On Thu, Feb 9, 2017 at 12:45 PM, KY Srinivasan <k...@microsoft.com> wrote:
>
>
>> -Original Message-
>> From: Thomas Gleixner [mailto:t...@linutronix.de]
>> Sent: Thursday, February 9, 2017 9:08 AM
>> To: Vitaly Kuznetsov <vkuzn...@redhat.com>
>> Cc: x...@kernel.org; Andy Lutomirski <l...@amacapital.net>; Ingo Molnar
>> <mi...@redhat.com>; H. Peter Anvin <h...@zytor.com>; KY Srinivasan
>> <k...@microsoft.com>; Haiyang Zhang <haiya...@microsoft.com>; Stephen
>> Hemminger <sthem...@microsoft.com>; Dexuan Cui
>> <de...@microsoft.com>; linux-ker...@vger.kernel.org;
>> de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org
>> Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read
>> method
>>
>> On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote:
>> > +#ifdef CONFIG_HYPERV_TSCPAGE
>> > +static notrace u64 vread_hvclock(int *mode)
>> > +{
>> > +   const struct ms_hyperv_tsc_page *tsc_pg =
>> > +   (const struct ms_hyperv_tsc_page *)_page;
>> > +   u64 sequence, scale, offset, current_tick, cur_tsc;
>> > +
>> > +   while (1) {
>> > +   sequence = READ_ONCE(tsc_pg->tsc_sequence);
>> > +   if (!sequence)
>> > +   break;
>> > +
>> > +   scale = READ_ONCE(tsc_pg->tsc_scale);
>> > +   offset = READ_ONCE(tsc_pg->tsc_offset);
>> > +   rdtscll(cur_tsc);
>> > +
>> > +   current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
>> > +
>> > +   if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
>> > +   return current_tick;
>>
>> That sequence stuff lacks still a sensible explanation. It's fundamentally
>> different from the sequence counting we do in the kernel, so documentation
>> for it is really required.
>
> The host is updating multiple fields in this shared TSC page and the sequence 
> number is
> used to ensure that the guest sees a consistent set values published. If I 
> remember
> correctly, Xen has a similar mechanism.

So what's the actual protocol?  When the hypervisor updates the page,
does it freeze all guest cpus?  If not, how does it maintain
atomicity?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-08 Thread Andy Lutomirski
On Wed, Feb 8, 2017 at 9:07 AM, Vitaly Kuznetsov  wrote:
> Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol
> defined by the hypervisor is different from VCLOCK_PVCLOCK. Implement the
> required support re-using pvclock_page VVAR as VCLOCK_PVCLOCK is mutually
> exclusive with VCLOCK_HVCLOCK at run time.
>
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/entry/vdso/vclock_gettime.c | 48 
> 
>  arch/x86/entry/vdso/vma.c| 26 +--
>  arch/x86/hyperv/hv_init.c|  3 +++
>  arch/x86/include/asm/clocksource.h   |  3 ++-
>  4 files changed, 72 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
> b/arch/x86/entry/vdso/vclock_gettime.c
> index 9d4d6e1..93e9dcd 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -141,6 +142,49 @@ static notrace u64 vread_pvclock(int *mode)
> return last;
>  }
>  #endif
> +#ifdef CONFIG_HYPERV_CLOCK
> +/* (a * b) >> 64 implementation */
> +static inline u64 mul64x64_hi(u64 a, u64 b)
> +{
> +   u64 a_lo, a_hi, b_lo, b_hi, p1, p2;
> +
> +   a_lo = (u32)a;
> +   a_hi = a >> 32;
> +   b_lo = (u32)b;
> +   b_hi = b >> 32;
> +   p1 = a_lo * b_hi;
> +   p2 = a_hi * b_lo;
> +
> +   return a_hi * b_hi + (p1 >> 32) + (p2 >> 32) +
> +   a_lo * b_lo) >> 32) + (u32)p1 + (u32)p2) >> 32);
> +
> +}

Unless GCC is waaay more clever than I think, this is hugely
suboptimal on 64-bit.  x86 can do this in a single instruction, and
gcc can express it cleanly using __uint128_t.  I wouldn't be terribly
surprised if the 32-bit generated code was fine, too.

> +
> +static notrace u64 vread_hvclock(int *mode)
> +{
> +   const struct ms_hyperv_tsc_page *tsc_pg =
> +   (const struct ms_hyperv_tsc_page *)_page;
> +   u64 sequence, scale, offset, current_tick, cur_tsc;
> +
> +   while (1) {
> +   sequence = READ_ONCE(tsc_pg->tsc_sequence);
> +   if (!sequence)
> +   break;
> +
> +   scale = READ_ONCE(tsc_pg->tsc_scale);
> +   offset = READ_ONCE(tsc_pg->tsc_offset);
> +   rdtscll(cur_tsc);
> +
> +   current_tick = mul64x64_hi(cur_tsc, scale) + offset;
> +
> +   if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> +   return current_tick;
> +   }

Can you explain better what's going on here?  What protocol does the
hypervisor use to update this bizarro seqlock?  What exactly does
sequence==0 mean?

> +
> +   *mode = VCLOCK_NONE;
> +   return 0;
> +}
> +#endif
>
>  notrace static u64 vread_tsc(void)
>  {
> @@ -173,6 +217,10 @@ notrace static inline u64 vgetsns(int *mode)
> else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
> cycles = vread_pvclock(mode);
>  #endif
> +#ifdef CONFIG_HYPERV_CLOCK
> +   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
> +   cycles = vread_hvclock(mode);
> +#endif
> else
> return 0;
> v = (cycles - gtod->cycle_last) & gtod->mask;
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index 10820f6..4b9d90c 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #if defined(CONFIG_X86_64)
>  unsigned int __read_mostly vdso64_enabled = 1;
> @@ -112,13 +113,24 @@ static int vvar_fault(const struct vm_special_mapping 
> *sm,
> ret = vm_insert_pfn(vma, vmf->address,
> __pa_symbol(&__vvar_page) >> PAGE_SHIFT);
> } else if (sym_offset == image->sym_pvclock_page) {
> -   struct pvclock_vsyscall_time_info *pvti =
> -   pvclock_pvti_cpu0_va();
> -   if (pvti && vclock_was_used(VCLOCK_PVCLOCK)) {
> -   ret = vm_insert_pfn(
> -   vma,
> -   vmf->address,
> -   __pa(pvti) >> PAGE_SHIFT);
> +   if (vclock_was_used(VCLOCK_PVCLOCK)) {
> +   struct pvclock_vsyscall_time_info *pvti =
> +   pvclock_pvti_cpu0_va();
> +   if (pvti) {
> +   ret = vm_insert_pfn(
> +   vma,
> +   vmf->address,
> +   __pa(pvti) >> PAGE_SHIFT);
> +   }
> +   } else if (vclock_was_used(VCLOCK_HVCLOCK)) {
> +   struct ms_hyperv_tsc_page *tsc_pg =
> +   hv_get_tsc_page();
> +   if (tsc_pg) {
> +

Re: [PATCH v2 2/2] vring: Force use of DMA API for ARM-based systems

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 2:01 AM, Will Deacon <will.dea...@arm.com> wrote:
> On Wed, Jan 11, 2017 at 01:33:31AM +0200, Michael S. Tsirkin wrote:
>> On Tue, Jan 10, 2017 at 05:51:18PM +, Robin Murphy wrote:
>> > From: Will Deacon <will.dea...@arm.com>
>> >
>> > Booting Linux on an ARM fastmodel containing an SMMU emulation results
>> > in an unexpected I/O page fault from the legacy virtio-blk PCI device:
>> >
>> > [1.211721] arm-smmu-v3 2b40.smmu: event 0x10 received:
>> > [1.211800] arm-smmu-v3 2b40.smmu:   0xf010
>> > [1.211880] arm-smmu-v3 2b40.smmu:   0x0208
>> > [1.211959] arm-smmu-v3 2b40.smmu:   0x0008fa081002
>> > [1.212075] arm-smmu-v3 2b40.smmu:   0x
>> > [1.212155] arm-smmu-v3 2b40.smmu: event 0x10 received:
>> > [1.212234] arm-smmu-v3 2b40.smmu:   0xf010
>> > [1.212314] arm-smmu-v3 2b40.smmu:   0x0208
>> > [1.212394] arm-smmu-v3 2b40.smmu:   0x0008fa081000
>> > [1.212471] arm-smmu-v3 2b40.smmu:   0x
>> >
>> > 
>> >
>> > This is because the virtio-blk is behind an SMMU, so we have consequently
>> > swizzled its DMA ops and configured the SMMU to translate accesses. This
>> > then requires the vring code to use the DMA API to establish translations,
>> > otherwise all transactions will result in fatal faults and termination.
>> >
>> > Given that ARM-based systems only see an SMMU if one is really present
>> > (the topology is all described by firmware tables such as device-tree or
>> > IORT), then we can safely use the DMA API for all virtio devices.
>> >
>> > Cc: Andy Lutomirski <l...@kernel.org>
>> > Cc: Michael S. Tsirkin <m...@redhat.com>
>> > Signed-off-by: Will Deacon <will.dea...@arm.com>
>>
>> I'd like to better understand then need for this one.
>> Can't the device in question just set VIRTIO_F_IOMMU_PLATFORM ?
>>
>> I'd rather we avoided need for more hacks and just
>> have everyone switch to that.
>
> There are a couple of problems with VIRTIO_F_IOMMU_PLATFORM:
>
> 1. It doesn't exist for legacy devices, which are all we have on the
>platform in question.
>
> 2. It's not documented in the virtio sp^H^HSTOP PRESS. I see you applied
>my patch ;). Thanks.
>
> In which case, for non-legacy devices we should definitely be using
> VIRTIO_F_IOMMU_PLATFORM, but since this platform hasn't yet moved to the
> world of flying cars, could we unconditionally set the DMA ops on ARM
> for legacy devices? The alternative is disabling the SMMU altogether,
> but that's less than ideal because there are non-virtio devices on the
> same PCI bus.

Also, on ARM, using the DMA API appears to *always* be the correct
approach.  Why not do it all the time, then?  The non-DMA-API path is
a legacy thing that is needed because a few platforms incorrectly
enumerate their IOMMUs.  ARM gets it right, so I don't see why ARM
should be subject to the legacy mess.

Even on x86, it should be possible to get the code into a state where
using DMA ops is always correct.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH] vring: Force use of DMA API for ARM-based systems

2017-01-06 Thread Andy Lutomirski
On Fri, Jan 6, 2017 at 10:32 AM, Robin Murphy  wrote:
> On 06/01/17 17:48, Jean-Philippe Brucker wrote:
>> Hi Will,
>>
>> On 20/12/16 15:14, Will Deacon wrote:
>>> Booting Linux on an ARM fastmodel containing an SMMU emulation results
>>> in an unexpected I/O page fault from the legacy virtio-blk PCI device:
>>>
>>> [1.211721] arm-smmu-v3 2b40.smmu: event 0x10 received:
>>> [1.211800] arm-smmu-v3 2b40.smmu:0xf010
>>> [1.211880] arm-smmu-v3 2b40.smmu:0x0208
>>> [1.211959] arm-smmu-v3 2b40.smmu:0x0008fa081002
>>> [1.212075] arm-smmu-v3 2b40.smmu:0x
>>> [1.212155] arm-smmu-v3 2b40.smmu: event 0x10 received:
>>> [1.212234] arm-smmu-v3 2b40.smmu:0xf010
>>> [1.212314] arm-smmu-v3 2b40.smmu:0x0208
>>> [1.212394] arm-smmu-v3 2b40.smmu:0x0008fa081000
>>> [1.212471] arm-smmu-v3 2b40.smmu:0x
>>>
>>> 
>>>
>>> This is because the virtio-blk is behind an SMMU, so we have consequently
>>> swizzled its DMA ops and configured the SMMU to translate accesses. This
>>> then requires the vring code to use the DMA API to establish translations,
>>> otherwise all transactions will result in fatal faults and termination.
>>>
>>> Given that ARM-based systems only see an SMMU if one is really present
>>> (the topology is all described by firmware tables such as device-tree or
>>> IORT), then we can safely use the DMA API for all virtio devices.
>>
>> There is a problem with the platform block device on that same model.
>> Since it's not behind the SMMU, the DMA ops fall back to swiotlb, which
>> limits the number of mappings.
>>
>> It used to work with 4.9, but since 9491ae4 ("mm: don't cap request size
>> based on read-ahead setting") unlocked read-ahead, we quickly run into
>> the limit of swiotlb and panic:
>>
>> [5.382359] virtio-mmio 1c13.virtio_block: swiotlb buffer is full
>> (sz: 491520 bytes)
>> [5.382452] virtio-mmio 1c13.virtio_block: DMA: Out of SW-IOMMU
>> space for 491520 bytes
>> [5.382531] Kernel panic - not syncing: DMA: Random memory could be
>> DMA written
>> ...
>> [5.383148] [] swiotlb_map_page+0x194/0x1a0
>> [5.383226] [] __swiotlb_map_page+0x20/0x88
>> [5.383320] [] vring_map_one_sg.isra.1+0x70/0x88
>> [5.383417] [] virtqueue_add_sgs+0x2ec/0x4e8
>> [5.383505] [] __virtblk_add_req+0x9c/0x1a8
>> ...
>> [5.384449] [] ondemand_readahead+0xfc/0x2b8
>>
>> Commit 9491ae4 caps the read-ahead request to a limit set by the backing
>> device. For virtio-blk, it is infinite (as set by the call to
>> blk_queue_max_hw_sectors in virtblk_probe).
>>
>> I'm not sure how to fix this. Setting an arbitrary sector limit in the
>> virtio-blk driver seems unfair to other users. Maybe we should check if
>> the device is behind a hardware IOMMU before using the DMA API?
>
> Hmm, this looks more like the virtio_block device simply has the wrong
> DMA mask to begin with. For virtio-pci we set the streaming DMA mask to
> 64 bits - should a platform device not be similarly capable?

If it's not, then turning off DMA API will cause random corruption.
ISTM one way or another the bug is in either the DMA ops or in the
driver initialization.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio-net: Fix DMA-from-the-stack in virtnet_set_mac_address()

2016-12-05 Thread Andy Lutomirski
With CONFIG_VMAP_STACK=y, virtnet_set_mac_address() can be passed a
pointer to the stack and it will OOPS.  Copy the address to the heap
to prevent the crash.

Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: Laura Abbott <labb...@redhat.com>
Reported-by: zbys...@in.waw.pl
Signed-off-by: Andy Lutomirski <l...@kernel.org>
---

Very lightly tested.

 drivers/net/virtio_net.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7276d5a95bd0..cbf1c613c67a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -969,12 +969,17 @@ static int virtnet_set_mac_address(struct net_device 
*dev, void *p)
struct virtnet_info *vi = netdev_priv(dev);
struct virtio_device *vdev = vi->vdev;
int ret;
-   struct sockaddr *addr = p;
+   struct sockaddr *addr;
struct scatterlist sg;
 
-   ret = eth_prepare_mac_addr_change(dev, p);
+   addr = kmalloc(sizeof(*addr), GFP_KERNEL);
+   if (!addr)
+   return -ENOMEM;
+   memcpy(addr, p, sizeof(*addr));
+
+   ret = eth_prepare_mac_addr_change(dev, addr);
if (ret)
-   return ret;
+   goto out;
 
if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_MAC_ADDR)) {
sg_init_one(, addr->sa_data, dev->addr_len);
@@ -982,7 +987,8 @@ static int virtnet_set_mac_address(struct net_device *dev, 
void *p)
  VIRTIO_NET_CTRL_MAC_ADDR_SET, )) {
dev_warn(>dev,
 "Failed to set mac address by vq command.\n");
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out;
}
} else if (virtio_has_feature(vdev, VIRTIO_NET_F_MAC) &&
   !virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
@@ -996,8 +1002,11 @@ static int virtnet_set_mac_address(struct net_device 
*dev, void *p)
}
 
eth_commit_mac_addr_change(dev, p);
+   ret = 0;
 
-   return 0;
+out:
+   kfree(addr);
+   return ret;
 }
 
 static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
-- 
2.9.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio_ring: fix complaint by sparse

2016-11-22 Thread Andy Lutomirski
On Tue, Nov 22, 2016 at 7:16 AM, Thomas Huth <th...@redhat.com> wrote:
> On 22.11.2016 16:04, Michael S. Tsirkin wrote:
>> On Tue, Nov 22, 2016 at 01:51:50PM +0800, Gonglei wrote:
>>>  # make C=2 CF="-D__CHECK_ENDIAN__" ./drivers/virtio/
>>>
>>> drivers/virtio/virtio_ring.c:423:19: warning: incorrect type in assignment 
>>> (different base types)
>>> drivers/virtio/virtio_ring.c:423:19:expected unsigned int [unsigned] 
>>> [assigned] i
>>> drivers/virtio/virtio_ring.c:423:19:got restricted __virtio16 
>>> [usertype] next
>>> drivers/virtio/virtio_ring.c:423:19: warning: incorrect type in assignment 
>>> (different base types)
>>> drivers/virtio/virtio_ring.c:423:19:expected unsigned int [unsigned] 
>>> [assigned] i
>>> drivers/virtio/virtio_ring.c:423:19:got restricted __virtio16 
>>> [usertype] next
>>> drivers/virtio/virtio_ring.c:423:19: warning: incorrect type in assignment 
>>> (different base types)
>>> drivers/virtio/virtio_ring.c:423:19:expected unsigned int [unsigned] 
>>> [assigned] i
>>> drivers/virtio/virtio_ring.c:423:19:got restricted __virtio16 
>>> [usertype] next
>>> drivers/virtio/virtio_ring.c:604:39: warning: incorrect type in initializer 
>>> (different base types)
>>> drivers/virtio/virtio_ring.c:604:39:expected unsigned short [unsigned] 
>>> [usertype] nextflag
>>> drivers/virtio/virtio_ring.c:604:39:got restricted __virtio16
>>> drivers/virtio/virtio_ring.c:612:33: warning: restricted __virtio16 
>>> degrades to integer
>>>
>>> Signed-off-by: Gonglei <arei.gong...@huawei.com>
>>> ---
>>>  drivers/virtio/virtio_ring.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 489bfc6..d2863c3 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -420,7 +420,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>>>  if (i == err_idx)
>>>  break;
>>>  vring_unmap_one(vq, [i]);
>>> -i = vq->vring.desc[i].next;
>>> +i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
>>>  }
>>>
>>>  vq->vq.num_free += total_sg;
> [...]
>> Wow are you saying endian-ness is all wrong for the next field?
>> How do things ever work then?
>
> The above code is only in the error cleanup path (after the
> "unmap_release" label, introduced by commit 780bc7903), so it likely has
> never been exercised in the field.
> I think Gonlei's patch is right, there should be a virtio16_to_cpu() here.

Agreed.

>
>  Thomas
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio_console: Stop doing DMA on the stack

2016-08-30 Thread Andy Lutomirski
virtio_console uses a small DMA buffer for control requests.  Move
that buffer into heap memory.

Doing virtio DMA on the stack is normally okay on non-DMA-API virtio
systems (which is currently most of them), but it breaks completely
if the stack is virtually mapped.

Tested by typing both directions using picocom aimed at /dev/hvc0.

Signed-off-by: Andy Lutomirski <l...@kernel.org>
---

Hi all-

This is currently broken in tip:x86/asm.  If you (Amit) like this patch,
would it make sense for Ingo to add it to -tip?

Thanks,
Andy

 drivers/char/virtio_console.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index d2406fe25533..5da47e26a012 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -165,6 +165,12 @@ struct ports_device {
 */
struct virtqueue *c_ivq, *c_ovq;
 
+   /*
+* A control packet buffer for guest->host requests, protected
+* by c_ovq_lock.
+*/
+   struct virtio_console_control cpkt;
+
/* Array of per-port IO virtqueues */
struct virtqueue **in_vqs, **out_vqs;
 
@@ -560,28 +566,29 @@ static ssize_t __send_control_msg(struct ports_device 
*portdev, u32 port_id,
  unsigned int event, unsigned int value)
 {
struct scatterlist sg[1];
-   struct virtio_console_control cpkt;
struct virtqueue *vq;
unsigned int len;
 
if (!use_multiport(portdev))
return 0;
 
-   cpkt.id = cpu_to_virtio32(portdev->vdev, port_id);
-   cpkt.event = cpu_to_virtio16(portdev->vdev, event);
-   cpkt.value = cpu_to_virtio16(portdev->vdev, value);
-
vq = portdev->c_ovq;
 
-   sg_init_one(sg, , sizeof(cpkt));
-
spin_lock(>c_ovq_lock);
-   if (virtqueue_add_outbuf(vq, sg, 1, , GFP_ATOMIC) == 0) {
+
+   portdev->cpkt.id = cpu_to_virtio32(portdev->vdev, port_id);
+   portdev->cpkt.event = cpu_to_virtio16(portdev->vdev, event);
+   portdev->cpkt.value = cpu_to_virtio16(portdev->vdev, value);
+
+   sg_init_one(sg, >cpkt, sizeof(struct virtio_console_control));
+
+   if (virtqueue_add_outbuf(vq, sg, 1, >cpkt, GFP_ATOMIC) == 0) {
virtqueue_kick(vq);
while (!virtqueue_get_buf(vq, )
&& !virtqueue_is_broken(vq))
cpu_relax();
}
+
spin_unlock(>c_ovq_lock);
return 0;
 }
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


DMA from stack in virtio_net and virtio_console

2016-06-24 Thread Andy Lutomirski
virtio_net does DMA on the stack when it calls sg_init_one in
virtio_set_queues, virtnet_vlan_rx_add_vid, and
virtnet_vlan_rx_kill_vid.  Michael, I think these are examples we
missed somehow when fixing these issues earlier on.

virtio_console does it here:

sg_init_one(sg, , sizeof(cpkt));

This will cause problems on some architectures (Xen at the very least,
and it'll cause more subtle problems on other architectures if they
start using the DMA API), and it will blow up horribly with virtually
mapped stacks.

Could you fix these, please?

Thanks,
Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 7:54 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <j...@8bytes.org> wrote:
>> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> >> One correction: it's a feature of the device in the system.
>> >> >> There could be a mix of devices bypassing and not
>> >> >> bypassing the IOMMU.
>> >> >
>> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> >> > IOMMU can chose to let the device bypass. So any fix here belongs
>> >> > into the platform/iommu code too and not into some driver.
>> >> >
>> >> >> Sounds good. And a way to detect appropriate devices could
>> >> >> be by looking at the feature flag, perhaps?
>> >> >
>> >> > Again, no! The way to detect that is to look into the iommu description
>> >> > structures provided by the firmware. They provide everything necessary
>> >> > to tell the iommu code which devices are not translated.
>> >> >
>> >>
>> >> Except on PPC and SPARC.  As far as I know, those are the only
>> >> problematic platforms.
>> >>
>> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> >> fixed to report correct data in the DMAR tables?
>> >>
>> >> --Andy
>> >
>> > Meaning virtio or assigned devices?
>> > For virtio - it's way too late since these are working configurations.
>> > For assigned devices - they don't work on x86 so it doesn't have
>> > to be disabled, it's safe to ignore.
>>
>> I mean actually prevent QEMU from running in q35-iommu mode with any
>> virtio devices attached or maybe even turn off q35-iommu mode entirely
>> [1].  Doesn't it require that the user literally pass the word
>> "experimental" into QEMU right now?  It did at some point IIRC.
>>
>> The reason I'm asking is that, other than q35-iommu, QEMU's virtio
>> devices *don't* bypass the IOMMU except on PPC and SPARC, simply
>> because there is no other configuration AFAICT that has virtio and and
>> IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
>> correctly (thus breaking q35-iommu users with older guest kernels,
>> which hopefully don't actually exist) and to come up with a PPC- and
>> SPARC-specific solution, or maybe OpenFirmware-specific solution, to
>> handle PPC and SPARC down the road.
>>
>> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
>> showed up in a release asking the QEMU team to please not do that
>> until this issue was resolved.  Sadly, that email was ignored :(
>>
>> --Andy
>
> Sorry, I didn't make myself clear.
> Point is, QEMU is not the only virtio implementation out there.
> So we can't know no virtio implementations have an IOMMU as long as
> linux supports this IOMMU.
> virtio always used physical addresses since it was born and if it
> changes that it must do this in a way that does not break existing
> users.

Is there any non-QEMU virtio implementation can provide an
IOMMU-bypassing virtio device on a platform that has a nontrivial
IOMMU?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <j...@8bytes.org> wrote:
>> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> One correction: it's a feature of the device in the system.
>> >> There could be a mix of devices bypassing and not
>> >> bypassing the IOMMU.
>> >
>> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> > IOMMU can chose to let the device bypass. So any fix here belongs
>> > into the platform/iommu code too and not into some driver.
>> >
>> >> Sounds good. And a way to detect appropriate devices could
>> >> be by looking at the feature flag, perhaps?
>> >
>> > Again, no! The way to detect that is to look into the iommu description
>> > structures provided by the firmware. They provide everything necessary
>> > to tell the iommu code which devices are not translated.
>> >
>>
>> Except on PPC and SPARC.  As far as I know, those are the only
>> problematic platforms.
>>
>> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> fixed to report correct data in the DMAR tables?
>>
>> --Andy
>
> Meaning virtio or assigned devices?
> For virtio - it's way too late since these are working configurations.
> For assigned devices - they don't work on x86 so it doesn't have
> to be disabled, it's safe to ignore.

I mean actually prevent QEMU from running in q35-iommu mode with any
virtio devices attached or maybe even turn off q35-iommu mode entirely
[1].  Doesn't it require that the user literally pass the word
"experimental" into QEMU right now?  It did at some point IIRC.

The reason I'm asking is that, other than q35-iommu, QEMU's virtio
devices *don't* bypass the IOMMU except on PPC and SPARC, simply
because there is no other configuration AFAICT that has virtio and and
IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
correctly (thus breaking q35-iommu users with older guest kernels,
which hopefully don't actually exist) and to come up with a PPC- and
SPARC-specific solution, or maybe OpenFirmware-specific solution, to
handle PPC and SPARC down the road.

[1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
showed up in a release asking the QEMU team to please not do that
until this issue was resolved.  Sadly, that email was ignored :(

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel  wrote:
> On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> One correction: it's a feature of the device in the system.
>> There could be a mix of devices bypassing and not
>> bypassing the IOMMU.
>
> No, it really is not. A device can't chose to bypass the IOMMU. But the
> IOMMU can chose to let the device bypass. So any fix here belongs
> into the platform/iommu code too and not into some driver.
>
>> Sounds good. And a way to detect appropriate devices could
>> be by looking at the feature flag, perhaps?
>
> Again, no! The way to detect that is to look into the iommu description
> structures provided by the firmware. They provide everything necessary
> to tell the iommu code which devices are not translated.
>

Except on PPC and SPARC.  As far as I know, those are the only
problematic platforms.

Is it too late to *disable* QEMU's q35-iommu thingy until it can be
fixed to report correct data in the DMAR tables?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-20 Thread Andy Lutomirski
On Apr 20, 2016 6:14 AM, "Michael S. Tsirkin" <m...@redhat.com> wrote:
>
> On Tue, Apr 19, 2016 at 02:07:01PM -0700, Andy Lutomirski wrote:
> > On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> > > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
> > >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <m...@redhat.com> 
> > >> wrote:
> > >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> > >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin 
> > >> >> <m...@redhat.com> wrote:
> > >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> > >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> > >> >> >> >
> > >> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
> > >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, 
> > >> >> >> > > the host
> > >> >> >> > > device would skip translation?  Or is that problematic for 
> > >> >> >> > > vfio?
> > >> >> >> >
> > >> >> >> > Exactly that's problematic for security.
> > >> >> >> > You can't allow guest driver to decide whether device skips 
> > >> >> >> > security.
> > >> >> >>
> > >> >> >> Right. Because fundamentally, this *isn't* a property of the 
> > >> >> >> endpoint
> > >> >> >> device, and doesn't live in virtio itself.
> > >> >> >>
> > >> >> >> It's a property of the platform IOMMU, and lives there.
> > >> >> >
> > >> >> > It's a property of the hypervisor virtio implementation, and lives 
> > >> >> > there.
> > >> >>
> > >> >> It is now, but QEMU could, in principle, change the way it thinks
> > >> >> about it so that virtio devices would use the QEMU DMA API but ask
> > >> >> QEMU to pass everything through 1:1.  This would be entirely invisible
> > >> >> to guests but would make it be a property of the IOMMU implementation.
> > >> >> At that point, maybe QEMU could find a (platform dependent) way to
> > >> >> tell the guest what's going on.
> > >> >>
> > >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> > >> >> set up 1:1 mappings in the guest so that the virtio devices would work
> > >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> > >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> > >> >> up with an offset.  I don't know too much about those platforms, but
> > >> >> presumably the layout could be changed so that 1:1 really was 1:1.
> > >> >>
> > >> >> --Andy
> > >> >
> > >> > Sure. Do you see any reason why the decision to do this can't be
> > >> > keyed off the virtio feature bit?
> > >>
> > >> I can think of three types of virtio host:
> > >>
> > >> a) virtio always bypasses the IOMMU.
> > >>
> > >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
> > >> it does) -- i.e. virtio works like any other device.
> > >>
> > >> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
> > >
> > > d) some virtio devices bypass the IOMMU and some don't,
> > > e.g. it's harder to support IOMMU with vhost.
> > >
> > >
> > >> If this is keyed off a virtio feature bit and anyone tries to
> > >> implement (c), the vfio is going to have a problem.  And, if it's
> > >> keyed off a virtio feature bit, then (a) won't work on Xen or similar
> > >> setups unless the Xen hypervisor adds a giant and probably unreliable
> > >> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
> > >> default x86 QEMU configuration, and I'd really like to keep it that
> > >> way.
> > >>
> > >> What could plausibly work using a virtio feature bit is for a device
> > >> to say "hey, I'm a new device and I support the platform-defined 

Re: [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <m...@redhat.com> 
>> >> wrote:
>> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >> >
>> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the 
>> >> >> > > host
>> >> >> > > device would skip translation?  Or is that problematic for vfio?
>> >> >> >
>> >> >> > Exactly that's problematic for security.
>> >> >> > You can't allow guest driver to decide whether device skips security.
>> >> >>
>> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> >> device, and doesn't live in virtio itself.
>> >> >>
>> >> >> It's a property of the platform IOMMU, and lives there.
>> >> >
>> >> > It's a property of the hypervisor virtio implementation, and lives 
>> >> > there.
>> >>
>> >> It is now, but QEMU could, in principle, change the way it thinks
>> >> about it so that virtio devices would use the QEMU DMA API but ask
>> >> QEMU to pass everything through 1:1.  This would be entirely invisible
>> >> to guests but would make it be a property of the IOMMU implementation.
>> >> At that point, maybe QEMU could find a (platform dependent) way to
>> >> tell the guest what's going on.
>> >>
>> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> >> set up 1:1 mappings in the guest so that the virtio devices would work
>> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> >> up with an offset.  I don't know too much about those platforms, but
>> >> presumably the layout could be changed so that 1:1 really was 1:1.
>> >>
>> >> --Andy
>> >
>> > Sure. Do you see any reason why the decision to do this can't be
>> > keyed off the virtio feature bit?
>>
>> I can think of three types of virtio host:
>>
>> a) virtio always bypasses the IOMMU.
>>
>> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
>> it does) -- i.e. virtio works like any other device.
>>
>> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
>
> d) some virtio devices bypass the IOMMU and some don't,
> e.g. it's harder to support IOMMU with vhost.
>
>
>> If this is keyed off a virtio feature bit and anyone tries to
>> implement (c), the vfio is going to have a problem.  And, if it's
>> keyed off a virtio feature bit, then (a) won't work on Xen or similar
>> setups unless the Xen hypervisor adds a giant and probably unreliable
>> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
>> default x86 QEMU configuration, and I'd really like to keep it that
>> way.
>>
>> What could plausibly work using a virtio feature bit is for a device
>> to say "hey, I'm a new device and I support the platform-defined IOMMU
>> mechanism".  This bit would be *set* on default IOMMU-less QEMU
>> configurations and on physical virtio PCI cards.
>
> And clear on xen.

How?  QEMU has no idea that the guest is running Xen.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >
>> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> >> > > device would skip translation?  Or is that problematic for vfio?
>> >> >
>> >> > Exactly that's problematic for security.
>> >> > You can't allow guest driver to decide whether device skips security.
>> >>
>> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> device, and doesn't live in virtio itself.
>> >>
>> >> It's a property of the platform IOMMU, and lives there.
>> >
>> > It's a property of the hypervisor virtio implementation, and lives there.
>>
>> It is now, but QEMU could, in principle, change the way it thinks
>> about it so that virtio devices would use the QEMU DMA API but ask
>> QEMU to pass everything through 1:1.  This would be entirely invisible
>> to guests but would make it be a property of the IOMMU implementation.
>> At that point, maybe QEMU could find a (platform dependent) way to
>> tell the guest what's going on.
>>
>> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> set up 1:1 mappings in the guest so that the virtio devices would work
>> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> up with an offset.  I don't know too much about those platforms, but
>> presumably the layout could be changed so that 1:1 really was 1:1.
>>
>> --Andy
>
> Sure. Do you see any reason why the decision to do this can't be
> keyed off the virtio feature bit?

I can think of three types of virtio host:

a) virtio always bypasses the IOMMU.

b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
it does) -- i.e. virtio works like any other device.

c) virtio may bypass the IOMMU depending on what the guest asks it to do.

If this is keyed off a virtio feature bit and anyone tries to
implement (c), the vfio is going to have a problem.  And, if it's
keyed off a virtio feature bit, then (a) won't work on Xen or similar
setups unless the Xen hypervisor adds a giant and probably unreliable
kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
default x86 QEMU configuration, and I'd really like to keep it that
way.

What could plausibly work using a virtio feature bit is for a device
to say "hey, I'm a new device and I support the platform-defined IOMMU
mechanism".  This bit would be *set* on default IOMMU-less QEMU
configurations and on physical virtio PCI cards.  The guest could
operate accordingly.  I'm not sure I see a good way for feature
negotiation to work the other direction, though.

PPC and SPARC could only set this bit on emulated devices if they know
that new guest kernels are in use.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin  wrote:
> On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >
>> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> > > device would skip translation?  Or is that problematic for vfio?
>> >
>> > Exactly that's problematic for security.
>> > You can't allow guest driver to decide whether device skips security.
>>
>> Right. Because fundamentally, this *isn't* a property of the endpoint
>> device, and doesn't live in virtio itself.
>>
>> It's a property of the platform IOMMU, and lives there.
>
> It's a property of the hypervisor virtio implementation, and lives there.

It is now, but QEMU could, in principle, change the way it thinks
about it so that virtio devices would use the QEMU DMA API but ask
QEMU to pass everything through 1:1.  This would be entirely invisible
to guests but would make it be a property of the IOMMU implementation.
At that point, maybe QEMU could find a (platform dependent) way to
tell the guest what's going on.

FWIW, as far as I can tell, PPC and SPARC really could, in principle,
set up 1:1 mappings in the guest so that the virtio devices would work
regardless of whether QEMU is ignoring the IOMMU or not -- I think the
only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
up with an offset.  I don't know too much about those platforms, but
presumably the layout could be changed so that 1:1 really was 1:1.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 9:09 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dw...@infradead.org> 
>> >> wrote:
>> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> >> > the truth, and even legacy kernels ought to cope with that.
>> >> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> >> > tables, which puts it back into the same camp as ARM and Power.
>> >>
>> >> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> >> implementation on x86 has always been "experimental", so it just might
>> >> be okay to change it in a way that causes some older kernels to OOPS.
>> >>
>> >> --Andy
>> >
>> > Since it's experimental, it might be OK to change *guest kernels*
>> > such that they oops on old QEMU.
>> > But guest kernels were not experimental - so we need a QEMU mode that
>> > makes them work fine. The more functionality is available in this QEMU
>> > mode, the betterm because it's going to be the default for a while. For
>> > the same reason, it is preferable to also have new kernels not crash in
>> > this mode.
>> >
>>
>> People add QEMU features that need new guest kernels all time time.
>> If you enable virtio-scsi and try to boot a guest that's too old, it
>> won't work.  So I don't see anything fundamentally wrong with saying
>> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
>> kernel is too old.  It might be annoying, since old kernels do work on
>> actual Q35 hardware, but it at least seems to be that it might be
>> okay.
>>
>> --Andy
>
> Yes but we need a mode that makes both old and new kernels work, and
> that should be the default for a while.  this is what the
> IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA
> API, new kernels go "oh compatibility mode" and bypass the IOMMU
> within DMA API.

I thought that PLATFORM served that purpose.  Woudn't the host
advertise PLATFORM support and, if the guest doesn't ack it, the host
device would skip translation?  Or is that problematic for vfio?

>
> --
> MST



-- 
Andy Lutomirski
AMA Capital Management, LLC
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dw...@infradead.org> 
>> wrote:
>> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> > the truth, and even legacy kernels ought to cope with that.
>> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> > tables, which puts it back into the same camp as ARM and Power.
>>
>> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> implementation on x86 has always been "experimental", so it just might
>> be okay to change it in a way that causes some older kernels to OOPS.
>>
>> --Andy
>
> Since it's experimental, it might be OK to change *guest kernels*
> such that they oops on old QEMU.
> But guest kernels were not experimental - so we need a QEMU mode that
> makes them work fine. The more functionality is available in this QEMU
> mode, the betterm because it's going to be the default for a while. For
> the same reason, it is preferable to also have new kernels not crash in
> this mode.
>

People add QEMU features that need new guest kernels all time time.
If you enable virtio-scsi and try to boot a guest that's too old, it
won't work.  So I don't see anything fundamentally wrong with saying
that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
kernel is too old.  It might be annoying, since old kernels do work on
actual Q35 hardware, but it at least seems to be that it might be
okay.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Apr 19, 2016 2:13 AM, "Michael S. Tsirkin"  wrote:
>
>
> I guess you are right in that we should split this part out.
> What I wanted is really the combination
> PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> need to guess, this device actually bypasses the IOMMU".

What happens when you use a device like this on Xen or with a similar
software translation layer?

I think that a "please bypass IOMMU" feature would be better in the
PCI, IOMMU, or platform code.  For Xen, virtio would still want to use
the DMA API, just without translating at the DMAR or hardware level.
Doing it in virtio is awkward, because virtio is involved at the
device level and the driver level, but the translation might be
entirely in between.

I think a nicer long-term approach would be to have a way to ask the
guest to set up a full 1:1 mapping for performance, but to still
handle the case where the guest refuses to do so or where there's more
than one translation layer involved.

But I agree that this part shouldn't delay the other part of your series.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-18 Thread Andy Lutomirski
On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse  wrote:
> For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> the truth, and even legacy kernels ought to cope with that.
> FSVO 'ought to' where I suspect some of them will actually crash with a
> NULL pointer dereference if there's no "catch-all" DMAR unit in the
> tables, which puts it back into the same camp as ARM and Power.

I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
implementation on x86 has always been "experimental", so it just might
be okay to change it in a way that causes some older kernels to OOPS.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 0/9] virtio DMA API, yet again

2016-02-16 Thread Andy Lutomirski
On Tue, Feb 2, 2016 at 9:46 PM, Andy Lutomirski <l...@kernel.org> wrote:
> This switches virtio to use the DMA API on Xen and if requested by
> module option.

Michael, any update on this?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 9/9] vring: Use the DMA API on Xen

2016-02-04 Thread Andy Lutomirski
On Wed, Feb 3, 2016 at 1:49 AM, David Vrabel <david.vra...@citrix.com> wrote:
> On 03/02/16 05:46, Andy Lutomirski wrote:
>> Signed-off-by: Andy Lutomirski <l...@kernel.org>
>
> You forgot the previous Reviewed-by tags.

Whoops.  If I send another version, they'll be there.

>
> David



-- 
Andy Lutomirski
AMA Capital Management, LLC
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 5/9] virtio_ring: Support DMA APIs

2016-02-03 Thread Andy Lutomirski
On Feb 3, 2016 5:52 AM, "Michael S. Tsirkin" <m...@redhat.com> wrote:
>
> On Tue, Feb 02, 2016 at 09:46:36PM -0800, Andy Lutomirski wrote:
> > virtio_ring currently sends the device (usually a hypervisor)
> > physical addresses of its I/O buffers.  This is okay when DMA
> > addresses and physical addresses are the same thing, but this isn't
> > always the case.  For example, this never works on Xen guests, and
> > it is likely to fail if a physical "virtio" device ever ends up
> > behind an IOMMU or swiotlb.
> >
> > The immediate use case for me is to enable virtio on Xen guests.
> > For that to work, we need vring to support DMA address translation
> > as well as a corresponding change to virtio_pci or to another
> > driver.
> >
> > Signed-off-by: Andy Lutomirski <l...@kernel.org>
> > ---
> >  drivers/virtio/Kconfig   |   2 +-
> >  drivers/virtio/virtio_ring.c | 200 
> > ---
> >  tools/virtio/linux/dma-mapping.h |  17 
> >  3 files changed, 183 insertions(+), 36 deletions(-)
> >  create mode 100644 tools/virtio/linux/dma-mapping.h
> >
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index cab9f3f63a38..77590320d44c 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -60,7 +60,7 @@ config VIRTIO_INPUT
> >
> >   config VIRTIO_MMIO
> >   tristate "Platform bus driver for memory mapped virtio devices"
> > - depends on HAS_IOMEM
> > + depends on HAS_IOMEM && HAS_DMA
> >   select VIRTIO
> >   ---help---
> >This drivers provides support for memory mapped virtio
>
> What's this chunk doing here btw? Should be part of the mmio patch?
>

IIRC it was deliberate.  Making virtio depend on HAS_DMA didn't work
right because kconfig doesn't propagate dependencies through select
intelligently.  This patch makes core virtio depend on HAS_DMA, so I
added the dependency here, too.

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6 6/9] virtio: Add improved queue allocation API

2016-02-02 Thread Andy Lutomirski
On Tue, Feb 2, 2016 at 3:25 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Feb 01, 2016 at 10:00:56AM -0800, Andy Lutomirski wrote:
>> This leaves vring_new_virtqueue alone for compatbility, but it
>> adds two new improved APIs:
>>
>> vring_create_virtqueue: Creates a virtqueue backed by automatically
>> allocated coherent memory.  (Some day it this could be extended to
>> support non-coherent memory, too, if there ends up being a platform
>> on which it's worthwhile.)
>>
>> __vring_new_virtqueue: Creates a virtqueue with a manually-specified
>> layout.  This should allow mic_virtio to work much more cleanly.
>>
>> Signed-off-by: Andy Lutomirski <l...@kernel.org>
>> ---
>>  drivers/virtio/virtio_ring.c | 178 
>> +++
>>  include/linux/virtio.h   |  23 +-
>>  include/linux/virtio_ring.h  |  35 +
>>  3 files changed, 204 insertions(+), 32 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 2f621e96b9ff..cf2840c7e500 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -95,6 +95,11 @@ struct vring_virtqueue {
>>   /* How to notify other side. FIXME: commonalize hcalls! */
>>   bool (*notify)(struct virtqueue *vq);
>>
>> + /* DMA, allocation, and size information */
>> + bool we_own_ring;
>> + size_t queue_size_in_bytes;
>> + dma_addr_t queue_dma_addr;
>> +
>>  #ifdef DEBUG
>>   /* They're supposed to lock for us. */
>>   unsigned int in_use;
>> @@ -878,36 +883,31 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
>>  }
>>  EXPORT_SYMBOL_GPL(vring_interrupt);
>>
>> -struct virtqueue *vring_new_virtqueue(unsigned int index,
>> -   unsigned int num,
>> -   unsigned int vring_align,
>> -   struct virtio_device *vdev,
>> -   bool weak_barriers,
>> -   void *pages,
>> -   bool (*notify)(struct virtqueue *),
>> -   void (*callback)(struct virtqueue *),
>> -   const char *name)
>> +struct virtqueue *__vring_new_virtqueue(unsigned int index,
>> + struct vring vring,
>> + struct virtio_device *vdev,
>> + bool weak_barriers,
>> + bool (*notify)(struct virtqueue *),
>> + void (*callback)(struct virtqueue *),
>> + const char *name)
>>  {
>> - struct vring_virtqueue *vq;
>>   unsigned int i;
>> + struct vring_virtqueue *vq;
>>
>> - /* We assume num is a power of 2. */
>> - if (num & (num - 1)) {
>> - dev_warn(>dev, "Bad virtqueue length %u\n", num);
>> - return NULL;
>> - }
>> -
>> - vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
>> + vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
>>GFP_KERNEL);
>>   if (!vq)
>>   return NULL;
>>
>> - vring_init(>vring, num, pages, vring_align);
>> + vq->vring = vring;
>>   vq->vq.callback = callback;
>>   vq->vq.vdev = vdev;
>>   vq->vq.name = name;
>> - vq->vq.num_free = num;
>> + vq->vq.num_free = vring.num;
>>   vq->vq.index = index;
>> + vq->we_own_ring = false;
>> + vq->queue_dma_addr = 0;
>> + vq->queue_size_in_bytes = 0;
>>   vq->notify = notify;
>>   vq->weak_barriers = weak_barriers;
>>   vq->broken = false;
>> @@ -932,18 +932,105 @@ struct virtqueue *vring_new_virtqueue(unsigned int 
>> index,
>>
>>   /* Put everything in free lists. */
>>   vq->free_head = 0;
>> - for (i = 0; i < num-1; i++)
>> + for (i = 0; i < vring.num-1; i++)
>>   vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
>> - memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
>> + memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
>>
>>   return >vq;
>>  }
>> +EXPORT_SYMBOL_GPL(__vring_new_virtque

  1   2   3   >