Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP

2021-09-29 Thread Andy Lutomirski



On Wed, Sep 29, 2021, at 10:07 AM, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 09:58:22AM -0700, Andy Lutomirski wrote:
>> On 9/28/21 16:10, Luck, Tony wrote:
>> > Moving beyond pseudo-code and into compiles-but-probably-broken-code.
>> > 
>> > 
>> > The intent of the functions below is that Fenghua should be able to
>> > do:
>> > 
>> > void fpu__pasid_write(u32 pasid)
>> > {
>> >u64 msr_val = pasid | MSR_IA32_PASID_VALID;
>> >struct ia32_pasid_state *addr;
>> > 
>> >addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
>> >addr->pasid = msr_val;
>> >finish_update_one_xsave_feature(current);
>> > }
>> > 
>> 
>> This gets gnarly because we would presumably like to optimize the case where
>> we can do the update directly in registers.  I wonder if we can do it with a
>> bit of macro magic in a somewhat generic way:
>
> Can we defere the optimizations until there is a use case that
> cares? The existing use case (fixing up the #GP fault by setting
> the PASID MSR) isn't performance critical.
>
> Let's just have something that is clear (or as clear as any xsave
> code can be) and correct.
>
> 

The goal would be to use the same code for CET and PKRU, I think.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Andy Lutomirski



On Wed, Sep 29, 2021, at 10:41 AM, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
>> On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
>> > On 9/29/21 05:28, Thomas Gleixner wrote:
>> >> Looking at that patch again, none of this muck in fpu__pasid_write() is
>> >> required at all. The whole exception fixup is:
>> >> 
>> >>  if (!user_mode(regs))
>> >>   return false;
>> >> 
>> >>  if (!current->mm->pasid)
>> >>   return false;
>> >> 
>> >>  if (current->pasid_activated)
>> >>return false;
>> >
>> > <-- preemption or BH here: kaboom.
>> 
>> Sigh, this had obviously to run in the early portion of #GP, i.e. before
>> enabling interrupts.
>
> Like this? Obviously with some comment about why this is being done.
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..a848a59291e7 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>   unsigned long gp_addr;
>   int ret;
> 
> + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) 
> {
> + current->pasid_activated = 1;
> + wrmsrl(MSR_IA32_PASID, current->mm->pasid | 
> MSR_IA32_PASID_VALID);
> + return;
> + }
> +

This could do with a WARN_ON_ONCE(TIF_NEED_LOAD_FPU) imo.

Is instrumentation allowed to touch FPU state?

>   cond_local_irq_enable(regs);
> 
>   if (static_cpu_has(X86_FEATURE_UMIP)) {
>
> -Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-29 Thread Andy Lutomirski

On 9/29/21 05:28, Thomas Gleixner wrote:

On Wed, Sep 29 2021 at 11:54, Peter Zijlstra wrote:

On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:

I think the perfect and the good are a bit confused here. If we go for
"good", then we have an mm owning a PASID for its entire lifetime.  If
we want "perfect", then we should actually do it right: teach the
kernel to update an entire mm's PASID setting all at once.  This isn't
*that* hard -- it involves two things:

1. The context switch code needs to resync PASID.  Unfortunately, this
adds some overhead to every context switch, although a static_branch
could minimize it for non-PASID users.



2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
touch FPU state.  So instead the IPI should use task_work_add() to
make sure PASID gets resynced.


What do we need 1 for? Any PASID change can be achieved using 2 no?

Basically, call task_work_add() on all relevant tasks [1], then IPI
spray the current running of those and presto.

[1] it is nigh on impossible to find all tasks sharing an mm in any sane
way due to CLONE_MM && !CLONE_THREAD.


Why would we want any of that at all?

   Process starts, no PASID assigned.

   bind to device -> PASID is allocated and assigned to the mm

   some task of the process issues ENQCMD -> #GP -> write PASID MSR

   After that the PASID is saved and restored as part of the XSTATE and
   there is no extra overhead in context switch or return to user space.

   All tasks of the process which did never use ENQCMD don't care and their
   PASID xstate is in init state.

There is absolutely no point in enforcing that all tasks of the process
have the PASID activated immediately when it is assigned. If they need
it they get it via the #GP fixup and everything just works.

Looking at that patch again, none of this muck in fpu__pasid_write() is
required at all. The whole exception fixup is:

 if (!user_mode(regs))
  return false;

 if (!current->mm->pasid)
  return false;

 if (current->pasid_activated)
 return false;


<-- preemption or BH here: kaboom.



 wrmsrl(MSR_IA32_PASID, current->mm->pasid);


This needs the actual sane fpstate writing helper -- see other email.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP

2021-09-29 Thread Andy Lutomirski

On 9/28/21 16:10, Luck, Tony wrote:

Moving beyond pseudo-code and into compiles-but-probably-broken-code.


The intent of the functions below is that Fenghua should be able to
do:

void fpu__pasid_write(u32 pasid)
{
u64 msr_val = pasid | MSR_IA32_PASID_VALID;
struct ia32_pasid_state *addr;

addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
addr->pasid = msr_val;
finish_update_one_xsave_feature(current);
}



This gets gnarly because we would presumably like to optimize the case 
where we can do the update directly in registers.  I wonder if we can do 
it with a bit of macro magic in a somewhat generic way:


typedef fpu__pasid_type u32;

static inline void fpu__set_pasid_in_register(const u32 *value)
{
wrmsr(...);
}

#define DEFINE_FPU_HELPER(name) \
static inline void fpu__set_##name(const fpu__##name##_type *val) \
{ \
fpregs_lock(); \
if (should write in memory) { \
->xfeatures |= XFEATURE_##name; \
ptr = get_xsave_addr(...); \
memcpy(ptr, val, sizeof(*val)); \
__fpu_invalidate_fpregs_state(...); \
} else { \
fpu__set_##name##_in_register(val); \
} \
}
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-24 Thread Andy Lutomirski



On Fri, Sep 24, 2021, at 9:12 AM, Luck, Tony wrote:
> On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
>> On Thu, Sep 23 2021 at 19:48, Thomas Gleixner wrote:
>> > On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
>> >
>> > fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
>> > and be done with it.
>> >
>> > The task exit code can just call iommu_pasid_put_task_ref() from the
>> > generic code and not from within x86.
>> 
>> But OTOH why do you need a per task reference count on the PASID at all?
>> 
>> The PASID is fundamentaly tied to the mm and the mm can't go away before
>> the threads have gone away unless this magically changed after I checked
>> that ~20 years ago.
>
> It would be possible to avoid a per-task reference to the PASID by
> taking an extra reference when mm->pasid is first allocated using
> the CONFIG_PASID_TASK_REFS you proposed yesterday to define a function
> to take the extra reference, and another to drop it when the mm is
> finally freed ... with stubs to do nothing on architectures that
> don't create per-task PASID context.
>
> This solution works, but is functionally different from Fenghua's
> proposal for this case:
>
>   Process clones a task
>   task binds a device
>   task accesses device using PASID
>   task unbinds device
>   task exits (but process lives on)
>
> Fenghua will free the PASID because the reference count goes
> back to zero. The "take an extra reference and keep until the
> mm is freed" option would needlessly hold onto the PASID.
>
> This seems like an odd usage case ... even if it does exist, a process
> that does this may spawn another task that does the same thing.
>
> If you think it is sufficiently simpler to use the "extra reference"
> option (invoking the "perfect is the enemy of good enough" rule) then we
> can change.

I think the perfect and the good are a bit confused here. If we go for "good", 
then we have an mm owning a PASID for its entire lifetime.  If we want 
"perfect", then we should actually do it right: teach the kernel to update an 
entire mm's PASID setting all at once.  This isn't *that* hard -- it involves 
two things:

1. The context switch code needs to resync PASID.  Unfortunately, this adds 
some overhead to every context switch, although a static_branch could minimize 
it for non-PASID users.
2. A change to an mm's PASID needs to sent an IPI, but that IPI can't touch FPU 
state.  So instead the IPI should use task_work_add() to make sure PASID gets 
resynced.

And there is still no per-task refcounting.

After all, the not so perfect attempt at perfect in this patch set won't 
actually work if a thread pool is involved.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-23 Thread Andy Lutomirski



On Thu, Sep 23, 2021, at 4:22 PM, Luck, Tony wrote:
> On Thu, Sep 23, 2021 at 04:09:18PM -0700, Andy Lutomirski wrote:
>> On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
>
>> I think this is unnecessarily complicated because it's buying in to the
>> existing ISA misconception that PASID has anything to do with a task.
>> A PASID belongs to an mm, full stop.  Now the ISA is nasty and we have
>> tasks that have *noticed* that their mm has a PASID and tasks that have
>> not noticed this fact, but that should be irrelevant to essentially
>> everything except the fault handler.
>> 
>> So just refcount the thing the obvious way: take a reference when you
>> stick the PASID in the mm_struct and drop the reference in __mmdrop().
>> Problem solved.  You could probably drop it more aggressively in
>> __mmput(), and the comment explaining why is left as an exercise to the
>> reader -- if a kernel thread starts doing ENQCMD, we have worse things
>> to worry about :)
>
> That doesn't match well with the non-x86 usage of PASIDs. The code there
> bumps the reference count on each device bind, and decrements on each
> device unbind.

Can you elaborate on how that works?  Is there an architecture where there is a 
bona fide per task PASID?

>
> If we don't keep a reference count for each task that has IA32_PASID
> set up we could have this sequence
>
> 1) Process binds to a PASID capable device

Okay, so the mm has that PASID set up and a reference is taken.

> 2) Task uses ENQCMD, so PASID MSR is set up.

Yep.

> 3) Process unbinds the device, reference count on PASID
>goes to zero. PASID is freed. PASID is reallocated to
>another task.

It had better not.  We had an entire phone call in which we agreed that the 
entire lazy-MSR-setup approach only makes any sense if everyone pinky swears 
that an mm will *never* change its PASID once it has a PASID.

> 4) Task from step #2 uses ENQCMD to submit a descriptor
>and device now processes virtual addresses based on mappings
>in the new task.
>
> Now you might say that at step 3 we need to hunt down all the
> tasks that have PASID enabled and disabled ... but that's the
> same painful code that we avoided when we said that we would
> not make Linux hand out a PASID to all existing tasks in a
> process on the first bind operation.
>

Exactly.  Which means that the mm ought to pin that PASID for as long as it 
exists.  What am I missing?

Sure, one can invent a situation in which you start two threads, and one of 
those threads binds a device, does ENQCMD, unbinds the device, and exits.  Then 
the other thread *in the same mm* binds another device and gets a new PASID.  
And it all works.  But I really don't think this special case is worth 
optimizing for.

> -Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP

2021-09-23 Thread Andy Lutomirski



On Thu, Sep 23, 2021, at 7:56 PM, Fenghua Yu wrote:
> Hi, Andy,
>
> On Thu, Sep 23, 2021 at 04:17:05PM -0700, Andy Lutomirski wrote:
>> On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
>> > ENQCMD requires the IA32_PASID MSR has a valid PASID value which was
>> > allocated to the process during bind. The MSR could be populated eagerly
>> > by an IPI after the PASID is allocated in bind. But the method was
>> > disabled in commit 9bfecd058339 ("x86/cpufeatures: Force disable
>> > X86_FEATURE_ENQCMD and remove update_pasid()")' due to locking and other
>> > issues.
>> >
>> > Since the MSR was cleared in fork()/clone(), the first ENQCMD will
>> > generate a #GP fault. The #GP fault handler will initialize the MSR
>> > if a PASID has been allocated for this process.
>> >
>> > The lazy enabling of the PASID MSR in the #GP handler is not an elegant
>> > solution. But it has the least complexity that fits with h/w behavior.
>> >
>> > Signed-off-by: Fenghua Yu 
>> > Reviewed-by: Tony Luck 
>> > ---
>> >  arch/x86/include/asm/fpu/api.h |  6 
>> >  arch/x86/include/asm/iommu.h   |  2 ++
>> >  arch/x86/kernel/fpu/xstate.c   | 59 ++
>> >  arch/x86/kernel/traps.c| 12 +++
>> >  drivers/iommu/intel/svm.c  | 32 ++
>> >  5 files changed, 111 insertions(+)
>> >
>> > diff --git a/arch/x86/include/asm/fpu/api.h 
>> > b/arch/x86/include/asm/fpu/api.h
>> > index ca4d0dee1ecd..f146849e5c8c 100644
>> > --- a/arch/x86/include/asm/fpu/api.h
>> > +++ b/arch/x86/include/asm/fpu/api.h
>> > @@ -106,4 +106,10 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, 
>> > const char **feature_name);
>> >   */
>> >  #define PASID_DISABLED0
>> > 
>> > +#ifdef CONFIG_INTEL_IOMMU_SVM
>> > +void fpu__pasid_write(u32 pasid);
>> > +#else
>> > +static inline void fpu__pasid_write(u32 pasid) { }
>> > +#endif
>> > +
>> >  #endif /* _ASM_X86_FPU_API_H */
>> > diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
>> > index bf1ed2ddc74b..9c4bf9b0702f 100644
>> > --- a/arch/x86/include/asm/iommu.h
>> > +++ b/arch/x86/include/asm/iommu.h
>> > @@ -26,4 +26,6 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory 
>> > *rmrr)
>> >return -EINVAL;
>> >  }
>> > 
>> > +bool __fixup_pasid_exception(void);
>> > +
>> >  #endif /* _ASM_X86_IOMMU_H */
>> > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> > index c8def1b7f8fb..8a89b2cecd77 100644
>> > --- a/arch/x86/kernel/fpu/xstate.c
>> > +++ b/arch/x86/kernel/fpu/xstate.c
>> > @@ -1289,3 +1289,62 @@ int proc_pid_arch_status(struct seq_file *m, 
>> > struct pid_namespace *ns,
>> >return 0;
>> >  }
>> >  #endif /* CONFIG_PROC_PID_ARCH_STATUS */
>> > +
>> > +#ifdef CONFIG_INTEL_IOMMU_SVM
>> > +/**
>> > + * fpu__pasid_write - Write the current task's PASID state/MSR.
>> > + * @pasid:the PASID
>> > + *
>> > + * The PASID is written to the IA32_PASID MSR directly if the MSR is 
>> > active.
>> > + * Otherwise it's written to the PASID. The IA32_PASID MSR should 
>> > contain
>> > + * the PASID after returning to the user.
>> > + *
>> > + * This is called only when ENQCMD is enabled.
>> > + */
>> > +void fpu__pasid_write(u32 pasid)
>> > +{
>> > +  struct xregs_state *xsave = >thread.fpu.state.xsave;
>> > +  u64 msr_val = pasid | MSR_IA32_PASID_VALID;
>> > +  struct fpu *fpu = >thread.fpu;
>> > +
>> > +  /*
>> > +   * ENQCMD always uses the compacted XSAVE format. Ensure the buffer
>> > +   * has space for the PASID.
>> > +   */
>> > +  BUG_ON(!(xsave->header.xcomp_bv & XFEATURE_MASK_PASID));
>> > +
>> > +  fpregs_lock();
>> > +
>> > +  /*
>> > +   * If the task's FPU doesn't need to be loaded or is valid, directly
>> > +   * write the IA32_PASID MSR. Otherwise, write the PASID state and
>> > +   * the MSR will be loaded from the PASID state before returning to
>> > +   * the user.
>> > +   */
>> > +  if (!test_thread_flag(TIF_NEED_FPU_LOAD) ||
>> > +  fpregs_state_valid(fpu, smp_processor_id())) {
>> > +  wrmsrl(MSR_IA32_PASID, msr_val);
>> 
>&g

Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP

2021-09-23 Thread Andy Lutomirski



On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
> ENQCMD requires the IA32_PASID MSR has a valid PASID value which was
> allocated to the process during bind. The MSR could be populated eagerly
> by an IPI after the PASID is allocated in bind. But the method was
> disabled in commit 9bfecd058339 ("x86/cpufeatures: Force disable
> X86_FEATURE_ENQCMD and remove update_pasid()")' due to locking and other
> issues.
>
> Since the MSR was cleared in fork()/clone(), the first ENQCMD will
> generate a #GP fault. The #GP fault handler will initialize the MSR
> if a PASID has been allocated for this process.
>
> The lazy enabling of the PASID MSR in the #GP handler is not an elegant
> solution. But it has the least complexity that fits with h/w behavior.
>
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> ---
>  arch/x86/include/asm/fpu/api.h |  6 
>  arch/x86/include/asm/iommu.h   |  2 ++
>  arch/x86/kernel/fpu/xstate.c   | 59 ++
>  arch/x86/kernel/traps.c| 12 +++
>  drivers/iommu/intel/svm.c  | 32 ++
>  5 files changed, 111 insertions(+)
>
> diff --git a/arch/x86/include/asm/fpu/api.h 
> b/arch/x86/include/asm/fpu/api.h
> index ca4d0dee1ecd..f146849e5c8c 100644
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -106,4 +106,10 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, 
> const char **feature_name);
>   */
>  #define PASID_DISABLED   0
> 
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +void fpu__pasid_write(u32 pasid);
> +#else
> +static inline void fpu__pasid_write(u32 pasid) { }
> +#endif
> +
>  #endif /* _ASM_X86_FPU_API_H */
> diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
> index bf1ed2ddc74b..9c4bf9b0702f 100644
> --- a/arch/x86/include/asm/iommu.h
> +++ b/arch/x86/include/asm/iommu.h
> @@ -26,4 +26,6 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory 
> *rmrr)
>   return -EINVAL;
>  }
> 
> +bool __fixup_pasid_exception(void);
> +
>  #endif /* _ASM_X86_IOMMU_H */
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index c8def1b7f8fb..8a89b2cecd77 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1289,3 +1289,62 @@ int proc_pid_arch_status(struct seq_file *m, 
> struct pid_namespace *ns,
>   return 0;
>  }
>  #endif /* CONFIG_PROC_PID_ARCH_STATUS */
> +
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +/**
> + * fpu__pasid_write - Write the current task's PASID state/MSR.
> + * @pasid:   the PASID
> + *
> + * The PASID is written to the IA32_PASID MSR directly if the MSR is 
> active.
> + * Otherwise it's written to the PASID. The IA32_PASID MSR should 
> contain
> + * the PASID after returning to the user.
> + *
> + * This is called only when ENQCMD is enabled.
> + */
> +void fpu__pasid_write(u32 pasid)
> +{
> + struct xregs_state *xsave = >thread.fpu.state.xsave;
> + u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> + struct fpu *fpu = >thread.fpu;
> +
> + /*
> +  * ENQCMD always uses the compacted XSAVE format. Ensure the buffer
> +  * has space for the PASID.
> +  */
> + BUG_ON(!(xsave->header.xcomp_bv & XFEATURE_MASK_PASID));
> +
> + fpregs_lock();
> +
> + /*
> +  * If the task's FPU doesn't need to be loaded or is valid, directly
> +  * write the IA32_PASID MSR. Otherwise, write the PASID state and
> +  * the MSR will be loaded from the PASID state before returning to
> +  * the user.
> +  */
> + if (!test_thread_flag(TIF_NEED_FPU_LOAD) ||
> + fpregs_state_valid(fpu, smp_processor_id())) {
> + wrmsrl(MSR_IA32_PASID, msr_val);

Let me try to decode this.

If the current task's FPU state is live or if the state is in memory but the 
CPU regs just happen to match the copy in memory, then write the MSR.  Else 
write the value to memory.

This is wrong.  If !TIF_NEED_FPU_LOAD && fpregs_state_valid, you MUST NOT 
MODIFY FPU STATE.  This is not negotiable -- you will break coherence between 
CPU regs and the memory image.  The way you modify the current task's state is 
either you modify it in CPU regs (if the kernel knows that the CPU regs are the 
one and only source of truth) OR you modify it in memory and invalidate any 
preserved copies (by zapping last_cpu). 

In any event, that particular bit of logic really doesn't belong in here -- it 
belongs in some helper that gets it right, once.

> +
> +/*
> + * Try to figure out if there is a PASID MSR value to propagate to the
> + * thread taking the #GP.
> + */
> +bool __fixup_pasid_exception(void)
> +{
> + u32 pasid;
> +
> + /*
> +  * This function is called only when this #GP was triggered from user
> +  * space. So the mm cannot be NULL.
> +  */
> + pasid = current->mm->pasid;
> +
> + /* If no PASID is allocated, there is nothing to propagate. */
> + if (pasid == PASID_DISABLED)
> + return false;
> +
> + /*
> +  * If the 

Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

2021-09-23 Thread Andy Lutomirski
On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
> PASIDs are fundamentally hardware resources in a shared address space.
> There is a limited number of them to use ENQCMD on shared workqueue.
> They must be shared and managed. They can not, for instance, be
> statically allocated to processes.
>
> Free PASID eagerly by sending IPIs in unbind was disabled due to locking
> and other issues in commit 9bfecd058339 ("x86/cpufeatures: Force disable
> X86_FEATURE_ENQCMD and remove update_pasid()").
>
> Lazy PASID free is implemented in order to re-enable the ENQCMD feature.
> PASIDs are currently reference counted and are centered around device
> usage. To support lazy PASID free, reference counts are tracked in the
> following scenarios:
>
> 1. The PASID's reference count is initialized as 1 when the PASID is first
>allocated in bind. This is already implemented.
> 2. A reference is taken when a device is bound to the mm and dropped
>when the device is unbound from the mm. This reference tracks device
>usage of the PASID. This is already implemented.
> 3. A reference is taken when a task's IA32_PASID MSR is initialized in
>#GP fix up and dropped when the task exits. This reference tracks
>the task usage of the PASID. It is implemented here.

I think this is unnecessarily complicated because it's buying in to the 
existing ISA misconception that PASID has anything to do with a task.  A PASID 
belongs to an mm, full stop.  Now the ISA is nasty and we have tasks that have 
*noticed* that their mm has a PASID and tasks that have not noticed this fact, 
but that should be irrelevant to essentially everything except the fault 
handler.

So just refcount the thing the obvious way: take a reference when you stick the 
PASID in the mm_struct and drop the reference in __mmdrop().  Problem solved.  
You could probably drop it more aggressively in __mmput(), and the comment 
explaining why is left as an exercise to the reader -- if a kernel thread 
starts doing ENQCMD, we have worse things to worry about :)

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 12/12] iommu: Do not allow IOMMU passthrough with Secure Launch

2021-06-21 Thread Andy Lutomirski
On Mon, Jun 21, 2021 at 10:51 AM Ross Philipson
 wrote:
>
> On 6/18/21 2:32 PM, Robin Murphy wrote:
> > On 2021-06-18 17:12, Ross Philipson wrote:
> >> The IOMMU should always be set to default translated type after
> >> the PMRs are disabled to protect the MLE from DMA.
> >>
> >> Signed-off-by: Ross Philipson 
> >> ---
> >>   drivers/iommu/intel/iommu.c | 5 +
> >>   drivers/iommu/iommu.c   | 6 +-
> >>   2 files changed, 10 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> >> index be35284..4f0256d 100644
> >> --- a/drivers/iommu/intel/iommu.c
> >> +++ b/drivers/iommu/intel/iommu.c
> >> @@ -41,6 +41,7 @@
> >>   #include 
> >>   #include 
> >>   #include 
> >> +#include 
> >>   #include 
> >>   #include 
> >>   #include 
> >> @@ -2877,6 +2878,10 @@ static bool device_is_rmrr_locked(struct device
> >> *dev)
> >>*/
> >>   static int device_def_domain_type(struct device *dev)
> >>   {
> >> +/* Do not allow identity domain when Secure Launch is configured */
> >> +if (slaunch_get_flags() & SL_FLAG_ACTIVE)
> >> +return IOMMU_DOMAIN_DMA;
> >
> > Is this specific to Intel? It seems like it could easily be done
> > commonly like the check for untrusted external devices.
>
> It is currently Intel only but that will change. I will look into what
> you suggest.
>
> >
> >> +
> >>   if (dev_is_pci(dev)) {
> >>   struct pci_dev *pdev = to_pci_dev(dev);
> >>   diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >> index 808ab70d..d49b7dd 100644
> >> --- a/drivers/iommu/iommu.c
> >> +++ b/drivers/iommu/iommu.c
> >> @@ -23,6 +23,7 @@
> >>   #include 
> >>   #include 
> >>   #include 
> >> +#include 
> >>   #include 
> >> static struct kset *iommu_group_kset;
> >> @@ -2761,7 +2762,10 @@ void iommu_set_default_passthrough(bool cmd_line)
> >>   {
> >>   if (cmd_line)
> >>   iommu_cmd_line |= IOMMU_CMD_LINE_DMA_API;
> >> -iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY;
> >> +
> >> +/* Do not allow identity domain when Secure Launch is configured */
> >> +if (!(slaunch_get_flags() & SL_FLAG_ACTIVE))
> >> +iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY;
> >
> > Quietly ignoring the setting and possibly leaving iommu_def_domain_type
> > uninitialised (note that 0 is not actually a usable type) doesn't seem
> > great. AFAICS this probably warrants similar treatment to the
>
> Ok so I guess it would be better to set it to IOMMU_DOMAIN_DMA event
> though passthrough was requested. Or perhaps something more is needed here?
>
> > mem_encrypt_active() case - there doesn't seem a great deal of value in
> > trying to save users from themselves if they care about measured boot
> > yet explicitly pass options which may compromise measured boot. If you
> > really want to go down that route there's at least the sysfs interface
> > you'd need to nobble as well, not to mention the various ways of
> > completely disabling IOMMUs...
>
> Doing a secure launch with the kernel is not a general purpose user use
> case. A lot of work is done to secure the environment. Allowing
> passthrough mode would leave the secure launch kernel exposed to DMA. I
> think what we are trying to do here is what we intend though there may
> be a better way or perhaps it is incomplete as you suggest.
>

I don't really like all these special cases.  Generically, what you're
trying to do is (AFAICT) to get the kernel to run in a mode in which
it does its best not to trust attached devices.  Nothing about this is
specific to Secure Launch.  There are plenty of scenarios in which
this the case:

 - Virtual devices in a VM host outside the TCB, e.g. VDUSE, Xen
device domains (did I get the name right), whatever tricks QEMU has,
etc.
 - SRTM / DRTM technologies (including but not limited to Secure
Launch -- plain old Secure Boot can work like this too).
 - Secure guest technologies, including but not limited to TDX and SEV.
 - Any computer with a USB-C port or other external DMA-capable port.
 - Regular computers in which the admin wants to enable this mode for
whatever reason.

Can you folks all please agree on a coordinated way for a Linux kernel
to configure itself appropriately?  Or to be configured via initramfs,
boot option, or some other trusted source of configuration supplied at
boot time?  We don't need a whole bunch of if (TDX), if (SEV), if
(secure launch), if (I have a USB-C port with PCIe exposed), if
(running on Xen), and similar checks all over the place.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()

2021-06-09 Thread Andy Lutomirski
On 6/9/21 10:32 AM, Luck, Tony wrote:
>>> I've told Fenghua to dig out the previous iteration of this patch where
>>> the plan was to lazily fix the PASID_MSR in other threads in the #GP
>>> handler.
>>
>> Blech.  Also this won't work for other PASID-like features.
>>
>> I have a half-written patch to fix this up for real.  Stay tuned.
> 
> Andy: Any update on this?
> 
> -Tony
> 

Let me try to merge my pile with tglx's pile and come up with something
halfway sane.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andy Lutomirski
On 6/3/21 4:32 PM, Andi Kleen wrote:
> 
>> We do not need an increasing pile of kludges
> 
> Do you mean disabling features is a kludge?
> 
> If yes I disagree with that characterization.
> 
> 
>> to make TDX and SEV “secure”.  We need the actual loaded driver to be
>> secure.  The virtio architecture is full of legacy nonsense,
>> and there is no good reason for SEV and TDX to be a giant special case.
> 
> I don't know where you see a "giant special case". Except for the
> limited feature negotiation all the changes are common, and the
> disabling of features (which is not new BTW, but already done e.g. with
> forcing DMA API in some cases) can be of course used by all these other
> technologies too. But it just cannot be done by default for everything
> because it would break compatibility. So every technology with such
> requirements has to explicitly opt-in.
> 
> 
>>
>> As I said before, real PCIe (Thunderbolt/USB-C or anything else) has
>> the exact same problem.  The fact that TDX has encrypted memory is, at
>> best, a poor proxy for the actual condition.  The actual condition is
>> that the host does not trust the device to implement the virtio
>> protocol correctly.
> 
> Right they can do similar limitations of feature sets. But again it
> cannot be default.

Let me try again.

For most Linux drivers, a report that a misbehaving device can corrupt
host memory is a bug, not a feature.  If a USB device can corrupt kernel
memory, that's a serious bug.  If a USB-C device can corrupt kernel
memory, that's also a serious bug, although, sadly, we probably have
lots of these bugs.  If a Firewire device can corrupt kernel memory,
news at 11.  If a Bluetooth or WiFi peer can corrupt kernel memory,
people write sonnets about it and give it clever names.  Why is virtio
special?

If, for some reason, the virtio driver cannot be fixed so that it is
secure and compatible [1], then I think that the limited cases that are
secure should be accessible to anyone, with or without TDX.  Have a
virtio.secure_mode module option or a udev-controllable parameter or an
alternative driver name or *something*.  An alternative driver name
would allow userspace to prevent the insecure mode from auto-binding to
devices.  And make whatever system configures encrypted guests for
security use this mode.  (Linux is not going to be magically secure just
by booting it in TDX.  There's a whole process of unsealing or remote
attestation, something needs to prevent the hypervisor from connecting a
virtual keyboard and typing init=/bin/bash, something needs to provision
an SSH key, etc.)

In my opinion, it is not so great to identify bugs in the driver and
then say that they're only being fixed for TDX and SEV.

Keep in mind that, as I understand it, there is nothing virt specific
about virtio.  There are real physical devices that speak virtio.

[1] The DMA quirk is nasty.  Fortunately, it's the only case I'm aware
of in which the virtio driver genuinely cannot be made secure and
compatible at the smae time.  Also, fortunately, most real deployments
except on powerpc work just fine with the DMA quirk unquirked.

> 
> 
>>
>>>
>>> TDX and SEV use the arch hook to enforce DMA API, so that part is also
>>> solved.
>>>
>> Can you point me to the code you’re referring to?
> 
> See 4/8 in this patch kit. It uses an existing hook which is already
> used in tree by s390.

This one:

int arch_has_restricted_virtio_memory_access(void)
+{
+   return is_tdx_guest();
+}

I'm looking at a fairly recent kernel, and I don't see anything for s390
wired up in vring_use_dma_api.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andy Lutomirski


On Thu, Jun 3, 2021, at 12:53 PM, Andi Kleen wrote:
> 
> > Tell that to every crypto downgrade attack ever.
> 
> That's exactly what this patch addresses.
> 
> >
> > I see two credible solutions:
> >
> > 1. Actually harden the virtio driver.
> That's exactly what this patchkit, and the alternative approaches, like 
> Jason's, are doing.
> >
> > 2. Have a new virtio-modern driver and use it for modern use cases. Maybe 
> > rename the old driver virtio-legacy or virtio-insecure.  They can share 
> > code.
> 
> In most use cases the legacy driver is not insecure because there is no 
> memory protection anyways.
> 
> Yes maybe such a split would be a good idea for maintenance and maybe 
> performance reasons, but at least from the security perspective I don't 
> see any need for it.


Please reread my email.

We do not need an increasing pile of kludges to make TDX and SEV “secure”.  We 
need the actual loaded driver to be secure.  The virtio architecture is full of 
legacy nonsense, and there is no good reason for SEV and TDX to be a giant 
special case.

As I said before, real PCIe (Thunderbolt/USB-C or anything else) has the exact 
same problem.  The fact that TDX has encrypted memory is, at best, a poor proxy 
for the actual condition.  The actual condition is that the host does not trust 
the device to implement the virtio protocol correctly.

> 
> >
> > Another snag you may hit: virtio’s heuristic for whether to use proper DMA 
> > ops or to bypass them is a giant kludge. I’m very slightly optimistic that 
> > getting the heuristic wrong will make the driver fail to operate but won’t 
> > allow the host to take over the guest, but I’m not really convinced. And I 
> > wrote that code!  A virtio-modern mode probably should not have a 
> > heuristic, and the various iommu-bypassing modes should be fixed to work at 
> > the bus level, not the device level
> 
> TDX and SEV use the arch hook to enforce DMA API, so that part is also 
> solved.
> 

Can you point me to the code you’re referring to?

> 
> -Andi
> 
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andy Lutomirski


On Thu, Jun 3, 2021, at 11:00 AM, Andi Kleen wrote:
> 
> On 6/3/2021 10:33 AM, Andy Lutomirski wrote:
> > On 6/2/21 5:41 PM, Andi Kleen wrote:
> >> Only allow split mode when in a protected guest. Followon
> >> patches harden the split mode code paths, and we don't want
> >> an malicious host to force anything else. Also disallow
> >> indirect mode for similar reasons.
> > I read this as "the virtio driver is buggy.  Let's disable most of the
> > buggy code in one special case in which we need a driver without bugs.
> > In all the other cases (e.g. hardware virtio device connected over
> > USB-C), driver bugs are still allowed."
> 
> My understanding is most of the other modes (except for split with 
> separate descriptors) are obsolete and just there for compatibility. As 
> long as they're deprecated they won't harm anyone.
> 
>

Tell that to every crypto downgrade attack ever.

I see two credible solutions:

1. Actually harden the virtio driver.

2. Have a new virtio-modern driver and use it for modern use cases. Maybe 
rename the old driver virtio-legacy or virtio-insecure.  They can share code.

Another snag you may hit: virtio’s heuristic for whether to use proper DMA ops 
or to bypass them is a giant kludge. I’m very slightly optimistic that getting 
the heuristic wrong will make the driver fail to operate but won’t allow the 
host to take over the guest, but I’m not really convinced. And I wrote that 
code!  A virtio-modern mode probably should not have a heuristic, and the 
various iommu-bypassing modes should be fixed to work at the bus level, not the 
device level.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 1/8] virtio: Force only split mode with protected guest

2021-06-03 Thread Andy Lutomirski
On 6/2/21 5:41 PM, Andi Kleen wrote:
> Only allow split mode when in a protected guest. Followon
> patches harden the split mode code paths, and we don't want
> an malicious host to force anything else. Also disallow
> indirect mode for similar reasons.

I read this as "the virtio driver is buggy.  Let's disable most of the
buggy code in one special case in which we need a driver without bugs.
In all the other cases (e.g. hardware virtio device connected over
USB-C), driver bugs are still allowed."

Can we just fix the driver without special cases?

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()

2021-06-03 Thread Andy Lutomirski
On 6/2/21 1:37 PM, Luck, Tony wrote:
>>> ... so on a PASID system, your trivial reproducer would theoretically
>>> fire the same way and corrupt FPU state just as well.
>>
>> This is worse and you can't selftest it because the IPI can just hit in
>> the middle of _any_ FPU state operation and corrupt state.
> 
> That sounds like we should abandon the "IPI all the other threads
> to force enable the PASID for them" approach. It would just be a
> nightmare of papering over cracks when the IPI was delivered at
> some inconvenient moment when the recipient was in the middle
> of touching xsave state.
> 
> I've told Fenghua to dig out the previous iteration of this patch where
> the plan was to lazily fix the PASID_MSR in other threads in the #GP
> handler.

Blech.  Also this won't work for other PASID-like features.

I have a half-written patch to fix this up for real.  Stay tuned.

> Seems like a better direction than trying to fix the IPI method. The 
> virtualization
> folks will like this way more because IPI in guest causes a couple of VMEXIT
> so is somewhat expensive.

It happens at most once per PASID-using process.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Andy Lutomirski


> On Feb 7, 2021, at 12:31 AM, Zhou Wang  wrote:
> 
> SVA(share virtual address) offers a way for device to share process virtual
> address space safely, which makes more convenient for user space device
> driver coding. However, IO page faults may happen when doing DMA
> operations. As the latency of IO page fault is relatively big, DMA
> performance will be affected severely when there are IO page faults.
> From a long term view, DMA performance will be not stable.
> 
> In high-performance I/O cases, accelerators might want to perform
> I/O on a memory without IO page faults which can result in dramatically
> increased latency. Current memory related APIs could not achieve this
> requirement, e.g. mlock can only avoid memory to swap to backup device,
> page migration can still trigger IO page fault.
> 
> Various drivers working under traditional non-SVA mode are using
> their own specific ioctl to do pin. Such ioctl can be seen in v4l2,
> gpu, infiniband, media, vfio, etc. Drivers are usually doing dma
> mapping while doing pin.
> 
> But, in SVA mode, pin could be a common need which isn't necessarily
> bound with any drivers, and neither is dma mapping needed by drivers
> since devices are using the virtual address of CPU. Thus, It is better
> to introduce a new common syscall for it.
> 
> This patch leverages the design of userfaultfd and adds mempinfd for pin
> to avoid messing up mm_struct. A fd will be got by mempinfd, then user
> space can do pin/unpin pages by ioctls of this fd, all pinned pages under
> one file will be unpinned in file release process. Like pin page cases in
> other places, can_do_mlock is used to check permission and input
> parameters.


Can you document what the syscall does?

Userfaultfd is an fd because one program controls another.  Is mempinfd like 
this?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Andy Lutomirski



> On Aug 3, 2020, at 10:34 AM, Dave Hansen  wrote:
> 
> On 8/3/20 10:16 AM, Andy Lutomirski wrote:
>> - TILE: genuinely per-thread, but it's expensive so it's
>> lazy-loadable.  But the lazy-load mechanism reuses #NM, and it's not
>> fully disambiguated from the other use of #NM.  So it sort of works,
>> but it's gross.
> 
> For those playing along at home, there's a new whitepaper out from Intel
> about some new CPU features which are going to be fun:
> 
>> https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf
> 
> Which part were you worried about?  I thought it was fully disambuguated
> from this:
> 
>> When XFD causes an instruction to generate #NM, the processor loads
>> the IA32_XFD_ERR MSR to identify the disabled state component(s).
>> Specifically, the MSR is loaded with the logical AND of the IA32_XFD
>> MSR and the bitmap corresponding to the state components required by
>> the faulting instruction.
>> 
>> Device-not-available exceptions that are not due to XFD — those 
>> resulting from setting CR0.TS to 1 — do not modify the IA32_XFD_ERR
>> MSR.
> 
> So if you always make sure to *clear* IA32_XFD_ERR after handing and XFD
> exception, any #NM's with a clear IA32_XFD_ERR are from "legacy"
> CR0.TS=1.  Any bits set in IA32_XFD_ERR mean a new-style XFD exception.
> 
> Am I missing something?

I don’t think you’re missing anything, but this mechanism seems to be solidly 
in the category of “just barely works”, kind of like #DB and DR6, which also 
just barely works.

And this PASID vs #GP mess is just sad. We’re trying to engineer around an 
issue that has no need to exist in the first place.  For some reason we have 
two lazy-loading-fault mechanisms showing up in x86 in rapid succession, one of 
them is maybe 3/4-baked, and the other is totally separate and maybe 1/4 baked.

If ENQCMD instead raise #NM, this would be trivial. (And it even makes sense — 
the error is, literally, “an instruction tried to use something in XSTATE that 
isn’t initialized”.). Or if someone had noticed that, kind of like PKRU, PASID 
didn’t really belong in XSTATE, we wouldn’t have this mess.

Yes, obviously Linux can get all this stuff to work, but maybe Intel could 
aspire to design features that are straightforward to use well instead of 
merely possible to use well?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Andy Lutomirski
On Mon, Aug 3, 2020 at 9:37 AM Dave Hansen  wrote:
>
> On 8/3/20 8:12 AM, Andy Lutomirski wrote:
> > I could easily be convinced that the PASID fixup is so trivial and so
> > obviously free of misfiring in a way that causes an infinite loop that
> > this code is fine.  But I think we first need to answer the bigger
> > question of why we're doing a lazy fixup in the first place.
>
> There was an (internal to Intel) implementation of this about a year ago
> that used smp_call_function_many() to force the MSR state into all
> threads of a process.  I took one look at it, decided there was a 0%
> chance of it actually functioning and recommended we find another way.
> While I'm sure it could be done more efficiently, the implementation I
> looked at took ~200 lines of code and comments.  It started to look too
> much like another instance of mm->cpumask for my comfort.

If I were implementing this, I would try making switch_mm_irqs_off()
do, roughly:

void load_mm_pasid(...) {
  if (cpu_feature_enabled(X86_FEATURE_ENQCMD))
tsk->xstate[offset] = READ_ONCE(next->context.pasid);
}

This costs one cache miss, although the cache line in question is
about to be read anyway.  It might be faster to, instead, do:

void load_mm_pasid(...) {
  u32 pasid = READ_ONCE(next->context.pasid);

  if (tsk->xstate[offset] != pasid)
tsk->state[offset] = pasid;
}

so we don't dirty the cache line in the common case.  The actual
generated code ought to be pretty good -- surely the offset of PASID
in XSTATE is an entry in an array somewhere that can be found with a
single read, right?

The READ_ONCE is because this could race against a write to
context.pasid, so this code needs to be at the end of the function
where it's protected by mm_cpumask.  With all this done, the pasid
update is just on_each_cpu_mask(mm_cpumask(mm), load_mm_pasid, mm,
true).

This looks like maybe 20 lines of code.  As an added bonus, it lets us
free PASIDs early if we ever decide we want to.




May I take this opportunity to ask Intel to please put some real
thought into future pieces of CPU state?  Here's a summary of some
things we have:

- Normal extended state (FPU, XMM, etc): genuinely per thread and only
ever used explicitly.  Actually makes sense with XSAVE(C/S).

- PKRU: affects CPL0-originated memory accesses, so it needs to be
eagerly loaded in the kernel.  Does not make sense with XRSTOR(C/S),
but it's in there anyway.

- CR3: per-mm state.  Makes some sense in CR3, but it's tangled up
with CR4 in nasty ways.

- LDTR: per-mm on Linux and mostly obsolete everyone.  In it's own
register, so it's not a big deal.

- PASID: per-mm state (surely Intel always intended it to be per-mm,
since it's for shared _virtual memory_!).  But for some reason it's in
an MSR (which is slow), and it's cleverly, but not that cleverly,
accessible with XSAVES/XRSTORS.  Doesn't actually make sense.  Also,
PASID is lazy-loadable, but the mechanism for telling the kernel that
a lazy load is needed got flubbed.

- TILE: genuinely per-thread, but it's expensive so it's
lazy-loadable.  But the lazy-load mechanism reuses #NM, and it's not
fully disambiguated from the other use of #NM.  So it sort of works,
but it's gross.

- "KERNEL_GS_BASE", i.e. the shadow GS base.  This is logically
per-user-thread state, but it's only accessible in MSRs.  For some
reason this is *not* in XSAVES/XRSTORS state, nor is there any
efficient way to access it at all.

- Segment registers: can't be properly saved except by hypervisors,
and can almost, but not quite, be properly loaded (assuming the state
was sane to begin with) by normal kernels.  Just don't try to load 1,
2, or 3 into any of them.

Sometimes I think that this is all intended to be as confusing as
possible and that it's all a ploy to keep context switches slow and
complicated.  Maybe Intel doesn't actually want to compete with other
architectures that can context switch quickly?


It would be really nice if we had a clean way to load per-mm state
(see my private emails about this), a clean way to load CPL3 register
state, and a clean way to load per-user-thread *kernel* register state
(e.g. PKRU and probably PKRS).  And there should be an exception that
says "user code accessed a lazy-loaded resource that isn't loaded, and
this is the resource it tried to access".
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Andy Lutomirski
On Mon, Aug 3, 2020 at 8:03 AM Dave Hansen  wrote:
>
> On 7/31/20 4:34 PM, Andy Lutomirski wrote:
> >> Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
> >> without a valid PASID value programmed. #GP error codes are 16 bits and all
> >> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
> >> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> >> when loading from the source operand, so its not fully comprehending all
> >> the reasons. Rather than special case the ENQCMD, in future Intel may
> >> choose a different fault mechanism for such cases if recovery is needed on
> >> #GP.
> > Decoding the user instruction is ugly and sets a bad architecture
> > precedent, but we already do it in #GP for UMIP.  So I'm unconvinced.
>
> I'll try to do one more bit of convincing. :)
>
> In the end, we need a way to figure out if the #GP was from a known "OK"
> source that we can fix up.  You're right that we could fire up the
> instruction decoder to help answer that question.  But, it (also)
> doesn't easily yield a perfect answer as to the source of the #GP, it
> always involves a user copy, and it's a larger code impact than what
> we've got.
>
> I think I went and looked at fixup_umip_exception(), and compared it to
> the alternative which is essentially just these three lines of code:
>
> > + /*
> > +  * If the current task already has a valid PASID in the MSR,
> > +  * the #GP must be for some other reason.
> > +  */
> > + if (current->has_valid_pasid)
> > + return false;
> ...> +  /* Now the current task has a valid PASID in the MSR. */
> > + current->has_valid_pasid = 1;
>
> and *I* was convinced that instruction decoding wasn't worth it.
>
> There's a lot of stuff that fixup_umip_exception() does which we don't
> have to duplicate, but it's going to be really hard to get it anywhere
> near as compact as what we've got.
>

I could easily be convinced that the PASID fixup is so trivial and so
obviously free of misfiring in a way that causes an infinite loop that
this code is fine.  But I think we first need to answer the bigger
question of why we're doing a lazy fixup in the first place.

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 09/12] x86/process: Clear PASID state for a newly forked/cloned thread

2020-07-31 Thread Andy Lutomirski
On Mon, Jul 13, 2020 at 4:48 PM Fenghua Yu  wrote:
>
> The PASID state has to be cleared on forks, since the child has a
> different address space. The PASID is also cleared for thread clone. While
> it would be correct to inherit the PASID in this case, it is unknown
> whether the new task will use ENQCMD. Giving it the PASID "just in case"
> would have the downside of increased context switch overhead to setting
> the PASID MSR.
>
> Since #GP faults have to be handled on any threads that were created before
> the PASID was assigned to the mm of the process, newly created threads
> might as well be treated in a consistent way.

Just how much context switch overhead are we talking about here?
Unless the CPU works differently than I expect, I would guess you are
saving exactly zero cycles.  What am I missing?

--Andy


>
> Suggested-by: Thomas Gleixner 
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> ---
> v2:
> - Modify init_task_pasid().
>
>  arch/x86/kernel/process.c | 18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index f362ce0d5ac0..1b1492e337a6 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -121,6 +121,21 @@ static int set_new_tls(struct task_struct *p, unsigned 
> long tls)
> return do_set_thread_area_64(p, ARCH_SET_FS, tls);
>  }
>
> +/* Initialize the PASID state for the forked/cloned thread. */
> +static void init_task_pasid(struct task_struct *task)
> +{
> +   struct ia32_pasid_state *ppasid;
> +
> +   /*
> +* Initialize the PASID state so that the PASID MSR will be
> +* initialized to its initial state (0) by XRSTORS when the task is
> +* scheduled for the first time.
> +*/
> +   ppasid = get_xsave_addr(>thread.fpu.state.xsave, 
> XFEATURE_PASID);
> +   if (ppasid)
> +   ppasid->pasid = INIT_PASID;
> +}
> +
>  int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
> unsigned long arg, struct task_struct *p, unsigned long 
> tls)
>  {
> @@ -174,6 +189,9 @@ int copy_thread_tls(unsigned long clone_flags, unsigned 
> long sp,
> task_user_gs(p) = get_user_gs(current_pt_regs());
>  #endif
>
> +   if (static_cpu_has(X86_FEATURE_ENQCMD))
> +   init_task_pasid(p);
> +
> /* Set a new TLS for the child thread? */
> if (clone_flags & CLONE_SETTLS)
> ret = set_new_tls(p, tls);
> --
> 2.19.1
>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-07-31 Thread Andy Lutomirski
On Mon, Jul 13, 2020 at 4:48 PM Fenghua Yu  wrote:
>
> A #GP fault is generated when ENQCMD instruction is executed without
> a valid PASID value programmed in the current thread's PASID MSR. The
> #GP fault handler will initialize the MSR if a PASID has been allocated
> for this process.

Let's take a step back here.  Why are we trying to avoid IPIs?  If you
call munmap(), you IPI other CPUs running tasks in the current mm.  If
you do perf_event_open() and thus acquire RDPMC permission, you IPI
other CPUs running tasks in the current mm.  If you call modify_ldt(),
you IPI other CPUs running tasks in the current mm.  These events can
all happen more than once per process.

Now we have ENQCMD.  An mm can be assigned a PASID *once* in the model
that these patches support.  Why not just send an IPI using
essentially identical code to the LDT sync or the CR4.PCE sync?

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-07-31 Thread Andy Lutomirski
On Mon, Jul 13, 2020 at 4:48 PM Fenghua Yu  wrote:
>
> A #GP fault is generated when ENQCMD instruction is executed without
> a valid PASID value programmed in the current thread's PASID MSR. The
> #GP fault handler will initialize the MSR if a PASID has been allocated
> for this process.
>
> Decoding the user instruction is ugly and sets a bad architecture
> precedent. It may not function if the faulting instruction is modified
> after #GP.
>
> Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
> without a valid PASID value programmed. #GP error codes are 16 bits and all
> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> when loading from the source operand, so its not fully comprehending all
> the reasons. Rather than special case the ENQCMD, in future Intel may
> choose a different fault mechanism for such cases if recovery is needed on
> #GP.

Decoding the user instruction is ugly and sets a bad architecture
precedent, but we already do it in #GP for UMIP.  So I'm unconvinced.

Memo to Intel, though: you REALLY need to start thinking about what
the heck an OS is supposed to do with all these new faults you're
coming up with.  The new #NM for TILE is utterly nonsensical.  Sure,
it works for an OS that does not use CR0.TS and as long as no one
tries to extend the same mechanism for some new optional piece of
state, but as soon as Intel tries to use the same mechanism for
anything else, it falls apart.

Please do better.

> +
> +/*
> + * Write the current task's PASID MSR/state. This is called only when PASID
> + * is enabled.
> + */
> +static void fpu__pasid_write(u32 pasid)
> +{
> +   u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> +
> +   fpregs_lock();
> +
> +   /*
> +* If the MSR is active and owned by the current task's FPU, it can
> +* be directly written.
> +*
> +* Otherwise, write the fpstate.
> +*/
> +   if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
> +   wrmsrl(MSR_IA32_PASID, msr_val);
> +   } else {
> +   struct ia32_pasid_state *ppasid_state;
> +
> +   ppasid_state = 
> get_xsave_addr(>thread.fpu.state.xsave,
> + XFEATURE_PASID);
> +   /*
> +* ppasid_state shouldn't be NULL because XFEATURE_PASID
> +* is enabled.
> +*/
> +   WARN_ON_ONCE(!ppasid_state);
> +   ppasid_state->pasid = msr_val;

WARN instead of BUG is nice, but you'll immediate oops if this fails.
How about:

if (!WARN_ON_ONCE(!ppasid_state))
  ppasid_state->pasid = msr_val;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 01/13] iommu: Change type of pasid to unsigned int

2020-06-17 Thread Andy Lutomirski
On Wed, Jun 17, 2020 at 12:39 PM Luck, Tony  wrote:
>
> > Is PASID in the uapi at all?
>
>
> $ grep pasid include/uapi/linux/iommu.h
> * @pasid: Process Address Space ID
> __u32   pasid;
>  * @pasid: Process Address Space ID
> __u32   pasid;
>  * @pasid: Process Address Space ID
> __u32   pasid;
>  * - If the PASID bit is set, the @pasid field is populated and the 
> invalidation
>  * @pasid: process address space ID
> __u64   pasid;
>  * struct iommu_inv_pasid_info - PASID Selective Invalidation Structure
>  * - If the PASID bit is set, the @pasid field is populated and the 
> invalidation
>  * @pasid: process address space ID
> struct iommu_inv_pasid_info {
> __u64   pasid;
>  * @pasid_info: invalidation data when @granularity is %IOMMU_INV_GRANU_PASID
> struct iommu_inv_pasid_info pasid_info;
>  * struct iommu_gpasid_bind_data_vtd - Intel VT-d specific data on device and 
> guest
> struct iommu_gpasid_bind_data_vtd {
>  * struct iommu_gpasid_bind_data - Information about device and guest PASID 
> binding
>  * @hpasid: Process address space ID used for the guest mm in host IOMMU
>  * @gpasid: Process address space ID used for the guest mm in guest IOMMU
> struct iommu_gpasid_bind_data {
> __u64 hpasid;
> __u64 gpasid;
> struct iommu_gpasid_bind_data_vtd vtd;


Aha.  I guess this is considerably older than I thought.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 01/13] iommu: Change type of pasid to unsigned int

2020-06-17 Thread Andy Lutomirski
On Wed, Jun 17, 2020 at 11:24 AM Fenghua Yu  wrote:
>
> PASID is defined as a few different types in iommu including "int",
> "u32", and "unsigned int". To be consistent and to match with ioasid's
> type, define PASID and its variations (e.g. max PASID) as "unsigned int".
>
> No PASID type change in uapi.

Is PASID in the uapi at all?

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Andy Lutomirski

> On Jun 15, 2020, at 1:17 PM, Fenghua Yu  wrote:
> 
> Hi, Peter,
> 
>> On Mon, Jun 15, 2020 at 09:09:28PM +0200, Peter Zijlstra wrote:
>>> On Mon, Jun 15, 2020 at 11:55:29AM -0700, Fenghua Yu wrote:
>>> 
>>> Or do you suggest to add a random new flag in struct thread_info instead
>>> of a TIF flag?
>> 
>> Why thread_info? What's wrong with something simple like the below. It
>> takes a bit from the 'strictly current' flags word.
>> 
>> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index b62e6aaf28f0..fca830b97055 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -801,6 +801,9 @@ struct task_struct {
>>/* Stalled due to lack of memory */
>>unsignedin_memstall:1;
>> #endif
>> +#ifdef CONFIG_PCI_PASID
>> +unsignedhas_valid_pasid:1;
>> +#endif
>> 
>>unsigned longatomic_flags; /* Flags requiring atomic access. 
>> */
>> 
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 142b23645d82..10b3891be99e 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -955,6 +955,10 @@ static struct task_struct *dup_task_struct(struct 
>> task_struct *orig, int node)
>>tsk->use_memdelay = 0;
>> #endif
>> 
>> +#ifdef CONFIG_PCI_PASID
>> +tsk->has_valid_pasid = 0;
>> +#endif
>> +
>> #ifdef CONFIG_MEMCG
>>tsk->active_memcg = NULL;
>> #endif
> 
> The PASID MSR is x86 specific although PASID is PCIe concept and per-mm.
> Checking if the MSR has valid PASID (bit31=1) is an x86 specifc work.
> The flag should be cleared in cloned()/forked() and is only set and
> read in fixup() in x86 #GP for heuristic. It's not used anywhere outside
> of x86.
> 
> That's why we think the flag should be in x86 struct thread_info instead
> of in generice struct task_struct.
> 

Are we planning to keep PASID live once a task has used it once or are we going 
to swap it lazily?  If the latter, a percpu variable might be better.

> Please advice.
> 
> Thanks.
> 
> -Fenghua
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Andy Lutomirski


> On Jun 15, 2020, at 1:56 PM, Luck, Tony  wrote:
> 
> 
>> 
>> Are we planning to keep PASID live once a task has used it once or are we 
>> going to swap it lazily?  If the latter, a percpu variable might be better.
> 
> Current plan is "touch it once and the task owns it until exit(2)"
> 
> Maybe someday in the future when we have data on how applications
> actually use accelerators we could look at something more complex
> if usage patterns look like it would be beneficial.
> 
> 

So what’s the RDMSR for?  Surely you
have some state somewhere that says “this task has a PASID.”  Can’t you just 
make sure that stays in sync with the MSR?  Then, on #GP, if the task already 
has a PASID, you know the MSR is set.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-17 Thread Andy Lutomirski
On Wed, Apr 17, 2019 at 5:00 PM Linus Torvalds
 wrote:
>
> On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner  wrote:
> >
> > On Wed, 17 Apr 2019, Linus Torvalds wrote:
> >
> > > With SMEP, user space pages are always NX.
> >
> > We talk past each other. The user space page in the ring3 valid virtual
> > address space (non negative) is of course protected by SMEP.
> >
> > The attack utilizes the kernel linear mapping of the physical
> > memory. I.e. user space address 0x43210 has a kernel equivalent at
> > 0xfxx. So if the attack manages to trick the kernel to that valid
> > kernel address and that is mapped X --> game over. SMEP does not help
> > there.
>
> Oh, agreed.
>
> But that would simply be a kernel bug. We should only map kernel pages
> executable when we have kernel code in them, and we should certainly
> not allow those pages to be mapped writably in user space.
>
> That kind of "executable in kernel, writable in user" would be a
> horrendous and major bug.
>
> So i think it's a non-issue.
>
> > From the top of my head I'd say this is a non issue as those kernel address
> > space mappings _should_ be NX, but we got bitten by _should_ in the past:)
>
> I do agree that bugs can happen, obviously, and we might have missed 
> something.
>
> But in the context of XPFO, I would argue (*very* strongly) that the
> likelihood of the above kind of bug is absolutely *miniscule* compared
> to the likelihood that we'd have something wrong in the software
> implementation of XPFO.
>
> So if the argument is "we might have bugs in software", then I think
> that's an argument _against_ XPFO rather than for it.
>

I don't think this type of NX goof was ever the argument for XPFO.
The main argument I've heard is that a malicious user program writes a
ROP payload into user memory (regular anonymous user memory) and then
gets the kernel to erroneously set RSP (*not* RIP) to point there.

I find this argument fairly weak for a couple reasons.  First, if
we're worried about this, let's do in-kernel CFI, not XPFO, to
mitigate it.  Second, I don't see why the exact same attack can't be
done using, say, page cache, and unless I'm missing something, XPFO
doesn't protect page cache.  Or network buffers, or pipe buffers, etc.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-17 Thread Andy Lutomirski
On Wed, Apr 17, 2019 at 10:33 AM Khalid Aziz  wrote:
>
> On 4/17/19 11:09 AM, Ingo Molnar wrote:
> >
> > * Khalid Aziz  wrote:
> >
> >>> I.e. the original motivation of the XPFO patches was to prevent execution
> >>> of direct kernel mappings. Is this motivation still present if those
> >>> mappings are non-executable?
> >>>
> >>> (Sorry if this has been asked and answered in previous discussions.)
> >>
> >> Hi Ingo,
> >>
> >> That is a good question. Because of the cost of XPFO, we have to be very
> >> sure we need this protection. The paper from Vasileios, Michalis and
> >> Angelos - ,
> >> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
> >> and 6.2.
> >
> > So it would be nice if you could generally summarize external arguments
> > when defending a patchset, instead of me having to dig through a PDF
> > which not only causes me to spend time that you probably already spent
> > reading that PDF, but I might also interpret it incorrectly. ;-)
>
> Sorry, you are right. Even though that paper explains it well, a summary
> is always useful.
>
> >
> > The PDF you cited says this:
> >
> >   "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced
> >in many platforms, including x86-64.  In our example, the content of
> >user address 0xBEEF000 is also accessible through kernel address
> >0x87FF9F08 as plain, executable code."
> >
> > Is this actually true of modern x86-64 kernels? We've locked down W^X
> > protections in general.
> >
> > I.e. this conclusion:
> >
> >   "Therefore, by simply overwriting kfptr with 0x87FF9F08 and
> >triggering the kernel to dereference it, an attacker can directly
> >execute shell code with kernel privileges."
> >
> > ... appears to be predicated on imperfect W^X protections on the x86-64
> > kernel.
> >
> > Do such holes exist on the latest x86-64 kernel? If yes, is there a
> > reason to believe that these W^X holes cannot be fixed, or that any fix
> > would be more expensive than XPFO?
>
> Even if physmap is not executable, return-oriented programming (ROP) can
> still be used to launch an attack. Instead of placing executable code at
> user address 0xBEEF000, attacker can place an ROP payload there. kfptr
> is then overwritten to point to a stack-pivoting gadget. Using the
> physmap address aliasing, the ROP payload becomes kernel-mode stack. The
> execution can then be hijacked upon execution of ret instruction. This
> is a gist of the subsection titled "Non-executable physmap" under
> section 6.2 and it looked convincing enough to me. If you have a
> different take on this, I am very interested in your point of view.

My issue with all this is that XPFO is really very expensive.  I think
that, if we're going to seriously consider upstreaming expensive
exploit mitigations like this, we should consider others first, in
particular CFI techniques.  grsecurity's RAP would be a great start.
I also proposed using a gcc plugin (or upstream gcc feature) to add
some instrumentation to any code that pops RSP to verify that the
resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes.
This will make ROP quite a bit harder.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-05 Thread Andy Lutomirski


> On Apr 5, 2019, at 9:56 AM, Tycho Andersen  wrote:
> 
>> On Fri, Apr 05, 2019 at 09:24:50AM -0600, Andy Lutomirski wrote:
>> 
>> 
>>> On Apr 5, 2019, at 8:44 AM, Dave Hansen  wrote:
>>> 
>>> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>>>> process. Is that an acceptable trade-off?
>>>> You are not seriously asking whether creating a user controllable ret2dir
>>>> attack window is a acceptable trade-off? April 1st was a few days ago.
>>> 
>>> Well, let's not forget that this set at least takes us from "always
>>> vulnerable to ret2dir" to a choice between:
>>> 
>>> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
>>> 2. slow and "mitigated against ret2dir"
>>> 
>>> Sounds like we need a mechanism that will do the deferred XPFO TLB
>>> flushes whenever the kernel is entered, and not _just_ at context switch
>>> time.  This permits an app to run in userspace with stale kernel TLB
>>> entries as long as it wants... that's harmless.
>> 
>> I don’t think this is good enough. The bad guys can enter the kernel and 
>> arrange for the kernel to wait, *in kernel*, for long enough to set up the 
>> attack.  userfaultfd is the most obvious way, but there are plenty. I 
>> suppose we could do the flush at context switch *and* entry.  I bet that 
>> performance still utterly sucks, though — on many workloads, this turns 
>> every entry into a full flush, and we already know exactly how much that 
>> sucks — it’s identical to KPTI without PCID.  (And yes, if we go this route, 
>> we need to merge this logic together — we shouldn’t write CR3 twice on 
>> entry).
>> 
>> I feel like this whole approach is misguided. ret2dir is not such a game 
>> changer that fixing it is worth huge slowdowns. I think all this effort 
>> should be spent on some kind of sensible CFI. For example, we should be able 
>> to mostly squash ret2anything by inserting a check that the high bits of RSP 
>> match the value on the top of the stack before any code that pops RSP.  On 
>> an FPO build, there aren’t all that many hot POP RSP instructions, I think.
>> 
>> (Actually, checking the bits is suboptimal. Do:
>> 
>> unsigned long offset = *rsp - rsp;
>> offset >>= THREAD_SHIFT;
>> if (unlikely(offset))
>> BUG();
>> POP RSP;
> 
> This is a neat trick, and definitely prevents going random places in
> the heap. But,
> 
>> This means that it’s also impossible to trick a function to return into a 
>> buffer that is on that function’s stack.)
> 
> Why is this true? All you're checking is that you can't shift the
> "location" of the stack. If you can inject stuff into a stack buffer,
> can't you just inject the right frame to return to your code as well,
> so you don't have to shift locations?
> 
> 

But the injected ROP payload will be *below* RSP, so you’ll need a gadget that 
can decrement RSP.  This makes the attack a good deal harder.

Something like RAP on top, or CET, will make this even harder.

> 
> Tycho
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-05 Thread Andy Lutomirski


> On Apr 5, 2019, at 10:01 AM, Dave Hansen  wrote:
> 
> On 4/5/19 8:24 AM, Andy Lutomirski wrote:
>>> Sounds like we need a mechanism that will do the deferred XPFO TLB 
>>> flushes whenever the kernel is entered, and not _just_ at context
>>> switch time.  This permits an app to run in userspace with stale
>>> kernel TLB entries as long as it wants... that's harmless.
> ...
>> I suppose we could do the flush at context switch *and*
>> entry.  I bet that performance still utterly sucks, though — on many
>> workloads, this turns every entry into a full flush, and we already
>> know exactly how much that sucks — it’s identical to KPTI without
>> PCID.  (And yes, if we go this route, we need to merge this logic
>> together — we shouldn’t write CR3 twice on entry).
> 
> Yeah, probably true.
> 
> Just eyeballing this, it would mean mapping the "cpu needs deferred
> flush" variable into the cpu_entry_area, which doesn't seem too awful.
> 
> I think the basic overall concern is that the deferred flush leaves too
> many holes and by the time we close them sufficiently, performance will
> suck again.  Seems like a totally valid concern, but my crystal ball is
> hazy on whether it will be worth it in the end to many folks
> 
> ...
>> In other words, I think that ret2dir is an insufficient justification
>> for XPFO.
> 
> Yeah, other things that it is good for have kinda been lost in the
> noise.  I think I first started looking at this long before Meltdown and
> L1TF were public.
> 
> There are hypervisors out there that simply don't (persistently) map
> user data.  They can't leak user data because they don't even have
> access to it in their virtual address space.  Those hypervisors had a
> much easier time with L1TF mitigation than we did.  Basically, they
> could flush the L1 after user data was accessible instead of before
> untrusted guest code runs.
> 
> My hope is that XPFO could provide us similar protection.  But,
> somebody's got to poke at it for a while to see how far they can push it.
> 
> IMNHO, XPFO is *always* going to be painful for kernel compiles.  But,
> cloud providers aren't doing a lot of kernel compiles on their KVM
> hosts, and they deeply care about leaking their users' data.

At the risk of asking stupid questions: we already have a mechanism for this: 
highmem.  Can we enable highmem on x86_64, maybe with some heuristics to make 
it work well?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-05 Thread Andy Lutomirski


> On Apr 5, 2019, at 8:44 AM, Dave Hansen  wrote:
> 
> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>> process. Is that an acceptable trade-off?
>> You are not seriously asking whether creating a user controllable ret2dir
>> attack window is a acceptable trade-off? April 1st was a few days ago.
> 
> Well, let's not forget that this set at least takes us from "always
> vulnerable to ret2dir" to a choice between:
> 
> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
> 2. slow and "mitigated against ret2dir"
> 
> Sounds like we need a mechanism that will do the deferred XPFO TLB
> flushes whenever the kernel is entered, and not _just_ at context switch
> time.  This permits an app to run in userspace with stale kernel TLB
> entries as long as it wants... that's harmless.

I don’t think this is good enough. The bad guys can enter the kernel and 
arrange for the kernel to wait, *in kernel*, for long enough to set up the 
attack.  userfaultfd is the most obvious way, but there are plenty. I suppose 
we could do the flush at context switch *and* entry.  I bet that performance 
still utterly sucks, though — on many workloads, this turns every entry into a 
full flush, and we already know exactly how much that sucks — it’s identical to 
KPTI without PCID.  (And yes, if we go this route, we need to merge this logic 
together — we shouldn’t write CR3 twice on entry).

I feel like this whole approach is misguided. ret2dir is not such a game 
changer that fixing it is worth huge slowdowns. I think all this effort should 
be spent on some kind of sensible CFI. For example, we should be able to mostly 
squash ret2anything by inserting a check that the high bits of RSP match the 
value on the top of the stack before any code that pops RSP.  On an FPO build, 
there aren’t all that many hot POP RSP instructions, I think.

(Actually, checking the bits is suboptimal. Do:

unsigned long offset = *rsp - rsp;
offset >>= THREAD_SHIFT;
if (unlikely(offset))
BUG();
POP RSP;

This means that it’s also impossible to trick a function to return into a 
buffer that is on that function’s stack.)

In other words, I think that ret2dir is an insufficient justification for XPFO.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-05 Thread Andy Lutomirski


>>> On Apr 4, 2019, at 4:55 PM, Khalid Aziz  wrote:
>>> 
>>> On 4/3/19 10:10 PM, Andy Lutomirski wrote:
>>> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
>>> 
>>> XPFO flushes kernel space TLB entries for pages that are now mapped
>>> in userspace on not only the current CPU but also all other CPUs
>>> synchronously. Processes on each core allocating pages causes a
>>> flood of IPI messages to all other cores to flush TLB entries.
>>> Many of these messages are to flush the entire TLB on the core if
>>> the number of entries being flushed from local core exceeds
>>> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
>>> unmapping pages from physmap goes up dramatically on machines with
>>> high core count.
>>> 
>>> This patch flushes relevant TLB entries for current process or
>>> entire TLB depending upon number of entries for the current CPU
>>> and posts a pending TLB flush on all other CPUs when a page is
>>> unmapped from kernel space and mapped in userspace. Each core
>>> checks the pending TLB flush flag for itself on every context
>>> switch, flushes its TLB if the flag is set and clears it.
>>> This patch potentially aggregates multiple TLB flushes into one.
>>> This has very significant impact especially on machines with large
>>> core counts.
>> 
>> Why is this a reasonable strategy?
> 
> Ideally when pages are unmapped from physmap, all CPUs would be sent IPI
> synchronously to flush TLB entry for those pages immediately. This may
> be ideal from correctness and consistency point of view, but it also
> results in IPI storm and repeated TLB flushes on all processors. Any
> time a page is allocated to userspace, we are going to go through this
> and it is very expensive. On a 96-core server, performance degradation
> is 26x!!

Indeed. XPFO is expensive.

> 
> When xpfo unmaps a page from physmap only (after mapping the page in
> userspace in response to an allocation request from userspace) on one
> processor, there is a small window of opportunity for ret2dir attack on
> other cpus until the TLB entry in physmap for the unmapped pages on
> other cpus is cleared.

Why do you think this window is small? Intervals of seconds to months between 
context switches aren’t unheard of.

And why is a small window like this even helpful?  For a ret2dir attack, you 
just need to get CPU A to allocate a page and write the ret2dir payload and 
then get CPU B to return to it before context switching.  This should be doable 
quite reliably.

So I don’t really have a suggestion, but I think that a 44% regression to get a 
weak defense like this doesn’t seem worthwhile.  I bet that any of a number of 
CFI techniques (RAP-like or otherwise) will be cheaper and protect against 
ret2dir better.  And they’ll also protect against using other kernel memory as 
a stack buffer.  There are plenty of those — think pipe buffers, network 
buffers, any page cache not covered by XPFO, XMM/YMM saved state, etc.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-04 Thread Andy Lutomirski

> On Apr 4, 2019, at 10:28 AM, Thomas Gleixner  wrote:
> 
>> On Thu, 4 Apr 2019, Tycho Andersen wrote:
>>leaq-PTREGS_SIZE(%rax), %rsp
>>UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
>> 
>> +/*
>> + * If we oopsed in an interrupt handler, interrupts may be off. Let's 
>> turn
>> + * them back on before going back to "normal" code.
>> + */
>> +sti
> 
> That breaks the paravirt muck and tracing/lockdep.
> 
> ENABLE_INTERRUPTS() is what you want plus TRACE_IRQ_ON to keep the tracer
> and lockdep happy.
> 
> 

I’m sure we’ll find some other thing we forgot to reset eventually, so let’s do 
this in C.  Change the call do_exit to call __finish_rewind_stack_do_exit and 
add the latter as a C function that does local_irq_enable() and do_exit().
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-03 Thread Andy Lutomirski
On Wed, Apr 3, 2019 at 6:42 PM Tycho Andersen  wrote:
>
> On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
> > >
> > > From: Tycho Andersen 
> > >
> > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> > > that might sleep:
> > >
> >
> >
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index 9d5c75f02295..7891add0913f 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long 
> > > error_code,
> > > /* Executive summary in case the body of the oops scrolled away */
> > > printk(KERN_DEFAULT "CR2: %016lx\n", address);
> > >
> > > +   /*
> > > +* We're about to oops, which might kill the task. Make sure we're
> > > +* allowed to sleep.
> > > +*/
> > > +   flags |= X86_EFLAGS_IF;
> > > +
> > > oops_end(flags, regs, sig);
> > >  }
> > >
> >
> >
> > NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> > rewind_stack_do_exit().
>
> [I trimmed the CC list since google rejected it with E2BIG :)]
>
> I guess the problem is really that do_exit() (or really
> exit_signals()) might sleep. Maybe we should put an irq_enable() at
> the beginning of do_exit() instead and fix this problem for all
> arches?
>

Hmm.  do_exit() isn't really meant to be "try your best to leave the
system somewhat usable without returning" -- it's a function that,
other than in OOPSes, is called from a well-defined state.  So I think
rewind_stack_do_exit() is probably a better spot.  But we need to
rewind the stack and *then* turn on IRQs, since we otherwise risk
exploding quite badly.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-03 Thread Andy Lutomirski
On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
>
> XPFO flushes kernel space TLB entries for pages that are now mapped
> in userspace on not only the current CPU but also all other CPUs
> synchronously. Processes on each core allocating pages causes a
> flood of IPI messages to all other cores to flush TLB entries.
> Many of these messages are to flush the entire TLB on the core if
> the number of entries being flushed from local core exceeds
> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
> unmapping pages from physmap goes up dramatically on machines with
> high core count.
>
> This patch flushes relevant TLB entries for current process or
> entire TLB depending upon number of entries for the current CPU
> and posts a pending TLB flush on all other CPUs when a page is
> unmapped from kernel space and mapped in userspace. Each core
> checks the pending TLB flush flag for itself on every context
> switch, flushes its TLB if the flag is set and clears it.
> This patch potentially aggregates multiple TLB flushes into one.
> This has very significant impact especially on machines with large
> core counts.

Why is this a reasonable strategy?

> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
> +{
> +   struct cpumask tmp_mask;
> +
> +   /*
> +* Balance as user space task's flush, a bit conservative.
> +* Do a local flush immediately and post a pending flush on all
> +* other CPUs. Local flush can be a range flush or full flush
> +* depending upon the number of entries to be flushed. Remote
> +* flushes will be done by individual processors at the time of
> +* context switch and this allows multiple flush requests from
> +* other CPUs to be batched together.
> +*/

I don't like this function at all.  A core function like this is a
contract of sorts between the caller and the implementation.  There is
no such thing as an "xpfo" flush, and this function's behavior isn't
at all well defined.  For flush_tlb_kernel_range(), I can tell you
exactly what that function does, and the implementation is either
correct or incorrect.  With this function, I have no idea what is
actually required, and I can't possibly tell whether it's correct.

As far as I can see, xpfo_flush_tlb_kernel_range() actually means
"flush this range on this CPU right now, and flush it on remote CPUs
eventually".  It would be valid, but probably silly, to flush locally
and to never flush at all on remote CPUs.  This makes me wonder what
the point is.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-03 Thread Andy Lutomirski
On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
>
> From: Tycho Andersen 
>
> Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
> that might sleep:
>


> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9d5c75f02295..7891add0913f 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long 
> error_code,
> /* Executive summary in case the body of the oops scrolled away */
> printk(KERN_DEFAULT "CR2: %016lx\n", address);
>
> +   /*
> +* We're about to oops, which might kill the task. Make sure we're
> +* allowed to sleep.
> +*/
> +   flags |= X86_EFLAGS_IF;
> +
> oops_end(flags, regs, sig);
>  }
>


NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
rewind_stack_do_exit().
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 11/36] x86/mm: Add SME support for read_cr3_pa()

2017-06-20 Thread Andy Lutomirski
On Fri, Jun 16, 2017 at 11:51 AM, Tom Lendacky  wrote:
> The cr3 register entry can contain the SME encryption mask that indicates
> the PGD is encrypted.  The encryption mask should not be used when
> creating a virtual address from the cr3 register, so remove the SME
> encryption mask in the read_cr3_pa() function.
>
> During early boot SME will need to use a native version of read_cr3_pa(),
> so create native_read_cr3_pa().
>
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/include/asm/processor-flags.h |3 ++-
>  arch/x86/include/asm/processor.h   |5 +
>  2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/processor-flags.h 
> b/arch/x86/include/asm/processor-flags.h
> index 79aa2f9..cb6999c 100644
> --- a/arch/x86/include/asm/processor-flags.h
> +++ b/arch/x86/include/asm/processor-flags.h
> @@ -2,6 +2,7 @@
>  #define _ASM_X86_PROCESSOR_FLAGS_H
>
>  #include 
> +#include 
>
>  #ifdef CONFIG_VM86
>  #define X86_VM_MASKX86_EFLAGS_VM
> @@ -33,7 +34,7 @@
>   */
>  #ifdef CONFIG_X86_64
>  /* Mask off the address space ID bits. */
> -#define CR3_ADDR_MASK 0x7000ull
> +#define CR3_ADDR_MASK __sme_clr(0x7000ull)

Can you update the comment one line above, too?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 10/34] x86, x86/mm, x86/xen, olpc: Use __va() against just the physical address in cr3

2017-06-09 Thread Andy Lutomirski
On Thu, Jun 8, 2017 at 3:38 PM, Tom Lendacky <thomas.lenda...@amd.com> wrote:
> On 6/8/2017 1:05 AM, Andy Lutomirski wrote:
>>
>> On Wed, Jun 7, 2017 at 12:14 PM, Tom Lendacky <thomas.lenda...@amd.com>
>> wrote:
>>>
>>> The cr3 register entry can contain the SME encryption bit that indicates
>>> the PGD is encrypted.  The encryption bit should not be used when
>>> creating
>>> a virtual address for the PGD table.
>>>
>>> Create a new function, read_cr3_pa(), that will extract the physical
>>> address from the cr3 register. This function is then used where a virtual
>>> address of the PGD needs to be created/used from the cr3 register.
>>
>>
>> This is going to conflict with:
>>
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/pcid=555c81e5d01a62b629ec426a2f50d27e2127c1df
>>
>> We're both encountering the fact that CR3 munges the page table PA
>> with some other stuff, and some readers want to see the actual CR3
>> value and other readers just want the PA.  The thing I prefer about my
>> patch is that I get rid of read_cr3() entirely, forcing the patch to
>> update every single reader, making review and conflict resolution much
>> safer.
>>
>> I'd be willing to send a patch tomorrow that just does the split into
>> __read_cr3() and read_cr3_pa() (I like your name better) and then we
>> can both base on top of it.  Would that make sense?
>
>
> That makes sense to me.

Draft patch:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/read_cr3=9adebbc1071f066421a27b4f6e040190f1049624

>
>>
>> Also:
>>
>>> +static inline unsigned long read_cr3_pa(void)
>>> +{
>>> +   return (read_cr3() & PHYSICAL_PAGE_MASK);
>>> +}
>>
>>
>> Is there any guarantee that the magic encryption bit is masked out in
>> PHYSICAL_PAGE_MASK?  The docs make it sound like it could be any bit.
>> (But if it's one of the low 12 bits, that would be quite confusing.)
>
>
> Right now it's bit 47 and we're steering away from any of the currently
> reserved bits so we should be safe.

Should the SME init code check that it's a usable bit (i.e. outside
our physical address mask and not one of the bottom twelve bits)?  If
some future CPU daftly picks, say, bit 12, we'll regret it if we
enable SME.

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 10/34] x86, x86/mm, x86/xen, olpc: Use __va() against just the physical address in cr3

2017-06-08 Thread Andy Lutomirski
On Wed, Jun 7, 2017 at 12:14 PM, Tom Lendacky  wrote:
> The cr3 register entry can contain the SME encryption bit that indicates
> the PGD is encrypted.  The encryption bit should not be used when creating
> a virtual address for the PGD table.
>
> Create a new function, read_cr3_pa(), that will extract the physical
> address from the cr3 register. This function is then used where a virtual
> address of the PGD needs to be created/used from the cr3 register.

This is going to conflict with:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/pcid=555c81e5d01a62b629ec426a2f50d27e2127c1df

We're both encountering the fact that CR3 munges the page table PA
with some other stuff, and some readers want to see the actual CR3
value and other readers just want the PA.  The thing I prefer about my
patch is that I get rid of read_cr3() entirely, forcing the patch to
update every single reader, making review and conflict resolution much
safer.

I'd be willing to send a patch tomorrow that just does the split into
__read_cr3() and read_cr3_pa() (I like your name better) and then we
can both base on top of it.  Would that make sense?

Also:

> +static inline unsigned long read_cr3_pa(void)
> +{
> +   return (read_cr3() & PHYSICAL_PAGE_MASK);
> +}

Is there any guarantee that the magic encryption bit is masked out in
PHYSICAL_PAGE_MASK?  The docs make it sound like it could be any bit.
(But if it's one of the low 12 bits, that would be quite confusing.)

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v2 11/20] mm: Access BOOT related data in the clear

2016-09-12 Thread Andy Lutomirski
On Aug 22, 2016 6:53 PM, "Tom Lendacky"  wrote:
>
> BOOT data (such as EFI related data) is not encyrpted when the system is
> booted and needs to be accessed as non-encrypted.  Add support to the
> early_memremap API to identify the type of data being accessed so that
> the proper encryption attribute can be applied.  Currently, two types
> of data are defined, KERNEL_DATA and BOOT_DATA.

What happens when you memremap boot services data outside of early
boot?  Matt just added code that does this.

IMO this API is not so great.  It scatters a specialized consideration
all over the place.  Could early_memremap not look up the PA to figure
out what to do?

--Andy

[leaving the rest here for Matt's benefit]

>  unsigned long size,
> +   enum memremap_owner owner,
> +   pgprot_t prot)
> +{
> +   return prot;
> +}
> +
>  void __init early_ioremap_reset(void)
>  {
> early_ioremap_shutdown();
> @@ -213,16 +221,23 @@ early_ioremap(resource_size_t phys_addr, unsigned long 
> size)
>
>  /* Remap memory */
>  void __init *
> -early_memremap(resource_size_t phys_addr, unsigned long size)
> +early_memremap(resource_size_t phys_addr, unsigned long size,
> +  enum memremap_owner owner)
>  {
> -   return (__force void *)__early_ioremap(phys_addr, size,
> -  FIXMAP_PAGE_NORMAL);
> +   pgprot_t prot = early_memremap_pgprot_adjust(phys_addr, size, owner,
> +FIXMAP_PAGE_NORMAL);
> +
> +   return (__force void *)__early_ioremap(phys_addr, size, prot);
>  }
>  #ifdef FIXMAP_PAGE_RO
>  void __init *
> -early_memremap_ro(resource_size_t phys_addr, unsigned long size)
> +early_memremap_ro(resource_size_t phys_addr, unsigned long size,
> + enum memremap_owner owner)
>  {
> -   return (__force void *)__early_ioremap(phys_addr, size, 
> FIXMAP_PAGE_RO);
> +   pgprot_t prot = early_memremap_pgprot_adjust(phys_addr, size, owner,
> +FIXMAP_PAGE_RO);
> +
> +   return (__force void *)__early_ioremap(phys_addr, size, prot);
>  }
>  #endif
>
> @@ -236,7 +251,8 @@ early_memremap_prot(resource_size_t phys_addr, unsigned 
> long size,
>
>  #define MAX_MAP_CHUNK  (NR_FIX_BTMAPS << PAGE_SHIFT)
>
> -void __init copy_from_early_mem(void *dest, phys_addr_t src, unsigned long 
> size)
> +void __init copy_from_early_mem(void *dest, phys_addr_t src, unsigned long 
> size,
> +   enum memremap_owner owner)
>  {
> unsigned long slop, clen;
> char *p;
> @@ -246,7 +262,7 @@ void __init copy_from_early_mem(void *dest, phys_addr_t 
> src, unsigned long size)
> clen = size;
> if (clen > MAX_MAP_CHUNK - slop)
> clen = MAX_MAP_CHUNK - slop;
> -   p = early_memremap(src & PAGE_MASK, clen + slop);
> +   p = early_memremap(src & PAGE_MASK, clen + slop, owner);
> memcpy(dest, p + slop, clen);
> early_memunmap(p, clen + slop);
> dest += clen;
> @@ -265,12 +281,14 @@ early_ioremap(resource_size_t phys_addr, unsigned long 
> size)
>
>  /* Remap memory */
>  void __init *
> -early_memremap(resource_size_t phys_addr, unsigned long size)
> +early_memremap(resource_size_t phys_addr, unsigned long size,
> +  enum memremap_owner owner)
>  {
> return (void *)phys_addr;
>  }
>  void __init *
> -early_memremap_ro(resource_size_t phys_addr, unsigned long size)
> +early_memremap_ro(resource_size_t phys_addr, unsigned long size,
> + enum memremap_owner owner)
>  {
> return (void *)phys_addr;
>  }
>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v2 04/20] x86: Secure Memory Encryption (SME) support

2016-08-30 Thread Andy Lutomirski
On Aug 30, 2016 6:34 AM, "Tom Lendacky"  wrote:
>
> On 08/25/2016 08:04 AM, Thomas Gleixner wrote:
> > On Mon, 22 Aug 2016, Tom Lendacky wrote:
> >
> >> Provide support for Secure Memory Encryption (SME). This initial support
> >> defines the memory encryption mask as a variable for quick access and an
> >> accessor for retrieving the number of physical addressing bits lost if
> >> SME is enabled.
> >
> > What is the reason that this needs to live in assembly code?
>
> In later patches this code is expanded and deals with a lot of page
> table manipulation, cpuid/rdmsr instructions, etc. and so I thought it
> was best to do it this way.

None of that sounds like it needs to be in asm, though.

I, at least, have a strong preference for minimizing the amount of asm
in the low-level arch code.

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v1 00/18] x86: Secure Memory Encryption (AMD)

2016-05-02 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 1:10 PM, Tom Lendacky <thomas.lenda...@amd.com> wrote:
> On 04/27/2016 09:39 AM, Andy Lutomirski wrote:
>> On Tue, Apr 26, 2016 at 3:55 PM, Tom Lendacky <thomas.lenda...@amd.com> 
>> wrote:
>>> This RFC patch series provides support for AMD's new Secure Memory
>>> Encryption (SME) feature.
>>>
>>> SME can be used to mark individual pages of memory as encrypted through the
>>> page tables. A page of memory that is marked encrypted will be automatically
>>> decrypted when read from DRAM and will be automatically encrypted when
>>> written to DRAM. Details on SME can found in the links below.
>>
>> Having read through the docs briefly, some questions:
>>
>> 1. How does the crypto work?  Is it straight AES-ECB?  Is it a
>> tweakable mode?  If so, what does into the tweak?  For example, if I
>> swap the ciphertext of two pages, does the plaintext of the pages get
>> swapped?  If not, why not?
>
> The AES crypto uses a tweak such that two identical plaintexts at
> different locations will have different ciphertext. So swapping the
> ciphertext of two pages will not result in the plaintext being swapped.

OK, makes sense.

>
>>
>> 2. In SEV mode, how does the hypervisor relocate a physical backing
>> page?  Does it simple move it and update the 2nd-level page tables?
>> If so, is the result of decryption guaranteed to be garbage if it
>> relocates a page and re-inserts it at the wrong guest physical
>> address?
>
> For SEV mode, relocating a physical backing page takes extra steps.
> There are APIs that are used to have the AMD Secure Processor create a
> transportable encrypted page that can then be moved to a new location
> in memory. After moving it to the new location the APIs are used to
> haves the AMD Secure Processor re-encrypt the page for use with the
> guests SEV key. Based on #1 above, just moving a page without invoking
> the necessary APIs will result in the decryption returning garbage.
>
>>
>> 3. In SEV mode, does anything prevent the hypervisor from resuming a
>> guest with the wrong ASID, or is this all relying on the resulting
>> corruption of the guest code and data to cause a crash?
>
> There is nothing that prevents resuming a guest with the wrong ASID.
> This relies on the resulting corruption of the guest code/data to
> cause a crash.

This all seems somewhat useful, but I almost guarantee that if there
is ever anything economically important (or important for national
security reasons, or simply something that sounds fun for an
enterprising kid to break) that it *will* be broken in many creative
ways.

Someone will break it by replaying old data through the VM, either to
confuse control flow or to use some part of the VM code as an oracle
with which to attack another part.

Someone else will break it by installing a #UD / #PF handler and using
the resulting exceptions as an oracle.

A third clever person will break it by carefully constructing a
scenario in which randomizing 16 bytes of data has a high probability
of letting then pwn your system.  (For example, what if the secured VM
creates an RSA key and you can carefully interrupt it right after
generating p and q.  Replace 16 bytes from the middle of both p and q
(32 bytes total) with random garbage.  With reasonably high
probability, the resulting p and q will no longer be prime.)

Depending on how strong your ASID protection is, a fourth clever
person will break it by replacing a bunch of the VM out from under the
target while leaving the sensitive data in place and then will use
some existing exploit or design issue to gain code execution in the
modified VM.

Also, I really hope that your tweakable cipher mode is at least CCA2
secure, because attackers can absolutely hit it with adaptive chosen
ciphertext attacks.  (Actually, attackers can alternate between
adaptive chosen ciphertext and adaptive chosen plaintext.)

And did the SEV implementation remember to encrypt the guest register
state?  Because, if not, everything of importance will leak out
through the VMCB and/or GPRs.

But I guess it's better than nothing.

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v1 01/18] x86: Set the write-protect cache mode for AMD processors

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 8:31 AM, Borislav Petkov <b...@alien8.de> wrote:
> On Wed, Apr 27, 2016 at 08:12:56AM -0700, Andy Lutomirski wrote:
>> I think there are some errata
>
> Isn't that addressed by the first branch of the if-test in pat_init():
>
> if ((c->x86_vendor == X86_VENDOR_INTEL) &&
> (((c->x86 == 0x6) && (c->x86_model <= 0xd)) ||
>  ((c->x86 == 0xf) && (c->x86_model <= 0x6 {
>

That's the intent, but I'm unconvinced that it's complete.  The reason
that WT is in slot 7 is that if it accidentally ends up using the slot
3 entry instead of 7 (e.g. if a 2M page gets confused due to an
erratum we didn't handle or similar), then it falls back to UC, which
is safe.

But this is mostly moot in this case.  There is no safe fallback for
WP, but it doesn't really matter, because no one will actually try to
use it except on a system will full PAT support anyway.  So I'm not
really concerned.

>
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.



-- 
Andy Lutomirski
AMA Capital Management, LLC
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v1 01/18] x86: Set the write-protect cache mode for AMD processors

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 8:05 AM, Tom Lendacky <thomas.lenda...@amd.com> wrote:
> On 04/27/2016 09:47 AM, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:44 AM, Tom Lendacky <thomas.lenda...@amd.com> 
>> wrote:
>>> On 04/27/2016 09:33 AM, Andy Lutomirski wrote:
>>>> On Tue, Apr 26, 2016 at 3:56 PM, Tom Lendacky <thomas.lenda...@amd.com> 
>>>> wrote:
>>>>> For AMD processors that support PAT, set the write-protect cache mode
>>>>> (_PAGE_CACHE_MODE_WP) entry to the actual write-protect value (x05).
>>>>
>>>> What's the purpose of using the WP memory type?
>>>
>>> The WP memory type is used for encrypting or decrypting data "in place".
>>> The use of the WP on the source data will prevent any of the source
>>> data from being cached.  Refer to section 7.10.8 "Encrypt-in-Place" in
>>> the AMD64 APM link provided in the cover letter.
>>>
>>> This memory type will be used in subsequent patches for this purpose.
>>
>> OK.
>>
>> Why AMD-only?  I thought Intel supported WP, too.
>
> Just me being conservative. If there aren't any objections from the
> Intel folks about it we can remove the vendor check and just set it.

I think there are some errata that will cause high PAT references to
incorrectly reference the low parts of the table, but I don't recall
any that go the other way around.  So merely setting WP in a high
entry should be harmless unless something tries to use it.

>
> Thanks,
> Tom
>
>>
>> --Andy
>>



-- 
Andy Lutomirski
AMA Capital Management, LLC
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 7:54 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <j...@8bytes.org> wrote:
>> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> >> One correction: it's a feature of the device in the system.
>> >> >> There could be a mix of devices bypassing and not
>> >> >> bypassing the IOMMU.
>> >> >
>> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> >> > IOMMU can chose to let the device bypass. So any fix here belongs
>> >> > into the platform/iommu code too and not into some driver.
>> >> >
>> >> >> Sounds good. And a way to detect appropriate devices could
>> >> >> be by looking at the feature flag, perhaps?
>> >> >
>> >> > Again, no! The way to detect that is to look into the iommu description
>> >> > structures provided by the firmware. They provide everything necessary
>> >> > to tell the iommu code which devices are not translated.
>> >> >
>> >>
>> >> Except on PPC and SPARC.  As far as I know, those are the only
>> >> problematic platforms.
>> >>
>> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> >> fixed to report correct data in the DMAR tables?
>> >>
>> >> --Andy
>> >
>> > Meaning virtio or assigned devices?
>> > For virtio - it's way too late since these are working configurations.
>> > For assigned devices - they don't work on x86 so it doesn't have
>> > to be disabled, it's safe to ignore.
>>
>> I mean actually prevent QEMU from running in q35-iommu mode with any
>> virtio devices attached or maybe even turn off q35-iommu mode entirely
>> [1].  Doesn't it require that the user literally pass the word
>> "experimental" into QEMU right now?  It did at some point IIRC.
>>
>> The reason I'm asking is that, other than q35-iommu, QEMU's virtio
>> devices *don't* bypass the IOMMU except on PPC and SPARC, simply
>> because there is no other configuration AFAICT that has virtio and and
>> IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
>> correctly (thus breaking q35-iommu users with older guest kernels,
>> which hopefully don't actually exist) and to come up with a PPC- and
>> SPARC-specific solution, or maybe OpenFirmware-specific solution, to
>> handle PPC and SPARC down the road.
>>
>> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
>> showed up in a release asking the QEMU team to please not do that
>> until this issue was resolved.  Sadly, that email was ignored :(
>>
>> --Andy
>
> Sorry, I didn't make myself clear.
> Point is, QEMU is not the only virtio implementation out there.
> So we can't know no virtio implementations have an IOMMU as long as
> linux supports this IOMMU.
> virtio always used physical addresses since it was born and if it
> changes that it must do this in a way that does not break existing
> users.

Is there any non-QEMU virtio implementation can provide an
IOMMU-bypassing virtio device on a platform that has a nontrivial
IOMMU?

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <j...@8bytes.org> wrote:
>> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> One correction: it's a feature of the device in the system.
>> >> There could be a mix of devices bypassing and not
>> >> bypassing the IOMMU.
>> >
>> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> > IOMMU can chose to let the device bypass. So any fix here belongs
>> > into the platform/iommu code too and not into some driver.
>> >
>> >> Sounds good. And a way to detect appropriate devices could
>> >> be by looking at the feature flag, perhaps?
>> >
>> > Again, no! The way to detect that is to look into the iommu description
>> > structures provided by the firmware. They provide everything necessary
>> > to tell the iommu code which devices are not translated.
>> >
>>
>> Except on PPC and SPARC.  As far as I know, those are the only
>> problematic platforms.
>>
>> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> fixed to report correct data in the DMAR tables?
>>
>> --Andy
>
> Meaning virtio or assigned devices?
> For virtio - it's way too late since these are working configurations.
> For assigned devices - they don't work on x86 so it doesn't have
> to be disabled, it's safe to ignore.

I mean actually prevent QEMU from running in q35-iommu mode with any
virtio devices attached or maybe even turn off q35-iommu mode entirely
[1].  Doesn't it require that the user literally pass the word
"experimental" into QEMU right now?  It did at some point IIRC.

The reason I'm asking is that, other than q35-iommu, QEMU's virtio
devices *don't* bypass the IOMMU except on PPC and SPARC, simply
because there is no other configuration AFAICT that has virtio and and
IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
correctly (thus breaking q35-iommu users with older guest kernels,
which hopefully don't actually exist) and to come up with a PPC- and
SPARC-specific solution, or maybe OpenFirmware-specific solution, to
handle PPC and SPARC down the road.

[1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
showed up in a release asking the QEMU team to please not do that
until this issue was resolved.  Sadly, that email was ignored :(

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v1 00/18] x86: Secure Memory Encryption (AMD)

2016-04-27 Thread Andy Lutomirski
On Tue, Apr 26, 2016 at 3:55 PM, Tom Lendacky  wrote:
> This RFC patch series provides support for AMD's new Secure Memory
> Encryption (SME) feature.
>
> SME can be used to mark individual pages of memory as encrypted through the
> page tables. A page of memory that is marked encrypted will be automatically
> decrypted when read from DRAM and will be automatically encrypted when
> written to DRAM. Details on SME can found in the links below.

Having read through the docs briefly, some questions:

1. How does the crypto work?  Is it straight AES-ECB?  Is it a
tweakable mode?  If so, what does into the tweak?  For example, if I
swap the ciphertext of two pages, does the plaintext of the pages get
swapped?  If not, why not?

2. In SEV mode, how does the hypervisor relocate a physical backing
page?  Does it simple move it and update the 2nd-level page tables?
If so, is the result of decryption guaranteed to be garbage if it
relocates a page and re-inserts it at the wrong guest physical
address?

3. In SEV mode, does anything prevent the hypervisor from resuming a
guest with the wrong ASID, or is this all relying on the resulting
corruption of the guest code and data to cause a crash?

4. As I understand it, the caches are all unencrypted, and they're
tagged with the physical address, *including* the SME bit (bit 47).
In SEV mode, are they also tagged with the ASID?  I.e. if I have a
page in cache for ASID 1 and I try to read it with ASID 2, will I get
a fresh copy decrypted with ASID 2's key?  If so, will the old ASID 1
copy be evicted, or will it stay in cache and be non-coherent?

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v1 01/18] x86: Set the write-protect cache mode for AMD processors

2016-04-27 Thread Andy Lutomirski
On Tue, Apr 26, 2016 at 3:56 PM, Tom Lendacky  wrote:
> For AMD processors that support PAT, set the write-protect cache mode
> (_PAGE_CACHE_MODE_WP) entry to the actual write-protect value (x05).

What's the purpose of using the WP memory type?

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] vfio: introduce the VFIO_DMA_MAP_FLAG_NOEXEC flag

2014-10-20 Thread Andy Lutomirski
On Mon, Oct 13, 2014 at 6:09 AM, Antonios Motakis
a.mota...@virtualopensystems.com wrote:
 We introduce the VFIO_DMA_MAP_FLAG_NOEXEC flag to the VFIO dma map call,
 and expose its availability via the capability VFIO_DMA_NOEXEC_IOMMU.
 This way the user can control whether the XN flag will be set on the
 requested mappings. The IOMMU_NOEXEC flag needs to be available for all
 the IOMMUs of the container used.

Since you sent this to the linux-api list, I'll bite: what's the XN
flag?  I know what PROT_EXEC does when you mmap something, and I
presume that vfio is mmappable, but I don't actually have any clue
what this patch does.

I assume that this does not have anything to do with a non-CPU DMA
master executing code in main memory, because that makes rather little
sense.  (Or maybe it really does, in which case: weird.)

--Andy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu