On 12/08/2019 09:00, Jan Beulich wrote:
> On 09.08.2019 19:16, Andrew Cooper wrote:
>> --- a/docs/misc/xen-command-line.pandoc
>> +++ b/docs/misc/xen-command-line.pandoc
>> @@ -1914,7 +1914,7 @@ By default SSBD will be mitigated at runtime
>> (i.e `ssbd=runtime`).
>>   ### spec-ctrl (x86)
>>   > `= List of [ <bool>, xen=<bool>,
>> {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
>>   >              bti-thunk=retpoline|lfence|jmp,
>> {ibrs,ibpb,ssbd,eager-fpu,
>> ->              l1d-flush,l1tf-barrier}=<bool> ]`
>> +>              l1d-flush,stale-seg-clear,l1tf-barrier}=<bool> ]`
>>     Controls for speculative execution sidechannel mitigations.  By
>> default, Xen
>>   will pick the most appropriate mitigations based on compiled in
>> support,
>> @@ -1986,6 +1986,12 @@ Irrespective of Xen's setting, the feature is
>> virtualised for HVM guests to
>>   use.  By default, Xen will enable this mitigation on hardware
>> believed to be
>>   vulnerable to L1TF.
>>   +On all hardware, the `stale-seg-clear=` option can be used to
>> force or prevent
>> +Xen from clearing the microarchitectural segment register copies on
>> context
>> +switch.  By default, Xen will choose to use stale segment clearing
>> on affected
>> +hardware.  The clearing logic is tuned to microarchitectural details
>> of the
>> +affected CPUs.
>> +
>>   On hardware vulnerable to L1TF, the `l1tf-barrier=` option can be
>> used to force
>>   or prevent Xen from protecting evaluations inside the hypervisor
>> with a barrier
>>   instruction to not load potentially secret information into L1
>> cache.  By
>
> Purely out of curiosity: Is there a reason both insertions happen between
> two pre-existing sub-options, not after all of them?

Easier backportability.  That and l1tf-barrier is not going to say in
this form by the time 4.13 gets released.

>
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1328,6 +1328,29 @@ static void load_segments(struct vcpu *n)
>>       dirty_segment_mask = per_cpu(dirty_segment_mask, cpu);
>>       per_cpu(dirty_segment_mask, cpu) = 0;
>>   +    /*
>> +     * CPUs which suffer from stale segment register leakage have
>> two copies
>> +     * of each segment register [Correct at the time of writing -
>> Aug 2019].
>> +     *
>> +     * We must write to both of them to scrub state from the
>> previous vcpu.
>> +     * However, two writes in quick succession stall the pipeline,
>> as only one
>> +     * write per segment can be speculatively outstanding.
>> +     *
>> +     * Split the two sets of writes to each register to maximise the
>> chance
>> +     * that these writes have retired before the second set starts,
>> thus
>> +     * reducing the chance of stalling.
>> +     */
>> +    if ( opt_stale_seg_clear )
>> +    {
>> +        asm volatile ( "mov %[sel], %%ds;"
>> +                       "mov %[sel], %%es;"
>> +                       "mov %[sel], %%fs;"
>> +                       "mov %[sel], %%gs;"
>> +                       :: [sel] "r" (0) );
>> +        /* Force a reload of all segments to be the second scrubbing
>> write. */
>> +        dirty_segment_mask = ~0;
>> +    }
>
> Having the command line option, do we care about people actually using
> it on AMD hardware?

I care that it can be forced on to test.  Beyond that, I expect the
overhead will be large on Atom and AMD, which is why the command like
doc was worded a bit more forcefully than other options, and is hinting
at "don't try using this on non-affected hardware".

> For %ds and %es this would now lead to up to three
> selector register loads each, with the one above being completely
> pointless (due to not clearing the segment bases anyway). Question of
> course is whether adding yet another conditional (to the added code
> above or to preload_segment()) would be any better than having this
> extra selector register write.

preload_segment() needs to change anyway given Rome's behaviour, but I
need to wait for feedback from AMD as to whether they are happy with my
CPUID bit and erratum proposal.

>
> Furthermore, if we cared, shouldn't SVM code also honor the flag and
> gain extra loads, albeit perhaps with unlikely() used in the if()?

If you observe, svm_ctxt_switch_to() already has this unconditionally.

This was added in f9caf5ffaba but I'm not convinced it can be correct. 
I don't see equivalent logic in KVM, or any description to this effect
in the APM.  It might be an errata on an early SVM-capable processor.

>
> As to your choice of loading a nul selector: For the VERW change iirc
> we've been told that using a nul selector is a bad choice from
> performance pov. Are nul selectors being treated better for selector
> register writes?

VERW and segment loads are very different.

VERW is microcoded, unconditionally references memory, and doesn't have
a plausible usecase in the 286+ where you would call it with a NUL
selector (until the MDS side effect came along).  As a result, handling
NUL is the slowpath in microcode.

Loading a segment selector with NUL has a specific meaning, and is the
common case in 64bit code.  It does not reference memory.

In practice, my microperf tests show no difference in net time between
using a real selector and NUL.  The majority of samples come in
cycle-identical.

This might mean that NUL is slowpath'd (considering it doesn't have a
memory read to perform), but it probably means that other aspects of
segment loading dominate.

>
>> @@ -872,11 +873,38 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
>>     static void vmx_ctxt_switch_to(struct vcpu *v)
>>   {
>> +    /*
>> +     * CPUs which suffer from stale segment register leakage have
>> two copies
>> +     * of each segment register [Correct at the time of writing -
>> Aug 2019].
>> +     *
>> +     * We must write to both of them to scrub state from the
>> previous vcpu.
>> +     * However, two writes in quick succession stall the pipeline,
>> as only one
>> +     * write per segment can be speculatively outstanding.
>> +     *
>> +     * Split the two sets of writes to each register to maximise the
>> chance
>> +     * that these writes have retired before the second set starts,
>> thus
>> +     * reducing the chance of stalling.
>> +     */
>> +    if ( opt_stale_seg_clear )
>> +        asm volatile ( "mov %[sel], %%ds;"
>> +                       "mov %[sel], %%es;"
>> +                       "mov %[sel], %%fs;"
>> +                       "mov %[sel], %%gs;"
>> +                       :: [sel] "r" (0) );
>> +
>>       vmx_restore_guest_msrs(v);
>>       vmx_restore_dr(v);
>>         if ( v->domain->arch.hvm.pi_ops.flags & PI_CSW_TO )
>>           vmx_pi_switch_to(v);
>> +
>> +    /* Should be last in the function.  See above. */
>> +    if ( opt_stale_seg_clear )
>> +        asm volatile ( "mov %[sel], %%ds;"
>> +                       "mov %[sel], %%es;"
>> +                       "mov %[sel], %%fs;"
>> +                       "mov %[sel], %%gs;"
>> +                       :: [sel] "r" (0) );
>>   }
>
> Why two instances? Aren't the selector register loads during VM
> entry sufficient to clear both instances? (The question is
> rhetorical in a way, because I think I know the answer, but
> neither the comment here nor the patch description provide it.)

This really depends on how much information information Intel are
willing to make public on the subject.

I don't think it is reasonable to presume that VMEntry works as a
sequence of architectural instructions (it provably doesn't when it
comes to control registers), which is why I commented the logic in this way.

>
>> @@ -111,6 +114,7 @@ static int __init parse_spec_ctrl(const char *s)
>>               opt_ibpb = false;
>>               opt_ssbd = false;
>>               opt_l1d_flush = 0;
>> +            opt_stale_seg_clear = 0;
>>           }
>
> Is it really correct for this to be affected by both "spec-ctrl=no"
> and "spec-ctrl=no-xen"? It would seem to me that it would belong
> above the "disable_common" label, as the control also is meant to
> guard against guest->guest leaks.

spec-ctrl=no-xen is ill-defined.

As stated, it says "don't do anything in Xen except enough to virtualise
capabilities for guests".

However, an argument could be made to say that this is conceptually
similar to eager-fpu.

Either way, I think a clarification to what no-xen means is needed first.

>
>> @@ -864,6 +871,83 @@ static __init void mds_calculations(uint64_t caps)
>>       }
>>   }
>>   +/* Calculate whether this CPU leaks segment registers between
>> contexts. */
>> +static void __init stale_segment_calculations(void)
>> +{
>> +    /*
>> +     * Assume all unrecognised processors are ok.  This is only
>> known to
>> +     * affect Intel Family 6 processors.
>> +     */
>> +    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
>> +         boot_cpu_data.x86 != 6 )
>> +        return;
>
> The comment above here and ...
>
>> +    switch ( boot_cpu_data.x86_model )
>> +    {
>> +        /*
>> +         * Core processors since at least Nehalem are vulnerable.
>> +         */
>> +    case 0x1e: /* Nehalem */
>> +    case 0x1f: /* Auburndale / Havendale */
>> +    case 0x1a: /* Nehalem EP */
>> +    case 0x2e: /* Nehalem EX */
>> +    case 0x25: /* Westmere */
>> +    case 0x2c: /* Westmere EP */
>> +    case 0x2f: /* Westmere EX */
>> +    case 0x2a: /* SandyBridge */
>> +    case 0x2d: /* SandyBridge EP/EX */
>> +    case 0x3a: /* IvyBridge */
>> +    case 0x3e: /* IvyBridge EP/EX */
>> +    case 0x3c: /* Haswell */
>> +    case 0x3f: /* Haswell EX/EP */
>> +    case 0x45: /* Haswell D */
>> +    case 0x46: /* Haswell H */
>> +    case 0x3d: /* Broadwell */
>> +    case 0x47: /* Broadwell H */
>> +    case 0x4f: /* Broadwell EP/EX */
>> +    case 0x56: /* Broadwell D */
>> +    case 0x4e: /* Skylake M */
>> +    case 0x55: /* Skylake X */
>> +    case 0x5e: /* Skylake D */
>> +    case 0x66: /* Cannonlake */
>> +    case 0x67: /* Cannonlake? */
>> +    case 0x8e: /* Kabylake M */
>> +    case 0x9e: /* Kabylake D */
>> +        cpu_has_bug_stale_seg = true;
>> +        break;
>> +
>> +        /*
>> +         * Atom processors are not vulnerable.
>> +         */
>> +    case 0x1c: /* Pineview */
>> +    case 0x26: /* Lincroft */
>> +    case 0x27: /* Penwell */
>> +    case 0x35: /* Cloverview */
>> +    case 0x36: /* Cedarview */
>> +    case 0x37: /* Baytrail / Valleyview (Silvermont) */
>> +    case 0x4d: /* Avaton / Rangely (Silvermont) */
>> +    case 0x4c: /* Cherrytrail / Brasswell */
>> +    case 0x4a: /* Merrifield */
>> +    case 0x5a: /* Moorefield */
>> +    case 0x5c: /* Goldmont */
>> +    case 0x5f: /* Denverton */
>> +    case 0x7a: /* Gemini Lake */
>> +        break;
>> +
>> +        /*
>> +         * Knights processors are not vulnerable.
>> +         */
>> +    case 0x57: /* Knights Landing */
>> +    case 0x85: /* Knights Mill */
>> +        break;
>> +
>> +    default:
>> +        printk("Unrecognised CPU model %#x - assuming vulnerable to
>> StaleSeg\n",
>> +               boot_cpu_data.x86_model);
>> +        break;
>
> ... the plain "break" here are not in line with the log message text.
> Did you mean to add "not"?

No.  I intended this path to set cpu_has_bug_stale_seg.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to