Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On Sun, 7 Jul 2019, Thomas Gleixner wrote: > On Fri, 5 Jul 2019, Paolo Bonzini wrote: > > On 05/07/19 22:25, Thomas Gleixner wrote: > > > The more interesting question is whether this is all relevant. If I > > > understood the issue correctly then this is mitigated by proper interrupt > > > remapping. > > > > Yes, and for Linux we're good I think. VFIO by default refuses to use > > the IOMMU if interrupt remapping is absent or disabled, and KVM's own > > Confused. If it does not use IOMMU, what does it do? Hand in the device as > is and let the guest fiddle with it unconstrained or does it actually > refuse to pass through? Read through it and it refuses to attach unless the allow_unsafe_interrupts option is set, but again we can't protect against wilful ignorance. So the default prevents abuse on systems without IOMMU and irq remapping, so there is not much to worry about AFAICT. Setting TPR to 1 and fixing the comments/changelogs still makes sense independently. Thanks, tglx
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On Fri, 5 Jul 2019, Paolo Bonzini wrote: > On 05/07/19 22:25, Thomas Gleixner wrote: > > The more interesting question is whether this is all relevant. If I > > understood the issue correctly then this is mitigated by proper interrupt > > remapping. > > Yes, and for Linux we're good I think. VFIO by default refuses to use > the IOMMU if interrupt remapping is absent or disabled, and KVM's own Confused. If it does not use IOMMU, what does it do? Hand in the device as is and let the guest fiddle with it unconstrained or does it actually refuse to pass through? > (pre-VFIO) IOMMU support was removed a couple years ago. I guess the > secure boot lockdown patches should outlaw VFIO's > allow_unsafe_interrupts option, but that's it. I'm not worried too much about command line options. The important thing is the default behaviour. If an admin decides to do something stupid, so be it. Thanks, tglx
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On Fri, 5 Jul 2019, Andy Lutomirski wrote: > On Fri, Jul 5, 2019 at 1:36 PM Thomas Gleixner wrote: > > No. We can map the APIC into the user space visible page tables for PTI > > without compromising the PTI isolation and it can be read very early on > > before SWAPGS. All you need is a register to clobber not more. It the ISR > > is set, then go into an error path, yell loudly, issue EOI and return. > > The only issue I can see is: It's slow :) > > > I think this will be really extremely slow. If we can restrict this > to x2apic machines, then maybe it's not so awful. x2apic machines have working iommu/interrupt remapping. > FWIW, if we just patch up the GS thing, then we are still vulnerable: > the bad guy can arrange for a privileged process to have register > state corresponding to a dangerous syscall and then send an int $0x80 > via the APIC. Right, that's why you want to read the APIC:ISR to check where that thing came from. Thanks, tglx
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On 05/07/2019 21:49, Paolo Bonzini wrote: > On 05/07/19 22:25, Thomas Gleixner wrote: >> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which >> I'm disappointed to see wasn't shared with other software vendors at the >> time. > Oh, that brings back memories. At the time I was working on Xen, so I > remember that CVE. IIRC there was some mitigation but the fix was > basically to print a very scary error message if you used VT-d without > interrupt remapping. Maybe force the user to add something on the Xen > command line too? It was before my time. I have no public comment on how the other aspects of it were handled. >> Is there any serious usage of virtualization w/o interrupt remapping left >> or have the machines which are not capable been retired already? > I think they were already starting to disappear in 2011, as I don't > remember much worry about customers that were using systems without it. ISTR Nehalem/Westmere era systems were the first to support interrupt remapping, but were totally crippled with errata to the point of needing to turn a prerequisite feature (Queued Invalidation) off. I believe later systems have it working to a first approximation. As to the original question, whether people should be using such systems is a different question to whether they actually are. ~Andrew
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On 05/07/19 22:25, Thomas Gleixner wrote: > In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which > I'm disappointed to see wasn't shared with other software vendors at the > time. Oh, that brings back memories. At the time I was working on Xen, so I remember that CVE. IIRC there was some mitigation but the fix was basically to print a very scary error message if you used VT-d without interrupt remapping. Maybe force the user to add something on the Xen command line too? > The more interesting question is whether this is all relevant. If I > understood the issue correctly then this is mitigated by proper interrupt > remapping. Yes, and for Linux we're good I think. VFIO by default refuses to use the IOMMU if interrupt remapping is absent or disabled, and KVM's own (pre-VFIO) IOMMU support was removed a couple years ago. I guess the secure boot lockdown patches should outlaw VFIO's allow_unsafe_interrupts option, but that's it. > Is there any serious usage of virtualization w/o interrupt remapping left > or have the machines which are not capable been retired already? I think they were already starting to disappear in 2011, as I don't remember much worry about customers that were using systems without it. Paolo
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On 05/07/2019 20:19, Nadav Amit wrote: >> On Jul 5, 2019, at 8:47 AM, Andrew Cooper wrote: >> >> On 04/07/2019 16:51, Thomas Gleixner wrote: >>> 2) The loop termination logic is interesting at best. >>> >>> If the machine has no TSC or cpu_khz is not known yet it tries 1 >>> million times to ack stale IRR/ISR bits. What? >>> >>> With TSC it uses the TSC to calculate the loop termination. It takes a >>> timestamp at entry and terminates the loop when: >>> >>> (rdtsc() - start_timestamp) >= (cpu_hkz << 10) >>> >>> That's roughly one second. >>> >>> Both methods are problematic. The APIC has 256 vectors, which means >>> that in theory max. 256 IRR/ISR bits can be set. In practice this is >>> impossible as the first 32 vectors are reserved and not affected and >>> the chance that more than a few bits are set is close to zero. >> [Disclaimer. I talked to Thomas in private first, and he asked me to >> post this publicly as the CVE is almost a decade old already.] >> >> I'm afraid that this isn't quite true. >> >> In terms of IDT vectors, the first 32 are reserved for exceptions, but >> only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair >> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors). >> >> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which >> I'm disappointed to see wasn't shared with other software vendors at the >> time. > IIRC (and from skimming the CVE again) the basic problem in Xen was that > MSIs can be used when devices are assigned to generate IRQs with arbitrary > vectors. The mitigation was to require interrupt remapping to be enabled in > the IOMMU when IOMMU is used for DMA remapping (i.e., device assignment). > > Are you concerned about this case, additional concrete ones, or is it about > security hardening? (or am I missing something?) The phrase "impossible as the first 32 vectors are reserved" stuck out, because its not true. That generally means that any logic derived from it is also false. :) In practice, I was thinking more about robustness against buggy conditions. Setting TPR to 1 at start of day is very easy. Some of the other protections, less so. When it comes to virtualisation, security is an illusion when a guest kernel has a real piece of hardware in its hands. Anyone who is under the misapprehension otherwise should try talking to a IOMMU hardware engineer and see the reaction on their face. IOMMUs were designed to isolate devices when all controlling software was of the same privilege level. They don't magically make the system safe against a hostile guest device driver, which in the most basic case, can still mount a DoS attempt with deliberately bad DMA. ~Andrew
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On Fri, Jul 5, 2019 at 1:36 PM Thomas Gleixner wrote: > > On Fri, 5 Jul 2019, Andy Lutomirski wrote: > > On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper > > wrote: > > > Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX > > > without an error code on the stack, which results in a corrupt pt_regs > > > in the exception handler, and a stack underflow on the way back out, > > > most likely with a fault on IRET. > > > > > > These can be addressed by setting TPR to 0x10, which will inhibit > > > delivery of any errant IPIs in this range, but some extra sanity logic > > > may not go amiss. An error code on a 64bit stack can be spotted with > > > `testb $8, %spl` due to %rsp being aligned before pushing the exception > > > frame. > > > > Several years ago, I remember having a discussion with someone (Jan > > Beulich, maybe?) about how to efficiently make the entry code figure > > out the error code situation automatically. I suspect it was on IRC > > and I can't find the logs. I'm thinking that maybe we should just > > make Linux's idtentry code do something like this. > > > > If nothing else, we could make idtentry do: > > > > testl $8, %esp /* shorter than testb IIRC */ > > jz 1f /* or jnz -- too lazy to figure it out */ > > pushq $-1 > > 1: > > Errm, no. We should not silently paper over it. If we detect that this came > in with a wrong stack frame, i.e. not from a CPU originated exception, then > we truly should yell loud. Also in that case you want to check the APIC:ISR > and issue an EOI to clear it. It gives us the option to replace idtentry with something table-driven. I don't think I love it, but it's not an awful idea. > > > > Another interesting problem is an IPI which its vector 0x80. A cunning > > > attacker can use this to simulate system calls from unsuspecting > > > positions in userspace, or for interrupting kernel context. At the very > > > least the int0x80 path does an unconditional swapgs, so will try to run > > > with the user gs, and I expect things will explode quickly from there. > > > > At least SMAP helps here on non-FSGSBASE systems. With FSGSBASE, I > > How does it help? It still crashes the kernel. > > > suppose we could harden this by adding a special check to int $0x80 to > > validate GSBASE. > > > > One option here is to look at ISR and complain if it is found to be set. > > > > Barring some real hackery, we're toast long before we get far enough to > > do that. > > No. We can map the APIC into the user space visible page tables for PTI > without compromising the PTI isolation and it can be read very early on > before SWAPGS. All you need is a register to clobber not more. It the ISR > is set, then go into an error path, yell loudly, issue EOI and return. > The only issue I can see is: It's slow :) > > I think this will be really extremely slow. If we can restrict this to x2apic machines, then maybe it's not so awful. FWIW, if we just patch up the GS thing, then we are still vulnerable: the bad guy can arrange for a privileged process to have register state corresponding to a dangerous syscall and then send an int $0x80 via the APIC.
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On Fri, Jul 5, 2019 at 1:25 PM Thomas Gleixner wrote: > > Andrew, > > > > > These can be addressed by setting TPR to 0x10, which will inhibit > > Right, that's easy and obvious. > This boots: diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 177aa8ef2afa..5257c40bde6c 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -1531,11 +1531,14 @@ static void setup_local_APIC(void) #endif /* -* Set Task Priority to 'accept all'. We never change this -* later on. +* Set Task Priority to 'accept all except vectors 0-31'. An APIC +* vector in the 16-31 range can be delivered otherwise, but we'll +* think it's an exception and terrible things will happen. +* We never change this later on. */ value = apic_read(APIC_TASKPRI); value &= ~APIC_TPRI_MASK; + value |= 0x10; apic_write(APIC_TASKPRI, value); apic_pending_intr_clear();
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On Fri, 5 Jul 2019, Andy Lutomirski wrote: > On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper > wrote: > > Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX > > without an error code on the stack, which results in a corrupt pt_regs > > in the exception handler, and a stack underflow on the way back out, > > most likely with a fault on IRET. > > > > These can be addressed by setting TPR to 0x10, which will inhibit > > delivery of any errant IPIs in this range, but some extra sanity logic > > may not go amiss. An error code on a 64bit stack can be spotted with > > `testb $8, %spl` due to %rsp being aligned before pushing the exception > > frame. > > Several years ago, I remember having a discussion with someone (Jan > Beulich, maybe?) about how to efficiently make the entry code figure > out the error code situation automatically. I suspect it was on IRC > and I can't find the logs. I'm thinking that maybe we should just > make Linux's idtentry code do something like this. > > If nothing else, we could make idtentry do: > > testl $8, %esp /* shorter than testb IIRC */ > jz 1f /* or jnz -- too lazy to figure it out */ > pushq $-1 > 1: Errm, no. We should not silently paper over it. If we detect that this came in with a wrong stack frame, i.e. not from a CPU originated exception, then we truly should yell loud. Also in that case you want to check the APIC:ISR and issue an EOI to clear it. > > Another interesting problem is an IPI which its vector 0x80. A cunning > > attacker can use this to simulate system calls from unsuspecting > > positions in userspace, or for interrupting kernel context. At the very > > least the int0x80 path does an unconditional swapgs, so will try to run > > with the user gs, and I expect things will explode quickly from there. > > At least SMAP helps here on non-FSGSBASE systems. With FSGSBASE, I How does it help? It still crashes the kernel. > suppose we could harden this by adding a special check to int $0x80 to > validate GSBASE. > > One option here is to look at ISR and complain if it is found to be set. > > Barring some real hackery, we're toast long before we get far enough to > do that. No. We can map the APIC into the user space visible page tables for PTI without compromising the PTI isolation and it can be read very early on before SWAPGS. All you need is a register to clobber not more. It the ISR is set, then go into an error path, yell loudly, issue EOI and return. The only issue I can see is: It's slow :) Thanks, tglx
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
Andrew, On Fri, 5 Jul 2019, Andrew Cooper wrote: > On 04/07/2019 16:51, Thomas Gleixner wrote: > > 2) The loop termination logic is interesting at best. > > > > If the machine has no TSC or cpu_khz is not known yet it tries 1 > > million times to ack stale IRR/ISR bits. What? > > > > With TSC it uses the TSC to calculate the loop termination. It takes a > > timestamp at entry and terminates the loop when: > > > > (rdtsc() - start_timestamp) >= (cpu_hkz << 10) > > > > That's roughly one second. > > > > Both methods are problematic. The APIC has 256 vectors, which means > > that in theory max. 256 IRR/ISR bits can be set. In practice this is > > impossible as the first 32 vectors are reserved and not affected and > > the chance that more than a few bits are set is close to zero. > > [Disclaimer. I talked to Thomas in private first, and he asked me to > post this publicly as the CVE is almost a decade old already.] thanks for bringing this up! > I'm afraid that this isn't quite true. > > In terms of IDT vectors, the first 32 are reserved for exceptions, but > only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair > game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors). Indeed. > In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which > I'm disappointed to see wasn't shared with other software vendors at the > time. No comment. > Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX > without an error code on the stack, which results in a corrupt pt_regs > in the exception handler, and a stack underflow on the way back out, > most likely with a fault on IRET. > > These can be addressed by setting TPR to 0x10, which will inhibit Right, that's easy and obvious. > delivery of any errant IPIs in this range, but some extra sanity logic > may not go amiss. An error code on a 64bit stack can be spotted with > `testb $8, %spl` due to %rsp being aligned before pushing the exception > frame. The question is what we do with that information :) > Another interesting problem is an IPI which its vector 0x80. A cunning > attacker can use this to simulate system calls from unsuspecting > positions in userspace, or for interrupting kernel context. At the very > least the int0x80 path does an unconditional swapgs, so will try to run > with the user gs, and I expect things will explode quickly from there. Cute. > One option here is to look at ISR and complain if it is found to be set. That's slw, but could at least provide an option to do so. > Another option, which I've only just remembered, is that AMD hardware > has the Interrupt Enable Register in its extended APIC space, which may > or may not be good enough to prohibit delivery of 0x80. There isn't > enough information in the APM to be clear, but the name suggests it is > worth experimenting with. I doubt it. Clearing a bit in the IER takes the interrupt out of the priority decision logic. That's a SVM feature so interrupts directed directly to guests cannot block other interrupts if they are not serviced. It's grossly misnomed and won't help with the int80 issue. The more interesting question is whether this is all relevant. If I understood the issue correctly then this is mitigated by proper interrupt remapping. Is there any serious usage of virtualization w/o interrupt remapping left or have the machines which are not capable been retired already? Thanks, tglx
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On 05/07/2019 20:06, Andy Lutomirski wrote: > On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper > wrote: >> On 04/07/2019 16:51, Thomas Gleixner wrote: >>> 2) The loop termination logic is interesting at best. >>> >>> If the machine has no TSC or cpu_khz is not known yet it tries 1 >>> million times to ack stale IRR/ISR bits. What? >>> >>> With TSC it uses the TSC to calculate the loop termination. It takes a >>> timestamp at entry and terminates the loop when: >>> >>> (rdtsc() - start_timestamp) >= (cpu_hkz << 10) >>> >>> That's roughly one second. >>> >>> Both methods are problematic. The APIC has 256 vectors, which means >>> that in theory max. 256 IRR/ISR bits can be set. In practice this is >>> impossible as the first 32 vectors are reserved and not affected and >>> the chance that more than a few bits are set is close to zero. >> [Disclaimer. I talked to Thomas in private first, and he asked me to >> post this publicly as the CVE is almost a decade old already.] >> >> I'm afraid that this isn't quite true. >> >> In terms of IDT vectors, the first 32 are reserved for exceptions, but >> only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair >> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors). >> >> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which >> I'm disappointed to see wasn't shared with other software vendors at the >> time. >> >> Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX >> without an error code on the stack, which results in a corrupt pt_regs >> in the exception handler, and a stack underflow on the way back out, >> most likely with a fault on IRET. >> >> These can be addressed by setting TPR to 0x10, which will inhibit >> delivery of any errant IPIs in this range, but some extra sanity logic >> may not go amiss. An error code on a 64bit stack can be spotted with >> `testb $8, %spl` due to %rsp being aligned before pushing the exception >> frame. > Several years ago, I remember having a discussion with someone (Jan > Beulich, maybe?) about how to efficiently make the entry code figure > out the error code situation automatically. I suspect it was on IRC > and I can't find the logs. It was on IRC, but I don't remember exactly when, either. > I'm thinking that maybe we should just > make Linux's idtentry code do something like this. > > If nothing else, we could make idtentry do: > > testl $8, %esp /* shorter than testb IIRC */ Sadly not. test (unlike cmp and the basic mutative opcodes) doesn't have a sign-extendable imm8 encoding. The two options are: f7 c4 08 00 00 00 test $0x8,%esp 40 f6 c4 08 test $0x8,%spl > jz 1f /* or jnz -- too lazy to figure it out */ > pushq $-1 > 1: It is jz, and Xen does use this sequence for reserved/unimplemented vectors, but we expect those codepaths never to be executed. > > instead of the current hardcoded push. The cost of a mispredicted > branch here will be smallish compared to the absurdly large cost of > the entry itself. But I thought I had something more clever than > this. This sequence works, but it still feels like it should be > possible to do better: > > .macro PUSH_ERROR_IF_NEEDED > /* > * Before the IRET frame is pushed, RSP is aligned to a 16-byte > * boundary. After SS .. RIP and the error code are pushed, RSP is > * once again aligned. Pushing -1 will put -1 in the error code slot > * (regs->orig_ax) if there was no error code. > */ > > pushq$-1/* orig_ax = -1, maybe */ > /* now RSP points to orig_ax (aligned) or di (misaligned) */ > pushq$0 > /* now RSP points to di (misaligned) or si (aligned) */ > orq$8, %rsp > /* now RSP points to di */ > addq$8, %rsp > /* now RSP points to orig_ax, and we're in good shape */ > .endm > > Is there a better sequence for this? The only aspect I can think of is whether mixing the push/pops with explicit updates updates to %rsp is better or worse than a very well predicted branch, given that various frontends have special tracking to reduce instruction dependencies on %rsp. I'll have to defer to the CPU microachitects as to which of the two options is the lesser evil. That said, both Intel and AMD's Optimisation guides have stack alignment suggestions which mix push/sub/and on function prolog, so I expect this is as optimised as it can reasonably be in the pipelines. >> Another interesting problem is an IPI which its vector 0x80. A cunning >> attacker can use this to simulate system calls from unsuspecting >> positions in userspace, or for interrupting kernel context. At the very >> least the int0x80 path does an unconditional swapgs, so will try to run >> with the user gs, and I expect things will explode quickly from there. > At least SMAP helps here on non-FSGSBASE systems. With FSGSBASE, I > suppose we could harden this by adding a special
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
> On Jul 5, 2019, at 8:47 AM, Andrew Cooper wrote: > > On 04/07/2019 16:51, Thomas Gleixner wrote: >> 2) The loop termination logic is interesting at best. >> >> If the machine has no TSC or cpu_khz is not known yet it tries 1 >> million times to ack stale IRR/ISR bits. What? >> >> With TSC it uses the TSC to calculate the loop termination. It takes a >> timestamp at entry and terminates the loop when: >> >>(rdtsc() - start_timestamp) >= (cpu_hkz << 10) >> >> That's roughly one second. >> >> Both methods are problematic. The APIC has 256 vectors, which means >> that in theory max. 256 IRR/ISR bits can be set. In practice this is >> impossible as the first 32 vectors are reserved and not affected and >> the chance that more than a few bits are set is close to zero. > > [Disclaimer. I talked to Thomas in private first, and he asked me to > post this publicly as the CVE is almost a decade old already.] > > I'm afraid that this isn't quite true. > > In terms of IDT vectors, the first 32 are reserved for exceptions, but > only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair > game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors). > > In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which > I'm disappointed to see wasn't shared with other software vendors at the > time. IIRC (and from skimming the CVE again) the basic problem in Xen was that MSIs can be used when devices are assigned to generate IRQs with arbitrary vectors. The mitigation was to require interrupt remapping to be enabled in the IOMMU when IOMMU is used for DMA remapping (i.e., device assignment). Are you concerned about this case, additional concrete ones, or is it about security hardening? (or am I missing something?)
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper wrote: > > On 04/07/2019 16:51, Thomas Gleixner wrote: > > 2) The loop termination logic is interesting at best. > > > > If the machine has no TSC or cpu_khz is not known yet it tries 1 > > million times to ack stale IRR/ISR bits. What? > > > > With TSC it uses the TSC to calculate the loop termination. It takes a > > timestamp at entry and terminates the loop when: > > > > (rdtsc() - start_timestamp) >= (cpu_hkz << 10) > > > > That's roughly one second. > > > > Both methods are problematic. The APIC has 256 vectors, which means > > that in theory max. 256 IRR/ISR bits can be set. In practice this is > > impossible as the first 32 vectors are reserved and not affected and > > the chance that more than a few bits are set is close to zero. > > [Disclaimer. I talked to Thomas in private first, and he asked me to > post this publicly as the CVE is almost a decade old already.] > > I'm afraid that this isn't quite true. > > In terms of IDT vectors, the first 32 are reserved for exceptions, but > only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair > game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors). > > In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which > I'm disappointed to see wasn't shared with other software vendors at the > time. > > Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX > without an error code on the stack, which results in a corrupt pt_regs > in the exception handler, and a stack underflow on the way back out, > most likely with a fault on IRET. > > These can be addressed by setting TPR to 0x10, which will inhibit > delivery of any errant IPIs in this range, but some extra sanity logic > may not go amiss. An error code on a 64bit stack can be spotted with > `testb $8, %spl` due to %rsp being aligned before pushing the exception > frame. Several years ago, I remember having a discussion with someone (Jan Beulich, maybe?) about how to efficiently make the entry code figure out the error code situation automatically. I suspect it was on IRC and I can't find the logs. I'm thinking that maybe we should just make Linux's idtentry code do something like this. If nothing else, we could make idtentry do: testl $8, %esp /* shorter than testb IIRC */ jz 1f /* or jnz -- too lazy to figure it out */ pushq $-1 1: instead of the current hardcoded push. The cost of a mispredicted branch here will be smallish compared to the absurdly large cost of the entry itself. But I thought I had something more clever than this. This sequence works, but it still feels like it should be possible to do better: .macro PUSH_ERROR_IF_NEEDED /* * Before the IRET frame is pushed, RSP is aligned to a 16-byte * boundary. After SS .. RIP and the error code are pushed, RSP is * once again aligned. Pushing -1 will put -1 in the error code slot * (regs->orig_ax) if there was no error code. */ pushq$-1/* orig_ax = -1, maybe */ /* now RSP points to orig_ax (aligned) or di (misaligned) */ pushq$0 /* now RSP points to di (misaligned) or si (aligned) */ orq$8, %rsp /* now RSP points to di */ addq$8, %rsp /* now RSP points to orig_ax, and we're in good shape */ .endm Is there a better sequence for this? A silly downside here is that the ORC annotations need to know the offset to the IRET frame. Josh, how ugly would it be to teach the unwinder that UNWIND_HINT_IRET_REGS actually means "hey, maybe I'm 8 bytes off -- please realign RSP when doing your calculation"? FWIW, the entry code is rather silly here in that it actually only uses the orig_ax slot as a temporary dumping ground for the error code and then it replaces it with -1 later on. I don't remember whether anything still cares about the -1. Once upon a time, there was some code that assumed that -1 meant "not in a syscall" and anything else meant "in a syscall", but, if so, I suspect we should just kill that code regardless. > > Another interesting problem is an IPI which its vector 0x80. A cunning > attacker can use this to simulate system calls from unsuspecting > positions in userspace, or for interrupting kernel context. At the very > least the int0x80 path does an unconditional swapgs, so will try to run > with the user gs, and I expect things will explode quickly from there. At least SMAP helps here on non-FSGSBASE systems. With FSGSBASE, I suppose we could harden this by adding a special check to int $0x80 to validate GSBASE. > > One option here is to look at ISR and complain if it is found to be set. Barring some real hackery, we're toast long before we get far enough to do that.
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
On 04/07/2019 16:51, Thomas Gleixner wrote: > 2) The loop termination logic is interesting at best. > > If the machine has no TSC or cpu_khz is not known yet it tries 1 > million times to ack stale IRR/ISR bits. What? > > With TSC it uses the TSC to calculate the loop termination. It takes a > timestamp at entry and terminates the loop when: > > (rdtsc() - start_timestamp) >= (cpu_hkz << 10) > > That's roughly one second. > > Both methods are problematic. The APIC has 256 vectors, which means > that in theory max. 256 IRR/ISR bits can be set. In practice this is > impossible as the first 32 vectors are reserved and not affected and > the chance that more than a few bits are set is close to zero. [Disclaimer. I talked to Thomas in private first, and he asked me to post this publicly as the CVE is almost a decade old already.] I'm afraid that this isn't quite true. In terms of IDT vectors, the first 32 are reserved for exceptions, but only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors). In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which I'm disappointed to see wasn't shared with other software vendors at the time. Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX without an error code on the stack, which results in a corrupt pt_regs in the exception handler, and a stack underflow on the way back out, most likely with a fault on IRET. These can be addressed by setting TPR to 0x10, which will inhibit delivery of any errant IPIs in this range, but some extra sanity logic may not go amiss. An error code on a 64bit stack can be spotted with `testb $8, %spl` due to %rsp being aligned before pushing the exception frame. Another interesting problem is an IPI which its vector 0x80. A cunning attacker can use this to simulate system calls from unsuspecting positions in userspace, or for interrupting kernel context. At the very least the int0x80 path does an unconditional swapgs, so will try to run with the user gs, and I expect things will explode quickly from there. One option here is to look at ISR and complain if it is found to be set. Another option, which I've only just remembered, is that AMD hardware has the Interrupt Enable Register in its extended APIC space, which may or may not be good enough to prohibit delivery of 0x80. There isn't enough information in the APM to be clear, but the name suggests it is worth experimenting with. ~Andrew