Re: Avoid speculative indirect calls in kernel
Patchmeister Torvalds: "Or is Intel basically saying "we are committed to selling you shit forever and ever, and never fixing anything"?" Back in Celeron days, Intel was popular because you could clock the lesser cached Celeron 300mhz to ~500mhz. Everybody knew then not to get anything pricier. But still Intel sells Xeons at 3x the price, for little noticable gain? Basically I did research on philosophy, and indeed the mainconcept of a culture is what determines a cultures behaviour. Even the internal design of the cpu seems to be inspired by "God". I have tried the zén-realized version "Zün", instead, God of absolute reality. Because regressions in philosophy, is regressions in computing, is ultimately Bill Gates talking about fecal water. -- Fredelige hilsener, Ywe Cærlyn,
Re: Avoid speculative indirect calls in kernel
Patchmeister Torvalds: "Or is Intel basically saying "we are committed to selling you shit forever and ever, and never fixing anything"?" Back in Celeron days, Intel was popular because you could clock the lesser cached Celeron 300mhz to ~500mhz. Everybody knew then not to get anything pricier. But still Intel sells Xeons at 3x the price, for little noticable gain? Basically I did research on philosophy, and indeed the mainconcept of a culture is what determines a cultures behaviour. Even the internal design of the cpu seems to be inspired by "God". I have tried the zén-realized version "Zün", instead, God of absolute reality. Because regressions in philosophy, is regressions in computing, is ultimately Bill Gates talking about fecal water. -- Fredelige hilsener, Ywe Cærlyn,
Re: Avoid speculative indirect calls in kernel
On Jan 5, 12:12pm, Alan Cox wrote: } Subject: Re: Avoid speculative indirect calls in kernel Good morning to everyone, a bit behind on mail given everything which has been going on. > On Fri, 5 Jan 2018 01:54:13 +0100 (CET) > Thomas Gleixner <t...@linutronix.de> wrote: > > > On Thu, 4 Jan 2018, Jon Masters wrote: > > > P.S. I've an internal document where I've been tracking "nice to haves" > > > for later, and one of them is whether it makes sense to tag binaries as > > > "trusted" (e.g. extended attribute, label, whatever). It was something I > > > wanted to bring up at some point as potentially worth considering. > > > > Scratch that. There is no such thing as a trusted binary. > There is if you are using signing and the like. I'm sure SELinux and > friends will grow the ability to set per process policy but that's > certainly not a priority. > > However the question is wrong. 'trusted' is a binary operator not a > unary one. Alan's observations are correct. In our autonomous introspection work we apply the notion that 'trusted' is a binary characteristic of a context of execution (COE). Its value is an expression of whether or not the information exchange events it has been involved in have deviated from the desired execution trajectory path of the system. It is a decidedly different way of thinking about things. Most importantly it is a namespaceable characteristic. We have already written the futuristic LSM that Alan aludes to in order to implement per COE security policies and forensics for actors/COE's that have gone over to the 'dark side'. > Alan Have a good weekend. Dr. Greg }-- End of excerpt from Alan Cox As always, Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC. 4206 N. 19th Ave. Specializing in information infra-structure Fargo, ND 58102development. PH: 701-281-1686 FAX: 701-281-3949 EMAIL: g...@enjellic.com -- "Given a choice between a complex, difficult-to-understand, disconcerting explanation and a simplistic, comforting one, many prefer simplistic comfort if it's remotely plausible, especially if it involves blaming someone else for their problems." -- Bob Lewis _Infoworld_
Re: Avoid speculative indirect calls in kernel
On Jan 5, 12:12pm, Alan Cox wrote: } Subject: Re: Avoid speculative indirect calls in kernel Good morning to everyone, a bit behind on mail given everything which has been going on. > On Fri, 5 Jan 2018 01:54:13 +0100 (CET) > Thomas Gleixner wrote: > > > On Thu, 4 Jan 2018, Jon Masters wrote: > > > P.S. I've an internal document where I've been tracking "nice to haves" > > > for later, and one of them is whether it makes sense to tag binaries as > > > "trusted" (e.g. extended attribute, label, whatever). It was something I > > > wanted to bring up at some point as potentially worth considering. > > > > Scratch that. There is no such thing as a trusted binary. > There is if you are using signing and the like. I'm sure SELinux and > friends will grow the ability to set per process policy but that's > certainly not a priority. > > However the question is wrong. 'trusted' is a binary operator not a > unary one. Alan's observations are correct. In our autonomous introspection work we apply the notion that 'trusted' is a binary characteristic of a context of execution (COE). Its value is an expression of whether or not the information exchange events it has been involved in have deviated from the desired execution trajectory path of the system. It is a decidedly different way of thinking about things. Most importantly it is a namespaceable characteristic. We have already written the futuristic LSM that Alan aludes to in order to implement per COE security policies and forensics for actors/COE's that have gone over to the 'dark side'. > Alan Have a good weekend. Dr. Greg }-- End of excerpt from Alan Cox As always, Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC. 4206 N. 19th Ave. Specializing in information infra-structure Fargo, ND 58102development. PH: 701-281-1686 FAX: 701-281-3949 EMAIL: g...@enjellic.com -- "Given a choice between a complex, difficult-to-understand, disconcerting explanation and a simplistic, comforting one, many prefer simplistic comfort if it's remotely plausible, especially if it involves blaming someone else for their problems." -- Bob Lewis _Infoworld_
Re: Avoid speculative indirect calls in kernel
On Tue, 9 Jan 2018, Dave Hansen wrote: > On 01/09/2018 04:45 PM, Thomas Gleixner wrote: > > On Mon, 8 Jan 2018, Andrea Arcangeli wrote: > >> On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote: > >> Did my best to do the cleanest patch for tip, but I now figured Dave's > >> original comment was spot on: a _PAGE_NX clear then becomes necessary > >> also after pud_alloc not only after p4d_alloc. > >> > >> pmd_alloc would run into the same with x86 32bit non-PAE too. > > non-PAE doesn't have an NX bit. :) > > But we #define _PAGE_NX down to 0 there so it's harmless. > > >> So there are two choices, either going back to one single _PAGE_NX > >> clear from the original Dave's original patch as below, or to add > >> multiple clear after each level which was my objective and is more > >> robust, but it may be overkill in this case. As long as it was one > >> line it looked a clear improvement. > >> > >> Considering the caller in both cases is going to abort I guess we can > >> use the one liner approach as Dave and Jiri did originally. > > > > Dave ? > > I agree with Andrea. The patch in -tip potentially misses the pgd > clearing if pud_alloc() sets a PGD. It would also be nice to have that > comment back. > > Note that the -tip commit probably works in *practice* because for two > adjacent calls to map_tboot_page() that share a PGD entry, the first > will clear NX, *then* allocate and set the PGD (without NX clear). The > second call will *not* allocate but will clear the NX bit. > > The patch I think we want is attached. Color me confused. I have queued the one below in tip. It lacks the comment and does the !NX at a different place. Thanks, tglx 8<- commit 262b6b30087246abf09d6275eb0c0dc421bcbe38 Author: Dave HansenDate: Sat Jan 6 18:41:14 2018 +0100 x86/tboot: Unbreak tboot with PTI enabled This is another case similar to what EFI does: create a new set of page tables, map some code at a low address, and jump to it. PTI mistakes this low address for userspace and mistakenly marks it non-executable in an effort to make it unusable for userspace. Undo the poison to allow execution. Fixes: 385ce0ea4c07 ("x86/mm/pti: Add Kconfig") Signed-off-by: Dave Hansen Signed-off-by: Andrea Arcangeli Signed-off-by: Thomas Gleixner Cc: Alan Cox Cc: Tim Chen Cc: Jon Masters Cc: Dave Hansen Cc: Andi Kleen Cc: Jeff Law Cc: Paolo Bonzini Cc: Linus Torvalds Cc: Greg Kroah-Hartman Cc: David" Cc: Nick Clifton Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180108102805.gk25...@redhat.com diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index a4eb27918ceb..75869a4b6c41 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -127,6 +127,7 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn, p4d = p4d_alloc(_mm, pgd, vaddr); if (!p4d) return -1; + pgd->pgd &= ~_PAGE_NX; pud = pud_alloc(_mm, p4d, vaddr); if (!pud) return -1;
Re: Avoid speculative indirect calls in kernel
On Tue, 9 Jan 2018, Dave Hansen wrote: > On 01/09/2018 04:45 PM, Thomas Gleixner wrote: > > On Mon, 8 Jan 2018, Andrea Arcangeli wrote: > >> On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote: > >> Did my best to do the cleanest patch for tip, but I now figured Dave's > >> original comment was spot on: a _PAGE_NX clear then becomes necessary > >> also after pud_alloc not only after p4d_alloc. > >> > >> pmd_alloc would run into the same with x86 32bit non-PAE too. > > non-PAE doesn't have an NX bit. :) > > But we #define _PAGE_NX down to 0 there so it's harmless. > > >> So there are two choices, either going back to one single _PAGE_NX > >> clear from the original Dave's original patch as below, or to add > >> multiple clear after each level which was my objective and is more > >> robust, but it may be overkill in this case. As long as it was one > >> line it looked a clear improvement. > >> > >> Considering the caller in both cases is going to abort I guess we can > >> use the one liner approach as Dave and Jiri did originally. > > > > Dave ? > > I agree with Andrea. The patch in -tip potentially misses the pgd > clearing if pud_alloc() sets a PGD. It would also be nice to have that > comment back. > > Note that the -tip commit probably works in *practice* because for two > adjacent calls to map_tboot_page() that share a PGD entry, the first > will clear NX, *then* allocate and set the PGD (without NX clear). The > second call will *not* allocate but will clear the NX bit. > > The patch I think we want is attached. Color me confused. I have queued the one below in tip. It lacks the comment and does the !NX at a different place. Thanks, tglx 8<- commit 262b6b30087246abf09d6275eb0c0dc421bcbe38 Author: Dave Hansen Date: Sat Jan 6 18:41:14 2018 +0100 x86/tboot: Unbreak tboot with PTI enabled This is another case similar to what EFI does: create a new set of page tables, map some code at a low address, and jump to it. PTI mistakes this low address for userspace and mistakenly marks it non-executable in an effort to make it unusable for userspace. Undo the poison to allow execution. Fixes: 385ce0ea4c07 ("x86/mm/pti: Add Kconfig") Signed-off-by: Dave Hansen Signed-off-by: Andrea Arcangeli Signed-off-by: Thomas Gleixner Cc: Alan Cox Cc: Tim Chen Cc: Jon Masters Cc: Dave Hansen Cc: Andi Kleen Cc: Jeff Law Cc: Paolo Bonzini Cc: Linus Torvalds Cc: Greg Kroah-Hartman Cc: David" Cc: Nick Clifton Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180108102805.gk25...@redhat.com diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index a4eb27918ceb..75869a4b6c41 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -127,6 +127,7 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn, p4d = p4d_alloc(_mm, pgd, vaddr); if (!p4d) return -1; + pgd->pgd &= ~_PAGE_NX; pud = pud_alloc(_mm, p4d, vaddr); if (!pud) return -1;
Re: Avoid speculative indirect calls in kernel
On 01/09/2018 04:45 PM, Thomas Gleixner wrote: > On Mon, 8 Jan 2018, Andrea Arcangeli wrote: >> On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote: >> Did my best to do the cleanest patch for tip, but I now figured Dave's >> original comment was spot on: a _PAGE_NX clear then becomes necessary >> also after pud_alloc not only after p4d_alloc. >> >> pmd_alloc would run into the same with x86 32bit non-PAE too. non-PAE doesn't have an NX bit. :) But we #define _PAGE_NX down to 0 there so it's harmless. >> So there are two choices, either going back to one single _PAGE_NX >> clear from the original Dave's original patch as below, or to add >> multiple clear after each level which was my objective and is more >> robust, but it may be overkill in this case. As long as it was one >> line it looked a clear improvement. >> >> Considering the caller in both cases is going to abort I guess we can >> use the one liner approach as Dave and Jiri did originally. > > Dave ? I agree with Andrea. The patch in -tip potentially misses the pgd clearing if pud_alloc() sets a PGD. It would also be nice to have that comment back. Note that the -tip commit probably works in *practice* because for two adjacent calls to map_tboot_page() that share a PGD entry, the first will clear NX, *then* allocate and set the PGD (without NX clear). The second call will *not* allocate but will clear the NX bit. The patch I think we want is attached. From: Dave HansenThis is another case similar to what EFI does: create a new set of page tables, map some code at a low address, and jump to it. PTI mistakes this low address for userspace and mistakenly marks it non-executable in an effort to make it unusable for userspace. Undo the poison to allow execution. Signed-off-by: Dave Hansen Cc: Ning Sun Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: tboot-de...@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org --- b/arch/x86/kernel/tboot.c | 11 +++ 1 file changed, 11 insertions(+) diff -puN arch/x86/kernel/tboot.c~pti-tboot-fix arch/x86/kernel/tboot.c --- a/arch/x86/kernel/tboot.c~pti-tboot-fix 2018-01-05 21:50:55.74960 -0800 +++ b/arch/x86/kernel/tboot.c 2018-01-05 23:51:41.368536890 -0800 @@ -138,6 +138,17 @@ static int map_tboot_page(unsigned long return -1; set_pte_at(_mm, vaddr, pte, pfn_pte(pfn, prot)); pte_unmap(pte); + + /* + * PTI poisons low addresses in the kernel page tables in the + * name of making them unusable for userspace. To execute + * code at such a low address, the poison must be cleared. + * + * Note: 'pgd' actually gets set in p4d_alloc() _or_ + * pud_alloc() depending on 4/5-level paging. + */ + pgd->pgd &= ~_PAGE_NX; + return 0; } _
Re: Avoid speculative indirect calls in kernel
On 01/09/2018 04:45 PM, Thomas Gleixner wrote: > On Mon, 8 Jan 2018, Andrea Arcangeli wrote: >> On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote: >> Did my best to do the cleanest patch for tip, but I now figured Dave's >> original comment was spot on: a _PAGE_NX clear then becomes necessary >> also after pud_alloc not only after p4d_alloc. >> >> pmd_alloc would run into the same with x86 32bit non-PAE too. non-PAE doesn't have an NX bit. :) But we #define _PAGE_NX down to 0 there so it's harmless. >> So there are two choices, either going back to one single _PAGE_NX >> clear from the original Dave's original patch as below, or to add >> multiple clear after each level which was my objective and is more >> robust, but it may be overkill in this case. As long as it was one >> line it looked a clear improvement. >> >> Considering the caller in both cases is going to abort I guess we can >> use the one liner approach as Dave and Jiri did originally. > > Dave ? I agree with Andrea. The patch in -tip potentially misses the pgd clearing if pud_alloc() sets a PGD. It would also be nice to have that comment back. Note that the -tip commit probably works in *practice* because for two adjacent calls to map_tboot_page() that share a PGD entry, the first will clear NX, *then* allocate and set the PGD (without NX clear). The second call will *not* allocate but will clear the NX bit. The patch I think we want is attached. From: Dave Hansen This is another case similar to what EFI does: create a new set of page tables, map some code at a low address, and jump to it. PTI mistakes this low address for userspace and mistakenly marks it non-executable in an effort to make it unusable for userspace. Undo the poison to allow execution. Signed-off-by: Dave Hansen Cc: Ning Sun Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: tboot-de...@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org --- b/arch/x86/kernel/tboot.c | 11 +++ 1 file changed, 11 insertions(+) diff -puN arch/x86/kernel/tboot.c~pti-tboot-fix arch/x86/kernel/tboot.c --- a/arch/x86/kernel/tboot.c~pti-tboot-fix 2018-01-05 21:50:55.74960 -0800 +++ b/arch/x86/kernel/tboot.c 2018-01-05 23:51:41.368536890 -0800 @@ -138,6 +138,17 @@ static int map_tboot_page(unsigned long return -1; set_pte_at(_mm, vaddr, pte, pfn_pte(pfn, prot)); pte_unmap(pte); + + /* + * PTI poisons low addresses in the kernel page tables in the + * name of making them unusable for userspace. To execute + * code at such a low address, the poison must be cleared. + * + * Note: 'pgd' actually gets set in p4d_alloc() _or_ + * pud_alloc() depending on 4/5-level paging. + */ + pgd->pgd &= ~_PAGE_NX; + return 0; } _
Re: Avoid speculative indirect calls in kernel
On Mon, 8 Jan 2018, Andrea Arcangeli wrote: > On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote: > > Thanks for resending it. > > Thanks to you for the PTI improvements! > > Did my best to do the cleanest patch for tip, but I now figured Dave's > original comment was spot on: a _PAGE_NX clear then becomes necessary > also after pud_alloc not only after p4d_alloc. > > pmd_alloc would run into the same with x86 32bit non-PAE too. > > So there are two choices, either going back to one single _PAGE_NX > clear from the original Dave's original patch as below, or to add > multiple clear after each level which was my objective and is more > robust, but it may be overkill in this case. As long as it was one > line it looked a clear improvement. > > Considering the caller in both cases is going to abort I guess we can > use the one liner approach as Dave and Jiri did originally. Dave ? > > It's up to you, doing it at each level would be more resilent in case > the caller is changed. > > For the efi_64 same issue, the current tip patch will work better, but > it can still be cleaned up with pgd_efi instead of pgd_offset_k(). > > I got partly fooled because it worked great with 4levels, but it > wasn't ok anyway for 32bit non-PAE. Sometime it's the simpler stuff > that gets more subtle. > > Andrea > > >From 391517951e904cdd231dda9943c36a25a7bf01b9 Mon Sep 17 00:00:00 2001 > From: Dave Hansen> Date: Sat, 6 Jan 2018 18:41:14 +0100 > Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot > > This is another case similar to what EFI does: create a new set of > page tables, map some code at a low address, and jump to it. PTI > mistakes this low address for userspace and mistakenly marks it > non-executable in an effort to make it unusable for userspace. Undo > the poison to allow execution. > > Signed-off-by: Dave Hansen > Cc: Ning Sun > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: x...@kernel.org > Cc: tboot-de...@lists.sourceforge.net > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Andrea Arcangeli > --- > arch/x86/kernel/tboot.c | 11 +++ > 1 file changed, 11 insertions(+) > > diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c > index a4eb27918ceb..a2486f444073 100644 > --- a/arch/x86/kernel/tboot.c > +++ b/arch/x86/kernel/tboot.c > @@ -138,6 +138,17 @@ static int map_tboot_page(unsigned long vaddr, unsigned > long pfn, > return -1; > set_pte_at(_mm, vaddr, pte, pfn_pte(pfn, prot)); > pte_unmap(pte); > + > + /* > + * PTI poisons low addresses in the kernel page tables in the > + * name of making them unusable for userspace. To execute > + * code at such a low address, the poison must be cleared. > + * > + * Note: 'pgd' actually gets set in p4d_alloc() _or_ > + * pud_alloc() depending on 4/5-level paging. > + */ > + pgd->pgd &= ~_PAGE_NX; > + > return 0; > } > >
Re: Avoid speculative indirect calls in kernel
On Mon, 8 Jan 2018, Andrea Arcangeli wrote: > On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote: > > Thanks for resending it. > > Thanks to you for the PTI improvements! > > Did my best to do the cleanest patch for tip, but I now figured Dave's > original comment was spot on: a _PAGE_NX clear then becomes necessary > also after pud_alloc not only after p4d_alloc. > > pmd_alloc would run into the same with x86 32bit non-PAE too. > > So there are two choices, either going back to one single _PAGE_NX > clear from the original Dave's original patch as below, or to add > multiple clear after each level which was my objective and is more > robust, but it may be overkill in this case. As long as it was one > line it looked a clear improvement. > > Considering the caller in both cases is going to abort I guess we can > use the one liner approach as Dave and Jiri did originally. Dave ? > > It's up to you, doing it at each level would be more resilent in case > the caller is changed. > > For the efi_64 same issue, the current tip patch will work better, but > it can still be cleaned up with pgd_efi instead of pgd_offset_k(). > > I got partly fooled because it worked great with 4levels, but it > wasn't ok anyway for 32bit non-PAE. Sometime it's the simpler stuff > that gets more subtle. > > Andrea > > >From 391517951e904cdd231dda9943c36a25a7bf01b9 Mon Sep 17 00:00:00 2001 > From: Dave Hansen > Date: Sat, 6 Jan 2018 18:41:14 +0100 > Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot > > This is another case similar to what EFI does: create a new set of > page tables, map some code at a low address, and jump to it. PTI > mistakes this low address for userspace and mistakenly marks it > non-executable in an effort to make it unusable for userspace. Undo > the poison to allow execution. > > Signed-off-by: Dave Hansen > Cc: Ning Sun > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: x...@kernel.org > Cc: tboot-de...@lists.sourceforge.net > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Andrea Arcangeli > --- > arch/x86/kernel/tboot.c | 11 +++ > 1 file changed, 11 insertions(+) > > diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c > index a4eb27918ceb..a2486f444073 100644 > --- a/arch/x86/kernel/tboot.c > +++ b/arch/x86/kernel/tboot.c > @@ -138,6 +138,17 @@ static int map_tboot_page(unsigned long vaddr, unsigned > long pfn, > return -1; > set_pte_at(_mm, vaddr, pte, pfn_pte(pfn, prot)); > pte_unmap(pte); > + > + /* > + * PTI poisons low addresses in the kernel page tables in the > + * name of making them unusable for userspace. To execute > + * code at such a low address, the poison must be cleared. > + * > + * Note: 'pgd' actually gets set in p4d_alloc() _or_ > + * pud_alloc() depending on 4/5-level paging. > + */ > + pgd->pgd &= ~_PAGE_NX; > + > return 0; > } > >
Re: Avoid speculative indirect calls in kernel
Alan Coxwrites: > On Fri, 5 Jan 2018 01:54:13 +0100 (CET) > Thomas Gleixner wrote: > >> On Thu, 4 Jan 2018, Jon Masters wrote: >> > P.S. I've an internal document where I've been tracking "nice to haves" >> > for later, and one of them is whether it makes sense to tag binaries as >> > "trusted" (e.g. extended attribute, label, whatever). It was something I >> > wanted to bring up at some point as potentially worth considering. >> >> Scratch that. There is no such thing as a trusted binary. > > There is if you are using signing and the like. I'm sure SELiux and > friends will grow the ability to set per process policy but that's > certainly not a priority. There was a proposed security module providing such a per-process policy. When a process want to execute a specific networking syscall regarding specific "transport protocol", the security module catches the syscall at the LSM hook, and ask user about the "verdict" (authorized or not ?) Verdicts are put inside "tickets" (it's a struct of information regarding the autorization). Verdicts can have timeout or live forever. They are managed by a hashtable. The policy can be define by attaching tickets to process with a userspace tool. Interface between userspace command tool and kernel is using netlink protocol. I managed to do the same on process and memory. memory access requires process to delivery a available ticket. Sharing memory is like "process A has a ticket required to access memory of process B" Of course, direct assignation, throught asm code or operation like : buffer[x] = y; are impossible to catch at this level. It requires hooks at the asm level. As I understand, Willy needs to built such a took to classify "trusted" binaries from others. This is just the top of the iceberg, because, after starting to mark process as "trusted" or not, there is a need of an architecture to track such operations, evaluate incoherences, evaluate the convergence of such classification, regarding thousands of binaries, in a lot of contexts. This was the big part of the job. last series I propose was years ago under the name : [RFC,v3,00/10] snet: Security for NETwork syscalls and particulary : [RFC,v3,08/10] snet: introduce snet_ticket http://patchwork.ozlabs.org/patch/93808/ thanks; sam
Re: Avoid speculative indirect calls in kernel
Alan Cox writes: > On Fri, 5 Jan 2018 01:54:13 +0100 (CET) > Thomas Gleixner wrote: > >> On Thu, 4 Jan 2018, Jon Masters wrote: >> > P.S. I've an internal document where I've been tracking "nice to haves" >> > for later, and one of them is whether it makes sense to tag binaries as >> > "trusted" (e.g. extended attribute, label, whatever). It was something I >> > wanted to bring up at some point as potentially worth considering. >> >> Scratch that. There is no such thing as a trusted binary. > > There is if you are using signing and the like. I'm sure SELiux and > friends will grow the ability to set per process policy but that's > certainly not a priority. There was a proposed security module providing such a per-process policy. When a process want to execute a specific networking syscall regarding specific "transport protocol", the security module catches the syscall at the LSM hook, and ask user about the "verdict" (authorized or not ?) Verdicts are put inside "tickets" (it's a struct of information regarding the autorization). Verdicts can have timeout or live forever. They are managed by a hashtable. The policy can be define by attaching tickets to process with a userspace tool. Interface between userspace command tool and kernel is using netlink protocol. I managed to do the same on process and memory. memory access requires process to delivery a available ticket. Sharing memory is like "process A has a ticket required to access memory of process B" Of course, direct assignation, throught asm code or operation like : buffer[x] = y; are impossible to catch at this level. It requires hooks at the asm level. As I understand, Willy needs to built such a took to classify "trusted" binaries from others. This is just the top of the iceberg, because, after starting to mark process as "trusted" or not, there is a need of an architecture to track such operations, evaluate incoherences, evaluate the convergence of such classification, regarding thousands of binaries, in a lot of contexts. This was the big part of the job. last series I propose was years ago under the name : [RFC,v3,00/10] snet: Security for NETwork syscalls and particulary : [RFC,v3,08/10] snet: introduce snet_ticket http://patchwork.ozlabs.org/patch/93808/ thanks; sam
Re: Avoid speculative indirect calls in kernel
On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote: > Thanks for resending it. Thanks to you for the PTI improvements! Did my best to do the cleanest patch for tip, but I now figured Dave's original comment was spot on: a _PAGE_NX clear then becomes necessary also after pud_alloc not only after p4d_alloc. pmd_alloc would run into the same with x86 32bit non-PAE too. So there are two choices, either going back to one single _PAGE_NX clear from the original Dave's original patch as below, or to add multiple clear after each level which was my objective and is more robust, but it may be overkill in this case. As long as it was one line it looked a clear improvement. Considering the caller in both cases is going to abort I guess we can use the one liner approach as Dave and Jiri did originally. It's up to you, doing it at each level would be more resilent in case the caller is changed. For the efi_64 same issue, the current tip patch will work better, but it can still be cleaned up with pgd_efi instead of pgd_offset_k(). I got partly fooled because it worked great with 4levels, but it wasn't ok anyway for 32bit non-PAE. Sometime it's the simpler stuff that gets more subtle. Andrea >From 391517951e904cdd231dda9943c36a25a7bf01b9 Mon Sep 17 00:00:00 2001 From: Dave HansenDate: Sat, 6 Jan 2018 18:41:14 +0100 Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot This is another case similar to what EFI does: create a new set of page tables, map some code at a low address, and jump to it. PTI mistakes this low address for userspace and mistakenly marks it non-executable in an effort to make it unusable for userspace. Undo the poison to allow execution. Signed-off-by: Dave Hansen Cc: Ning Sun Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: tboot-de...@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org Signed-off-by: Andrea Arcangeli --- arch/x86/kernel/tboot.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index a4eb27918ceb..a2486f444073 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -138,6 +138,17 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn, return -1; set_pte_at(_mm, vaddr, pte, pfn_pte(pfn, prot)); pte_unmap(pte); + + /* +* PTI poisons low addresses in the kernel page tables in the +* name of making them unusable for userspace. To execute +* code at such a low address, the poison must be cleared. +* +* Note: 'pgd' actually gets set in p4d_alloc() _or_ +* pud_alloc() depending on 4/5-level paging. +*/ + pgd->pgd &= ~_PAGE_NX; + return 0; }
Re: Avoid speculative indirect calls in kernel
On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote: > Thanks for resending it. Thanks to you for the PTI improvements! Did my best to do the cleanest patch for tip, but I now figured Dave's original comment was spot on: a _PAGE_NX clear then becomes necessary also after pud_alloc not only after p4d_alloc. pmd_alloc would run into the same with x86 32bit non-PAE too. So there are two choices, either going back to one single _PAGE_NX clear from the original Dave's original patch as below, or to add multiple clear after each level which was my objective and is more robust, but it may be overkill in this case. As long as it was one line it looked a clear improvement. Considering the caller in both cases is going to abort I guess we can use the one liner approach as Dave and Jiri did originally. It's up to you, doing it at each level would be more resilent in case the caller is changed. For the efi_64 same issue, the current tip patch will work better, but it can still be cleaned up with pgd_efi instead of pgd_offset_k(). I got partly fooled because it worked great with 4levels, but it wasn't ok anyway for 32bit non-PAE. Sometime it's the simpler stuff that gets more subtle. Andrea >From 391517951e904cdd231dda9943c36a25a7bf01b9 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Sat, 6 Jan 2018 18:41:14 +0100 Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot This is another case similar to what EFI does: create a new set of page tables, map some code at a low address, and jump to it. PTI mistakes this low address for userspace and mistakenly marks it non-executable in an effort to make it unusable for userspace. Undo the poison to allow execution. Signed-off-by: Dave Hansen Cc: Ning Sun Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: tboot-de...@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org Signed-off-by: Andrea Arcangeli --- arch/x86/kernel/tboot.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index a4eb27918ceb..a2486f444073 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -138,6 +138,17 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn, return -1; set_pte_at(_mm, vaddr, pte, pfn_pte(pfn, prot)); pte_unmap(pte); + + /* +* PTI poisons low addresses in the kernel page tables in the +* name of making them unusable for userspace. To execute +* code at such a low address, the poison must be cleared. +* +* Note: 'pgd' actually gets set in p4d_alloc() _or_ +* pud_alloc() depending on 4/5-level paging. +*/ + pgd->pgd &= ~_PAGE_NX; + return 0; }
Re: Avoid speculative indirect calls in kernel
On Mon, 8 Jan 2018, Andrea Arcangeli wrote: > On Fri, Jan 05, 2018 at 10:59:28AM +0100, Thomas Gleixner wrote: > I sent you a better version of the efi_64.c fix from Jiri privately > and you still miss the tboot fix in linux-tip so you still got a boot > failure to fix there. Missed that in the pile ... > This is incremental with > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=WIP.x86/pti > where the "Unbreak EFI old_memmap" fix is applied. > > I respinned it after doing the more correct fix in this case too (same > as the efi_64.c improvement) while leaving the attribution to the fix > to Dave as he did the hard part. Thanks for resending it. > >From 0c480d1eeabd56379144a4ed6b6fb24f3b84e40e Mon Sep 17 00:00:00 2001 > From: Dave Hansen> Date: Sat, 6 Jan 2018 18:41:14 +0100 > Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot > > This is another case similar to what EFI does: create a new set of > page tables, map some code at a low address, and jump to it. PTI > mistakes this low address for userspace and mistakenly marks it > non-executable in an effort to make it unusable for userspace. Undo > the poison to allow execution. > > Signed-off-by: Dave Hansen > Cc: Ning Sun > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: x...@kernel.org > Cc: tboot-de...@lists.sourceforge.net > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Andrea Arcangeli > --- > arch/x86/kernel/tboot.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c > index a4eb27918ceb..75869a4b6c41 100644 > --- a/arch/x86/kernel/tboot.c > +++ b/arch/x86/kernel/tboot.c > @@ -127,6 +127,7 @@ static int map_tboot_page(unsigned long vaddr, unsigned > long pfn, > p4d = p4d_alloc(_mm, pgd, vaddr); > if (!p4d) > return -1; > + pgd->pgd &= ~_PAGE_NX; > pud = pud_alloc(_mm, p4d, vaddr); > if (!pud) > return -1; > > If I can help and assist in any other way let me know. > > Thanks, > Andrea >
Re: Avoid speculative indirect calls in kernel
On Mon, 8 Jan 2018, Andrea Arcangeli wrote: > On Fri, Jan 05, 2018 at 10:59:28AM +0100, Thomas Gleixner wrote: > I sent you a better version of the efi_64.c fix from Jiri privately > and you still miss the tboot fix in linux-tip so you still got a boot > failure to fix there. Missed that in the pile ... > This is incremental with > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=WIP.x86/pti > where the "Unbreak EFI old_memmap" fix is applied. > > I respinned it after doing the more correct fix in this case too (same > as the efi_64.c improvement) while leaving the attribution to the fix > to Dave as he did the hard part. Thanks for resending it. > >From 0c480d1eeabd56379144a4ed6b6fb24f3b84e40e Mon Sep 17 00:00:00 2001 > From: Dave Hansen > Date: Sat, 6 Jan 2018 18:41:14 +0100 > Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot > > This is another case similar to what EFI does: create a new set of > page tables, map some code at a low address, and jump to it. PTI > mistakes this low address for userspace and mistakenly marks it > non-executable in an effort to make it unusable for userspace. Undo > the poison to allow execution. > > Signed-off-by: Dave Hansen > Cc: Ning Sun > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: x...@kernel.org > Cc: tboot-de...@lists.sourceforge.net > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Andrea Arcangeli > --- > arch/x86/kernel/tboot.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c > index a4eb27918ceb..75869a4b6c41 100644 > --- a/arch/x86/kernel/tboot.c > +++ b/arch/x86/kernel/tboot.c > @@ -127,6 +127,7 @@ static int map_tboot_page(unsigned long vaddr, unsigned > long pfn, > p4d = p4d_alloc(_mm, pgd, vaddr); > if (!p4d) > return -1; > + pgd->pgd &= ~_PAGE_NX; > pud = pud_alloc(_mm, p4d, vaddr); > if (!pud) > return -1; > > If I can help and assist in any other way let me know. > > Thanks, > Andrea >
Re: Avoid speculative indirect calls in kernel
On Mon, Jan 08, 2018 at 05:22:41PM +0100, Borislav Petkov wrote: > On Sun, Jan 07, 2018 at 11:10:38PM +0100, Willy Tarreau wrote: > > I just want to be clear that the big drop some of us are facing is > > not an option *at all* for certain processes in certain environments > > and that we'll either continue to run with pti=off or with pti=on + a > > finer grained setting ASAP. > > And that's all I'm saying: do pti=off in that case. The finer-grained > "solution" is just silly. I disagree because I want that, as much as possible, occasional unprivileged local users can't exploit it. pti=off gives them full access. The finer-grained solution ensures that only a few processes share the same risk as the kernel as they work together to deliver the service. And that's what I've implemented in a patch series I sent in another thread :-) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1580131.html Cheers, Willy
Re: Avoid speculative indirect calls in kernel
On Mon, Jan 08, 2018 at 05:22:41PM +0100, Borislav Petkov wrote: > On Sun, Jan 07, 2018 at 11:10:38PM +0100, Willy Tarreau wrote: > > I just want to be clear that the big drop some of us are facing is > > not an option *at all* for certain processes in certain environments > > and that we'll either continue to run with pti=off or with pti=on + a > > finer grained setting ASAP. > > And that's all I'm saying: do pti=off in that case. The finer-grained > "solution" is just silly. I disagree because I want that, as much as possible, occasional unprivileged local users can't exploit it. pti=off gives them full access. The finer-grained solution ensures that only a few processes share the same risk as the kernel as they work together to deliver the service. And that's what I've implemented in a patch series I sent in another thread :-) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1580131.html Cheers, Willy
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 11:10:38PM +0100, Willy Tarreau wrote: > I just want to be clear that the big drop some of us are facing is > not an option *at all* for certain processes in certain environments > and that we'll either continue to run with pti=off or with pti=on + a > finer grained setting ASAP. And that's all I'm saying: do pti=off in that case. The finer-grained "solution" is just silly. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 11:10:38PM +0100, Willy Tarreau wrote: > I just want to be clear that the big drop some of us are facing is > not an option *at all* for certain processes in certain environments > and that we'll either continue to run with pti=off or with pti=on + a > finer grained setting ASAP. And that's all I'm saying: do pti=off in that case. The finer-grained "solution" is just silly. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 05, 2018 at 10:59:28AM +0100, Thomas Gleixner wrote: > I've seen the insanities which were crammed into the distro kernels, which > have sysctls and whatever, but at the same time these kernels shipped in a Debugfs tunables only, there are no sysctl, quoting Greg: http://lkml.kernel.org/r/20180107082026.ga11...@kroah.com "It's a debugfs api, it can be changed at any time, to be anything we want, and all is fine :)" > haste do not even boot on a specific class of machines. [..] If you refer to the two efi_64.c and tboot.c corner case boot failures found over the last weekend those affected upstream 4.15-rc 4.14.12 and all PTI branches in linux-tip too (perhaps less reproducible there because of differences in old_memmap handling). I sent you a better version of the efi_64.c fix from Jiri privately and you still miss the tboot fix in linux-tip so you still got a boot failure to fix there. This is incremental with https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=WIP.x86/pti where the "Unbreak EFI old_memmap" fix is applied. I respinned it after doing the more correct fix in this case too (same as the efi_64.c improvement) while leaving the attribution to the fix to Dave as he did the hard part. >From 0c480d1eeabd56379144a4ed6b6fb24f3b84e40e Mon Sep 17 00:00:00 2001 From: Dave HansenDate: Sat, 6 Jan 2018 18:41:14 +0100 Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot This is another case similar to what EFI does: create a new set of page tables, map some code at a low address, and jump to it. PTI mistakes this low address for userspace and mistakenly marks it non-executable in an effort to make it unusable for userspace. Undo the poison to allow execution. Signed-off-by: Dave Hansen Cc: Ning Sun Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: tboot-de...@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org Signed-off-by: Andrea Arcangeli --- arch/x86/kernel/tboot.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index a4eb27918ceb..75869a4b6c41 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -127,6 +127,7 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn, p4d = p4d_alloc(_mm, pgd, vaddr); if (!p4d) return -1; + pgd->pgd &= ~_PAGE_NX; pud = pud_alloc(_mm, p4d, vaddr); if (!pud) return -1; If I can help and assist in any other way let me know. Thanks, Andrea
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 05, 2018 at 10:59:28AM +0100, Thomas Gleixner wrote: > I've seen the insanities which were crammed into the distro kernels, which > have sysctls and whatever, but at the same time these kernels shipped in a Debugfs tunables only, there are no sysctl, quoting Greg: http://lkml.kernel.org/r/20180107082026.ga11...@kroah.com "It's a debugfs api, it can be changed at any time, to be anything we want, and all is fine :)" > haste do not even boot on a specific class of machines. [..] If you refer to the two efi_64.c and tboot.c corner case boot failures found over the last weekend those affected upstream 4.15-rc 4.14.12 and all PTI branches in linux-tip too (perhaps less reproducible there because of differences in old_memmap handling). I sent you a better version of the efi_64.c fix from Jiri privately and you still miss the tboot fix in linux-tip so you still got a boot failure to fix there. This is incremental with https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=WIP.x86/pti where the "Unbreak EFI old_memmap" fix is applied. I respinned it after doing the more correct fix in this case too (same as the efi_64.c improvement) while leaving the attribution to the fix to Dave as he did the hard part. >From 0c480d1eeabd56379144a4ed6b6fb24f3b84e40e Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Sat, 6 Jan 2018 18:41:14 +0100 Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot This is another case similar to what EFI does: create a new set of page tables, map some code at a low address, and jump to it. PTI mistakes this low address for userspace and mistakenly marks it non-executable in an effort to make it unusable for userspace. Undo the poison to allow execution. Signed-off-by: Dave Hansen Cc: Ning Sun Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: tboot-de...@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org Signed-off-by: Andrea Arcangeli --- arch/x86/kernel/tboot.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index a4eb27918ceb..75869a4b6c41 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -127,6 +127,7 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn, p4d = p4d_alloc(_mm, pgd, vaddr); if (!p4d) return -1; + pgd->pgd &= ~_PAGE_NX; pud = pud_alloc(_mm, p4d, vaddr); if (!pud) return -1; If I can help and assist in any other way let me know. Thanks, Andrea
Re: Avoid speculative indirect calls in kernel
Hi Thomas, On Mon, Jan 08, 2018 at 10:18:09AM +0100, Thomas Gleixner wrote: > On Sun, 7 Jan 2018, Willy Tarreau wrote: > > On Sun, Jan 07, 2018 at 07:55:11PM +0100, Borislav Petkov wrote: > > > > Just like you have to trust your plane's pilot eventhough you don't > > > > know him personally. > > > > > > Funny you should make that analogy. Remember that germanwings pilot? > > > People trusted him too. > > > > > > Now imagine if the plane had protection against insane pilots... some of > > > those people might still be alive, who knows... > > > > Sure but despite this case many people continue to take the plane because > > it's their only option to cross half of the world in a reasonable time. > > > > Boris, I'm *not* contesting the performance resulting from the fixes, > > and I would never have been able to produce them myself had I to, so > > I'm really glad we have them. I just want to be clear that the big drop > > some of us are facing is not an option *at all* for certain processes > > in certain environments and that we'll either continue to run with > > pti=off or with pti=on + a finer grained setting ASAP. > > No argument about that. We've looked into per process PTI very early and > decided not to go that route because of the time pressure and the risk. I'm > glad that we managed to pull it off at all without breaking the world > completely. It's surely doable and we all know that it has to be done, just > not right now as we have to fast track at least the basic protections for > the other two attack vectors. I know that most people with the skills to do it are very busy, which is why I started to take a look at it, not being involved at all in this and having interest in seeing it done. For me the road is long, progressively discovering asid/pcid etc in the code, you can guess I won't come up with something testable any time soon ;-) My idea would be to use a privileged prctl() call to set a new TIF_NOPTI on the task and to see where to check for this to avoid switching to the user-only PGD when returning to userspace. I have no idea if this is doable at all nor if this would be sufficient (I hope so) but reading the code to try to figure whether it makes sense cannot hurt. > You can be sure, that all people involved hate it more than you do. I'm definitely convinced about this, we're all proud to save one CPU cycle here and there from time to time and having to suddenly flush TLBs and throw hundreds or thousands of cycles at once down the drain must be a very hard decision to take. And by the way I don't hate what was done because there's a config option and I still have the choice. Other OS users probably don't even have this choice, so thanks to all involved for this! Willy
Re: Avoid speculative indirect calls in kernel
Hi Thomas, On Mon, Jan 08, 2018 at 10:18:09AM +0100, Thomas Gleixner wrote: > On Sun, 7 Jan 2018, Willy Tarreau wrote: > > On Sun, Jan 07, 2018 at 07:55:11PM +0100, Borislav Petkov wrote: > > > > Just like you have to trust your plane's pilot eventhough you don't > > > > know him personally. > > > > > > Funny you should make that analogy. Remember that germanwings pilot? > > > People trusted him too. > > > > > > Now imagine if the plane had protection against insane pilots... some of > > > those people might still be alive, who knows... > > > > Sure but despite this case many people continue to take the plane because > > it's their only option to cross half of the world in a reasonable time. > > > > Boris, I'm *not* contesting the performance resulting from the fixes, > > and I would never have been able to produce them myself had I to, so > > I'm really glad we have them. I just want to be clear that the big drop > > some of us are facing is not an option *at all* for certain processes > > in certain environments and that we'll either continue to run with > > pti=off or with pti=on + a finer grained setting ASAP. > > No argument about that. We've looked into per process PTI very early and > decided not to go that route because of the time pressure and the risk. I'm > glad that we managed to pull it off at all without breaking the world > completely. It's surely doable and we all know that it has to be done, just > not right now as we have to fast track at least the basic protections for > the other two attack vectors. I know that most people with the skills to do it are very busy, which is why I started to take a look at it, not being involved at all in this and having interest in seeing it done. For me the road is long, progressively discovering asid/pcid etc in the code, you can guess I won't come up with something testable any time soon ;-) My idea would be to use a privileged prctl() call to set a new TIF_NOPTI on the task and to see where to check for this to avoid switching to the user-only PGD when returning to userspace. I have no idea if this is doable at all nor if this would be sufficient (I hope so) but reading the code to try to figure whether it makes sense cannot hurt. > You can be sure, that all people involved hate it more than you do. I'm definitely convinced about this, we're all proud to save one CPU cycle here and there from time to time and having to suddenly flush TLBs and throw hundreds or thousands of cycles at once down the drain must be a very hard decision to take. And by the way I don't hate what was done because there's a config option and I still have the choice. Other OS users probably don't even have this choice, so thanks to all involved for this! Willy
Re: Avoid speculative indirect calls in kernel
On Sun, 7 Jan 2018, Willy Tarreau wrote: > On Sun, Jan 07, 2018 at 07:55:11PM +0100, Borislav Petkov wrote: > > > Just like you have to trust your plane's pilot eventhough you don't > > > know him personally. > > > > Funny you should make that analogy. Remember that germanwings pilot? > > People trusted him too. > > > > Now imagine if the plane had protection against insane pilots... some of > > those people might still be alive, who knows... > > Sure but despite this case many people continue to take the plane because > it's their only option to cross half of the world in a reasonable time. > > Boris, I'm *not* contesting the performance resulting from the fixes, > and I would never have been able to produce them myself had I to, so > I'm really glad we have them. I just want to be clear that the big drop > some of us are facing is not an option *at all* for certain processes > in certain environments and that we'll either continue to run with > pti=off or with pti=on + a finer grained setting ASAP. No argument about that. We've looked into per process PTI very early and decided not to go that route because of the time pressure and the risk. I'm glad that we managed to pull it off at all without breaking the world completely. It's surely doable and we all know that it has to be done, just not right now as we have to fast track at least the basic protections for the other two attack vectors. You can be sure, that all people involved hate it more than you do. Thanks, tglx
Re: Avoid speculative indirect calls in kernel
On Sun, 7 Jan 2018, Willy Tarreau wrote: > On Sun, Jan 07, 2018 at 07:55:11PM +0100, Borislav Petkov wrote: > > > Just like you have to trust your plane's pilot eventhough you don't > > > know him personally. > > > > Funny you should make that analogy. Remember that germanwings pilot? > > People trusted him too. > > > > Now imagine if the plane had protection against insane pilots... some of > > those people might still be alive, who knows... > > Sure but despite this case many people continue to take the plane because > it's their only option to cross half of the world in a reasonable time. > > Boris, I'm *not* contesting the performance resulting from the fixes, > and I would never have been able to produce them myself had I to, so > I'm really glad we have them. I just want to be clear that the big drop > some of us are facing is not an option *at all* for certain processes > in certain environments and that we'll either continue to run with > pti=off or with pti=on + a finer grained setting ASAP. No argument about that. We've looked into per process PTI very early and decided not to go that route because of the time pressure and the risk. I'm glad that we managed to pull it off at all without breaking the world completely. It's surely doable and we all know that it has to be done, just not right now as we have to fast track at least the basic protections for the other two attack vectors. You can be sure, that all people involved hate it more than you do. Thanks, tglx
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 07:55:11PM +0100, Borislav Petkov wrote: > > Just like you have to trust your plane's pilot eventhough you don't > > know him personally. > > Funny you should make that analogy. Remember that germanwings pilot? > People trusted him too. > > Now imagine if the plane had protection against insane pilots... some of > those people might still be alive, who knows... Sure but despite this case many people continue to take the plane because it's their only option to cross half of the world in a reasonable time. Boris, I'm *not* contesting the performance resulting from the fixes, and I would never have been able to produce them myself had I to, so I'm really glad we have them. I just want to be clear that the big drop some of us are facing is not an option *at all* for certain processes in certain environments and that we'll either continue to run with pti=off or with pti=on + a finer grained setting ASAP. I mean, the kernel is not the only sensitive part in a system (and sometimes it's even not at all). A kernel + a userland processes deliver a service, each in it role. Breaking one or the other can be similar or sometimes the trouble can be worse for one than the other. But for some situations, the good work condition of the combination of the two is critical, and even a kernel compromission could be a detail compared to the impact of something crashing at full load. Sometimes a userspace compromission would already be critical enough that the risk is not higher by accepting to take it for the kernel as well. In my specific case, on LB appliances, I don't really care what happens once haproxy has already been compromised, it's too late. End of the game, all sensitive information are already disclosed at this point. What I'd rather avoid however is the occasional sysop who has an account on the machine to retrieve some stats once in a while that would suddenly be able to get more than these stats. That's where I draw the line for *this* use case. Plenty of others will have plenty of other perception and that's fine. Cheers, Willy
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 07:55:11PM +0100, Borislav Petkov wrote: > > Just like you have to trust your plane's pilot eventhough you don't > > know him personally. > > Funny you should make that analogy. Remember that germanwings pilot? > People trusted him too. > > Now imagine if the plane had protection against insane pilots... some of > those people might still be alive, who knows... Sure but despite this case many people continue to take the plane because it's their only option to cross half of the world in a reasonable time. Boris, I'm *not* contesting the performance resulting from the fixes, and I would never have been able to produce them myself had I to, so I'm really glad we have them. I just want to be clear that the big drop some of us are facing is not an option *at all* for certain processes in certain environments and that we'll either continue to run with pti=off or with pti=on + a finer grained setting ASAP. I mean, the kernel is not the only sensitive part in a system (and sometimes it's even not at all). A kernel + a userland processes deliver a service, each in it role. Breaking one or the other can be similar or sometimes the trouble can be worse for one than the other. But for some situations, the good work condition of the combination of the two is critical, and even a kernel compromission could be a detail compared to the impact of something crashing at full load. Sometimes a userspace compromission would already be critical enough that the risk is not higher by accepting to take it for the kernel as well. In my specific case, on LB appliances, I don't really care what happens once haproxy has already been compromised, it's too late. End of the game, all sensitive information are already disclosed at this point. What I'd rather avoid however is the occasional sysop who has an account on the machine to retrieve some stats once in a while that would suddenly be able to get more than these stats. That's where I draw the line for *this* use case. Plenty of others will have plenty of other perception and that's fine. Cheers, Willy
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 06:44:51PM +0100, Willy Tarreau wrote: > Exactly, but there's much more to gain by owning this process anyway in > certain cases than just dumping a few hundreds of kernel bytes. A few hundred? It is *all* machine bytes. > That's where I consider that "trusted" is more "critical" than "safe" : > if it dies, we all die anyway. No, not die. Exploit it and since it is "trusted", use it to dump all memory. All your memories belongs to us. > Just like you have to trust your plane's pilot eventhough you don't > know him personally. Funny you should make that analogy. Remember that germanwings pilot? People trusted him too. Now imagine if the plane had protection against insane pilots... some of those people might still be alive, who knows... -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 06:44:51PM +0100, Willy Tarreau wrote: > Exactly, but there's much more to gain by owning this process anyway in > certain cases than just dumping a few hundreds of kernel bytes. A few hundred? It is *all* machine bytes. > That's where I consider that "trusted" is more "critical" than "safe" : > if it dies, we all die anyway. No, not die. Exploit it and since it is "trusted", use it to dump all memory. All your memories belongs to us. > Just like you have to trust your plane's pilot eventhough you don't > know him personally. Funny you should make that analogy. Remember that germanwings pilot? People trusted him too. Now imagine if the plane had protection against insane pilots... some of those people might still be alive, who knows... -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 09:21:44AM -0800, David Lang wrote: > The point is that in many cases, if someone explits the "trusted" process, > they already have everything that the machine is able to do anyway. ...and then we don't need the per-process complication anyway. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 09:21:44AM -0800, David Lang wrote: > The point is that in many cases, if someone explits the "trusted" process, > they already have everything that the machine is able to do anyway. ...and then we don't need the per-process complication anyway. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Avoid speculative indirect calls in kernel
On Sun, 2018-01-07 at 21:01 +0300, Ivan Ivanov wrote: > Make sure that your patches do not affect AMD CPU, > because they are unaffected by Meltdown vulnerability > for which this "30% slowdown Intel patch" is required These patches *do* affect AMD CPUs, because they address one of the issues for which AMD CPUs are also vulnerable. smime.p7s Description: S/MIME cryptographic signature
Re: Avoid speculative indirect calls in kernel
On Sun, 2018-01-07 at 21:01 +0300, Ivan Ivanov wrote: > Make sure that your patches do not affect AMD CPU, > because they are unaffected by Meltdown vulnerability > for which this "30% slowdown Intel patch" is required These patches *do* affect AMD CPUs, because they address one of the issues for which AMD CPUs are also vulnerable. smime.p7s Description: S/MIME cryptographic signature
Re: Avoid speculative indirect calls in kernel
Make sure that your patches do not affect AMD CPU, because they are unaffected by Meltdown vulnerability for which this "30% slowdown Intel patch" is required All your security patches regarding Meltdown should be like: *) if its Intel, it is " cpu_insecure " ==> take a safe and slow route *) if its AMD, it is " secure cpu " ==> take a normal route AMD users should not suffer because of Intel screwups. if Intel is responsible they should accept the CPU returns Best regards Ivan Ivanov, coreboot developer and open source enthusiast https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail; target="_blank">https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif; alt="" width="46" height="29" style="width: 46px; height: 29px;" /> Без вирусов. https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail; target="_blank" style="color: #4453ea;">www.avast.ru 2018-01-07 20:47 GMT+03:00 Willy Tarreau: > On Sun, Jan 07, 2018 at 02:01:38PM +, Alan Cox wrote: >> > I disagree. When there are patches that slow execution down up to 30%, >> > I want to be able to mark a binary as "trusted" so that I can run it >> >> It's not a binary that is trusted - it's a binary in a given use case. >> You could easily have the same binary being run in two situations on the >> same box at the same time and run just one of them 'trusted'. > > That's what I like with the prctl approach. This can end up as a config > option in the application itself. At least I'd see it like this in > haproxy. Basically : > - start it with enough privileges (always the case to warrant chroot() > then setuid()) > > - if config option "disable-kpti" is set, run prctl() to disable it. > > > It is sufficiently inconvenient to ensure that it's only done where > relevant and regardless of the executable itself (ie it should not be > an xattr on the FS for example). > > Willy
Re: Avoid speculative indirect calls in kernel
Make sure that your patches do not affect AMD CPU, because they are unaffected by Meltdown vulnerability for which this "30% slowdown Intel patch" is required All your security patches regarding Meltdown should be like: *) if its Intel, it is " cpu_insecure " ==> take a safe and slow route *) if its AMD, it is " secure cpu " ==> take a normal route AMD users should not suffer because of Intel screwups. if Intel is responsible they should accept the CPU returns Best regards Ivan Ivanov, coreboot developer and open source enthusiast https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail; target="_blank">https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif; alt="" width="46" height="29" style="width: 46px; height: 29px;" /> Без вирусов. https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail; target="_blank" style="color: #4453ea;">www.avast.ru 2018-01-07 20:47 GMT+03:00 Willy Tarreau : > On Sun, Jan 07, 2018 at 02:01:38PM +, Alan Cox wrote: >> > I disagree. When there are patches that slow execution down up to 30%, >> > I want to be able to mark a binary as "trusted" so that I can run it >> >> It's not a binary that is trusted - it's a binary in a given use case. >> You could easily have the same binary being run in two situations on the >> same box at the same time and run just one of them 'trusted'. > > That's what I like with the prctl approach. This can end up as a config > option in the application itself. At least I'd see it like this in > haproxy. Basically : > - start it with enough privileges (always the case to warrant chroot() > then setuid()) > > - if config option "disable-kpti" is set, run prctl() to disable it. > > > It is sufficiently inconvenient to ensure that it's only done where > relevant and regardless of the executable itself (ie it should not be > an xattr on the FS for example). > > Willy
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 02:01:38PM +, Alan Cox wrote: > > I disagree. When there are patches that slow execution down up to 30%, > > I want to be able to mark a binary as "trusted" so that I can run it > > It's not a binary that is trusted - it's a binary in a given use case. > You could easily have the same binary being run in two situations on the > same box at the same time and run just one of them 'trusted'. That's what I like with the prctl approach. This can end up as a config option in the application itself. At least I'd see it like this in haproxy. Basically : - start it with enough privileges (always the case to warrant chroot() then setuid()) - if config option "disable-kpti" is set, run prctl() to disable it. It is sufficiently inconvenient to ensure that it's only done where relevant and regardless of the executable itself (ie it should not be an xattr on the FS for example). Willy
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 02:01:38PM +, Alan Cox wrote: > > I disagree. When there are patches that slow execution down up to 30%, > > I want to be able to mark a binary as "trusted" so that I can run it > > It's not a binary that is trusted - it's a binary in a given use case. > You could easily have the same binary being run in two situations on the > same box at the same time and run just one of them 'trusted'. That's what I like with the prctl approach. This can end up as a config option in the application itself. At least I'd see it like this in haproxy. Basically : - start it with enough privileges (always the case to warrant chroot() then setuid()) - if config option "disable-kpti" is set, run prctl() to disable it. It is sufficiently inconvenient to ensure that it's only done where relevant and regardless of the executable itself (ie it should not be an xattr on the FS for example). Willy
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 03:14:10PM +0100, Borislav Petkov wrote: > On Fri, Jan 05, 2018 at 08:13:33AM +0100, Willy Tarreau wrote: > > I'm not fond of running the mitigations, but given that a few sysops can > > connect to the machine to collect stats or counters, I think it would be > > better to ensure these people can't happily play with the exploits to > > dump stuff they shouldn't have access to. > > So if someone exploits the "trusted" process, and then dumps all memory, > you have practically lost. Exactly, but there's much more to gain by owning this process anyway in certain cases than just dumping a few hundreds of kernel bytes. That's where I consider that "trusted" is more "critical" than "safe" : if it dies, we all die anyway. Just like you have to trust your plane's pilot eventhough you don't know him personally. Willy
Re: Avoid speculative indirect calls in kernel
On Sun, Jan 07, 2018 at 03:14:10PM +0100, Borislav Petkov wrote: > On Fri, Jan 05, 2018 at 08:13:33AM +0100, Willy Tarreau wrote: > > I'm not fond of running the mitigations, but given that a few sysops can > > connect to the machine to collect stats or counters, I think it would be > > better to ensure these people can't happily play with the exploits to > > dump stuff they shouldn't have access to. > > So if someone exploits the "trusted" process, and then dumps all memory, > you have practically lost. Exactly, but there's much more to gain by owning this process anyway in certain cases than just dumping a few hundreds of kernel bytes. That's where I consider that "trusted" is more "critical" than "safe" : if it dies, we all die anyway. Just like you have to trust your plane's pilot eventhough you don't know him personally. Willy
Re: Avoid speculative indirect calls in kernel
The point is that in many cases, if someone explits the "trusted" process, they already have everything that the machine is able to do anyway.
Re: Avoid speculative indirect calls in kernel
The point is that in many cases, if someone explits the "trusted" process, they already have everything that the machine is able to do anyway.
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 05, 2018 at 08:13:33AM +0100, Willy Tarreau wrote: > I'm not fond of running the mitigations, but given that a few sysops can > connect to the machine to collect stats or counters, I think it would be > better to ensure these people can't happily play with the exploits to > dump stuff they shouldn't have access to. So if someone exploits the "trusted" process, and then dumps all memory, you have practically lost. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 05, 2018 at 08:13:33AM +0100, Willy Tarreau wrote: > I'm not fond of running the mitigations, but given that a few sysops can > connect to the machine to collect stats or counters, I think it would be > better to ensure these people can't happily play with the exploits to > dump stuff they shouldn't have access to. So if someone exploits the "trusted" process, and then dumps all memory, you have practically lost. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: Avoid speculative indirect calls in kernel
> I disagree. When there are patches that slow execution down up to 30%, > I want to be able to mark a binary as "trusted" so that I can run it It's not a binary that is trusted - it's a binary in a given use case. You could easily have the same binary being run in two situations on the same box at the same time and run just one of them 'trusted'.
Re: Avoid speculative indirect calls in kernel
> I disagree. When there are patches that slow execution down up to 30%, > I want to be able to mark a binary as "trusted" so that I can run it It's not a binary that is trusted - it's a binary in a given use case. You could easily have the same binary being run in two situations on the same box at the same time and run just one of them 'trusted'.
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 5, 2018 at 5:40 AM, Woodhouse, Davidwrote: > On Thu, 2018-01-04 at 21:01 -0500, james harvey wrote: >> >> >> I understand the GCC patches being discussed will fix the >> vulnerability because newly compiled kernels will be compiled with a >> GCC with these patches. > > The GCC patches work by eliminating all indirect branches, thus > avoiding 'variant 2' of the three problems which have been discovered. > > Note that we also need to eliminate all the indirect branches which > occurred in native assembler code too, and provide the 'thunks' that > GCC uses instead, which is why there's a series of kernel patches to go > with it. > > But building a kernel this way is *only* sufficient to protect the > kernel. Attacks between userspace processes are still possible — you > need the updated microcode, with branch-predictor flushes/restrictions, > to protect existing userspace processes from each other. > >> But, are the GCC patches being discussed also expected to fix the >> vulnerability because user binaries will be compiled using them? > > It would be possible to do that. Sensitive userspace processes could be > built this way, rendering them invulnerable to 'variant 2' attacks > without the kernel having to use the microcode features. > >> In such case, a binary could be maliciously changed back, or a custom GCC >> made with the patches reverted. > > If the attacker can replace the sensitive binary, or replace the > compiler with which the sensitive binary was compiled, then we have > other problems. I'm not going to lose sleep over that. > > > Note that *none* of this addresses 'variant 1'. There's a separate > patch series which addresses likely 'variant 1 gadgets' in the kernel, > which I haven't seen posted in public yet. And I'm not sure what we do > about that for userspace except extending the existing Coverity ruleset > and teaching GCC to emit barriers automatically in the right places, > which is a bit far-fetched right now. Elena? > Amazon Web Services UK Limited. Registered in England and Wales with > registration number 08650665 and which has its registered office at 60 > Holborn Viaduct, London EC1A 2FD, United Kingdom. I could have written more clearly. What I'm getting at is if any of the GCC patches are intended to prevent an exploit from being able to be attempted, rather than making binaries immune from attacks. So, I don't mean being able to modify a sensitive binary or the system's compiler. I mean someone has non-root SSH access. Compiles GCC with whatever patches wind up for variant 1-3 reverted, or uses an old version. Leaves their custom compiled GCC in their user directory, to compile malicious code. Executes the compiled malicious code from their user directory. (Or, compiles a malicious program on their own system compatible in architecture, kernel and library versions, using GCC without new patches, and just copies over the binary to their user directory to execute.) Or, malicious customer with a VM on a shared machine who has root access within their VM, reverting GCC patches attempting to see into the host or other VM's on the machine.
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 5, 2018 at 5:40 AM, Woodhouse, David wrote: > On Thu, 2018-01-04 at 21:01 -0500, james harvey wrote: >> >> >> I understand the GCC patches being discussed will fix the >> vulnerability because newly compiled kernels will be compiled with a >> GCC with these patches. > > The GCC patches work by eliminating all indirect branches, thus > avoiding 'variant 2' of the three problems which have been discovered. > > Note that we also need to eliminate all the indirect branches which > occurred in native assembler code too, and provide the 'thunks' that > GCC uses instead, which is why there's a series of kernel patches to go > with it. > > But building a kernel this way is *only* sufficient to protect the > kernel. Attacks between userspace processes are still possible — you > need the updated microcode, with branch-predictor flushes/restrictions, > to protect existing userspace processes from each other. > >> But, are the GCC patches being discussed also expected to fix the >> vulnerability because user binaries will be compiled using them? > > It would be possible to do that. Sensitive userspace processes could be > built this way, rendering them invulnerable to 'variant 2' attacks > without the kernel having to use the microcode features. > >> In such case, a binary could be maliciously changed back, or a custom GCC >> made with the patches reverted. > > If the attacker can replace the sensitive binary, or replace the > compiler with which the sensitive binary was compiled, then we have > other problems. I'm not going to lose sleep over that. > > > Note that *none* of this addresses 'variant 1'. There's a separate > patch series which addresses likely 'variant 1 gadgets' in the kernel, > which I haven't seen posted in public yet. And I'm not sure what we do > about that for userspace except extending the existing Coverity ruleset > and teaching GCC to emit barriers automatically in the right places, > which is a bit far-fetched right now. Elena? > Amazon Web Services UK Limited. Registered in England and Wales with > registration number 08650665 and which has its registered office at 60 > Holborn Viaduct, London EC1A 2FD, United Kingdom. I could have written more clearly. What I'm getting at is if any of the GCC patches are intended to prevent an exploit from being able to be attempted, rather than making binaries immune from attacks. So, I don't mean being able to modify a sensitive binary or the system's compiler. I mean someone has non-root SSH access. Compiles GCC with whatever patches wind up for variant 1-3 reverted, or uses an old version. Leaves their custom compiled GCC in their user directory, to compile malicious code. Executes the compiled malicious code from their user directory. (Or, compiles a malicious program on their own system compatible in architecture, kernel and library versions, using GCC without new patches, and just copies over the binary to their user directory to execute.) Or, malicious customer with a VM on a shared machine who has root access within their VM, reverting GCC patches attempting to see into the host or other VM's on the machine.
Re: Avoid speculative indirect calls in kernel
On Fri, 5 Jan 2018 01:54:13 +0100 (CET) Thomas Gleixnerwrote: > On Thu, 4 Jan 2018, Jon Masters wrote: > > P.S. I've an internal document where I've been tracking "nice to haves" > > for later, and one of them is whether it makes sense to tag binaries as > > "trusted" (e.g. extended attribute, label, whatever). It was something I > > wanted to bring up at some point as potentially worth considering. > > Scratch that. There is no such thing as a trusted binary. There is if you are using signing and the like. I'm sure SELiux and friends will grow the ability to set per process policy but that's certainly not a priority. However the question is wrong. 'trusted' is a binary operator not a unary one. The question that matters is If I am executing A and about to switch to B does B trust A because if B trusts A (which in Linuxspeak is 'can A ptrace B') then there's not much point worrying about protection between them because what you are trying to prevent is already expressly permitted. It's even more important if there is a cost to the barrier imposition because not only can you skip it sometimes but your scheduler can schedule considering that cost just as it does cache eviction costs. Alan
Re: Avoid speculative indirect calls in kernel
On Fri, 5 Jan 2018 01:54:13 +0100 (CET) Thomas Gleixner wrote: > On Thu, 4 Jan 2018, Jon Masters wrote: > > P.S. I've an internal document where I've been tracking "nice to haves" > > for later, and one of them is whether it makes sense to tag binaries as > > "trusted" (e.g. extended attribute, label, whatever). It was something I > > wanted to bring up at some point as potentially worth considering. > > Scratch that. There is no such thing as a trusted binary. There is if you are using signing and the like. I'm sure SELiux and friends will grow the ability to set per process policy but that's certainly not a priority. However the question is wrong. 'trusted' is a binary operator not a unary one. The question that matters is If I am executing A and about to switch to B does B trust A because if B trusts A (which in Linuxspeak is 'can A ptrace B') then there's not much point worrying about protection between them because what you are trying to prevent is already expressly permitted. It's even more important if there is a cost to the barrier imposition because not only can you skip it sometimes but your scheduler can schedule considering that cost just as it does cache eviction costs. Alan
Re: Avoid speculative indirect calls in kernel
> But, are the GCC patches being discussed also expected to fix the > vulnerability because user binaries will be compiled using them? In If you have a system with just a few user binaries where you are concerned about such a thing you might go that way. > such case, a binary could be maliciously changed back, or a custom GCC > made with the patches reverted. If I can change your gcc or your binary then instead of removing the speculation protection I can make it encrypt all your files instead. Much simpler. At the point I can do this you already lost. Alan
Re: Avoid speculative indirect calls in kernel
> But, are the GCC patches being discussed also expected to fix the > vulnerability because user binaries will be compiled using them? In If you have a system with just a few user binaries where you are concerned about such a thing you might go that way. > such case, a binary could be maliciously changed back, or a custom GCC > made with the patches reverted. If I can change your gcc or your binary then instead of removing the speculation protection I can make it encrypt all your files instead. Much simpler. At the point I can do this you already lost. Alan
Re: Avoid speculative indirect calls in kernel
On Thu, 4 Jan 2018, Jon Masters wrote: > On 01/04/2018 07:54 PM, Thomas Gleixner wrote: > > On Thu, 4 Jan 2018, Jon Masters wrote: > >> P.S. I've an internal document where I've been tracking "nice to haves" > >> for later, and one of them is whether it makes sense to tag binaries as > >> "trusted" (e.g. extended attribute, label, whatever). It was something I > >> wanted to bring up at some point as potentially worth considering. > > > > Scratch that. There is no such thing as a trusted binary. > > I agree with your sentiment, but for those mitigations that carry a > significant performance overhead (for example IBRS at the moment, and on > some other architectures where we might not end up with retpolines) > there /could/ be some value in leaving them on by default but allowing a > sysadmin to decide to trust a given application/container and accept the > risk. Sure, it's selectively weakened security, I get that. I am not > necessarily advocating this, just suggesting it be discussed. > > [ I also totally get that you can extend variant 2 to have any > application that interacts with another abuse it (even over a pipe or a > socket, etc. provided they share the same cache and take untrusted data > that can lead to some kind of load within a speculation window), and > there are a ton of ways to still cause an attack in that case. ] Correct. We have neither the basic mitigations in place nor has anyone fully understood the implications and possible further issues. So can we please all sit back and fix the problems at hand in a sane way before we start discussing things like selective trust or whatever? I've seen the insanities which were crammed into the distro kernels, which have sysctls and whatever, but at the same time these kernels shipped in a haste do not even boot on a specific class of machines. Great engineering work. The thing which sits between the ears is not an acronyn for: Big Revenue All Intelligence Nuked But it seems that in some ways it has been degraded to exactly that or do you have a sane explanation why quite some of the chip vendors ignored the textbooks from the 90es about speculative execution, which clearly say that speculation has to stop on domain borders and permission violations. We already lost a lot of precious time due to other even more disgusting big corporate games and many of us haven't had a quite moment in the past two month. So can we please fix the stuff on the oldest and most important principle of engineering "Correctness first" and then once that done think about ways how to optimize that w/o digging yet another hole. Thanks, tglx
Re: Avoid speculative indirect calls in kernel
On Thu, 4 Jan 2018, Jon Masters wrote: > On 01/04/2018 07:54 PM, Thomas Gleixner wrote: > > On Thu, 4 Jan 2018, Jon Masters wrote: > >> P.S. I've an internal document where I've been tracking "nice to haves" > >> for later, and one of them is whether it makes sense to tag binaries as > >> "trusted" (e.g. extended attribute, label, whatever). It was something I > >> wanted to bring up at some point as potentially worth considering. > > > > Scratch that. There is no such thing as a trusted binary. > > I agree with your sentiment, but for those mitigations that carry a > significant performance overhead (for example IBRS at the moment, and on > some other architectures where we might not end up with retpolines) > there /could/ be some value in leaving them on by default but allowing a > sysadmin to decide to trust a given application/container and accept the > risk. Sure, it's selectively weakened security, I get that. I am not > necessarily advocating this, just suggesting it be discussed. > > [ I also totally get that you can extend variant 2 to have any > application that interacts with another abuse it (even over a pipe or a > socket, etc. provided they share the same cache and take untrusted data > that can lead to some kind of load within a speculation window), and > there are a ton of ways to still cause an attack in that case. ] Correct. We have neither the basic mitigations in place nor has anyone fully understood the implications and possible further issues. So can we please all sit back and fix the problems at hand in a sane way before we start discussing things like selective trust or whatever? I've seen the insanities which were crammed into the distro kernels, which have sysctls and whatever, but at the same time these kernels shipped in a haste do not even boot on a specific class of machines. Great engineering work. The thing which sits between the ears is not an acronyn for: Big Revenue All Intelligence Nuked But it seems that in some ways it has been degraded to exactly that or do you have a sane explanation why quite some of the chip vendors ignored the textbooks from the 90es about speculative execution, which clearly say that speculation has to stop on domain borders and permission violations. We already lost a lot of precious time due to other even more disgusting big corporate games and many of us haven't had a quite moment in the past two month. So can we please fix the stuff on the oldest and most important principle of engineering "Correctness first" and then once that done think about ways how to optimize that w/o digging yet another hole. Thanks, tglx
Re: Avoid speculative indirect calls in kernel
On Thu, Jan 04, 2018 at 10:57:19PM -0800, Dave Hansen wrote: > On 01/04/2018 10:49 PM, Willy Tarreau wrote: > > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: > >> On Thu, 4 Jan 2018, Jon Masters wrote: > >>> P.S. I've an internal document where I've been tracking "nice to haves" > >>> for later, and one of them is whether it makes sense to tag binaries as > >>> "trusted" (e.g. extended attribute, label, whatever). It was something I > >>> wanted to bring up at some point as potentially worth considering. > >> Scratch that. There is no such thing as a trusted binary. > > I disagree with you on this Thomas. "trusted" means "we agree to share the > > risk this binary takes because it's critical to our service". When you > > build a load balancing appliance on which 100% of the service is assured > > by a single executable and the rest is just config management, you'd better > > trust that process. > > So you want to run this "one binary" as fast as possible and without > mitigations in place? But, you want mitigations *available* on that > system at the same time? For what? If there's only one binary, why not > just disable the mitigations entirely? I'm not fond of running the mitigations, but given that a few sysops can connect to the machine to collect stats or counters, I think it would be better to ensure these people can't happily play with the exploits to dump stuff they shouldn't have access to. It's even easier to understand on a database or key-value server for example, where you may expect the highest performance the CPU can bring for a specific process and the rest can be mitigated and will never ever notice any performance impact at all. That's why I was saying in another thread that it would be nice over the long term if we could 1) make the mitigation dynamic, and 2) make it possible for an admin to disable it for certain processes/programs. Don't get me wrong, I'm perfectly aware that it's far from being simple and for now we need to get a reliable mitigation. I'm just saying that the performance impact is a huge loss for certain use cases and that once things settle down we should start to work on ways to recover what was lost. Regards, Willy
Re: Avoid speculative indirect calls in kernel
On Thu, Jan 04, 2018 at 10:57:19PM -0800, Dave Hansen wrote: > On 01/04/2018 10:49 PM, Willy Tarreau wrote: > > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: > >> On Thu, 4 Jan 2018, Jon Masters wrote: > >>> P.S. I've an internal document where I've been tracking "nice to haves" > >>> for later, and one of them is whether it makes sense to tag binaries as > >>> "trusted" (e.g. extended attribute, label, whatever). It was something I > >>> wanted to bring up at some point as potentially worth considering. > >> Scratch that. There is no such thing as a trusted binary. > > I disagree with you on this Thomas. "trusted" means "we agree to share the > > risk this binary takes because it's critical to our service". When you > > build a load balancing appliance on which 100% of the service is assured > > by a single executable and the rest is just config management, you'd better > > trust that process. > > So you want to run this "one binary" as fast as possible and without > mitigations in place? But, you want mitigations *available* on that > system at the same time? For what? If there's only one binary, why not > just disable the mitigations entirely? I'm not fond of running the mitigations, but given that a few sysops can connect to the machine to collect stats or counters, I think it would be better to ensure these people can't happily play with the exploits to dump stuff they shouldn't have access to. It's even easier to understand on a database or key-value server for example, where you may expect the highest performance the CPU can bring for a specific process and the rest can be mitigated and will never ever notice any performance impact at all. That's why I was saying in another thread that it would be nice over the long term if we could 1) make the mitigation dynamic, and 2) make it possible for an admin to disable it for certain processes/programs. Don't get me wrong, I'm perfectly aware that it's far from being simple and for now we need to get a reliable mitigation. I'm just saying that the performance impact is a huge loss for certain use cases and that once things settle down we should start to work on ways to recover what was lost. Regards, Willy
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 10:49 PM, Willy Tarreau wrote: > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: >> On Thu, 4 Jan 2018, Jon Masters wrote: >>> P.S. I've an internal document where I've been tracking "nice to haves" >>> for later, and one of them is whether it makes sense to tag binaries as >>> "trusted" (e.g. extended attribute, label, whatever). It was something I >>> wanted to bring up at some point as potentially worth considering. >> Scratch that. There is no such thing as a trusted binary. > I disagree with you on this Thomas. "trusted" means "we agree to share the > risk this binary takes because it's critical to our service". When you > build a load balancing appliance on which 100% of the service is assured > by a single executable and the rest is just config management, you'd better > trust that process. So you want to run this "one binary" as fast as possible and without mitigations in place? But, you want mitigations *available* on that system at the same time? For what? If there's only one binary, why not just disable the mitigations entirely?
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 10:49 PM, Willy Tarreau wrote: > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: >> On Thu, 4 Jan 2018, Jon Masters wrote: >>> P.S. I've an internal document where I've been tracking "nice to haves" >>> for later, and one of them is whether it makes sense to tag binaries as >>> "trusted" (e.g. extended attribute, label, whatever). It was something I >>> wanted to bring up at some point as potentially worth considering. >> Scratch that. There is no such thing as a trusted binary. > I disagree with you on this Thomas. "trusted" means "we agree to share the > risk this binary takes because it's critical to our service". When you > build a load balancing appliance on which 100% of the service is assured > by a single executable and the rest is just config management, you'd better > trust that process. So you want to run this "one binary" as fast as possible and without mitigations in place? But, you want mitigations *available* on that system at the same time? For what? If there's only one binary, why not just disable the mitigations entirely?
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: > On Thu, 4 Jan 2018, Jon Masters wrote: > > P.S. I've an internal document where I've been tracking "nice to haves" > > for later, and one of them is whether it makes sense to tag binaries as > > "trusted" (e.g. extended attribute, label, whatever). It was something I > > wanted to bring up at some point as potentially worth considering. > > Scratch that. There is no such thing as a trusted binary. I disagree with you on this Thomas. "trusted" means "we agree to share the risk this binary takes because it's critical to our service". When you build a load balancing appliance on which 100% of the service is assured by a single executable and the rest is just config management, you'd better trust that process. If the binary or process cannot be trusted, the product is dead anyway. It doesn't mean the binary is safe. It just means that for the product there's nothing worse than its compromission or failure. And when it suffers from the performance impact of workarounds supposed to protect the whole device against this process' possible abuses, you easily see how the situation becomes ridiculous. We need to still think about performance a lot. There's already an ongoing trend of kernel bypass mechanisms in the wild for performance reasons, and the new increase of syscall costs will necessarily amplify this willingness to avoid the kernel. I personally don't want to see the kernel being reduced to booting and executing SSH to manage the machines. Willy
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: > On Thu, 4 Jan 2018, Jon Masters wrote: > > P.S. I've an internal document where I've been tracking "nice to haves" > > for later, and one of them is whether it makes sense to tag binaries as > > "trusted" (e.g. extended attribute, label, whatever). It was something I > > wanted to bring up at some point as potentially worth considering. > > Scratch that. There is no such thing as a trusted binary. I disagree with you on this Thomas. "trusted" means "we agree to share the risk this binary takes because it's critical to our service". When you build a load balancing appliance on which 100% of the service is assured by a single executable and the rest is just config management, you'd better trust that process. If the binary or process cannot be trusted, the product is dead anyway. It doesn't mean the binary is safe. It just means that for the product there's nothing worse than its compromission or failure. And when it suffers from the performance impact of workarounds supposed to protect the whole device against this process' possible abuses, you easily see how the situation becomes ridiculous. We need to still think about performance a lot. There's already an ongoing trend of kernel bypass mechanisms in the wild for performance reasons, and the new increase of syscall costs will necessarily amplify this willingness to avoid the kernel. I personally don't want to see the kernel being reduced to booting and executing SSH to manage the machines. Willy
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 07:54 PM, Thomas Gleixner wrote: > On Thu, 4 Jan 2018, Jon Masters wrote: >> P.S. I've an internal document where I've been tracking "nice to haves" >> for later, and one of them is whether it makes sense to tag binaries as >> "trusted" (e.g. extended attribute, label, whatever). It was something I >> wanted to bring up at some point as potentially worth considering. > > Scratch that. There is no such thing as a trusted binary. I agree with your sentiment, but for those mitigations that carry a significant performance overhead (for example IBRS at the moment, and on some other architectures where we might not end up with retpolines) there /could/ be some value in leaving them on by default but allowing a sysadmin to decide to trust a given application/container and accept the risk. Sure, it's selectively weakened security, I get that. I am not necessarily advocating this, just suggesting it be discussed. [ I also totally get that you can extend variant 2 to have any application that interacts with another abuse it (even over a pipe or a socket, etc. provided they share the same cache and take untrusted data that can lead to some kind of load within a speculation window), and there are a ton of ways to still cause an attack in that case. ] Jon. -- Computer Architect | Sent from my Fedora powered laptop
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 07:54 PM, Thomas Gleixner wrote: > On Thu, 4 Jan 2018, Jon Masters wrote: >> P.S. I've an internal document where I've been tracking "nice to haves" >> for later, and one of them is whether it makes sense to tag binaries as >> "trusted" (e.g. extended attribute, label, whatever). It was something I >> wanted to bring up at some point as potentially worth considering. > > Scratch that. There is no such thing as a trusted binary. I agree with your sentiment, but for those mitigations that carry a significant performance overhead (for example IBRS at the moment, and on some other architectures where we might not end up with retpolines) there /could/ be some value in leaving them on by default but allowing a sysadmin to decide to trust a given application/container and accept the risk. Sure, it's selectively weakened security, I get that. I am not necessarily advocating this, just suggesting it be discussed. [ I also totally get that you can extend variant 2 to have any application that interacts with another abuse it (even over a pipe or a socket, etc. provided they share the same cache and take untrusted data that can lead to some kind of load within a speculation window), and there are a ton of ways to still cause an attack in that case. ] Jon. -- Computer Architect | Sent from my Fedora powered laptop
Re: Avoid speculative indirect calls in kernel
On Wed, Jan 3, 2018 at 7:19 PM, Jiri Kosinawrote: > On Wed, 3 Jan 2018, Andi Kleen wrote: > >> > It should be a CPU_BUG bit as we have for the other mess. And that can be >> > used for patching. >> >> It has to be done at compile time because it requires a compiler option. > > If gcc anotates indirect calls/jumps in a way that we could patch them > using alternatives in runtime, that'd be enough. > > -- > Jiri Kosina > SUSE Labs I understand the GCC patches being discussed will fix the vulnerability because newly compiled kernels will be compiled with a GCC with these patches. But, are the GCC patches being discussed also expected to fix the vulnerability because user binaries will be compiled using them? In such case, a binary could be maliciously changed back, or a custom GCC made with the patches reverted. Please forgive me if my ignorance about all the related GCC patches makes this a stupid question.
Re: Avoid speculative indirect calls in kernel
On Wed, Jan 3, 2018 at 7:19 PM, Jiri Kosina wrote: > On Wed, 3 Jan 2018, Andi Kleen wrote: > >> > It should be a CPU_BUG bit as we have for the other mess. And that can be >> > used for patching. >> >> It has to be done at compile time because it requires a compiler option. > > If gcc anotates indirect calls/jumps in a way that we could patch them > using alternatives in runtime, that'd be enough. > > -- > Jiri Kosina > SUSE Labs I understand the GCC patches being discussed will fix the vulnerability because newly compiled kernels will be compiled with a GCC with these patches. But, are the GCC patches being discussed also expected to fix the vulnerability because user binaries will be compiled using them? In such case, a binary could be maliciously changed back, or a custom GCC made with the patches reverted. Please forgive me if my ignorance about all the related GCC patches makes this a stupid question.
Re: Avoid speculative indirect calls in kernel
On Thu, 4 Jan 2018, Jon Masters wrote: > P.S. I've an internal document where I've been tracking "nice to haves" > for later, and one of them is whether it makes sense to tag binaries as > "trusted" (e.g. extended attribute, label, whatever). It was something I > wanted to bring up at some point as potentially worth considering. Scratch that. There is no such thing as a trusted binary.
Re: Avoid speculative indirect calls in kernel
On Thu, 4 Jan 2018, Jon Masters wrote: > P.S. I've an internal document where I've been tracking "nice to haves" > for later, and one of them is whether it makes sense to tag binaries as > "trusted" (e.g. extended attribute, label, whatever). It was something I > wanted to bring up at some point as potentially worth considering. Scratch that. There is no such thing as a trusted binary.
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 02:57 PM, Jon Masters wrote: > + Jeff Law, Nick Clifton > > On 01/04/2018 03:20 AM, Woodhouse, David wrote: >> On Thu, 2018-01-04 at 03:11 +0100, Paolo Bonzini wrote: >>> On 04/01/2018 02:59, Alan Cox wrote: > But then, exactly because the retpoline approach adds quite some cruft > and leaves something to be desired, why even bother? Performance >>> >>> Dunno. If I care about mitigating this threat, I wouldn't stop at >>> retpolines even if the full solution has pretty bad performance (it's >>> roughly in the same ballpark as PTI). But if I don't care, I wouldn't >>> want retpolines either, since they do introduce a small slowdown (10-20 >>> cycles per indirect branch, meaning that after a thousand such papercuts >>> they become slower than the full solution). >>> >>> A couple manually written asm retpolines may be good as mitigation to >>> block the simplest PoCs (Linus may disagree), but patching the compiler, >>> getting alternatives right, etc. will take a while. The only redeeming >>> grace of retpolines is that they don't require a microcode update, but >>> the microcode will be out there long before these patches are included >>> and trickle down to distros... I just don't see the point in starting >>> from retpolines or drawing the line there. >> >> No, really. The full mitigation with the microcode update and IBRS >> support is *slow*. Horribly slow. > > It is horribly slow, though the story changes with CPU generation as > others noted (and what needs disabling in the microcode). We did various > analysis of the retpoline patches, including benchmarks, and we decided > that the fastest and safest approach for Tue^W yesterday was to use the > new MSRs. Especially in light of the corner cases we would need to > address for an empty RSB, etc. I'm adding Jeff Law because he and the > tools team have done analysis on this and he may have thoughts. > > There's also a cross-architecture concern here in that different > solutions are needed across architectures. Retpolines are not endorsed > or recommended by every architecture vendor at this time. It's important > to make sure the necessary cross-vendor discussion happens now that it > can happen in the open. > > Longer term, it'll be good to see BTBs tagged using the full address > space (including any address space IDs...) in future silicon. P.S. I've an internal document where I've been tracking "nice to haves" for later, and one of them is whether it makes sense to tag binaries as "trusted" (e.g. extended attribute, label, whatever). It was something I wanted to bring up at some point as potentially worth considering. Jon. -- Computer Architect | Sent from my Fedora powered laptop
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 02:57 PM, Jon Masters wrote: > + Jeff Law, Nick Clifton > > On 01/04/2018 03:20 AM, Woodhouse, David wrote: >> On Thu, 2018-01-04 at 03:11 +0100, Paolo Bonzini wrote: >>> On 04/01/2018 02:59, Alan Cox wrote: > But then, exactly because the retpoline approach adds quite some cruft > and leaves something to be desired, why even bother? Performance >>> >>> Dunno. If I care about mitigating this threat, I wouldn't stop at >>> retpolines even if the full solution has pretty bad performance (it's >>> roughly in the same ballpark as PTI). But if I don't care, I wouldn't >>> want retpolines either, since they do introduce a small slowdown (10-20 >>> cycles per indirect branch, meaning that after a thousand such papercuts >>> they become slower than the full solution). >>> >>> A couple manually written asm retpolines may be good as mitigation to >>> block the simplest PoCs (Linus may disagree), but patching the compiler, >>> getting alternatives right, etc. will take a while. The only redeeming >>> grace of retpolines is that they don't require a microcode update, but >>> the microcode will be out there long before these patches are included >>> and trickle down to distros... I just don't see the point in starting >>> from retpolines or drawing the line there. >> >> No, really. The full mitigation with the microcode update and IBRS >> support is *slow*. Horribly slow. > > It is horribly slow, though the story changes with CPU generation as > others noted (and what needs disabling in the microcode). We did various > analysis of the retpoline patches, including benchmarks, and we decided > that the fastest and safest approach for Tue^W yesterday was to use the > new MSRs. Especially in light of the corner cases we would need to > address for an empty RSB, etc. I'm adding Jeff Law because he and the > tools team have done analysis on this and he may have thoughts. > > There's also a cross-architecture concern here in that different > solutions are needed across architectures. Retpolines are not endorsed > or recommended by every architecture vendor at this time. It's important > to make sure the necessary cross-vendor discussion happens now that it > can happen in the open. > > Longer term, it'll be good to see BTBs tagged using the full address > space (including any address space IDs...) in future silicon. P.S. I've an internal document where I've been tracking "nice to haves" for later, and one of them is whether it makes sense to tag binaries as "trusted" (e.g. extended attribute, label, whatever). It was something I wanted to bring up at some point as potentially worth considering. Jon. -- Computer Architect | Sent from my Fedora powered laptop
Re: Avoid speculative indirect calls in kernel
On 1/4/2018 5:47 PM, Tom Lendacky wrote: > On 1/4/2018 2:05 PM, David Woodhouse wrote: >> On Thu, 2018-01-04 at 14:00 -0600, Tom Lendacky wrote: >>> Yes, lfence is sufficient. As long as the target is in the register >>> before the lfence and we jump through the register all is good, i.e.: >> >> Thanks. Can I have a Reviewed-by: for this then please: > > Reviewed-by: Tom Lendacky> > While this works, a more efficient way to do the lfence support would be > to not use the retpoline in this case. Changing the indirect jumps to > do the "mov [rax], rax; lfence; jmp *rax" sequence would be quicker. I'm > not sure if this is feasible given the need to do a retpoline if you can't > use lfence, though. > > Thanks, > Tom > I do need to send the patches that make lfence a serializing instruction for AMD. I'll get those out as soon as I can. Thanks, Tom >> >> http://git.infradead.org/users/dwmw2/linux-retpoline.git/commitdiff/08d9eda03 >> >> From: David Woodhouse >> Date: Thu, 4 Jan 2018 20:01:53 + >> Subject: [PATCH] x86/retpoline: Simplify AMD variant of retpoline thunk >> >> On AMD (which is X86_FEATURE_K8), just the lfence is sufficient. >> >> Signed-off-by: David Woodhouse >> --- >> arch/x86/lib/retpoline.S | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S >> index bbdda5cc136e..26070976bff0 100644 >> --- a/arch/x86/lib/retpoline.S >> +++ b/arch/x86/lib/retpoline.S >> @@ -11,7 +11,7 @@ >> >> ENTRY(__x86.indirect_thunk.\reg) >> CFI_STARTPROC >> -ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE >> +ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), >> X86_FEATURE_K8, __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE >> 1: >> lfence >> jmp 1b >> -- >> 2.14.3 >>
Re: Avoid speculative indirect calls in kernel
On 1/4/2018 5:47 PM, Tom Lendacky wrote: > On 1/4/2018 2:05 PM, David Woodhouse wrote: >> On Thu, 2018-01-04 at 14:00 -0600, Tom Lendacky wrote: >>> Yes, lfence is sufficient. As long as the target is in the register >>> before the lfence and we jump through the register all is good, i.e.: >> >> Thanks. Can I have a Reviewed-by: for this then please: > > Reviewed-by: Tom Lendacky > > While this works, a more efficient way to do the lfence support would be > to not use the retpoline in this case. Changing the indirect jumps to > do the "mov [rax], rax; lfence; jmp *rax" sequence would be quicker. I'm > not sure if this is feasible given the need to do a retpoline if you can't > use lfence, though. > > Thanks, > Tom > I do need to send the patches that make lfence a serializing instruction for AMD. I'll get those out as soon as I can. Thanks, Tom >> >> http://git.infradead.org/users/dwmw2/linux-retpoline.git/commitdiff/08d9eda03 >> >> From: David Woodhouse >> Date: Thu, 4 Jan 2018 20:01:53 + >> Subject: [PATCH] x86/retpoline: Simplify AMD variant of retpoline thunk >> >> On AMD (which is X86_FEATURE_K8), just the lfence is sufficient. >> >> Signed-off-by: David Woodhouse >> --- >> arch/x86/lib/retpoline.S | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S >> index bbdda5cc136e..26070976bff0 100644 >> --- a/arch/x86/lib/retpoline.S >> +++ b/arch/x86/lib/retpoline.S >> @@ -11,7 +11,7 @@ >> >> ENTRY(__x86.indirect_thunk.\reg) >> CFI_STARTPROC >> -ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE >> +ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), >> X86_FEATURE_K8, __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE >> 1: >> lfence >> jmp 1b >> -- >> 2.14.3 >>
Re: Avoid speculative indirect calls in kernel
On 04/01/2018 23:47, Tom Lendacky wrote: > On 1/4/2018 2:05 PM, David Woodhouse wrote: >> On Thu, 2018-01-04 at 14:00 -0600, Tom Lendacky wrote: >>> Yes, lfence is sufficient. As long as the target is in the register >>> before the lfence and we jump through the register all is good, i.e.: >> Thanks. Can I have a Reviewed-by: for this then please: > Reviewed-by: Tom Lendacky> > While this works, a more efficient way to do the lfence support would be > to not use the retpoline in this case. Changing the indirect jumps to > do the "mov [rax], rax; lfence; jmp *rax" sequence would be quicker. I'm > not sure if this is feasible given the need to do a retpoline if you can't > use lfence, though. That would be most efficient for AMD, but it isn't compatible with having a single binary which can mitigate itself most efficiently wherever it was booted. On most hardware, we'll want to dynamically chose between repoline and lfence depending on vendor. One option would be to teach GCC/Clang/Other to output alternative patch-point data for indirect branches in the format Linux/Xen could consume, and feed this into the alternatives framework. The practical option to actually deploy in the timeframe is to use __x86.indirect_thunk.%reg and alternate between repoline and lfence in 15 locations, which does add an unconditional call/jmp over the most efficient alternative, but allows us to switch the thunk-in-use at boot time. ~Andrew
Re: Avoid speculative indirect calls in kernel
On 04/01/2018 23:47, Tom Lendacky wrote: > On 1/4/2018 2:05 PM, David Woodhouse wrote: >> On Thu, 2018-01-04 at 14:00 -0600, Tom Lendacky wrote: >>> Yes, lfence is sufficient. As long as the target is in the register >>> before the lfence and we jump through the register all is good, i.e.: >> Thanks. Can I have a Reviewed-by: for this then please: > Reviewed-by: Tom Lendacky > > While this works, a more efficient way to do the lfence support would be > to not use the retpoline in this case. Changing the indirect jumps to > do the "mov [rax], rax; lfence; jmp *rax" sequence would be quicker. I'm > not sure if this is feasible given the need to do a retpoline if you can't > use lfence, though. That would be most efficient for AMD, but it isn't compatible with having a single binary which can mitigate itself most efficiently wherever it was booted. On most hardware, we'll want to dynamically chose between repoline and lfence depending on vendor. One option would be to teach GCC/Clang/Other to output alternative patch-point data for indirect branches in the format Linux/Xen could consume, and feed this into the alternatives framework. The practical option to actually deploy in the timeframe is to use __x86.indirect_thunk.%reg and alternate between repoline and lfence in 15 locations, which does add an unconditional call/jmp over the most efficient alternative, but allows us to switch the thunk-in-use at boot time. ~Andrew
Re: Avoid speculative indirect calls in kernel
On 1/4/2018 2:05 PM, David Woodhouse wrote: > On Thu, 2018-01-04 at 14:00 -0600, Tom Lendacky wrote: >> Yes, lfence is sufficient. As long as the target is in the register >> before the lfence and we jump through the register all is good, i.e.: > > Thanks. Can I have a Reviewed-by: for this then please: Reviewed-by: Tom LendackyWhile this works, a more efficient way to do the lfence support would be to not use the retpoline in this case. Changing the indirect jumps to do the "mov [rax], rax; lfence; jmp *rax" sequence would be quicker. I'm not sure if this is feasible given the need to do a retpoline if you can't use lfence, though. Thanks, Tom > > http://git.infradead.org/users/dwmw2/linux-retpoline.git/commitdiff/08d9eda03 > > From: David Woodhouse > Date: Thu, 4 Jan 2018 20:01:53 + > Subject: [PATCH] x86/retpoline: Simplify AMD variant of retpoline thunk > > On AMD (which is X86_FEATURE_K8), just the lfence is sufficient. > > Signed-off-by: David Woodhouse > --- > arch/x86/lib/retpoline.S | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S > index bbdda5cc136e..26070976bff0 100644 > --- a/arch/x86/lib/retpoline.S > +++ b/arch/x86/lib/retpoline.S > @@ -11,7 +11,7 @@ > > ENTRY(__x86.indirect_thunk.\reg) > CFI_STARTPROC > - ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE > + ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), > X86_FEATURE_K8, __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE > 1: > lfence > jmp 1b > -- > 2.14.3 >
Re: Avoid speculative indirect calls in kernel
On 1/4/2018 2:05 PM, David Woodhouse wrote: > On Thu, 2018-01-04 at 14:00 -0600, Tom Lendacky wrote: >> Yes, lfence is sufficient. As long as the target is in the register >> before the lfence and we jump through the register all is good, i.e.: > > Thanks. Can I have a Reviewed-by: for this then please: Reviewed-by: Tom Lendacky While this works, a more efficient way to do the lfence support would be to not use the retpoline in this case. Changing the indirect jumps to do the "mov [rax], rax; lfence; jmp *rax" sequence would be quicker. I'm not sure if this is feasible given the need to do a retpoline if you can't use lfence, though. Thanks, Tom > > http://git.infradead.org/users/dwmw2/linux-retpoline.git/commitdiff/08d9eda03 > > From: David Woodhouse > Date: Thu, 4 Jan 2018 20:01:53 + > Subject: [PATCH] x86/retpoline: Simplify AMD variant of retpoline thunk > > On AMD (which is X86_FEATURE_K8), just the lfence is sufficient. > > Signed-off-by: David Woodhouse > --- > arch/x86/lib/retpoline.S | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S > index bbdda5cc136e..26070976bff0 100644 > --- a/arch/x86/lib/retpoline.S > +++ b/arch/x86/lib/retpoline.S > @@ -11,7 +11,7 @@ > > ENTRY(__x86.indirect_thunk.\reg) > CFI_STARTPROC > - ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE > + ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), > X86_FEATURE_K8, __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE > 1: > lfence > jmp 1b > -- > 2.14.3 >
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 01:33 PM, Linus Torvalds wrote: > On Thu, Jan 4, 2018 at 3:26 AM, Pavel Machekwrote: >> On Wed 2018-01-03 15:51:35, Linus Torvalds wrote: >>> >>> A *competent* CPU engineer would fix this by making sure speculation >>> doesn't happen across protection domains. Maybe even a L1 I$ that is >>> keyed by CPL. >> >> Would that be enough? > > No, you'd need to add the CPL to the branch target buffer itself, not the I$ > L1. > > And as somebody pointed out, that only helps the user space messing > with the kernel. It doesn't help the "one user context fools another > user context to mispredict". (Where the user contexts might be a > JIT'ed JS vs the rest of the web browser). > > So you really would want to just make sure the full address is used to > index (or at least verify) the BTB lookup, and even then you'd then > need to invalidate the BTB on context switches so that one context > can't fill in data for another context. IMO the correct hardware fix is to index the BTB using the full VA including the ASID/PCID. And guarantee (as is the case) that there is not a live conflict between address space identifiers with entries. The sad thing is that even the latest academic courses recommend "optimizing" branch predictors with a few low order bits (e.g. 31 in Intel's case, various others for different vendors). The fix for variant 3 is similarly not that difficult in new hardware: don't allow the speculated load to happen by enforcing the permission check at the right time. The last several editions of Computer Architecture spell this out in Appendix B (page 37 or thereabouts). Jon. -- Computer Architect | Sent from my Fedora powered laptop
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 01:33 PM, Linus Torvalds wrote: > On Thu, Jan 4, 2018 at 3:26 AM, Pavel Machek wrote: >> On Wed 2018-01-03 15:51:35, Linus Torvalds wrote: >>> >>> A *competent* CPU engineer would fix this by making sure speculation >>> doesn't happen across protection domains. Maybe even a L1 I$ that is >>> keyed by CPL. >> >> Would that be enough? > > No, you'd need to add the CPL to the branch target buffer itself, not the I$ > L1. > > And as somebody pointed out, that only helps the user space messing > with the kernel. It doesn't help the "one user context fools another > user context to mispredict". (Where the user contexts might be a > JIT'ed JS vs the rest of the web browser). > > So you really would want to just make sure the full address is used to > index (or at least verify) the BTB lookup, and even then you'd then > need to invalidate the BTB on context switches so that one context > can't fill in data for another context. IMO the correct hardware fix is to index the BTB using the full VA including the ASID/PCID. And guarantee (as is the case) that there is not a live conflict between address space identifiers with entries. The sad thing is that even the latest academic courses recommend "optimizing" branch predictors with a few low order bits (e.g. 31 in Intel's case, various others for different vendors). The fix for variant 3 is similarly not that difficult in new hardware: don't allow the speculated load to happen by enforcing the permission check at the right time. The last several editions of Computer Architecture spell this out in Appendix B (page 37 or thereabouts). Jon. -- Computer Architect | Sent from my Fedora powered laptop
Re: Avoid speculative indirect calls in kernel
On Thu, 2018-01-04 at 14:00 -0600, Tom Lendacky wrote: > Yes, lfence is sufficient. As long as the target is in the register > before the lfence and we jump through the register all is good, i.e.: Thanks. Can I have a Reviewed-by: for this then please: http://git.infradead.org/users/dwmw2/linux-retpoline.git/commitdiff/08d9eda03 From: David WoodhouseDate: Thu, 4 Jan 2018 20:01:53 + Subject: [PATCH] x86/retpoline: Simplify AMD variant of retpoline thunk On AMD (which is X86_FEATURE_K8), just the lfence is sufficient. Signed-off-by: David Woodhouse --- arch/x86/lib/retpoline.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S index bbdda5cc136e..26070976bff0 100644 --- a/arch/x86/lib/retpoline.S +++ b/arch/x86/lib/retpoline.S @@ -11,7 +11,7 @@ ENTRY(__x86.indirect_thunk.\reg) CFI_STARTPROC - ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE + ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE 1: lfence jmp 1b -- 2.14.3 smime.p7s Description: S/MIME cryptographic signature
Re: Avoid speculative indirect calls in kernel
On Thu, 2018-01-04 at 14:00 -0600, Tom Lendacky wrote: > Yes, lfence is sufficient. As long as the target is in the register > before the lfence and we jump through the register all is good, i.e.: Thanks. Can I have a Reviewed-by: for this then please: http://git.infradead.org/users/dwmw2/linux-retpoline.git/commitdiff/08d9eda03 From: David Woodhouse Date: Thu, 4 Jan 2018 20:01:53 + Subject: [PATCH] x86/retpoline: Simplify AMD variant of retpoline thunk On AMD (which is X86_FEATURE_K8), just the lfence is sufficient. Signed-off-by: David Woodhouse --- arch/x86/lib/retpoline.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S index bbdda5cc136e..26070976bff0 100644 --- a/arch/x86/lib/retpoline.S +++ b/arch/x86/lib/retpoline.S @@ -11,7 +11,7 @@ ENTRY(__x86.indirect_thunk.\reg) CFI_STARTPROC - ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE + ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE 1: lfence jmp 1b -- 2.14.3 smime.p7s Description: S/MIME cryptographic signature
Re: Avoid speculative indirect calls in kernel
On 1/4/2018 10:15 AM, David Woodhouse wrote: > On Thu, 2018-01-04 at 15:29 +, Woodhouse, David wrote: >> >>> With the GCC -mindirect-branch=thunk-external support, and microcode, >>> Xen will make a boot-time choice between using Retpoline, Lfence (which >>> is the better AMD option, and more performant than retpoline), or IBRS >>> on Skylake and newer processors where it is strictly necessary, as well >>> as using IBPB whenever available. >> >> I need to pull in the AMD lfence alternative for retpoline, giving us a >> 3-way choice of the existing retpoline thunk, "lfence; jmp *%\reg", and >> a bare "jmp *%\reg". > > I think I can abuse X86_FEATURE_SYSCALL for that, right? So it would > look something like this: > > --- a/arch/x86/lib/retpoline.S > +++ b/arch/x86/lib/retpoline.S > @@ -12,7 +12,7 @@ > > ENTRY(__x86.indirect_thunk.\reg) > CFI_STARTPROC > - ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE > + ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), > X86_FEATURE_SYSCALL, __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE > 1: > lfence > ASM_UNREACHABLE > > > However, I would very much like to see a categorical statement from AMD > that the lfence is sufficient in all cases. Remember, Intel were saying > that too for a while, before finding that it was not *quite* good > enough. Yes, lfence is sufficient. As long as the target is in the register before the lfence and we jump through the register all is good, i.e.: Include a dispatch serializing instruction after the load of an indirect branch target. For instance, change this code: 1: jmp *[rax] ; jump to address pointed to by RAX To this: 1: mov [rax], rax ; load target address 2: lfence ; dispatch serializing instruction 3: jmp *rax The processor will stop dispatching instructions until all older instructions have returned their results and are capable of being retired by the processor. At this point the branch target will be in the general purpose register (rax in this example) and available at dispatch for execution such that the speculative execution window is not large enough to be exploited. Thanks, Tom >
Re: Avoid speculative indirect calls in kernel
On 1/4/2018 10:15 AM, David Woodhouse wrote: > On Thu, 2018-01-04 at 15:29 +, Woodhouse, David wrote: >> >>> With the GCC -mindirect-branch=thunk-external support, and microcode, >>> Xen will make a boot-time choice between using Retpoline, Lfence (which >>> is the better AMD option, and more performant than retpoline), or IBRS >>> on Skylake and newer processors where it is strictly necessary, as well >>> as using IBPB whenever available. >> >> I need to pull in the AMD lfence alternative for retpoline, giving us a >> 3-way choice of the existing retpoline thunk, "lfence; jmp *%\reg", and >> a bare "jmp *%\reg". > > I think I can abuse X86_FEATURE_SYSCALL for that, right? So it would > look something like this: > > --- a/arch/x86/lib/retpoline.S > +++ b/arch/x86/lib/retpoline.S > @@ -12,7 +12,7 @@ > > ENTRY(__x86.indirect_thunk.\reg) > CFI_STARTPROC > - ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE > + ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), > X86_FEATURE_SYSCALL, __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE > 1: > lfence > ASM_UNREACHABLE > > > However, I would very much like to see a categorical statement from AMD > that the lfence is sufficient in all cases. Remember, Intel were saying > that too for a while, before finding that it was not *quite* good > enough. Yes, lfence is sufficient. As long as the target is in the register before the lfence and we jump through the register all is good, i.e.: Include a dispatch serializing instruction after the load of an indirect branch target. For instance, change this code: 1: jmp *[rax] ; jump to address pointed to by RAX To this: 1: mov [rax], rax ; load target address 2: lfence ; dispatch serializing instruction 3: jmp *rax The processor will stop dispatching instructions until all older instructions have returned their results and are capable of being retired by the processor. At this point the branch target will be in the general purpose register (rax in this example) and available at dispatch for execution such that the speculative execution window is not large enough to be exploited. Thanks, Tom >
Re: Avoid speculative indirect calls in kernel
+ Jeff Law, Nick Clifton On 01/04/2018 03:20 AM, Woodhouse, David wrote: > On Thu, 2018-01-04 at 03:11 +0100, Paolo Bonzini wrote: >> On 04/01/2018 02:59, Alan Cox wrote: But then, exactly because the retpoline approach adds quite some cruft and leaves something to be desired, why even bother? >>> >>> Performance >> >> Dunno. If I care about mitigating this threat, I wouldn't stop at >> retpolines even if the full solution has pretty bad performance (it's >> roughly in the same ballpark as PTI). But if I don't care, I wouldn't >> want retpolines either, since they do introduce a small slowdown (10-20 >> cycles per indirect branch, meaning that after a thousand such papercuts >> they become slower than the full solution). >> >> A couple manually written asm retpolines may be good as mitigation to >> block the simplest PoCs (Linus may disagree), but patching the compiler, >> getting alternatives right, etc. will take a while. The only redeeming >> grace of retpolines is that they don't require a microcode update, but >> the microcode will be out there long before these patches are included >> and trickle down to distros... I just don't see the point in starting >> from retpolines or drawing the line there. > > No, really. The full mitigation with the microcode update and IBRS > support is *slow*. Horribly slow. It is horribly slow, though the story changes with CPU generation as others noted (and what needs disabling in the microcode). We did various analysis of the retpoline patches, including benchmarks, and we decided that the fastest and safest approach for Tue^W yesterday was to use the new MSRs. Especially in light of the corner cases we would need to address for an empty RSB, etc. I'm adding Jeff Law because he and the tools team have done analysis on this and he may have thoughts. There's also a cross-architecture concern here in that different solutions are needed across architectures. Retpolines are not endorsed or recommended by every architecture vendor at this time. It's important to make sure the necessary cross-vendor discussion happens now that it can happen in the open. Longer term, it'll be good to see BTBs tagged using the full address space (including any address space IDs...) in future silicon. Jon. -- Computer Architect | Sent from my Fedora powered laptop
Re: Avoid speculative indirect calls in kernel
+ Jeff Law, Nick Clifton On 01/04/2018 03:20 AM, Woodhouse, David wrote: > On Thu, 2018-01-04 at 03:11 +0100, Paolo Bonzini wrote: >> On 04/01/2018 02:59, Alan Cox wrote: But then, exactly because the retpoline approach adds quite some cruft and leaves something to be desired, why even bother? >>> >>> Performance >> >> Dunno. If I care about mitigating this threat, I wouldn't stop at >> retpolines even if the full solution has pretty bad performance (it's >> roughly in the same ballpark as PTI). But if I don't care, I wouldn't >> want retpolines either, since they do introduce a small slowdown (10-20 >> cycles per indirect branch, meaning that after a thousand such papercuts >> they become slower than the full solution). >> >> A couple manually written asm retpolines may be good as mitigation to >> block the simplest PoCs (Linus may disagree), but patching the compiler, >> getting alternatives right, etc. will take a while. The only redeeming >> grace of retpolines is that they don't require a microcode update, but >> the microcode will be out there long before these patches are included >> and trickle down to distros... I just don't see the point in starting >> from retpolines or drawing the line there. > > No, really. The full mitigation with the microcode update and IBRS > support is *slow*. Horribly slow. It is horribly slow, though the story changes with CPU generation as others noted (and what needs disabling in the microcode). We did various analysis of the retpoline patches, including benchmarks, and we decided that the fastest and safest approach for Tue^W yesterday was to use the new MSRs. Especially in light of the corner cases we would need to address for an empty RSB, etc. I'm adding Jeff Law because he and the tools team have done analysis on this and he may have thoughts. There's also a cross-architecture concern here in that different solutions are needed across architectures. Retpolines are not endorsed or recommended by every architecture vendor at this time. It's important to make sure the necessary cross-vendor discussion happens now that it can happen in the open. Longer term, it'll be good to see BTBs tagged using the full address space (including any address space IDs...) in future silicon. Jon. -- Computer Architect | Sent from my Fedora powered laptop
Re: Avoid speculative indirect calls in kernel
On Thu, Jan 4, 2018 at 3:26 AM, Pavel Machekwrote: > On Wed 2018-01-03 15:51:35, Linus Torvalds wrote: >> >> A *competent* CPU engineer would fix this by making sure speculation >> doesn't happen across protection domains. Maybe even a L1 I$ that is >> keyed by CPL. > > Would that be enough? No, you'd need to add the CPL to the branch target buffer itself, not the I$ L1. And as somebody pointed out, that only helps the user space messing with the kernel. It doesn't help the "one user context fools another user context to mispredict". (Where the user contexts might be a JIT'ed JS vs the rest of the web browser). So you really would want to just make sure the full address is used to index (or at least verify) the BTB lookup, and even then you'd then need to invalidate the BTB on context switches so that one context can't fill in data for another context. Linus
Re: Avoid speculative indirect calls in kernel
On Thu, Jan 4, 2018 at 3:26 AM, Pavel Machek wrote: > On Wed 2018-01-03 15:51:35, Linus Torvalds wrote: >> >> A *competent* CPU engineer would fix this by making sure speculation >> doesn't happen across protection domains. Maybe even a L1 I$ that is >> keyed by CPL. > > Would that be enough? No, you'd need to add the CPL to the branch target buffer itself, not the I$ L1. And as somebody pointed out, that only helps the user space messing with the kernel. It doesn't help the "one user context fools another user context to mispredict". (Where the user contexts might be a JIT'ed JS vs the rest of the web browser). So you really would want to just make sure the full address is used to index (or at least verify) the BTB lookup, and even then you'd then need to invalidate the BTB on context switches so that one context can't fill in data for another context. Linus
Re: Avoid speculative indirect calls in kernel
Hi! > ps: BTW^2 (and this is of course not about you, David) I'm disappointed > that for "Spectre" there was no discussion between upstream developers, > or between Linux vendors, or in fact hardly any discussion beyond "these > are the patches". I understand that (unlike PTI) there was no back > story to cover up the actual vulnerability, but... grow up, folks. > Seriously, "these are the patches" won't fly with either upstream or > distros. I still hope that the discussion will now Or is someone still under impression that embargo is in place? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: Avoid speculative indirect calls in kernel
Hi! > ps: BTW^2 (and this is of course not about you, David) I'm disappointed > that for "Spectre" there was no discussion between upstream developers, > or between Linux vendors, or in fact hardly any discussion beyond "these > are the patches". I understand that (unlike PTI) there was no back > story to cover up the actual vulnerability, but... grow up, folks. > Seriously, "these are the patches" won't fly with either upstream or > distros. I still hope that the discussion will now Or is someone still under impression that embargo is in place? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: Avoid speculative indirect calls in kernel
Hello, On Thu, Jan 04, 2018 at 06:15:01PM +0100, Paolo Bonzini wrote: > On 04/01/2018 18:13, Dave Hansen wrote: > > On 01/04/2018 08:25 AM, Andrea Arcangeli wrote: > >> It's only where SPEC_CTRL is missing and only IBPB_SUPPORT is > >> available, that ibrs 0 ibpb 2 is the only option to fix variant#2 for > >> good. > > > > Could you help us decode what "ibrs 0 ibpb 2" means to you? > > IBRS 0 = disabled > IBRS 1 = only kernel sets IBRS=1 > IBRS 2 = indirect branch prediction fully disabled, or do the right > thing on future processors > > IBPB 0 = disabled > IBPB 1 = on context switch > IBPB 2 = on every kernel or hypervisor entry Yes. ibrs 0 ibpb 2 kernel entry and vmexit. ibpb 2 if set, is forcing ibrs to 0 (it's sharing the same branch in the kernel entry points and it wouldn't make sense anyway to enable ibrs with ibpb 2). ibrs 0 ibpb 2 is only ever activated if SPEC_CTRL is missing but IBPB_SUPPORT is present and it does the same as stuff_RSB, imagine it like a stuff_IBP where stuff_RSB is already called.
Re: Avoid speculative indirect calls in kernel
Hello, On Thu, Jan 04, 2018 at 06:15:01PM +0100, Paolo Bonzini wrote: > On 04/01/2018 18:13, Dave Hansen wrote: > > On 01/04/2018 08:25 AM, Andrea Arcangeli wrote: > >> It's only where SPEC_CTRL is missing and only IBPB_SUPPORT is > >> available, that ibrs 0 ibpb 2 is the only option to fix variant#2 for > >> good. > > > > Could you help us decode what "ibrs 0 ibpb 2" means to you? > > IBRS 0 = disabled > IBRS 1 = only kernel sets IBRS=1 > IBRS 2 = indirect branch prediction fully disabled, or do the right > thing on future processors > > IBPB 0 = disabled > IBPB 1 = on context switch > IBPB 2 = on every kernel or hypervisor entry Yes. ibrs 0 ibpb 2 kernel entry and vmexit. ibpb 2 if set, is forcing ibrs to 0 (it's sharing the same branch in the kernel entry points and it wouldn't make sense anyway to enable ibrs with ibpb 2). ibrs 0 ibpb 2 is only ever activated if SPEC_CTRL is missing but IBPB_SUPPORT is present and it does the same as stuff_RSB, imagine it like a stuff_IBP where stuff_RSB is already called.
Re: Avoid speculative indirect calls in kernel
Hi Alan, On Thu, Jan 04, 2018 at 05:04:42PM +, Alan Cox wrote: > > If you run lots of syscalls ibrs 1 ibpb 1 is much faster. If you do > > infrequent syscalls computing a lot in kernel like I/O with large > > buffers getting copied, ibrs 0 ibpb 2 is much faster than ibrs 1 ibpb > > 1 (on those microcodes where ibrs 1 reduces performance a lot, not all > > microcodes implementing SPEC_CTRL are inefficient like that). > > Have you looked at whether you can measure activity and switch > automatically between the two (or by task). It seems silly to leave > something the machine can accurately assess toa human ? We didn't but it'd be definitely reasonable to investigate and it's a good idea for those CPUs where the updated microcode has to shutdown way more than just indirect branch prediction speculation to achieve the ibrs 1 semantics. If the workload changes from frequent syscalls to reasonably large read/writes and less frequent syscalls or lots of interrupts in idle CPUs, it would work well to switch between ibrs 1 ibpb 1 and ibpb 2 ibrs 0 automatically. As long as the pattern keeps repeating for a while... that is the question ;). Thanks! Andrea
Re: Avoid speculative indirect calls in kernel
Hi Alan, On Thu, Jan 04, 2018 at 05:04:42PM +, Alan Cox wrote: > > If you run lots of syscalls ibrs 1 ibpb 1 is much faster. If you do > > infrequent syscalls computing a lot in kernel like I/O with large > > buffers getting copied, ibrs 0 ibpb 2 is much faster than ibrs 1 ibpb > > 1 (on those microcodes where ibrs 1 reduces performance a lot, not all > > microcodes implementing SPEC_CTRL are inefficient like that). > > Have you looked at whether you can measure activity and switch > automatically between the two (or by task). It seems silly to leave > something the machine can accurately assess toa human ? We didn't but it'd be definitely reasonable to investigate and it's a good idea for those CPUs where the updated microcode has to shutdown way more than just indirect branch prediction speculation to achieve the ibrs 1 semantics. If the workload changes from frequent syscalls to reasonably large read/writes and less frequent syscalls or lots of interrupts in idle CPUs, it would work well to switch between ibrs 1 ibpb 1 and ibpb 2 ibrs 0 automatically. As long as the pattern keeps repeating for a while... that is the question ;). Thanks! Andrea
Re: Avoid speculative indirect calls in kernel
On 04/01/2018 18:13, Dave Hansen wrote: > On 01/04/2018 08:25 AM, Andrea Arcangeli wrote: >> It's only where SPEC_CTRL is missing and only IBPB_SUPPORT is >> available, that ibrs 0 ibpb 2 is the only option to fix variant#2 for >> good. > > Could you help us decode what "ibrs 0 ibpb 2" means to you? IBRS 0 = disabled IBRS 1 = only kernel sets IBRS=1 IBRS 2 = indirect branch prediction fully disabled, or do the right thing on future processors IBPB 0 = disabled IBPB 1 = on context switch IBPB 2 = on every kernel or hypervisor entry Thanks, Paolo
Re: Avoid speculative indirect calls in kernel
On 04/01/2018 18:13, Dave Hansen wrote: > On 01/04/2018 08:25 AM, Andrea Arcangeli wrote: >> It's only where SPEC_CTRL is missing and only IBPB_SUPPORT is >> available, that ibrs 0 ibpb 2 is the only option to fix variant#2 for >> good. > > Could you help us decode what "ibrs 0 ibpb 2" means to you? IBRS 0 = disabled IBRS 1 = only kernel sets IBRS=1 IBRS 2 = indirect branch prediction fully disabled, or do the right thing on future processors IBPB 0 = disabled IBPB 1 = on context switch IBPB 2 = on every kernel or hypervisor entry Thanks, Paolo
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 08:25 AM, Andrea Arcangeli wrote: > It's only where SPEC_CTRL is missing and only IBPB_SUPPORT is > available, that ibrs 0 ibpb 2 is the only option to fix variant#2 for > good. Could you help us decode what "ibrs 0 ibpb 2" means to you?
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 08:25 AM, Andrea Arcangeli wrote: > It's only where SPEC_CTRL is missing and only IBPB_SUPPORT is > available, that ibrs 0 ibpb 2 is the only option to fix variant#2 for > good. Could you help us decode what "ibrs 0 ibpb 2" means to you?
Re: Avoid speculative indirect calls in kernel
> If you run lots of syscalls ibrs 1 ibpb 1 is much faster. If you do > infrequent syscalls computing a lot in kernel like I/O with large > buffers getting copied, ibrs 0 ibpb 2 is much faster than ibrs 1 ibpb > 1 (on those microcodes where ibrs 1 reduces performance a lot, not all > microcodes implementing SPEC_CTRL are inefficient like that). Have you looked at whether you can measure activity and switch automatically between the two (or by task). It seems silly to leave something the machine can accurately assess toa human ? Alan
Re: Avoid speculative indirect calls in kernel
> If you run lots of syscalls ibrs 1 ibpb 1 is much faster. If you do > infrequent syscalls computing a lot in kernel like I/O with large > buffers getting copied, ibrs 0 ibpb 2 is much faster than ibrs 1 ibpb > 1 (on those microcodes where ibrs 1 reduces performance a lot, not all > microcodes implementing SPEC_CTRL are inefficient like that). Have you looked at whether you can measure activity and switch automatically between the two (or by task). It seems silly to leave something the machine can accurately assess toa human ? Alan
Re: Avoid speculative indirect calls in kernel
On Thu, Jan 04, 2018 at 03:29:37PM +, Woodhouse, David wrote: > On Thu, 2018-01-04 at 14:51 +, Andrew Cooper wrote: > > > > > * never turn off indirect branch prediction, but use a branch prediction > > > barrier on every mode switch (needed for current AMD microcode) > > > > Where have you got this idea from? Using IBPB on every mode switch > > would be an insane overhead to take, and isn't necessary. > > AMD *only* has IBPB and not IBRS, but IIRC you don't need to do it on AMD 0x10 0x12 0x16 basically have IBRS and no IBPB, those works perfectly fine in ibrs 2 ibpb 1 mode, variant#2 fixed and zero overhead. > every context switch into the kernel; only when switching between > VMs/processes? Some AMD only has IBPB and no IBRS, then IBPB has to be called in every enter kernel or vmexit to give the same security as ibrs 1 ibpb 1 (modulo SMT/HT but that's not the spectre PoC and you can rule that out mathematically also by simply using cpu pinning as you already do or disabling SMT if you care that much). Note ibrs 1 ibpb 1 also won't cover HT effects of guest/user mode vs guest/user mode so cpu pinning may be advisable anyway in your case (even with ibrs 1 ibpb 1 no difference). Of course everything can be trivially opted out at runtime and all measurable performance restored, but by default it boots in the most secure config available and it will make spectre variant#2 attack impossible with only ibpb available. > I need to pull in the AMD lfence alternative for retpoline, giving us a > 3-way choice of the existing retpoline thunk, "lfence; jmp *%\reg", and > a bare "jmp *%\reg". > > Then the IBRS bits can be added on top. "AMD lfence and reptoline" in the same sentence sounds like somebody else also cares about spectre variant#2 on AMD. "Reptoline" only ever makes sense in spectre variant#2 context so either ibrs 0 ibpb 2 mode makes some sense too, or special lfence repotline for AMD should not be worth mentioning in the first place.
Re: Avoid speculative indirect calls in kernel
On Thu, Jan 04, 2018 at 03:29:37PM +, Woodhouse, David wrote: > On Thu, 2018-01-04 at 14:51 +, Andrew Cooper wrote: > > > > > * never turn off indirect branch prediction, but use a branch prediction > > > barrier on every mode switch (needed for current AMD microcode) > > > > Where have you got this idea from? Using IBPB on every mode switch > > would be an insane overhead to take, and isn't necessary. > > AMD *only* has IBPB and not IBRS, but IIRC you don't need to do it on AMD 0x10 0x12 0x16 basically have IBRS and no IBPB, those works perfectly fine in ibrs 2 ibpb 1 mode, variant#2 fixed and zero overhead. > every context switch into the kernel; only when switching between > VMs/processes? Some AMD only has IBPB and no IBRS, then IBPB has to be called in every enter kernel or vmexit to give the same security as ibrs 1 ibpb 1 (modulo SMT/HT but that's not the spectre PoC and you can rule that out mathematically also by simply using cpu pinning as you already do or disabling SMT if you care that much). Note ibrs 1 ibpb 1 also won't cover HT effects of guest/user mode vs guest/user mode so cpu pinning may be advisable anyway in your case (even with ibrs 1 ibpb 1 no difference). Of course everything can be trivially opted out at runtime and all measurable performance restored, but by default it boots in the most secure config available and it will make spectre variant#2 attack impossible with only ibpb available. > I need to pull in the AMD lfence alternative for retpoline, giving us a > 3-way choice of the existing retpoline thunk, "lfence; jmp *%\reg", and > a bare "jmp *%\reg". > > Then the IBRS bits can be added on top. "AMD lfence and reptoline" in the same sentence sounds like somebody else also cares about spectre variant#2 on AMD. "Reptoline" only ever makes sense in spectre variant#2 context so either ibrs 0 ibpb 2 mode makes some sense too, or special lfence repotline for AMD should not be worth mentioning in the first place.
Re: Avoid speculative indirect calls in kernel
Hello, On Thu, Jan 04, 2018 at 04:32:01PM +0100, Paolo Bonzini wrote: > On 04/01/2018 15:51, Andrew Cooper wrote: > > Where have you got this idea from? Using IBPB on every mode switch > > would be an insane overhead to take, and isn't necessary. It's only on kernel entry and vmexit. > IIRC it started as a paranoia mode for AMD, but then we found out it was > actually faster than IBRS on some Intel processor where IBRS performance > was horrible. But I don't remember the details of the performance > testing, sorry. Yes, it depends on the workload what is faster. ibrs 0 ibpb 2 is possible to use on CPUs with SPEC_CTRL too in fact. It's only where SPEC_CTRL is missing and only IBPB_SUPPORT is available, that ibrs 0 ibpb 2 is the only option to fix variant#2 for good. If you run lots of syscalls ibrs 1 ibpb 1 is much faster. If you do infrequent syscalls computing a lot in kernel like I/O with large buffers getting copied, ibrs 0 ibpb 2 is much faster than ibrs 1 ibpb 1 (on those microcodes where ibrs 1 reduces performance a lot, not all microcodes implementing SPEC_CTRL are inefficient like that). If SPEC_CTRL is available ibrs 1 ibpb 1 should be preferred even if it may not always be faster in every workload. AMD website says https://www.amd.com/en/corporate/speculative-execution "Differences in AMD architecture mean there is a near zero risk of exploitation of this variant." ibrs 0 ibpb 2 brings the probability down to zero even when SPEC_CTRL is missing and only IBPB_SUPPORT is available in microcode, if you need that kind of piece of mind. What exactly would be the point of shipping fixes for variant#2 if we leave spectre variant#2 unfixed also in cases where we could have fixed it? The problem is, it's very unlikely, but if by accident somebody can mount and setup such an attack, then spectre variant#2 becomes a problem almost as bad as spectre variant#1 is and your hypervisor guest/host isolation is fully compromised. It's not up to us to decide if to leave something with "near zero risk" unfixed by default, so for now we provided a fix that brings the probability of such spectre variant#2 attack to zero whenever possible so that such a spectre varaint#2 attack becomes impossible (not just "near zero risk""). Of course we made sure the performance comes back at runtime no matter what after running this: echo 0 >/sys/kernel/debug/x86/ibpb_enabled echo 0 >/sys/kernel/debug/x86/ibrs_enabled Or if you prefer at boot time with "noibrs noibpb". Not everyone will necessarily care about that kind of variant#2 attacks of course. NOTE: if those two tunables both read as 0 it means the fix for variant#2 isn't activated by the running kernel and you need to contact your CPU manufacturer for a microcode update providing SPEC_CTRL or at least IBPB_SUPPORT (in the latter case the fix will generally tend to perform worse and ibrs 0 ibpb 2 mode will auto-engage). For meltdown variant#3 same thing: if you want to disable the fix at runtime because it's a guest kernel and it's running a single microservice with a single app (similar to unikernel) or something like that, you can with "nopti" or: echo 0 >/sys/kernel/debug/x86/pti_enabled Same issue if it's a bare metal host and it's running a single app and it doesn't store secure data in kernel space etc... There's always an option to disable the fixes. Only spectre variant#1 fix is always on, as there's no performance overhead to it. By default it boots in the most secure setting possible so that all spectre variant#1 and variant2 and meltdown variant#3 are fixed. Thanks, Andrea
Re: Avoid speculative indirect calls in kernel
Hello, On Thu, Jan 04, 2018 at 04:32:01PM +0100, Paolo Bonzini wrote: > On 04/01/2018 15:51, Andrew Cooper wrote: > > Where have you got this idea from? Using IBPB on every mode switch > > would be an insane overhead to take, and isn't necessary. It's only on kernel entry and vmexit. > IIRC it started as a paranoia mode for AMD, but then we found out it was > actually faster than IBRS on some Intel processor where IBRS performance > was horrible. But I don't remember the details of the performance > testing, sorry. Yes, it depends on the workload what is faster. ibrs 0 ibpb 2 is possible to use on CPUs with SPEC_CTRL too in fact. It's only where SPEC_CTRL is missing and only IBPB_SUPPORT is available, that ibrs 0 ibpb 2 is the only option to fix variant#2 for good. If you run lots of syscalls ibrs 1 ibpb 1 is much faster. If you do infrequent syscalls computing a lot in kernel like I/O with large buffers getting copied, ibrs 0 ibpb 2 is much faster than ibrs 1 ibpb 1 (on those microcodes where ibrs 1 reduces performance a lot, not all microcodes implementing SPEC_CTRL are inefficient like that). If SPEC_CTRL is available ibrs 1 ibpb 1 should be preferred even if it may not always be faster in every workload. AMD website says https://www.amd.com/en/corporate/speculative-execution "Differences in AMD architecture mean there is a near zero risk of exploitation of this variant." ibrs 0 ibpb 2 brings the probability down to zero even when SPEC_CTRL is missing and only IBPB_SUPPORT is available in microcode, if you need that kind of piece of mind. What exactly would be the point of shipping fixes for variant#2 if we leave spectre variant#2 unfixed also in cases where we could have fixed it? The problem is, it's very unlikely, but if by accident somebody can mount and setup such an attack, then spectre variant#2 becomes a problem almost as bad as spectre variant#1 is and your hypervisor guest/host isolation is fully compromised. It's not up to us to decide if to leave something with "near zero risk" unfixed by default, so for now we provided a fix that brings the probability of such spectre variant#2 attack to zero whenever possible so that such a spectre varaint#2 attack becomes impossible (not just "near zero risk""). Of course we made sure the performance comes back at runtime no matter what after running this: echo 0 >/sys/kernel/debug/x86/ibpb_enabled echo 0 >/sys/kernel/debug/x86/ibrs_enabled Or if you prefer at boot time with "noibrs noibpb". Not everyone will necessarily care about that kind of variant#2 attacks of course. NOTE: if those two tunables both read as 0 it means the fix for variant#2 isn't activated by the running kernel and you need to contact your CPU manufacturer for a microcode update providing SPEC_CTRL or at least IBPB_SUPPORT (in the latter case the fix will generally tend to perform worse and ibrs 0 ibpb 2 mode will auto-engage). For meltdown variant#3 same thing: if you want to disable the fix at runtime because it's a guest kernel and it's running a single microservice with a single app (similar to unikernel) or something like that, you can with "nopti" or: echo 0 >/sys/kernel/debug/x86/pti_enabled Same issue if it's a bare metal host and it's running a single app and it doesn't store secure data in kernel space etc... There's always an option to disable the fixes. Only spectre variant#1 fix is always on, as there's no performance overhead to it. By default it boots in the most secure setting possible so that all spectre variant#1 and variant2 and meltdown variant#3 are fixed. Thanks, Andrea