Re: [PATCH V2] fork: Improve error message for corrupted page tables
On Tue, 2019-08-06 at 10:36 +0200, Michal Hocko wrote: > On Mon 05-08-19 20:05:27, Sai Praneeth Prakhya wrote: > > When a user process exits, the kernel cleans up the mm_struct of the user > > process and during cleanup, check_mm() checks the page tables of the user > > process for corruption (E.g: unexpected page flags set/cleared). For > > corrupted page tables, the error message printed by check_mm() isn't very > > clear as it prints the loop index instead of page table type (E.g: > > Resident > > file mapping pages vs Resident shared memory pages). The loop index in > > check_mm() is used to index rss_stat[] which represents individual memory > > type stats. Hence, instead of printing index, print memory type, thereby > > improving error message. > > > > Without patch: > > -- > > [ 204.836425] mm/pgtable-generic.c:29: bad p4d > > 89eb4e92(80025f941467) > > [ 204.836544] BUG: Bad rss-counter state mm:f75895ea idx:0 val:2 > > [ 204.836615] BUG: Bad rss-counter state mm:f75895ea idx:1 val:5 > > [ 204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480 > > > > With patch: > > --- > > [ 69.815453] mm/pgtable-generic.c:29: bad p4d > > 84653642(80025ca37467) > > [ 69.815872] BUG: Bad rss-counter state mm:014a6c03 > > type:MM_FILEPAGES val:2 > > [ 69.815962] BUG: Bad rss-counter state mm:014a6c03 > > type:MM_ANONPAGES val:5 > > [ 69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480 > > I like this. On any occasion I am investigating an issue with an rss > inbalance I have to go back to kernel sources to see which pte type that > is. > Hopefully, this patch will be useful to you the next time you run into any rss imbalance issues. > > Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so > > that it matches the other print statement. > > good change as well. Maybe we should also lower the loglevel (in a > separate patch) as well. While this is not nice because we are > apparently leaking memory behind it shouldn't be really critical enough > to jump on normal consoles. Ya.. I think, probably could be lowered to pr_err() or pr_warn(). Regards, Sai
[PATCH V3] fork: Improve error message for corrupted page tables
When a user process exits, the kernel cleans up the mm_struct of the user process and during cleanup, check_mm() checks the page tables of the user process for corruption (E.g: unexpected page flags set/cleared). For corrupted page tables, the error message printed by check_mm() isn't very clear as it prints the loop index instead of page table type (E.g: Resident file mapping pages vs Resident shared memory pages). The loop index in check_mm() is used to index rss_stat[] which represents individual memory type stats. Hence, instead of printing index, print memory type, thereby improving error message. Without patch: -- [ 204.836425] mm/pgtable-generic.c:29: bad p4d 89eb4e92(80025f941467) [ 204.836544] BUG: Bad rss-counter state mm:f75895ea idx:0 val:2 [ 204.836615] BUG: Bad rss-counter state mm:f75895ea idx:1 val:5 [ 204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480 With patch: --- [ 69.815453] mm/pgtable-generic.c:29: bad p4d 84653642(80025ca37467) [ 69.815872] BUG: Bad rss-counter state mm:014a6c03 type:MM_FILEPAGES val:2 [ 69.815962] BUG: Bad rss-counter state mm:014a6c03 type:MM_ANONPAGES val:5 [ 69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480 Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so that it matches the other print statement. Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrew Morton Acked-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Dave Hansen Suggested-by: Dave Hansen Reviewed-by: Anshuman Khandual Signed-off-by: Sai Praneeth Prakhya --- Changes from V2 to V3: -- 1. Add comment that suggests to update resident_page_types[] if there are any changes to exisiting page types in 2. Add a build check to enforce resident_page_types[] is always in sync 3. Use a macro to populate elements of resident_page_types[] Changes from V1 to V2: -- 1. Move struct definition from header file to fork.c file, so that it won't be included in every compilation unit. As this struct is used *only* in fork.c, include the definition in fork.c itself. 2. Index the struct to match respective macros. 3. Mention about print function change in commit message. include/linux/mm_types_task.h | 4 kernel/fork.c | 16 ++-- 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h index d7016dcb245e..c1bc6731125c 100644 --- a/include/linux/mm_types_task.h +++ b/include/linux/mm_types_task.h @@ -36,6 +36,10 @@ struct vmacache { struct vm_area_struct *vmas[VMACACHE_SIZE]; }; +/* + * When updating this, please also update struct resident_page_types[] in + * kernel/fork.c + */ enum { MM_FILEPAGES, /* Resident file mapping pages */ MM_ANONPAGES, /* Resident anonymous pages */ diff --git a/kernel/fork.c b/kernel/fork.c index d8ae0f1b4148..7583e0fde0ed 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -125,6 +125,15 @@ int nr_threads;/* The idle threads do not count.. */ static int max_threads;/* tunable limit on nr_threads */ +#define NAMED_ARRAY_INDEX(x) [x] = __stringify(x) + +static const char * const resident_page_types[] = { + NAMED_ARRAY_INDEX(MM_FILEPAGES), + NAMED_ARRAY_INDEX(MM_ANONPAGES), + NAMED_ARRAY_INDEX(MM_SWAPENTS), + NAMED_ARRAY_INDEX(MM_SHMEMPAGES), +}; + DEFINE_PER_CPU(unsigned long, process_counts) = 0; __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ @@ -645,12 +654,15 @@ static void check_mm(struct mm_struct *mm) { int i; + BUILD_BUG_ON_MSG(ARRAY_SIZE(resident_page_types) != NR_MM_COUNTERS, +"Please make sure 'struct resident_page_types[]' is updated as well"); + for (i = 0; i < NR_MM_COUNTERS; i++) { long x = atomic_long_read(>rss_stat.count[i]); if (unlikely(x)) - printk(KERN_ALERT "BUG: Bad rss-counter state " - "mm:%p idx:%d val:%ld\n", mm, i, x); + pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", +mm, resident_page_types[i], x); } if (mm_pgtables_bytes(mm)) -- 2.7.4
Re: [PATCH V2] fork: Improve error message for corrupted page tables
On Tue, 2019-08-06 at 09:30 -0700, Dave Hansen wrote: > On 8/5/19 8:05 PM, Sai Praneeth Prakhya wrote: > > +static const char * const resident_page_types[NR_MM_COUNTERS] = { > > + [MM_FILEPAGES] = "MM_FILEPAGES", > > + [MM_ANONPAGES] = "MM_ANONPAGES", > > + [MM_SWAPENTS] = "MM_SWAPENTS", > > + [MM_SHMEMPAGES] = "MM_SHMEMPAGES", > > +}; > > One trick to ensure that this gets updated if the names are ever > updated. You can do: > > #define NAMED_ARRAY_INDEX(x) [x] = __stringify(x), > > and > > static const char * const resident_page_types[NR_MM_COUNTERS] = { > NAMED_ARRAY_INDEX(MM_FILE_PAGES), > NAMED_ARRAY_INDEX(MM_SHMEMPAGES), > ... > }; Thanks for the suggestion Dave. I will add this in V3. Even with this, (if ever) anyone who changes the name of page types or adds an new entry would still need to update struct resident_page_types[]. So, I will add the comment as suggested by Vlastimil. > > That makes sure that any name changes make it into the strings. Then > stick a: > > BUILD_BUG_ON(NR_MM_COUNTERS != ARRAY_SIZE(resident_page_types)); > > somewhere. That makes sure that any new array indexes get a string > added in the array. Otherwise you get nice, early, compile-time errors. Sure! this sounds good and a small nit-bit :) For the BUILD_BUG_ON() to work, the definition of struct should be changed as below static const char * const resident_page_types[] = { ... } i.e. we should not specify the size of array. Regards, Sai
Re: [PATCH] fork: Improve error message for corrupted page tables
On Mon, 2019-08-05 at 15:28 +0200, Vlastimil Babka wrote: > On 8/2/19 8:46 AM, Prakhya, Sai Praneeth wrote: > > > > > > +static const char * const resident_page_types[NR_MM_COUNTERS] = { > > > > > > + "MM_FILEPAGES", > > > > > > + "MM_ANONPAGES", > > > > > > + "MM_SWAPENTS", > > > > > > + "MM_SHMEMPAGES", > > > > > > +}; > > > > > > > > > > But please let's not put this in a header file. We're asking the > > > > > compiler to put a copy of all of this into every compilation unit > > > > > which includes the header. Presumably the compiler is smart enough > > > > > not to do that, but it's not good practice. > > > > > > > > Thanks for the explanation. Makes sense to me. > > > > > > > > Just wanted to check before sending V2, Is it OK if I add this to > > > > kernel/fork.c? or do you have something else in mind? > > > > > > I was thinking somewhere like mm/util.c so the array could be used by > > > other > > > code. But it seems there is no such code. Perhaps it's best to just > > > leave fork.c as > > > it is now. > > > > Ok, so does that mean have the struct in header file itself? > > If the struct definition (including the string values) was in mm/util.c, > there would have to be a declaration in a header. If it's in fork.c with > the only users, there doesn't need to be separate declaration in a header. Makes sense. > > > Sorry! for too many questions. I wanted to check with you before changing > > because it's *the* fork.c file (I presume random changes will not be > > encouraged here) > > > > I am not yet clear on what's the right thing to do here :( > > So, could you please help me in deciding. > > fork.c should be fine, IMHO I was leaning to add struct definition in fork.c as well but just wanted to check with Andrew before posting V2. Thanks for the reply though :) Regards, Sai
[PATCH V2] fork: Improve error message for corrupted page tables
When a user process exits, the kernel cleans up the mm_struct of the user process and during cleanup, check_mm() checks the page tables of the user process for corruption (E.g: unexpected page flags set/cleared). For corrupted page tables, the error message printed by check_mm() isn't very clear as it prints the loop index instead of page table type (E.g: Resident file mapping pages vs Resident shared memory pages). The loop index in check_mm() is used to index rss_stat[] which represents individual memory type stats. Hence, instead of printing index, print memory type, thereby improving error message. Without patch: -- [ 204.836425] mm/pgtable-generic.c:29: bad p4d 89eb4e92(80025f941467) [ 204.836544] BUG: Bad rss-counter state mm:f75895ea idx:0 val:2 [ 204.836615] BUG: Bad rss-counter state mm:f75895ea idx:1 val:5 [ 204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480 With patch: --- [ 69.815453] mm/pgtable-generic.c:29: bad p4d 84653642(80025ca37467) [ 69.815872] BUG: Bad rss-counter state mm:014a6c03 type:MM_FILEPAGES val:2 [ 69.815962] BUG: Bad rss-counter state mm:014a6c03 type:MM_ANONPAGES val:5 [ 69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480 Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so that it matches the other print statement. Cc: Ingo Molnar Cc: Vlastimil Babka Cc: Peter Zijlstra Cc: Andrew Morton Cc: Anshuman Khandual Acked-by: Dave Hansen Suggested-by: Dave Hansen Signed-off-by: Sai Praneeth Prakhya --- Changes from V1 to V2: -- 1. Move struct definition from header file to fork.c file, so that it won't be included in every compilation unit. As this struct is used *only* in fork.c, include the definition in fork.c itself. 2. Index the struct to match respective macros. 3. Mention about print function change in commit message. kernel/fork.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d8ae0f1b4148..f34f441c50c0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -125,6 +125,13 @@ int nr_threads;/* The idle threads do not count.. */ static int max_threads;/* tunable limit on nr_threads */ +static const char * const resident_page_types[NR_MM_COUNTERS] = { + [MM_FILEPAGES] = "MM_FILEPAGES", + [MM_ANONPAGES] = "MM_ANONPAGES", + [MM_SWAPENTS] = "MM_SWAPENTS", + [MM_SHMEMPAGES] = "MM_SHMEMPAGES", +}; + DEFINE_PER_CPU(unsigned long, process_counts) = 0; __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ @@ -649,8 +656,8 @@ static void check_mm(struct mm_struct *mm) long x = atomic_long_read(>rss_stat.count[i]); if (unlikely(x)) - printk(KERN_ALERT "BUG: Bad rss-counter state " - "mm:%p idx:%d val:%ld\n", mm, i, x); + pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", +mm, resident_page_types[i], x); } if (mm_pgtables_bytes(mm)) -- 2.7.4
Re: [PATCH] fork: Improve error message for corrupted page tables
> > With patch: > > --- > > [ 69.815453] mm/pgtable-generic.c:29: bad p4d > > 84653642(80025ca37467) > > [ 69.815872] BUG: Bad rss-counter state mm:014a6c03 > > type:MM_FILEPAGES val:2 > > [ 69.815962] BUG: Bad rss-counter state mm:014a6c03 > > type:MM_ANONPAGES val:5 > > [ 69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480 > > Seems useful. > > > --- a/include/linux/mm_types_task.h > > +++ b/include/linux/mm_types_task.h > > @@ -44,6 +44,13 @@ enum { > > NR_MM_COUNTERS > > }; > > > > +static const char * const resident_page_types[NR_MM_COUNTERS] = { > > + "MM_FILEPAGES", > > + "MM_ANONPAGES", > > + "MM_SWAPENTS", > > + "MM_SHMEMPAGES", > > +}; > > But please let's not put this in a header file. We're asking the > compiler to put a copy of all of this into every compilation unit which > includes the header. Presumably the compiler is smart enough not to > do that, but it's not good practice. Thanks for the explanation. Makes sense to me. Just wanted to check before sending V2, Is it OK if I add this to kernel/fork.c? or do you have something else in mind? Regards, Sai
[PATCH] fork: Improve error message for corrupted page tables
When a user process exits, the kernel cleans up the mm_struct of the user process and during cleanup, check_mm() checks the page tables of the user process for corruption (E.g: unexpected page flags set/cleared). For corrupted page tables, the error message printed by check_mm() isn't very clear as it prints the loop index instead of page table type (E.g: Resident file mapping pages vs Resident shared memory pages). Hence, improve the error message so that it's more informative. Without patch: -- [ 204.836425] mm/pgtable-generic.c:29: bad p4d 89eb4e92(80025f941467) [ 204.836544] BUG: Bad rss-counter state mm:f75895ea idx:0 val:2 [ 204.836615] BUG: Bad rss-counter state mm:f75895ea idx:1 val:5 [ 204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480 With patch: --- [ 69.815453] mm/pgtable-generic.c:29: bad p4d 84653642(80025ca37467) [ 69.815872] BUG: Bad rss-counter state mm:014a6c03 type:MM_FILEPAGES val:2 [ 69.815962] BUG: Bad rss-counter state mm:014a6c03 type:MM_ANONPAGES val:5 [ 69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480 Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrew Morton Suggested-by/Acked-by: Dave Hansen Signed-off-by: Sai Praneeth Prakhya --- include/linux/mm_types_task.h | 7 +++ kernel/fork.c | 4 ++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h index d7016dcb245e..881f4ea3a1b5 100644 --- a/include/linux/mm_types_task.h +++ b/include/linux/mm_types_task.h @@ -44,6 +44,13 @@ enum { NR_MM_COUNTERS }; +static const char * const resident_page_types[NR_MM_COUNTERS] = { + "MM_FILEPAGES", + "MM_ANONPAGES", + "MM_SWAPENTS", + "MM_SHMEMPAGES", +}; + #if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU) #define SPLIT_RSS_COUNTING /* per-thread cached information, */ diff --git a/kernel/fork.c b/kernel/fork.c index 2852d0e76ea3..6aef5842d4e0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -649,8 +649,8 @@ static void check_mm(struct mm_struct *mm) long x = atomic_long_read(>rss_stat.count[i]); if (unlikely(x)) - printk(KERN_ALERT "BUG: Bad rss-counter state " - "mm:%p idx:%d val:%ld\n", mm, i, x); + pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", +mm, resident_page_types[i], x); } if (mm_pgtables_bytes(mm)) -- 2.19.1
[tip:efi/core] x86/efi: Mark can_free_region() as an __init function
Commit-ID: 8fe55212aacfce9b7718de7964b3a3096ec30919 Gitweb: https://git.kernel.org/tip/8fe55212aacfce9b7718de7964b3a3096ec30919 Author: Sai Praneeth Prakhya AuthorDate: Sat, 2 Feb 2019 10:41:10 +0100 Committer: Ingo Molnar CommitDate: Mon, 4 Feb 2019 08:19:22 +0100 x86/efi: Mark can_free_region() as an __init function can_free_region() is called only once during boot, by efi_reserve_boot_services(). Hence, mark it as an __init function. Signed-off-by: Sai Praneeth Prakhya Signed-off-by: Ard Biesheuvel Cc: AKASHI Takahiro Cc: Alexander Graf Cc: Bjorn Andersson Cc: Borislav Petkov Cc: Heinrich Schuchardt Cc: Jeffrey Hugo Cc: Lee Jones Cc: Leif Lindholm Cc: Linus Torvalds Cc: Matt Fleming Cc: Peter Jones Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20190202094119.13230-2-ard.biesheu...@linaro.org Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/quirks.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 17456a1d3f04..9ce85e605052 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -304,7 +304,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size) * - Not within any part of the kernel * - Not the BIOS reserved area (E820_TYPE_RESERVED, E820_TYPE_NVS, etc) */ -static bool can_free_region(u64 start, u64 size) +static __init bool can_free_region(u64 start, u64 size) { if (start + size > __pa_symbol(_text) && start <= __pa_symbol(_end)) return false;
[tip:efi/core] x86/efi: Don't unmap EFI boot services code/data regions for EFI_OLD_MEMMAP and EFI_MIXED_MODE
Commit-ID: 1debf0958fa27b7c469dbf22754929ec59a7c0e7 Gitweb: https://git.kernel.org/tip/1debf0958fa27b7c469dbf22754929ec59a7c0e7 Author: Sai Praneeth Prakhya AuthorDate: Fri, 21 Dec 2018 18:22:34 -0800 Committer: Ingo Molnar CommitDate: Sat, 22 Dec 2018 20:58:30 +0100 x86/efi: Don't unmap EFI boot services code/data regions for EFI_OLD_MEMMAP and EFI_MIXED_MODE The following commit: d5052a7130a6 ("x86/efi: Unmap EFI boot services code/data regions from efi_pgd") forgets to take two EFI modes into consideration, namely EFI_OLD_MEMMAP and EFI_MIXED_MODE: - EFI_OLD_MEMMAP is a legacy way of mapping EFI regions into swapper_pg_dir using ioremap() and init_memory_mapping(). This feature can be enabled by passing "efi=old_map" as kernel command line argument. But, efi_unmap_pages() unmaps EFI boot services code/data regions *only* from efi_pgd and hence cannot be used for unmapping EFI boot services code/data regions from swapper_pg_dir. Introduce a temporary fix to not unmap EFI boot services code/data regions when EFI_OLD_MEMMAP is enabled while working on a real fix. - EFI_MIXED_MODE is another feature where a 64-bit kernel runs on a 64-bit platform crippled by a 32-bit firmware. To support EFI_MIXED_MODE, all RAM (i.e. namely EFI regions like EFI_CONVENTIONAL_MEMORY, EFI_LOADER_, EFI_BOOT_SERVICES_ and EFI_RUNTIME_CODE/DATA regions) is mapped into efi_pgd all the time to facilitate EFI runtime calls access it's arguments in 1:1 mode. Hence, don't unmap EFI boot services code/data regions when booted in mixed mode. Signed-off-by: Sai Praneeth Prakhya Acked-by: Ard Biesheuvel Cc: Andy Lutomirski Cc: Bhupesh Sharma Cc: Borislav Petkov Cc: Dave Hansen Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20181222022234.7573-1-sai.praneeth.prak...@intel.com Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/quirks.c | 16 1 file changed, 16 insertions(+) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 09e811b9da26..17456a1d3f04 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -380,6 +380,22 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md) u64 pa = md->phys_addr; u64 va = md->virt_addr; + /* +* To Do: Remove this check after adding functionality to unmap EFI boot +* services code/data regions from direct mapping area because +* "efi=old_map" maps EFI regions in swapper_pg_dir. +*/ + if (efi_enabled(EFI_OLD_MEMMAP)) + return; + + /* +* EFI mixed mode has all RAM mapped to access arguments while making +* EFI runtime calls, hence don't unmap EFI boot services code/data +* regions. +*/ + if (!efi_is_native()) + return; + if (kernel_unmap_pages_in_pgd(pgd, pa, md->num_pages)) pr_err("Failed to unmap 1:1 mapping for 0x%llx\n", pa);
[PATCH] x86/efi: Don't unmap EFI boot services code/data regions for EFI_OLD_MEMMAP and EFI_MIXED_MODE
Commit d5052a7130a6 ("x86/efi: Unmap EFI boot services code/data regions from efi_pgd") forgets to take two EFI modes into consideration namely EFI_OLD_MEMMAP and EFI_MIXED_MODE. EFI_OLD_MEMMAP is a legacy way of mapping EFI regions into swapper_pg_dir using ioremap() and init_memory_mapping(). This feature can be enabled by passing "efi=old_map" as kernel command line argument. But, efi_unmap_pages() unmaps EFI boot services code/data regions *only* from efi_pgd and hence cannot be used for unmapping EFI boot services code/data regions from swapper_pg_dir. Introduce a temporary fix to not unmap EFI boot services code/data regions when EFI_OLD_MEMMAP is enabled while working on a real fix. EFI_MIXED_MODE is another feature where a 64-bit kernel runs on a 64-bit platform crippled by a 32-bit firmware. To support EFI_MIXED_MODE, all RAM (i.e. namely EFI regions like EFI_CONVENTIONAL_MEMORY, EFI_LOADER_, EFI_BOOT_SERVICES_ and EFI_RUNTIME_CODE/DATA regions) is mapped into efi_pgd all the time to facilitate EFI runtime calls access it's arguments in 1:1 mode. Hence, don't unmap EFI boot services code/data regions when booted in mixed mode. Signed-off-by: Sai Praneeth Prakhya Cc: Borislav Petkov Cc: Ingo Molnar Cc: Andy Lutomirski Cc: Dave Hansen Cc: Bhupesh Sharma Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ard Biesheuvel --- arch/x86/platform/efi/quirks.c | 16 1 file changed, 16 insertions(+) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 09e811b9da26..9c34230aaeae 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -380,6 +380,22 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md) u64 pa = md->phys_addr; u64 va = md->virt_addr; + /* +* To Do: Remove this check after adding functionality to unmap EFI boot +* services code/data regions from direct mapping area because +* "efi=old_map" maps EFI regions in swapper_pg_dir. +*/ + if (efi_enabled(EFI_OLD_MEMMAP)) + return; + + /* +* EFI mixed mode has all RAM mapped to access arguments while making +* EFI runtime calls, hence don't unmap EFI boot services code/data +* regions. +*/ + if (!efi_is_native() && IS_ENABLED(CONFIG_EFI_MIXED)) + return; + if (kernel_unmap_pages_in_pgd(pgd, pa, md->num_pages)) pr_err("Failed to unmap 1:1 mapping for 0x%llx\n", pa); -- 2.19.1
[tip:efi/core] x86/efi: Move efi__boot_services() to arch/x86
Commit-ID: 47c33a095e1fae376d74b4160a0d73c1a4e73969 Gitweb: https://git.kernel.org/tip/47c33a095e1fae376d74b4160a0d73c1a4e73969 Author: Sai Praneeth Prakhya AuthorDate: Thu, 29 Nov 2018 18:12:25 +0100 Committer: Ingo Molnar CommitDate: Fri, 30 Nov 2018 09:10:31 +0100 x86/efi: Move efi__boot_services() to arch/x86 efi__boot_services() are x86 specific quirks and as such should be in asm/efi.h, so move them from linux/efi.h. Also, call efi_free_boot_services() from __efi_enter_virtual_mode() as it is x86 specific call and ideally shouldn't be part of init/main.c Signed-off-by: Sai Praneeth Prakhya Signed-off-by: Ard Biesheuvel Acked-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Arend van Spriel Cc: Bhupesh Sharma Cc: Borislav Petkov Cc: Dave Hansen Cc: Eric Snowberg Cc: Hans de Goede Cc: Joe Perches Cc: Jon Hunter Cc: Julien Thierry Cc: Linus Torvalds Cc: Marc Zyngier Cc: Matt Fleming Cc: Nathan Chancellor Cc: Peter Zijlstra Cc: Sedat Dilek Cc: YiFei Zhu Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20181129171230.18699-7-ard.biesheu...@linaro.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/efi.h | 2 ++ arch/x86/platform/efi/efi.c | 2 ++ include/linux/efi.h | 3 --- init/main.c | 4 4 files changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index eea40d52ca78..d1e64ac80b9c 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -141,6 +141,8 @@ extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); extern void efi_switch_mm(struct mm_struct *mm); extern void efi_recover_from_page_fault(unsigned long phys_addr); +extern void efi_free_boot_services(void); +extern void efi_reserve_boot_services(void); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 7ae939e353cd..e1cb01a22fa8 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -993,6 +993,8 @@ static void __init __efi_enter_virtual_mode(void) panic("EFI call to SetVirtualAddressMap() failed!"); } + efi_free_boot_services(); + /* * Now that EFI is in virtual mode, update the function * pointers in the runtime service table to the new virtual addresses. diff --git a/include/linux/efi.h b/include/linux/efi.h index 100ce4a4aff6..2b3b33c83b05 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -1000,13 +1000,11 @@ extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg); extern void efi_gettimeofday (struct timespec64 *ts); extern void efi_enter_virtual_mode (void); /* switch EFI to virtual mode, if possible */ #ifdef CONFIG_X86 -extern void efi_free_boot_services(void); extern efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, bool nonblocking); extern void efi_find_mirror(void); #else -static inline void efi_free_boot_services(void) {} static inline efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, @@ -1046,7 +1044,6 @@ extern void efi_mem_reserve(phys_addr_t addr, u64 size); extern int efi_mem_reserve_persistent(phys_addr_t addr, u64 size); extern void efi_initialize_iomem_resources(struct resource *code_resource, struct resource *data_resource, struct resource *bss_resource); -extern void efi_reserve_boot_services(void); extern int efi_get_fdt_params(struct efi_fdt_params *params); extern struct kobject *efi_kobj; diff --git a/init/main.c b/init/main.c index ee147103ba1b..ccefcd8e855f 100644 --- a/init/main.c +++ b/init/main.c @@ -737,10 +737,6 @@ asmlinkage __visible void __init start_kernel(void) arch_post_acpi_subsys_init(); sfi_init_late(); - if (efi_enabled(EFI_RUNTIME_SERVICES)) { - efi_free_boot_services(); - } - /* Do the rest non-__init'ed, we're now alive */ arch_call_rest_init(); }
[tip:efi/core] x86/efi: Move efi__boot_services() to arch/x86
Commit-ID: 47c33a095e1fae376d74b4160a0d73c1a4e73969 Gitweb: https://git.kernel.org/tip/47c33a095e1fae376d74b4160a0d73c1a4e73969 Author: Sai Praneeth Prakhya AuthorDate: Thu, 29 Nov 2018 18:12:25 +0100 Committer: Ingo Molnar CommitDate: Fri, 30 Nov 2018 09:10:31 +0100 x86/efi: Move efi__boot_services() to arch/x86 efi__boot_services() are x86 specific quirks and as such should be in asm/efi.h, so move them from linux/efi.h. Also, call efi_free_boot_services() from __efi_enter_virtual_mode() as it is x86 specific call and ideally shouldn't be part of init/main.c Signed-off-by: Sai Praneeth Prakhya Signed-off-by: Ard Biesheuvel Acked-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Arend van Spriel Cc: Bhupesh Sharma Cc: Borislav Petkov Cc: Dave Hansen Cc: Eric Snowberg Cc: Hans de Goede Cc: Joe Perches Cc: Jon Hunter Cc: Julien Thierry Cc: Linus Torvalds Cc: Marc Zyngier Cc: Matt Fleming Cc: Nathan Chancellor Cc: Peter Zijlstra Cc: Sedat Dilek Cc: YiFei Zhu Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20181129171230.18699-7-ard.biesheu...@linaro.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/efi.h | 2 ++ arch/x86/platform/efi/efi.c | 2 ++ include/linux/efi.h | 3 --- init/main.c | 4 4 files changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index eea40d52ca78..d1e64ac80b9c 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -141,6 +141,8 @@ extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); extern void efi_switch_mm(struct mm_struct *mm); extern void efi_recover_from_page_fault(unsigned long phys_addr); +extern void efi_free_boot_services(void); +extern void efi_reserve_boot_services(void); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 7ae939e353cd..e1cb01a22fa8 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -993,6 +993,8 @@ static void __init __efi_enter_virtual_mode(void) panic("EFI call to SetVirtualAddressMap() failed!"); } + efi_free_boot_services(); + /* * Now that EFI is in virtual mode, update the function * pointers in the runtime service table to the new virtual addresses. diff --git a/include/linux/efi.h b/include/linux/efi.h index 100ce4a4aff6..2b3b33c83b05 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -1000,13 +1000,11 @@ extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg); extern void efi_gettimeofday (struct timespec64 *ts); extern void efi_enter_virtual_mode (void); /* switch EFI to virtual mode, if possible */ #ifdef CONFIG_X86 -extern void efi_free_boot_services(void); extern efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, bool nonblocking); extern void efi_find_mirror(void); #else -static inline void efi_free_boot_services(void) {} static inline efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, @@ -1046,7 +1044,6 @@ extern void efi_mem_reserve(phys_addr_t addr, u64 size); extern int efi_mem_reserve_persistent(phys_addr_t addr, u64 size); extern void efi_initialize_iomem_resources(struct resource *code_resource, struct resource *data_resource, struct resource *bss_resource); -extern void efi_reserve_boot_services(void); extern int efi_get_fdt_params(struct efi_fdt_params *params); extern struct kobject *efi_kobj; diff --git a/init/main.c b/init/main.c index ee147103ba1b..ccefcd8e855f 100644 --- a/init/main.c +++ b/init/main.c @@ -737,10 +737,6 @@ asmlinkage __visible void __init start_kernel(void) arch_post_acpi_subsys_init(); sfi_init_late(); - if (efi_enabled(EFI_RUNTIME_SERVICES)) { - efi_free_boot_services(); - } - /* Do the rest non-__init'ed, we're now alive */ arch_call_rest_init(); }
[tip:efi/core] x86/efi: Unmap EFI boot services code/data regions from efi_pgd
Commit-ID: 08cfb38f3ef49cfd1bba11a00401451606477d80 Gitweb: https://git.kernel.org/tip/08cfb38f3ef49cfd1bba11a00401451606477d80 Author: Sai Praneeth Prakhya AuthorDate: Thu, 29 Nov 2018 18:12:24 +0100 Committer: Ingo Molnar CommitDate: Fri, 30 Nov 2018 09:10:30 +0100 x86/efi: Unmap EFI boot services code/data regions from efi_pgd efi_free_boot_services(), as the name suggests, frees EFI boot services code/data regions but forgets to unmap these regions from efi_pgd. This means that any code that's running in efi_pgd address space (e.g: any EFI runtime service) would still be able to access these regions but the contents of these regions would have long been over written by someone else. So, it's important to unmap these regions. Hence, introduce efi_unmap_pages() to unmap these regions from efi_pgd. After unmapping EFI boot services code/data regions, any illegal access by buggy firmware to these regions would result in page fault which will be handled by EFI specific fault handler. Signed-off-by: Sai Praneeth Prakhya Signed-off-by: Ard Biesheuvel Acked-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Arend van Spriel Cc: Bhupesh Sharma Cc: Borislav Petkov Cc: Dave Hansen Cc: Eric Snowberg Cc: Hans de Goede Cc: Joe Perches Cc: Jon Hunter Cc: Julien Thierry Cc: Linus Torvalds Cc: Marc Zyngier Cc: Matt Fleming Cc: Nathan Chancellor Cc: Peter Zijlstra Cc: Sedat Dilek Cc: YiFei Zhu Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20181129171230.18699-6-ard.biesheu...@linaro.org Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/quirks.c | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 95e77a667ba5..09e811b9da26 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -369,6 +369,24 @@ void __init efi_reserve_boot_services(void) } } +/* + * Apart from having VA mappings for EFI boot services code/data regions, + * (duplicate) 1:1 mappings were also created as a quirk for buggy firmware. So, + * unmap both 1:1 and VA mappings. + */ +static void __init efi_unmap_pages(efi_memory_desc_t *md) +{ + pgd_t *pgd = efi_mm.pgd; + u64 pa = md->phys_addr; + u64 va = md->virt_addr; + + if (kernel_unmap_pages_in_pgd(pgd, pa, md->num_pages)) + pr_err("Failed to unmap 1:1 mapping for 0x%llx\n", pa); + + if (kernel_unmap_pages_in_pgd(pgd, va, md->num_pages)) + pr_err("Failed to unmap VA mapping for 0x%llx\n", va); +} + void __init efi_free_boot_services(void) { phys_addr_t new_phys, new_size; @@ -393,6 +411,13 @@ void __init efi_free_boot_services(void) continue; } + /* +* Before calling set_virtual_address_map(), EFI boot services +* code/data regions were mapped as a quirk for buggy firmware. +* Unmap them from efi_pgd before freeing them up. +*/ + efi_unmap_pages(md); + /* * Nasty quirk: if all sub-1MB memory is used for boot * services, we can get here without having allocated the
[tip:efi/core] x86/efi: Unmap EFI boot services code/data regions from efi_pgd
Commit-ID: 08cfb38f3ef49cfd1bba11a00401451606477d80 Gitweb: https://git.kernel.org/tip/08cfb38f3ef49cfd1bba11a00401451606477d80 Author: Sai Praneeth Prakhya AuthorDate: Thu, 29 Nov 2018 18:12:24 +0100 Committer: Ingo Molnar CommitDate: Fri, 30 Nov 2018 09:10:30 +0100 x86/efi: Unmap EFI boot services code/data regions from efi_pgd efi_free_boot_services(), as the name suggests, frees EFI boot services code/data regions but forgets to unmap these regions from efi_pgd. This means that any code that's running in efi_pgd address space (e.g: any EFI runtime service) would still be able to access these regions but the contents of these regions would have long been over written by someone else. So, it's important to unmap these regions. Hence, introduce efi_unmap_pages() to unmap these regions from efi_pgd. After unmapping EFI boot services code/data regions, any illegal access by buggy firmware to these regions would result in page fault which will be handled by EFI specific fault handler. Signed-off-by: Sai Praneeth Prakhya Signed-off-by: Ard Biesheuvel Acked-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Arend van Spriel Cc: Bhupesh Sharma Cc: Borislav Petkov Cc: Dave Hansen Cc: Eric Snowberg Cc: Hans de Goede Cc: Joe Perches Cc: Jon Hunter Cc: Julien Thierry Cc: Linus Torvalds Cc: Marc Zyngier Cc: Matt Fleming Cc: Nathan Chancellor Cc: Peter Zijlstra Cc: Sedat Dilek Cc: YiFei Zhu Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20181129171230.18699-6-ard.biesheu...@linaro.org Signed-off-by: Ingo Molnar --- arch/x86/platform/efi/quirks.c | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 95e77a667ba5..09e811b9da26 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -369,6 +369,24 @@ void __init efi_reserve_boot_services(void) } } +/* + * Apart from having VA mappings for EFI boot services code/data regions, + * (duplicate) 1:1 mappings were also created as a quirk for buggy firmware. So, + * unmap both 1:1 and VA mappings. + */ +static void __init efi_unmap_pages(efi_memory_desc_t *md) +{ + pgd_t *pgd = efi_mm.pgd; + u64 pa = md->phys_addr; + u64 va = md->virt_addr; + + if (kernel_unmap_pages_in_pgd(pgd, pa, md->num_pages)) + pr_err("Failed to unmap 1:1 mapping for 0x%llx\n", pa); + + if (kernel_unmap_pages_in_pgd(pgd, va, md->num_pages)) + pr_err("Failed to unmap VA mapping for 0x%llx\n", va); +} + void __init efi_free_boot_services(void) { phys_addr_t new_phys, new_size; @@ -393,6 +411,13 @@ void __init efi_free_boot_services(void) continue; } + /* +* Before calling set_virtual_address_map(), EFI boot services +* code/data regions were mapped as a quirk for buggy firmware. +* Unmap them from efi_pgd before freeing them up. +*/ + efi_unmap_pages(md); + /* * Nasty quirk: if all sub-1MB memory is used for boot * services, we can get here without having allocated the
[tip:efi/core] x86/mm/pageattr: Introduce helper function to unmap EFI boot services
Commit-ID: 7e0dabd3010d6041ee0a952c1146b2150a11f1be Gitweb: https://git.kernel.org/tip/7e0dabd3010d6041ee0a952c1146b2150a11f1be Author: Sai Praneeth Prakhya AuthorDate: Thu, 29 Nov 2018 18:12:23 +0100 Committer: Ingo Molnar CommitDate: Fri, 30 Nov 2018 09:10:30 +0100 x86/mm/pageattr: Introduce helper function to unmap EFI boot services Ideally, after kernel assumes control of the platform, firmware shouldn't access EFI boot services code/data regions. But, it's noticed that this is not so true in many x86 platforms. Hence, during boot, kernel reserves EFI boot services code/data regions [1] and maps [2] them to efi_pgd so that call to set_virtual_address_map() doesn't fail. After returning from set_virtual_address_map(), kernel frees the reserved regions [3] but they still remain mapped. Hence, introduce kernel_unmap_pages_in_pgd() which will later be used to unmap EFI boot services code/data regions. While at it modify kernel_map_pages_in_pgd() by: 1. Adding __init modifier because it's always used *only* during boot. 2. Add a warning if it's used after SMP is initialized because it uses __flush_tlb_all() which flushes mappings only on current CPU. Unmapping EFI boot services code/data regions will result in clearing PAGE_PRESENT bit and it shouldn't bother L1TF cases because it's already handled by protnone_mask() at arch/x86/include/asm/pgtable-invert.h. [1] efi_reserve_boot_services() [2] efi_map_region() -> __map_region() -> kernel_map_pages_in_pgd() [3] efi_free_boot_services() Signed-off-by: Sai Praneeth Prakhya Signed-off-by: Ard Biesheuvel Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Arend van Spriel Cc: Bhupesh Sharma Cc: Borislav Petkov Cc: Dave Hansen Cc: Eric Snowberg Cc: Hans de Goede Cc: Joe Perches Cc: Jon Hunter Cc: Julien Thierry Cc: Linus Torvalds Cc: Marc Zyngier Cc: Matt Fleming Cc: Nathan Chancellor Cc: Peter Zijlstra Cc: Sedat Dilek Cc: YiFei Zhu Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20181129171230.18699-5-ard.biesheu...@linaro.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable_types.h | 8 ++-- arch/x86/mm/pageattr.c | 40 ++-- 2 files changed, 44 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 106b7d0e2dae..d6ff0bbdb394 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -564,8 +564,12 @@ extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address, unsigned int *level); extern pmd_t *lookup_pmd_address(unsigned long address); extern phys_addr_t slow_virt_to_phys(void *__address); -extern int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, - unsigned numpages, unsigned long page_flags); +extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, + unsigned long address, + unsigned numpages, + unsigned long page_flags); +extern int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address, + unsigned long numpages); #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_PGTABLE_DEFS_H */ diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index db7a10082238..bac35001d896 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -2338,8 +2338,8 @@ bool kernel_page_present(struct page *page) #endif /* CONFIG_DEBUG_PAGEALLOC */ -int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, - unsigned numpages, unsigned long page_flags) +int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, + unsigned numpages, unsigned long page_flags) { int retval = -EINVAL; @@ -2353,6 +2353,8 @@ int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, .flags = 0, }; + WARN_ONCE(num_online_cpus() > 1, "Don't call after initializing SMP"); + if (!(__supported_pte_mask & _PAGE_NX)) goto out; @@ -2374,6 +2376,40 @@ out: return retval; } +/* + * __flush_tlb_all() flushes mappings only on current CPU and hence this + * function shouldn't be used in an SMP environment. Presently, it's used only + * during boot (way before smp_init()) by EFI subsystem and hence is ok. + */ +int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address, +unsigned long numpages) +{ + int retval; + + /* +* The typical sequence for unmapping is to find a pte through +* lookup_address_in_pgd() (ideally, it should never return NULL because +* the address is already mapped) and change i
[tip:efi/core] x86/mm/pageattr: Introduce helper function to unmap EFI boot services
Commit-ID: 7e0dabd3010d6041ee0a952c1146b2150a11f1be Gitweb: https://git.kernel.org/tip/7e0dabd3010d6041ee0a952c1146b2150a11f1be Author: Sai Praneeth Prakhya AuthorDate: Thu, 29 Nov 2018 18:12:23 +0100 Committer: Ingo Molnar CommitDate: Fri, 30 Nov 2018 09:10:30 +0100 x86/mm/pageattr: Introduce helper function to unmap EFI boot services Ideally, after kernel assumes control of the platform, firmware shouldn't access EFI boot services code/data regions. But, it's noticed that this is not so true in many x86 platforms. Hence, during boot, kernel reserves EFI boot services code/data regions [1] and maps [2] them to efi_pgd so that call to set_virtual_address_map() doesn't fail. After returning from set_virtual_address_map(), kernel frees the reserved regions [3] but they still remain mapped. Hence, introduce kernel_unmap_pages_in_pgd() which will later be used to unmap EFI boot services code/data regions. While at it modify kernel_map_pages_in_pgd() by: 1. Adding __init modifier because it's always used *only* during boot. 2. Add a warning if it's used after SMP is initialized because it uses __flush_tlb_all() which flushes mappings only on current CPU. Unmapping EFI boot services code/data regions will result in clearing PAGE_PRESENT bit and it shouldn't bother L1TF cases because it's already handled by protnone_mask() at arch/x86/include/asm/pgtable-invert.h. [1] efi_reserve_boot_services() [2] efi_map_region() -> __map_region() -> kernel_map_pages_in_pgd() [3] efi_free_boot_services() Signed-off-by: Sai Praneeth Prakhya Signed-off-by: Ard Biesheuvel Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Arend van Spriel Cc: Bhupesh Sharma Cc: Borislav Petkov Cc: Dave Hansen Cc: Eric Snowberg Cc: Hans de Goede Cc: Joe Perches Cc: Jon Hunter Cc: Julien Thierry Cc: Linus Torvalds Cc: Marc Zyngier Cc: Matt Fleming Cc: Nathan Chancellor Cc: Peter Zijlstra Cc: Sedat Dilek Cc: YiFei Zhu Cc: linux-...@vger.kernel.org Link: http://lkml.kernel.org/r/20181129171230.18699-5-ard.biesheu...@linaro.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable_types.h | 8 ++-- arch/x86/mm/pageattr.c | 40 ++-- 2 files changed, 44 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 106b7d0e2dae..d6ff0bbdb394 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -564,8 +564,12 @@ extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address, unsigned int *level); extern pmd_t *lookup_pmd_address(unsigned long address); extern phys_addr_t slow_virt_to_phys(void *__address); -extern int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, - unsigned numpages, unsigned long page_flags); +extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, + unsigned long address, + unsigned numpages, + unsigned long page_flags); +extern int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address, + unsigned long numpages); #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_PGTABLE_DEFS_H */ diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index db7a10082238..bac35001d896 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -2338,8 +2338,8 @@ bool kernel_page_present(struct page *page) #endif /* CONFIG_DEBUG_PAGEALLOC */ -int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, - unsigned numpages, unsigned long page_flags) +int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, + unsigned numpages, unsigned long page_flags) { int retval = -EINVAL; @@ -2353,6 +2353,8 @@ int kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, .flags = 0, }; + WARN_ONCE(num_online_cpus() > 1, "Don't call after initializing SMP"); + if (!(__supported_pte_mask & _PAGE_NX)) goto out; @@ -2374,6 +2376,40 @@ out: return retval; } +/* + * __flush_tlb_all() flushes mappings only on current CPU and hence this + * function shouldn't be used in an SMP environment. Presently, it's used only + * during boot (way before smp_init()) by EFI subsystem and hence is ok. + */ +int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address, +unsigned long numpages) +{ + int retval; + + /* +* The typical sequence for unmapping is to find a pte through +* lookup_address_in_pgd() (ideally, it should never return NULL because +* the address is already mapped) and change i
[PATCH V5 0/2] Add efi page fault handler to recover from page
From: Sai Praneeth There may exist some buggy UEFI firmware implementations that access efi memory regions other than EFI_RUNTIME_SERVICES_ even after the kernel has assumed control of the platform. This violates UEFI specification. Hence, provide a efi specific page fault handler which recovers from page faults caused by buggy firmware. Page faults triggered by firmware happen at ring 0 and if unhandled, hangs the kernel. So, provide an efi specific page fault handler to: 1. Avoid panics/hangs caused by buggy firmware. 2. Shout loud that the firmware is buggy and hence is not a kernel bug. The efi page fault handler will check if the access is by efi_reset_system(). 1. If so, then the efi page fault handler will reboot the machine through BIOS and not through efi_reset_system(). 2. If not, then the efi page fault handler will freeze efi_rts_wq and schedules a new process. This issue was reported by Al Stone when he saw that reboot via EFI hangs the machine. Upon debugging, I found that it's efi_reset_system() that's touching memory regions which it shouldn't. To reproduce the same behavior, I have hacked OVMF and made efi_reset_system() buggy. Along with efi_reset_system(), I have also modified get_next_high_mono_count() and set_virtual_address_map(). They illegally access both boot time and other efi regions. Testing the patch set: -- 1. Download buggy firmware from here [1]. 2. Run a qemu instance with this buggy BIOS and boot mainline kernel. Add reboot=efi to the kernel command line arguments and after the kernel is up and running, type "reboot". The kernel should hang while rebooting. 3. With the same setup, boot kernel after applying patches and the reboot should work fine. Also please notice warning/error messages printed by kernel. Changes from RFC to V1: --- 1. Drop "long jump" technique of dealing with illegal access and instead use scheduling away from efi_rts_wq. Changes from V1 to V2: -- 1. Shortened config name to CONFIG_EFI_WARN_ON_ILLEGAL_ACCESS from CONFIG_EFI_WARN_ON_ILLEGAL_ACCESSES. 2. Made the config option available only to expert users. 3. efi_free_boot_services() should be called only when CONFIG_EFI_WARN_ON_ILLEGAL_ACCESS is not enabled. Previously, this was part of init/main.c file. As it is an architecture agnostic code, moved the change to arch/x86/platform/efi/quirks.c file. Changes from V2 to V3: -- 1. Drop treating illegal access to EFI_BOOT_SERVICES_ regions separately from illegal accesses to other regions like EFI_CONVENTIONAL_MEMORY or EFI_LOADER_. In previous versions, illegal access to EFI_BOOT_SERVICES_ regions were handled by mapping requested region to efi_pgd but from V3 they are handled similar to illegal access to other regions i.e by freezing efi_rts_wq and scheduling new process. 2. Change __efi_init_fixup attribute to __efi_init. Changes from V3 to V4: -- 1. Drop saving original memory map passed by kernel. It also means less checks in efi page fault handler. 2. Change the config name to EFI_PAGE_FAULT_HANDLER to reflect it's functionality more appropriately. Changes from V4 to V5: -- 1. Drop config option that enables efi page fault handler, instead make it default. 2. Call schedule() in an infinite loop to account for spurious wake ups. 3. Introduce "NONE" as an efi runtime service function identifier so that it could be used in efi_recover_from_page_fault() to check if the page fault was indeed triggered by an efi runtime service. Note: - Patch set based on "next" branch in efi tree. [1] https://drive.google.com/drive/folders/1VozKTms92ifyVHAT0ZDQe55ZYL1UE5wt Sai Praneeth (2): efi: Make efi_rts_work accessible to efi page fault handler x86/efi: Add efi page fault handler to recover from page faults caused by the firmware arch/x86/include/asm/efi.h | 1 + arch/x86/mm/fault.c | 9 arch/x86/platform/efi/quirks.c | 78 + drivers/firmware/efi/runtime-wrappers.c | 61 +++--- include/linux/efi.h | 42 ++ 5 files changed, 147 insertions(+), 44 deletions(-) Tested-by: Bhupesh Sharma Suggested-by: Matt Fleming Based-on-code-from: Ricardo Neri Signed-off-by: Sai Praneeth Prakhya Cc: Al Stone Cc: Borislav Petkov Cc: Ingo Molnar Cc: Andy Lutomirski Cc: Bhupesh Sharma Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Ard Biesheuvel -- 2.7.4
[PATCH V5 0/2] Add efi page fault handler to recover from page
From: Sai Praneeth There may exist some buggy UEFI firmware implementations that access efi memory regions other than EFI_RUNTIME_SERVICES_ even after the kernel has assumed control of the platform. This violates UEFI specification. Hence, provide a efi specific page fault handler which recovers from page faults caused by buggy firmware. Page faults triggered by firmware happen at ring 0 and if unhandled, hangs the kernel. So, provide an efi specific page fault handler to: 1. Avoid panics/hangs caused by buggy firmware. 2. Shout loud that the firmware is buggy and hence is not a kernel bug. The efi page fault handler will check if the access is by efi_reset_system(). 1. If so, then the efi page fault handler will reboot the machine through BIOS and not through efi_reset_system(). 2. If not, then the efi page fault handler will freeze efi_rts_wq and schedules a new process. This issue was reported by Al Stone when he saw that reboot via EFI hangs the machine. Upon debugging, I found that it's efi_reset_system() that's touching memory regions which it shouldn't. To reproduce the same behavior, I have hacked OVMF and made efi_reset_system() buggy. Along with efi_reset_system(), I have also modified get_next_high_mono_count() and set_virtual_address_map(). They illegally access both boot time and other efi regions. Testing the patch set: -- 1. Download buggy firmware from here [1]. 2. Run a qemu instance with this buggy BIOS and boot mainline kernel. Add reboot=efi to the kernel command line arguments and after the kernel is up and running, type "reboot". The kernel should hang while rebooting. 3. With the same setup, boot kernel after applying patches and the reboot should work fine. Also please notice warning/error messages printed by kernel. Changes from RFC to V1: --- 1. Drop "long jump" technique of dealing with illegal access and instead use scheduling away from efi_rts_wq. Changes from V1 to V2: -- 1. Shortened config name to CONFIG_EFI_WARN_ON_ILLEGAL_ACCESS from CONFIG_EFI_WARN_ON_ILLEGAL_ACCESSES. 2. Made the config option available only to expert users. 3. efi_free_boot_services() should be called only when CONFIG_EFI_WARN_ON_ILLEGAL_ACCESS is not enabled. Previously, this was part of init/main.c file. As it is an architecture agnostic code, moved the change to arch/x86/platform/efi/quirks.c file. Changes from V2 to V3: -- 1. Drop treating illegal access to EFI_BOOT_SERVICES_ regions separately from illegal accesses to other regions like EFI_CONVENTIONAL_MEMORY or EFI_LOADER_. In previous versions, illegal access to EFI_BOOT_SERVICES_ regions were handled by mapping requested region to efi_pgd but from V3 they are handled similar to illegal access to other regions i.e by freezing efi_rts_wq and scheduling new process. 2. Change __efi_init_fixup attribute to __efi_init. Changes from V3 to V4: -- 1. Drop saving original memory map passed by kernel. It also means less checks in efi page fault handler. 2. Change the config name to EFI_PAGE_FAULT_HANDLER to reflect it's functionality more appropriately. Changes from V4 to V5: -- 1. Drop config option that enables efi page fault handler, instead make it default. 2. Call schedule() in an infinite loop to account for spurious wake ups. 3. Introduce "NONE" as an efi runtime service function identifier so that it could be used in efi_recover_from_page_fault() to check if the page fault was indeed triggered by an efi runtime service. Note: - Patch set based on "next" branch in efi tree. [1] https://drive.google.com/drive/folders/1VozKTms92ifyVHAT0ZDQe55ZYL1UE5wt Sai Praneeth (2): efi: Make efi_rts_work accessible to efi page fault handler x86/efi: Add efi page fault handler to recover from page faults caused by the firmware arch/x86/include/asm/efi.h | 1 + arch/x86/mm/fault.c | 9 arch/x86/platform/efi/quirks.c | 78 + drivers/firmware/efi/runtime-wrappers.c | 61 +++--- include/linux/efi.h | 42 ++ 5 files changed, 147 insertions(+), 44 deletions(-) Tested-by: Bhupesh Sharma Suggested-by: Matt Fleming Based-on-code-from: Ricardo Neri Signed-off-by: Sai Praneeth Prakhya Cc: Al Stone Cc: Borislav Petkov Cc: Ingo Molnar Cc: Andy Lutomirski Cc: Bhupesh Sharma Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Ard Biesheuvel -- 2.7.4
[PATCH V3] x86/speculation: Support Enhanced IBRS on future CPUs
From: Sai Praneeth Future Intel processors will support "Enhanced IBRS" which is an "always on" mode i.e. IBRS bit in SPEC_CTRL MSR is enabled once and never disabled. >From the specification [1]: "With enhanced IBRS, the predicted targets of indirect branches executed cannot be controlled by software that was executed in a less privileged predictor mode or on another logical processor. As a result, software operating on a processor with enhanced IBRS need not use WRMSR to set IA32_SPEC_CTRL.IBRS after every transition to a more privileged predictor mode. Software can isolate predictor modes effectively simply by setting the bit once. Software need not disable enhanced IBRS prior to entering a sleep state such as MWAIT or HLT." If Enhanced IBRS is supported by the processor then use it as the preferred spectre v2 mitigation mechanism instead of Retpoline. Intel's Retpoline white paper [2] states: "Retpoline is known to be an effective branch target injection (Spectre variant 2) mitigation on Intel processors belonging to family 6 (enumerated by the CPUID instruction) that do not have support for enhanced IBRS. On processors that support enhanced IBRS, it should be used for mitigation instead of retpoline." The reason why Enhanced IBRS is the recommended mitigation on processors which support it is that these processors also support CET which provides a defense against ROP attacks. Retpoline is very similar to ROP techniques and might trigger false positives in the CET defense. If Enhanced IBRS is selected as the mitigation technique for spectre v2, the IBRS bit in SPEC_CTRL MSR is set once at boot time and never cleared. Kernel also has to make sure that IBRS bit remains set after VMEXIT because the guest might have cleared the bit. This is already covered by the existing x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() speculation control functions. Enhanced IBRS still requires IBPB for full mitigation. [1] Speculative-Execution-Side-Channel-Mitigations.pdf [2] Retpoline-A-Branch-Target-Injection-Mitigation.pdf Both the documents are available at: https://bugzilla.kernel.org/show_bug.cgi?id=199511 Signed-off-by: Sai Praneeth Prakhya Originally-by: David Woodhouse Cc: Ingo Molnar Cc: Tim C Chen Cc: Dave Hansen Cc: Thomas Gleixner Cc: Ravi Shankar --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/nospec-branch.h | 2 +- arch/x86/kernel/cpu/bugs.c | 23 +-- arch/x86/kernel/cpu/common.c | 3 +++ 4 files changed, 26 insertions(+), 3 deletions(-) Changes from V2 to V3: 1. Improve commit message as suggested by Thomas i.e. a. Use indentation when quoting from specification. b. Refrain from using "this patch" and "we". c. Restructuring and enhancing information on the real reason for using Enhanced IBRS as the default spectre V2 mitigation technique. 2. Remove "ibrs_enhanced" feature string as its not needed. 3. Remove unnecessary WARN_ON_ONCE(). 4. Add explicit wrmsrl() after setting IBRS bit in x86_spec_ctrl_base. Changes from V1 to V2: 1. Explicitly spell out in the change log, the reason for using Enhanced IBRS as the default spectre V2 mitigation technique instead of Retpoline. diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 5701f5cecd31..568fa20254f7 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -219,6 +219,7 @@ #define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */ +#define X86_FEATURE_IBRS_ENHANCED ( 7*32+29) /* Enhanced IBRS */ /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index f6f6c63da62f..fd2a8c1b88bc 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -214,7 +214,7 @@ enum spectre_v2_mitigation { SPECTRE_V2_RETPOLINE_MINIMAL_AMD, SPECTRE_V2_RETPOLINE_GENERIC, SPECTRE_V2_RETPOLINE_AMD, - SPECTRE_V2_IBRS, + SPECTRE_V2_IBRS_ENHANCED, }; /* The Speculative Store Bypass disable variants */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 5c0ea39311fe..4e4be8512a77 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -130,6 +130,7 @@ static const char *spectre_v2_strings[] = { [SPECTRE_V2_RETPOLINE_MINIMAL_AMD] = "Vulnerable: Minimal AMD ASM retpoline", [SPECTRE_V2_RETPOLINE_GENERIC] = "Mitigation: Full gener
[PATCH V3] x86/speculation: Support Enhanced IBRS on future CPUs
From: Sai Praneeth Future Intel processors will support "Enhanced IBRS" which is an "always on" mode i.e. IBRS bit in SPEC_CTRL MSR is enabled once and never disabled. >From the specification [1]: "With enhanced IBRS, the predicted targets of indirect branches executed cannot be controlled by software that was executed in a less privileged predictor mode or on another logical processor. As a result, software operating on a processor with enhanced IBRS need not use WRMSR to set IA32_SPEC_CTRL.IBRS after every transition to a more privileged predictor mode. Software can isolate predictor modes effectively simply by setting the bit once. Software need not disable enhanced IBRS prior to entering a sleep state such as MWAIT or HLT." If Enhanced IBRS is supported by the processor then use it as the preferred spectre v2 mitigation mechanism instead of Retpoline. Intel's Retpoline white paper [2] states: "Retpoline is known to be an effective branch target injection (Spectre variant 2) mitigation on Intel processors belonging to family 6 (enumerated by the CPUID instruction) that do not have support for enhanced IBRS. On processors that support enhanced IBRS, it should be used for mitigation instead of retpoline." The reason why Enhanced IBRS is the recommended mitigation on processors which support it is that these processors also support CET which provides a defense against ROP attacks. Retpoline is very similar to ROP techniques and might trigger false positives in the CET defense. If Enhanced IBRS is selected as the mitigation technique for spectre v2, the IBRS bit in SPEC_CTRL MSR is set once at boot time and never cleared. Kernel also has to make sure that IBRS bit remains set after VMEXIT because the guest might have cleared the bit. This is already covered by the existing x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() speculation control functions. Enhanced IBRS still requires IBPB for full mitigation. [1] Speculative-Execution-Side-Channel-Mitigations.pdf [2] Retpoline-A-Branch-Target-Injection-Mitigation.pdf Both the documents are available at: https://bugzilla.kernel.org/show_bug.cgi?id=199511 Signed-off-by: Sai Praneeth Prakhya Originally-by: David Woodhouse Cc: Ingo Molnar Cc: Tim C Chen Cc: Dave Hansen Cc: Thomas Gleixner Cc: Ravi Shankar --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/nospec-branch.h | 2 +- arch/x86/kernel/cpu/bugs.c | 23 +-- arch/x86/kernel/cpu/common.c | 3 +++ 4 files changed, 26 insertions(+), 3 deletions(-) Changes from V2 to V3: 1. Improve commit message as suggested by Thomas i.e. a. Use indentation when quoting from specification. b. Refrain from using "this patch" and "we". c. Restructuring and enhancing information on the real reason for using Enhanced IBRS as the default spectre V2 mitigation technique. 2. Remove "ibrs_enhanced" feature string as its not needed. 3. Remove unnecessary WARN_ON_ONCE(). 4. Add explicit wrmsrl() after setting IBRS bit in x86_spec_ctrl_base. Changes from V1 to V2: 1. Explicitly spell out in the change log, the reason for using Enhanced IBRS as the default spectre V2 mitigation technique instead of Retpoline. diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 5701f5cecd31..568fa20254f7 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -219,6 +219,7 @@ #define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */ +#define X86_FEATURE_IBRS_ENHANCED ( 7*32+29) /* Enhanced IBRS */ /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index f6f6c63da62f..fd2a8c1b88bc 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -214,7 +214,7 @@ enum spectre_v2_mitigation { SPECTRE_V2_RETPOLINE_MINIMAL_AMD, SPECTRE_V2_RETPOLINE_GENERIC, SPECTRE_V2_RETPOLINE_AMD, - SPECTRE_V2_IBRS, + SPECTRE_V2_IBRS_ENHANCED, }; /* The Speculative Store Bypass disable variants */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 5c0ea39311fe..4e4be8512a77 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -130,6 +130,7 @@ static const char *spectre_v2_strings[] = { [SPECTRE_V2_RETPOLINE_MINIMAL_AMD] = "Vulnerable: Minimal AMD ASM retpoline", [SPECTRE_V2_RETPOLINE_GENERIC] = "Mitigation: Full gener
[PATCH V2] x86/speculation: Support Enhanced IBRS on future CPUs
From: Sai Praneeth Some future Intel processors may support "Enhanced IBRS" which is an "always on" mode i.e. IBRS bit in SPEC_CTRL MSR is enabled once and never disabled. [With enhanced IBRS, the predicted targets of indirect branches executed cannot be controlled by software that was executed in a less privileged predictor mode or on another logical processor. As a result, software operating on a processor with enhanced IBRS need not use WRMSR to set IA32_SPEC_CTRL.IBRS after every transition to a more privileged predictor mode. Software can isolate predictor modes effectively simply by setting the bit once. Software need not disable enhanced IBRS prior to entering a sleep state such as MWAIT or HLT.] - Specification [1] Even with enhanced IBRS, we still need to make sure that IBRS bit in SPEC_CTRL MSR is always set i.e. while booting, if we detect support for Enhanced IBRS, we enable IBRS bit in SPEC_CTRL MSR and we should also make sure that it remains set always. In other words, if the guest has cleared IBRS bit, upon VMEXIT the bit should still be set. Fortunately, kernel already has the infrastructure ready. kvm/vmx.c does x86_spec_ctrl_set_guest() before entering guest and x86_spec_ctrl_restore_host() after leaving guest. So, the guest view of SPEC_CTRL MSR is restored before entering guest and the host view of SPEC_CTRL MSR is restored before entering host and hence IBRS will be set after VMEXIT. Intel's white paper on Retpoline [2] says that "Retpoline is known to be an effective branch target injection (Spectre variant 2) mitigation on Intel processors belonging to family 6 (enumerated by the CPUID instruction) that do not have support for enhanced IBRS. On processors that support enhanced IBRS, it should be used for mitigation instead of retpoline." This means, Intel recommends using Enhanced IBRS over Retpoline where available and it also means that retpoline provides less mitigation on processors with enhanced IBRS compared to those without. Hence, on processors that support Enhanced IBRS, this patch makes Enhanced IBRS as the default Spectre V2 mitigation technique instead of retpoline. Also, we still need IBPB even with enhanced IBRS. [1] https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf [2] https://software.intel.com/sites/default/files/managed/1d/46/Retpoline-A-Branch-Target-Injection-Mitigation.pdf Signed-off-by: Sai Praneeth Prakhya Originally-by: David Woodhouse Cc: Tim C Chen Cc: Dave Hansen Cc: Thomas Gleixner Cc: Ravi Shankar Cc: Ingo Molnar --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/nospec-branch.h | 2 +- arch/x86/kernel/cpu/bugs.c | 29 +++-- arch/x86/kernel/cpu/common.c | 3 +++ 4 files changed, 32 insertions(+), 3 deletions(-) Changes from V1 to V2: 1. Explicitly spell out in the change log, the reason for using Enhanced IBRS as the default Spectre V2 mitigation technique instead of Retpoline. diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 5701f5cecd31..f75815b1dbee 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -219,6 +219,7 @@ #define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */ +#define X86_FEATURE_IBRS_ENHANCED ( 7*32+29) /* "ibrs_enhanced" Use Enhanced IBRS in kernel */ /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index f6f6c63da62f..fd2a8c1b88bc 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -214,7 +214,7 @@ enum spectre_v2_mitigation { SPECTRE_V2_RETPOLINE_MINIMAL_AMD, SPECTRE_V2_RETPOLINE_GENERIC, SPECTRE_V2_RETPOLINE_AMD, - SPECTRE_V2_IBRS, + SPECTRE_V2_IBRS_ENHANCED, }; /* The Speculative Store Bypass disable variants */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 5c0ea39311fe..a66517de1301 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -130,6 +130,7 @@ static const char *spectre_v2_strings[] = { [SPECTRE_V2_RETPOLINE_MINIMAL_AMD] = "Vulnerable: Minimal AMD ASM retpoline", [SPECTRE_V2_RETPOLINE_GENERIC] = "Mitigation: Full generic retpoline", [SPECTRE_V2_RETPOLINE_AMD] = "Mitigation: Full AMD retpoline", + [SPECTRE_V2_IBRS_ENHANCED] = "Mitigation: Enhanced IBRS", }; #undef pr_fm
[PATCH V2] x86/speculation: Support Enhanced IBRS on future CPUs
From: Sai Praneeth Some future Intel processors may support "Enhanced IBRS" which is an "always on" mode i.e. IBRS bit in SPEC_CTRL MSR is enabled once and never disabled. [With enhanced IBRS, the predicted targets of indirect branches executed cannot be controlled by software that was executed in a less privileged predictor mode or on another logical processor. As a result, software operating on a processor with enhanced IBRS need not use WRMSR to set IA32_SPEC_CTRL.IBRS after every transition to a more privileged predictor mode. Software can isolate predictor modes effectively simply by setting the bit once. Software need not disable enhanced IBRS prior to entering a sleep state such as MWAIT or HLT.] - Specification [1] Even with enhanced IBRS, we still need to make sure that IBRS bit in SPEC_CTRL MSR is always set i.e. while booting, if we detect support for Enhanced IBRS, we enable IBRS bit in SPEC_CTRL MSR and we should also make sure that it remains set always. In other words, if the guest has cleared IBRS bit, upon VMEXIT the bit should still be set. Fortunately, kernel already has the infrastructure ready. kvm/vmx.c does x86_spec_ctrl_set_guest() before entering guest and x86_spec_ctrl_restore_host() after leaving guest. So, the guest view of SPEC_CTRL MSR is restored before entering guest and the host view of SPEC_CTRL MSR is restored before entering host and hence IBRS will be set after VMEXIT. Intel's white paper on Retpoline [2] says that "Retpoline is known to be an effective branch target injection (Spectre variant 2) mitigation on Intel processors belonging to family 6 (enumerated by the CPUID instruction) that do not have support for enhanced IBRS. On processors that support enhanced IBRS, it should be used for mitigation instead of retpoline." This means, Intel recommends using Enhanced IBRS over Retpoline where available and it also means that retpoline provides less mitigation on processors with enhanced IBRS compared to those without. Hence, on processors that support Enhanced IBRS, this patch makes Enhanced IBRS as the default Spectre V2 mitigation technique instead of retpoline. Also, we still need IBPB even with enhanced IBRS. [1] https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf [2] https://software.intel.com/sites/default/files/managed/1d/46/Retpoline-A-Branch-Target-Injection-Mitigation.pdf Signed-off-by: Sai Praneeth Prakhya Originally-by: David Woodhouse Cc: Tim C Chen Cc: Dave Hansen Cc: Thomas Gleixner Cc: Ravi Shankar Cc: Ingo Molnar --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/nospec-branch.h | 2 +- arch/x86/kernel/cpu/bugs.c | 29 +++-- arch/x86/kernel/cpu/common.c | 3 +++ 4 files changed, 32 insertions(+), 3 deletions(-) Changes from V1 to V2: 1. Explicitly spell out in the change log, the reason for using Enhanced IBRS as the default Spectre V2 mitigation technique instead of Retpoline. diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 5701f5cecd31..f75815b1dbee 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -219,6 +219,7 @@ #define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */ +#define X86_FEATURE_IBRS_ENHANCED ( 7*32+29) /* "ibrs_enhanced" Use Enhanced IBRS in kernel */ /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index f6f6c63da62f..fd2a8c1b88bc 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -214,7 +214,7 @@ enum spectre_v2_mitigation { SPECTRE_V2_RETPOLINE_MINIMAL_AMD, SPECTRE_V2_RETPOLINE_GENERIC, SPECTRE_V2_RETPOLINE_AMD, - SPECTRE_V2_IBRS, + SPECTRE_V2_IBRS_ENHANCED, }; /* The Speculative Store Bypass disable variants */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 5c0ea39311fe..a66517de1301 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -130,6 +130,7 @@ static const char *spectre_v2_strings[] = { [SPECTRE_V2_RETPOLINE_MINIMAL_AMD] = "Vulnerable: Minimal AMD ASM retpoline", [SPECTRE_V2_RETPOLINE_GENERIC] = "Mitigation: Full generic retpoline", [SPECTRE_V2_RETPOLINE_AMD] = "Mitigation: Full AMD retpoline", + [SPECTRE_V2_IBRS_ENHANCED] = "Mitigation: Enhanced IBRS", }; #undef pr_fm
[PATCH] x86/speculation: Support Enhanced IBRS on future CPUs
From: Sai Praneeth Some future Intel processors may support "Enhanced IBRS" which is an "always on" mode i.e. IBRS bit in SPEC_CTRL MSR is enabled once and never disabled. According to specification[1], this should simplify software enabling and improve performance. [With enhanced IBRS, the predicted targets of indirect branches executed cannot be controlled by software that was executed in a less privileged predictor mode or on another logical processor. As a result, software operating on a processor with enhanced IBRS need not use WRMSR to set IA32_SPEC_CTRL.IBRS after every transition to a more privileged predictor mode. Software can isolate predictor modes effectively simply by setting the bit once. Software need not disable enhanced IBRS prior to entering a sleep state such as MWAIT or HLT.] - Specification Even with enhanced IBRS, we still need to make sure that IBRS bit in SPEC_CTRL MSR is always set i.e. while booting, if we detect support for Enhanced IBRS, we enable IBRS bit in SPEC_CTRL MSR and we should also make sure that it remains set always. In other words, if the guest has cleared IBRS bit, upon VMEXIT the bit should still be set. Fortunately, kernel already has the infrastructure ready. kvm/vmx.c does x86_spec_ctrl_set_guest() before entering guest and x86_spec_ctrl_restore_host() after leaving guest. So, the guest view of SPEC_CTRL MSR is restored before entering guest and the host view of SPEC_CTRL MSR is restored before entering host and hence IBRS will be set after VMEXIT. For Intel CPUs that support Enhanced IBRS, this patch also makes Enhanced IBRS as the default Spectre V2 mitigation technique instead of retpoline. Also, we still need IBPB even with enhanced IBRS. [1] https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf Signed-off-by: Sai Praneeth Prakhya Originally-by: David Woodhouse Cc: Tim C Chen Cc: Dave Hansen Cc: Thomas Gleixner Cc: Ravi Shankar Cc: Ingo Molnar --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/nospec-branch.h | 2 +- arch/x86/kernel/cpu/bugs.c | 29 +++-- arch/x86/kernel/cpu/common.c | 3 +++ 4 files changed, 32 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 5701f5cecd31..f75815b1dbee 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -219,6 +219,7 @@ #define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */ +#define X86_FEATURE_IBRS_ENHANCED ( 7*32+29) /* "ibrs_enhanced" Use Enhanced IBRS in kernel */ /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index f6f6c63da62f..fd2a8c1b88bc 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -214,7 +214,7 @@ enum spectre_v2_mitigation { SPECTRE_V2_RETPOLINE_MINIMAL_AMD, SPECTRE_V2_RETPOLINE_GENERIC, SPECTRE_V2_RETPOLINE_AMD, - SPECTRE_V2_IBRS, + SPECTRE_V2_IBRS_ENHANCED, }; /* The Speculative Store Bypass disable variants */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 5c0ea39311fe..a66517de1301 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -130,6 +130,7 @@ static const char *spectre_v2_strings[] = { [SPECTRE_V2_RETPOLINE_MINIMAL_AMD] = "Vulnerable: Minimal AMD ASM retpoline", [SPECTRE_V2_RETPOLINE_GENERIC] = "Mitigation: Full generic retpoline", [SPECTRE_V2_RETPOLINE_AMD] = "Mitigation: Full AMD retpoline", + [SPECTRE_V2_IBRS_ENHANCED] = "Mitigation: Enhanced IBRS", }; #undef pr_fmt @@ -349,6 +350,8 @@ static void __init spectre_v2_select_mitigation(void) case SPECTRE_V2_CMD_FORCE: case SPECTRE_V2_CMD_AUTO: + if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED)) + goto skip_retpoline_enable_ibrs; if (IS_ENABLED(CONFIG_RETPOLINE)) goto retpoline_auto; break; @@ -385,7 +388,22 @@ static void __init spectre_v2_select_mitigation(void) SPECTRE_V2_RETPOLINE_MINIMAL; setup_force_cpu_cap(X86_FEATURE_RETPOLINE); } + goto enable_other_mitigations; +skip_retpoline_enable_ibrs: + mode = SPECTRE_V2_IBRS_ENHANCED; + + /* +* As we don't use IBRS in kernel, nobody should
[PATCH] x86/speculation: Support Enhanced IBRS on future CPUs
From: Sai Praneeth Some future Intel processors may support "Enhanced IBRS" which is an "always on" mode i.e. IBRS bit in SPEC_CTRL MSR is enabled once and never disabled. According to specification[1], this should simplify software enabling and improve performance. [With enhanced IBRS, the predicted targets of indirect branches executed cannot be controlled by software that was executed in a less privileged predictor mode or on another logical processor. As a result, software operating on a processor with enhanced IBRS need not use WRMSR to set IA32_SPEC_CTRL.IBRS after every transition to a more privileged predictor mode. Software can isolate predictor modes effectively simply by setting the bit once. Software need not disable enhanced IBRS prior to entering a sleep state such as MWAIT or HLT.] - Specification Even with enhanced IBRS, we still need to make sure that IBRS bit in SPEC_CTRL MSR is always set i.e. while booting, if we detect support for Enhanced IBRS, we enable IBRS bit in SPEC_CTRL MSR and we should also make sure that it remains set always. In other words, if the guest has cleared IBRS bit, upon VMEXIT the bit should still be set. Fortunately, kernel already has the infrastructure ready. kvm/vmx.c does x86_spec_ctrl_set_guest() before entering guest and x86_spec_ctrl_restore_host() after leaving guest. So, the guest view of SPEC_CTRL MSR is restored before entering guest and the host view of SPEC_CTRL MSR is restored before entering host and hence IBRS will be set after VMEXIT. For Intel CPUs that support Enhanced IBRS, this patch also makes Enhanced IBRS as the default Spectre V2 mitigation technique instead of retpoline. Also, we still need IBPB even with enhanced IBRS. [1] https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf Signed-off-by: Sai Praneeth Prakhya Originally-by: David Woodhouse Cc: Tim C Chen Cc: Dave Hansen Cc: Thomas Gleixner Cc: Ravi Shankar Cc: Ingo Molnar --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/nospec-branch.h | 2 +- arch/x86/kernel/cpu/bugs.c | 29 +++-- arch/x86/kernel/cpu/common.c | 3 +++ 4 files changed, 32 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 5701f5cecd31..f75815b1dbee 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -219,6 +219,7 @@ #define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */ #define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */ #define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */ +#define X86_FEATURE_IBRS_ENHANCED ( 7*32+29) /* "ibrs_enhanced" Use Enhanced IBRS in kernel */ /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index f6f6c63da62f..fd2a8c1b88bc 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -214,7 +214,7 @@ enum spectre_v2_mitigation { SPECTRE_V2_RETPOLINE_MINIMAL_AMD, SPECTRE_V2_RETPOLINE_GENERIC, SPECTRE_V2_RETPOLINE_AMD, - SPECTRE_V2_IBRS, + SPECTRE_V2_IBRS_ENHANCED, }; /* The Speculative Store Bypass disable variants */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 5c0ea39311fe..a66517de1301 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -130,6 +130,7 @@ static const char *spectre_v2_strings[] = { [SPECTRE_V2_RETPOLINE_MINIMAL_AMD] = "Vulnerable: Minimal AMD ASM retpoline", [SPECTRE_V2_RETPOLINE_GENERIC] = "Mitigation: Full generic retpoline", [SPECTRE_V2_RETPOLINE_AMD] = "Mitigation: Full AMD retpoline", + [SPECTRE_V2_IBRS_ENHANCED] = "Mitigation: Enhanced IBRS", }; #undef pr_fmt @@ -349,6 +350,8 @@ static void __init spectre_v2_select_mitigation(void) case SPECTRE_V2_CMD_FORCE: case SPECTRE_V2_CMD_AUTO: + if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED)) + goto skip_retpoline_enable_ibrs; if (IS_ENABLED(CONFIG_RETPOLINE)) goto retpoline_auto; break; @@ -385,7 +388,22 @@ static void __init spectre_v2_select_mitigation(void) SPECTRE_V2_RETPOLINE_MINIMAL; setup_force_cpu_cap(X86_FEATURE_RETPOLINE); } + goto enable_other_mitigations; +skip_retpoline_enable_ibrs: + mode = SPECTRE_V2_IBRS_ENHANCED; + + /* +* As we don't use IBRS in kernel, nobody should
[PATCH 4/6] x86/efi: Free existing memory map before installing new memory map
From: Sai Praneeth efi_memmap_install(), unmaps the existing memory map and installs a new memory map but doesn't free the memory allocated to the existing memory map. Fortunately, the details about the existing memory map (like the physical address, number of entries and type of memory) are stored in efi.memmap. Hence, use them to free the memory. In __efi_enter_virtual_mode(), we don't use efi_memmap_install() to install a new memory map, instead we use efi_memmap_init_late(). Hence, free existing memory map there too before installing a new memory map. Generally, memory for new memory map is allocated using efi_memmap_alloc() but in __efi_enter_virtual_mode() it's done using realloc_pages() [please see efi_map_regions()]. So, it's OK to free this memory using efi_memmap_free() in efi_free_boot_services(). Also, note that the first time efi_free_memmap() is called either from efi_fake_memmap() or efi_arch_mem_reserve() [depending on the boot sequence], we are actually freeing memblock_reserved memory which isn't allocated by efi_memmap_alloc(). So, there are two outliers where we use efi_free_memmap() to free memory allocated through other sources rather than efi_memmap_alloc(). Signed-off-by: Sai Praneeth Prakhya Suggested-by: Ard Biesheuvel Cc: Lee Chun-Yi Cc: Dave Young Cc: Borislav Petkov Cc: Laszlo Ersek Cc: Jan Kiszka Cc: Dave Hansen Cc: Bhupesh Sharma Cc: Nicolai Stange Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Taku Izumi Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 3 +++ arch/x86/platform/efi/quirks.c | 6 ++ drivers/firmware/efi/fake_mem.c | 3 +++ 3 files changed, 12 insertions(+) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index cda54abf25a6..7756426e93b5 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -952,6 +952,9 @@ static void __init __efi_enter_virtual_mode(void) * firmware via SetVirtualAddressMap(). */ efi_memmap_unmap(); + /* Free existing memory map before installing new memory map */ + efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, + efi.memmap.alloc_type); if (efi_memmap_init_late(pa, efi.memmap.desc_size * count)) { pr_err("Failed to remap late EFI memory map\n"); diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 11fa6ac9f0c2..11800f3cbb93 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -292,6 +292,9 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size) efi_memmap_insert(, new, ); early_memunmap(new, new_size); + /* Free existing memory map before installing new memory map */ + efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, + efi.memmap.alloc_type); efi_memmap_install(new_phys, num_entries, alloc_type); } @@ -452,6 +455,9 @@ void __init efi_free_boot_services(void) memunmap(new); + /* Free existing memory map before installing new memory map */ + efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, + efi.memmap.alloc_type); if (efi_memmap_install(new_phys, num_entries, alloc_type)) { pr_err("Could not install new EFI memmap\n"); return; diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c index 82dcfa1c340b..a47754efb796 100644 --- a/drivers/firmware/efi/fake_mem.c +++ b/drivers/firmware/efi/fake_mem.c @@ -90,6 +90,9 @@ void __init efi_fake_memmap(void) /* swap into new EFI memmap */ early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map); + /* Free existing memory map before installing new memory map */ + efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, + efi.memmap.alloc_type); efi_memmap_install(new_memmap_phy, new_nr_map, alloc_type); /* print new EFI memmap */ -- 2.7.4
[PATCH 4/6] x86/efi: Free existing memory map before installing new memory map
From: Sai Praneeth efi_memmap_install(), unmaps the existing memory map and installs a new memory map but doesn't free the memory allocated to the existing memory map. Fortunately, the details about the existing memory map (like the physical address, number of entries and type of memory) are stored in efi.memmap. Hence, use them to free the memory. In __efi_enter_virtual_mode(), we don't use efi_memmap_install() to install a new memory map, instead we use efi_memmap_init_late(). Hence, free existing memory map there too before installing a new memory map. Generally, memory for new memory map is allocated using efi_memmap_alloc() but in __efi_enter_virtual_mode() it's done using realloc_pages() [please see efi_map_regions()]. So, it's OK to free this memory using efi_memmap_free() in efi_free_boot_services(). Also, note that the first time efi_free_memmap() is called either from efi_fake_memmap() or efi_arch_mem_reserve() [depending on the boot sequence], we are actually freeing memblock_reserved memory which isn't allocated by efi_memmap_alloc(). So, there are two outliers where we use efi_free_memmap() to free memory allocated through other sources rather than efi_memmap_alloc(). Signed-off-by: Sai Praneeth Prakhya Suggested-by: Ard Biesheuvel Cc: Lee Chun-Yi Cc: Dave Young Cc: Borislav Petkov Cc: Laszlo Ersek Cc: Jan Kiszka Cc: Dave Hansen Cc: Bhupesh Sharma Cc: Nicolai Stange Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Taku Izumi Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel --- arch/x86/platform/efi/efi.c | 3 +++ arch/x86/platform/efi/quirks.c | 6 ++ drivers/firmware/efi/fake_mem.c | 3 +++ 3 files changed, 12 insertions(+) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index cda54abf25a6..7756426e93b5 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -952,6 +952,9 @@ static void __init __efi_enter_virtual_mode(void) * firmware via SetVirtualAddressMap(). */ efi_memmap_unmap(); + /* Free existing memory map before installing new memory map */ + efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, + efi.memmap.alloc_type); if (efi_memmap_init_late(pa, efi.memmap.desc_size * count)) { pr_err("Failed to remap late EFI memory map\n"); diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 11fa6ac9f0c2..11800f3cbb93 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -292,6 +292,9 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size) efi_memmap_insert(, new, ); early_memunmap(new, new_size); + /* Free existing memory map before installing new memory map */ + efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, + efi.memmap.alloc_type); efi_memmap_install(new_phys, num_entries, alloc_type); } @@ -452,6 +455,9 @@ void __init efi_free_boot_services(void) memunmap(new); + /* Free existing memory map before installing new memory map */ + efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, + efi.memmap.alloc_type); if (efi_memmap_install(new_phys, num_entries, alloc_type)) { pr_err("Could not install new EFI memmap\n"); return; diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c index 82dcfa1c340b..a47754efb796 100644 --- a/drivers/firmware/efi/fake_mem.c +++ b/drivers/firmware/efi/fake_mem.c @@ -90,6 +90,9 @@ void __init efi_fake_memmap(void) /* swap into new EFI memmap */ early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map); + /* Free existing memory map before installing new memory map */ + efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, + efi.memmap.alloc_type); efi_memmap_install(new_memmap_phy, new_nr_map, alloc_type); /* print new EFI memmap */ -- 2.7.4
[PATCH V5 3/3] efi: Use efi_rts_wq to invoke EFI Runtime Services
From: Sai Praneeth Presently, when a user process requests the kernel to execute any efi_runtime_service(), kernel switches the page directory (%cr3) from swapper_pgd to efi_pgd. Other subsystems in the kernel aren't aware of this switch and they might think, user space is still valid (i.e. the user space mappings are still pointing to the process that requested to run efi_runtime_service()) but in reality it is not so. A solution for this issue is to use kthread to run efi_runtime_service(). When a user process requests the kernel to execute any efi_runtime_service(), kernel queues the work to efi_rts_wq, a kthread comes along, switches to efi_pgd and executes efi_runtime_service() in kthread context. Anything that tries to touch user space addresses while in kthread is terminally broken. Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_wq. 2. Caller thread waits for completion until the work is finished because it's dependent on the return status of efi_runtime_service(). Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Introduce a handler function (called efi_call_rts()) that 1. Understands efi_runtime_work and 2. Invokes the appropriate efi_runtime_service() with the appropriate arguments Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. The non-blocking variants of set_variable() and query_variable_info() should be used while in atomic context. Use of blocking variants like set_variable() and query_variable_info() while in atomic will issue a warning ("scheduling wile in atomic") and prints stack trace. Presently, pstore uses non-blocking variants and hence works fine. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- drivers/firmware/efi/runtime-wrappers.c | 135 1 file changed, 119 insertions(+), 16 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index cf3bae42a752..127d4de00403 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -173,13 +173,104 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call) */ static DEFINE_SEMAPHORE(efi_runtime_lock); +/* + * Calls the appropriate efi_runtime_service() with the appropriate + * arguments. + * + * Semantics followed by efi_call_rts() to understand efi_runtime_work: + * 1. If argument was a pointer, recast it from void pointer to original + * pointer type. + * 2. If argument was a value, recast it from void pointer to original + * pointer type and dereference it. + */ +static void efi_call_rts(struct work_struct *work) +{ + struct efi_runtime_work *efi_rts_work; + void *arg1, *arg2, *arg3, *arg4, *arg5; + efi_status_t status = EFI_NOT_FOUND; + + efi_rts_work = container_of(work, struct efi_runtime_work, work); + arg1 = efi_rts_work->arg1; + arg2 = efi_rts_work->arg2; + arg3 = efi_rts_work->arg3; + arg4 = efi_rts_work->arg4; + arg5 = efi_rts_work->arg5; + + switch (efi_rts_work->efi_rts_id) { + case GET_TIME: + status = efi_call_virt(get_time, (efi_time_t *)arg1, + (efi_time_cap_t *)arg2); + break; + case SET_TIME: + status = efi_call_virt(set_time, (efi_time_t *)arg1); + break; + case GET_WAKEUP_TIME: + status = efi_call_virt(get_wakeup_time, (efi_bool_t *)arg1, + (efi_bool_t *)arg2, (efi_time_t *)arg3); + break; + case SET_WAKEUP_TIME: + status = efi_call_virt(set_wakeup_time, *(efi_bool_t *)arg1, + (efi_time_t *)arg2); + break; + case GET_VARIABLE: + status = efi_call_virt(get_variable, (efi_char16_t *)arg1, + (efi_guid_t *)arg2, (u32 *)arg3, + (unsigned long *)arg4, (void *)arg5); + break; + case GET_NEXT_VARIABLE: + status = efi_call_virt(get_next_variable, (unsigned long *)arg1, + (efi_char16_t *)arg2, + (efi_guid_t *)
[PATCH V5 1/3] x86/efi: Make efi_delete_dummy_variable() use set_variable_nonblocking() instead of set_variable()
From: Sai Praneeth Presently, efi_delete_dummy_variable() uses set_variable() which might block and hence kernel prints stack trace with a warning "bad: scheduling from the idle thread!". So, make efi_delete_dummy_variable() use set_variable_nonblocking(), which, as the name suggests doesn't block. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- arch/x86/platform/efi/quirks.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 36c1f8b9f7e0..6af39dc40325 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -105,12 +105,11 @@ early_param("efi_no_storage_paranoia", setup_storage_paranoia); */ void efi_delete_dummy_variable(void) { - efi.set_variable((efi_char16_t *)efi_dummy_name, -_DUMMY_GUID, -EFI_VARIABLE_NON_VOLATILE | -EFI_VARIABLE_BOOTSERVICE_ACCESS | -EFI_VARIABLE_RUNTIME_ACCESS, -0, NULL); + efi.set_variable_nonblocking((efi_char16_t *)efi_dummy_name, +_DUMMY_GUID, +EFI_VARIABLE_NON_VOLATILE | +EFI_VARIABLE_BOOTSERVICE_ACCESS | +EFI_VARIABLE_RUNTIME_ACCESS, 0, NULL); } /* -- 2.7.4
[PATCH V5 3/3] efi: Use efi_rts_wq to invoke EFI Runtime Services
From: Sai Praneeth Presently, when a user process requests the kernel to execute any efi_runtime_service(), kernel switches the page directory (%cr3) from swapper_pgd to efi_pgd. Other subsystems in the kernel aren't aware of this switch and they might think, user space is still valid (i.e. the user space mappings are still pointing to the process that requested to run efi_runtime_service()) but in reality it is not so. A solution for this issue is to use kthread to run efi_runtime_service(). When a user process requests the kernel to execute any efi_runtime_service(), kernel queues the work to efi_rts_wq, a kthread comes along, switches to efi_pgd and executes efi_runtime_service() in kthread context. Anything that tries to touch user space addresses while in kthread is terminally broken. Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_wq. 2. Caller thread waits for completion until the work is finished because it's dependent on the return status of efi_runtime_service(). Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Introduce a handler function (called efi_call_rts()) that 1. Understands efi_runtime_work and 2. Invokes the appropriate efi_runtime_service() with the appropriate arguments Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. The non-blocking variants of set_variable() and query_variable_info() should be used while in atomic context. Use of blocking variants like set_variable() and query_variable_info() while in atomic will issue a warning ("scheduling wile in atomic") and prints stack trace. Presently, pstore uses non-blocking variants and hence works fine. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- drivers/firmware/efi/runtime-wrappers.c | 135 1 file changed, 119 insertions(+), 16 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index cf3bae42a752..127d4de00403 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -173,13 +173,104 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call) */ static DEFINE_SEMAPHORE(efi_runtime_lock); +/* + * Calls the appropriate efi_runtime_service() with the appropriate + * arguments. + * + * Semantics followed by efi_call_rts() to understand efi_runtime_work: + * 1. If argument was a pointer, recast it from void pointer to original + * pointer type. + * 2. If argument was a value, recast it from void pointer to original + * pointer type and dereference it. + */ +static void efi_call_rts(struct work_struct *work) +{ + struct efi_runtime_work *efi_rts_work; + void *arg1, *arg2, *arg3, *arg4, *arg5; + efi_status_t status = EFI_NOT_FOUND; + + efi_rts_work = container_of(work, struct efi_runtime_work, work); + arg1 = efi_rts_work->arg1; + arg2 = efi_rts_work->arg2; + arg3 = efi_rts_work->arg3; + arg4 = efi_rts_work->arg4; + arg5 = efi_rts_work->arg5; + + switch (efi_rts_work->efi_rts_id) { + case GET_TIME: + status = efi_call_virt(get_time, (efi_time_t *)arg1, + (efi_time_cap_t *)arg2); + break; + case SET_TIME: + status = efi_call_virt(set_time, (efi_time_t *)arg1); + break; + case GET_WAKEUP_TIME: + status = efi_call_virt(get_wakeup_time, (efi_bool_t *)arg1, + (efi_bool_t *)arg2, (efi_time_t *)arg3); + break; + case SET_WAKEUP_TIME: + status = efi_call_virt(set_wakeup_time, *(efi_bool_t *)arg1, + (efi_time_t *)arg2); + break; + case GET_VARIABLE: + status = efi_call_virt(get_variable, (efi_char16_t *)arg1, + (efi_guid_t *)arg2, (u32 *)arg3, + (unsigned long *)arg4, (void *)arg5); + break; + case GET_NEXT_VARIABLE: + status = efi_call_virt(get_next_variable, (unsigned long *)arg1, + (efi_char16_t *)arg2, + (efi_guid_t *)
[PATCH V5 1/3] x86/efi: Make efi_delete_dummy_variable() use set_variable_nonblocking() instead of set_variable()
From: Sai Praneeth Presently, efi_delete_dummy_variable() uses set_variable() which might block and hence kernel prints stack trace with a warning "bad: scheduling from the idle thread!". So, make efi_delete_dummy_variable() use set_variable_nonblocking(), which, as the name suggests doesn't block. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- arch/x86/platform/efi/quirks.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 36c1f8b9f7e0..6af39dc40325 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -105,12 +105,11 @@ early_param("efi_no_storage_paranoia", setup_storage_paranoia); */ void efi_delete_dummy_variable(void) { - efi.set_variable((efi_char16_t *)efi_dummy_name, -_DUMMY_GUID, -EFI_VARIABLE_NON_VOLATILE | -EFI_VARIABLE_BOOTSERVICE_ACCESS | -EFI_VARIABLE_RUNTIME_ACCESS, -0, NULL); + efi.set_variable_nonblocking((efi_char16_t *)efi_dummy_name, +_DUMMY_GUID, +EFI_VARIABLE_NON_VOLATILE | +EFI_VARIABLE_BOOTSERVICE_ACCESS | +EFI_VARIABLE_RUNTIME_ACCESS, 0, NULL); } /* -- 2.7.4
[PATCH V5 0/3] Use efi_rts_wq to invoke EFI Runtime Services
Patches are based on Linus's kernel v4.17-rc7 [1] Backup: Detailing efi_pgd: -- efi_pgd has mappings for EFI Runtime Code/Data (on x86, plus EFI Boot time Code/Data) regions. Due to the nature of these mappings, they fall in user space address ranges and they are not the same as swapper. [On arm64, the EFI mappings are in the VA range usually used for user space. The two halves of the address space are managed by separate tables, TTBR0 and TTBR1. We always map the kernel in TTBR1, and we map user space or EFI runtime mappings in TTBR0.] - Mark Rutland Changes from V4 to V5: -- 1. As suggested by Ard, don't use efi_rts_wq for non-blocking variants. Non-blocking variants are supposed to not block and using workqueue exactly does the opposite, hence refrain from using it. 2. Use non-blocking variants in efi_delete_dummy_variable(). Use of blocking variants means that we have to call efi_delete_dummy_variable() after efi_rts_wq has been created. 3. Remove in_atomic() check in set_variable<>() and query_variable_info<>(). Any caller wishing to use set_variable() and query_variable_info() in atomic context should use their non-blocking variants. Changes from V3 to V4: -- 1. As suggested by Peter, use completions instead of flush_work() as the former is cheaper 2. Call efi_delete_dummy_variable() from efisubsys_init(). Sorry! Ard, wasn't able to find a better alternative to keep this change local to arch/x86. Changes from V2 to V3: -- 1. Rewrite the cover letter to clearly state the problem. What we are fixing and what we are not fixing. 2. Make efi_delete_dummy_variable() change local to x86. 3. Avoid using BUG(), instead, print error message and exit gracefully. 4. Move struct efi_runtime_work to runtime-wrappers.c file. 5. Give enum a name (efi_rts_ids) and use it in efi_runtime_work. 6. Add Naresh (maintainer of LUV for ARM) and Miguel to the CC list. Changes from V1 to V2: -- 1. Remove unnecessary include of asm/efi.h file - Fixes build error on ia64, reported by 0-day 2. Use enum to identify efi_runtime_services() 3. Use alloc_ordered_workqueue() to create efi_rts_wq as create_workqueue() is scheduled for depreciation. 4. Make efi_call_rts() static, as it has no callers outside runtime-wrappers.c 5. Use BUG(), when we are unable to queue work or unable to identify requested efi_runtime_service() - Because these two situations should *never* happen. Sai Praneeth (3): x86/efi: Make efi_delete_dummy_variable() use set_variable_nonblocking() instead of set_variable() efi: Create efi_rts_wq and efi_queue_work() to invoke all efi_runtime_services() efi: Use efi_rts_wq to invoke EFI Runtime Services arch/x86/platform/efi/quirks.c | 11 +- drivers/firmware/efi/efi.c | 14 ++ drivers/firmware/efi/runtime-wrappers.c | 218 +--- include/linux/efi.h | 3 + 4 files changed, 224 insertions(+), 22 deletions(-) Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda -- 2.7.4
[PATCH V5 2/3] efi: Create efi_rts_wq and efi_queue_work() to invoke all efi_runtime_services()
From: Sai Praneeth When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce some infrastructure: 1. Creating workqueue named efi_rts_wq 2. A macro (efi_queue_work()) that a. Populates efi_runtime_work b. Queues work onto efi_rts_wq and c. Waits until worker thread completes The caller thread has to wait until the worker thread completes, because it depends on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- drivers/firmware/efi/efi.c | 14 ++ drivers/firmware/efi/runtime-wrappers.c | 83 + include/linux/efi.h | 3 ++ 3 files changed, 100 insertions(+) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 232f4915223b..1379a375dfa8 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -84,6 +84,8 @@ struct mm_struct efi_mm = { .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -337,6 +339,18 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* Since we process only one efi_runtime_service() at a time, an +* ordered workqueue (which creates only one execution context) +* should suffice all our needs. +*/ + efi_rts_wq = alloc_ordered_workqueue("efi_rts_wq", 0); + if (!efi_rts_wq) { + pr_err("Creating efi_rts_wq failed, EFI runtime services disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return 0; + } + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..cf3bae42a752 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,15 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_wq. + * 2. Caller thread waits for completion until the work is finished + * because it's dependent on the return status and execution of + * efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +31,9 @@ #include #include #include +#include +#include + #include /* @@ -33,6 +45,77 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* efi_runtime_service() function identifiers */ +enum efi_rts_ids { + GET_TIME, + SET_TIME, + GET_WAKEUP_TIME, + SET_WAKEUP_TIME, + GET_VARIABLE, + GET_NEXT_VARIABLE, + SET_VARIABLE, + QUERY_VARIABLE_INFO, + GET_NEXT_HIGH_MONO_COUNT, + RESET_SYSTEM, + UPDATE_CAPSULE, + QUERY_CAPSULE_CAPS, +}; + +/* + * efi_runtime_work: Details of EFI Runtime Service work + * @arg<1-5>: EFI Runtime Service function arguments + * @status:Status of executing EFI Runtime Service + * @efi_rts_id:EFI Runtime Service function identifier + * @efi_rts_comp: Struct used for handling completions + */ +struct efi_runtime_work { + void *arg1; + void *arg2; + void *arg3; + void *arg4; + void *arg5; + efi_status_t status; + struct work_struct work; + enum
[PATCH V5 0/3] Use efi_rts_wq to invoke EFI Runtime Services
Patches are based on Linus's kernel v4.17-rc7 [1] Backup: Detailing efi_pgd: -- efi_pgd has mappings for EFI Runtime Code/Data (on x86, plus EFI Boot time Code/Data) regions. Due to the nature of these mappings, they fall in user space address ranges and they are not the same as swapper. [On arm64, the EFI mappings are in the VA range usually used for user space. The two halves of the address space are managed by separate tables, TTBR0 and TTBR1. We always map the kernel in TTBR1, and we map user space or EFI runtime mappings in TTBR0.] - Mark Rutland Changes from V4 to V5: -- 1. As suggested by Ard, don't use efi_rts_wq for non-blocking variants. Non-blocking variants are supposed to not block and using workqueue exactly does the opposite, hence refrain from using it. 2. Use non-blocking variants in efi_delete_dummy_variable(). Use of blocking variants means that we have to call efi_delete_dummy_variable() after efi_rts_wq has been created. 3. Remove in_atomic() check in set_variable<>() and query_variable_info<>(). Any caller wishing to use set_variable() and query_variable_info() in atomic context should use their non-blocking variants. Changes from V3 to V4: -- 1. As suggested by Peter, use completions instead of flush_work() as the former is cheaper 2. Call efi_delete_dummy_variable() from efisubsys_init(). Sorry! Ard, wasn't able to find a better alternative to keep this change local to arch/x86. Changes from V2 to V3: -- 1. Rewrite the cover letter to clearly state the problem. What we are fixing and what we are not fixing. 2. Make efi_delete_dummy_variable() change local to x86. 3. Avoid using BUG(), instead, print error message and exit gracefully. 4. Move struct efi_runtime_work to runtime-wrappers.c file. 5. Give enum a name (efi_rts_ids) and use it in efi_runtime_work. 6. Add Naresh (maintainer of LUV for ARM) and Miguel to the CC list. Changes from V1 to V2: -- 1. Remove unnecessary include of asm/efi.h file - Fixes build error on ia64, reported by 0-day 2. Use enum to identify efi_runtime_services() 3. Use alloc_ordered_workqueue() to create efi_rts_wq as create_workqueue() is scheduled for depreciation. 4. Make efi_call_rts() static, as it has no callers outside runtime-wrappers.c 5. Use BUG(), when we are unable to queue work or unable to identify requested efi_runtime_service() - Because these two situations should *never* happen. Sai Praneeth (3): x86/efi: Make efi_delete_dummy_variable() use set_variable_nonblocking() instead of set_variable() efi: Create efi_rts_wq and efi_queue_work() to invoke all efi_runtime_services() efi: Use efi_rts_wq to invoke EFI Runtime Services arch/x86/platform/efi/quirks.c | 11 +- drivers/firmware/efi/efi.c | 14 ++ drivers/firmware/efi/runtime-wrappers.c | 218 +--- include/linux/efi.h | 3 + 4 files changed, 224 insertions(+), 22 deletions(-) Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda -- 2.7.4
[PATCH V5 2/3] efi: Create efi_rts_wq and efi_queue_work() to invoke all efi_runtime_services()
From: Sai Praneeth When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce some infrastructure: 1. Creating workqueue named efi_rts_wq 2. A macro (efi_queue_work()) that a. Populates efi_runtime_work b. Queues work onto efi_rts_wq and c. Waits until worker thread completes The caller thread has to wait until the worker thread completes, because it depends on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- drivers/firmware/efi/efi.c | 14 ++ drivers/firmware/efi/runtime-wrappers.c | 83 + include/linux/efi.h | 3 ++ 3 files changed, 100 insertions(+) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 232f4915223b..1379a375dfa8 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -84,6 +84,8 @@ struct mm_struct efi_mm = { .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -337,6 +339,18 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* Since we process only one efi_runtime_service() at a time, an +* ordered workqueue (which creates only one execution context) +* should suffice all our needs. +*/ + efi_rts_wq = alloc_ordered_workqueue("efi_rts_wq", 0); + if (!efi_rts_wq) { + pr_err("Creating efi_rts_wq failed, EFI runtime services disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return 0; + } + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..cf3bae42a752 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,15 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_wq. + * 2. Caller thread waits for completion until the work is finished + * because it's dependent on the return status and execution of + * efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +31,9 @@ #include #include #include +#include +#include + #include /* @@ -33,6 +45,77 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* efi_runtime_service() function identifiers */ +enum efi_rts_ids { + GET_TIME, + SET_TIME, + GET_WAKEUP_TIME, + SET_WAKEUP_TIME, + GET_VARIABLE, + GET_NEXT_VARIABLE, + SET_VARIABLE, + QUERY_VARIABLE_INFO, + GET_NEXT_HIGH_MONO_COUNT, + RESET_SYSTEM, + UPDATE_CAPSULE, + QUERY_CAPSULE_CAPS, +}; + +/* + * efi_runtime_work: Details of EFI Runtime Service work + * @arg<1-5>: EFI Runtime Service function arguments + * @status:Status of executing EFI Runtime Service + * @efi_rts_id:EFI Runtime Service function identifier + * @efi_rts_comp: Struct used for handling completions + */ +struct efi_runtime_work { + void *arg1; + void *arg2; + void *arg3; + void *arg4; + void *arg5; + efi_status_t status; + struct work_struct work; + enum
[PATCH V4 1/3] x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Invoking efi_runtime_services() through efi_rts_wq (efi runtime services workqueue) means all accesses to efi_runtime_services() should be done after efi_rts_wq has been created. efi_delete_dummy_variable() calls set_variable(), hence efi_delete_dummy_variable() should be called after efi_rts_wq has been created. Presently, efi_delete_dummy_variable() is called from efi_enter_virtual_mode() which is early in the boot phase (efi_rts_wq isn't created yet), so call efi_delete_dummy_variable() later in the boot phase. Another and the most important reason for calling efi_delete_dummy_variable() late in the boot process is, if called before rest_init(), kernel prints stack trace with a warning "bad: scheduling from the idle thread!". Hence, call from efisubsys_init() which is called during rest_init(). Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Naresh Bhat <naresh.b...@linaro.org> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Dan Williams <dan.j.willi...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Miguel Ojeda <miguel.ojeda.sando...@gmail.com> --- arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 -- drivers/firmware/efi/efi.c | 6 ++ include/linux/efi.h | 3 +++ 4 files changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index cec5fae23eb3..0e61b771b93d 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -138,7 +138,6 @@ extern void __init efi_runtime_update_mappings(void); extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); -extern void efi_delete_dummy_variable(void); extern void efi_switch_mm(struct mm_struct *mm); struct efi_setup_data { diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 9061babfbc83..a3169d14583f 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -893,9 +893,6 @@ static void __init kexec_enter_virtual_mode(void) if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX)) runtime_code_page_mkexec(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); #endif } @@ -1015,9 +1012,6 @@ static void __init __efi_enter_virtual_mode(void) * necessary relocation fixups for the new virtual addresses. */ efi_runtime_update_mappings(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); } void __init efi_enter_virtual_mode(void) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 232f4915223b..1176af664013 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -337,6 +337,12 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* Clean DUMMY object calls EFI Runtime Service, set_variable(), so +* it should be invoked only after efi_rts_wq is ready. +*/ + efi_delete_dummy_variable(); + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { diff --git a/include/linux/efi.h b/include/linux/efi.h index 3016d8c456bc..1b79939d0b1e 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -994,6 +994,7 @@ extern efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, bool nonblocking); extern void efi_find_mirror(void); +extern void efi_delete_dummy_variable(void); #else static inline void efi_late_init(void) {} static inline void efi_free_boot_services(void) {} @@ -1004,6 +1005,8 @@ static inline efi_status_t efi_query_variable_store(u32 attributes, { return EFI_SUCCESS; } + +static inline void efi_delete_dummy_variable(void) {} #endif extern void __iomem *efi_lookup_mapped_addr(u64 phys_addr); -- 2.7.4
[PATCH V4 1/3] x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization
From: Sai Praneeth Invoking efi_runtime_services() through efi_rts_wq (efi runtime services workqueue) means all accesses to efi_runtime_services() should be done after efi_rts_wq has been created. efi_delete_dummy_variable() calls set_variable(), hence efi_delete_dummy_variable() should be called after efi_rts_wq has been created. Presently, efi_delete_dummy_variable() is called from efi_enter_virtual_mode() which is early in the boot phase (efi_rts_wq isn't created yet), so call efi_delete_dummy_variable() later in the boot phase. Another and the most important reason for calling efi_delete_dummy_variable() late in the boot process is, if called before rest_init(), kernel prints stack trace with a warning "bad: scheduling from the idle thread!". Hence, call from efisubsys_init() which is called during rest_init(). Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 -- drivers/firmware/efi/efi.c | 6 ++ include/linux/efi.h | 3 +++ 4 files changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index cec5fae23eb3..0e61b771b93d 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -138,7 +138,6 @@ extern void __init efi_runtime_update_mappings(void); extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); -extern void efi_delete_dummy_variable(void); extern void efi_switch_mm(struct mm_struct *mm); struct efi_setup_data { diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 9061babfbc83..a3169d14583f 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -893,9 +893,6 @@ static void __init kexec_enter_virtual_mode(void) if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX)) runtime_code_page_mkexec(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); #endif } @@ -1015,9 +1012,6 @@ static void __init __efi_enter_virtual_mode(void) * necessary relocation fixups for the new virtual addresses. */ efi_runtime_update_mappings(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); } void __init efi_enter_virtual_mode(void) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 232f4915223b..1176af664013 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -337,6 +337,12 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* Clean DUMMY object calls EFI Runtime Service, set_variable(), so +* it should be invoked only after efi_rts_wq is ready. +*/ + efi_delete_dummy_variable(); + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { diff --git a/include/linux/efi.h b/include/linux/efi.h index 3016d8c456bc..1b79939d0b1e 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -994,6 +994,7 @@ extern efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, bool nonblocking); extern void efi_find_mirror(void); +extern void efi_delete_dummy_variable(void); #else static inline void efi_late_init(void) {} static inline void efi_free_boot_services(void) {} @@ -1004,6 +1005,8 @@ static inline efi_status_t efi_query_variable_store(u32 attributes, { return EFI_SUCCESS; } + +static inline void efi_delete_dummy_variable(void) {} #endif extern void __iomem *efi_lookup_mapped_addr(u64 phys_addr); -- 2.7.4
[PATCH V4 0/3] Use efi_rts_wq to invoke EFI Runtime Services
comments and concerns. Note: - Patches are based on Linus's kernel v4.17-rc6 [1] Backup: Detailing efi_pgd: -- efi_pgd has mappings for EFI Runtime Code/Data (on x86, plus EFI Boot time Code/Data) regions. Due to the nature of these mappings, they fall in user space address ranges and they are not the same as swapper. [On arm64, the EFI mappings are in the VA range usually used for user space. The two halves of the address space are managed by separate tables, TTBR0 and TTBR1. We always map the kernel in TTBR1, and we map user space or EFI runtime mappings in TTBR0.] - Mark Rutland Changes from V3 to V4: -- 1. As suggested by Peter, use completions instead of flush_work() as the former is cheaper 2. Call efi_delete_dummy_variable() from efisubsys_init(). Sorry! Ard, wasn't able to find a better alternative to keep this change local to arch/x86. Changes from V2 to V3: -- 1. Rewrite the cover letter to clearly state the problem. What we are fixing and what we are not fixing. 2. Make efi_delete_dummy_variable() change local to x86. 3. Avoid using BUG(), instead, print error message and exit gracefully. 4. Move struct efi_runtime_work to runtime-wrappers.c file. 5. Give enum a name (efi_rts_ids) and use it in efi_runtime_work. 6. Add Naresh (maintainer of LUV for ARM) and Miguel to the CC list. Changes from V1 to V2: -- 1. Remove unnecessary include of asm/efi.h file - Fixes build error on ia64, reported by 0-day 2. Use enum to identify efi_runtime_services() 3. Use alloc_ordered_workqueue() to create efi_rts_wq as create_workqueue() is scheduled for depreciation. 4. Make efi_call_rts() static, as it has no callers outside runtime-wrappers.c 5. Use BUG(), when we are unable to queue work or unable to identify requested efi_runtime_service() - Because these two situations should *never* happen. Sai Praneeth (3): x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization efi: Create efi_rts_wq and efi_queue_work() to invoke all efi_runtime_services() efi: Use efi_rts_wq to invoke EFI Runtime Services arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 - drivers/firmware/efi/efi.c | 20 +++ drivers/firmware/efi/runtime-wrappers.c | 256 +--- include/linux/efi.h | 6 + 5 files changed, 262 insertions(+), 27 deletions(-) Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Naresh Bhat <naresh.b...@linaro.org> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Dan Williams <dan.j.willi...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Miguel Ojeda <miguel.ojeda.sando...@gmail.com> -- 2.7.4
[PATCH V4 0/3] Use efi_rts_wq to invoke EFI Runtime Services
s are based on Linus's kernel v4.17-rc6 [1] Backup: Detailing efi_pgd: -- efi_pgd has mappings for EFI Runtime Code/Data (on x86, plus EFI Boot time Code/Data) regions. Due to the nature of these mappings, they fall in user space address ranges and they are not the same as swapper. [On arm64, the EFI mappings are in the VA range usually used for user space. The two halves of the address space are managed by separate tables, TTBR0 and TTBR1. We always map the kernel in TTBR1, and we map user space or EFI runtime mappings in TTBR0.] - Mark Rutland Changes from V3 to V4: -- 1. As suggested by Peter, use completions instead of flush_work() as the former is cheaper 2. Call efi_delete_dummy_variable() from efisubsys_init(). Sorry! Ard, wasn't able to find a better alternative to keep this change local to arch/x86. Changes from V2 to V3: -- 1. Rewrite the cover letter to clearly state the problem. What we are fixing and what we are not fixing. 2. Make efi_delete_dummy_variable() change local to x86. 3. Avoid using BUG(), instead, print error message and exit gracefully. 4. Move struct efi_runtime_work to runtime-wrappers.c file. 5. Give enum a name (efi_rts_ids) and use it in efi_runtime_work. 6. Add Naresh (maintainer of LUV for ARM) and Miguel to the CC list. Changes from V1 to V2: -- 1. Remove unnecessary include of asm/efi.h file - Fixes build error on ia64, reported by 0-day 2. Use enum to identify efi_runtime_services() 3. Use alloc_ordered_workqueue() to create efi_rts_wq as create_workqueue() is scheduled for depreciation. 4. Make efi_call_rts() static, as it has no callers outside runtime-wrappers.c 5. Use BUG(), when we are unable to queue work or unable to identify requested efi_runtime_service() - Because these two situations should *never* happen. Sai Praneeth (3): x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization efi: Create efi_rts_wq and efi_queue_work() to invoke all efi_runtime_services() efi: Use efi_rts_wq to invoke EFI Runtime Services arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 - drivers/firmware/efi/efi.c | 20 +++ drivers/firmware/efi/runtime-wrappers.c | 256 +--- include/linux/efi.h | 6 + 5 files changed, 262 insertions(+), 27 deletions(-) Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda -- 2.7.4
[PATCH V4 3/3] efi: Use efi_rts_wq to invoke EFI Runtime Services
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Presently, when a user process requests the kernel to execute any efi_runtime_service(), kernel switches the page directory (%cr3) from swapper_pgd to efi_pgd. Other subsystems in the kernel aren't aware of this switch and they might think, user space is still valid (i.e. the user space mappings are still pointing to the process that requested to run efi_runtime_service()) but in reality it is not so. A solution for this issue is to use kthread to run efi_runtime_service(). When a user process requests the kernel to execute any efi_runtime_service(), kernel queues the work to efi_rts_wq, a kthread comes along, switches to efi_pgd and executes efi_runtime_service() in kthread context. Anything that tries to touch user space addresses while in kthread is terminally broken. Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_wq. 2. Caller thread waits for completion until the work is finished because it's dependent on the return status of efi_runtime_service(). Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Introduce a handler function (called efi_call_rts()) that 1. Understands efi_runtime_work and 2. Invokes the appropriate efi_runtime_service() with the appropriate arguments Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. pstore writes could potentially be invoked in atomic context and it uses set_variable<>() and query_variable_info<>() to store logs. If we invoke efi_runtime_services() through efi_rts_wq while in atomic(), kernel issues a warning ("scheduling wile in atomic") and prints stack trace. One way to overcome this is to not make the caller process wait for the worker thread to finish. This approach breaks pstore i.e. the log messages aren't written to efi variables. Hence, pstore calls efi_runtime_services() without using efi_rts_wq or in other words efi_rts_wq will be used unconditionally for all the efi_runtime_services() except set_variable<>() and query_variable_info<>(). Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Naresh Bhat <naresh.b...@linaro.org> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Dan Williams <dan.j.willi...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Miguel Ojeda <miguel.ojeda.sando...@gmail.com> --- drivers/firmware/efi/runtime-wrappers.c | 171 1 file changed, 151 insertions(+), 20 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index 534bd348feca..26bb6645ff59 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -175,13 +175,108 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call) */ static DEFINE_SEMAPHORE(efi_runtime_lock); +/* + * Calls the appropriate efi_runtime_service() with the appropriate + * arguments. + * + * Semantics followed by efi_call_rts() to understand efi_runtime_work: + * 1. If argument was a pointer, recast it from void pointer to original + * pointer type. + * 2. If argument was a value, recast it from void pointer to original + * pointer type and dereference it. + */ +static void efi_call_rts(struct work_struct *work) +{ + struct efi_runtime_work *efi_rts_work; + void *arg1, *arg2, *arg3, *arg4, *arg5; + efi_status_t status = EFI_NOT_FOUND; + + efi_rts_work = container_of(work, struct efi_runtime_work, work); + arg1 = efi_rts_work->arg1; + arg2 = efi_rts_work->arg2; + arg3 = efi_rts_work->arg3; + arg4 = efi_rts_work->arg4; + arg5 = efi_rts_work->arg5; + + switch (efi_rts_work->efi_rts_id) { + case GET_TIME: + status = efi_call_virt(get_time, (efi_time_t *)arg1, + (efi_time_cap_t *)arg2); + break; + case SET_TIME: + status = efi_call_virt(set_time, (efi_time_t *)arg1); + break; +
[PATCH V4 3/3] efi: Use efi_rts_wq to invoke EFI Runtime Services
From: Sai Praneeth Presently, when a user process requests the kernel to execute any efi_runtime_service(), kernel switches the page directory (%cr3) from swapper_pgd to efi_pgd. Other subsystems in the kernel aren't aware of this switch and they might think, user space is still valid (i.e. the user space mappings are still pointing to the process that requested to run efi_runtime_service()) but in reality it is not so. A solution for this issue is to use kthread to run efi_runtime_service(). When a user process requests the kernel to execute any efi_runtime_service(), kernel queues the work to efi_rts_wq, a kthread comes along, switches to efi_pgd and executes efi_runtime_service() in kthread context. Anything that tries to touch user space addresses while in kthread is terminally broken. Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_wq. 2. Caller thread waits for completion until the work is finished because it's dependent on the return status of efi_runtime_service(). Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Introduce a handler function (called efi_call_rts()) that 1. Understands efi_runtime_work and 2. Invokes the appropriate efi_runtime_service() with the appropriate arguments Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. pstore writes could potentially be invoked in atomic context and it uses set_variable<>() and query_variable_info<>() to store logs. If we invoke efi_runtime_services() through efi_rts_wq while in atomic(), kernel issues a warning ("scheduling wile in atomic") and prints stack trace. One way to overcome this is to not make the caller process wait for the worker thread to finish. This approach breaks pstore i.e. the log messages aren't written to efi variables. Hence, pstore calls efi_runtime_services() without using efi_rts_wq or in other words efi_rts_wq will be used unconditionally for all the efi_runtime_services() except set_variable<>() and query_variable_info<>(). Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- drivers/firmware/efi/runtime-wrappers.c | 171 1 file changed, 151 insertions(+), 20 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index 534bd348feca..26bb6645ff59 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -175,13 +175,108 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call) */ static DEFINE_SEMAPHORE(efi_runtime_lock); +/* + * Calls the appropriate efi_runtime_service() with the appropriate + * arguments. + * + * Semantics followed by efi_call_rts() to understand efi_runtime_work: + * 1. If argument was a pointer, recast it from void pointer to original + * pointer type. + * 2. If argument was a value, recast it from void pointer to original + * pointer type and dereference it. + */ +static void efi_call_rts(struct work_struct *work) +{ + struct efi_runtime_work *efi_rts_work; + void *arg1, *arg2, *arg3, *arg4, *arg5; + efi_status_t status = EFI_NOT_FOUND; + + efi_rts_work = container_of(work, struct efi_runtime_work, work); + arg1 = efi_rts_work->arg1; + arg2 = efi_rts_work->arg2; + arg3 = efi_rts_work->arg3; + arg4 = efi_rts_work->arg4; + arg5 = efi_rts_work->arg5; + + switch (efi_rts_work->efi_rts_id) { + case GET_TIME: + status = efi_call_virt(get_time, (efi_time_t *)arg1, + (efi_time_cap_t *)arg2); + break; + case SET_TIME: + status = efi_call_virt(set_time, (efi_time_t *)arg1); + break; + case GET_WAKEUP_TIME: + status = efi_call_virt(get_wakeup_time, (efi_bool_t *)arg1, + (efi_bool_t *)arg2, (efi_time_t *)arg3); + break; + case SET_WAKEUP_TIME: + status = efi_call_virt(set_wakeup_time, *(efi_bool_t *)arg1, + (efi_time_t *)arg2); + break; + case GET_VARIABLE: + status = efi_call_virt(get_variable, (efi_char16_t *)arg1, +
[PATCH V4 2/3] efi: Create efi_rts_wq and efi_queue_work() to invoke all efi_runtime_services()
From: Sai Praneeth <sai.praneeth.prak...@intel.com> When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce some infrastructure: 1. Creating workqueue named efi_rts_wq 2. A macro (efi_queue_work()) that a. Populates efi_runtime_work b. Queues work onto efi_rts_wq and c. Waits until worker thread completes The caller thread has to wait until the worker thread completes, because it depends on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Naresh Bhat <naresh.b...@linaro.org> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Dan Williams <dan.j.willi...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Miguel Ojeda <miguel.ojeda.sando...@gmail.com> --- drivers/firmware/efi/efi.c | 14 ++ drivers/firmware/efi/runtime-wrappers.c | 85 + include/linux/efi.h | 3 ++ 3 files changed, 102 insertions(+) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 1176af664013..2632294eb33f 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -84,6 +84,8 @@ struct mm_struct efi_mm = { .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -338,6 +340,18 @@ static int __init efisubsys_init(void) return 0; /* +* Since we process only one efi_runtime_service() at a time, an +* ordered workqueue (which creates only one execution context) +* should suffice all our needs. +*/ + efi_rts_wq = alloc_ordered_workqueue("efi_rts_wq", 0); + if (!efi_rts_wq) { + pr_err("Creating efi_rts_wq failed, EFI runtime services disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return 0; + } + + /* * Clean DUMMY object calls EFI Runtime Service, set_variable(), so * it should be invoked only after efi_rts_wq is ready. */ diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..534bd348feca 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,15 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_wq. + * 2. Caller thread waits for completion until the work is finished + * because it's dependent on the return status and execution of + * efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. <ard.biesheu...@linaro.org> * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +31,9 @@ #include #include #include +#include +#include + #include /* @@ -33,6 +45,79 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* efi_runtime_service() function identifiers */ +enum efi_rts_ids { + GET_TIME, + SET_TIME, + GET_WAKEUP_TIME, + SET_WAKEUP_TIME, + GET_VARIABLE, + GET_NEXT_VARIABLE, + SET_VARIABLE, + SET_VARIABLE_NONBLOCKING, + QUERY_VARIABLE_INFO, + QUERY_VARIABLE_INFO_NONBLOCKING, + GET_NEXT_HIGH_MONO_COUNT, +
[PATCH V4 2/3] efi: Create efi_rts_wq and efi_queue_work() to invoke all efi_runtime_services()
From: Sai Praneeth When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce some infrastructure: 1. Creating workqueue named efi_rts_wq 2. A macro (efi_queue_work()) that a. Populates efi_runtime_work b. Queues work onto efi_rts_wq and c. Waits until worker thread completes The caller thread has to wait until the worker thread completes, because it depends on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- drivers/firmware/efi/efi.c | 14 ++ drivers/firmware/efi/runtime-wrappers.c | 85 + include/linux/efi.h | 3 ++ 3 files changed, 102 insertions(+) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 1176af664013..2632294eb33f 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -84,6 +84,8 @@ struct mm_struct efi_mm = { .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -338,6 +340,18 @@ static int __init efisubsys_init(void) return 0; /* +* Since we process only one efi_runtime_service() at a time, an +* ordered workqueue (which creates only one execution context) +* should suffice all our needs. +*/ + efi_rts_wq = alloc_ordered_workqueue("efi_rts_wq", 0); + if (!efi_rts_wq) { + pr_err("Creating efi_rts_wq failed, EFI runtime services disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return 0; + } + + /* * Clean DUMMY object calls EFI Runtime Service, set_variable(), so * it should be invoked only after efi_rts_wq is ready. */ diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..534bd348feca 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,15 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_wq. + * 2. Caller thread waits for completion until the work is finished + * because it's dependent on the return status and execution of + * efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +31,9 @@ #include #include #include +#include +#include + #include /* @@ -33,6 +45,79 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* efi_runtime_service() function identifiers */ +enum efi_rts_ids { + GET_TIME, + SET_TIME, + GET_WAKEUP_TIME, + SET_WAKEUP_TIME, + GET_VARIABLE, + GET_NEXT_VARIABLE, + SET_VARIABLE, + SET_VARIABLE_NONBLOCKING, + QUERY_VARIABLE_INFO, + QUERY_VARIABLE_INFO_NONBLOCKING, + GET_NEXT_HIGH_MONO_COUNT, + RESET_SYSTEM, + UPDATE_CAPSULE, + QUERY_CAPSULE_CAPS, +}; + +/* + * efi_runtime_work: Details of EFI Runtime Service work + * @arg<1-5>: EFI Runtime Service function arguments + * @status:Status of executing EFI Runtime Service + * @efi_rts_id:EFI Runtime Service function identifier + * @efi_rts_comp: Struct used for handling completions + */ +struct efi_runtime_work { + void *arg1; + void *arg2; + void *arg3; + void *arg4; + void *arg5; + efi_status_t status; +
[PATCH V3 1/3] x86/efi: Call efi_delete_dummy_variable() after creating efi_rts_wq
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Create a workqueue named efi_rts_wq (efi runtime services workqueue), so that all efi_runtime_services() are executed in kthread context. Invoking efi_runtime_services() through efi_rts_wq means all accesses to efi_runtime_services() should be done after efi_rts_wq has been created. efi_delete_dummy_variable() calls set_variable(), hence efi_delete_dummy_variable() should be called after efi_rts_wq has been created. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Naresh Bhat <naresh.b...@linaro.org> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Dan Williams <dan.j.willi...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Miguel Ojeda <miguel.ojeda.sando...@gmail.com> --- arch/x86/platform/efi/efi.c| 15 +-- drivers/firmware/efi/arm-runtime.c | 3 +++ drivers/firmware/efi/efi.c | 25 + include/linux/efi.h| 4 4 files changed, 41 insertions(+), 6 deletions(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 9061babfbc83..adcc55cd25ce 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -893,9 +893,6 @@ static void __init kexec_enter_virtual_mode(void) if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX)) runtime_code_page_mkexec(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); #endif } @@ -1015,9 +1012,6 @@ static void __init __efi_enter_virtual_mode(void) * necessary relocation fixups for the new virtual addresses. */ efi_runtime_update_mappings(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); } void __init efi_enter_virtual_mode(void) @@ -1031,6 +1025,15 @@ void __init efi_enter_virtual_mode(void) __efi_enter_virtual_mode(); efi_dump_pagetable(); + + if (!efi_create_rts_wq()) + return; + + /* +* Clean DUMMY object calls EFI Runtime Service, set_variable(), so +* it should be invoked only after efi_rts_wq is ready. +*/ + efi_delete_dummy_variable(); } static int __init arch_parse_efi_cmdline(char *str) diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c index 5889cbea60b8..6fb06130b53f 100644 --- a/drivers/firmware/efi/arm-runtime.c +++ b/drivers/firmware/efi/arm-runtime.c @@ -139,6 +139,9 @@ static int __init arm_enable_runtime_services(void) return -ENOMEM; } + if (!efi_create_rts_wq()) + return 0; + /* Set up runtime services function pointers */ efi_native_runtime_setup(); set_bit(EFI_RUNTIME_SERVICES, ); diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 232f4915223b..b9103caa03b4 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -84,6 +84,8 @@ struct mm_struct efi_mm = { .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -337,6 +339,13 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* If we failed to create efi_rts_wq, EFI_RUNTIME_SERVICES would +* have been be cleared, check for that condition. +*/ + if (!efi_enabled(EFI_RUNTIME_SERVICES)) + return 0; + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { @@ -971,3 +980,19 @@ static int register_update_efi_random_seed(void) } late_initcall(register_update_efi_random_seed); #endif + +bool __init efi_create_rts_wq(void) +{ + /* +* Since we process only one efi_runtime_service() at a time, an +* ordered workqueue (which creates only one execution context) +* should suffice all our needs. +*/ + efi_rts_wq = alloc_ordered_workqueue("efi_rts_wq", 0); + if (!efi_rts_wq) { + pr_err("Creating efi_rts_wq failed, EFI runtime services disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return false; + } + return true; +} diff --git a/include/li
[PATCH V3 1/3] x86/efi: Call efi_delete_dummy_variable() after creating efi_rts_wq
From: Sai Praneeth Create a workqueue named efi_rts_wq (efi runtime services workqueue), so that all efi_runtime_services() are executed in kthread context. Invoking efi_runtime_services() through efi_rts_wq means all accesses to efi_runtime_services() should be done after efi_rts_wq has been created. efi_delete_dummy_variable() calls set_variable(), hence efi_delete_dummy_variable() should be called after efi_rts_wq has been created. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- arch/x86/platform/efi/efi.c| 15 +-- drivers/firmware/efi/arm-runtime.c | 3 +++ drivers/firmware/efi/efi.c | 25 + include/linux/efi.h| 4 4 files changed, 41 insertions(+), 6 deletions(-) diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 9061babfbc83..adcc55cd25ce 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -893,9 +893,6 @@ static void __init kexec_enter_virtual_mode(void) if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX)) runtime_code_page_mkexec(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); #endif } @@ -1015,9 +1012,6 @@ static void __init __efi_enter_virtual_mode(void) * necessary relocation fixups for the new virtual addresses. */ efi_runtime_update_mappings(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); } void __init efi_enter_virtual_mode(void) @@ -1031,6 +1025,15 @@ void __init efi_enter_virtual_mode(void) __efi_enter_virtual_mode(); efi_dump_pagetable(); + + if (!efi_create_rts_wq()) + return; + + /* +* Clean DUMMY object calls EFI Runtime Service, set_variable(), so +* it should be invoked only after efi_rts_wq is ready. +*/ + efi_delete_dummy_variable(); } static int __init arch_parse_efi_cmdline(char *str) diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c index 5889cbea60b8..6fb06130b53f 100644 --- a/drivers/firmware/efi/arm-runtime.c +++ b/drivers/firmware/efi/arm-runtime.c @@ -139,6 +139,9 @@ static int __init arm_enable_runtime_services(void) return -ENOMEM; } + if (!efi_create_rts_wq()) + return 0; + /* Set up runtime services function pointers */ efi_native_runtime_setup(); set_bit(EFI_RUNTIME_SERVICES, ); diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 232f4915223b..b9103caa03b4 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -84,6 +84,8 @@ struct mm_struct efi_mm = { .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -337,6 +339,13 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* If we failed to create efi_rts_wq, EFI_RUNTIME_SERVICES would +* have been be cleared, check for that condition. +*/ + if (!efi_enabled(EFI_RUNTIME_SERVICES)) + return 0; + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { @@ -971,3 +980,19 @@ static int register_update_efi_random_seed(void) } late_initcall(register_update_efi_random_seed); #endif + +bool __init efi_create_rts_wq(void) +{ + /* +* Since we process only one efi_runtime_service() at a time, an +* ordered workqueue (which creates only one execution context) +* should suffice all our needs. +*/ + efi_rts_wq = alloc_ordered_workqueue("efi_rts_wq", 0); + if (!efi_rts_wq) { + pr_err("Creating efi_rts_wq failed, EFI runtime services disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return false; + } + return true; +} diff --git a/include/linux/efi.h b/include/linux/efi.h index 3016d8c456bc..565955010b18 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -987,6 +987,7 @@ extern void efi_map_pal_code (void); extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg); extern void efi_gettimeofday (struct timespec64 *ts); extern void efi_enter_virtual_mode (void); /* switch EFI to virtual mode, if possible */ +extern bool __init efi_create_rts_wq(void); #ifdef CONFIG_X86 extern void efi_late_init(void); extern void efi_free_boot_se
[PATCH V3 3/3] efi: Use efi_rts_wq to invoke EFI Runtime Services
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Presently, when a user process requests the kernel to execute any efi_runtime_service(), kernel switches the page directory (%cr3) from swapper_pgd to efi_pgd. Other subsystems in the kernel aren't aware of this switch and they might think, user space is still valid (i.e. the user space mappings are still pointing to the process that requested to run efi_runtime_service()) but in reality it is not so. A solution for this issue is to use kthread to run efi_runtime_service() When a user process requests the kernel to execute any efi_runtime_service(), kernel queues the work to efi_rts_wq, a kthread comes along, switches to efi_pgd and executes efi_runtime_service() in kthread context. Anything that tries to touch user space addresses while in kthread is terminally broken. Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_wq. 2. Caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service(). Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Introduce a handler function (called efi_call_rts()) that 1. Understands efi_runtime_work and 2. Invokes the appropriate efi_runtime_service() with the appropriate arguments Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. pstore writes could potentially be invoked in atomic context and it uses set_variable<>() and query_variable_info<>() to store logs. If we invoke efi_runtime_services() through efi_rts_wq while in atomic(), kernel issues a warning ("scheduling wile in atomic") and prints stack trace. One way to overcome this is to not make the caller process wait for the worker thread to finish. This approach breaks pstore i.e. the log messages aren't written to efi variables. Hence, pstore calls efi_runtime_services() without using efi_rts_wq or in other words efi_rts_wq will be used unconditionally for all the efi_runtime_services() except set_variable<>() and query_variable_info<>(). Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Naresh Bhat <naresh.b...@linaro.org> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Dan Williams <dan.j.willi...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Miguel Ojeda <miguel.ojeda.sando...@gmail.com> --- drivers/firmware/efi/runtime-wrappers.c | 170 1 file changed, 150 insertions(+), 20 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index a9866045ed52..23ff128fcb2f 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -170,13 +170,107 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call) */ static DEFINE_SEMAPHORE(efi_runtime_lock); +/* + * Calls the appropriate efi_runtime_service() with the appropriate + * arguments. + * + * Semantics followed by efi_call_rts() to understand efi_runtime_work: + * 1. If argument was a pointer, recast it from void pointer to original + * pointer type. + * 2. If argument was a value, recast it from void pointer to original + * pointer type and dereference it. + */ +static void efi_call_rts(struct work_struct *work) +{ + struct efi_runtime_work *efi_rts_work; + void *arg1, *arg2, *arg3, *arg4, *arg5; + efi_status_t status = EFI_NOT_FOUND; + + efi_rts_work = container_of(work, struct efi_runtime_work, work); + arg1 = efi_rts_work->arg1; + arg2 = efi_rts_work->arg2; + arg3 = efi_rts_work->arg3; + arg4 = efi_rts_work->arg4; + arg5 = efi_rts_work->arg5; + + switch (efi_rts_work->efi_rts_id) { + case GET_TIME: + status = efi_call_virt(get_time, (efi_time_t *)arg1, + (efi_time_cap_t *)arg2); + break; + case SET_TIME: + status = efi_call_virt(set_time, (efi_time_t *)arg1); + break; + case GET
[PATCH V3 2/3] efi: Introduce efi_queue_work() to queue any efi_runtime_service() on efi_rts_wq
From: Sai Praneeth <sai.praneeth.prak...@intel.com> When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce efi_queue_work() that 1. Populates efi_runtime_work 2. Queues work onto efi_rts_wq and 3. Waits until worker thread returns. The caller thread has to wait until the worker thread returns, because it depends on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Naresh Bhat <naresh.b...@linaro.org> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Dan Williams <dan.j.willi...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Miguel Ojeda <miguel.ojeda.sando...@gmail.com> --- drivers/firmware/efi/runtime-wrappers.c | 80 + 1 file changed, 80 insertions(+) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..a9866045ed52 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,14 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_wq. + * 2. Caller thread waits until the work is finished because it's + * dependent on the return status and execution of efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. <ard.biesheu...@linaro.org> * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +30,8 @@ #include #include #include +#include + #include /* @@ -33,6 +43,76 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* efi_runtime_service() function identifiers */ +enum efi_rts_ids { + GET_TIME, + SET_TIME, + GET_WAKEUP_TIME, + SET_WAKEUP_TIME, + GET_VARIABLE, + GET_NEXT_VARIABLE, + SET_VARIABLE, + SET_VARIABLE_NONBLOCKING, + QUERY_VARIABLE_INFO, + QUERY_VARIABLE_INFO_NONBLOCKING, + GET_NEXT_HIGH_MONO_COUNT, + RESET_SYSTEM, + UPDATE_CAPSULE, + QUERY_CAPSULE_CAPS, +}; + +/* + * efi_runtime_work: Details of EFI Runtime Service work + * @func: EFI Runtime Service function identifier + * @arg<1-5>: EFI Runtime Service function arguments + * @status:Status of executing EFI Runtime Service + */ +struct efi_runtime_work { + void *arg1; + void *arg2; + void *arg3; + void *arg4; + void *arg5; + efi_status_t status; + struct work_struct work; + enum efi_rts_ids efi_rts_id; +}; + +/* + * efi_queue_work: Queue efi_runtime_service() and wait until it's done + * @rts: efi_runtime_service() function identifier + * @rts_arg<1-5>: efi_runtime_service() function arguments + * + * Accesses to efi_runtime_services() are serialized by a binary + * semaphore (efi_runtime_lock) and caller waits until the work is + * finished, hence _only_ one work is queued at a time and the queued + * work gets flushed. + */ +#define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5)\ +({ \ + struct efi_runtime_work efi_rts_work; \ + efi_rts_work.status = EFI_ABORTED; \ +
[PATCH V3 2/3] efi: Introduce efi_queue_work() to queue any efi_runtime_service() on efi_rts_wq
From: Sai Praneeth When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce efi_queue_work() that 1. Populates efi_runtime_work 2. Queues work onto efi_rts_wq and 3. Waits until worker thread returns. The caller thread has to wait until the worker thread returns, because it depends on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- drivers/firmware/efi/runtime-wrappers.c | 80 + 1 file changed, 80 insertions(+) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..a9866045ed52 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,14 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_wq. + * 2. Caller thread waits until the work is finished because it's + * dependent on the return status and execution of efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +30,8 @@ #include #include #include +#include + #include /* @@ -33,6 +43,76 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* efi_runtime_service() function identifiers */ +enum efi_rts_ids { + GET_TIME, + SET_TIME, + GET_WAKEUP_TIME, + SET_WAKEUP_TIME, + GET_VARIABLE, + GET_NEXT_VARIABLE, + SET_VARIABLE, + SET_VARIABLE_NONBLOCKING, + QUERY_VARIABLE_INFO, + QUERY_VARIABLE_INFO_NONBLOCKING, + GET_NEXT_HIGH_MONO_COUNT, + RESET_SYSTEM, + UPDATE_CAPSULE, + QUERY_CAPSULE_CAPS, +}; + +/* + * efi_runtime_work: Details of EFI Runtime Service work + * @func: EFI Runtime Service function identifier + * @arg<1-5>: EFI Runtime Service function arguments + * @status:Status of executing EFI Runtime Service + */ +struct efi_runtime_work { + void *arg1; + void *arg2; + void *arg3; + void *arg4; + void *arg5; + efi_status_t status; + struct work_struct work; + enum efi_rts_ids efi_rts_id; +}; + +/* + * efi_queue_work: Queue efi_runtime_service() and wait until it's done + * @rts: efi_runtime_service() function identifier + * @rts_arg<1-5>: efi_runtime_service() function arguments + * + * Accesses to efi_runtime_services() are serialized by a binary + * semaphore (efi_runtime_lock) and caller waits until the work is + * finished, hence _only_ one work is queued at a time and the queued + * work gets flushed. + */ +#define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5)\ +({ \ + struct efi_runtime_work efi_rts_work; \ + efi_rts_work.status = EFI_ABORTED; \ + \ + INIT_WORK_ONSTACK(_rts_work.work, efi_call_rts);\ + efi_rts_work.arg1 = _arg1; \ + efi_rts_work.arg2 = _arg2; \ + efi_rts_work.arg3 = _arg3; \ + efi_rts_work.arg4 = _arg4; \ + efi_rts_work.arg5 = _arg5; \ + efi_rts_work.ef
[PATCH V3 3/3] efi: Use efi_rts_wq to invoke EFI Runtime Services
From: Sai Praneeth Presently, when a user process requests the kernel to execute any efi_runtime_service(), kernel switches the page directory (%cr3) from swapper_pgd to efi_pgd. Other subsystems in the kernel aren't aware of this switch and they might think, user space is still valid (i.e. the user space mappings are still pointing to the process that requested to run efi_runtime_service()) but in reality it is not so. A solution for this issue is to use kthread to run efi_runtime_service() When a user process requests the kernel to execute any efi_runtime_service(), kernel queues the work to efi_rts_wq, a kthread comes along, switches to efi_pgd and executes efi_runtime_service() in kthread context. Anything that tries to touch user space addresses while in kthread is terminally broken. Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_wq. 2. Caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service(). Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Introduce a handler function (called efi_call_rts()) that 1. Understands efi_runtime_work and 2. Invokes the appropriate efi_runtime_service() with the appropriate arguments Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. pstore writes could potentially be invoked in atomic context and it uses set_variable<>() and query_variable_info<>() to store logs. If we invoke efi_runtime_services() through efi_rts_wq while in atomic(), kernel issues a warning ("scheduling wile in atomic") and prints stack trace. One way to overcome this is to not make the caller process wait for the worker thread to finish. This approach breaks pstore i.e. the log messages aren't written to efi variables. Hence, pstore calls efi_runtime_services() without using efi_rts_wq or in other words efi_rts_wq will be used unconditionally for all the efi_runtime_services() except set_variable<>() and query_variable_info<>(). Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda --- drivers/firmware/efi/runtime-wrappers.c | 170 1 file changed, 150 insertions(+), 20 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index a9866045ed52..23ff128fcb2f 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -170,13 +170,107 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call) */ static DEFINE_SEMAPHORE(efi_runtime_lock); +/* + * Calls the appropriate efi_runtime_service() with the appropriate + * arguments. + * + * Semantics followed by efi_call_rts() to understand efi_runtime_work: + * 1. If argument was a pointer, recast it from void pointer to original + * pointer type. + * 2. If argument was a value, recast it from void pointer to original + * pointer type and dereference it. + */ +static void efi_call_rts(struct work_struct *work) +{ + struct efi_runtime_work *efi_rts_work; + void *arg1, *arg2, *arg3, *arg4, *arg5; + efi_status_t status = EFI_NOT_FOUND; + + efi_rts_work = container_of(work, struct efi_runtime_work, work); + arg1 = efi_rts_work->arg1; + arg2 = efi_rts_work->arg2; + arg3 = efi_rts_work->arg3; + arg4 = efi_rts_work->arg4; + arg5 = efi_rts_work->arg5; + + switch (efi_rts_work->efi_rts_id) { + case GET_TIME: + status = efi_call_virt(get_time, (efi_time_t *)arg1, + (efi_time_cap_t *)arg2); + break; + case SET_TIME: + status = efi_call_virt(set_time, (efi_time_t *)arg1); + break; + case GET_WAKEUP_TIME: + status = efi_call_virt(get_wakeup_time, (efi_bool_t *)arg1, + (efi_bool_t *)arg2, (efi_time_t *)arg3); + break; + case SET_WAKEUP_TIME: + status = efi_call_virt(set_wakeup_time, *(efi_bool_t *)arg1, + (efi_time_t *)arg2); + break; + case GET_VARIABLE: + status = efi_call_virt(get_variable, (efi_char16_t *)arg1, +
[PATCH V3 0/3] Use efi_rts_wq to invoke EFI Runtime Services
comments and concerns. Note: - Patches are based on Linus's kernel v4.17-rc6 [1] Backup: Detailing efi_pgd: -- efi_pgd has mappings for EFI Runtime Code/Data (on x86, plus EFI Boot time Code/Data) regions. Due to the nature of these mappings, they fall in user space address ranges and they are not the same as swapper. [On arm64, the EFI mappings are in the VA range usually used for user space. The two halves of the address space are managed by separate tables, TTBR0 and TTBR1. We always map the kernel in TTBR1, and we map user space or EFI runtime mappings in TTBR0.] - Mark Rutland Changes from V2 to V3: -- 1. Rewrite the cover letter to clearly state the problem. What we are fixing and what we are not fixing. 2. Make efi_delete_dummy_variable() change local to x86. 3. Avoid using BUG(), instead, print error message and exit gracefully. 4. Move struct efi_runtime_work to runtime-wrappers.c file. 5. Give enum a name (efi_rts_ids) and use it in efi_runtime_work. 6. Add Naresh (maintainer of LUV for ARM) and Miguel to the CC list. Changes from V1 to V2: -- 1. Remove unnecessary include of asm/efi.h file - Fixes build error on ia64, reported by 0-day 2. Use enum to identify efi_runtime_services() 3. Use alloc_ordered_workqueue() to create efi_rts_wq as create_workqueue() is scheduled for depreciation. 4. Make efi_call_rts() static, as it has no callers outside runtime-wrappers.c 5. Use BUG(), when we are unable to queue work or unable to identify requested efi_runtime_service() - Because these two situations should *never* happen. Sai Praneeth (3): x86/efi: Call efi_delete_dummy_variable() after creating efi_rts_wq efi: Introduce efi_queue_work() to queue any efi_runtime_service() on efi_rts_wq efi: Use efi_rts_wq to invoke EFI Runtime Services arch/x86/platform/efi/efi.c | 15 +- drivers/firmware/efi/arm-runtime.c | 3 + drivers/firmware/efi/efi.c | 25 drivers/firmware/efi/runtime-wrappers.c | 250 +--- include/linux/efi.h | 4 + 5 files changed, 271 insertions(+), 26 deletions(-) Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Naresh Bhat <naresh.b...@linaro.org> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Dan Williams <dan.j.willi...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Miguel Ojeda <miguel.ojeda.sando...@gmail.com> -- 2.7.4
[PATCH V3 0/3] Use efi_rts_wq to invoke EFI Runtime Services
s are based on Linus's kernel v4.17-rc6 [1] Backup: Detailing efi_pgd: -- efi_pgd has mappings for EFI Runtime Code/Data (on x86, plus EFI Boot time Code/Data) regions. Due to the nature of these mappings, they fall in user space address ranges and they are not the same as swapper. [On arm64, the EFI mappings are in the VA range usually used for user space. The two halves of the address space are managed by separate tables, TTBR0 and TTBR1. We always map the kernel in TTBR1, and we map user space or EFI runtime mappings in TTBR0.] - Mark Rutland Changes from V2 to V3: -- 1. Rewrite the cover letter to clearly state the problem. What we are fixing and what we are not fixing. 2. Make efi_delete_dummy_variable() change local to x86. 3. Avoid using BUG(), instead, print error message and exit gracefully. 4. Move struct efi_runtime_work to runtime-wrappers.c file. 5. Give enum a name (efi_rts_ids) and use it in efi_runtime_work. 6. Add Naresh (maintainer of LUV for ARM) and Miguel to the CC list. Changes from V1 to V2: -- 1. Remove unnecessary include of asm/efi.h file - Fixes build error on ia64, reported by 0-day 2. Use enum to identify efi_runtime_services() 3. Use alloc_ordered_workqueue() to create efi_rts_wq as create_workqueue() is scheduled for depreciation. 4. Make efi_call_rts() static, as it has no callers outside runtime-wrappers.c 5. Use BUG(), when we are unable to queue work or unable to identify requested efi_runtime_service() - Because these two situations should *never* happen. Sai Praneeth (3): x86/efi: Call efi_delete_dummy_variable() after creating efi_rts_wq efi: Introduce efi_queue_work() to queue any efi_runtime_service() on efi_rts_wq efi: Use efi_rts_wq to invoke EFI Runtime Services arch/x86/platform/efi/efi.c | 15 +- drivers/firmware/efi/arm-runtime.c | 3 + drivers/firmware/efi/efi.c | 25 drivers/firmware/efi/runtime-wrappers.c | 250 +--- include/linux/efi.h | 4 + 5 files changed, 271 insertions(+), 26 deletions(-) Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Naresh Bhat Cc: Ricardo Neri Cc: Peter Zijlstra Cc: Ravi Shankar Cc: Matt Fleming Cc: Dan Williams Cc: Ard Biesheuvel Cc: Miguel Ojeda -- 2.7.4
[PATCH] x86: Use boot_cpu_has() instead of this_cpu_has() in build_cr3_noflush()
From: Sai Praneeth <sai.praneeth.prak...@intel.com> When the platform supports PCID and if CONFIG_DEBUG_VM is enabled, build_cr3_noflush() (called via switch_mm()) does a sanity check to see if X86_FEATURE_PCID is set. Presently, build_cr3_noflush() uses "this_cpu_has(X86_FEATURE_PCID)" to perform the check but this_cpu_has() works only after SMP is initialized (i.e. per cpu cpu_info's should be populated) and this happens to be very late in the boot process (during rest_init). As efi_runtime_services() are called during (early) kernel boot time and run time, modify build_cr3_noflush() to use boot_cpu_has() all the time. As suggested by Dave, this should be OK because all cpu's have same capabilities anyways (for x86). Without this change we see below warning during kernel boot. WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/tlbflush.h:134 load_new_mm_cr3+0x114/0x170 Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.0-02277-gbc16d4052f1a #1 Hardware name: System manufacturer System Product Name/Z170-K, BIOS 3301 02/08/2017 RIP: 0010:load_new_mm_cr3+0x114/0x170 RSP: :9b203e38 EFLAGS: 00010046 RAX: RBX: 9b26f5a0 RCX: RDX: RSI: RDI: 9b20a000 RBP: 9b203e90 R08: R09: 0f63eb29 R10: 9b203ea8 R11: c3292018 R12: R13: 9b2e1180 R14: 0001ee80 R15: FS: () GS:968df6c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 968df6fff000 CR3: 0004261e6002 CR4: 000606b0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: switch_mm_irqs_off+0x267/0x590 switch_mm+0xe/0x20 efi_switch_mm+0x3e/0x50 efi_enter_virtual_mode+0x43f/0x4da start_kernel+0x3bf/0x458 secondary_startup_64+0xa5/0xb0 Dave also suggested that we put a warning in this_cpu_has() if it's used early in the boot process. This is still work in progress as it effects MCE. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Reported-by: Linus Torvalds <torva...@linux-foundation.org> Cc: Lee Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Ingo Molnar <mi...@kernel.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Peter Zijlstra <a.p.zijls...@chello.nl> Cc: Andrew Morton <a...@linux-foundation.org> Cc: Dave Hansen <dave.han...@intel.com> --- arch/x86/include/asm/tlbflush.h | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 84137c22fdfa..42e040859067 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -131,7 +131,12 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 asid) static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid) { VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE); - VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID)); + /* +* Use boot_cpu_has() instead of this_cpu_has() as this function +* might be called during early boot. This should work even after +* boot because all cpu's have same capabilities anyways. +*/ + VM_WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_PCID)); return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH; } -- 2.7.4
[PATCH] x86: Use boot_cpu_has() instead of this_cpu_has() in build_cr3_noflush()
From: Sai Praneeth When the platform supports PCID and if CONFIG_DEBUG_VM is enabled, build_cr3_noflush() (called via switch_mm()) does a sanity check to see if X86_FEATURE_PCID is set. Presently, build_cr3_noflush() uses "this_cpu_has(X86_FEATURE_PCID)" to perform the check but this_cpu_has() works only after SMP is initialized (i.e. per cpu cpu_info's should be populated) and this happens to be very late in the boot process (during rest_init). As efi_runtime_services() are called during (early) kernel boot time and run time, modify build_cr3_noflush() to use boot_cpu_has() all the time. As suggested by Dave, this should be OK because all cpu's have same capabilities anyways (for x86). Without this change we see below warning during kernel boot. WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/tlbflush.h:134 load_new_mm_cr3+0x114/0x170 Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.0-02277-gbc16d4052f1a #1 Hardware name: System manufacturer System Product Name/Z170-K, BIOS 3301 02/08/2017 RIP: 0010:load_new_mm_cr3+0x114/0x170 RSP: :9b203e38 EFLAGS: 00010046 RAX: RBX: 9b26f5a0 RCX: RDX: RSI: RDI: 9b20a000 RBP: 9b203e90 R08: R09: 0f63eb29 R10: 9b203ea8 R11: c3292018 R12: R13: 9b2e1180 R14: 0001ee80 R15: FS: () GS:968df6c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 968df6fff000 CR3: 0004261e6002 CR4: 000606b0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: switch_mm_irqs_off+0x267/0x590 switch_mm+0xe/0x20 efi_switch_mm+0x3e/0x50 efi_enter_virtual_mode+0x43f/0x4da start_kernel+0x3bf/0x458 secondary_startup_64+0xa5/0xb0 Dave also suggested that we put a warning in this_cpu_has() if it's used early in the boot process. This is still work in progress as it effects MCE. Signed-off-by: Sai Praneeth Prakhya Reported-by: Linus Torvalds Cc: Lee Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Andrew Morton Cc: Dave Hansen --- arch/x86/include/asm/tlbflush.h | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 84137c22fdfa..42e040859067 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -131,7 +131,12 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 asid) static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid) { VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE); - VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID)); + /* +* Use boot_cpu_has() instead of this_cpu_has() as this function +* might be called during early boot. This should work even after +* boot because all cpu's have same capabilities anyways. +*/ + VM_WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_PCID)); return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH; } -- 2.7.4
[PATCH V2 1/3] x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Invoking efi_runtime_services() through efi_workqueue means all accesses to efi_runtime_services() should be done after efi_rts_wq has been created. efi_delete_dummy_variable() calls set_variable(), hence efi_delete_dummy_variable() should be called after efi_rts_wq has been created. efi_delete_dummy_variable() is called from efi_enter_virtual_mode() which is early in the boot phase (efi_rts_wq isn't created yet), so call efi_delete_dummy_variable() later in the boot phase i.e. while initializing efi subsystem. In the next patch, this is the place where we create efi_rts_wq and all the efi_runtime_services() will be called using efi_rts_wq. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Peter Zijlstra <peter.zijls...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Dan Williams <dan.j.willi...@intel.com> --- arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 -- drivers/firmware/efi/efi.c | 6 ++ include/linux/efi.h | 3 +++ 4 files changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index a399c1ebf6f0..43009e3f821b 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -143,7 +143,6 @@ extern void __init efi_runtime_update_mappings(void); extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); -extern void efi_delete_dummy_variable(void); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 9061babfbc83..a3169d14583f 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -893,9 +893,6 @@ static void __init kexec_enter_virtual_mode(void) if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX)) runtime_code_page_mkexec(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); #endif } @@ -1015,9 +1012,6 @@ static void __init __efi_enter_virtual_mode(void) * necessary relocation fixups for the new virtual addresses. */ efi_runtime_update_mappings(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); } void __init efi_enter_virtual_mode(void) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index cd42f66a7c85..838b8efe639c 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -328,6 +328,12 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* Clean DUMMY object calls EFI Runtime Service, set_variable(), so +* it should be invoked only after efi_rts_workqueue is ready. +*/ + efi_delete_dummy_variable(); + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { diff --git a/include/linux/efi.h b/include/linux/efi.h index f5083aa72eae..c4efb3ef0dfa 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -992,6 +992,7 @@ extern efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, bool nonblocking); extern void efi_find_mirror(void); +extern void efi_delete_dummy_variable(void); #else static inline void efi_late_init(void) {} static inline void efi_free_boot_services(void) {} @@ -1002,6 +1003,8 @@ static inline efi_status_t efi_query_variable_store(u32 attributes, { return EFI_SUCCESS; } + +static inline void efi_delete_dummy_variable(void) {} #endif extern void __iomem *efi_lookup_mapped_addr(u64 phys_addr); -- 2.7.4
[PATCH V2 0/3] Use efi_rts_workqueue to invoke EFI Runtime Services
From: Sai Praneeth <sai.praneeth.prak...@intel.com> This patch set is an outcome of the discussion at https://lkml.org/lkml/2017/8/21/607 Presently, efi_runtime_services() are executed by firmware in process context. To execute efi_runtime_service(), kernel switches the page directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have any user space mappings. A potential issue could be, for instance, an NMI interrupt (like perf) trying to profile some user data while in efi_pgd. A solution for this issue could be to use kthread to run efi_runtime_service(). When a user/kernel thread requests to execute efi_runtime_service(), kernel off-loads this work to kthread which in turn uses efi_pgd. Anything that tries to touch user space addresses while in kthread is terminally broken. This patch set adds support to the efi subsystem to handle all calls to efi_runtime_services() using a work queue (which in turn uses kthread). Implementation summary: --- 1. When a user/kernel thread requests to execute efi_runtime_service(), enqueue work to a work queue, efi_rts_workqueue. 2. The caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, the caller process cannot just post the work and get going, it has to wait for results from firmware. Caveat: efi_rts_workqueue to run efi_runtime_services() shouldn't be used while in atomic, because caller thread might sleep. Presently, pstore code doesn't use efi_rts_workqueue. Tested using LUV (Linux UEFI Validation) for x86_64 and x86_32. Builds fine for arm and arm64. Will appreciate the effort if someone could test the patches on real ARM/ARM64 machines. LUV: https://01.org/linux-uefi-validation Thanks to Ricardo and Dan for initial reviews and suggestions. Please feel free to pour in your comments and concerns. Note: Patches are based on Linus's kernel v4.16-rc4 Changes from V1 to V2: -- 1. Remove unnecessary include of asm/efi.h file - Fixes build error on ia64, reported by 0-day 2. Use enum to identify efi_runtime_services() 3. Use alloc_ordered_workqueue() to create efi_rts_wq as create_workqueue() is scheduled for depreciation. 4. Make efi_call_rts() static, as it has no callers outside runtime-wrappers.c 5. Use BUG(), when we are unable to queue work or unable to identify requested efi_runtime_service() - Because these two situations should *never* happen. Sai Praneeth (3): x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services() efi: Use efi_rts_workqueue to invoke EFI Runtime Services arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 - drivers/firmware/efi/efi.c | 21 +++ drivers/firmware/efi/runtime-wrappers.c | 229 +--- include/linux/efi.h | 23 5 files changed, 253 insertions(+), 27 deletions(-) Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Peter Zijlstra <peter.zijls...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Dan Williams <dan.j.willi...@intel.com> -- 2.7.4
[PATCH V2 1/3] x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization
From: Sai Praneeth Invoking efi_runtime_services() through efi_workqueue means all accesses to efi_runtime_services() should be done after efi_rts_wq has been created. efi_delete_dummy_variable() calls set_variable(), hence efi_delete_dummy_variable() should be called after efi_rts_wq has been created. efi_delete_dummy_variable() is called from efi_enter_virtual_mode() which is early in the boot phase (efi_rts_wq isn't created yet), so call efi_delete_dummy_variable() later in the boot phase i.e. while initializing efi subsystem. In the next patch, this is the place where we create efi_rts_wq and all the efi_runtime_services() will be called using efi_rts_wq. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Ravi Shankar Cc: Matt Fleming Cc: Peter Zijlstra Cc: Ard Biesheuvel Cc: Dan Williams --- arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 -- drivers/firmware/efi/efi.c | 6 ++ include/linux/efi.h | 3 +++ 4 files changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index a399c1ebf6f0..43009e3f821b 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -143,7 +143,6 @@ extern void __init efi_runtime_update_mappings(void); extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); -extern void efi_delete_dummy_variable(void); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 9061babfbc83..a3169d14583f 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -893,9 +893,6 @@ static void __init kexec_enter_virtual_mode(void) if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX)) runtime_code_page_mkexec(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); #endif } @@ -1015,9 +1012,6 @@ static void __init __efi_enter_virtual_mode(void) * necessary relocation fixups for the new virtual addresses. */ efi_runtime_update_mappings(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); } void __init efi_enter_virtual_mode(void) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index cd42f66a7c85..838b8efe639c 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -328,6 +328,12 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* Clean DUMMY object calls EFI Runtime Service, set_variable(), so +* it should be invoked only after efi_rts_workqueue is ready. +*/ + efi_delete_dummy_variable(); + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { diff --git a/include/linux/efi.h b/include/linux/efi.h index f5083aa72eae..c4efb3ef0dfa 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -992,6 +992,7 @@ extern efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, bool nonblocking); extern void efi_find_mirror(void); +extern void efi_delete_dummy_variable(void); #else static inline void efi_late_init(void) {} static inline void efi_free_boot_services(void) {} @@ -1002,6 +1003,8 @@ static inline efi_status_t efi_query_variable_store(u32 attributes, { return EFI_SUCCESS; } + +static inline void efi_delete_dummy_variable(void) {} #endif extern void __iomem *efi_lookup_mapped_addr(u64 phys_addr); -- 2.7.4
[PATCH V2 0/3] Use efi_rts_workqueue to invoke EFI Runtime Services
From: Sai Praneeth This patch set is an outcome of the discussion at https://lkml.org/lkml/2017/8/21/607 Presently, efi_runtime_services() are executed by firmware in process context. To execute efi_runtime_service(), kernel switches the page directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have any user space mappings. A potential issue could be, for instance, an NMI interrupt (like perf) trying to profile some user data while in efi_pgd. A solution for this issue could be to use kthread to run efi_runtime_service(). When a user/kernel thread requests to execute efi_runtime_service(), kernel off-loads this work to kthread which in turn uses efi_pgd. Anything that tries to touch user space addresses while in kthread is terminally broken. This patch set adds support to the efi subsystem to handle all calls to efi_runtime_services() using a work queue (which in turn uses kthread). Implementation summary: --- 1. When a user/kernel thread requests to execute efi_runtime_service(), enqueue work to a work queue, efi_rts_workqueue. 2. The caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, the caller process cannot just post the work and get going, it has to wait for results from firmware. Caveat: efi_rts_workqueue to run efi_runtime_services() shouldn't be used while in atomic, because caller thread might sleep. Presently, pstore code doesn't use efi_rts_workqueue. Tested using LUV (Linux UEFI Validation) for x86_64 and x86_32. Builds fine for arm and arm64. Will appreciate the effort if someone could test the patches on real ARM/ARM64 machines. LUV: https://01.org/linux-uefi-validation Thanks to Ricardo and Dan for initial reviews and suggestions. Please feel free to pour in your comments and concerns. Note: Patches are based on Linus's kernel v4.16-rc4 Changes from V1 to V2: -- 1. Remove unnecessary include of asm/efi.h file - Fixes build error on ia64, reported by 0-day 2. Use enum to identify efi_runtime_services() 3. Use alloc_ordered_workqueue() to create efi_rts_wq as create_workqueue() is scheduled for depreciation. 4. Make efi_call_rts() static, as it has no callers outside runtime-wrappers.c 5. Use BUG(), when we are unable to queue work or unable to identify requested efi_runtime_service() - Because these two situations should *never* happen. Sai Praneeth (3): x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services() efi: Use efi_rts_workqueue to invoke EFI Runtime Services arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 - drivers/firmware/efi/efi.c | 21 +++ drivers/firmware/efi/runtime-wrappers.c | 229 +--- include/linux/efi.h | 23 5 files changed, 253 insertions(+), 27 deletions(-) Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Ravi Shankar Cc: Matt Fleming Cc: Peter Zijlstra Cc: Ard Biesheuvel Cc: Dan Williams -- 2.7.4
[PATCH V2 2/3] efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services()
From: Sai Praneeth <sai.praneeth.prak...@intel.com> When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce some infrastructure: 1. Creating workqueue named efi_rts_wq 2. A macro (efi_queue_work()) that a. populates efi_runtime_work b. queues work onto efi_rts_wq and c. waits until worker thread returns The caller thread has to wait until the worker thread returns, because it's dependent on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Peter Zijlstra <peter.zijls...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Dan Williams <dan.j.willi...@intel.com> --- drivers/firmware/efi/efi.c | 15 drivers/firmware/efi/runtime-wrappers.c | 61 + include/linux/efi.h | 20 +++ 3 files changed, 96 insertions(+) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 838b8efe639c..04b46c62f3ce 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -75,6 +75,8 @@ static unsigned long *efi_tables[] = { _attr_table, }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -329,6 +331,19 @@ static int __init efisubsys_init(void) return 0; /* +* Since we process only one efi_runtime_service() at a time, an +* ordered workqueue (which creates only one execution context) +* should suffice all our needs. +*/ + efi_rts_wq = alloc_ordered_workqueue("efi_rts_workqueue", 0); + if (!efi_rts_wq) { + pr_err("Failed to create efi_rts_workqueue, EFI runtime services " + "disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return 0; + } + + /* * Clean DUMMY object calls EFI Runtime Service, set_variable(), so * it should be invoked only after efi_rts_workqueue is ready. */ diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..649763171439 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,14 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_workqueue. + * 2. Caller thread waits until the work is finished because it's + * dependent on the return status and execution of efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. <ard.biesheu...@linaro.org> * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +30,8 @@ #include #include #include +#include + #include /* @@ -33,6 +43,57 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* efi_runtime_service() function identifiers */ +enum { + GET_TIME, + SET_TIME, + GET_WAKEUP_TIME, + SET_WAKEUP_TIME, + GET_VARIABLE, + GET_NEXT_VARIABLE, + SET_VARIABLE, + SET_VARIABLE_NONBLOCKING, + QUERY_VARIABLE_INFO, + QUERY_VARIABLE_INFO_NONBLOCKING, + GET_NEXT_HIGH_MONO_COUNT, + RESET_SYSTEM, + UPDATE_CAPSULE, + QUERY_CAPSULE_CAPS, +}; + +/* + * efi_queue_work:
[PATCH V2 2/3] efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services()
From: Sai Praneeth When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce some infrastructure: 1. Creating workqueue named efi_rts_wq 2. A macro (efi_queue_work()) that a. populates efi_runtime_work b. queues work onto efi_rts_wq and c. waits until worker thread returns The caller thread has to wait until the worker thread returns, because it's dependent on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Ravi Shankar Cc: Matt Fleming Cc: Peter Zijlstra Cc: Ard Biesheuvel Cc: Dan Williams --- drivers/firmware/efi/efi.c | 15 drivers/firmware/efi/runtime-wrappers.c | 61 + include/linux/efi.h | 20 +++ 3 files changed, 96 insertions(+) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 838b8efe639c..04b46c62f3ce 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -75,6 +75,8 @@ static unsigned long *efi_tables[] = { _attr_table, }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -329,6 +331,19 @@ static int __init efisubsys_init(void) return 0; /* +* Since we process only one efi_runtime_service() at a time, an +* ordered workqueue (which creates only one execution context) +* should suffice all our needs. +*/ + efi_rts_wq = alloc_ordered_workqueue("efi_rts_workqueue", 0); + if (!efi_rts_wq) { + pr_err("Failed to create efi_rts_workqueue, EFI runtime services " + "disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return 0; + } + + /* * Clean DUMMY object calls EFI Runtime Service, set_variable(), so * it should be invoked only after efi_rts_workqueue is ready. */ diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..649763171439 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,14 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_workqueue. + * 2. Caller thread waits until the work is finished because it's + * dependent on the return status and execution of efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +30,8 @@ #include #include #include +#include + #include /* @@ -33,6 +43,57 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* efi_runtime_service() function identifiers */ +enum { + GET_TIME, + SET_TIME, + GET_WAKEUP_TIME, + SET_WAKEUP_TIME, + GET_VARIABLE, + GET_NEXT_VARIABLE, + SET_VARIABLE, + SET_VARIABLE_NONBLOCKING, + QUERY_VARIABLE_INFO, + QUERY_VARIABLE_INFO_NONBLOCKING, + GET_NEXT_HIGH_MONO_COUNT, + RESET_SYSTEM, + UPDATE_CAPSULE, + QUERY_CAPSULE_CAPS, +}; + +/* + * efi_queue_work: Queue efi_runtime_service() and wait until it's done + * @rts: efi_runtime_service() function identifier + * @rts_arg<1-5>: efi_runtime_service() function arguments + * + * Accesses to efi_runtime_services() are serialized by a binary + * semaphore (efi_runtime_lock) and caller waits until the work is + * finished, hence _only_ one work is queued at a time and the queued + * work gets flushed. + */ +#define efi_queue_work(
[PATCH V2 3/3] efi: Use efi_rts_workqueue to invoke EFI Runtime Services
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Presently, efi_runtime_services() are executed by firmware in process context. To execute efi_runtime_service(), kernel switches the page directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have any user space mappings. A potential issue could be, for instance, an NMI interrupt (like perf) trying to profile some user data while in efi_pgd. A solution for this issue could be to use kthread to run efi_runtime_service(). When a user/kernel thread requests to execute efi_runtime_service(), kernel off-loads this work to kthread which in turn uses efi_pgd. Anything that tries to touch user space addresses while in kthread is terminally broken. This patch adds support to efi subsystem to handle all calls to efi_runtime_services() using a work queue (which in turn uses kthread). Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_workqueue. 2. Caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service(). Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Introduce a handler function (called efi_call_rts()) that a. understands efi_runtime_work and b. invokes the appropriate efi_runtime_service() with the appropriate arguments Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. pstore writes could potentially be invoked in interrupt context and it uses set_variable<>() and query_variable_info<>() to store logs. If we invoke efi_runtime_services() through efi_rts_wq while in atomic() kernel issues a warning ("scheduling wile in atomic") and prints stack trace. One way to overcome this is to not make the caller process wait for the worker thread to finish. This approach breaks pstore i.e. the log messages aren't written to efi variables. Hence, pstore calls efi_runtime_services() without using efi_rts_wq or in other words efi_rts_wq will be used unconditionally for all the efi_runtime_services() except set_variable<>() and query_variable_info<>() Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Peter Zijlstra <peter.zijls...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Dan Williams <dan.j.willi...@intel.com> --- drivers/firmware/efi/runtime-wrappers.c | 168 1 file changed, 148 insertions(+), 20 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index 649763171439..eff443bf942c 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -151,13 +151,105 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call) */ static DEFINE_SEMAPHORE(efi_runtime_lock); +/* + * Calls the appropriate efi_runtime_service() with the appropriate + * arguments. + * + * Semantics followed by efi_call_rts() to understand efi_runtime_work: + * 1. If argument was a pointer, recast it from void pointer to original + * pointer type. + * 2. If argument was a value, recast it from void pointer to original + * pointer type and dereference it. + */ +static void efi_call_rts(struct work_struct *work) +{ + struct efi_runtime_work *efi_rts_work; + void *arg1, *arg2, *arg3, *arg4, *arg5; + efi_status_t status = EFI_NOT_FOUND; + + efi_rts_work = container_of(work, struct efi_runtime_work, work); + arg1 = efi_rts_work->arg1; + arg2 = efi_rts_work->arg2; + arg3 = efi_rts_work->arg3; + arg4 = efi_rts_work->arg4; + arg5 = efi_rts_work->arg5; + + switch (efi_rts_work->func) { + case GET_TIME: + status = efi_call_virt(get_time, (efi_time_t *)arg1, + (efi_time_cap_t *)arg2); + break; + case SET_TIME: + status = efi_call_virt(set_time, (efi_time_t *)arg1); + break; + case GET_WAKEUP_TIME: + status = efi_call_virt(get_
[PATCH V2 3/3] efi: Use efi_rts_workqueue to invoke EFI Runtime Services
From: Sai Praneeth Presently, efi_runtime_services() are executed by firmware in process context. To execute efi_runtime_service(), kernel switches the page directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have any user space mappings. A potential issue could be, for instance, an NMI interrupt (like perf) trying to profile some user data while in efi_pgd. A solution for this issue could be to use kthread to run efi_runtime_service(). When a user/kernel thread requests to execute efi_runtime_service(), kernel off-loads this work to kthread which in turn uses efi_pgd. Anything that tries to touch user space addresses while in kthread is terminally broken. This patch adds support to efi subsystem to handle all calls to efi_runtime_services() using a work queue (which in turn uses kthread). Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_workqueue. 2. Caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service(). Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Introduce a handler function (called efi_call_rts()) that a. understands efi_runtime_work and b. invokes the appropriate efi_runtime_service() with the appropriate arguments Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. pstore writes could potentially be invoked in interrupt context and it uses set_variable<>() and query_variable_info<>() to store logs. If we invoke efi_runtime_services() through efi_rts_wq while in atomic() kernel issues a warning ("scheduling wile in atomic") and prints stack trace. One way to overcome this is to not make the caller process wait for the worker thread to finish. This approach breaks pstore i.e. the log messages aren't written to efi variables. Hence, pstore calls efi_runtime_services() without using efi_rts_wq or in other words efi_rts_wq will be used unconditionally for all the efi_runtime_services() except set_variable<>() and query_variable_info<>() Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Ravi Shankar Cc: Matt Fleming Cc: Peter Zijlstra Cc: Ard Biesheuvel Cc: Dan Williams --- drivers/firmware/efi/runtime-wrappers.c | 168 1 file changed, 148 insertions(+), 20 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index 649763171439..eff443bf942c 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -151,13 +151,105 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call) */ static DEFINE_SEMAPHORE(efi_runtime_lock); +/* + * Calls the appropriate efi_runtime_service() with the appropriate + * arguments. + * + * Semantics followed by efi_call_rts() to understand efi_runtime_work: + * 1. If argument was a pointer, recast it from void pointer to original + * pointer type. + * 2. If argument was a value, recast it from void pointer to original + * pointer type and dereference it. + */ +static void efi_call_rts(struct work_struct *work) +{ + struct efi_runtime_work *efi_rts_work; + void *arg1, *arg2, *arg3, *arg4, *arg5; + efi_status_t status = EFI_NOT_FOUND; + + efi_rts_work = container_of(work, struct efi_runtime_work, work); + arg1 = efi_rts_work->arg1; + arg2 = efi_rts_work->arg2; + arg3 = efi_rts_work->arg3; + arg4 = efi_rts_work->arg4; + arg5 = efi_rts_work->arg5; + + switch (efi_rts_work->func) { + case GET_TIME: + status = efi_call_virt(get_time, (efi_time_t *)arg1, + (efi_time_cap_t *)arg2); + break; + case SET_TIME: + status = efi_call_virt(set_time, (efi_time_t *)arg1); + break; + case GET_WAKEUP_TIME: + status = efi_call_virt(get_wakeup_time, (efi_bool_t *)arg1, + (efi_bool_t *)arg2, (efi_time_t *)arg3); + break; + case SET_WAKEUP_TIME: + status = efi_call_virt(set_wakeup_time, *(efi_bool_t *)arg1, + (efi_time_t *)arg2); + break; + case GET_VARIABLE: + status = efi_call_virt(get_variable, (efi_char16_t *)arg1, +
[PATCH V1 1/3] x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Invoking efi_runtime_services() through efi_workqueue means all accesses to efi_runtime_services() should be done after efi_rts_wq has been created. efi_delete_dummy_variable() calls set_variable(), hence efi_delete_dummy_variable() should be called after efi_rts_wq has been created. efi_delete_dummy_variable() is called from efi_enter_virtual_mode() which is early in the boot phase (efi_rts_wq isn't created yet), so call efi_delete_dummy_variable() later in the boot phase i.e. while initializing efi subsystem. In the next patch, this is the place where we create efi_rts_wq and all the efi_runtime_services() will be called using efi_rts_wq. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Peter Zijlstra <peter.zijls...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Dan Williams <dan.j.willi...@intel.com> --- arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 -- drivers/firmware/efi/efi.c | 7 +++ include/linux/efi.h | 3 +++ 4 files changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 85f6ccb80b91..34b03440a80f 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -130,7 +130,6 @@ extern void __init efi_runtime_update_mappings(void); extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); -extern void efi_delete_dummy_variable(void); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 9061babfbc83..a3169d14583f 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -893,9 +893,6 @@ static void __init kexec_enter_virtual_mode(void) if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX)) runtime_code_page_mkexec(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); #endif } @@ -1015,9 +1012,6 @@ static void __init __efi_enter_virtual_mode(void) * necessary relocation fixups for the new virtual addresses. */ efi_runtime_update_mappings(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); } void __init efi_enter_virtual_mode(void) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index cd42f66a7c85..ac5db5f8dbbf 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -33,6 +33,7 @@ #include #include +#include struct efi __read_mostly efi = { .mps= EFI_INVALID_TABLE_ADDR, @@ -328,6 +329,12 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* Clean DUMMY object calls EFI Runtime Service, set_variable(), so +* it should be invoked only after efi_rts_workqueue is ready. +*/ + efi_delete_dummy_variable(); + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { diff --git a/include/linux/efi.h b/include/linux/efi.h index f5083aa72eae..c4efb3ef0dfa 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -992,6 +992,7 @@ extern efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, bool nonblocking); extern void efi_find_mirror(void); +extern void efi_delete_dummy_variable(void); #else static inline void efi_late_init(void) {} static inline void efi_free_boot_services(void) {} @@ -1002,6 +1003,8 @@ static inline efi_status_t efi_query_variable_store(u32 attributes, { return EFI_SUCCESS; } + +static inline void efi_delete_dummy_variable(void) {} #endif extern void __iomem *efi_lookup_mapped_addr(u64 phys_addr); -- 2.1.4
[PATCH V1 1/3] x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization
From: Sai Praneeth Invoking efi_runtime_services() through efi_workqueue means all accesses to efi_runtime_services() should be done after efi_rts_wq has been created. efi_delete_dummy_variable() calls set_variable(), hence efi_delete_dummy_variable() should be called after efi_rts_wq has been created. efi_delete_dummy_variable() is called from efi_enter_virtual_mode() which is early in the boot phase (efi_rts_wq isn't created yet), so call efi_delete_dummy_variable() later in the boot phase i.e. while initializing efi subsystem. In the next patch, this is the place where we create efi_rts_wq and all the efi_runtime_services() will be called using efi_rts_wq. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Ravi Shankar Cc: Matt Fleming Cc: Peter Zijlstra Cc: Ard Biesheuvel Cc: Dan Williams --- arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 -- drivers/firmware/efi/efi.c | 7 +++ include/linux/efi.h | 3 +++ 4 files changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 85f6ccb80b91..34b03440a80f 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -130,7 +130,6 @@ extern void __init efi_runtime_update_mappings(void); extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); -extern void efi_delete_dummy_variable(void); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 9061babfbc83..a3169d14583f 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -893,9 +893,6 @@ static void __init kexec_enter_virtual_mode(void) if (efi_enabled(EFI_OLD_MEMMAP) && (__supported_pte_mask & _PAGE_NX)) runtime_code_page_mkexec(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); #endif } @@ -1015,9 +1012,6 @@ static void __init __efi_enter_virtual_mode(void) * necessary relocation fixups for the new virtual addresses. */ efi_runtime_update_mappings(); - - /* clean DUMMY object */ - efi_delete_dummy_variable(); } void __init efi_enter_virtual_mode(void) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index cd42f66a7c85..ac5db5f8dbbf 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -33,6 +33,7 @@ #include #include +#include struct efi __read_mostly efi = { .mps= EFI_INVALID_TABLE_ADDR, @@ -328,6 +329,12 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* +* Clean DUMMY object calls EFI Runtime Service, set_variable(), so +* it should be invoked only after efi_rts_workqueue is ready. +*/ + efi_delete_dummy_variable(); + /* We register the efi directory at /sys/firmware/efi */ efi_kobj = kobject_create_and_add("efi", firmware_kobj); if (!efi_kobj) { diff --git a/include/linux/efi.h b/include/linux/efi.h index f5083aa72eae..c4efb3ef0dfa 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -992,6 +992,7 @@ extern efi_status_t efi_query_variable_store(u32 attributes, unsigned long size, bool nonblocking); extern void efi_find_mirror(void); +extern void efi_delete_dummy_variable(void); #else static inline void efi_late_init(void) {} static inline void efi_free_boot_services(void) {} @@ -1002,6 +1003,8 @@ static inline efi_status_t efi_query_variable_store(u32 attributes, { return EFI_SUCCESS; } + +static inline void efi_delete_dummy_variable(void) {} #endif extern void __iomem *efi_lookup_mapped_addr(u64 phys_addr); -- 2.1.4
[PATCH V1 3/3] efi: Use efi_rts_workqueue to invoke EFI Runtime Services
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Presently, efi_runtime_services() are executed by firmware in process context. To execute efi_runtime_service(), kernel switches the page directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have any user space mappings. A potential issue could be, for instance, an NMI interrupt (like perf) trying to profile some user data while in efi_pgd. A solution for this issue could be to use kthread to run efi_runtime_service(). When a user/kernel thread requests to execute efi_runtime_service(), kernel off-loads this work to kthread which in turn uses efi_pgd. Anything that tries to touch user space addresses while in kthread is terminally broken. This patch adds support to efi subsystem to handle all calls to efi_runtime_services() using a work queue (which in turn uses kthread). Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_workqueue. 2. Caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service(). pstore writes could potentially be invoked in interrupt context and it uses set_variable<>() and query_variable_info<>() to store logs. If we invoke efi_runtime_services() through efi_rts_wq while in atomic() kernel issues a warning ("scheduling wile in atomic") and prints stack trace. One way to overcome this is to not make the caller process wait for the worker thread to finish. This approach breaks pstore i.e. the log messages aren't written to efi variables. Hence, pstore calls efi_runtime_services() without using efi_rts_wq or in other words efi_rts_wq will be used unconditionally for all the efi_runtime_services() except set_variable<>() and query_variable_info<>() Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Peter Zijlstra <peter.zijls...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Dan Williams <dan.j.willi...@intel.com> --- drivers/firmware/efi/runtime-wrappers.c | 86 + 1 file changed, 66 insertions(+), 20 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index 5cdb787da5d3..531d077aac70 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -68,6 +68,16 @@ * semaphore (efi_runtime_lock) and caller waits until the work is * finished, hence _only_ one work is queued at a time. So, queue_work() * should never fail. + * + * efi_rts_workqueue to run efi_runtime_services() shouldn't be used + * while in atomic, because caller thread might sleep. pstore writes + * could potentially be invoked in interrupt context and it uses + * set_variable<>() and query_variable_info<>(), so pstore code doesn't + * use efi_rts_workqueue. + * + * Semantics that caller function should follow while passing arguments: + * 1. If argument is a pointer (of any type), pass it as is. + * 2. If argument is a value (of any type), address of the value is passed. */ #define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5) \ ({ \ @@ -150,7 +160,7 @@ static efi_status_t virt_efi_get_time(efi_time_t *tm, efi_time_cap_t *tc) if (down_interruptible(_runtime_lock)) return EFI_ABORTED; - status = efi_call_virt(get_time, tm, tc); + status = efi_queue_work(GET_TIME, tm, tc, NULL, NULL, NULL); up(_runtime_lock); return status; } @@ -161,7 +171,7 @@ static efi_status_t virt_efi_set_time(efi_time_t *tm) if (down_interruptible(_runtime_lock)) return EFI_ABORTED; - status = efi_call_virt(set_time, tm); + status = efi_queue_work(SET_TIME, tm, NULL, NULL, NULL, NULL); up(_runtime_lock); return status; } @@ -174,7 +184,8 @@ static efi_status_t virt_efi_get_wakeup_time(efi_bool_t *enabled, if (down_interruptible(_runtime_lock)) return EFI_ABORTED; - status = efi_call_virt(get_wakeup_time, enabled, pending, tm); + status = efi_queue_work(GET_
[PATCH V1 3/3] efi: Use efi_rts_workqueue to invoke EFI Runtime Services
From: Sai Praneeth Presently, efi_runtime_services() are executed by firmware in process context. To execute efi_runtime_service(), kernel switches the page directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have any user space mappings. A potential issue could be, for instance, an NMI interrupt (like perf) trying to profile some user data while in efi_pgd. A solution for this issue could be to use kthread to run efi_runtime_service(). When a user/kernel thread requests to execute efi_runtime_service(), kernel off-loads this work to kthread which in turn uses efi_pgd. Anything that tries to touch user space addresses while in kthread is terminally broken. This patch adds support to efi subsystem to handle all calls to efi_runtime_services() using a work queue (which in turn uses kthread). Implementation summary: --- 1. When user/kernel thread requests to execute efi_runtime_service(), enqueue work to efi_rts_workqueue. 2. Caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service(). pstore writes could potentially be invoked in interrupt context and it uses set_variable<>() and query_variable_info<>() to store logs. If we invoke efi_runtime_services() through efi_rts_wq while in atomic() kernel issues a warning ("scheduling wile in atomic") and prints stack trace. One way to overcome this is to not make the caller process wait for the worker thread to finish. This approach breaks pstore i.e. the log messages aren't written to efi variables. Hence, pstore calls efi_runtime_services() without using efi_rts_wq or in other words efi_rts_wq will be used unconditionally for all the efi_runtime_services() except set_variable<>() and query_variable_info<>() Semantics to pack arguments in efi_runtime_work (has void pointers): 1. If argument is a pointer (of any type), pass it as is. 2. If argument is a value (of any type), address of the value is passed. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Ravi Shankar Cc: Matt Fleming Cc: Peter Zijlstra Cc: Ard Biesheuvel Cc: Dan Williams --- drivers/firmware/efi/runtime-wrappers.c | 86 + 1 file changed, 66 insertions(+), 20 deletions(-) diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index 5cdb787da5d3..531d077aac70 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -68,6 +68,16 @@ * semaphore (efi_runtime_lock) and caller waits until the work is * finished, hence _only_ one work is queued at a time. So, queue_work() * should never fail. + * + * efi_rts_workqueue to run efi_runtime_services() shouldn't be used + * while in atomic, because caller thread might sleep. pstore writes + * could potentially be invoked in interrupt context and it uses + * set_variable<>() and query_variable_info<>(), so pstore code doesn't + * use efi_rts_workqueue. + * + * Semantics that caller function should follow while passing arguments: + * 1. If argument is a pointer (of any type), pass it as is. + * 2. If argument is a value (of any type), address of the value is passed. */ #define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5) \ ({ \ @@ -150,7 +160,7 @@ static efi_status_t virt_efi_get_time(efi_time_t *tm, efi_time_cap_t *tc) if (down_interruptible(_runtime_lock)) return EFI_ABORTED; - status = efi_call_virt(get_time, tm, tc); + status = efi_queue_work(GET_TIME, tm, tc, NULL, NULL, NULL); up(_runtime_lock); return status; } @@ -161,7 +171,7 @@ static efi_status_t virt_efi_set_time(efi_time_t *tm) if (down_interruptible(_runtime_lock)) return EFI_ABORTED; - status = efi_call_virt(set_time, tm); + status = efi_queue_work(SET_TIME, tm, NULL, NULL, NULL, NULL); up(_runtime_lock); return status; } @@ -174,7 +184,8 @@ static efi_status_t virt_efi_get_wakeup_time(efi_bool_t *enabled, if (down_interruptible(_runtime_lock)) return EFI_ABORTED; - status = efi_call_virt(get_wakeup_time, enabled, pending, tm); + status = efi_queue_work(GET_WAKEUP_TIME, enabled, pending, tm, NULL, + NULL); up(_runtime_lock); return status; } @@ -185,7 +196,8 @@ static efi_status_t virt_efi_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm) if (down_interruptible(_runtime_lock)) return EFI_ABORTED; - status = efi_call_virt(set_wakeup_time, enabled, tm); + status = efi_queue_work(SET_WAKEUP_TIME, , tm, NULL, NULL, +
[PATCH V1 2/3] efi: Introduce efi_rts_workqueue and necessary infrastructure to invoke all efi_runtime_services()
From: Sai Praneeth <sai.praneeth.prak...@intel.com> When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce necessary infrastructure: 1. Creating workqueue named efi_rts_wq 2. A macro (efi_queue_work()) that a. populates efi_runtime_work b. queues work onto efi_rts_wq and c. waits until worker thread returns 3. A handler function that a. understands efi_runtime_work and b. invokes the appropriate efi_runtime_service() with the appropriate arguments The caller thread has to wait until the worker thread returns, because it's dependent on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Peter Zijlstra <peter.zijls...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Dan Williams <dan.j.willi...@intel.com> --- drivers/firmware/efi/efi.c | 11 +++ drivers/firmware/efi/runtime-wrappers.c | 143 include/linux/efi.h | 23 + 3 files changed, 177 insertions(+) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index ac5db5f8dbbf..4714b305ca90 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -76,6 +76,8 @@ static unsigned long *efi_tables[] = { _attr_table, }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -329,6 +331,15 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* Create a work queue to run EFI Runtime Services */ + efi_rts_wq = create_workqueue("efi_rts_workqueue"); + if (!efi_rts_wq) { + pr_err("Failed to create efi_rts_workqueue, EFI runtime services " + "disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return 0; + } + /* * Clean DUMMY object calls EFI Runtime Service, set_variable(), so * it should be invoked only after efi_rts_workqueue is ready. diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..5cdb787da5d3 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,14 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_workqueue. + * 2. Caller thread waits until the work is finished because it's + * dependent on the return status and execution of efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. <ard.biesheu...@linaro.org> * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +30,8 @@ #include #include #include +#include + #include /* @@ -33,6 +43,50 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* Each EFI Runtime Service is represented with a unique number */ +#define GET_TIME 0 +#define SET_
[PATCH V1 2/3] efi: Introduce efi_rts_workqueue and necessary infrastructure to invoke all efi_runtime_services()
From: Sai Praneeth When a process requests the kernel to execute any efi_runtime_service(), the requested efi_runtime_service (represented as an identifier) and its arguments are packed into a struct named efi_runtime_work and queued onto work queue named efi_rts_wq. The caller then waits until the work is completed. Introduce necessary infrastructure: 1. Creating workqueue named efi_rts_wq 2. A macro (efi_queue_work()) that a. populates efi_runtime_work b. queues work onto efi_rts_wq and c. waits until worker thread returns 3. A handler function that a. understands efi_runtime_work and b. invokes the appropriate efi_runtime_service() with the appropriate arguments The caller thread has to wait until the worker thread returns, because it's dependent on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, caller process cannot just post the work and get going. Some facts about efi_runtime_services(): 1. A quick look at all the efi_runtime_services() shows that any efi_runtime_service() has five or less arguments. 2. An argument of efi_runtime_service() can be a value (of any type) or a pointer (of any type). Hence, efi_runtime_work has five void pointers to store these arguments. Semantics followed by efi_call_rts() to understand efi_runtime_work: 1. If argument was a pointer, recast it from void pointer to original pointer type. 2. If argument was a value, recast it from void pointer to original pointer type and dereference it. Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Ravi Shankar Cc: Matt Fleming Cc: Peter Zijlstra Cc: Ard Biesheuvel Cc: Dan Williams --- drivers/firmware/efi/efi.c | 11 +++ drivers/firmware/efi/runtime-wrappers.c | 143 include/linux/efi.h | 23 + 3 files changed, 177 insertions(+) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index ac5db5f8dbbf..4714b305ca90 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -76,6 +76,8 @@ static unsigned long *efi_tables[] = { _attr_table, }; +struct workqueue_struct *efi_rts_wq; + static bool disable_runtime; static int __init setup_noefi(char *arg) { @@ -329,6 +331,15 @@ static int __init efisubsys_init(void) if (!efi_enabled(EFI_BOOT)) return 0; + /* Create a work queue to run EFI Runtime Services */ + efi_rts_wq = create_workqueue("efi_rts_workqueue"); + if (!efi_rts_wq) { + pr_err("Failed to create efi_rts_workqueue, EFI runtime services " + "disabled.\n"); + clear_bit(EFI_RUNTIME_SERVICES, ); + return 0; + } + /* * Clean DUMMY object calls EFI Runtime Service, set_variable(), so * it should be invoked only after efi_rts_workqueue is ready. diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c index ae54870b2788..5cdb787da5d3 100644 --- a/drivers/firmware/efi/runtime-wrappers.c +++ b/drivers/firmware/efi/runtime-wrappers.c @@ -1,6 +1,14 @@ /* * runtime-wrappers.c - Runtime Services function call wrappers * + * Implementation summary: + * --- + * 1. When user/kernel thread requests to execute efi_runtime_service(), + * enqueue work to efi_rts_workqueue. + * 2. Caller thread waits until the work is finished because it's + * dependent on the return status and execution of efi_runtime_service(). + * For instance, get_variable() and get_next_variable(). + * * Copyright (C) 2014 Linaro Ltd. * * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +30,8 @@ #include #include #include +#include + #include /* @@ -33,6 +43,50 @@ #define __efi_call_virt(f, args...) \ __efi_call_virt_pointer(efi.systab->runtime, f, args) +/* Each EFI Runtime Service is represented with a unique number */ +#define GET_TIME 0 +#define SET_TIME 1 +#define GET_WAKEUP_TIME2 +#define SET_WAKEUP_TIME3 +#define GET_VARIABLE 4 +#define GET_NEXT_VARIABLE 5 +#define SET_VARIABLE 6 +#define SET_VARIABLE_NONBLOCKING 7 +#define QUERY_VARIABLE_INFO8 +#define QUERY_VARIABLE_INFO_NONBLOCKING9
[PATCH V1 0/3] Use efi_rts_workqueue to invoke EFI Runtime Services
From: Sai Praneeth <sai.praneeth.prak...@intel.com> This patch set is an outcome of the discussion at https://lkml.org/lkml/2017/8/21/607 Presently, efi_runtime_services() are executed by firmware in process context. To execute efi_runtime_service(), kernel switches the page directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have any user space mappings. A potential issue could be, for instance, an NMI interrupt (like perf) trying to profile some user data while in efi_pgd. A solution for this issue could be to use kthread to run efi_runtime_service(). When a user/kernel thread requests to execute efi_runtime_service(), kernel off-loads this work to kthread which in turn uses efi_pgd. Anything that tries to touch user space addresses while in kthread is terminally broken. This patch set adds support to the efi subsystem to handle all calls to efi_runtime_services() using a work queue (which in turn uses kthread). Implementation summary: --- 1. When a user/kernel thread requests to execute efi_runtime_service(), enqueue work to a work queue, efi_rts_workqueue. 2. The caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, the caller process cannot just post the work and get going, it has to wait for results from firmware. Caveat: efi_rts_workqueue to run efi_runtime_services() shouldn't be used while in atomic, because caller thread might sleep. Presently, pstore code doesn't use efi_rts_workqueue. Tested using LUV (Linux UEFI Validation) for x86_64 and x86_32. Builds fine for arm and arm64. Will appreciate the effort if someone could test the patches on ARM (although I was able to boot with LUV for ARM). LUV: https://01.org/linux-uefi-validation Thanks to Ricardo and Dan for initial reviews and suggestions. Please feel free to pour in your comments and concerns. Note: Patches are based on Linus's kernel v4.16-rc2 Sai Praneeth (3): x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization efi: Introduce efi_rts_workqueue and necessary infrastructure to invoke all efi_runtime_services() efi: Use efi_rts_workqueue to invoke EFI Runtime Services arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 - drivers/firmware/efi/efi.c | 18 +++ drivers/firmware/efi/runtime-wrappers.c | 229 +--- include/linux/efi.h | 26 5 files changed, 253 insertions(+), 27 deletions(-) Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Suggested-by: Andy Lutomirski <l...@kernel.org> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Will Deacon <will.dea...@arm.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Mark Rutland <mark.rutl...@arm.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Peter Zijlstra <peter.zijls...@intel.com> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Dan Williams <dan.j.willi...@intel.com> -- 2.1.4
[PATCH V1 0/3] Use efi_rts_workqueue to invoke EFI Runtime Services
From: Sai Praneeth This patch set is an outcome of the discussion at https://lkml.org/lkml/2017/8/21/607 Presently, efi_runtime_services() are executed by firmware in process context. To execute efi_runtime_service(), kernel switches the page directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have any user space mappings. A potential issue could be, for instance, an NMI interrupt (like perf) trying to profile some user data while in efi_pgd. A solution for this issue could be to use kthread to run efi_runtime_service(). When a user/kernel thread requests to execute efi_runtime_service(), kernel off-loads this work to kthread which in turn uses efi_pgd. Anything that tries to touch user space addresses while in kthread is terminally broken. This patch set adds support to the efi subsystem to handle all calls to efi_runtime_services() using a work queue (which in turn uses kthread). Implementation summary: --- 1. When a user/kernel thread requests to execute efi_runtime_service(), enqueue work to a work queue, efi_rts_workqueue. 2. The caller thread waits until the work is finished because it's dependent on the return status of efi_runtime_service() and, in specific cases, the arguments populated by efi_runtime_service(). Some efi_runtime_services() takes a pointer to buffer as an argument and fills up the buffer with requested data. For instance, efi_get_variable() and efi_get_next_variable(). Hence, the caller process cannot just post the work and get going, it has to wait for results from firmware. Caveat: efi_rts_workqueue to run efi_runtime_services() shouldn't be used while in atomic, because caller thread might sleep. Presently, pstore code doesn't use efi_rts_workqueue. Tested using LUV (Linux UEFI Validation) for x86_64 and x86_32. Builds fine for arm and arm64. Will appreciate the effort if someone could test the patches on ARM (although I was able to boot with LUV for ARM). LUV: https://01.org/linux-uefi-validation Thanks to Ricardo and Dan for initial reviews and suggestions. Please feel free to pour in your comments and concerns. Note: Patches are based on Linus's kernel v4.16-rc2 Sai Praneeth (3): x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization efi: Introduce efi_rts_workqueue and necessary infrastructure to invoke all efi_runtime_services() efi: Use efi_rts_workqueue to invoke EFI Runtime Services arch/x86/include/asm/efi.h | 1 - arch/x86/platform/efi/efi.c | 6 - drivers/firmware/efi/efi.c | 18 +++ drivers/firmware/efi/runtime-wrappers.c | 229 +--- include/linux/efi.h | 26 5 files changed, 253 insertions(+), 27 deletions(-) Signed-off-by: Sai Praneeth Prakhya Suggested-by: Andy Lutomirski Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Will Deacon Cc: Dave Hansen Cc: Mark Rutland Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Ravi Shankar Cc: Matt Fleming Cc: Peter Zijlstra Cc: Ard Biesheuvel Cc: Dan Williams -- 2.1.4
[PATCH V4 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Use helper function (efi_switch_mm()) to switch to/from efi_mm. We switch to efi_mm before calling 1. efi_set_virtual_address_map() and 2. Invoking any efi_runtime_service() Likewise, we need to switch back to previous mm (mm context stolen by efi_mm) after the above calls return successfully. We can use efi_switch_mm() helper function only with x86_64 kernel and "efi=old_map" disabled because, x86_32 and efi=old_map doesn't use efi_pgd, rather they use swapper_pg_dir. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Tested-by: Bhupesh Sharma <bhsha...@redhat.com> --- arch/x86/include/asm/efi.h | 25 +- arch/x86/platform/efi/efi_64.c | 40 +++- arch/x86/platform/efi/efi_thunk_64.S | 2 +- 3 files changed, 32 insertions(+), 35 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 00f977ddd718..cda9940bed7a 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -62,14 +62,13 @@ extern asmlinkage u64 efi_call(void *fp, ...); #define efi_call_phys(f, args...) efi_call((f), args) /* - * Scratch space used for switching the pagetable in the EFI stub + * struct efi_scratch - Scratch space used while switching to/from efi_mm + * @phys_stack: stack used during EFI Mixed Mode + * @prev_mm:store/restore stolen mm_struct while switching to/from efi_mm */ struct efi_scratch { - u64 r15; - u64 prev_cr3; - pgd_t *efi_pgt; - booluse_pgd; - u64 phys_stack; + u64 phys_stack; + struct mm_struct*prev_mm; } __packed; #define arch_efi_call_virt_setup() \ @@ -78,11 +77,8 @@ struct efi_scratch { preempt_disable(); \ __kernel_fpu_begin(); \ \ - if (efi_scratch.use_pgd) { \ - efi_scratch.prev_cr3 = __read_cr3();\ - write_cr3((unsigned long)efi_scratch.efi_pgt); \ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(_mm); \ }) #define arch_efi_call_virt(p, f, args...) \ @@ -90,10 +86,8 @@ struct efi_scratch { #define arch_efi_call_virt_teardown() \ ({ \ - if (efi_scratch.use_pgd) { \ - write_cr3(efi_scratch.prev_cr3);\ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(efi_scratch.prev_mm); \ \ __kernel_fpu_end(); \ preempt_enable(); \ @@ -135,6 +129,7 @@ extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); +extern void efi_switch_mm(struct mm_struct *mm); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index c93f59731608..d6892ad2a693 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -82,9 +82,8 @@ pgd_t * __init efi_call_phys_prolog(void) int n_pgds, i, j; if (!efi_enabled(EFI_OLD_MEMMAP)) { - save_pgd = (pgd_t *)__read_cr3(); - write_cr3((unsigned long)efi_scratch.efi_pgt); - goto out; + efi_switch_mm(_mm); + return NULL; } early_code_mapping_set_exec(1); @@ -156,8 +155,7 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd)
[PATCH V4 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3
From: Sai Praneeth Use helper function (efi_switch_mm()) to switch to/from efi_mm. We switch to efi_mm before calling 1. efi_set_virtual_address_map() and 2. Invoking any efi_runtime_service() Likewise, we need to switch back to previous mm (mm context stolen by efi_mm) after the above calls return successfully. We can use efi_switch_mm() helper function only with x86_64 kernel and "efi=old_map" disabled because, x86_32 and efi=old_map doesn't use efi_pgd, rather they use swapper_pg_dir. Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar Tested-by: Bhupesh Sharma --- arch/x86/include/asm/efi.h | 25 +- arch/x86/platform/efi/efi_64.c | 40 +++- arch/x86/platform/efi/efi_thunk_64.S | 2 +- 3 files changed, 32 insertions(+), 35 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 00f977ddd718..cda9940bed7a 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -62,14 +62,13 @@ extern asmlinkage u64 efi_call(void *fp, ...); #define efi_call_phys(f, args...) efi_call((f), args) /* - * Scratch space used for switching the pagetable in the EFI stub + * struct efi_scratch - Scratch space used while switching to/from efi_mm + * @phys_stack: stack used during EFI Mixed Mode + * @prev_mm:store/restore stolen mm_struct while switching to/from efi_mm */ struct efi_scratch { - u64 r15; - u64 prev_cr3; - pgd_t *efi_pgt; - booluse_pgd; - u64 phys_stack; + u64 phys_stack; + struct mm_struct*prev_mm; } __packed; #define arch_efi_call_virt_setup() \ @@ -78,11 +77,8 @@ struct efi_scratch { preempt_disable(); \ __kernel_fpu_begin(); \ \ - if (efi_scratch.use_pgd) { \ - efi_scratch.prev_cr3 = __read_cr3();\ - write_cr3((unsigned long)efi_scratch.efi_pgt); \ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(_mm); \ }) #define arch_efi_call_virt(p, f, args...) \ @@ -90,10 +86,8 @@ struct efi_scratch { #define arch_efi_call_virt_teardown() \ ({ \ - if (efi_scratch.use_pgd) { \ - write_cr3(efi_scratch.prev_cr3);\ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(efi_scratch.prev_mm); \ \ __kernel_fpu_end(); \ preempt_enable(); \ @@ -135,6 +129,7 @@ extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); +extern void efi_switch_mm(struct mm_struct *mm); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index c93f59731608..d6892ad2a693 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -82,9 +82,8 @@ pgd_t * __init efi_call_phys_prolog(void) int n_pgds, i, j; if (!efi_enabled(EFI_OLD_MEMMAP)) { - save_pgd = (pgd_t *)__read_cr3(); - write_cr3((unsigned long)efi_scratch.efi_pgt); - goto out; + efi_switch_mm(_mm); + return NULL; } early_code_mapping_set_exec(1); @@ -156,8 +155,7 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd) pud_t *pud; if (!efi_enabled(EFI_OLD_MEMMAP)) { - write_cr3((unsigned long)save_pgd); - __flush_tlb_all(); + efi_switch_mm(efi_scratch.prev_mm); return; } @@ -346,13 +344,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
[PATCH V4 1/3] efi: Use efi_mm in x86 as well as ARM
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Presently, only ARM uses mm_struct to manage efi page tables and efi runtime region mappings. As this is the preferred approach, let's make this data structure common across architectures. Specially, for x86, using this data structure improves code maintainability and readability. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Tested-by: Bhupesh Sharma <bhsha...@redhat.com> --- arch/x86/include/asm/efi.h | 4 arch/x86/platform/efi/efi_64.c | 3 +++ drivers/firmware/efi/arm-runtime.c | 9 - drivers/firmware/efi/efi.c | 9 + include/linux/efi.h| 2 ++ 5 files changed, 18 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 85f6ccb80b91..00f977ddd718 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -2,10 +2,14 @@ #ifndef _ASM_X86_EFI_H #define _ASM_X86_EFI_H +#include +#include + #include #include #include #include +#include /* * We map the EFI regions needed for runtime services non-contiguously, diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 2dd15e967c3f..c9f8e6924df7 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -232,6 +232,9 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + mm_init_cpumask(_mm); + init_new_context(NULL, _mm); + return 0; } diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c index 1cc41c3d6315..d6b26534812b 100644 --- a/drivers/firmware/efi/arm-runtime.c +++ b/drivers/firmware/efi/arm-runtime.c @@ -31,15 +31,6 @@ extern u64 efi_system_table; -static struct mm_struct efi_mm = { - .mm_rb = RB_ROOT, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), - .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), - .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), -}; - #ifdef CONFIG_ARM64_PTDUMP_DEBUGFS #include diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 557a47829d03..760260b933b6 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -74,6 +74,15 @@ static unsigned long *efi_tables[] = { _attr_table, }; +struct mm_struct efi_mm = { + .mm_rb = RB_ROOT, + .mm_users = ATOMIC_INIT(2), + .mm_count = ATOMIC_INIT(1), + .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), + .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), + .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), +}; + static bool disable_runtime; static int __init setup_noefi(char *arg) { diff --git a/include/linux/efi.h b/include/linux/efi.h index 29fdf8029cf6..d79f1cc4c8bb 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -930,6 +930,8 @@ extern struct efi { unsigned long flags; } efi; +extern struct mm_struct efi_mm; + static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right) { -- 2.1.4
[PATCH V4 2/3] x86/efi: Replace efi_pgd with efi_mm.pgd
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Since the previous patch added support for efi_mm, let's handle efi_pgd through efi_mm and remove global variable efi_pgd. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Tested-by: Bhupesh Sharma <bhsha...@redhat.com> --- arch/x86/platform/efi/efi_64.c | 17 - 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index c9f8e6924df7..c93f59731608 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -191,8 +191,6 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd) early_code_mapping_set_exec(0); } -static pgd_t *efi_pgd; - /* * We need our own copy of the higher levels of the page tables * because we want to avoid inserting EFI region mappings (EFI_VA_END @@ -204,7 +202,7 @@ static pgd_t *efi_pgd; */ int __init efi_alloc_page_tables(void) { - pgd_t *pgd; + pgd_t *pgd, *efi_pgd; p4d_t *p4d; pud_t *pud; gfp_t gfp_mask; @@ -232,6 +230,7 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + efi_mm.pgd = efi_pgd; mm_init_cpumask(_mm); init_new_context(NULL, _mm); @@ -247,6 +246,7 @@ void efi_sync_low_kernel_mappings(void) pgd_t *pgd_k, *pgd_efi; p4d_t *p4d_k, *p4d_efi; pud_t *pud_k, *pud_efi; + pgd_t *efi_pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return; @@ -340,7 +340,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) unsigned long pfn, text, pf; struct page *page; unsigned npages; - pgd_t *pgd; + pgd_t *pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return 0; @@ -350,8 +350,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) * this value is loaded into cr3 the PGD will be decrypted during * the pagetable walk. */ - efi_scratch.efi_pgt = (pgd_t *)__sme_pa(efi_pgd); - pgd = efi_pgd; + efi_scratch.efi_pgt = (pgd_t *)__sme_pa(pgd); /* * It can happen that the physical address of new_memmap lands in memory @@ -421,7 +420,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va) { unsigned long flags = _PAGE_RW; unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; if (!(md->attribute & EFI_MEMORY_WB)) flags |= _PAGE_PCD; @@ -525,7 +524,7 @@ void __init parse_efi_setup(u64 phys_addr, u32 data_len) static int __init efi_update_mappings(efi_memory_desc_t *md, unsigned long pf) { unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; int err1, err2; /* Update the 1:1 mapping */ @@ -622,7 +621,7 @@ void __init efi_dump_pagetable(void) if (efi_enabled(EFI_OLD_MEMMAP)) ptdump_walk_pgd_level(NULL, swapper_pg_dir); else - ptdump_walk_pgd_level(NULL, efi_pgd); + ptdump_walk_pgd_level(NULL, efi_mm.pgd); #endif } -- 2.1.4
[PATCH V4 2/3] x86/efi: Replace efi_pgd with efi_mm.pgd
From: Sai Praneeth Since the previous patch added support for efi_mm, let's handle efi_pgd through efi_mm and remove global variable efi_pgd. Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar Tested-by: Bhupesh Sharma --- arch/x86/platform/efi/efi_64.c | 17 - 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index c9f8e6924df7..c93f59731608 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -191,8 +191,6 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd) early_code_mapping_set_exec(0); } -static pgd_t *efi_pgd; - /* * We need our own copy of the higher levels of the page tables * because we want to avoid inserting EFI region mappings (EFI_VA_END @@ -204,7 +202,7 @@ static pgd_t *efi_pgd; */ int __init efi_alloc_page_tables(void) { - pgd_t *pgd; + pgd_t *pgd, *efi_pgd; p4d_t *p4d; pud_t *pud; gfp_t gfp_mask; @@ -232,6 +230,7 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + efi_mm.pgd = efi_pgd; mm_init_cpumask(_mm); init_new_context(NULL, _mm); @@ -247,6 +246,7 @@ void efi_sync_low_kernel_mappings(void) pgd_t *pgd_k, *pgd_efi; p4d_t *p4d_k, *p4d_efi; pud_t *pud_k, *pud_efi; + pgd_t *efi_pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return; @@ -340,7 +340,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) unsigned long pfn, text, pf; struct page *page; unsigned npages; - pgd_t *pgd; + pgd_t *pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return 0; @@ -350,8 +350,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) * this value is loaded into cr3 the PGD will be decrypted during * the pagetable walk. */ - efi_scratch.efi_pgt = (pgd_t *)__sme_pa(efi_pgd); - pgd = efi_pgd; + efi_scratch.efi_pgt = (pgd_t *)__sme_pa(pgd); /* * It can happen that the physical address of new_memmap lands in memory @@ -421,7 +420,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va) { unsigned long flags = _PAGE_RW; unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; if (!(md->attribute & EFI_MEMORY_WB)) flags |= _PAGE_PCD; @@ -525,7 +524,7 @@ void __init parse_efi_setup(u64 phys_addr, u32 data_len) static int __init efi_update_mappings(efi_memory_desc_t *md, unsigned long pf) { unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; int err1, err2; /* Update the 1:1 mapping */ @@ -622,7 +621,7 @@ void __init efi_dump_pagetable(void) if (efi_enabled(EFI_OLD_MEMMAP)) ptdump_walk_pgd_level(NULL, swapper_pg_dir); else - ptdump_walk_pgd_level(NULL, efi_pgd); + ptdump_walk_pgd_level(NULL, efi_mm.pgd); #endif } -- 2.1.4
[PATCH V4 1/3] efi: Use efi_mm in x86 as well as ARM
From: Sai Praneeth Presently, only ARM uses mm_struct to manage efi page tables and efi runtime region mappings. As this is the preferred approach, let's make this data structure common across architectures. Specially, for x86, using this data structure improves code maintainability and readability. Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar Tested-by: Bhupesh Sharma --- arch/x86/include/asm/efi.h | 4 arch/x86/platform/efi/efi_64.c | 3 +++ drivers/firmware/efi/arm-runtime.c | 9 - drivers/firmware/efi/efi.c | 9 + include/linux/efi.h| 2 ++ 5 files changed, 18 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 85f6ccb80b91..00f977ddd718 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -2,10 +2,14 @@ #ifndef _ASM_X86_EFI_H #define _ASM_X86_EFI_H +#include +#include + #include #include #include #include +#include /* * We map the EFI regions needed for runtime services non-contiguously, diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 2dd15e967c3f..c9f8e6924df7 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -232,6 +232,9 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + mm_init_cpumask(_mm); + init_new_context(NULL, _mm); + return 0; } diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c index 1cc41c3d6315..d6b26534812b 100644 --- a/drivers/firmware/efi/arm-runtime.c +++ b/drivers/firmware/efi/arm-runtime.c @@ -31,15 +31,6 @@ extern u64 efi_system_table; -static struct mm_struct efi_mm = { - .mm_rb = RB_ROOT, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), - .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), - .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), -}; - #ifdef CONFIG_ARM64_PTDUMP_DEBUGFS #include diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 557a47829d03..760260b933b6 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -74,6 +74,15 @@ static unsigned long *efi_tables[] = { _attr_table, }; +struct mm_struct efi_mm = { + .mm_rb = RB_ROOT, + .mm_users = ATOMIC_INIT(2), + .mm_count = ATOMIC_INIT(1), + .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), + .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), + .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), +}; + static bool disable_runtime; static int __init setup_noefi(char *arg) { diff --git a/include/linux/efi.h b/include/linux/efi.h index 29fdf8029cf6..d79f1cc4c8bb 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -930,6 +930,8 @@ extern struct efi { unsigned long flags; } efi; +extern struct mm_struct efi_mm; + static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right) { -- 2.1.4
[PATCH V4 0/3] Use mm_struct and switch_mm() instead of manually
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Presently, in x86, to invoke any efi function like efi_set_virtual_address_map() or any efi_runtime_service() the code path typically involves read_cr3() (save previous pgd), write_cr3() (write efi_pgd) and calling efi function. Likewise after returning from efi function the code path typically involves read_cr3() (save efi_pgd), write_cr3() (write previous pgd). We do this couple of times in efi subsystem of Linux kernel, instead we can use helper function efi_switch_mm() to do this. This improves readability and maintainability. Also, instead of maintaining a separate struct "efi_scratch" to store/restore efi_pgd, we can use mm_struct to do this. I have tested this patch set against LUV (Linux UEFI Validation), so I think I didn't break any existing configurations. I have tested this patch set for 1. x86_64, 2. x86_32, 3. Mixed mode with efi=old_map and for kexec kernel. Please let me know if I have missed any other configurations. Changes in V2: 1. Resolve mm_dropping() issue by not mm_dropping()/mm_grabbing() any mm, as we are not losing/creating any references. Changes in V3: 1. When CPUMASK_OFFSTACK is enabled, switch_mm_irqs_off() sets cpumask by calling cpumask_set_cpu(). This panics kernel as efi_mm is not initialized, therefore initialize efi_mm in efi_alloc_page_tables(). Changes in V4: 1. Remove the unintended removal of local_irq_restore(flags) (in 3rd patch). IRQ flags should be restored after switching to orginal mm. Note: This patch set is based on Linus's tree v4.15-rc8 Sai Praneeth (3): efi: Use efi_mm in x86 as well as ARM x86/efi: Replace efi_pgd with efi_mm.pgd x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3 arch/x86/include/asm/efi.h | 29 +- arch/x86/platform/efi/efi_64.c | 58 +++- arch/x86/platform/efi/efi_thunk_64.S | 2 +- drivers/firmware/efi/arm-runtime.c | 9 -- drivers/firmware/efi/efi.c | 9 ++ include/linux/efi.h | 2 ++ 6 files changed, 57 insertions(+), 52 deletions(-) Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Tested-by: Bhupesh Sharma <bhsha...@redhat.com> -- 2.1.4
[PATCH V4 0/3] Use mm_struct and switch_mm() instead of manually
From: Sai Praneeth Presently, in x86, to invoke any efi function like efi_set_virtual_address_map() or any efi_runtime_service() the code path typically involves read_cr3() (save previous pgd), write_cr3() (write efi_pgd) and calling efi function. Likewise after returning from efi function the code path typically involves read_cr3() (save efi_pgd), write_cr3() (write previous pgd). We do this couple of times in efi subsystem of Linux kernel, instead we can use helper function efi_switch_mm() to do this. This improves readability and maintainability. Also, instead of maintaining a separate struct "efi_scratch" to store/restore efi_pgd, we can use mm_struct to do this. I have tested this patch set against LUV (Linux UEFI Validation), so I think I didn't break any existing configurations. I have tested this patch set for 1. x86_64, 2. x86_32, 3. Mixed mode with efi=old_map and for kexec kernel. Please let me know if I have missed any other configurations. Changes in V2: 1. Resolve mm_dropping() issue by not mm_dropping()/mm_grabbing() any mm, as we are not losing/creating any references. Changes in V3: 1. When CPUMASK_OFFSTACK is enabled, switch_mm_irqs_off() sets cpumask by calling cpumask_set_cpu(). This panics kernel as efi_mm is not initialized, therefore initialize efi_mm in efi_alloc_page_tables(). Changes in V4: 1. Remove the unintended removal of local_irq_restore(flags) (in 3rd patch). IRQ flags should be restored after switching to orginal mm. Note: This patch set is based on Linus's tree v4.15-rc8 Sai Praneeth (3): efi: Use efi_mm in x86 as well as ARM x86/efi: Replace efi_pgd with efi_mm.pgd x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3 arch/x86/include/asm/efi.h | 29 +- arch/x86/platform/efi/efi_64.c | 58 +++- arch/x86/platform/efi/efi_thunk_64.S | 2 +- drivers/firmware/efi/arm-runtime.c | 9 -- drivers/firmware/efi/efi.c | 9 ++ include/linux/efi.h | 2 ++ 6 files changed, 57 insertions(+), 52 deletions(-) Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar Tested-by: Bhupesh Sharma -- 2.1.4
[PATCH 2/3] x86/efi: Replace efi_pgd with efi_mm.pgd
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Since the previous patch added support for efi_mm, let's handle efi_pgd through efi_mm and remove global variable efi_pgd. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Tested-by: Bhupesh Sharma <bhsha...@redhat.com> --- arch/x86/platform/efi/efi_64.c | 17 - 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index ccf5239923e8..6b541bdbda5f 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -189,8 +189,6 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd) early_code_mapping_set_exec(0); } -static pgd_t *efi_pgd; - /* * We need our own copy of the higher levels of the page tables * because we want to avoid inserting EFI region mappings (EFI_VA_END @@ -199,7 +197,7 @@ static pgd_t *efi_pgd; */ int __init efi_alloc_page_tables(void) { - pgd_t *pgd; + pgd_t *pgd, *efi_pgd; p4d_t *p4d; pud_t *pud; gfp_t gfp_mask; @@ -227,6 +225,7 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + efi_mm.pgd = efi_pgd; mm_init_cpumask(_mm); init_new_context(NULL, _mm); @@ -242,6 +241,7 @@ void efi_sync_low_kernel_mappings(void) pgd_t *pgd_k, *pgd_efi; p4d_t *p4d_k, *p4d_efi; pud_t *pud_k, *pud_efi; + pgd_t *efi_pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return; @@ -335,7 +335,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) unsigned long pfn, text, pf; struct page *page; unsigned npages; - pgd_t *pgd; + pgd_t *pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return 0; @@ -345,8 +345,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) * this value is loaded into cr3 the PGD will be decrypted during * the pagetable walk. */ - efi_scratch.efi_pgt = (pgd_t *)__sme_pa(efi_pgd); - pgd = efi_pgd; + efi_scratch.efi_pgt = (pgd_t *)__sme_pa(pgd); /* * It can happen that the physical address of new_memmap lands in memory @@ -416,7 +415,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va) { unsigned long flags = _PAGE_RW; unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; if (!(md->attribute & EFI_MEMORY_WB)) flags |= _PAGE_PCD; @@ -520,7 +519,7 @@ void __init parse_efi_setup(u64 phys_addr, u32 data_len) static int __init efi_update_mappings(efi_memory_desc_t *md, unsigned long pf) { unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; int err1, err2; /* Update the 1:1 mapping */ @@ -617,7 +616,7 @@ void __init efi_dump_pagetable(void) if (efi_enabled(EFI_OLD_MEMMAP)) ptdump_walk_pgd_level(NULL, swapper_pg_dir); else - ptdump_walk_pgd_level(NULL, efi_pgd); + ptdump_walk_pgd_level(NULL, efi_mm.pgd); #endif } -- 2.1.4
[PATCH 2/3] x86/efi: Replace efi_pgd with efi_mm.pgd
From: Sai Praneeth Since the previous patch added support for efi_mm, let's handle efi_pgd through efi_mm and remove global variable efi_pgd. Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar Tested-by: Bhupesh Sharma --- arch/x86/platform/efi/efi_64.c | 17 - 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index ccf5239923e8..6b541bdbda5f 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -189,8 +189,6 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd) early_code_mapping_set_exec(0); } -static pgd_t *efi_pgd; - /* * We need our own copy of the higher levels of the page tables * because we want to avoid inserting EFI region mappings (EFI_VA_END @@ -199,7 +197,7 @@ static pgd_t *efi_pgd; */ int __init efi_alloc_page_tables(void) { - pgd_t *pgd; + pgd_t *pgd, *efi_pgd; p4d_t *p4d; pud_t *pud; gfp_t gfp_mask; @@ -227,6 +225,7 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + efi_mm.pgd = efi_pgd; mm_init_cpumask(_mm); init_new_context(NULL, _mm); @@ -242,6 +241,7 @@ void efi_sync_low_kernel_mappings(void) pgd_t *pgd_k, *pgd_efi; p4d_t *p4d_k, *p4d_efi; pud_t *pud_k, *pud_efi; + pgd_t *efi_pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return; @@ -335,7 +335,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) unsigned long pfn, text, pf; struct page *page; unsigned npages; - pgd_t *pgd; + pgd_t *pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return 0; @@ -345,8 +345,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) * this value is loaded into cr3 the PGD will be decrypted during * the pagetable walk. */ - efi_scratch.efi_pgt = (pgd_t *)__sme_pa(efi_pgd); - pgd = efi_pgd; + efi_scratch.efi_pgt = (pgd_t *)__sme_pa(pgd); /* * It can happen that the physical address of new_memmap lands in memory @@ -416,7 +415,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va) { unsigned long flags = _PAGE_RW; unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; if (!(md->attribute & EFI_MEMORY_WB)) flags |= _PAGE_PCD; @@ -520,7 +519,7 @@ void __init parse_efi_setup(u64 phys_addr, u32 data_len) static int __init efi_update_mappings(efi_memory_desc_t *md, unsigned long pf) { unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; int err1, err2; /* Update the 1:1 mapping */ @@ -617,7 +616,7 @@ void __init efi_dump_pagetable(void) if (efi_enabled(EFI_OLD_MEMMAP)) ptdump_walk_pgd_level(NULL, swapper_pg_dir); else - ptdump_walk_pgd_level(NULL, efi_pgd); + ptdump_walk_pgd_level(NULL, efi_mm.pgd); #endif } -- 2.1.4
[PATCH 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Use helper function (efi_switch_mm()) to switch to/from efi_mm. We switch to efi_mm before calling 1. efi_set_virtual_address_map() and 2. Invoking any efi_runtime_service() Likewise, we need to switch back to previous mm (mm context stolen by efi_mm) after the above calls return successfully. We can use efi_switch_mm() helper function only with x86_64 kernel and "efi=old_map" disabled because, x86_32 and efi=old_map doesn't use efi_pgd, rather they use swapper_pg_dir. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Bhupesh Sharma <bhsha...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Tested-by: Bhupesh Sharma <bhsha...@redhat.com> --- arch/x86/include/asm/efi.h | 25 +- arch/x86/platform/efi/efi_64.c | 41 ++-- arch/x86/platform/efi/efi_thunk_64.S | 2 +- 3 files changed, 32 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 00f977ddd718..cda9940bed7a 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -62,14 +62,13 @@ extern asmlinkage u64 efi_call(void *fp, ...); #define efi_call_phys(f, args...) efi_call((f), args) /* - * Scratch space used for switching the pagetable in the EFI stub + * struct efi_scratch - Scratch space used while switching to/from efi_mm + * @phys_stack: stack used during EFI Mixed Mode + * @prev_mm:store/restore stolen mm_struct while switching to/from efi_mm */ struct efi_scratch { - u64 r15; - u64 prev_cr3; - pgd_t *efi_pgt; - booluse_pgd; - u64 phys_stack; + u64 phys_stack; + struct mm_struct*prev_mm; } __packed; #define arch_efi_call_virt_setup() \ @@ -78,11 +77,8 @@ struct efi_scratch { preempt_disable(); \ __kernel_fpu_begin(); \ \ - if (efi_scratch.use_pgd) { \ - efi_scratch.prev_cr3 = __read_cr3();\ - write_cr3((unsigned long)efi_scratch.efi_pgt); \ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(_mm); \ }) #define arch_efi_call_virt(p, f, args...) \ @@ -90,10 +86,8 @@ struct efi_scratch { #define arch_efi_call_virt_teardown() \ ({ \ - if (efi_scratch.use_pgd) { \ - write_cr3(efi_scratch.prev_cr3);\ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(efi_scratch.prev_mm); \ \ __kernel_fpu_end(); \ preempt_enable(); \ @@ -135,6 +129,7 @@ extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); +extern void efi_switch_mm(struct mm_struct *mm); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 6b541bdbda5f..c325b1cc4d1a 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -82,9 +82,8 @@ pgd_t * __init efi_call_phys_prolog(void) int n_pgds, i, j; if (!efi_enabled(EFI_OLD_MEMMAP)) { - save_pgd = (pgd_t *)__read_cr3(); - write_cr3((unsigned long)efi_scratch.efi_pgt); - goto out; + efi_switch_mm(_mm); + return NULL; } early_code_mapping_set_exec(1); @@ -154,8 +153,7 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd)
[PATCH 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3
From: Sai Praneeth Use helper function (efi_switch_mm()) to switch to/from efi_mm. We switch to efi_mm before calling 1. efi_set_virtual_address_map() and 2. Invoking any efi_runtime_service() Likewise, we need to switch back to previous mm (mm context stolen by efi_mm) after the above calls return successfully. We can use efi_switch_mm() helper function only with x86_64 kernel and "efi=old_map" disabled because, x86_32 and efi=old_map doesn't use efi_pgd, rather they use swapper_pg_dir. Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Bhupesh Sharma Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar Tested-by: Bhupesh Sharma --- arch/x86/include/asm/efi.h | 25 +- arch/x86/platform/efi/efi_64.c | 41 ++-- arch/x86/platform/efi/efi_thunk_64.S | 2 +- 3 files changed, 32 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 00f977ddd718..cda9940bed7a 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -62,14 +62,13 @@ extern asmlinkage u64 efi_call(void *fp, ...); #define efi_call_phys(f, args...) efi_call((f), args) /* - * Scratch space used for switching the pagetable in the EFI stub + * struct efi_scratch - Scratch space used while switching to/from efi_mm + * @phys_stack: stack used during EFI Mixed Mode + * @prev_mm:store/restore stolen mm_struct while switching to/from efi_mm */ struct efi_scratch { - u64 r15; - u64 prev_cr3; - pgd_t *efi_pgt; - booluse_pgd; - u64 phys_stack; + u64 phys_stack; + struct mm_struct*prev_mm; } __packed; #define arch_efi_call_virt_setup() \ @@ -78,11 +77,8 @@ struct efi_scratch { preempt_disable(); \ __kernel_fpu_begin(); \ \ - if (efi_scratch.use_pgd) { \ - efi_scratch.prev_cr3 = __read_cr3();\ - write_cr3((unsigned long)efi_scratch.efi_pgt); \ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(_mm); \ }) #define arch_efi_call_virt(p, f, args...) \ @@ -90,10 +86,8 @@ struct efi_scratch { #define arch_efi_call_virt_teardown() \ ({ \ - if (efi_scratch.use_pgd) { \ - write_cr3(efi_scratch.prev_cr3);\ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(efi_scratch.prev_mm); \ \ __kernel_fpu_end(); \ preempt_enable(); \ @@ -135,6 +129,7 @@ extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); +extern void efi_switch_mm(struct mm_struct *mm); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 6b541bdbda5f..c325b1cc4d1a 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -82,9 +82,8 @@ pgd_t * __init efi_call_phys_prolog(void) int n_pgds, i, j; if (!efi_enabled(EFI_OLD_MEMMAP)) { - save_pgd = (pgd_t *)__read_cr3(); - write_cr3((unsigned long)efi_scratch.efi_pgt); - goto out; + efi_switch_mm(_mm); + return NULL; } early_code_mapping_set_exec(1); @@ -154,8 +153,7 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd) pud_t *pud; if (!efi_enabled(EFI_OLD_MEMMAP)) { - write_cr3((unsigned long)save_pgd); - __flush_tlb_all(); + efi_switch_mm(efi_scratch.prev_mm); return; } @@ -341,13 +339,6 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
[PATCH 1/3] efi: Use efi_mm in x86 as well as ARM
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Presently, only ARM uses mm_struct to manage efi page tables and efi runtime region mappings. As this is the preferred approach, let's make this data structure common across architectures. Specially, for x86, using this data structure improves code maintainability and readability. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Tested-by: Bhupesh Sharma <bhsha...@redhat.com> --- arch/x86/include/asm/efi.h | 4 arch/x86/platform/efi/efi_64.c | 3 +++ drivers/firmware/efi/arm-runtime.c | 9 - drivers/firmware/efi/efi.c | 9 + include/linux/efi.h| 2 ++ 5 files changed, 18 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 85f6ccb80b91..00f977ddd718 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -2,10 +2,14 @@ #ifndef _ASM_X86_EFI_H #define _ASM_X86_EFI_H +#include +#include + #include #include #include #include +#include /* * We map the EFI regions needed for runtime services non-contiguously, diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 6a151ce70e86..ccf5239923e8 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -227,6 +227,9 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + mm_init_cpumask(_mm); + init_new_context(NULL, _mm); + return 0; } diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c index 1cc41c3d6315..d6b26534812b 100644 --- a/drivers/firmware/efi/arm-runtime.c +++ b/drivers/firmware/efi/arm-runtime.c @@ -31,15 +31,6 @@ extern u64 efi_system_table; -static struct mm_struct efi_mm = { - .mm_rb = RB_ROOT, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), - .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), - .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), -}; - #ifdef CONFIG_ARM64_PTDUMP_DEBUGFS #include diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 557a47829d03..760260b933b6 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -74,6 +74,15 @@ static unsigned long *efi_tables[] = { _attr_table, }; +struct mm_struct efi_mm = { + .mm_rb = RB_ROOT, + .mm_users = ATOMIC_INIT(2), + .mm_count = ATOMIC_INIT(1), + .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), + .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), + .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), +}; + static bool disable_runtime; static int __init setup_noefi(char *arg) { diff --git a/include/linux/efi.h b/include/linux/efi.h index d813f7b04da7..6745f4dbbcc1 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -928,6 +928,8 @@ extern struct efi { unsigned long flags; } efi; +extern struct mm_struct efi_mm; + static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right) { -- 2.1.4
[PATCH 1/3] efi: Use efi_mm in x86 as well as ARM
From: Sai Praneeth Presently, only ARM uses mm_struct to manage efi page tables and efi runtime region mappings. As this is the preferred approach, let's make this data structure common across architectures. Specially, for x86, using this data structure improves code maintainability and readability. Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar Tested-by: Bhupesh Sharma --- arch/x86/include/asm/efi.h | 4 arch/x86/platform/efi/efi_64.c | 3 +++ drivers/firmware/efi/arm-runtime.c | 9 - drivers/firmware/efi/efi.c | 9 + include/linux/efi.h| 2 ++ 5 files changed, 18 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 85f6ccb80b91..00f977ddd718 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -2,10 +2,14 @@ #ifndef _ASM_X86_EFI_H #define _ASM_X86_EFI_H +#include +#include + #include #include #include #include +#include /* * We map the EFI regions needed for runtime services non-contiguously, diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 6a151ce70e86..ccf5239923e8 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -227,6 +227,9 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + mm_init_cpumask(_mm); + init_new_context(NULL, _mm); + return 0; } diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c index 1cc41c3d6315..d6b26534812b 100644 --- a/drivers/firmware/efi/arm-runtime.c +++ b/drivers/firmware/efi/arm-runtime.c @@ -31,15 +31,6 @@ extern u64 efi_system_table; -static struct mm_struct efi_mm = { - .mm_rb = RB_ROOT, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), - .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), - .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), -}; - #ifdef CONFIG_ARM64_PTDUMP_DEBUGFS #include diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 557a47829d03..760260b933b6 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -74,6 +74,15 @@ static unsigned long *efi_tables[] = { _attr_table, }; +struct mm_struct efi_mm = { + .mm_rb = RB_ROOT, + .mm_users = ATOMIC_INIT(2), + .mm_count = ATOMIC_INIT(1), + .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), + .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), + .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), +}; + static bool disable_runtime; static int __init setup_noefi(char *arg) { diff --git a/include/linux/efi.h b/include/linux/efi.h index d813f7b04da7..6745f4dbbcc1 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -928,6 +928,8 @@ extern struct efi { unsigned long flags; } efi; +extern struct mm_struct efi_mm; + static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right) { -- 2.1.4
[PATCH 0/3] Use mm_struct and switch_mm() instead of manually
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Presently, in x86, to invoke any efi function like efi_set_virtual_address_map() or any efi_runtime_service() the code path typically involves read_cr3() (save previous pgd), write_cr3() (write efi_pgd) and calling efi function. Likewise after returning from efi function the code path typically involves read_cr3() (save efi_pgd), write_cr3() (write previous pgd). We do this couple of times in efi subsystem of Linux kernel, instead we can use helper function efi_switch_mm() to do this. This improves readability and maintainability. Also, instead of maintaining a separate struct "efi_scratch" to store/restore efi_pgd, we can use mm_struct to do this. I have tested this patch set against LUV (Linux UEFI Validation), so I think I didn't break any existing configurations. I have tested this patch set for 1. x86_64, 2. x86_32, 3. Mixed mode with efi=old_map and for kexec kernel. Please let me know if I have missed any other configurations. Changes in V2: 1. Resolve mm_dropping() issue by not mm_dropping()/mm_grabbing() any mm, as we are not losing/creating any references. Changes in V3: 1. When CPUMASK_OFFSTACK is enabled, switch_mm_irqs_off() sets cpumask by calling cpumask_set_cpu(). This panics kernel as efi_mm is not initialized, therefore initialize efi_mm in efi_alloc_page_tables(). Note: This patch set is based on Linus's tree v4.15-rc3 Sai Praneeth (3): efi: Use efi_mm in x86 as well as ARM x86/efi: Replace efi_pgd with efi_mm.pgd x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3 arch/x86/include/asm/efi.h | 29 +- arch/x86/platform/efi/efi_64.c | 59 +++- arch/x86/platform/efi/efi_thunk_64.S | 2 +- drivers/firmware/efi/arm-runtime.c | 9 -- drivers/firmware/efi/efi.c | 9 ++ include/linux/efi.h | 2 ++ 6 files changed, 57 insertions(+), 53 deletions(-) Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> Tested-by: Bhupesh Sharma <bhsha...@redhat.com> -- 2.1.4
[PATCH 0/3] Use mm_struct and switch_mm() instead of manually
From: Sai Praneeth Presently, in x86, to invoke any efi function like efi_set_virtual_address_map() or any efi_runtime_service() the code path typically involves read_cr3() (save previous pgd), write_cr3() (write efi_pgd) and calling efi function. Likewise after returning from efi function the code path typically involves read_cr3() (save efi_pgd), write_cr3() (write previous pgd). We do this couple of times in efi subsystem of Linux kernel, instead we can use helper function efi_switch_mm() to do this. This improves readability and maintainability. Also, instead of maintaining a separate struct "efi_scratch" to store/restore efi_pgd, we can use mm_struct to do this. I have tested this patch set against LUV (Linux UEFI Validation), so I think I didn't break any existing configurations. I have tested this patch set for 1. x86_64, 2. x86_32, 3. Mixed mode with efi=old_map and for kexec kernel. Please let me know if I have missed any other configurations. Changes in V2: 1. Resolve mm_dropping() issue by not mm_dropping()/mm_grabbing() any mm, as we are not losing/creating any references. Changes in V3: 1. When CPUMASK_OFFSTACK is enabled, switch_mm_irqs_off() sets cpumask by calling cpumask_set_cpu(). This panics kernel as efi_mm is not initialized, therefore initialize efi_mm in efi_alloc_page_tables(). Note: This patch set is based on Linus's tree v4.15-rc3 Sai Praneeth (3): efi: Use efi_mm in x86 as well as ARM x86/efi: Replace efi_pgd with efi_mm.pgd x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3 arch/x86/include/asm/efi.h | 29 +- arch/x86/platform/efi/efi_64.c | 59 +++- arch/x86/platform/efi/efi_thunk_64.S | 2 +- drivers/firmware/efi/arm-runtime.c | 9 -- drivers/firmware/efi/efi.c | 9 ++ include/linux/efi.h | 2 ++ 6 files changed, 57 insertions(+), 53 deletions(-) Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar Tested-by: Bhupesh Sharma -- 2.1.4
Re: [PATCH v4 00/10] PCID and improved laziness
> > > Hi Andy, > > I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on > Skylake server and noticed that it reboots automatically. > > When I booted the same kernel with command line arg "nopcid" it works > fine. Please find below a snippet of dmesg. Please let me know if you > need more info to debug. > > [0.00] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+ > root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0 > console=ttyS0,115200n8 > [0.00] log_buf_len individual max cpu contribution: 4096 bytes > [0.00] log_buf_len total cpu_extra contributions: 258048 bytes > [0.00] log_buf_len min size: 262144 bytes > [0.00] log_buf_len: 524288 bytes > [0.00] early log buf free: 212560(81%) > [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes) > [0.00] [ cut here ] > [0.00] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245 > initialize_tlbstate_and_flush+0x6c/0xf0 > [0.00] Modules linked in: > [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5 > [0.00] task: 8960f480 task.stack: 8960 > [0.00] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0 > [0.00] RSP: :89603e60 EFLAGS: 00010046 > [0.00] RAX: 000406b0 RBX: 9f1700a17880 RCX: > 8965de60 > [0.00] RDX: 008383a0a000 RSI: 0960a000 RDI: > 008383a0a000 > [0.00] RBP: 89603e60 R08: R09: > > [0.00] R10: 89603ee8 R11: R12: > > [0.00] R13: 9f1700a0c3e0 R14: 8960f480 R15: > > [0.00] FS: () GS:9f1700a0() > knlGS: > [0.00] CS: 0010 DS: ES: CR0: 80050033 > [0.00] CR2: 9fa7b000 CR3: 008383a0a000 CR4: > 000406b0 > [0.00] Call Trace: > [0.00] cpu_init+0x206/0x4f0 > [0.00] ? __set_pte_vaddr+0x1d/0x30 > [0.00] trap_init+0x3e/0x50 > [0.00] ? trap_init+0x3e/0x50 > [0.00] start_kernel+0x1e2/0x3f2 > [0.00] x86_64_start_reservations+0x24/0x26 > [0.00] x86_64_start_kernel+0x6f/0x72 > [0.00] secondary_startup_64+0xa5/0xa5 > [0.00] Code: de 00 48 01 f0 48 39 c7 0f 85 92 00 00 00 48 8b 05 > ee e2 ee 00 a9 00 00 02 00 74 11 65 48 8b 05 8b 9d 7c 77 a9 00 00 02 00 > 75 02 <0f> ff 48 81 e2 00 f0 ff ff 0f 22 da 65 66 c7 05 66 9d 7c 77 00 > [0.00] ---[ end trace c258f2d278fe031f ]--- > [0.00] Memory: 791050356K/803934656K available (9585K kernel > code, 1313K rwdata, 3000K rodata, 1176K init, 680K bss, 12884300K > reserved, 0K cma-reserved) > [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=64, > Nodes=4 > [0.00] Hierarchical RCU implementation. > [0.00]RCU event tracing is enabled. > [0.00] NR_IRQS: 4352, nr_irqs: 3928, preallocated irqs: 16 > [0.00] Console: colour dummy device 80x25 > [0.00] console [tty0] enabled > [0.00] console [ttyS0] enabled > [0.00] clocksource: hpet: mask: 0x max_cycles: > 0x, max_idle_ns: 79635855245 ns > [0.001000] tsc: Detected 2000.000 MHz processor > [0.002000] Calibrating delay loop (skipped), value calculated using > timer frequency.. 4000.00 BogoMIPS (lpj=200) > [0.003003] pid_max: default: 65536 minimum: 512 > [0.004030] ACPI: Core revision 20170728 > [0.091853] ACPI: 6 ACPI AML tables successfully acquired and loaded > [0.094143] Security Framework initialized > [0.095004] SELinux: Initializing. > [0.145612] Dentry cache hash table entries: 33554432 (order: 16, > 268435456 bytes) > [0.170544] Inode-cache hash table entries: 16777216 (order: 15, > 134217728 bytes) > [0.172699] Mount-cache hash table entries: 524288 (order: 10, > 4194304 bytes) > [0.174441] Mountpoint-cache hash table entries: 524288 (order: 10, > 4194304 bytes) > [0.176351] CPU: Physical Processor ID: 0 > [0.177003] CPU: Processor Core ID: 0 > [0.178007] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' > [0.179003] ENERGY_PERF_BIAS: View and update with > x86_energy_perf_policy(8) > [0.180013] mce: CPU supports 20 MCE banks > [0.181018] CPU0: Thermal monitoring enabled (TM1) > [0.182057] process: using mwait in idle threads > [0.183005] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8 > [0.184003] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 > [0.185223] Freeing SMP alternatives memory: 36K > [0.193912] smpboot: Max logical packages: 8 > [0.194017] Switched APIC routing to physical flat. > [0.196496] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 > [0.206252] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8164 CPU @ > 2.00GHz (family: 0x6, model: 0x55, stepping: 0x4) > [
Re: [PATCH v4 00/10] PCID and improved laziness
> > > Hi Andy, > > I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on > Skylake server and noticed that it reboots automatically. > > When I booted the same kernel with command line arg "nopcid" it works > fine. Please find below a snippet of dmesg. Please let me know if you > need more info to debug. > > [0.00] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+ > root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0 > console=ttyS0,115200n8 > [0.00] log_buf_len individual max cpu contribution: 4096 bytes > [0.00] log_buf_len total cpu_extra contributions: 258048 bytes > [0.00] log_buf_len min size: 262144 bytes > [0.00] log_buf_len: 524288 bytes > [0.00] early log buf free: 212560(81%) > [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes) > [0.00] [ cut here ] > [0.00] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245 > initialize_tlbstate_and_flush+0x6c/0xf0 > [0.00] Modules linked in: > [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5 > [0.00] task: 8960f480 task.stack: 8960 > [0.00] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0 > [0.00] RSP: :89603e60 EFLAGS: 00010046 > [0.00] RAX: 000406b0 RBX: 9f1700a17880 RCX: > 8965de60 > [0.00] RDX: 008383a0a000 RSI: 0960a000 RDI: > 008383a0a000 > [0.00] RBP: 89603e60 R08: R09: > > [0.00] R10: 89603ee8 R11: R12: > > [0.00] R13: 9f1700a0c3e0 R14: 8960f480 R15: > > [0.00] FS: () GS:9f1700a0() > knlGS: > [0.00] CS: 0010 DS: ES: CR0: 80050033 > [0.00] CR2: 9fa7b000 CR3: 008383a0a000 CR4: > 000406b0 > [0.00] Call Trace: > [0.00] cpu_init+0x206/0x4f0 > [0.00] ? __set_pte_vaddr+0x1d/0x30 > [0.00] trap_init+0x3e/0x50 > [0.00] ? trap_init+0x3e/0x50 > [0.00] start_kernel+0x1e2/0x3f2 > [0.00] x86_64_start_reservations+0x24/0x26 > [0.00] x86_64_start_kernel+0x6f/0x72 > [0.00] secondary_startup_64+0xa5/0xa5 > [0.00] Code: de 00 48 01 f0 48 39 c7 0f 85 92 00 00 00 48 8b 05 > ee e2 ee 00 a9 00 00 02 00 74 11 65 48 8b 05 8b 9d 7c 77 a9 00 00 02 00 > 75 02 <0f> ff 48 81 e2 00 f0 ff ff 0f 22 da 65 66 c7 05 66 9d 7c 77 00 > [0.00] ---[ end trace c258f2d278fe031f ]--- > [0.00] Memory: 791050356K/803934656K available (9585K kernel > code, 1313K rwdata, 3000K rodata, 1176K init, 680K bss, 12884300K > reserved, 0K cma-reserved) > [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=64, > Nodes=4 > [0.00] Hierarchical RCU implementation. > [0.00]RCU event tracing is enabled. > [0.00] NR_IRQS: 4352, nr_irqs: 3928, preallocated irqs: 16 > [0.00] Console: colour dummy device 80x25 > [0.00] console [tty0] enabled > [0.00] console [ttyS0] enabled > [0.00] clocksource: hpet: mask: 0x max_cycles: > 0x, max_idle_ns: 79635855245 ns > [0.001000] tsc: Detected 2000.000 MHz processor > [0.002000] Calibrating delay loop (skipped), value calculated using > timer frequency.. 4000.00 BogoMIPS (lpj=200) > [0.003003] pid_max: default: 65536 minimum: 512 > [0.004030] ACPI: Core revision 20170728 > [0.091853] ACPI: 6 ACPI AML tables successfully acquired and loaded > [0.094143] Security Framework initialized > [0.095004] SELinux: Initializing. > [0.145612] Dentry cache hash table entries: 33554432 (order: 16, > 268435456 bytes) > [0.170544] Inode-cache hash table entries: 16777216 (order: 15, > 134217728 bytes) > [0.172699] Mount-cache hash table entries: 524288 (order: 10, > 4194304 bytes) > [0.174441] Mountpoint-cache hash table entries: 524288 (order: 10, > 4194304 bytes) > [0.176351] CPU: Physical Processor ID: 0 > [0.177003] CPU: Processor Core ID: 0 > [0.178007] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' > [0.179003] ENERGY_PERF_BIAS: View and update with > x86_energy_perf_policy(8) > [0.180013] mce: CPU supports 20 MCE banks > [0.181018] CPU0: Thermal monitoring enabled (TM1) > [0.182057] process: using mwait in idle threads > [0.183005] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8 > [0.184003] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 > [0.185223] Freeing SMP alternatives memory: 36K > [0.193912] smpboot: Max logical packages: 8 > [0.194017] Switched APIC routing to physical flat. > [0.196496] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 > [0.206252] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8164 CPU @ > 2.00GHz (family: 0x6, model: 0x55, stepping: 0x4) > [
Re: [PATCH v4 00/10] PCID and improved laziness
> From: Andy Lutomirski> Date: Thu, Jun 29, 2017 at 8:53 AM > Subject: [PATCH v4 00/10] PCID and improved laziness > To: x...@kernel.org > Cc: linux-kernel@vger.kernel.org, Borislav Petkov , > Linus Torvalds , Andrew Morton > , Mel Gorman , > "linux...@kvack.org" , Nadav Amit > , Rik van Riel , Dave Hansen > , Arjan van de Ven , > Peter Zijlstra , Andy Lutomirski > > > > *** Ingo, even if this misses 4.13, please apply the first patch > before > *** the merge window. > > There are three performance benefits here: > > 1. TLB flushing is slow. (I.e. the flush itself takes a while.) >This avoids many of them when switching tasks by using PCID. In >a stupid little benchmark I did, it saves about 100ns on my laptop >per context switch. I'll try to improve that benchmark. > > 2. Mms that have been used recently on a given CPU might get to keep >their TLB entries alive across process switches with this patch >set. TLB fills are pretty fast on modern CPUs, but they're even >faster when they don't happen. > > 3. Lazy TLB is way better. We used to do two stupid things when we >ran kernel threads: we'd send IPIs to flush user contexts on their >CPUs and then we'd write to CR3 for no particular reason as an > excuse >to stop further IPIs. With this patch, we do neither. > > This will, in general, perform suboptimally if paravirt TLB flushing > is in use (currently just Xen, I think, but Hyper-V is in the works). > The code is structured so we could fix it in one of two ways: we > could take a spinlock when touching the percpu state so we can update > it remotely after a paravirt flush, or we could be more careful about > our exactly how we access the state and use cmpxchg16b to do atomic > remote updates. (On SMP systems without cmpxchg16b, we'd just skip > the optimization entirely.) > > This is still missing a final comment-only patch to add overall > documentation for the whole thing, but I didn't want to block sending > the maybe-hopefully-final code on that. > > This is based on tip:x86/mm. The branch is here if you want to play: > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid > > In general, performance seems to exceed my expectations. Here are > some performance numbers copy-and-pasted from the changelogs for > "Rework lazy TLB mode and TLB freshness" and "Try to preserve old > TLB entries using PCID": > > Hi Andy, I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on Skylake server and noticed that it reboots automatically. When I booted the same kernel with command line arg "nopcid" it works fine. Please find below a snippet of dmesg. Please let me know if you need more info to debug. [0.00] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+ root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0 console=ttyS0,115200n8 [0.00] log_buf_len individual max cpu contribution: 4096 bytes [0.00] log_buf_len total cpu_extra contributions: 258048 bytes [0.00] log_buf_len min size: 262144 bytes [0.00] log_buf_len: 524288 bytes [0.00] early log buf free: 212560(81%) [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes) [0.00] [ cut here ] [0.00] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245 initialize_tlbstate_and_flush+0x6c/0xf0 [0.00] Modules linked in: [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5 [0.00] task: 8960f480 task.stack: 8960 [0.00] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0 [0.00] RSP: :89603e60 EFLAGS: 00010046 [0.00] RAX: 000406b0 RBX: 9f1700a17880 RCX: 8965de60 [0.00] RDX: 008383a0a000 RSI: 0960a000 RDI: 008383a0a000 [0.00] RBP: 89603e60 R08: R09: [0.00] R10: 89603ee8 R11: R12: [0.00] R13: 9f1700a0c3e0 R14: 8960f480 R15: [0.00] FS: () GS:9f1700a0() knlGS: [0.00] CS: 0010 DS: ES: CR0: 80050033 [0.00] CR2: 9fa7b000 CR3: 008383a0a000 CR4: 000406b0 [0.00] Call Trace: [0.00] cpu_init+0x206/0x4f0 [0.00] ? __set_pte_vaddr+0x1d/0x30 [0.00] trap_init+0x3e/0x50 [0.00] ? trap_init+0x3e/0x50 [0.00] start_kernel+0x1e2/0x3f2 [0.00] x86_64_start_reservations+0x24/0x26 [0.00] x86_64_start_kernel+0x6f/0x72 [0.00] secondary_startup_64+0xa5/0xa5 [0.00] Code: de 00 48 01 f0 48 39
Re: [PATCH v4 00/10] PCID and improved laziness
> From: Andy Lutomirski > Date: Thu, Jun 29, 2017 at 8:53 AM > Subject: [PATCH v4 00/10] PCID and improved laziness > To: x...@kernel.org > Cc: linux-kernel@vger.kernel.org, Borislav Petkov , > Linus Torvalds , Andrew Morton > , Mel Gorman , > "linux...@kvack.org" , Nadav Amit > , Rik van Riel , Dave Hansen > , Arjan van de Ven , > Peter Zijlstra , Andy Lutomirski > > > > *** Ingo, even if this misses 4.13, please apply the first patch > before > *** the merge window. > > There are three performance benefits here: > > 1. TLB flushing is slow. (I.e. the flush itself takes a while.) >This avoids many of them when switching tasks by using PCID. In >a stupid little benchmark I did, it saves about 100ns on my laptop >per context switch. I'll try to improve that benchmark. > > 2. Mms that have been used recently on a given CPU might get to keep >their TLB entries alive across process switches with this patch >set. TLB fills are pretty fast on modern CPUs, but they're even >faster when they don't happen. > > 3. Lazy TLB is way better. We used to do two stupid things when we >ran kernel threads: we'd send IPIs to flush user contexts on their >CPUs and then we'd write to CR3 for no particular reason as an > excuse >to stop further IPIs. With this patch, we do neither. > > This will, in general, perform suboptimally if paravirt TLB flushing > is in use (currently just Xen, I think, but Hyper-V is in the works). > The code is structured so we could fix it in one of two ways: we > could take a spinlock when touching the percpu state so we can update > it remotely after a paravirt flush, or we could be more careful about > our exactly how we access the state and use cmpxchg16b to do atomic > remote updates. (On SMP systems without cmpxchg16b, we'd just skip > the optimization entirely.) > > This is still missing a final comment-only patch to add overall > documentation for the whole thing, but I didn't want to block sending > the maybe-hopefully-final code on that. > > This is based on tip:x86/mm. The branch is here if you want to play: > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid > > In general, performance seems to exceed my expectations. Here are > some performance numbers copy-and-pasted from the changelogs for > "Rework lazy TLB mode and TLB freshness" and "Try to preserve old > TLB entries using PCID": > > Hi Andy, I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on Skylake server and noticed that it reboots automatically. When I booted the same kernel with command line arg "nopcid" it works fine. Please find below a snippet of dmesg. Please let me know if you need more info to debug. [0.00] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+ root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0 console=ttyS0,115200n8 [0.00] log_buf_len individual max cpu contribution: 4096 bytes [0.00] log_buf_len total cpu_extra contributions: 258048 bytes [0.00] log_buf_len min size: 262144 bytes [0.00] log_buf_len: 524288 bytes [0.00] early log buf free: 212560(81%) [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes) [0.00] [ cut here ] [0.00] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245 initialize_tlbstate_and_flush+0x6c/0xf0 [0.00] Modules linked in: [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5 [0.00] task: 8960f480 task.stack: 8960 [0.00] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0 [0.00] RSP: :89603e60 EFLAGS: 00010046 [0.00] RAX: 000406b0 RBX: 9f1700a17880 RCX: 8965de60 [0.00] RDX: 008383a0a000 RSI: 0960a000 RDI: 008383a0a000 [0.00] RBP: 89603e60 R08: R09: [0.00] R10: 89603ee8 R11: R12: [0.00] R13: 9f1700a0c3e0 R14: 8960f480 R15: [0.00] FS: () GS:9f1700a0() knlGS: [0.00] CS: 0010 DS: ES: CR0: 80050033 [0.00] CR2: 9fa7b000 CR3: 008383a0a000 CR4: 000406b0 [0.00] Call Trace: [0.00] cpu_init+0x206/0x4f0 [0.00] ? __set_pte_vaddr+0x1d/0x30 [0.00] trap_init+0x3e/0x50 [0.00] ? trap_init+0x3e/0x50 [0.00] start_kernel+0x1e2/0x3f2 [0.00] x86_64_start_reservations+0x24/0x26 [0.00] x86_64_start_kernel+0x6f/0x72 [0.00] secondary_startup_64+0xa5/0xa5 [0.00] Code: de 00 48 01 f0 48 39 c7 0f 85 92 00 00 00 48 8b 05 ee e2 ee 00 a9 00 00 02 00 74 11 65 48 8b 05 8b 9d 7c 77 a9 00 00 02 00 75 02 <0f> ff 48 81 e2 00 f0 ff ff 0f 22 da 65 66 c7 05 66 9d 7c 77 00 [0.00] ---[ end trace c258f2d278fe031f ]--- [0.00] Memory:
Re: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually
On Tue, 2017-09-05 at 19:21 -0700, Sai Praneeth Prakhya wrote: > > I get a similar crash on Qemu with linus's master branch and the V2 > > applied on top of it. Here are the details of my test environment: > > > > 1. I use the OVMF (EDK2) EFI firmware to launch the kernel: > > edk2.git/ovmf-x64 > > > > 2. I used linus's master branch (HEAD - commit: > > b1b6f83ac938d176742c85757960dec2cf10e468) and applied your v2 on top > > of the same. > > > > 3. I use the following qemu command line to launch the test: > > > > # /usr/local/bin/qemu-system-x86_64 --version > > QEMU emulator version 2.9.50 (v2.9.0-526-g76d20ea) > > Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers > > > > # /usr/local/bin/qemu-system-x86_64 -enable-kvm -net nic -net tap -m > > $MEMSIZE -nographic -drive file=$DISK_IMAGE,if=virtio,format=qcow2 > > -vga std -boot c -cpu host -kernel $KERNEL -append > > "crashkernel=$CRASH_MEMSIZE console=ttyS0,115200n81" -initrd > > $INITRAMFS -bios $OVMF_FW_PATH > > > > And here is the crash log: > > > > [0.006054] general protection fault: [#1] SMP > > [0.006459] Modules linked in: > > [0.006711] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3 > > [0.007000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS 0.0.0 02/06/2015 > > [0.007000] task: 81e0f480 task.stack: 81e0 > > [0.007000] RIP: 0010:switch_mm_irqs_off+0x1bc/0x440 > > [0.007000] RSP: :81e03d80 EFLAGS: 00010086 > > [0.007000] RAX: 80007d084000 RBX: RCX: > > 77ff8000 > > [0.007000] RDX: 7d084000 RSI: 8000 RDI: > > 00019a00 > > [0.007000] RBP: 81e03dc0 R08: R09: > > 88007d085000 > > [0.007000] R10: 81e03dd8 R11: 7d095063 R12: > > 81e5c6a0 > > [0.007000] R13: 81ed4f40 R14: 0030 R15: > > 0001 > > [0.007000] FS: () GS:88007d40() > > knlGS: > > [0.007000] CS: 0010 DS: ES: CR0: 80050033 > > [0.007000] CR2: 88007d754000 CR3: 0220a000 CR4: > > 000406b0 > > [0.007000] Call Trace: > > [0.007000] switch_mm+0xd/0x20 > > [0.007000] ? switch_mm+0xd/0x20 > > [0.007000] efi_switch_mm+0x3e/0x4a > > [0.007000] efi_call_phys_prolog+0x28/0x1ac > > [0.007000] efi_enter_virtual_mode+0x35a/0x48f > > [0.007000] start_kernel+0x332/0x3b8 > > [0.007000] x86_64_start_reservations+0x2a/0x2c > > [0.007000] x86_64_start_kernel+0x178/0x18b > > [0.007000] secondary_startup_64+0xa5/0xa5 > > [0.007000] ? secondary_startup_64+0xa5/0xa5 > > [0.007000] Code: 00 00 00 80 49 03 55 50 0f 82 7f 02 00 00 48 b9 > > 00 00 00 80 ff 77 00 00 48 be 00 00 00 00 00 00 00 80 48 01 ca 48 09 > > f0 48 09 d0 <0f> 22 d8 0f 1f 44 00 00 e9 47 ff ff ff 65 8b 05 b8 87 fb > > 7e 89 > > [0.007000] RIP: switch_mm_irqs_off+0x1bc/0x440 RSP: 81e03d80 > > [0.007000] ---[ end trace bfa55bf4e4765255 ]--- > > [0.007000] Kernel panic - not syncing: Attempted to kill the idle task! > > [0.007000] ---[ end Kernel panic - not syncing: Attempted to kill > > the idle task! > > > > 4. Note though that if I use the EFI_MIXED mode (i.e. 32-bit ovmf > > firmware and 64-bit x86 kernel) with your patches, the primary kernel > > boots fine on Qemu: > > > > ovmf firmware used in this case - edk2.git/ovmf-ia32 > > > > 5. Also, if I append 'efi=old_map' to the bootargs (for the failing > > case in point 3 above), I see the primary kernel boots fine on Qemu as > > well. > > > > Regards, > > Bhupesh > > Hi Bhupesh, > > Thanks a lot for the detailed explanation. They are helpful to reproduce > the issue quickly. From my initial debug, I think that AMD SME + > efi_mm_struct patches + -cpu host (in qemu) are required to reproduce > the issue on qemu. > > I have tried the following combinations (all tests are on qemu): > On Linus's tree: > 1. With SME and efi_mm and -cpu host -> panics > 2. With SME and efi_mm and !-cpu host -> boots > 3. With SME and !efi_mm and -cpu host -> boots > 4. With SME and !efi_mm and !-cpu host -> boots > 5. With !SME and efi_mm and -cpu host -> boots > 6. With !SME and efi_mm and !-cpu host -> boots > 7. With !SME and !efi_mm and -cpu host -> boots > 8. With !SME and
Re: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually
On Tue, 2017-09-05 at 19:21 -0700, Sai Praneeth Prakhya wrote: > > I get a similar crash on Qemu with linus's master branch and the V2 > > applied on top of it. Here are the details of my test environment: > > > > 1. I use the OVMF (EDK2) EFI firmware to launch the kernel: > > edk2.git/ovmf-x64 > > > > 2. I used linus's master branch (HEAD - commit: > > b1b6f83ac938d176742c85757960dec2cf10e468) and applied your v2 on top > > of the same. > > > > 3. I use the following qemu command line to launch the test: > > > > # /usr/local/bin/qemu-system-x86_64 --version > > QEMU emulator version 2.9.50 (v2.9.0-526-g76d20ea) > > Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers > > > > # /usr/local/bin/qemu-system-x86_64 -enable-kvm -net nic -net tap -m > > $MEMSIZE -nographic -drive file=$DISK_IMAGE,if=virtio,format=qcow2 > > -vga std -boot c -cpu host -kernel $KERNEL -append > > "crashkernel=$CRASH_MEMSIZE console=ttyS0,115200n81" -initrd > > $INITRAMFS -bios $OVMF_FW_PATH > > > > And here is the crash log: > > > > [0.006054] general protection fault: [#1] SMP > > [0.006459] Modules linked in: > > [0.006711] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3 > > [0.007000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS 0.0.0 02/06/2015 > > [0.007000] task: 81e0f480 task.stack: 81e0 > > [0.007000] RIP: 0010:switch_mm_irqs_off+0x1bc/0x440 > > [0.007000] RSP: :81e03d80 EFLAGS: 00010086 > > [0.007000] RAX: 80007d084000 RBX: RCX: > > 77ff8000 > > [0.007000] RDX: 7d084000 RSI: 8000 RDI: > > 00019a00 > > [0.007000] RBP: 81e03dc0 R08: R09: > > 88007d085000 > > [0.007000] R10: 81e03dd8 R11: 7d095063 R12: > > 81e5c6a0 > > [0.007000] R13: 81ed4f40 R14: 0030 R15: > > 0001 > > [0.007000] FS: () GS:88007d40() > > knlGS: > > [0.007000] CS: 0010 DS: ES: CR0: 80050033 > > [0.007000] CR2: 88007d754000 CR3: 0220a000 CR4: > > 000406b0 > > [0.007000] Call Trace: > > [0.007000] switch_mm+0xd/0x20 > > [0.007000] ? switch_mm+0xd/0x20 > > [0.007000] efi_switch_mm+0x3e/0x4a > > [0.007000] efi_call_phys_prolog+0x28/0x1ac > > [0.007000] efi_enter_virtual_mode+0x35a/0x48f > > [0.007000] start_kernel+0x332/0x3b8 > > [0.007000] x86_64_start_reservations+0x2a/0x2c > > [0.007000] x86_64_start_kernel+0x178/0x18b > > [0.007000] secondary_startup_64+0xa5/0xa5 > > [0.007000] ? secondary_startup_64+0xa5/0xa5 > > [0.007000] Code: 00 00 00 80 49 03 55 50 0f 82 7f 02 00 00 48 b9 > > 00 00 00 80 ff 77 00 00 48 be 00 00 00 00 00 00 00 80 48 01 ca 48 09 > > f0 48 09 d0 <0f> 22 d8 0f 1f 44 00 00 e9 47 ff ff ff 65 8b 05 b8 87 fb > > 7e 89 > > [0.007000] RIP: switch_mm_irqs_off+0x1bc/0x440 RSP: 81e03d80 > > [0.007000] ---[ end trace bfa55bf4e4765255 ]--- > > [0.007000] Kernel panic - not syncing: Attempted to kill the idle task! > > [0.007000] ---[ end Kernel panic - not syncing: Attempted to kill > > the idle task! > > > > 4. Note though that if I use the EFI_MIXED mode (i.e. 32-bit ovmf > > firmware and 64-bit x86 kernel) with your patches, the primary kernel > > boots fine on Qemu: > > > > ovmf firmware used in this case - edk2.git/ovmf-ia32 > > > > 5. Also, if I append 'efi=old_map' to the bootargs (for the failing > > case in point 3 above), I see the primary kernel boots fine on Qemu as > > well. > > > > Regards, > > Bhupesh > > Hi Bhupesh, > > Thanks a lot for the detailed explanation. They are helpful to reproduce > the issue quickly. From my initial debug, I think that AMD SME + > efi_mm_struct patches + -cpu host (in qemu) are required to reproduce > the issue on qemu. > > I have tried the following combinations (all tests are on qemu): > On Linus's tree: > 1. With SME and efi_mm and -cpu host -> panics > 2. With SME and efi_mm and !-cpu host -> boots > 3. With SME and !efi_mm and -cpu host -> boots > 4. With SME and !efi_mm and !-cpu host -> boots > 5. With !SME and efi_mm and -cpu host -> boots > 6. With !SME and efi_mm and !-cpu host -> boots > 7. With !SME and !efi_mm and -cpu host -> boots > 8. With !SME and
Re: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually
> I get a similar crash on Qemu with linus's master branch and the V2 > applied on top of it. Here are the details of my test environment: > > 1. I use the OVMF (EDK2) EFI firmware to launch the kernel: > edk2.git/ovmf-x64 > > 2. I used linus's master branch (HEAD - commit: > b1b6f83ac938d176742c85757960dec2cf10e468) and applied your v2 on top > of the same. > > 3. I use the following qemu command line to launch the test: > > # /usr/local/bin/qemu-system-x86_64 --version > QEMU emulator version 2.9.50 (v2.9.0-526-g76d20ea) > Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers > > # /usr/local/bin/qemu-system-x86_64 -enable-kvm -net nic -net tap -m > $MEMSIZE -nographic -drive file=$DISK_IMAGE,if=virtio,format=qcow2 > -vga std -boot c -cpu host -kernel $KERNEL -append > "crashkernel=$CRASH_MEMSIZE console=ttyS0,115200n81" -initrd > $INITRAMFS -bios $OVMF_FW_PATH > > And here is the crash log: > > [0.006054] general protection fault: [#1] SMP > [0.006459] Modules linked in: > [0.006711] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3 > [0.007000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 0.0.0 02/06/2015 > [0.007000] task: 81e0f480 task.stack: 81e0 > [0.007000] RIP: 0010:switch_mm_irqs_off+0x1bc/0x440 > [0.007000] RSP: :81e03d80 EFLAGS: 00010086 > [0.007000] RAX: 80007d084000 RBX: RCX: > 77ff8000 > [0.007000] RDX: 7d084000 RSI: 8000 RDI: > 00019a00 > [0.007000] RBP: 81e03dc0 R08: R09: > 88007d085000 > [0.007000] R10: 81e03dd8 R11: 7d095063 R12: > 81e5c6a0 > [0.007000] R13: 81ed4f40 R14: 0030 R15: > 0001 > [0.007000] FS: () GS:88007d40() > knlGS: > [0.007000] CS: 0010 DS: ES: CR0: 80050033 > [0.007000] CR2: 88007d754000 CR3: 0220a000 CR4: > 000406b0 > [0.007000] Call Trace: > [0.007000] switch_mm+0xd/0x20 > [0.007000] ? switch_mm+0xd/0x20 > [0.007000] efi_switch_mm+0x3e/0x4a > [0.007000] efi_call_phys_prolog+0x28/0x1ac > [0.007000] efi_enter_virtual_mode+0x35a/0x48f > [0.007000] start_kernel+0x332/0x3b8 > [0.007000] x86_64_start_reservations+0x2a/0x2c > [0.007000] x86_64_start_kernel+0x178/0x18b > [0.007000] secondary_startup_64+0xa5/0xa5 > [0.007000] ? secondary_startup_64+0xa5/0xa5 > [0.007000] Code: 00 00 00 80 49 03 55 50 0f 82 7f 02 00 00 48 b9 > 00 00 00 80 ff 77 00 00 48 be 00 00 00 00 00 00 00 80 48 01 ca 48 09 > f0 48 09 d0 <0f> 22 d8 0f 1f 44 00 00 e9 47 ff ff ff 65 8b 05 b8 87 fb > 7e 89 > [0.007000] RIP: switch_mm_irqs_off+0x1bc/0x440 RSP: 81e03d80 > [0.007000] ---[ end trace bfa55bf4e4765255 ]--- > [0.007000] Kernel panic - not syncing: Attempted to kill the idle task! > [0.007000] ---[ end Kernel panic - not syncing: Attempted to kill > the idle task! > > 4. Note though that if I use the EFI_MIXED mode (i.e. 32-bit ovmf > firmware and 64-bit x86 kernel) with your patches, the primary kernel > boots fine on Qemu: > > ovmf firmware used in this case - edk2.git/ovmf-ia32 > > 5. Also, if I append 'efi=old_map' to the bootargs (for the failing > case in point 3 above), I see the primary kernel boots fine on Qemu as > well. > > Regards, > Bhupesh Hi Bhupesh, Thanks a lot for the detailed explanation. They are helpful to reproduce the issue quickly. From my initial debug, I think that AMD SME + efi_mm_struct patches + -cpu host (in qemu) are required to reproduce the issue on qemu. I have tried the following combinations (all tests are on qemu): On Linus's tree: 1. With SME and efi_mm and -cpu host -> panics 2. With SME and efi_mm and !-cpu host -> boots 3. With SME and !efi_mm and -cpu host -> boots 4. With SME and !efi_mm and !-cpu host -> boots 5. With !SME and efi_mm and -cpu host -> boots 6. With !SME and efi_mm and !-cpu host -> boots 7. With !SME and !efi_mm and -cpu host -> boots 8. With !SME and !efi_mm and !-cpu host -> boots On Matt's tree (no SME): 1. With efi_mm and -cpu host -> boots 2. With efi_mm and !-cpu host -> boots 3. With !efi_mm and -cpu host -> boots 4. With !efi_mm and !-cpu host -> boots Summary: On Matt's tree (next branch), I am unable to reproduce the issue because they don't have SME patches. On Linus's tree, with SME patches (b1b6f83ac938d176742c85757960dec2cf10e468) and my patches and -cpu host switch enabled in qemu, I was able to reproduce the issue. Could you please confirm if you are seeing the same behavior? Specially on real machines (I think, this is equivalent to -cpu host on qemu) because in earlier mails you have mentioned that you were able to reproduce this on Matt's tree, but according to my theory it shouldn't be the case because Matt's three doesn't have SME patches.
Re: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually
> I get a similar crash on Qemu with linus's master branch and the V2 > applied on top of it. Here are the details of my test environment: > > 1. I use the OVMF (EDK2) EFI firmware to launch the kernel: > edk2.git/ovmf-x64 > > 2. I used linus's master branch (HEAD - commit: > b1b6f83ac938d176742c85757960dec2cf10e468) and applied your v2 on top > of the same. > > 3. I use the following qemu command line to launch the test: > > # /usr/local/bin/qemu-system-x86_64 --version > QEMU emulator version 2.9.50 (v2.9.0-526-g76d20ea) > Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers > > # /usr/local/bin/qemu-system-x86_64 -enable-kvm -net nic -net tap -m > $MEMSIZE -nographic -drive file=$DISK_IMAGE,if=virtio,format=qcow2 > -vga std -boot c -cpu host -kernel $KERNEL -append > "crashkernel=$CRASH_MEMSIZE console=ttyS0,115200n81" -initrd > $INITRAMFS -bios $OVMF_FW_PATH > > And here is the crash log: > > [0.006054] general protection fault: [#1] SMP > [0.006459] Modules linked in: > [0.006711] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3 > [0.007000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 0.0.0 02/06/2015 > [0.007000] task: 81e0f480 task.stack: 81e0 > [0.007000] RIP: 0010:switch_mm_irqs_off+0x1bc/0x440 > [0.007000] RSP: :81e03d80 EFLAGS: 00010086 > [0.007000] RAX: 80007d084000 RBX: RCX: > 77ff8000 > [0.007000] RDX: 7d084000 RSI: 8000 RDI: > 00019a00 > [0.007000] RBP: 81e03dc0 R08: R09: > 88007d085000 > [0.007000] R10: 81e03dd8 R11: 7d095063 R12: > 81e5c6a0 > [0.007000] R13: 81ed4f40 R14: 0030 R15: > 0001 > [0.007000] FS: () GS:88007d40() > knlGS: > [0.007000] CS: 0010 DS: ES: CR0: 80050033 > [0.007000] CR2: 88007d754000 CR3: 0220a000 CR4: > 000406b0 > [0.007000] Call Trace: > [0.007000] switch_mm+0xd/0x20 > [0.007000] ? switch_mm+0xd/0x20 > [0.007000] efi_switch_mm+0x3e/0x4a > [0.007000] efi_call_phys_prolog+0x28/0x1ac > [0.007000] efi_enter_virtual_mode+0x35a/0x48f > [0.007000] start_kernel+0x332/0x3b8 > [0.007000] x86_64_start_reservations+0x2a/0x2c > [0.007000] x86_64_start_kernel+0x178/0x18b > [0.007000] secondary_startup_64+0xa5/0xa5 > [0.007000] ? secondary_startup_64+0xa5/0xa5 > [0.007000] Code: 00 00 00 80 49 03 55 50 0f 82 7f 02 00 00 48 b9 > 00 00 00 80 ff 77 00 00 48 be 00 00 00 00 00 00 00 80 48 01 ca 48 09 > f0 48 09 d0 <0f> 22 d8 0f 1f 44 00 00 e9 47 ff ff ff 65 8b 05 b8 87 fb > 7e 89 > [0.007000] RIP: switch_mm_irqs_off+0x1bc/0x440 RSP: 81e03d80 > [0.007000] ---[ end trace bfa55bf4e4765255 ]--- > [0.007000] Kernel panic - not syncing: Attempted to kill the idle task! > [0.007000] ---[ end Kernel panic - not syncing: Attempted to kill > the idle task! > > 4. Note though that if I use the EFI_MIXED mode (i.e. 32-bit ovmf > firmware and 64-bit x86 kernel) with your patches, the primary kernel > boots fine on Qemu: > > ovmf firmware used in this case - edk2.git/ovmf-ia32 > > 5. Also, if I append 'efi=old_map' to the bootargs (for the failing > case in point 3 above), I see the primary kernel boots fine on Qemu as > well. > > Regards, > Bhupesh Hi Bhupesh, Thanks a lot for the detailed explanation. They are helpful to reproduce the issue quickly. From my initial debug, I think that AMD SME + efi_mm_struct patches + -cpu host (in qemu) are required to reproduce the issue on qemu. I have tried the following combinations (all tests are on qemu): On Linus's tree: 1. With SME and efi_mm and -cpu host -> panics 2. With SME and efi_mm and !-cpu host -> boots 3. With SME and !efi_mm and -cpu host -> boots 4. With SME and !efi_mm and !-cpu host -> boots 5. With !SME and efi_mm and -cpu host -> boots 6. With !SME and efi_mm and !-cpu host -> boots 7. With !SME and !efi_mm and -cpu host -> boots 8. With !SME and !efi_mm and !-cpu host -> boots On Matt's tree (no SME): 1. With efi_mm and -cpu host -> boots 2. With efi_mm and !-cpu host -> boots 3. With !efi_mm and -cpu host -> boots 4. With !efi_mm and !-cpu host -> boots Summary: On Matt's tree (next branch), I am unable to reproduce the issue because they don't have SME patches. On Linus's tree, with SME patches (b1b6f83ac938d176742c85757960dec2cf10e468) and my patches and -cpu host switch enabled in qemu, I was able to reproduce the issue. Could you please confirm if you are seeing the same behavior? Specially on real machines (I think, this is equivalent to -cpu host on qemu) because in earlier mails you have mentioned that you were able to reproduce this on Matt's tree, but according to my theory it shouldn't be the case because Matt's three doesn't have SME patches.
[PATCH V2 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Use helper function (efi_switch_mm()) to switch to/from efi_mm. We switch to efi_mm before calling 1. efi_set_virtual_address_map() and 2. Invoking any efi_runtime_service() Likewise, we need to switch back to previous mm (mm context stolen by efi_mm) after the above calls return successfully. We can use efi_switch_mm() helper function only with x86_64 kernel and "efi=old_map" disabled because, x86_32 and efi=old_map doesn't use efi_pgd, rather they use swapper_pg_dir. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> --- arch/x86/include/asm/efi.h | 29 ++--- arch/x86/platform/efi/efi_64.c | 36 +--- arch/x86/platform/efi/efi_thunk_64.S | 2 +- 3 files changed, 36 insertions(+), 31 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 2f77bcefe6b4..23b2137a95e5 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -1,10 +1,14 @@ #ifndef _ASM_X86_EFI_H #define _ASM_X86_EFI_H +#include +#include + #include #include #include #include +#include /* * We map the EFI regions needed for runtime services non-contiguously, @@ -57,14 +61,13 @@ extern u64 asmlinkage efi_call(void *fp, ...); #define efi_call_phys(f, args...) efi_call((f), args) /* - * Scratch space used for switching the pagetable in the EFI stub + * struct efi_scratch - Scratch space used while switching to/from efi_mm + * @phys_stack: stack used during EFI Mixed Mode + * @prev_mm:store/restore stolen mm_struct while switching to/from efi_mm */ struct efi_scratch { - u64 r15; - u64 prev_cr3; - pgd_t *efi_pgt; - booluse_pgd; - u64 phys_stack; + u64 phys_stack; + struct mm_struct*prev_mm; } __packed; #define arch_efi_call_virt_setup() \ @@ -73,11 +76,8 @@ struct efi_scratch { preempt_disable(); \ __kernel_fpu_begin(); \ \ - if (efi_scratch.use_pgd) { \ - efi_scratch.prev_cr3 = read_cr3(); \ - write_cr3((unsigned long)efi_scratch.efi_pgt); \ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(_mm); \ }) #define arch_efi_call_virt(p, f, args...) \ @@ -85,10 +85,8 @@ struct efi_scratch { #define arch_efi_call_virt_teardown() \ ({ \ - if (efi_scratch.use_pgd) { \ - write_cr3(efi_scratch.prev_cr3);\ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(efi_scratch.prev_mm); \ \ __kernel_fpu_end(); \ preempt_enable(); \ @@ -130,6 +128,7 @@ extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); +extern void efi_switch_mm(struct mm_struct *mm); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 0bb98c35e178..e0545f56d703 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -80,9 +80,8 @@ pgd_t * __init efi_call_phys_prolog(void) int n_pgds, i, j; if (!efi_enabled(EFI_OLD_MEMMAP)) { - save_pgd = (pgd_t *)read_cr3(); - write_cr3((unsigned long)efi_scratch.efi_pgt); - goto out; + efi_switch_mm(_mm); + return NULL;
[PATCH V2 1/3] efi: Use efi_mm in x86 as well as ARM
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Presently, only ARM uses mm_struct to manage efi page tables and efi runtime region mappings. As this is the preferred approach, let's make this data structure common across architectures. Specially, for x86, using this data structure improves code maintainability and readability. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> --- drivers/firmware/efi/arm-runtime.c | 9 - drivers/firmware/efi/efi.c | 9 + include/linux/efi.h| 2 ++ 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c index 1cc41c3d6315..d6b26534812b 100644 --- a/drivers/firmware/efi/arm-runtime.c +++ b/drivers/firmware/efi/arm-runtime.c @@ -31,15 +31,6 @@ extern u64 efi_system_table; -static struct mm_struct efi_mm = { - .mm_rb = RB_ROOT, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), - .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), - .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), -}; - #ifdef CONFIG_ARM64_PTDUMP_DEBUGFS #include diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index b372aad3b449..3abbb25602bc 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -55,6 +55,15 @@ struct efi __read_mostly efi = { }; EXPORT_SYMBOL(efi); +struct mm_struct efi_mm = { + .mm_rb = RB_ROOT, + .mm_users = ATOMIC_INIT(2), + .mm_count = ATOMIC_INIT(1), + .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), + .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), + .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), +}; + static bool disable_runtime; static int __init setup_noefi(char *arg) { diff --git a/include/linux/efi.h b/include/linux/efi.h index 8269bcb8ccf7..d1f261d2ce69 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -927,6 +927,8 @@ extern struct efi { unsigned long flags; } efi; +extern struct mm_struct efi_mm; + static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right) { -- 2.1.4
[PATCH V2 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3
From: Sai Praneeth Use helper function (efi_switch_mm()) to switch to/from efi_mm. We switch to efi_mm before calling 1. efi_set_virtual_address_map() and 2. Invoking any efi_runtime_service() Likewise, we need to switch back to previous mm (mm context stolen by efi_mm) after the above calls return successfully. We can use efi_switch_mm() helper function only with x86_64 kernel and "efi=old_map" disabled because, x86_32 and efi=old_map doesn't use efi_pgd, rather they use swapper_pg_dir. Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar --- arch/x86/include/asm/efi.h | 29 ++--- arch/x86/platform/efi/efi_64.c | 36 +--- arch/x86/platform/efi/efi_thunk_64.S | 2 +- 3 files changed, 36 insertions(+), 31 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 2f77bcefe6b4..23b2137a95e5 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -1,10 +1,14 @@ #ifndef _ASM_X86_EFI_H #define _ASM_X86_EFI_H +#include +#include + #include #include #include #include +#include /* * We map the EFI regions needed for runtime services non-contiguously, @@ -57,14 +61,13 @@ extern u64 asmlinkage efi_call(void *fp, ...); #define efi_call_phys(f, args...) efi_call((f), args) /* - * Scratch space used for switching the pagetable in the EFI stub + * struct efi_scratch - Scratch space used while switching to/from efi_mm + * @phys_stack: stack used during EFI Mixed Mode + * @prev_mm:store/restore stolen mm_struct while switching to/from efi_mm */ struct efi_scratch { - u64 r15; - u64 prev_cr3; - pgd_t *efi_pgt; - booluse_pgd; - u64 phys_stack; + u64 phys_stack; + struct mm_struct*prev_mm; } __packed; #define arch_efi_call_virt_setup() \ @@ -73,11 +76,8 @@ struct efi_scratch { preempt_disable(); \ __kernel_fpu_begin(); \ \ - if (efi_scratch.use_pgd) { \ - efi_scratch.prev_cr3 = read_cr3(); \ - write_cr3((unsigned long)efi_scratch.efi_pgt); \ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(_mm); \ }) #define arch_efi_call_virt(p, f, args...) \ @@ -85,10 +85,8 @@ struct efi_scratch { #define arch_efi_call_virt_teardown() \ ({ \ - if (efi_scratch.use_pgd) { \ - write_cr3(efi_scratch.prev_cr3);\ - __flush_tlb_all(); \ - } \ + if (!efi_enabled(EFI_OLD_MEMMAP)) \ + efi_switch_mm(efi_scratch.prev_mm); \ \ __kernel_fpu_end(); \ preempt_enable(); \ @@ -130,6 +128,7 @@ extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); +extern void efi_switch_mm(struct mm_struct *mm); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 0bb98c35e178..e0545f56d703 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -80,9 +80,8 @@ pgd_t * __init efi_call_phys_prolog(void) int n_pgds, i, j; if (!efi_enabled(EFI_OLD_MEMMAP)) { - save_pgd = (pgd_t *)read_cr3(); - write_cr3((unsigned long)efi_scratch.efi_pgt); - goto out; + efi_switch_mm(_mm); + return NULL; } early_code_mapping_set_exec(1); @@ -152,8 +151,7 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd) pud_t *pud; if (!efi_enabled(EFI_OLD_MEMMAP)) { - write_cr3((unsigned long)save_pgd); - __flush_tlb_all(); + efi_switch_mm(efi_scrat
[PATCH V2 1/3] efi: Use efi_mm in x86 as well as ARM
From: Sai Praneeth Presently, only ARM uses mm_struct to manage efi page tables and efi runtime region mappings. As this is the preferred approach, let's make this data structure common across architectures. Specially, for x86, using this data structure improves code maintainability and readability. Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar --- drivers/firmware/efi/arm-runtime.c | 9 - drivers/firmware/efi/efi.c | 9 + include/linux/efi.h| 2 ++ 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/firmware/efi/arm-runtime.c b/drivers/firmware/efi/arm-runtime.c index 1cc41c3d6315..d6b26534812b 100644 --- a/drivers/firmware/efi/arm-runtime.c +++ b/drivers/firmware/efi/arm-runtime.c @@ -31,15 +31,6 @@ extern u64 efi_system_table; -static struct mm_struct efi_mm = { - .mm_rb = RB_ROOT, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), - .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), - .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), -}; - #ifdef CONFIG_ARM64_PTDUMP_DEBUGFS #include diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index b372aad3b449..3abbb25602bc 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -55,6 +55,15 @@ struct efi __read_mostly efi = { }; EXPORT_SYMBOL(efi); +struct mm_struct efi_mm = { + .mm_rb = RB_ROOT, + .mm_users = ATOMIC_INIT(2), + .mm_count = ATOMIC_INIT(1), + .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), + .page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), + .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), +}; + static bool disable_runtime; static int __init setup_noefi(char *arg) { diff --git a/include/linux/efi.h b/include/linux/efi.h index 8269bcb8ccf7..d1f261d2ce69 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -927,6 +927,8 @@ extern struct efi { unsigned long flags; } efi; +extern struct mm_struct efi_mm; + static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right) { -- 2.1.4
[PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually
From: Sai PraneethPresently, in x86, to invoke any efi function like efi_set_virtual_address_map() or any efi_runtime_service() the code path typically involves read_cr3() (save previous pgd), write_cr3() (write efi_pgd) and calling efi function. Likewise after returning from efi function the code path typically involves read_cr3() (save efi_pgd), write_cr3() (write previous pgd). We do this couple of times in efi subsystem of Linux kernel, instead we can use helper function efi_switch_mm() to do this. This improves readability and maintainability. Also, instead of maintaining a separate struct "efi_scratch" to store/restore efi_pgd, we can use mm_struct to do this. I have tested this patch set against LUV (Linux UEFI Validation), so I think I didn't break any existing configurations. I have tested this patch set for 1. x86_64, 2. x86_32, 3. Mixed mode with efi=old_map and for kexec kernel. Please let me know if I have missed any other configurations. Changes in V2: 1. Resolve mm_dropping() issue by not mm_dropping()/mm_grabbing() any mm, as we are not losing/creating any references. Sai Praneeth (3): efi: Use efi_mm in x86 as well as ARM x86/efi: Replace efi_pgd with efi_mm.pgd x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3 arch/x86/include/asm/efi.h | 29 ++-- arch/x86/platform/efi/efi_64.c | 52 arch/x86/platform/efi/efi_thunk_64.S | 2 +- drivers/firmware/efi/arm-runtime.c | 9 --- drivers/firmware/efi/efi.c | 9 +++ include/linux/efi.h | 2 ++ 6 files changed, 55 insertions(+), 48 deletions(-) -- 2.1.4
[PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually
From: Sai Praneeth Presently, in x86, to invoke any efi function like efi_set_virtual_address_map() or any efi_runtime_service() the code path typically involves read_cr3() (save previous pgd), write_cr3() (write efi_pgd) and calling efi function. Likewise after returning from efi function the code path typically involves read_cr3() (save efi_pgd), write_cr3() (write previous pgd). We do this couple of times in efi subsystem of Linux kernel, instead we can use helper function efi_switch_mm() to do this. This improves readability and maintainability. Also, instead of maintaining a separate struct "efi_scratch" to store/restore efi_pgd, we can use mm_struct to do this. I have tested this patch set against LUV (Linux UEFI Validation), so I think I didn't break any existing configurations. I have tested this patch set for 1. x86_64, 2. x86_32, 3. Mixed mode with efi=old_map and for kexec kernel. Please let me know if I have missed any other configurations. Changes in V2: 1. Resolve mm_dropping() issue by not mm_dropping()/mm_grabbing() any mm, as we are not losing/creating any references. Sai Praneeth (3): efi: Use efi_mm in x86 as well as ARM x86/efi: Replace efi_pgd with efi_mm.pgd x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3 arch/x86/include/asm/efi.h | 29 ++-- arch/x86/platform/efi/efi_64.c | 52 arch/x86/platform/efi/efi_thunk_64.S | 2 +- drivers/firmware/efi/arm-runtime.c | 9 --- drivers/firmware/efi/efi.c | 9 +++ include/linux/efi.h | 2 ++ 6 files changed, 55 insertions(+), 48 deletions(-) -- 2.1.4
[PATCH V2 2/3] x86/efi: Replace efi_pgd with efi_mm.pgd
From: Sai Praneeth <sai.praneeth.prak...@intel.com> Since the previous patch added support for efi_mm, let's handle efi_pgd through efi_mm and remove global variable efi_pgd. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prak...@intel.com> Cc: Lee, Chun-Yi <j...@suse.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Tony Luck <tony.l...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Michael S. Tsirkin <m...@redhat.com> Cc: Ricardo Neri <ricardo.n...@intel.com> Cc: Matt Fleming <m...@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org> Cc: Ravi Shankar <ravi.v.shan...@intel.com> --- arch/x86/platform/efi/efi_64.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 8ff1f95627f9..0bb98c35e178 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -187,8 +187,6 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd) early_code_mapping_set_exec(0); } -static pgd_t *efi_pgd; - /* * We need our own copy of the higher levels of the page tables * because we want to avoid inserting EFI region mappings (EFI_VA_END @@ -197,7 +195,7 @@ static pgd_t *efi_pgd; */ int __init efi_alloc_page_tables(void) { - pgd_t *pgd; + pgd_t *pgd, *efi_pgd; p4d_t *p4d; pud_t *pud; gfp_t gfp_mask; @@ -225,6 +223,8 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + efi_mm.pgd = efi_pgd; + return 0; } @@ -237,6 +237,7 @@ void efi_sync_low_kernel_mappings(void) pgd_t *pgd_k, *pgd_efi; p4d_t *p4d_k, *p4d_efi; pud_t *pud_k, *pud_efi; + pgd_t *efi_pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return; @@ -330,13 +331,12 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) unsigned long pfn, text; struct page *page; unsigned npages; - pgd_t *pgd; + pgd_t *pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return 0; - efi_scratch.efi_pgt = (pgd_t *)__pa(efi_pgd); - pgd = efi_pgd; + efi_scratch.efi_pgt = (pgd_t *)__pa(pgd); /* * It can happen that the physical address of new_memmap lands in memory @@ -400,7 +400,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va) { unsigned long flags = _PAGE_RW; unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; if (!(md->attribute & EFI_MEMORY_WB)) flags |= _PAGE_PCD; @@ -501,7 +501,7 @@ void __init parse_efi_setup(u64 phys_addr, u32 data_len) static int __init efi_update_mappings(efi_memory_desc_t *md, unsigned long pf) { unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; int err1, err2; /* Update the 1:1 mapping */ @@ -592,7 +592,7 @@ void __init efi_dump_pagetable(void) if (efi_enabled(EFI_OLD_MEMMAP)) ptdump_walk_pgd_level(NULL, swapper_pg_dir); else - ptdump_walk_pgd_level(NULL, efi_pgd); + ptdump_walk_pgd_level(NULL, efi_mm.pgd); #endif } -- 2.1.4
[PATCH V2 2/3] x86/efi: Replace efi_pgd with efi_mm.pgd
From: Sai Praneeth Since the previous patch added support for efi_mm, let's handle efi_pgd through efi_mm and remove global variable efi_pgd. Signed-off-by: Sai Praneeth Prakhya Cc: Lee, Chun-Yi Cc: Borislav Petkov Cc: Tony Luck Cc: Andy Lutomirski Cc: Michael S. Tsirkin Cc: Ricardo Neri Cc: Matt Fleming Cc: Ard Biesheuvel Cc: Ravi Shankar --- arch/x86/platform/efi/efi_64.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 8ff1f95627f9..0bb98c35e178 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -187,8 +187,6 @@ void __init efi_call_phys_epilog(pgd_t *save_pgd) early_code_mapping_set_exec(0); } -static pgd_t *efi_pgd; - /* * We need our own copy of the higher levels of the page tables * because we want to avoid inserting EFI region mappings (EFI_VA_END @@ -197,7 +195,7 @@ static pgd_t *efi_pgd; */ int __init efi_alloc_page_tables(void) { - pgd_t *pgd; + pgd_t *pgd, *efi_pgd; p4d_t *p4d; pud_t *pud; gfp_t gfp_mask; @@ -225,6 +223,8 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; } + efi_mm.pgd = efi_pgd; + return 0; } @@ -237,6 +237,7 @@ void efi_sync_low_kernel_mappings(void) pgd_t *pgd_k, *pgd_efi; p4d_t *p4d_k, *p4d_efi; pud_t *pud_k, *pud_efi; + pgd_t *efi_pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return; @@ -330,13 +331,12 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) unsigned long pfn, text; struct page *page; unsigned npages; - pgd_t *pgd; + pgd_t *pgd = efi_mm.pgd; if (efi_enabled(EFI_OLD_MEMMAP)) return 0; - efi_scratch.efi_pgt = (pgd_t *)__pa(efi_pgd); - pgd = efi_pgd; + efi_scratch.efi_pgt = (pgd_t *)__pa(pgd); /* * It can happen that the physical address of new_memmap lands in memory @@ -400,7 +400,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va) { unsigned long flags = _PAGE_RW; unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; if (!(md->attribute & EFI_MEMORY_WB)) flags |= _PAGE_PCD; @@ -501,7 +501,7 @@ void __init parse_efi_setup(u64 phys_addr, u32 data_len) static int __init efi_update_mappings(efi_memory_desc_t *md, unsigned long pf) { unsigned long pfn; - pgd_t *pgd = efi_pgd; + pgd_t *pgd = efi_mm.pgd; int err1, err2; /* Update the 1:1 mapping */ @@ -592,7 +592,7 @@ void __init efi_dump_pagetable(void) if (efi_enabled(EFI_OLD_MEMMAP)) ptdump_walk_pgd_level(NULL, swapper_pg_dir); else - ptdump_walk_pgd_level(NULL, efi_pgd); + ptdump_walk_pgd_level(NULL, efi_mm.pgd); #endif } -- 2.1.4