RE: Why does memblock only refer to E820 table and not EFI Memory Map?
> > On x86 platforms, there are two sources through which kernel learns > > about physical memory in the system namely E820 table and EFI Memory > > Map. Each table describes which regions of system memory is usable by > > kernel and which regions should be preserved (i.e. reserved regions > > that typically have BIOS code/data) so that no other component in the > > system could read/write to these regions. I think they are duplicating > > the information and hence I have couple of questions regarding these > > > > 1. I see that only E820 table is being consumed by kernel [1] (i.e. > > memblock subsystem in kernel) to distinguish between "usable" vs "reserved" > regions. > > Assume someone has called memblock_alloc(), the memblock subsystem > > would service the caller by allocating memory from "usable" regions > > and it knows this *only* from E820 table [2] (it does not check if EFI > > Memory Map also says that this region is usable as well). So, why > > isn't the kernel taking EFI Memory Map into consideration? (I see that > > it does happen only when "add_efi_memmap" kernel command line arg is > > passed i.e. passing this argument updates E820 table based on EFI > > Memory Map) [3]. The problem I see with memblock not taking EFI Memory > > Map into consideration is that, we are ignoring the main purpose for which > > EFI > Memory Map exists. > > https://blog.fpmurphy.com/2012/08/uefi-memory-v-e820-memory.html > Probably above blog can explain some background. Thanks a lot! Dave. The link was helpful, it did explain that Linus and HPA were not very happy with EFI and things were going good with E820 and hence it was given more preference compared to EFI. But sadly, I am not 100% convinced yet :( (just my thoughts) This decision was made a decade ago when EFI wasn't stable. Now that UEFI is the defacto on most of the x86 platforms (and since I believe UEFI is getting better) I am still unable to digest that kernel throws away EFI Memory Map (unless explicitly asked by "add_efi_memap") Regards, Sai
RE: [PATCH v2] x86/efi: fix a -Wtype-limits compilation warning
> Compiling a kernel with W=1 generates this warning, > > arch/x86/platform/efi/quirks.c:731:16: warning: comparison of unsigned > expression >= 0 is always true [-Wtype-limits] > > Fixes: 3425d934fc03 ("efi/x86: Handle page faults occurring while running EFI > runtime services") > Signed-off-by: Qian Cai > --- > > v2: Add a "Fixes" tag. Makes sense. Thanks for the fix Qian Cai. Regards, Sai
RE: kexec crash on OVMF i386 + x86_64 kernel (Re: [PATCH v4] x86/boot: Use efi_setup_data for searching RSDP on kexec-ed kernel)
> Added efi people. > > I remember previously Sai did some efi32 tests for kexec, but I'm not sure if > he > tested EFI32 + 64bit kernel. > > Kexec status is not certain because I'm not sure anyone tesed and reported > issues for that. No.. I didn't test kexec on EFI32 + 64bit kernel and I also haven't tested kexec'ing from kexec'd kernel. Regards, Sai
RE: EFI reboot vs. ACPI reboot (was: Re: [tip:x86/urgent] x86/reboot, efi: Use EFI reboot for Acer TravelMate X514-51T)
> > >> That seems counter-intuitive to me: if no full ACPI hardware is > > >> implemented then we should assume reduced ACPI functionality, i.e. > > >> if > > >the > > >> EFI runtime is otherwise available we should default to it. > > > > > >It's a bit confusing, but my loose understanding is that previous > > >versions of the ACPI spec required system implementors to implement > > >the whole thing; but that's increasingly impractical today, e.g. with > > >ARM systems coming along, which do not gel well with some of the > > >historical x86-rooted design aspects that spilled over into ACPI. The > > >V5 spec introduces reduced mode as an opt-in new feature, but for > > >compatibility with pre-V5 implementations it needs to consider "full > > >hardware" mode as the default. > > > > > >> Feel free to send a patch that makes EFI reboot the default one > > >> under these circumstances, > > > > > >Just to check, you mean: EFI reboot (and shutdown) become the default > > >methods when the machine is booted in EFI mode, and EFI stuff has not > > >been disabled with a kernel parameter? > > >Even when running in full hardware ACPI mode. > > > > > >Thanks > > >Daniel > > > > This, I believe, is known to not work. > > It definitely used to be the case that EFI reboot was unreliable, but I don't > know > the details. Sorry! Ard. I don't have the enough historical knowledge to state that EFI reboot is unreliable on most of the x86 systems but I did come across two systems that had buggy implementation of EFI reboot and hence I have added code to deal with EFI runtime services that cause page faults. I will ask around internally to check if it is a good idea to make EFI reboot the default method when the machine is booted in EFI mode. > I have added Sai to cc, he may be able to provide a bit more context > here, since he added the code to deal with page faults during EFI runtime > service > invocations (which was inspired by a EFI reboot issue IIRC). That feature may > have improved the situation, but I am not confident at all that it fixes all > systems. I believe, the code I added should improve the situation but I am hesitant to say that it fixes *all* the systems because the code addresses only one type of bug namely EFI Reboot accessing illegal memory and I have no idea about other buggy implementations. Regards, Sai
RE: [PATCH 02/10] x86/efi: Return error status if mapping EFI regions fail
> > > efi_map_region() creates VA mappings for an given EFI region using > > > any one of the two helper functions (namely __map_region() and > old_map_region()). > > > These helper functions *could* fail while creating mappings and > > > presently their return value is not checked. Not checking for the > > > return value of these functions might create issues because after > > > these functions return "md->virt_addr" is set to the requested > > > virtual address (so it's assumed that these functions always succeed > > > which is not quite true). This assumption leads to "md->virt_addr" > > > having invalid mapping should any of > > > __map_region() or old_map_region() fail. > > > > > > Hence, check for the return value of these functions and if indeed > > > they fail, turn off EFI Runtime Services forever because kernel > > > cannot prioritize among EFI regions. > > > [...] > > > -void __init old_map_region(efi_memory_desc_t *md) > > > +int __init old_map_region(efi_memory_desc_t *md) > > > { > > > u64 start_pfn, end_pfn, end; > > > unsigned long size; > > > @@ -601,10 +601,14 @@ void __init old_map_region(efi_memory_desc_t > *md) > > > va = efi_ioremap(md->phys_addr, size, > > >md->type, md->attribute); > > > > > > - md->virt_addr = (u64) (unsigned long) va; > > > - if (!va) > > > + if (!va) { > > > pr_err("ioremap of 0x%llX failed!\n", > > > (unsigned long long)md->phys_addr); > > > + return -ENOMEM; > > > + } > > > + > > > + md->virt_addr = (u64)(unsigned long)va; > > > + return 0; > > > > Just wondering, shouldn't the failure path set ->virt_addr to > > something safe, just in case a caller doesn't check the error and relies on > > it? > > > > That's because in this commit we've now changed it from 0 to undefined. > > > > Indeed. We don't usually rely on the value of ->virt_addr when > EFI_RUNTIME_SERVICES is unset, but there is some sysfs code, and perhaps > some other places where we do reference ->virt_addr, and not assigning it at > all > is obviously wrong, and potentially hazardous. > Ok.. makes sense. Do you think md->virt_addr = 0 for fail case is ok? > > > +int __init efi_map_region_fixed(efi_memory_desc_t *md) { return 0; > > > +} > > > > Inline functions should be marked inline ... > > > > > if (efi_va < EFI_VA_END) { > > > - pr_warn(FW_WARN "VA address range overflow!\n"); > > > - return; > > > + pr_err(FW_WARN "VA address range overflow!\n"); > > > + return -ENOMEM; > > > } > > > > > > /* Do the VA map */ > > > - __map_region(md, efi_va); > > > + if (__map_region(md, efi_va)) > > > + return -ENOMEM; > > > + > > > md->virt_addr = efi_va; > > > + return 0; > > > > Same error return problem of leaving ->virt_addr undefined. > > Sure! Will fix it in V2. > > Note that I also fixed up the grammar and readability of the changelog > > - see the updated version below. Thanks for fixing :) > > > > Thanks, > > > > Ingo > > > > => > > Subject: x86/efi: Return error status if mapping of EFI regions fails > > From: Ard Biesheuvel > > Date: Sat, 2 Feb 2019 10:41:11 +0100 > > > > From: Sai Praneeth Prakhya > > > > efi_map_region() creates VA mappings for a given EFI region using one > > of the two helper functions (namely __map_region() and old_map_region()). > > > > These helper functions could fail while creating mappings and > > presently their return value is not checked. > > > > Not checking for the return value of these functions might create > > bugs, because after these functions return "md->virt_addr" is set to > > the requested virtual address (so it's assumed that these functions > > always succeed which is not quite true). This assumption leads to > > "md->virt_addr" having invalid mapping, should any of __map_region() > > or old_map_region() fail. > > > > Hence, check for the return value of these functions and if indeed > > they fail, turn off EFI Runtime Services forever because kernel cannot > > prioritize among EFI regions. > > > > This also fixes the comment "FIXME: add error handling" in > > kexec_enter_virtual_mode(). > > > > Thanks Ingo. > > Sai, could you please respin this and use Ingo's updated version of the commit > log? Sure! I will send a V2 with the mentioned changes. Regards, Sai
RE: [PATCH] x86/efi: Don't unmap EFI boot services code/data regions for EFI_OLD_MEMMAP and EFI_MIXED_MODE
> > Acked-by: Ard Biesheuvel > > > > with the sidenote that I won't be able to test this myself until > > monday at the earliest. > > OK, so I have tested both efi=old_map and mixed mode before and after > applying this patch, using QEMU/KVM with 64-bit and 32-bit builds of OVMF > [respectively] > > efi=old_map is indeed broken before and needs this patch. > > Mixed mode works just fine both before and after applying the patch, but I > suggest we keep this patch as-is and address mixed mode later if needed (I > spotted a couple of things in the boot log that may need some attention but > I'm > not sure if the issue is in Linux or in OVMF) Thanks for testing this Ard. For testing Mixed mode, I have used QEMU/KVM with 64-bit and 32-bit OVMF and mixed mode is broken for me without this patch. I am using an older OVMF build and probably that might be the reason for us seeing different behaviors. Also, kernel config file might have some role to play.. I guess. Regards, Sai
RE: [PATCH] efi: memattr: don't bail on zero VA if it equals the region's PA
> The EFI memory attributes code cross-references the EFI memory map with the > more granular EFI memory attributes table to ensure that they are in sync > before > applying the strict permissions to the regions it describes. > > Since we always install virtual mappings for the EFI runtime regions to which > these strict permissions apply, we currently perform a sanity check on the EFI > memory descriptor, and ensure that the EFI_MEMORY_RUNTIME bit is set, and > that the virtual address has been assigned. > > However, in cases where a runtime region exists at physical address 0x0, and > the > virtual mapping equals the physical mapping, e.g., when running in mixed mode > on x86, we encounter a memory descriptor with the runtime attribute and > virtual address 0x0, and incorrectly draw the conclusion that a runtime region > exists for which no virtual mapping was installed, and give up altogether. The > consequence of this is that firmware mappings retain their read-write-execute > permissions, making the system more vulnerable to attacks. > > So let's only bail if the virtual address of 0x0 has been assigned to a > physical > region that does not reside at address 0x0. > > Signed-off-by: Ard Biesheuvel > --- > drivers/firmware/efi/memattr.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/firmware/efi/memattr.c b/drivers/firmware/efi/memattr.c > index 8f289dbc237c..58452fde92cc 100644 > --- a/drivers/firmware/efi/memattr.c > +++ b/drivers/firmware/efi/memattr.c > @@ -91,7 +91,7 @@ static bool entry_is_valid(const efi_memory_desc_t *in, > efi_memory_desc_t *out) > > if (!(md->attribute & EFI_MEMORY_RUNTIME)) > continue; > - if (md->virt_addr == 0) { > + if (md->virt_addr == 0 && md->phys_addr != 0) { > /* no virtual mapping has been installed by the stub */ > break; > } Thanks for the explanation Ard. This change looks good to me. Presently, on x86, even when the mapping fails we don't have any scenario where in md->phys_addr != 0 && md->virt_addr == 0 Regards, Sai
RE: [PATCH] x86/efi: Don't unmap EFI boot services code/data regions for EFI_OLD_MEMMAP and EFI_MIXED_MODE
> > * Sai Praneeth Prakhya wrote: > > > Commit d5052a7130a6 ("x86/efi: Unmap EFI boot services code/data > > regions from efi_pgd") forgets to take two EFI modes into > > consideration namely EFI_OLD_MEMMAP and EFI_MIXED_MODE. > > So the commit sha1 ended up being this one in tip:efi/core: > > 08cfb38f3ef4: x86/efi: Unmap EFI boot services code/data regions from > efi_pgd Ya.. sorry! I took the sha1 from efi tree. > > > EFI_OLD_MEMMAP is a legacy way of mapping EFI regions into > > swapper_pg_dir using ioremap() and init_memory_mapping(). This feature > > can be enabled by passing "efi=old_map" as kernel command line > > argument. But, > > efi_unmap_pages() unmaps EFI boot services code/data regions *only* > > from efi_pgd and hence cannot be used for unmapping EFI boot services > > code/data regions from swapper_pg_dir. [.] > > + /* > > +* EFI mixed mode has all RAM mapped to access arguments while > making > > +* EFI runtime calls, hence don't unmap EFI boot services code/data > > +* regions. > > +*/ > > + if (!efi_is_native() && IS_ENABLED(CONFIG_EFI_MIXED)) > > + return; > > I suppose old_mmap and mixed mode stopped working altogether after the > unmapping changes? Yes.. that's true. > What are the symptoms, instant reboots, crasher, or some > more benign behavior like non-working runtime EFI functionality? I noticed a kernel panic due to page fault. AFAIK, it happens because efi_pgd is not defined for old_map and mixed mode. Hence efi_unmap_pages() which calls __change_page_attr() unmaps mappings from swapper_pg_dir (please note that this my theory but I haven't checked if this is really true). > > If Ard acks this I'll apply it immediately, as these bugs look like > show-stoppers for > merging the EFI tree into v4.21. > > Thanks, > > Ingo
RE: [PATCH V2 0/3] Fix EFI memory map leaks
> > Although majority of the changes are made to > > drivers/firmware/efi/memmap.c file (which is common across > > architectures), this bug is only limited to x86_64 machines and hence this > > patch > set shouldn't effect any other architectures. > > I will give this a test run on the arm64 test hardware available at my end. I > have > shared some minor comments on the patchset - see respective patch threads for > the same. Thanks a lot! for the comments and taking time to test the patches :) Really appreciate your effort :) Regards, Sai
RE: [PATCH V2 2/3] x86/efi: Fix EFI memory map leaks
> > -static void __init efi_clean_memmap(void) > > +void __init efi_clean_memmap(void) > > { > > - efi_memory_desc_t *out = efi.memmap.map; > > - const efi_memory_desc_t *in = out; > > - const efi_memory_desc_t *end = efi.memmap.map_end; > > - int i, n_removal; > > - > > - for (i = n_removal = 0; in < end; i++) { > > - if (efi_memmap_entry_valid(in, i)) { > > - if (out != in) > > - memcpy(out, in, efi.memmap.desc_size); > > - out = (void *)out + efi.memmap.desc_size; > > - } else { > > + void *out; > > + efi_memory_desc_t *md; > > + unsigned int i = 0, n_removal = 0; > > Perhaps we can use 'unsigned int i = n_removal = 0;', which appears more > readable. Hmm.. that actually throws an compilation error saying that "n_removal is undeclared". So, let's stick with unsigned int i = 0, n_removal = 0; > > > + struct efi_memory_map new_memmap; > > + > > + for_each_efi_memory_desc(md) { > > + if (!efi_memmap_entry_valid(md, i)) > > n_removal++; > > - } > > - in = (void *)in + efi.memmap.desc_size; > > } > > > > - if (n_removal > 0) { > > - u64 size = efi.memmap.nr_map - n_removal; > > + if (n_removal == 0) > > and 'if (!n_removal)' above, to stay in sync with the rest of the code in > this file. Sure! Thanks for the catch. I will fix it. Regards, Sai
RE: [PATCH V2 1/3] efi: Introduce efi_memmap_free() and efi_memmap_unmap_and_free()
> > +/** > > + * efi_memmap_free - Free memory pointed by new_memmap.map > > + * @new_memmap: Structure that describes EFI memory map. > > + * > > + * Memory is freed depending on the type of allocation performed. > > + */ > > +static void __init efi_memmap_free(struct efi_memory_map new_memmap) > > ^^ Its not very clear what you mean by the term 'new_memmap' here. > Also can we pass a pointer to struct efi_memory_map to this function rather > than passing the entire struct. Sure! I will change it. > > +static void __init __efi_memmap_unmap(struct efi_memory_map > > +new_memmap) { > > + if (!new_memmap.late) { > > + unsigned long size; > > + > > + size = new_memmap.desc_size * new_memmap.nr_map; > > + early_memunmap(new_memmap.map, size); > > Nitpick: May be we can avoid the extra local variable size and directly pass > 'new_memmap.desc_size * new_memmap.nr_map' while calling > early_memunmap(). I think the code would still be readable and the calculation > seems self-explanatory. Yes, we could do that but that would make early_memunmap() exceed 80 characters limit. Personally, I feel single line code is more readable when compared to split lines and this local variable would anyways be gone once the function returns. > > +/** > > + * Unmap and free existing EFI Memory Map i.e. efi.memmap */ void > > +__init efi_memmap_unmap_and_free(void) { > > + if (!efi_enabled(EFI_MEMMAP)) { > > + WARN(1, "EFI_MEMMAP is not enabled"); > > Nitpick: Do we really need a WARN() here? May be we can do away with the > same. My thought behind adding a WARN() is to let the developer know that he's using the API in wrong scenario i.e. trying to unmap/free memory map when it's not present. If we just return, the developer might think that his code isn't buggy when he tries to unmap an non-existent EFI memory map. Regards, Sai
RE: [PATCH V2 0/3] Fix EFI memory map leaks
> Since we're being pedantic, it also makes sense to decide now whether 'area' > refers to all [discontiguous] regions or just one of them. I'd say the > former, and > use 'region' for the latter, i.e., an area may be made up of several regions, > but > only one exists of each type. > > > Parameters and local variables should be named unambiguously following > > these concepts as well. There should be no opaque 'new' variables > > *especially* where the primary role of the efi_memmap_insert() API > > call is to add a new *entry* - it should be 'new_efi_map_table' (or > > 'new_table' where it's unambiguous) instead, and new entries added > > should be 'new_entry' - etc. > > > > Once this is reorganized and clarified it should be a lot more easy to > > review 'table freeing' patches. I can help out with patches if there's > > no conceptual objections. I have a question, As this table freeing patches do some cleanup of some efi functions, namely, efi_clean_memmap(), efi_fake_memmap(), efi_arch_mem_reserve(), efi_memmap_install() and __efi_enter_virtual_mode(), I am not very sure if you want to do the name changing now and later review the patches or review the patches and later do the name changing. As you pointed out that reviewing them now is pain, which I agree, but my concern is you might be looking at code and might as well change names that might eventually be gone because of these table freeing patches. Regards, Sai
RE: [PATCH V2 2/3] x86/efi: Fix EFI memory map leaks
> > if (efi_enabled(EFI_MEMMAP)) { > > + /* > > +* efi_clean_memmap() uses memblock_phys_alloc() to allocate > > +* memory for new EFI memmap and hence will work only after > > +* e820__memblock_setup() > > +*/ > > + efi_clean_memmap(); > > efi_fake_memmap(); > > efi_find_mirror(); > > efi_esrt_init(); > > I'd also suggest a namespace cleanup before we do any material > modifications: > > 'efi_esrt_init()' is the proper pattern to follow, it's prefixed by efi_, then > followed by the more generic subsystem (_esrt) and then by the functionality > (_init). This is a good, hierarchical, top-down nomenclature that makes it > easy to > grep for ESRT functionality by typing 'git grep esrt'. > > The same is not true of the memmap functionality: 'git grep efi_memmap_' > doesn't do the right thing. > > So I think this should be renamed to: > > efi_memmap_clean() > efi_memmap_insert() > efi_memmap_free() > efi_memmap_print() > efi_memmap_fake() > etc. > > While at it, there's another area that could be improved: > > - efi_memmap_fake() is a bit weird naming: it's not really 'fake', >presumably the specified table is still very much real, otherwise it >won't result in a working kernel. > >What that functionality *really* is about is user-defined entries. Why >not name it in such a fashion? efi_memmap_init_user_defined() or so? >From the example you gave about efi_esrt_init(), the naming of efi memmap related functions does look messy to me now.. and yes, a namespace clean up might be good. Regards, Sai
RE: [PATCH V2 1/3] efi: Introduce efi_memmap_free() and efi_memmap_unmap_and_free()
> > +static void __init efi_memmap_free(struct efi_memory_map new_memmap) > > +{ > > + phys_addr_t start, end; > > + unsigned long size = new_memmap.nr_map * new_memmap.desc_size; > > + unsigned int order = get_order(size); > > + > > + start = new_memmap.phys_map; > > + end = start + size; > > + if (new_memmap.late) { > > + __free_pages(pfn_to_page(PHYS_PFN(start)), order); > > + return; > > + } > > + > > + if (memblock_free(start, size)) > > + pr_err("Failed to free mem from %pa to %pa\n", , ); > > Why is this rather large structure passed in by value and not by reference? My bad.. I will change it to pass by reference. > Also, 'new_memmap' naming is confusing - by this time this is a rather old > memmap table we are going to free, right? So naming it 'memmap' would be > fine, right? Yes, that should be fine. I will change it. Regards, Sai
RE: [GIT PULL 00/11] EFI updates
> On Thu, 29 Nov 2018 at 19:27, Prakhya, Sai Praneeth > wrote: > > > > Hi Ard, > > > > While building from next branch of efi tree, I noticed the below warning. > > Could > you please check the same on your side? > > > > CC lib/list_debug.o > > drivers/firmware/efi/efi.c: In function 'efi_mem_reserve_persistent': > > drivers/firmware/efi/efi.c:1000:6: warning: unused variable 'rsvsize' [- > Wunused-variable] > > int rsvsize = EFI_MEMRESERVE_SIZE(1); > > ^~~ > > CC lib/bitrev.o > > CC net/core/sock_reuseport.o > > > > Thanks Sai > > Ingo spotted it as well and fixed it up before merging. Great! :) Regards, Sai
RE: [GIT PULL 00/11] EFI updates
Hi Ard, While building from next branch of efi tree, I noticed the below warning. Could you please check the same on your side? CC lib/list_debug.o drivers/firmware/efi/efi.c: In function 'efi_mem_reserve_persistent': drivers/firmware/efi/efi.c:1000:6: warning: unused variable 'rsvsize' [-Wunused-variable] int rsvsize = EFI_MEMRESERVE_SIZE(1); ^~~ CC lib/bitrev.o CC net/core/sock_reuseport.o > -Original Message- > From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org] > Sent: Thursday, November 29, 2018 9:12 AM > To: linux-efi@vger.kernel.org; Ingo Molnar ; Thomas > Gleixner > Cc: Ard Biesheuvel ; linux-ker...@vger.kernel.org; > Andy Lutomirski ; Arend van Spriel > ; Bhupesh Sharma ; > Borislav Petkov ; Hansen, Dave ; Eric > Snowberg ; Hans de Goede > ; Joe Perches ; Jon Hunter > ; Julien Thierry ; Marc > Zyngier ; Nathan Chancellor > ; Peter Zijlstra ; Prakhya, > Sai Praneeth ; Sedat Dilek > ; YiFei Zhu > Subject: [GIT PULL 00/11] EFI updates > > The following changes since commit > 976b489120cdab2b1b3a41ffa14661db43d58190: > > efi: Prevent GICv3 WARN() by mapping the memreserve table before first use > (2018-11-27 13:50:20 +0100) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git tags/efi-next > > for you to fetch changes up to 1d29afdbf7ae878a23627ebee81efcd213f11749: > > efi/x86: earlyprintk - Fix infinite loop on some screen widths (2018-11-28 > 17:58:42 +0100) > Regards, Sai
RE: [PATCH V3 1/3] x86/mm/pageattr: Introduce helper function to unmap EFI boot services
> Ideally, after kernel assumes control of the platform, firmware shouldn't > access > EFI boot services code/data regions. But, it's noticed that this is not so > true in > many x86 platforms. Hence, during boot, kernel reserves EFI boot services > code/data regions [1] and maps [2] them to efi_pgd so that call to > set_virtual_address_map() doesn't fail. > After returning from set_virtual_address_map(), kernel frees the reserved > regions [3] but they still remain mapped. Hence, introduce > kernel_unmap_pages_in_pgd() which will later be used to unmap EFI boot > services code/data regions. > > While at it modify kernel_map_pages_in_pgd() by 1. Adding __init modifier > because it's always used *only* during boot. > 2. Add a warning if it's used after SMP is initialized because it uses >__flush_tlb_all() which flushes mappings only on current CPU. > > Unmapping EFI boot services code/data regions will result in clearing > PAGE_PRESENT bit and it shouldn't bother L1TF cases because it's already > handled by protnone_mask() at arch/x86/include/asm/pgtable-invert.h. > > [1] efi_reserve_boot_services() > [2] efi_map_region() -> __map_region() -> kernel_map_pages_in_pgd() [3] > efi_free_boot_services() > > Signed-off-by: Sai Praneeth Prakhya > Cc: Borislav Petkov > Cc: Ingo Molnar > Cc: Andy Lutomirski > Cc: Dave Hansen > Cc: Bhupesh Sharma > Cc: Thomas Gleixner > Cc: Peter Zijlstra > Cc: Ard Biesheuvel Wasn't sure (and hence didn't) if I should add Acked-by Ingo because this patch changed from V2. Regards, Sai
RE: [PATCH V2 1/2] x86/efi: Unmap EFI boot services code/data regions from efi_pgd
> > /* > > * The testcases use internal knowledge of the implementation that > > shouldn't > > * be exposed to the rest of the kernel. Include these directly here. > > diff --git a/arch/x86/platform/efi/quirks.c > > b/arch/x86/platform/efi/quirks.c index 669babcaf245..fb1c44b11235 > > 100644 > > --- a/arch/x86/platform/efi/quirks.c > > +++ b/arch/x86/platform/efi/quirks.c > > While you are at it, can you please split the EFI part out into a separate > patch? Sure! I will do it in V3. Regards, Sai
RE: [PATCH V2 1/2] x86/efi: Unmap EFI boot services code/data regions from efi_pgd
> > In general it's good to do on list because others can learn from the > > answers as well. > > So you somehow managed to keep everyone and the lists in CC nevertheless. > Did not pay attention to the CC list until I got my reply on the list :) Sorry! my bad.. yes, the conversation did happen on the list and it's by accident. > > > > > > > + retval = __change_page_attr_set_clr(, 0); [] > > > So some questions I have are, > > > 1. What did Peter Z mean here? "How is that not a TLB invalidation bug ?" > > > > __flush_tlb_all() flushes only the mapping on the current CPU. So in a > > SMP environment it's not sufficient. Thanks for the explanation. The issue makes sense to me now. [.] > > I'll have a second look at the whole thing and reply on list. > > So actually for the problem at hand __flush_tlb_all() works, because that is > called way before the secondary CPUs are up. > > Ditto for kernel_map_pages_in_pgd(). Yes, you are right. Both the functions are *always* called before SMP initialization and as you rightly pointed out they should be marked with __init. > But both functions want to be marked > __init and aside of that they both should contain a warning when they are > called > after the secondary CPUs have been brought up. Makes sense too.. I will include these changes in V3. Regards, Sai
RE: [PATCH V2 1/2] x86/efi: Unmap EFI boot services code/data regions from efi_pgd
Hi Thomas and Peter, [off the mailing list because I wasn't sure if it's a good idea to spam others with my questions] > > > + /* > > > + * The typical sequence for unmapping is to find a pte through > > > + * lookup_address_in_pgd() (ideally, it should never return NULL > because > > > + * the address is already mapped) and change it's protections. > > > + * As pfn is the *target* of a mapping, it's not useful while unmapping. > > > + */ > > > + struct cpa_data cpa = { > > > + .vaddr = , > > > + .pgd= pgd, > > > + .numpages = numpages, > > > + .mask_set = __pgprot(0), > > > + .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW), > > > + .flags = 0, > > > + }; > > > + > > > + retval = __change_page_attr_set_clr(, 0); > > > + __flush_tlb_all(); > > > > So this looks like you copied it from kernel_map_pages_in_pgd() which > > has been discussed before to be not sufficient, but it can't be > > changed right now due to locking issues. > > Managed to confuse myself. The place which cannot be changed is a different > one, but still for your call site __flush_tlb_all() might not be sufficient. Since the mappings are changed, I thought a flush_tlb() might be needed. But, (to be honest), I don't completely understand the x86/mm code. So, could you please elaborate the issue more for my better understanding? So some questions I have are, 1. What did Peter Z mean here? "How is that not a TLB invalidation bug ?" 2. I assumed kernel_map_pages_in_pgd() to be a good code but you said that it has some locking issues. So, could you please elaborate more on that. Or, could you please provide me some pointers in the source code that I can take a look at so that I could understand the issue much better. Regards, Sai
RE: [PATCH V2 1/2] x86/efi: Unmap EFI boot services code/data regions from efi_pgd
> > +int kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address, > > + unsigned long numpages) > > +{ > > + int retval; > > + > > + /* > > +* The typical sequence for unmapping is to find a pte through > > +* lookup_address_in_pgd() (ideally, it should never return NULL > because > > +* the address is already mapped) and change it's protections. > > +* As pfn is the *target* of a mapping, it's not useful while unmapping. > > +*/ > > + struct cpa_data cpa = { > > + .vaddr = , > > + .pgd= pgd, > > + .numpages = numpages, > > + .mask_set = __pgprot(0), > > + .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW), > > + .flags = 0, > > + }; > > + > > + retval = __change_page_attr_set_clr(, 0); > > + __flush_tlb_all(); > > So, just to clarify, 'pfn' is kept at 0 here? Might make sense to write it out > explicitly like 'flags', even if it's not used by > __change_page_attr_set_clr(). Sure! Makes sense. I will add it. Regards, Sai
RE: [PATCH] efi: Fix debugobjects warning on efi_rts_work
> -Original Message- > From: Waiman Long [mailto:long...@redhat.com] > Sent: Tuesday, October 23, 2018 7:18 AM > To: Ard Biesheuvel > Cc: linux-efi@vger.kernel.org; linux-ker...@vger.kernel.org; Prakhya, Sai > Praneeth ; Waiman Long > > Subject: [PATCH] efi: Fix debugobjects warning on efi_rts_work > > The commit 9dbbedaa6171 ("efi: Make efi_rts_work accessible to efi page fault > handler") converted efi_rts_work from an auto variable to a global variable. > However, when submitting the work, INIT_WORK_ONSTACK() was still used > causing the following complaint from debugobjects: > > ODEBUG: object ed27b500 is NOT on stack c7d38760, but > annotated. > > Change the macro to just INIT_WORK() to eliminate the warning. > > Fixes: 9dbbedaa6171 ("efi: Make efi_rts_work accessible to efi page fault > handler") Thanks for fixing this :) Looks good to me. Regards, Sai
RE: [PATCH 1/2] x86/efi: Unmap efi boot services code/data regions from efi_pgd
> On 10/21/2018 06:57 PM, Ingo Molnar wrote: > > Does the CPU _ever_ look look at the PFN if the page is > > !_PAGE_PRESENT, for example speculatively? If yes then what is the > > recommended value for the pfn - zero perhaps? > > I'll never say never. :) > > For L1TF[1], we know the CPU did exactly this; it ignored the _PAGE_PRESENT > bit when fetching data from the L1. That's what is worked around with the > gunk > in arch/x86/include/asm/pgtable-invert.h. > > I think Andi plugged the code in here at a low enough level in the page table > manipulation that pageattr.c should inherit it without doing anything > explicit. > But, Sai, you might want to double-check this. > > 1. > https://www.intel.com/content/www/us/en/architecture-and- > technology/l1tf.html Sure! Dave. Thanks for the link. I will take a look at it. Regards, Sai
RE: [PATCH 1/2] x86/efi: Unmap efi boot services code/data regions from efi_pgd
> > The one bit that is odd is the cpa->pfn field - for unmapped pages > > that's totally uninteresting and I'm wondering whether setting it to 0 > > wouldn't be better. > > > > Does the CPU _ever_ look look at the PFN if the page is > > !_PAGE_PRESENT, for example speculatively? If yes then what is the > > recommended value for the pfn - zero perhaps? > > > > This is L1TF, right? Isn't all ones the recommended value? > > I would love to see EFI start treating its mm just like any other mm at some > point, though. That is, it should not modify any mappings in the kernel > range, > and it could use the regular mm modification APIs for the user range. But > maybe > this is a pipe dream. Sounds good to me. I have some more fixes for EFI code and as soon as they are done, I will start looking into this. Regards, Sai
RE: [PATCH 1/2] x86/efi: Unmap efi boot services code/data regions from efi_pgd
> > +int kernel_unmap_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, > > + unsigned long numpages) > > +{ > > + int retval; > > + > > + struct cpa_data cpa = { > > + .vaddr = , > > + .pfn = pfn, > > + .pgd = pgd, > > + .numpages = numpages, > > + .mask_set = __pgprot(0), > > + .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW), > > + .flags = 0, > > + }; > > + > > + retval = __change_page_attr_set_clr(, 0); > > + __flush_tlb_all(); > > + > > + return retval; > > +} > > That's certainly a creative use of __change_page_attr_set_clr() by EFI used > for > mapping in pages so far (kernel_map_pages_in_pgd()), and now used for > unmapping as well. Doesn't look wrong, just a bit weird as part of CPA. > Haha.. yes.. I copied from kernel_map_pages_in_pgd() > Could you please write the initializer in an easier to read fashion: > > struct cpa_data cpa = { > .vaddr = , > .pfn= pfn, > .pgd= pgd, > .numpages = numpages, > .mask_set = __pgprot(0), > .mask_clr = __pgprot(_PAGE_PRESENT | _PAGE_RW), > .flags = 0, > }; > > ? Sure! > > The one bit that is odd is the cpa->pfn field - for unmapped pages that's > totally > uninteresting and I'm wondering whether setting it to 0 wouldn't be better. > > Does the CPU _ever_ look look at the PFN if the page is !_PAGE_PRESENT, for > example speculatively? If yes then what is the recommended value for the pfn - > zero perhaps? > > Also note that if for whatever reason the PFN range of the EFI boot area gets > hot-unplugged, we'd have outright invalid PFNs - although this is probably > very > unlikely from a platform perspective. > > > +/* > > + * Apart from having VA mappings for efi boot services code/data > > +regions, > > + * (duplicate) 1:1 mappings were also created as a catch for buggy > > +firmware. So, > > + * unmap both 1:1 and VA mappings. > > + */ > > Speling nits: > > - please capitalize 'EFI' consistently. > - s/catch/quirk ? > Sure! I will fix them > BTW., are the 1:1 'boot mappings' a buggy firmware quirk, or something > required by the EFI spec? (or both? ;-) > It's a quirk for buggy firmware. According to EFI spec, EFI Boot Services code/data regions shouldn't be accessed after calling exit_boot_services(). This call is typically performed by bootloader (grub) or efi_stub. > > +static void __init efi_unmap_pages(efi_memory_desc_t *md) { > > + pgd_t *pgd = efi_mm.pgd; > > + u64 pfn = md->phys_addr >> PAGE_SHIFT; > > Note that this md->phys_addr isn't really meaningful once it gets unmapped. > Yes, makes sense. In efi_free_boot_services(), after freeing up the memory and unmapping, a new memory map is created (which has only EFI Runtime regions) and hence we can safely assume that this memory descriptor and md->phys_addr would never be used. > > + > > + if (kernel_unmap_pages_in_pgd(pgd, pfn, md->phys_addr, md- > >num_pages)) > > + pr_err("Failed to unmap 1:1 mapping: PA 0x%llx -> VA > 0x%llx!\n", > > + md->phys_addr, md->virt_addr); > > + > > + if (kernel_unmap_pages_in_pgd(pgd, pfn, md->virt_addr, md- > >num_pages)) > > + pr_err("Failed to unmap VA mapping: PA 0x%llx -> VA > 0x%llx!\n", > > + md->phys_addr, md->virt_addr); > > Please keep pr_err()'s in a single line. (and ignore checkpatch.) > Sure! > > +} > > + > > void __init efi_free_boot_services(void) { > > phys_addr_t new_phys, new_size; > > @@ -415,6 +434,13 @@ void __init efi_free_boot_services(void) > > } > > > > free_bootmem_late(start, size); > > + > > + /* > > +* Before calling set_virtual_address_map(), boot services > > +* code/data regions were mapped as a catch for buggy > firmware. > > +* Unmap them from efi_pgd as they have already been freed. > > +*/ > > + efi_unmap_pages(md); > > Ditto. > > BTW., the ordering here is wrong: we should unmap any virtual aliases from > pagetables _before_ we free the underlying memory. The ordering is probably > harmless in this case but overall a good practice. Sure! Makes sense. I will fix it in V2. Regards, Sai
RE: [PATCH V6 2/2] x86/efi: Add efi page fault handler to recover from page faults caused by the firmware
> On 12 September 2018 at 20:56, Thomas Gleixner wrote: > > On Tue, 11 Sep 2018, Sai Praneeth Prakhya wrote: > >> diff --git a/drivers/firmware/efi/runtime-wrappers.c > b/drivers/firmware/efi/runtime-wrappers.c > >> index b18b2d864c2c..7455277a3b65 100644 > >> --- a/drivers/firmware/efi/runtime-wrappers.c > >> +++ b/drivers/firmware/efi/runtime-wrappers.c > >> @@ -61,6 +61,11 @@ struct efi_runtime_work efi_rts_work; > >> ({ \ > >> efi_rts_work.status = EFI_ABORTED; \ > >> \ > >> + if (!efi_enabled(EFI_RUNTIME_SERVICES)) { \ > >> + pr_info("Aborting! EFI Runtime Services disabled\n"); \ > > > > This probaby wants to be pr_info_ince(). > > Yes.. that makes sense. > > I can fix that up when applying. > Thanks a lot! Ard. > > Other than that: > > > > Reviewed-by: Thomas Gleixner > > Thanks! I will queue this up in efi/next
RE: [PATCH V6 0/2] Add efi page fault handler to recover from page
> > This issue was reported by Al Stone when he saw that reboot via EFI > > hangs the machine. Upon debugging, I found that it's > > efi_reset_system() that's touching memory regions which it shouldn't. > > To reproduce the same behavior, I have hacked OVMF and made > > efi_reset_system() buggy. Along with efi_reset_system(), I have also > > modified get_next_high_mono_count() and set_virtual_address_map(). > > They illegally access both boot time and other efi regions. > > > > Testing the patch set: > > -- > > 1. Download buggy firmware from here [1]. > > 2. Run a qemu instance with this buggy BIOS and boot mainline kernel. > > Add reboot=efi to the kernel command line arguments and after the > > kernel is up and running, type "reboot". The kernel should hang while > rebooting. > > 3. With the same setup, boot kernel after applying patches and the > > reboot should work fine. Also please notice warning/error messages > > printed by kernel. > > > > Did you test these patches with other buggy runtime services? Yes, I did. I have modified efi runtime service GetNextHighMonotonicCount and made it buggy, when invoked from FWTS test suites the efi page fault handler works as expected (i.e. freezing efi_rts_wq and disabling efi runtime services forever). Regards, Sai
RE: [PATCH V5 0/2] Add efi page fault handler to recover from page
Hi All, > Changes from V4 to V5: > -- > 1. Drop config option that enables efi page fault handler, instead make >it default. > 2. Call schedule() in an infinite loop to account for spurious wake ups. > 3. Introduce "NONE" as an efi runtime service function identifier so that >it could be used in efi_recover_from_page_fault() to check if the page >fault was indeed triggered by an efi runtime service. Please note that apart from dropping the config knob and accounting for spurious wake ups, I have added a third change. I realized that when kernel command line option efi=old_map is passed, the efi page fault handler in V4 wouldn't work because, it checks for efi_mm. In the efi page fault handler, I have also added an explicit check for x86_64 because AFAIK, x86_32 bit machines do not suffer from this problem (Boris, please correct me if I am wrong). Also, I was unable to trigger page faults with buggy OVMF for x86_32. Just wanted to state the third change explicitly here so that it doesn't escape reviewers eyes. Regards, Sai
RE: [PATCH V4 0/3] Add efi page fault handler to recover from page
> Thanks Sai for this work. I think this a step in the right direction. > I tested this on qemu x86_64 with OVMF firmware modified to access some > random address in the EFI_Reserved_Region. I was able to reboot the qemu > instance successfully with the patches (see logs below) while without the > patchset, reboot earlier used to get stuck. > > So, feel free to add: > Tested-by: Bhupesh Sharma > Thanks a lot Bhupesh, for trying the patches and as you said, the patches need a lot more testing on real machines. > Qemu Console Logs: > --- > > # reboot > > > > [ 11.44] [ cut here ] > [ 11.400137] [Firmware Bug]: Page fault caused by firmware at PA: 0x7e924100 > [ 11.400484] WARNING: CPU: 0 PID: at > arch/x86/platform/efi/quirks.c:691 > efi_recover_from_page_fault+0x3b/0xf0 > [ 11.400751] Modules linked in: > [ 11.400992] CPU: 0 PID: Comm: init Not tainted 4.18.0-rc5+ #1 > [ 11.401146] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 0.0.0 02/06/2015 > [ 11.401397] RIP: 0010:efi_recover_from_page_fault+0x3b/0xf0 [snipped stack trace] > [ 11.410378] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 11.410554] ---[ end trace ad3d0a220a88a45b ]--- > [ 11.410742] efi: efi_reset_system() buggy! Reboot through BIOS Thanks for the log, it looks good to me.
RE: [PATCH V4 3/3] x86/efi: Introduce EFI_PAGE_FAULT_HANDLER
> > > Why make this optional? > > > > I made it as a config option in RFC because the page fault handler was > > complicated and touching many parts (it had lots of code change and I > > didn't want to break any existing functionality). Now that it's > > simple, I don't think we need the config option. > > > > Without efi page fault handler, any page fault caused by firmware > > should panic kernel but with this patch I think we are just improving > > existing > condition (ideally). > > > > So, if Thomas, Ingo, Andy, Ard and Boris are ok.. I will make it as > > default (i.e. without config). > > > > Regards, > > Sai > > > Also, some distributions already have specific ways to handle buggy firmwares > which can be at times dependent on the underlying hardware and the firmware > versions. > > I would suggest that we enable this under a CONFIG for the first round and > once > it is tested with wider variety of x86 machines which have buggy or orphaned > firmware and linux (and reboot) works fine with them, we can drop the CONFIG > option in future and enable this by default. Sounds fair to me, but, I would like to wait for someone experienced to make the final call. Regards, Sai
RE: [PATCH V4 0/3] Add efi page fault handler to recover from page
> > The efi page fault handler will check if the access is by > > efi_reset_system(). > > 1. If so, then the efi page fault handler will reboot the machine > >through BIOS and not through efi_reset_system(). > > 2. If not, then the efi page fault handler will freeze efi_rts_wq and > >schedules a new process. > > > > Thanks Sai! I am pretty happy how this patch set turned out. It still > requires the > blessing of the x86 maintainers, of course, but from my pov, this is good to > go > (but I will fold patch #3 into #2) Hopefully, patch #3 goes away too.. in V5. Regards, Sai
RE: [PATCH V4 2/3] x86/efi: Add efi page fault handler to recover from page faults caused by the firmware
> > + if (efi_rts_work.efi_rts_id == RESET_SYSTEM) { > > + pr_info("efi_reset_system() buggy! Reboot through BIOS\n"); > > + machine_real_restart(MRR_BIOS); > > + return 0; > > + } > > + > > + /* Firmware has caused page fault, hence, freeze efi_rts_wq. */ > > + set_current_state(TASK_UNINTERRUPTIBLE); > > This doesn't freeze it, as such, it just sets the state. True! Thanks for pointing it out. I will update the comment. > > + > > + /* > > +* Before calling EFI Runtime Service, the kernel has switched the > > +* calling process to efi_mm. Hence, switch back to task_mm. > > +*/ > > + arch_efi_call_virt_teardown(); > > + > > + /* Signal error status to the efi caller process */ > > + efi_rts_work.status = EFI_ABORTED; > > + complete(_rts_work.efi_rts_comp); > > + > > + clear_bit(EFI_RUNTIME_SERVICES, ); > > + pr_info("Froze efi_rts_wq and disabled EFI Runtime Services\n"); > > > + schedule(); > > So what happens when we get a spurious wakeup and return from this? > > Quite possibly you want something like: > > for (;;) { > set_current_state(TASK_IDLE); > schedule(); > } > > here. The TASK_UNINTERRUPTIBLE thing will cause the load-avg to spike; is that > what you want? Yes, makes sense. TASK_IDLE seems more appropriate. I will change it. Regards, Sai
RE: [PATCH V4 3/3] x86/efi: Introduce EFI_PAGE_FAULT_HANDLER
> > There may exist some buggy UEFI firmware implementations that might > > access efi regions other than EFI_RUNTIME_SERVICES_ even > > after the kernel has assumed control of the platform. This violates > > UEFI specification. > > > > If selected, this debug option will print a warning message if the > > UEFI firmware tries to access any memory region which it shouldn't. > > Along with the warning, the efi page fault handler will also try to > > recover from the page fault triggered by the firmware so that the > > machine doesn't hang. > > Why make this optional? I made it as a config option in RFC because the page fault handler was complicated and touching many parts (it had lots of code change and I didn't want to break any existing functionality). Now that it's simple, I don't think we need the config option. Without efi page fault handler, any page fault caused by firmware should panic kernel but with this patch I think we are just improving existing condition (ideally). So, if Thomas, Ingo, Andy, Ard and Boris are ok.. I will make it as default (i.e. without config). Regards, Sai
RE: [PATCH V3 3/5] x86/efi: Permanently save the EFI_MEMORY_MAP passed by the firmware
> >> I agree. Keep it simple. If the EFI crap fails, then assist with the > >> reboot and otherwise just kill it. > > > > The reasons for saving old memory map are (in my view, these are the > > less important ones because they are very unlikely to happen) > > > > 1. Make sure that a memory descriptor exists for the physical address > > that was faulted on (EFI Memory Map could sometime have holes). > > Assuming a case that the physical address that caused page fault > > doesn't have a valid efi memory descriptor, the efi page fault handler > > shouldn't > take any action because it hasn't triaged the problem yet. > > > > 2. Make sure that the faulted physical address is _not_ efi runtime service > code/data region. > > Efi runtime service code/data regions are always mapped by kernel in > > efi_pgd and accesses to these regions should _never_ page fault. > > Assuming that something like this happens, efi page fault handler > > shouldn't take any action because it's not any illegal access by firmware > > but it's > a kernel bug. > > > > What about attempts to modify code regions or attempts to execute data > regions? What kind of fault will that trigger, and are they being handled at > the > moment? AFAIK, at least in the x86 world, attempts to write to read only regions or attempts to execute XP (execute protected) pages will result in page fault and I don't think we are handling them. > > As I pointed out, EFI runtime services code may legally access the stack or > dereference pointer arguments, but could still contain bugs that result in > out of > bounds accesses or writes to read-only regions. Yes, agreed. In fact, I did see these bugs. > So I don't really care about the address of the illegal access, any fault that > occurs while running in the firmware should be treated the same. Ok.. makes sense. > In fact, cross > referencing the value of IP with RuntimeServicesCode regions may be more > useful This is to verify that firmware is indeed executing code from RuntimeServicesCode regions when it faulted. Is that correct? Or did you mean something else? > > That said, a more important reason (in my view) is to print out the > > memory descriptor that we faulted on. This is a *proof* showing that > > it's buggy firmware that caused page fault and hence is not a kernel > > bug. This proof is important because whenever a stack trace is printed > > with some efi function, kernel is the usual suspect and hence we need to > > show > that it's not kernel fault. It could also help firmware engineers to fix the > bug > easily. > > > > dmesg would show something like this when buggy efi_reset_system() > accesses reserved region: > > > > [ 296.141511] efi: EFI Memory Descriptor for offending PA is: > > [ 296.141844] efi: [Reserved | | | | | | | | > > |WB|WT|WC|UC] > range=[0x7e915000-0x7e933fff] (0MB) > > [ 296.142522] efi: efi_reset_system() buggy! Reboot through BIOS > > > > So, I would be concerned if we miss this proof. > > > > You can dump the entire memory map by putting efi=debug on the kernel > command line, so all we need to do is report the physical address, and you can > easily figure out for yourself which memory map entry covers it. That's true. In fact, that's how I debugged this issue and hence thought that it might be useful to have all that info at one place (i.e. in efi page fault handler). But, as you said, to make the code look simpler, I will roll out a V4 without saving original memory map. Regards, Sai
RE: [PATCH V3 3/5] x86/efi: Permanently save the EFI_MEMORY_MAP passed by the firmware
> On Wed, 5 Sep 2018, Ard Biesheuvel wrote: > > On 5 September 2018 at 14:56, Peter Zijlstra wrote: > > > On Wed, Sep 05, 2018 at 02:27:49PM +0200, Ard Biesheuvel wrote: > > >> Would we still need to preserve the old memory map in that case? > > > > > > I thought the reason for having this was being able to know the > > > fault is in an EFI area. But of course, I'm not wel versed in this > > > whole EFI crapola. > > > > I'm not entirely sure whether that really matters. The EFI services > > access the stack and can access byref/pointer arguments which are not > > covered by the EFI memory map as runtime services code/data, and so > > they can trigger page faults by running off the vmapped stack or > > writing to const byref arguments. > > > > EFI runtime services using boot services regions after they are no > > longer available are a known source of headaches, but I don't see why > > we should restrict ourselves to such cases if we bother to wire up > > fault handling specifically for EFI services calls. > > > > So any page or permission fault occurring in the context of a EFI > > runtime services invocation should be treated the same, I think. > > I agree. Keep it simple. If the EFI crap fails, then assist with the reboot > and > otherwise just kill it. The reasons for saving old memory map are (in my view, these are the less important ones because they are very unlikely to happen) 1. Make sure that a memory descriptor exists for the physical address that was faulted on (EFI Memory Map could sometime have holes). Assuming a case that the physical address that caused page fault doesn't have a valid efi memory descriptor, the efi page fault handler shouldn't take any action because it hasn't triaged the problem yet. 2. Make sure that the faulted physical address is _not_ efi runtime service code/data region. Efi runtime service code/data regions are always mapped by kernel in efi_pgd and accesses to these regions should _never_ page fault. Assuming that something like this happens, efi page fault handler shouldn't take any action because it's not any illegal access by firmware but it's a kernel bug. Generally, the above two scenarios should never happen. I am just being paranoid and wanted to make sure that the efi page fault handler is fixing the right firmware bug that I came across and not something else. I also agree that, we could make the patch set and efi page fault handler much simpler by not saving old memory map. So, I am OK if we are not checking for the above two scenarios. If they are really needed, we could add them later. That said, a more important reason (in my view) is to print out the memory descriptor that we faulted on. This is a *proof* showing that it's buggy firmware that caused page fault and hence is not a kernel bug. This proof is important because whenever a stack trace is printed with some efi function, kernel is the usual suspect and hence we need to show that it's not kernel fault. It could also help firmware engineers to fix the bug easily. dmesg would show something like this when buggy efi_reset_system() accesses reserved region: [ 296.141511] efi: EFI Memory Descriptor for offending PA is: [ 296.141844] efi: [Reserved | | | | | | | | |WB|WT|WC|UC] range=[0x7e915000-0x7e933fff] (0MB) [ 296.142522] efi: efi_reset_system() buggy! Reboot through BIOS So, I would be concerned if we miss this proof. Regards, Sai
RE: [PATCH V3 2/5] efi: Introduce __efi_init attribute
Hi Boris and Ard, > Buggy firmware could illegally access some efi regions even after the kernel > has > assumed control of the platform. When > "CONFIG_EFI_WARN_ON_ILLEGAL_ACCESS" is enabled, the efi page fault > handler will detect and recover from these illegal accesses. > efi_md_typeattr_format() and memory_type_name are used by the efi page > fault handler to print information about memory descriptor that was illegally > accessed. As the page fault handler is present during/after kernel boot it > doesn't > have an __init attribute, but > efi_md_typeattr_format() has it and thus during kernel build, "WARNING: > modpost: Found * section mismatch(es)" build warning is observed. To fix it, > remove __init attribute for efi_md_typeattr_format(). > > In order to not keep efi_md_typeattr_format() and memory_type_name > needlessly when "CONFIG_EFI_WARN_ON_ILLEGAL_ACCESS" is not selected, > add a new __efi_init attribute whose value changes based on whether the config > option is selected or not. In previous versions (i.e. up to V2), where we handled EFI_BOOT_SERVICES_ regions differently, it made sense to have a separate attribute like __efi_init because many function definitions were modified. From V3, do you think it's still OK to have __efi_init or should I just remove __init attribute (and not have __efi_init) for efi_md_typeattr_format() and memory_type_name because we are just modifying two. Regards, Sai
RE: [PATCH V2 4/6] x86/efi: Add efi page fault handler to fixup/recover from page faults caused by firmware
> > The efi specific page fault handler offers us two advantages: > > 1. Avoid panics/hangs caused by buggy firmware. > > 2. Shout loud that the firmware is buggy and hence is not a kernel bug. > > > > Finally, this new mapping will not impact a reboot from kexec, as > > kexec is only concerned about runtime memory regions. > > No. This is just a horrible hack to make completely bogus firmware work and > never fixed. > Yes, that's true. > The proper thing to do is to have a minimal page fault handler which does: > > 1) Yell loudly if that ever happens > > 2) Handles the reboot request gracefully > > 3) Freeze and disable the EFI mess for all other cases > > That does not require any hackery to make these mappings work from atomic > context and keeps the mess confined to the EFI code where it belongs. > > Ideally we just blacklist the offending system and be done with it. This makes sense to me. I will implement the above suggested and as said should avoid the need for making mappings work from atomic context. Regards, Sai
RE: [PATCH V2 5/6] x86/mm: If in_atomic(), allocate pages without sleeping
> Thanks for CC'ing me. I wonder why the world and some more is on CC, but > x...@kernel.org is NOT. > Sorry! about that. I will CC you and x86 from V3. > > On Sun, Sep 02, 2018 at 02:46:33AM -0700, Sai Praneeth Prakhya wrote: > > > > @@ -926,7 +926,13 @@ static void unmap_pud_range(p4d_t *p4d, > > > unsigned long start, unsigned long end) > > > > > > static int alloc_pte_page(pmd_t *pmd) { > > > - pte_t *pte = (pte_t *)get_zeroed_page(GFP_KERNEL); > > > + pte_t *pte; > > > + > > > + if (in_atomic()) > > > + pte = (pte_t *)get_zeroed_page(GFP_ATOMIC); > > > + else > > > + pte = (pte_t *)get_zeroed_page(GFP_KERNEL); > > > + > > > if (!pte) > > > return -1; > > > > > > > This looks like tinkering to me.. > > Yes it is and it's not going to happen. Haha.. ok.. I will take your suggestion in the 4th patch and that should avoid this. Regards, Sai
RE: [PATCH V2 2/6] x86/efi: Remove __init attribute from memory mapping functions
> On Mon, Sep 03, 2018 at 05:03:56AM +0000, Prakhya, Sai Praneeth wrote: > > Hmm.. thought that __efi_init might be confusing with the normal > > __init attribute > > How would that be confusing? It has "__efi" prepended?! > > All I'm saying is, if you're going to define your own function attributes, do > them > generic and short. "_fixup" is too specific IMO. It also enlarges function > definitions unnecessarily. > Yes.. agreed that the function definitions did enlarge. > With "__efi_init" you already denote that it is a special attribute which has > relevance in the EFI code only. What you do about it - the > *fixup* - is the thing you do with the attribute. But you don't have to have > the > "what you do" in the attribute name too. Ok.. makes sense. Will roll a V3 with __efi_init. Regards, Sai
RE: [PATCH V2 2/6] x86/efi: Remove __init attribute from memory mapping functions
> On Sun, Sep 02, 2018 at 02:46:30AM -0700, Sai Praneeth Prakhya wrote: > > In order to not keep these functions needlessly when > > "CONFIG_EFI_WARN_ON_ILLEGAL_ACCESS" is not selected, add a new > > __efi_init_fixup attribute whose value changes based on whether the > > Why not just __efi_init{,data} ? Hmm.. thought that __efi_init might be confusing with the normal __init attribute and hence added "fixup", so that it's explicit that we are trying to fixup the normal __init attribute. __efi_init_data doesn't seem appropriate because we are fixing both functions and data. Regards, Sai
RE: [PATCH V1 6/6] x86/efi: Introduce EFI_WARN_ON_ILLEGAL_ACCESSES
> > +config EFI_WARN_ON_ILLEGAL_ACCESSES > > How about shortening the name to just 'EFI_WARN_ON_ILLEGAL_ACCESS'? > Makes sense.. I will shorten it. > > + bool "Warn about illegal memory accesses by firmware" > > + depends on EFI > > From a distribution p-o-v, I would suggest that we set this CONFIG option > only if > we are in the EXPERT mode, as this need more thrashing with the behaviour we > see on old, buggy EFI firmwares available on very old x86 machines. So > something like: >bool "Warn about illegal memory accesses by firmware" if EXPERT > Agreed. Although the feature is intended to warn about all these buggy firmware's instead of silently working around (as we are doing presently), it needs more testing. Hence, as you said, I think it's safe to have it as an EXPERT config option. > > + help > > + Enable this debug feature so that the kernel can detect illegal > > + memory accesses by firmware and issue a warning. Also, > > + 1. If the illegally accessed region is > > EFI_BOOT_SERVICES_, > > +the kernel fixes it up by mapping the requested region. [] > > diff --git a/init/main.c b/init/main.c index > > 3b4ada11ed52..dce0520861a1 100644 > > --- a/init/main.c > > +++ b/init/main.c > > @@ -730,7 +730,8 @@ asmlinkage __visible void __init start_kernel(void) > > arch_post_acpi_subsys_init(); > > sfi_init_late(); > > > > - if (efi_enabled(EFI_RUNTIME_SERVICES)) { > > + if (efi_enabled(EFI_RUNTIME_SERVICES) && > > + !IS_ENABLED(CONFIG_EFI_WARN_ON_ILLEGAL_ACCESSES)) { > > Since this is an arch agnostic code leg, do we really want to introduce a > generic > check for CONFIG_EFI_WARN_ON_ILLEGAL_ACCESSES > here, without checking for whether we are actually running this on a > x86 hardware, or alternatively we can consider moving > CONFIG_EFI_WARN_ON_ILLEGAL_ACCESSES out of 'arch/x86/Kconfig' to > 'arch/Kconfig' so that later it can be used on other archs like ARM64 as well. > > I think the later would be more useful.. Thanks for bringing this up. I see your point. I will refrain from polluting architecture agnostic code. As we don't need efi_free_boot_services() at all, if EFI_WARN_ON_ILLEGAL_ACCESSES is enabled, probably making changes to include/linux/efi.h would be better, I guess.. Moving this to arch/Kconfig could be too futuristic.. I guess.. because I am not sure if this problem exists on ARM machines, if so, probably Ard could suggest something here. Regards, Sai
RE: [PATCH RFC 0/8] Add efi page fault handler to fix/recover from
> > 1. Since we don't use "efi_rts_wq" for efi_reset_system(), the above > > solution doesn't work if efi_reset_system() is buggy. We cannot > > neglect it either because that's the issue faced by Al Stone and hence > > I started the patch set. So, I am thinking to use long jump technique > > *only* for > efi_reset_system(). Would you be OK with that? > > Or please propose a new solution. > > The long jump thing is super ugly. I'd be okay with it if it were somehow > factored out into a testable mechanism, but it's a nasty complicated bit of > assembly, it will only be used very very rarely, and it more or less > duplicates > switch_to(). Yes, I agree that it's complicated assembly. But I tested it by hacking OVMF. And.. you are correct that it's very very rarely used.. so probably there is no point adding complicated code for it. > > Could you just do a fallback reboot from the fault handler if > efi_reset_system() fails? > Sure! I will try that and will update you on how that goes. Hopefully, well. > > > > 2. Also, as suggested, I haven’t used a dedicated "efi_kthread", I got > > this working with the existing "efi_rts_wq" setup. Please note that, > > the existing "efi_rts_wq", has only one worker thread. Could you > > please brief me on why we should have a dedicated kthread and not leverage > "efi_rts_wq" (I am unclear on this). > > Two reasons: > > 1. If you used a dedicated thread, then you could probably just set that > thread's - > >mm to the EFI mm rather than manually switching the mm. That means, we don't need efi_switch_mm() anymore and could just get away by changing efi_kthread ->mm = efi_mm. Is that what you meant? Just wanted to make sure, I understood you correctly. > > 2. If you used a dedicated thread, you would be a lot closer to being able to > call > into EFI at CPL 3. In x86 Linux (and probably most or all other arches), the > kernel > never calls into CPL 3. Instead it > *returns* into CPL 3. So you would set up a thread and make that thread's > task_pt_regs() have the correct context for a little stub that calls into EFI > and > then does "syscall". Then you would return all the way out of the kernel > into the > stub. I got it.. not 100% though.. but I will surely work on this because it sounds very interesting to me. But still, I have a question, this article https://lwn.net/Articles/23634/ says that "each workqueue has one or more dedicated worker threads (one per CPU, by default) associated with it" So, do you think we still need our own efi dedicated kthread? Regards, Sai N�r��y���b�X��ǧv�^�){.n�+{�y��^n�r��z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�
RE: [PATCH RFC 0/8] Add efi page fault handler to fix/recover from
> > As discussed in previous mails, I have implemented the below: > > > > If kernel detects an illegal access by firmware to any efi region > > other than efi boot time code/data, a. It freezes "efi_rts_wq" (efi > > runtime services work queue). > > b. Signals the requested process that an error occurred. > > c. Disables EFI Runtime Services forever. > > d. Explicitly calls the scheduler to schedule another process. > > > > But, I have couple of questions: > > > > 1. Since we don't use "efi_rts_wq" for efi_reset_system(), the above > > solution doesn't work if efi_reset_system() is buggy. We cannot > > neglect it either because that's the issue faced by Al Stone and hence > > I started the patch set. So, I am thinking to use long jump technique > > *only* for > efi_reset_system(). Would you be OK with that? > > Or please propose a new solution. > > > > What would be the point of that? Can you reasonably expect the kernel to still > run reliably after it has taken itself down, called > efi_reset_system() and failed? > The problem with efi_reset_system() being buggy is, If user calls "reboot", kernel would call efi_reset_system() and if unhandled, kernel hangs and never reboots. But, if handled and returning error (either using long jump or some other technique), kernel could continue rebooting through other fall backs. I think, this is true only for x86 though. Please note that native_machine_emergency_restart() would call "BOOT_BIOS", if efi_reboot() returns in arch/x86/kernel/reboot > > 2. Also, as suggested, I haven’t used a dedicated "efi_kthread", I got > > this working with the existing "efi_rts_wq" setup. Please note that, > > the existing "efi_rts_wq", has only one worker thread. Could you > > please brief me on why we should have a dedicated kthread and not leverage > "efi_rts_wq" (I am unclear on this). > > > > Does every work queue have its own dedicated kernel thread? Hmm.. I think so.. I referred to the below articles while working on "efi_rts_wq" They say 1. Each workqueue has one or more dedicated worker threads (one per CPU, by default) associated with it. 2. Tasks running out of a special-purpose workqueue can sleep indefinitely, since they will not interfere with tasks in any other queue. [1] https://lwn.net/Articles/23634/ [2] https://lwn.net/Articles/11360/ Regards, Sai
RE: [PATCH RFC 0/8] Add efi page fault handler to fix/recover from
Hi All, > > >> The main problem is that we have just merged Sai's code to use a > > >> work queue for invoking EFI services, and killing kworker threads > > >> is obviously not going to make EFI any new friends. > > >> > > >> So I guess we should probably rework that code to use a dedicated > > >> kthread, and just freeze it when it performs an illegal memory > > >> access, and signal the completion in a way that makes the calling > > >> thread understand that a) the call failed and b) runtime services > > >> are no longer > > available. > > > > > > Yes, this makes sense to me. > > > Initially I did use a dedicated kthread for efi but moved to work > > > queues later so > > that the synchronization is well handled. I am ok to rework on the > > patches, could we ask Ingo to hold onto efi_workqueue patches? > > > > > > > I am fine with keeping them. We will have a different approach in > > v4.19 than in subsequent kernels, but the workqueue approach is still > > better than nothing at all. As discussed in previous mails, I have implemented the below: If kernel detects an illegal access by firmware to any efi region other than efi boot time code/data, a. It freezes "efi_rts_wq" (efi runtime services work queue). b. Signals the requested process that an error occurred. c. Disables EFI Runtime Services forever. d. Explicitly calls the scheduler to schedule another process. But, I have couple of questions: 1. Since we don't use "efi_rts_wq" for efi_reset_system(), the above solution doesn't work if efi_reset_system() is buggy. We cannot neglect it either because that's the issue faced by Al Stone and hence I started the patch set. So, I am thinking to use long jump technique *only* for efi_reset_system(). Would you be OK with that? Or please propose a new solution. 2. Also, as suggested, I haven’t used a dedicated "efi_kthread", I got this working with the existing "efi_rts_wq" setup. Please note that, the existing "efi_rts_wq", has only one worker thread. Could you please brief me on why we should have a dedicated kthread and not leverage "efi_rts_wq" (I am unclear on this). Regards, Sai N�r��y���b�X��ǧv�^�){.n�+{�y��^n�r��z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�
RE: [PATCH RFC 0/8] Add efi page fault handler to fix/recover from
> >> The main problem is that we have just merged Sai's code to use a work > >> queue for invoking EFI services, and killing kworker threads is > >> obviously not going to make EFI any new friends. > >> > >> So I guess we should probably rework that code to use a dedicated > >> kthread, and just freeze it when it performs an illegal memory > >> access, and signal the completion in a way that makes the calling > >> thread understand that a) the call failed and b) runtime services are no > >> longer > available. > > > > Yes, this makes sense to me. > > Initially I did use a dedicated kthread for efi but moved to work queues > > later so > that the synchronization is well handled. I am ok to rework on the patches, > could > we ask Ingo to hold onto efi_workqueue patches? > > > > I am fine with keeping them. We will have a different approach in > v4.19 than in subsequent kernels, but the workqueue approach is still better > than > nothing at all. Makes sense to me.
RE: [PATCH RFC 0/8] Add efi page fault handler to fix/recover from
> > The basic idea is that, when you get an exception from a context that > > is busted (i.e. it wasn't user code with a signal handler or kernel > > code with a fixup), you can't return from the exception handler. That > > leaves two choices: > > > > 1. Kill the task without returning. You could call do_exit(), or, > > roughly equivalently, let yourself OOPS (fall through to die(), etc). > > Then you have to make sure that the code that called into the thread > > can handle it by signaling some completion first or whatever you need > > to do. > > > > 2. Sleep forever. For example, set the task state to > > TASK_UNINTERRUPTIBLE and reschedule. > > > > If the thread is a real kthread, (1) may be a bad idea. I’m not sure a > > kthread can tolerate do_exit(), since the kthread core plays awful > > games involving storing important data structures on the stack. Also, > > with (2), it might be possible to enable some after-the-fact > > debugging, since the full crashing state is still there. > > > > The main problem is that we have just merged Sai's code to use a work queue > for invoking EFI services, and killing kworker threads is obviously not going > to > make EFI any new friends. > > So I guess we should probably rework that code to use a dedicated kthread, and > just freeze it when it performs an illegal memory access, and signal the > completion in a way that makes the calling thread understand that a) the call > failed and b) runtime services are no longer available. Yes, this makes sense to me. Initially I did use a dedicated kthread for efi but moved to work queues later so that the synchronization is well handled. I am ok to rework on the patches, could we ask Ingo to hold onto efi_workqueue patches? Regards, Sai N�r��y���b�X��ǧv�^�){.n�+{�y��^n�r��z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�
RE: [PATCH RFC 0/8] Add efi page fault handler to fix/recover from
I have added some x86/intel folks to cc. I am fine with these patches, and I think it is useful to be able to detect and recover from buggy UEFI implementations that use boot time regions at runtime. However, I need help from the x86 maintainers/developers to review this so please cc them on these patches. I'm okay with the general concept, but I'm not really thrilled by the longjmp-like approach. Wasn't there a bunch of talk of having an actual kernel thread (kefid?) that makes runtime services calls? Did that actually get implemented? IMO a much nicer approach would be to handle the page fault by killing the thread, more or less like how we kill unruly user threads. (And it's yet another step toward calling EFI runtime services at CPL 3!) Hi Andy, Thanks for the feedback ☺. We have efi_kthread implemented and I did briefly think about killing the efi_kthread approach, but I thought it might not be possible (I might be wrong) because, we are in page fault handler and if we kill efi_kthread, the page fault handler still returns to firmware (because a firmware instruction caused page fault) and firmware will try to perform illegal access again thinking that the page fault handler might have fixed the fault. So, I took this approach of jumping out of firmware. Please let me know If you think I missed something. Regards, Sai
RE: [PATCH RFC 0/8] Add efi page fault handler to fix/recover from
> > AFAIK, efi runtime services are not reentrant. With this in mind, if > > something > like above happens, I have completely turned off EFI runtime services in > kernel. > Is that OK? Or should we keep them enabled hoping to catch further illegal > accesses (assuming that this feature is not used in production kernels). > > > > I think it is reasonable to turn off services after that. The only problem is > that > distros will never be able to enable this, given that it may break systems > that are > working fine today. Actually these patches shouldn't break any existing behavior. Below are the possible illegal accesses. 1. If the illegal access was to boot time region, presently, it works during kernel boot but not after kernel boot, because we free boot time regions after set_virtual_address_map() is called. Please see efi_reserve/free_boot_services(). With the patches, we save boot time regions forever and hence illegal access could be fixed even after kernel boot. So, distros shouldn't see anything different here. 2. If the illegal access was to any other region except boot time region, presently, kernel panics both during and after kernel boot (this is the case reported by Al Stone). With these patches, we exit firmware context and hence fixup page fault handler. So, distros here, instead of seeing a kernel panic would see EFI Runtime Services disabled. Regards, Sai
RE: [PATCH RFC 0/8] Add efi page fault handler to fix/recover from
> I have added some x86/intel folks to cc. > > I am fine with these patches, and I think it is useful to be able to detect > and > recover from buggy UEFI implementations that use boot time regions at > runtime. > > However, I need help from the x86 maintainers/developers to review this so > please cc them on these patches. Hi Ard, Sure! I will keep them cc'ed. Could you also please let me know you thoughts on this approach If the illegal access occurs to any EFI region other than EFI boot time regions (Eg: EFI conventional memory or EFI loader code/data), these patches will exit firmware context and return to kernel i.e. we are adjusting RIP and RSP in efi page fault handler and leaving runtime service execution abruptly. Is that OK? This code in "[PATCH RFC 4/8] x86/efi: Add page fault handler to fixup/recover from page faults caused by firmware" + regs->sp = xmm_regs_rsp; + regs->ip = exit_fw_ctx_rip; + exited_fw_ctx = true; + clear_bit(EFI_RUNTIME_SERVICES, ); + pr_info("Exited Firmware context and disabled EFI Runtime Services\n"); AFAIK, efi runtime services are not reentrant. With this in mind, if something like above happens, I have completely turned off EFI runtime services in kernel. Is that OK? Or should we keep them enabled hoping to catch further illegal accesses (assuming that this feature is not used in production kernels). Regards, Sai N�r��y���b�X��ǧv�^�){.n�+{�y��^n�r��z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�
RE: [RFC PATCH] x86/efi: drop task_lock() from efi_switch_mm()
> > Because local_save_flags() does not disable interrupts?? > > now that you say so, it does make sense… > > > > Anyway, I would still like to get rid of task_lock() in efi_switch_mm(). > > > Any objections to that? > > > > No, not at all. > okay. I don’t have any either and thanks for the clear commit message. It makes sense why task_lock()/task_unlock() should go away. Regards, Sai N�r��y���b�X��ǧv�^�){.n�+{�y��^n�r��z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�
RE: [PATCH 2/8] efi/x86: Use non-blocking SetVariable() for efi_delete_dummy_variable()
> > diff --git a/arch/x86/platform/efi/quirks.c > > b/arch/x86/platform/efi/quirks.c index 36c1f8b9f7e0..6af39dc40325 > > 100644 > > --- a/arch/x86/platform/efi/quirks.c > > +++ b/arch/x86/platform/efi/quirks.c > > @@ -105,12 +105,11 @@ early_param("efi_no_storage_paranoia", > > setup_storage_paranoia); */ void efi_delete_dummy_variable(void) { > > - efi.set_variable((efi_char16_t *)efi_dummy_name, > > -_DUMMY_GUID, > > -EFI_VARIABLE_NON_VOLATILE | > > -EFI_VARIABLE_BOOTSERVICE_ACCESS | > > -EFI_VARIABLE_RUNTIME_ACCESS, > > -0, NULL); > > + efi.set_variable_nonblocking((efi_char16_t *)efi_dummy_name, > > +_DUMMY_GUID, > > +EFI_VARIABLE_NON_VOLATILE | > > +EFI_VARIABLE_BOOTSERVICE_ACCESS | > > +EFI_VARIABLE_RUNTIME_ACCESS, 0, NULL); > > } > > Just wondering, what is the full stack trace of the splat? It sounds a bit > surprising > to me that such type of EFI code is used from the idle thread. Sorry! for the confusing commit message. Kernel warns about scheduling from idle thread only when "efi_rts_wq" is used to invoke efi_runtime_services(). So, presently, this doesn't happen on mainline kernel. Support for "efi_rts_wq" is added by commit 3eb420e70d87 (efi: Use a work queue to invoke EFI Runtime Services). With v4.18-rc5 kernel, the stack trace looks as below: Please note that it's not just a warning but a kernel panic due to NULL pointer dereference. If I remember correctly, I noticed "bad: scheduling from the idle thread!" warning during development phase (probably with v4.15 or v4.16 kernels). [0.075052] BUG: unable to handle kernel NULL pointer dereference at 01c2 [0.076000] PGD 0 P4D 0 [0.076000] Oops: [#1] SMP PTI [0.076000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.18.0-rc5-efitest+ #216 [0.076000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [0.076000] RIP: 0010:__queue_work+0x41/0x5f0 [0.076000] Code: fd 48 83 ec 10 8b 35 2e e2 79 01 89 7c 24 04 85 f6 74 17 65 48 8b 04 25 40 4f 01 00 8b 88 54 0c 00 00 85 c9 0f 84 b5 02 00 00 <41> f6 84 24 c2 01 00 00 01 0f 85 f7 03 00 00 48 bd eb 83 b5 80 46 [0.076000] RSP: :82603cf0 EFLAGS: 00010046 [0.076000] RAX: 8262a7c0 RBX: 0246 RCX: [0.076000] RDX: RSI: 0001 RDI: 2000 [0.076000] RBP: 82603da0 R08: R09: 0001 [0.076000] R10: R11: R12: [0.076000] R13: 2000 R14: 82603da0 R15: [0.076000] FS: () GS:88007e00() knlGS: [0.076000] CS: 0010 DS: ES: CR0: 80050033 [0.076000] CR2: 01c2 CR3: 05a24001 CR4: 000606b0 [0.076000] Call Trace: [0.076000] queue_work_on+0x33/0x70 [0.076000] virt_efi_set_variable+0x11f/0x160 [0.076000] ? efi_call_virt_check_flags+0x80/0x80 [0.076000] efi_delete_dummy_variable+0x8c/0xb0 [0.076000] ? efi_enter_virtual_mode+0x42c/0x4e0 [0.076000] efi_enter_virtual_mode+0x42c/0x4e0 [0.076000] start_kernel+0x456/0x4f4 [0.076000] secondary_startup_64+0xa5/0xb0 [0.076000] Modules linked in: [0.076000] CR2: 01c2 [0.076000] ---[ end trace 5a03876c3be00272 ]--- [0.076000] RIP: 0010:__queue_work+0x41/0x5f0 [0.076000] Code: fd 48 83 ec 10 8b 35 2e e2 79 01 89 7c 24 04 85 f6 74 17 65 48 8b 04 25 40 4f 01 00 8b 88 54 0c 00 00 85 c9 0f 84 b5 02 00 00 <41> f6 84 24 c2 01 00 00 01 0f 85 f7 03 00 00 48 bd eb 83 b5 80 46 [0.076000] RSP: :82603cf0 EFLAGS: 00010046 [0.076000] RAX: 8262a7c0 RBX: 0246 RCX: [0.076000] RDX: RSI: 0001 RDI: 2000 [0.076000] RBP: 82603da0 R08: R09: 0001 [0.076000] R10: R11: R12: [0.076000] R13: 2000 R14: 82603da0 R15: [0.076000] FS: () GS:88007e00() knlGS: [0.076000] CS: 0010 DS: ES: CR0: 80050033 [0.076000] CR2: 01c2 CR3: 05a24001 CR4: 000606b0 [0.076000] Kernel panic - not syncing: Attempted to kill the idle task! [0.076000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- Regards, Sai -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/6] efi/x86 mixed mode cleanups
> >> Patch #6 helps unused code paths to be optimized away on > >> configurations that don't need them (32-bit only and 64-bit only) > > > > I have tested this patch set on qemu and I see mixed mode kernel not > > booting. > > My test setup is: > > Running qemu-system-x86_64 with 32 bit OVMF and x86_64 kernel built with > Mixed mode enabled. > > I am using elilinux as bootloader. > > > > Upon further investigation, I found that the issue isn't related to this > > patch set. > > It's introduced with commit 2e6eb40ca5eb ("Fix incorrect invocation of > > PciIO->Attributes") Reverting the change does help in booting mixed mode > kernel. > > > > Hello Sai, > > This is likely due to the fact that you are missing this patch > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=efi/urgen > t=e296701800f30d260a66f8aa1971b5b1bc3d2f81 > > Could you please double check? Thanks. Thanks for updating me on this. I now see why my testing broke. I think, the above patch is only in TIP tree and didn't yet make to Linus's tree and is also not part of efi tree next branch (these were the trees I am testing on). Regards, Sai
RE: [PATCH 1/6] efi/x86: prevent reentrant firmware calls in mixed mode
> The UEFI spec does not permit runtime services to be called reentrantly, and > so > it is up to the OS to provide proper locking around such calls. > > For the native case, this was fixed a long time ago, but for the mixed mode > case, > no locking is done whatsoever. Note that the calls are made with preemption > and interrupts disabled, so only SMP configurations are affected by this > issue. > > So add a spinlock and grab it when invoking a UEFI runtime service in mixed > mode. We will also need to provide non-blocking versions of SetVariable() and > QueryVariableInfo(), so add those as well. > > Signed-off-by: Ard Biesheuvel FWIW, this patch looks good to me. Regards, Sai -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/6] efi/x86 mixed mode cleanups
Hi Ard, > This series contains some fixes and cleanups for mixed-mode UEFI on x86. > > Patch #1 adds the missing locking in the runtime service wrapper code for > mixed > mode. This was found by inspection rather than having been reported but could > be a candidate for -stable nonetheless. > > Patch #2 merges the remaining 32/64-bit specific parts of the setup_efi_pci > routine. > > Patches #3 and #4 do the same for the UGA draw protocol discovery routines. > > Patch #5 fixes a latent bug in the UGA draw code. > > Patch #6 helps unused code paths to be optimized away on configurations that > don't need them (32-bit only and 64-bit only) I have tested this patch set on qemu and I see mixed mode kernel not booting. My test setup is: Running qemu-system-x86_64 with 32 bit OVMF and x86_64 kernel built with Mixed mode enabled. I am using elilinux as bootloader. Upon further investigation, I found that the issue isn't related to this patch set. It's introduced with commit 2e6eb40ca5eb ("Fix incorrect invocation of PciIO->Attributes") Reverting the change does help in booting mixed mode kernel. Regards, Sai -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/6] efi: Introduce efi_memmap_free() to free memory allocated by efi_memmap_alloc()
> > Signed-off-by: Sai Praneeth Prakhya > > Suggested-by: Ard Biesheuvel > > Cc: Lee Chun-Yi > > Cc: Dave Young > > Cc: Borislav Petkov > > Cc: Laszlo Ersek > > Cc: Jan Kiszka > > Cc: Dave Hansen > > Cc: Bhupesh Sharma > > Cc: Nicolai Stange > > Cc: Naresh Bhat > > Cc: Ricardo Neri > > Cc: Peter Zijlstra > > Cc: Taku Izumi > > Cc: Ravi Shankar > > Cc: Matt Fleming > > Cc: Dan Williams > > Cc: Ard Biesheuvel > > Do you really think you should cc all these people? > Sorry! I get that.. > > --- > > drivers/firmware/efi/memmap.c | 28 > > include/linux/efi.h | 8 > > 2 files changed, 36 insertions(+) > > [snip] > I am now thinking we should perhaps hide the allocation type from the callers > rather than passing around implementation details of the allocator, and have > explicit alloc/free operations. > > Could we instead have a single 'efi_realloc_memmap()' function that takes care > of all of this? Sure! I will give it a try. Regards, Sai
RE: [PATCH] efi: Free existing memory map before installing new memory map
> > Also, could you please clarify if there is any specific reason why > > memory allocated using memblock_reserve() shouldn't be freed. I mean, > > not with memblock_free() but I think we could make it _available_ > > using free_bootmem() (or something similar, please correct me if this is not > the right API). > > On arm64, the memory map is provided to the core kernel by the stub, and after > kexec, a pointer to the same memory map will be passed to the next kernel. So > the kernel does not 'own' that allocation, and it should not free it or > overwrite > it. Thanks for the reply. It confirms that the issue is only on x86 systems. I see that arm64 doesn't call efi_memmap_alloc() and hence there is no concept of allocating memory for new memory map and installing it (so no memory leak). Regards, Sai
RE: [kbuild-all] [PATCH] efi: Free existing memory map before installing new memory map
> >Since efi_memmap_free() is a recently introduced API [1], please note > >that the patch is based on efi tree > >@https://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git > > I noticed the official efi tree recored in MAINTAINERS is > git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git, > are they identical? > > EXTENSIBLE FIRMWARE INTERFACE (EFI) > M: Matt Fleming > L: linux-efi@vger.kernel.org > T: git git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git > S: Maintained > F: Documentation/efi-stub.txt > F: arch/ia64/kernel/efi.c > F: arch/x86/boot/compressed/eboot.[ch] > F: arch/x86/include/asm/efi.h > F: arch/x86/platform/efi/* > F: drivers/firmware/efi/* > F: include/linux/efi*.h Hi Xiaolong, Ard replaced Matt as EFI maintainer. Please see updated MAINTAINERS file for more details. https://elixir.bootlin.com/linux/v4.18-rc2/source/MAINTAINERS EXTENSIBLE FIRMWARE INTERFACE (EFI) M: Ard Biesheuvel L: linux-efi@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git S: Maintained F: Documentation/efi-stub.txt F: arch/*/kernel/efi.c F: arch/x86/boot/compressed/eboot.[ch] F: arch/*/include/asm/efi.h F: arch/x86/platform/efi/ F: drivers/firmware/efi/ F: include/linux/efi*.h F: arch/arm/boot/compressed/efi-header.S F: arch/arm64/kernel/efi-entry.S Regards, Sai -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] efi: Free existing memory map before installing new memory map
> > + /* Free the memory allocated to the existing memory map */ > > + efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, > > + efi.memmap.late); > > + > > data.phys_map = addr; > > data.size = efi.memmap.desc_size * nr_map; > > data.desc_version = efi.memmap.desc_version; > > -- > > 2.7.4 > > > > If only it were so simple :-) > > At this point, efi.memmap.phys_map could point to memory that was allocated > early, allocated late or simply passed to the OS at boot time by the stub (in > which case it was memblock_reserve()d but not memblock_alloc()d, and it > should not be freed) > Yes, completely agree that there could be three types of allocations for memmap. I thought, efi_memmap_free(efi.memmap.phys_map, efi.memmap.nr_map, efi.memmap.late); should work because the previous type of allocation should have been recorded in efi.memmap.late. But, now I see this will fail for memblock_reserved() memory because it will be mistaken to memblock_alloced() (I assumed both are almost similar :(). > So only allocations made with efi_memmap_alloc() should be freed here. Makes sense and I think that also means efi_memmap_free() should be called from function that called efi_memmap_alloc() and not efi_memmap_install(). > I'm not sure /how/ we should keep track of that: perhaps it is simply a > matter of > replacing the boolean 'late' with an enum that describes where the memory > came from that phys_map points to. I did try changing boolean late to enum and it seems to be working fine. I will do more testing/clean up and will submit a patch for review. Also, could you please clarify if there is any specific reason why memory allocated using memblock_reserve() shouldn't be freed. I mean, not with memblock_free() but I think we could make it _available_ using free_bootmem() (or something similar, please correct me if this is not the right API). If we allocate and install a new memory map (as in case with efi_fake_memmap()), I think we should free the memory used by memory map originally passed by EFI stub, because, at any point of time there should only be one active memory map. If we don't free the original memory map passed by EFI stub, we will be having two and hence will be leaking memory. Regards, Sai
RE: [PATCH] efi: Free existing memory map before installing new memory map
> > Hi Sai, > > Thank you for the patch! Yet something to improve: > > [auto build test ERROR on linus/master] > [also build test ERROR on v4.18-rc2 next-20180625] [if your patch is applied > to > the wrong git tree, please drop us a note to help improve the system] Since efi_memmap_free() is a recently introduced API [1], please note that the patch is based on efi tree @https://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git [1] https://lkml.org/lkml/2018/6/22/39 Regards, Sai -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V5 0/3] Use efi_rts_wq to invoke EFI Runtime Services
> >> I have tested V4 series on ThunderX2 Val1 and Saber platform by > >> integrating the patches with LUV project. > >> I will test V5 version soon and let you know the outcome. > > > > Thanks a lot! for taking time to test the patch set, Naresh. Will be > > waiting to > see the result. > > Sai, I did upgrade LUV kernel to v4.17, apply and test your patches by running > LUV on QEMU and ThunderX2 hardware. All efi test results looks ok. Thanks a lot! for the update :) > But as Ard pointed, I also noticed hang issue after executing reboot command > on console > on real hardware. Ard has fixed the issue before applying and the latest patches can be found at: https://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git/log/?h=next Regards, Sai
RE: [PATCH V5 3/3] efi: Use efi_rts_wq to invoke EFI Runtime Services
> >> I noticed that -unsurprisingly- reboot no longer works with these changes. > >> > >> I will fix up the patch, and revert the efi_reset_system() change, > >> both here and below. > > > > Could you please let me know what the bug is here? I am unable to see > > it right away :( I have tested reboot on qemu x86_64 by passing > > "reboot=efi" as command line arg and saw that reboot is working fine. > > > > My arm64 hangs at reboot or power off, unless i revert the ResetSystem() part. > That's interesting. Not sure how the behavior could be different between x86 and arm64. Maybe worth debugging further? As you said, reverting ResetSystem() part makes things work, we cannot suspect a buggy efi_reset_system() (I came across buggy efi_reset_system() implementations on some x86 systems). > But given that it is both risky (relying on a kthread running a workqueue in > the > shutdown path) and unnecessary (ResetSystem() is not supposed to return, and > is > only called by the kernel when the whole system has already been torn down), I > think the main reason is simply that there is no reason to add it. Makes sense. I think, before calling firmware ResetSystem() kernel has already shut down all other parts and hence as the control will not transfer back to kernel (other than ResetSystem() failure), we could safely assume that no kernel code would access user space while in efi_reset_system() and hence we don't need to take the work queue path. Regards, Sai N�r��y���b�X��ǧv�^�){.n�+{�y��^n�r��z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�
RE: [PATCH V5 3/3] efi: Use efi_rts_wq to invoke EFI Runtime Services
> > + case RESET_SYSTEM: > > + __efi_call_virt(reset_system, *(int *)arg1, > > + *(efi_status_t *)arg2, > > + *(unsigned long *)arg3, > > + (efi_char16_t *)arg4); > > + break; > > I noticed that -unsurprisingly- reboot no longer works with these changes. > > I will fix up the patch, and revert the efi_reset_system() change, both here > and > below. Could you please let me know what the bug is here? I am unable to see it right away :( I have tested reboot on qemu x86_64 by passing "reboot=efi" as command line arg and saw that reboot is working fine. > > @@ -340,7 +441,8 @@ static void virt_efi_reset_system(int reset_type, > > "could not get exclusive access to the firmware\n"); > > return; > > } > > - __efi_call_virt(reset_system, reset_type, status, data_size, data); > > + efi_queue_work(RESET_SYSTEM, _type, , _size, data, > > + NULL); > > up(_runtime_lock); > > } Regards, Sai N�r��y���b�X��ǧv�^�){.n�+{�y��^n�r��z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�
RE: [PATCH V5 0/3] Use efi_rts_wq to invoke EFI Runtime Services
> > Testing: > > > > Tested using LUV (Linux UEFI Validation) for x86_64, x86_32 and arm64 > > (qemu only). Will appreciate the effort if someone could test the > > patches on real ARM/ARM64 machines. > > LUV: https://01.org/linux-uefi-validation > > I have tested V4 series on ThunderX2 Val1 and Saber platform by integrating > the > patches with LUV project. > I will test V5 version soon and let you know the outcome. Thanks a lot! for taking time to test the patch set, Naresh. Will be waiting to see the result. Regards, Sai N�r��y���b�X��ǧv�^�){.n�+{�y��^n�r��z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�
RE: [PATCH V5 0/3] Use efi_rts_wq to invoke EFI Runtime Services
> >> Testing: > >> > >> Tested using LUV (Linux UEFI Validation) for x86_64, x86_32 and arm64 > >> (qemu only). Will appreciate the effort if someone could test the > >> patches on real ARM/ARM64 machines. > > I would give the latest v5 a try on my ARM64 qualcomm board as well. > WIll get back with the test results soon. > Sure! Testing is always welcomed :) and if you get a chance please try on UV300 machine too.. (probably it might find issues as happened with mm_struct patches). Regards, Sai
RE: [PATCH V4 0/3] Use efi_rts_wq to invoke EFI Runtime Services
> > Another follow on question is, does every firmware support both > > blocking and non-blocking variants (specially legacy EFI firmware)? I > > am worried about this because, presently efi_delete_dummy_variable() > > uses set_variable() and > > query_variable_info() but if I change efi_delete_dummy_variable() to > > use non-blocking variants and if they aren’t supported, then, I guess, > > efi_delete_dummy_variable() might fail :( > > > > So, could you please clarify on that? > > > > I don't follow. Why should it make any difference to the firmware whether the > OS routines blocks or gives up? We always honor the mutual exclusion between > different invocations of runtime services, and the firmware itself has no > awareness of the kind of scheduling the OS needs to do to ensure this. Sorry! my bad.. I thought firmware (with EFI System table revision > 2.0 ) offers two types of efi run time services, a blocking variant and a non-blocking variant. But, now I noticed in the spec that there is only set_variable() but _no_ set_variable_nonblocking(). Same with query_variable_info(). The same is also seen in runtime-wrappers.c file. Both the blocking and non-blocking variants call the same efi runtime service. I see that non-blocking() variants are just an additional feature (API) offered by OS. Regards, Sai
RE: [PATCH V4 0/3] Use efi_rts_wq to invoke EFI Runtime Services
> > One more question again, if we are sure that non-blocking variants > > will _always_ be called in atomic context, then, we got it covered. > > Because, in > > set_variable() and query_variable_info() (both blocking and > > non-blocking) we check for in_atomic() and if so, we don't use efi_rts_wq > (please refer to patch 3). > > > > If you think, there might be a probability of calling non-blocking > > efi_rts out of atomic context, then, sure! Let's make them never use > efi_rts_wq. > > > > This is not about what happens to be the current situation. It is about the > API. > > The non-blocking functions should never block, period. They either fail > gracefully > or perform their duties without sleeping. Yes, that makes sense. > > In this particular case, I think it is useful to have a guaranteed > non-blocking > version, not only to delete the dummy EFI variable, but potentially in other > future cases as well, given that they can be called much earlier in the boot > (when > the perf/%cr3 issue is not a concern to begin with) Thanks for making it more clear :) I will change the non-blocking variants _not_ to use efi_rts_wq and as you suggested make efi_delete_dummy_variable() use non-blocking variants (that should also make it local to arch/x86). Another follow on question is, does every firmware support both blocking and non-blocking variants (specially legacy EFI firmware)? I am worried about this because, presently efi_delete_dummy_variable() uses set_variable() and query_variable_info() but if I change efi_delete_dummy_variable() to use non-blocking variants and if they aren’t supported, then, I guess, efi_delete_dummy_variable() might fail :( So, could you please clarify on that? Regards, Sai
RE: [PATCH V4 0/3] Use efi_rts_wq to invoke EFI Runtime Services
> > Assume some user requested to execute some non-blocking variant of > > efi_rts and the kernel hasn't called efi_call_virt() yet, but was > > scheduled out. IOW, even though user requests for non-blocking efi call, we > might still block. Am I right? > > > > No, that is the whole point. These functions may be called from atomic > context, > which is why they trylock() and give up rather than block on the semaphore if > a rt > services call is already in progress. E.g., > > /* > * efivar_entry_set_nonblocking - call set_variable_nonblocking() > * > * This function is guaranteed to not block and is suitable for calling > * from crash/panic handlers. > * > * Crucially, this function will not block if it cannot acquire > * efivars_lock. Instead, it returns -EBUSY. > */ > One more question again, if we are sure that non-blocking variants will _always_ be called in atomic context, then, we got it covered. Because, in set_variable() and query_variable_info() (both blocking and non-blocking) we check for in_atomic() and if so, we don't use efi_rts_wq (please refer to patch 3). If you think, there might be a probability of calling non-blocking efi_rts out of atomic context, then, sure! Let's make them never use efi_rts_wq. Regards, Sai N�r��yb�X��ǧv�^�){.n�+{�y^n�r���z���h�&���G���h�(�階�ݢj"���m��z�ޖ���f���h���~�m�
RE: [PATCH V4 0/3] Use efi_rts_wq to invoke EFI Runtime Services
> > Changes from V3 to V4: > > -- > > 1. As suggested by Peter, use completions instead of flush_work() as the > > former is cheaper > > 2. Call efi_delete_dummy_variable() from efisubsys_init(). Sorry! Ard, > > wasn't able to find a better alternative to keep this change local to > > arch/x86. > > > > Two questions: > - Should the non-blocking variants of the query and set_variable_store use the > work queue? Doesn't that make them blocking? That's a good question . I think you are right, calling non-blocking variants of efi_rts using work queues makes them blocking. But, I have a follow on question. Assume some user requested to execute some non-blocking variant of efi_rts and the kernel hasn't called efi_call_virt() yet, but was scheduled out. IOW, even though user requests for non-blocking efi call, we might still block. Am I right? With efi_rts_wq, I think, I have increased the window of getting blocked. With efi_rts_wq, kernel should explicitly call schedule() to run firmware and the chances of getting blocked are much more. Expect this increased window, I think firmware should be executed as before. So, can you please explain me the difference between blocking and non-blocking variants from kernel perspective? (the way we get locks are different down_interruptible() vs down_trylock()) > - If the non-blocking set_variable() does not use the work queue, can we just > call > it from efi_delete_dummy_variable(), and keep the calls where they are? Yes, I think we can do that (if we don't use efi_rts_wq for non-blocking variants). Regards, Sai
RE: [PATCH V3 2/3] efi: Introduce efi_queue_work() to queue any efi_runtime_service() on efi_rts_wq
> On Mon, May 21, 2018 at 08:13:03PM -0700, Sai Praneeth Prakhya wrote: > > + /* \ > > +* queue_work() returns 0 if work was already on queue, \ > > +* _ideally_ this should never happen. \ > > +*/ \ > > + if (queue_work(efi_rts_wq, _rts_work.work)) > \ > > + flush_work(_rts_work.work); > \ > > Since you're _always_ going to wait for it, it is _much_ cheaper to put a > completion in your actual work and wait for that. Sure! I will change it. I also noticed that flush_work() in turn calls wait_for_completion(). Will also wait for a couple of days before posting V4. Regards, Sai -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Query regarding SetVirtualAddressMap()
> > AFAIK, ExitBootServices() means that boot services are no longer > > needed by OS/bootloader and hence firmware can terminate them. Does it > > also mean that the system is in runtime mode..? (I don't think so, as, I > > didn't > find it in UEFI spec). > > > > Yes > > > Also, could you please enlighten me on, Why calling > > SetVirtualAddressMap() is optional, why not make it mandatory? > > > > Because it depends on the OS whether it needs it or not. Also, there will > always > be a window after calling EBS() and before calling SVAM() where we are at > runtime and no virtual mapping is installed yet, so mandating SVAM() is a bit > ambiguous anyway. Thanks for clarifying my doubts :) Regards, Sai N�r��yb�X��ǧv�^�){.n�+{�y^n�r���z���h�&���G���h�(�階�ݢj"���m��z�ޖ���f���h���~�m�
RE: Query regarding SetVirtualAddressMap()
+ Ricardo > > > I recently noticed in UEFI spec v2.7, section 2.3.4, Calling > > > Conventions for X64, that “All selectors set to be flat with virtual > > > = physical address. If the UEFI OS loader or OS used > > > SetVirtualAddressMap() to relocate the runtime services in a virtual > > > address space, then this condition does not have to be met. My > > > understanding of it is as shown below. Please correct me if it’s wrong. > > > > > > 1. It’s not mandatory for the OS to call SetVirtualAddressMap() > > > > > > 2. If it’s not called, the OS can still use EFI Runtime Services in > > > physical mode (i.e. PA == VA). > > > > > > > > > > > > For x86_64, I did a quick test with qemu and OVMF. I have commented > > > out call to SetVirtualAddressMap() from kernel and noticed that > > > everything works fine i.e. efi variables have the normal > > > functionality and I ran LUV (Linux UEFI Validation). Not calling > > > SetVirtualAddressMap() doesn’t break anything because on x86 we have > > > 1:1 mappings. So, could someone please let me know why we might > > > still need > > to call SetVirtualAddressMap()? > > > > > > > You are right, and the same applies to arm64. However, Apple Mac EFI > > is known not to tolerate PA==VA at runtime, Some doubts again, When SetVirtualAddressMap() isn't called, how does Apple Mac EFI realize that the system is in runtime mode and how does it realize that it should now start using VA instead of PA? Extending it further, if SetVirtualAddressMap() isn't called, any platform firmware shouldn't be aware that the system is in runtime mode.. right? Or does the firmware behavior change on call to ExitBootServices()? AFAIK, ExitBootServices() means that boot services are no longer needed by OS/bootloader and hence firmware can terminate them. Does it also mean that the system is in runtime mode..? (I don't think so, as, I didn't find it in UEFI spec). Also, could you please enlighten me on, Why calling SetVirtualAddressMap() is optional, why not make it mandatory? > > and there may be other x86 > > firmware implementations that don't work correctly in this case, since > > they are only tested with MS Windows, which also calls > SetVirtualAddressMap(). Regards, Sai N�r��yb�X��ǧv�^�){.n�+{�y^n�r���z���h�&���G���h�(�階�ݢj"���m��z�ޖ���f���h���~�m�
RE: Query regarding SetVirtualAddressMap()
> > I recently noticed in UEFI spec v2.7, section 2.3.4, Calling > > Conventions for X64, that “All selectors set to be flat with virtual = > > physical address. If the UEFI OS loader or OS used > > SetVirtualAddressMap() to relocate the runtime services in a virtual > > address space, then this condition does not have to be met. My > > understanding of it is as shown below. Please correct me if it’s wrong. > > > > 1. It’s not mandatory for the OS to call SetVirtualAddressMap() > > > > 2. If it’s not called, the OS can still use EFI Runtime Services in > > physical mode (i.e. PA == VA). > > > > > > > > For x86_64, I did a quick test with qemu and OVMF. I have commented > > out call to SetVirtualAddressMap() from kernel and noticed that > > everything works fine i.e. efi variables have the normal functionality > > and I ran LUV (Linux UEFI Validation). Not calling > > SetVirtualAddressMap() doesn’t break anything because on x86 we have > > 1:1 mappings. So, could someone please let me know why we might still need > to call SetVirtualAddressMap()? > > > > You are right, and the same applies to arm64. However, Apple Mac EFI is known > not to tolerate PA==VA at runtime, and there may be other x86 firmware > implementations that don't work correctly in this case, since they are only > tested > with MS Windows, which also calls SetVirtualAddressMap(). > > So the safe option is to still call it. Thanks for the explanation Ard, it makes sense now :) Regards, Sai
RE: [PATCH V2 2/3] efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services()
> > That's true! AFAIK, we don't have any issues handling NMI while in efi_pgd. > > We might have issues only when, we are already in efi_pgd, NMI comes > > along > > Can you trigger this? Or is it something hypothetical? > AFAIK, it's hypothetical. I did try to trigger the issue, but failed [1]. Maybe, I need to have some more constraints [2]. [1] https://lkml.org/lkml/2017/8/23/715 [2] https://lkml.org/lkml/2017/8/25/469 > > and NMI handler tries to touch the regions that are not mapped in > > efi_pgd > > If it is not hypothetical, the NMI handler should learn to look at CR3 first > and > return if CR3 has the efi pgd. This solution and it's variants were discussed here [1], [2] and for varied reasons the community had decided to go with "Everything EFI as kthread" approach [3] [4]. Although the discussions were off my understanding, the present issue I see is, (and also the motivation for me to do the patch is) when a thread tries to execute any efi_runtime_service() we switch to efi_pgd (which doesn't have user space mappings) and all other subsystems in kernel aren't aware of this switch. This looks like a perfect case for kthread. Kthread by definition doesn’t have user space mappings and if we run efi_runtime_services() in a kthread context and if any other subsystem tries to access user space mappings while in efi_kthread, it's terminally broken [5]. There were several issues Andy, Peter and Mark raised. One such (hypothetical) case is accessing user space from the back of an interrupt (NMI). Others include 1. Issue specific to ARM because it runs efi_runtime_services() with interrupts enabled [6] 2. Interrupt taken while mmap_sem() is held for write that tries to access user memory [7] 3. If EFI were to have IO memory mapped at a "user" address, perf could end up reading it [8] [1] https://lkml.org/lkml/2017/8/15/757 [2] https://lkml.org/lkml/2017/8/16/487 [3] https://lkml.org/lkml/2017/8/21/573 [4] https://lkml.org/lkml/2017/8/16/540 [5] https://lkml.org/lkml/2017/8/17/667 [6] https://lkml.org/lkml/2017/8/16/176 [7] https://lkml.org/lkml/2017/8/17/667 [8] https://lkml.org/lkml/2017/8/21/427 Regards, Sai
RE: [PATCH 07/12] efi: Use efi_mm in x86 as well as ARM
> > diff --git a/include/linux/efi.h b/include/linux/efi.h index > > f5083aa72eae..f1b7d68ac460 100644 > > --- a/include/linux/efi.h > > +++ b/include/linux/efi.h > > @@ -966,6 +966,8 @@ extern struct efi { > > unsigned long flags; > > } efi; > > > > +extern struct mm_struct efi_mm; > > + > > static inline int > > efi_guidcmp (efi_guid_t left, efi_guid_t right) { > > Ugh, I can see three problems with this patch: > > 1) > > Why is the low level asm/efi.h header polluted with two of the biggest header > files in existence, to add a type to _another_ header (efi.h)? > > 2) > > Why is included if what is being relied on is mm_struct? > > 3) > > But even looks unnecessary in efi.h, a simple forward > declaration of mm_struct would do ... > > The high level MM and sched headers should be added to the actual .c files > that > make use of them. Ok, makes sense. Sorry! for that. I will fix the issues. Regards, Sai -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 2/3] efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services()
> > Another warning by checkpatch is "use of in_atomic() in drivers code" > > I'm assuming it warns because you're touching files in drivers/ but the efi > fun is > not really a driver... True! That makes sense :) > > But looking at patch 3, that thing looks like a real mess. Some of the things > - > pstore, it seems - do stuff in atomic context and yet you want to do efi > stuff in a > workqueue which doesn't stomach atomic context to begin with. > > So if you wanna do workqueue, you should make sure all efi stuff gets delayed > to process context and queued properly. For example, we log MCEs from atomic > context by putting them on a lockless buffer and then kicking irq_work to > queue > the work when we return to process context. > Can you do something like that? > I think we can do this, it's is a good idea. I looked at this approach and saw that in oops_end() function, part of arch/x86/kernel/dumpstack, between oops_exit() and panic() (here we are not in atomic context, so, I think we can use work queues) we could have something like efi_flush_buffer() which will flush the buffer and queue the work to efi_rts_wq. But, I guess, we have some downsides with this design: 1. We are doing this to have "no exceptions to use efi_rts_wq", but we will be making the common case complicated. i.e. When a user requests to write some efi variable, we will first write it to a buffer and then flush the buffer using efi_rts_wq. Instead, we could have written the variable directly. Maybe, you meant, we should use this buffer only while pstore and not during normal case (which sounds reasonably OK). 2. It doesn't look rational that, when we are already going down, we schedule (because we will be invoking efi_runtime_services() through work queue) to log some stuff. Not, sure if that's happening in other parts of kernel or if it's OK to do that. I will try the suggested approach and will keep this thread posted. > "Hence, pstore calls efi_runtime_services() without using efi_rts_wq" - that > doesn't sound like optimal design to me. I would try to shove them all through > the workqueue - not have exceptions. > Alternatively, instead of playing around with in_atomic(), we could have wrapper functions like efi_write_var_non_wq() which will only be used by pstore. This function will not use efi_rts_wq and directly invoke efi_runtime_service. Just an attempt to make the code not look messy. > Then this: > > > A potential issue could be, for instance, an NMI interrupt (like perf) > > trying to profile some user data while in efi_pgd. > > I can't understand. > > How did we handle this until now and why is it a problem all of a sudden? > > Because I don't recall being unable to run perf while efi runtime services are > happening. > That's true! AFAIK, we don't have any issues handling NMI while in efi_pgd. We might have issues only when, we are already in efi_pgd, NMI comes along and NMI handler tries to touch the regions that are not mapped in efi_pgd (Eg: User space part of process address space) and using kthread inherently means that we will not have any user space. Regards, Sai
RE: [PATCH V2 2/3] efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services()
/* > >> > +* Since we process only one efi_runtime_service() at a time, an > >> > +* ordered workqueue (which creates only one execution context) > >> > +* should suffice all our needs. > >> > +*/ > >> > + efi_rts_wq = alloc_ordered_workqueue("efi_rts_workqueue", > >> > + 0); > >> > >> efi_rts_wq or efi_rts_workqueue? > >> > >> > + if (!efi_rts_wq) { > >> > + pr_err("Failed to create efi_rts_workqueue, EFI runtime > >> > services " > >> > >> Same here. > > > > Sure! I will make it consistent with "efi_rts_wq". Just tried to be > > more verbose with names :) > > > > It is not a big deal, but using the exact same name is better for the > purposes of > grepping and things like that :-) Yes, that makes sense. > By the way, check the commit title/message, there are some others there too. Sure! I will make changes to commit/title messages too. Regards, Sai
RE: [PATCH V2 1/3] x86/efi: Call efi_delete_dummy_variable() during efi subsystem initialization
> > void __init efi_enter_virtual_mode(void) diff --git > > a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index > > cd42f66a7c85..838b8efe639c 100644 > > --- a/drivers/firmware/efi/efi.c > > +++ b/drivers/firmware/efi/efi.c > > @@ -328,6 +328,12 @@ static int __init efisubsys_init(void) > > if (!efi_enabled(EFI_BOOT)) > > return 0; > > > > + /* > > +* Clean DUMMY object calls EFI Runtime Service, set_variable(), so > > +* it should be invoked only after efi_rts_workqueue is ready. > > +*/ > > + efi_delete_dummy_variable(); > > + > > Is there any way to keep this local to arch/x86? > I think, we can definitely do that. I will retake a look at the possibilities and will update this thread. > > /* We register the efi directory at /sys/firmware/efi */ > > efi_kobj = kobject_create_and_add("efi", firmware_kobj); > > if (!efi_kobj) { > > diff --git a/include/linux/efi.h b/include/linux/efi.h index > > f5083aa72eae..c4efb3ef0dfa 100644 > > --- a/include/linux/efi.h > > +++ b/include/linux/efi.h > > @@ -992,6 +992,7 @@ extern efi_status_t efi_query_variable_store(u32
RE: [PATCH V2 2/3] efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services()
+Cc Miguel Ojeda > > > +({ > > > \ > > > + struct efi_runtime_work efi_rts_work; \ > > > + \ > > > + INIT_WORK_ONSTACK(_rts_work.work, efi_call_rts);\ > > > + efi_rts_work.func = _rts; \ > > > + efi_rts_work.arg1 = _arg1; \ > > > + efi_rts_work.arg2 = _arg2; \ > > > + efi_rts_work.arg3 = _arg3; \ > > > + efi_rts_work.arg4 = _arg4; \ > > > + efi_rts_work.arg5 = _arg5; \ > > > + /* \ > > > + * queue_work() returns 0 if work was already on queue, \ > > > + * _ideally_ this should never happen. \ > > > + */ \ > > > + if (queue_work(efi_rts_wq, _rts_work.work)) > > \ > > > + flush_work(_rts_work.work); > > \ > > > + else\ > > > + BUG(); \ > > > > So failure to queue that work is such a critical problem that we need > > to BUG() and can't possibly continue and shoult not attempt recovery at all? > > > > I think it's not critical, we can just return error status. > I think the problem in itself is not at all critical but when I initially > thought about > why the problem could have occurred, it sounded like one i.e. ideally (if the > system is running fine) we should always be able to queue work. Failure to > queue > means that the previous work is already on queue and that shouldn't be the > case. > So, thought, maybe something bad had happened already (just doubtful). > > But, I see your point. BUG() sounds more like an over kill. Instead of fixing > an > existing problem, this patch could completely take down the system. > > > IOW, we should always strive to fail gracefully and not shit in pants > > at the first sign of trouble. > > > > Yes, that makes sense. I will remove BUG() in V3 (in the two places that I > introduced). > > > Even checkpatch warns here: > > > > WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code > > rather than BUG() or BUG_ON() > > #184: FILE: drivers/firmware/efi/runtime-wrappers.c:92: > > + BUG(); \ > > > > Sure! I will fix this > > > > > and by looking at the other output, you should run your patches > > through checkpatch. Some of the things make sense like: > > > > WARNING: quoted string split across lines > > #97: FILE: drivers/firmware/efi/efi.c:341: > > + pr_err("Failed to create efi_rts_workqueue, EFI runtime > > services " > > + "disabled.\n"); > > > > for example. > > > > I will fix this one too. > > Another warning by checkpatch is "use of in_atomic() in drivers code" > Do you think it's OK to check if were are "in_atomic()" in drivers code. > I wasn't able to decide on other alternative, if possible, could you please > suggest > one? > > Regards, > Sai N�r��yb�X��ǧv�^�){.n�+{�y^n�r���z���h�&���G���h�(�階�ݢj"���m��z�ޖ���f���h���~�m�
RE: [PATCH V2 2/3] efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services()
> > +({ \ > > + struct efi_runtime_work efi_rts_work; \ > > + \ > > + INIT_WORK_ONSTACK(_rts_work.work, efi_call_rts);\ > > + efi_rts_work.func = _rts; \ > > + efi_rts_work.arg1 = _arg1; \ > > + efi_rts_work.arg2 = _arg2; \ > > + efi_rts_work.arg3 = _arg3; \ > > + efi_rts_work.arg4 = _arg4; \ > > + efi_rts_work.arg5 = _arg5; \ > > + /* \ > > +* queue_work() returns 0 if work was already on queue, \ > > +* _ideally_ this should never happen. \ > > +*/ \ > > + if (queue_work(efi_rts_wq, _rts_work.work)) > \ > > + flush_work(_rts_work.work); > \ > > + else\ > > + BUG(); \ > > So failure to queue that work is such a critical problem that we need to BUG() > and can't possibly continue and shoult not attempt recovery at all? > I think it's not critical, we can just return error status. I think the problem in itself is not at all critical but when I initially thought about why the problem could have occurred, it sounded like one i.e. ideally (if the system is running fine) we should always be able to queue work. Failure to queue means that the previous work is already on queue and that shouldn't be the case. So, thought, maybe something bad had happened already (just doubtful). But, I see your point. BUG() sounds more like an over kill. Instead of fixing an existing problem, this patch could completely take down the system. > IOW, we should always strive to fail gracefully and not shit in pants at the > first > sign of trouble. > Yes, that makes sense. I will remove BUG() in V3 (in the two places that I introduced). > Even checkpatch warns here: > > WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code > rather than BUG() or BUG_ON() > #184: FILE: drivers/firmware/efi/runtime-wrappers.c:92: > + BUG(); \ > Sure! I will fix this > > and by looking at the other output, you should run your patches through > checkpatch. Some of the things make sense like: > > WARNING: quoted string split across lines > #97: FILE: drivers/firmware/efi/efi.c:341: > + pr_err("Failed to create efi_rts_workqueue, EFI runtime > services " > + "disabled.\n"); > > for example. > I will fix this one too. Another warning by checkpatch is "use of in_atomic() in drivers code" Do you think it's OK to check if were are "in_atomic()" in drivers code. I wasn't able to decide on other alternative, if possible, could you please suggest one? Regards, Sai N�r��yb�X��ǧv�^�){.n�+{�y^n�r���z���h�&���G���h�(�階�ݢj"���m��z�ޖ���f���h���~�m�
RE: [PATCH V2 3/3] efi: Use efi_rts_workqueue to invoke EFI Runtime Services
> >> > pstore writes could potentially be invoked in interrupt context and > >> > it uses set_variable<>() and query_variable_info<>() to store logs. > >> > If we invoke efi_runtime_services() through efi_rts_wq while in > >> > atomic() kernel issues a warning ("scheduling wile in atomic") and > >> > prints stack trace. One way to overcome this is to not make the > >> > caller process wait for the worker thread to finish. This approach > >> > breaks pstore i.e. the log messages aren't written to efi > >> > variables. Hence, pstore calls > >> > efi_runtime_services() without using efi_rts_wq or in other words > >> > efi_rts_wq will be used unconditionally for all the > >> > efi_runtime_services() except set_variable<>() and > >> > query_variable_info<>() > >> > >> Can't NMIs still come in during this? > >> > >> ... or do we assume that since something has already gone wrong, this > >> doesn't matter? > >> > > > > I think they can come. AFAIK, pstore (if enabled) runs only when we crashed. > > Since, we are still executing stuff to log messages and NMI's can't be > > masked, there is still a possibility for NMI's to occur (please correct me > > if I am > wrong). > > But, as you said earlier, I guess it doesn't matter because anyways we are > going down. > > The problem is that we are not always in a "already going down" > condition for typical set_variable and query_variable_info calls. That's correct. > So are we actually fixing anything with this patchset? When we are _not_ in interrupt context (eg: process context) we still use efi_rts_wq to invoke *all* efi_runtime_services(). This solves *someone trying to access user space* while in EFI runtime services mapping problem. A instance could be, some user space process requests us to execute efi_runtime_service(), so, kernel switches to efi_pgd (which doesn’t have user space part of process address space) and while executing efi_runtime_service() perf NMI comes along to profile user data. > In other words if the NMI vs EFI > mapping problem requires the workqueue context then we can't have any EFI > calls outside of that context. Am I missing how this scheme addresses that > problem? I think so, because, we are not trying to solve NMI vs EFI Runtime Service mappings. AFAIK, they both work together (for x86_64). The problem is, as stated above, "someone trying to access user space while executing EFI runtime services". The problem exists because EFI runtime services mappings don’t have user space part of process address space. I think the problem still persists when we are already in interrupt context and then we were requested to execute some efi_runtime_service() and then NMI happens and that NMI handler touches user space. Please let me know if my explanation didn’t make sense or if I misunderstood your question. Ard and Matt, please correct me if I stated something that is incorrect.
RE: [PATCH V2 2/3] efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services()
> > +struct workqueue_struct *efi_rts_wq; > > + > > static bool disable_runtime; > > static int __init setup_noefi(char *arg) { @@ -329,6 +331,19 @@ > > static int __init efisubsys_init(void) > > return 0; > > > > /* > > +* Since we process only one efi_runtime_service() at a time, an > > +* ordered workqueue (which creates only one execution context) > > +* should suffice all our needs. > > +*/ > > + efi_rts_wq = alloc_ordered_workqueue("efi_rts_workqueue", 0); > > efi_rts_wq or efi_rts_workqueue? > > > + if (!efi_rts_wq) { > > + pr_err("Failed to create efi_rts_workqueue, EFI runtime > > services " > > Same here. Sure! I will make it consistent with "efi_rts_wq". Just tried to be more verbose with names :) [...] > > +#define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5) > > \ > > +({ \ > > + struct efi_runtime_work efi_rts_work; \ > > + \ > > + INIT_WORK_ONSTACK(_rts_work.work, efi_call_rts);\ > > + efi_rts_work.func = _rts; \ > > + efi_rts_work.arg1 = _arg1; \ > > + efi_rts_work.arg2 = _arg2; \ > > + efi_rts_work.arg3 = _arg3; \ > > + efi_rts_work.arg4 = _arg4; \ > > + efi_rts_work.arg5 = _arg5; \ > > + /* \ > > +* queue_work() returns 0 if work was already on queue, \ > > +* _ideally_ this should never happen. \ > > +*/ \ > > + if (queue_work(efi_rts_wq, _rts_work.work)) \ > > + flush_work(_rts_work.work); \ > > + else\ > > + BUG(); \ > > Thanks for the change! One remark, I would just do: Sorry! but I am planning to remove BUG(). Looks like it could defeat the purpose of patch. Please see Boris comments on the other thread. [...] > > +/* > > + * efi_runtime_work: Details of EFI Runtime Service work > > + * @func: EFI Runtime Service function identifier > > + * @arg<1-5>: EFI Runtime Service function arguments > > + * @status:Status of executing EFI Runtime Service > > + */ > > +struct efi_runtime_work { > > + u8 func; > > + void *arg1; > > + void *arg2; > > + void *arg3; > > + void *arg4; > > + void *arg5; > > + efi_status_t status; > > + struct work_struct work; > > +}; > > Why is efi_runtime_work in the .h at all? > Thanks for the catch. I will move it to runtime-wrappers.c file and will make it static too. It isn't being used in any other place. > Please CC me for the next version! :-) Sure! Sorry for that. I should have done in V2. Regards, Sai
RE: [PATCH V2 3/3] efi: Use efi_rts_workqueue to invoke EFI Runtime Services
05, 2018 at 03:23:10PM -0800, Sai Praneeth Prakhya wrote: > > From: Sai Praneeth> > > > Presently, efi_runtime_services() are executed by firmware in process > > context. To execute efi_runtime_service(), kernel switches the page > > directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have > > any user space mappings. A potential issue could be, for instance, an > > NMI interrupt (like perf) trying to profile some user data while in efi_pgd. > > It might be worth pointing out that this could result in disaster (e.g. > if the frame pointer happens to point at MMIO in the EFI runtime services > mappings). > Sorry! I didn't get it. I would like to add this point, so could you please explain it more? > [...] > > > pstore writes could potentially be invoked in interrupt context and it > > uses set_variable<>() and query_variable_info<>() to store logs. If we > > invoke efi_runtime_services() through efi_rts_wq while in atomic() > > kernel issues a warning ("scheduling wile in atomic") and prints stack > > trace. One way to overcome this is to not make the caller process wait > > for the worker thread to finish. This approach breaks pstore i.e. the > > log messages aren't written to efi variables. Hence, pstore calls > > efi_runtime_services() without using efi_rts_wq or in other words > > efi_rts_wq will be used unconditionally for all the > > efi_runtime_services() except set_variable<>() and > > query_variable_info<>() > > Can't NMIs still come in during this? > > ... or do we assume that since something has already gone wrong, this doesn't > matter? > I think they can come. AFAIK, pstore (if enabled) runs only when we crashed. Since, we are still executing stuff to log messages and NMI's can't be masked, there is still a possibility for NMI's to occur (please correct me if I am wrong). But, as you said earlier, I guess it doesn't matter because anyways we are going down. > Otherwise, this doesn't seem to break basic stuff on arm64 platforms. I can > boot > up, read the EFI RTC, and reboot. I ahd lockdep and KASAN enabled, I received > no splats from either. > > FWIW: > > Tested-by: Mark Rutland > Thanks, > Mark. -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 2/3] efi: Introduce efi_rts_workqueue and some infrastructure to invoke all efi_runtime_services()
-0800, Sai Praneeth Prakhya wrote: > > @@ -329,6 +331,19 @@ static int __init efisubsys_init(void) > > return 0; > > > > /* > > +* Since we process only one efi_runtime_service() at a time, an > > +* ordered workqueue (which creates only one execution context) > > +* should suffice all our needs. > > +*/ > > + efi_rts_wq = alloc_ordered_workqueue("efi_rts_workqueue", 0); > > + if (!efi_rts_wq) { > > + pr_err("Failed to create efi_rts_workqueue, EFI runtime services > " > > + "disabled.\n"); > > + clear_bit(EFI_RUNTIME_SERVICES, ); > > + return 0; > > + } > > I'm a little worried that something might sample this flag between it being > set in > an early_initcall (arm_enable_runtime_services), and cleared in a > subsys_initcall > here. > > However, nothing seems to do that so far, so maybe that's ok... > Thanks for raising this. I will take a look at initcalls. > [...] > > > +/* efi_runtime_service() function identifiers */ enum { > > + GET_TIME, > > + SET_TIME, > > + GET_WAKEUP_TIME, > > + SET_WAKEUP_TIME, > > + GET_VARIABLE, > > + GET_NEXT_VARIABLE, > > + SET_VARIABLE, > > + SET_VARIABLE_NONBLOCKING, > > + QUERY_VARIABLE_INFO, > > + QUERY_VARIABLE_INFO_NONBLOCKING, > > + GET_NEXT_HIGH_MONO_COUNT, > > + RESET_SYSTEM, > > + UPDATE_CAPSULE, > > + QUERY_CAPSULE_CAPS, > > +}; > > Can we please give this enum a name Sure! Added in V3. > > [...] > > > +/* > > + * efi_runtime_work: Details of EFI Runtime Service work > > + * @func: EFI Runtime Service function identifier > > + * @arg<1-5>: EFI Runtime Service function arguments > > + * @status:Status of executing EFI Runtime Service > > + */ > > +struct efi_runtime_work { > > + u8 func; > > ... and use it here rather than an opaque u8? I realise that means placing the > enum in . > Actually, with Miguel comments, I am considering making this struct static and moving it to runtime-wrappers.c, since "struct efi_runtime_work" isn't really being used anywhere except runtime-wrappers.c. Please see in V3. > > + void *arg1; > > + void *arg2; > > + void *arg3; > > + void *arg4; > > + void *arg5; > > + efi_status_t status; > > + struct work_struct work; > > +}; > > Thanks, > Mark. -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 3/3] efi: Use efi_rts_workqueue to invoke EFI Runtime Services
> > Presently, efi_runtime_services() are executed by firmware in process > > context. To execute efi_runtime_service(), kernel switches the page > > directory from swapper_pgd to efi_pgd. However, efi_pgd doesn't have > > any user space mappings. A potential issue could be, for instance, an > > NMI interrupt (like perf) trying to profile some user data while in efi_pgd. > > > > A solution for this issue could be to use kthread to run > > efi_runtime_service(). When a user/kernel thread requests to execute > > efi_runtime_service(), kernel off-loads this work to kthread which in > > turn uses efi_pgd. Anything that tries to touch user space addresses > > while in kthread is terminally broken. This patch adds support to efi > > subsystem to handle all calls to efi_runtime_services() using a work > > queue (which in turn uses kthread). > > > > Implementation summary: > > --- > > 1. When user/kernel thread requests to execute efi_runtime_service(), > > enqueue work to efi_rts_workqueue. > > 2. Caller thread waits until the work is finished because it's > > dependent on the return status of efi_runtime_service(). > > > > Semantics to pack arguments in efi_runtime_work (has void pointers): > > 1. If argument is a pointer (of any type), pass it as is. > > 2. If argument is a value (of any type), address of the value is > > passed. > > > > Introduce a handler function (called efi_call_rts()) that > > a. understands efi_runtime_work and > > b. invokes the appropriate efi_runtime_service() with the > > appropriate arguments > > > > Semantics followed by efi_call_rts() to understand efi_runtime_work: > > 1. If argument was a pointer, recast it from void pointer to original > > pointer type. > > 2. If argument was a value, recast it from void pointer to original > > pointer type and dereference it. > > > > pstore writes could potentially be invoked in interrupt context and it > > uses set_variable<>() and query_variable_info<>() to store logs. If we > > invoke efi_runtime_services() through efi_rts_wq while in atomic() > > kernel issues a warning ("scheduling wile in atomic") and prints stack > > trace. One way to overcome this is to not make the caller process wait > > for the worker thread to finish. This approach breaks pstore i.e. the > > log messages aren't written to efi variables. Hence, pstore calls > > efi_runtime_services() without using efi_rts_wq or in other words > > efi_rts_wq will be used unconditionally for all the > > efi_runtime_services() except set_variable<>() and > > query_variable_info<>() > > > Is there a place in the system reboot path where we can try to flush these > asynchronous pstore writes from interrupt context? I don't think so because, the issue is not with the pstore writes but with pstore using efi as backing store. Anything could register as pstore backend, eg: RAM, ACPI-ERST etc.. and AFAIK, they don’t use work queues to store logs. Now that efi_runtime_services() uses work queues, we unfortunately have to have this hack. > It seems unfortunate that > we need to have this wide exception for all > set_variable() calls. True, basically any efi_runtime_service() that might get called in interrupt context. I am not very happy to have the hack too, but didn’t find other way. Either that or switch to an explicit "emergency mode" where > we stop caring about protecting the system from EFI runtime code because > we're already crashing. Should we care about extra warning (scheduling while in atomic) when we are already crashing? This sounds kind of debatable. I will wait for feedback from community it they think it's OK or maybe a better solution. Regards, Sai
RE: [PATCH V1 2/3] efi: Introduce efi_rts_workqueue and necessary infrastructure to invoke all efi_runtime_services()
> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > > index ac5db5f8dbbf..4714b305ca90 100644 > > --- a/drivers/firmware/efi/efi.c > > +++ b/drivers/firmware/efi/efi.c > > @@ -76,6 +76,8 @@ static unsigned long *efi_tables[] = { > > _attr_table, > > }; > > > > +struct workqueue_struct *efi_rts_wq; > > + > > static bool disable_runtime; > > static int __init setup_noefi(char *arg) { @@ -329,6 +331,15 @@ > > static int __init efisubsys_init(void) > > if (!efi_enabled(EFI_BOOT)) > > return 0; > > > > + /* Create a work queue to run EFI Runtime Services */ > > + efi_rts_wq = create_workqueue("efi_rts_workqueue"); > > Please consider alloc_workqueue() instead with the appropriate flags, since > create_workqueue() and friends are deprecated. Sure! Will change in V2. > > > + if (!efi_rts_wq) { > > + pr_err("Failed to create efi_rts_workqueue, EFI runtime > > services " > > + "disabled.\n"); > > + clear_bit(EFI_RUNTIME_SERVICES, ); > > + return 0; > > + } > > + > > /* > > * Clean DUMMY object calls EFI Runtime Service, set_variable(), so > > * it should be invoked only after efi_rts_workqueue is ready. > > diff --git a/drivers/firmware/efi/runtime-wrappers.c > > b/drivers/firmware/efi/runtime-wrappers.c > > index ae54870b2788..5cdb787da5d3 100644 > > --- a/drivers/firmware/efi/runtime-wrappers.c > > +++ b/drivers/firmware/efi/runtime-wrappers.c > > @@ -1,6 +1,14 @@ > > /* > > * runtime-wrappers.c - Runtime Services function call wrappers > > * > > + * Implementation summary: > > + * --- > > + * 1. When user/kernel thread requests to execute > > + efi_runtime_service(), > > + * enqueue work to efi_rts_workqueue. > > + * 2. Caller thread waits until the work is finished because it's > > + * dependent on the return status and execution of efi_runtime_service(). > > + * For instance, get_variable() and get_next_variable(). > > + * > > * Copyright (C) 2014 Linaro Ltd.> > * > > * Split off from arch/x86/platform/efi/efi.c @@ -22,6 +30,8 @@ > > #include #include #include > > > > +#include > > + > > #include > > > > /* > > @@ -33,6 +43,50 @@ > > #define __efi_call_virt(f, args...) \ > > __efi_call_virt_pointer(efi.systab->runtime, f, args) > > > > +/* Each EFI Runtime Service is represented with a unique number */ > > +#define GET_TIME 0 > > +#define SET_TIME 1 > > +#define GET_WAKEUP_TIME2 > > +#define SET_WAKEUP_TIME3 > > +#define GET_VARIABLE 4 > > +#define GET_NEXT_VARIABLE 5 > > +#define SET_VARIABLE 6 > > +#define SET_VARIABLE_NONBLOCKING 7 > > +#define QUERY_VARIABLE_INFO8 > > +#define QUERY_VARIABLE_INFO_NONBLOCKING9 > > +#define GET_NEXT_HIGH_MONO_COUNT 10 > > +#define RESET_SYSTEM 11 > > +#define UPDATE_CAPSULE 12 > > +#define QUERY_CAPSULE_CAPS 13 > > An enum would be better, given these are just internal, contiguous IDs. > Makes sense. > > + > > +/* > > + * efi_queue_work: Queue efi_runtime_service() and wait until it's done > > + * @rts: efi_runtime_service() function identifier > > + * @rts_arg<1-5>: efi_runtime_service() function arguments > > + * > > + * Accesses to efi_runtime_services() are serialized by a binary > > + * semaphore (efi_runtime_lock) and caller waits until the work is > > + * finished, hence _only_ one work is queued at a time. So, > > +queue_work() > > + * should never fail. > > + */ > > +#define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5) > > \ > > +({ \ > > + struct efi_runtime_work efi_rts_work; \ > > + efi_rts_work.status = EFI_ABORTED; \ > > + \ > > + INIT_WORK_ONSTACK(_rts_work.work, efi_call_rts);\ > > + efi_rts_work.func = _rts; \ > > + efi_rts_work.arg1 = _arg1; \ > > + efi_rts_work.arg2 = _arg2; \ > > + efi_rts_work.arg3 = _arg3; \ > > + efi_rts_work.arg4 = _arg4; \ > > + efi_rts_work.arg5 = _arg5; \ > > + if (queue_work(efi_rts_wq, _rts_work.work))
RE: [PATCH 0/3] Use mm_struct and switch_mm() instead of manually
> > Changes in V3: > > 1. When CPUMASK_OFFSTACK is enabled, switch_mm_irqs_off() sets cpumask > > by calling cpumask_set_cpu(). This panics kernel as efi_mm is not > > initialized, therefore initialize efi_mm in efi_alloc_page_tables(). > > Thanks for the v3. > > I confirmed that the issue I saw with the v2 when I enabled 'efi=debug' on the > sgi-uv 300 machine (i.e the NULL pointer access while accessing > mm_cpumask(next), in the function call > 'switch_mm_irqs_off') is fixed in the v3. > > Also as I noted during the v2 review, introducing the 'efi_switch_mm() ' > helper > instead of manually twiddling with %cr3 seems more cleaner (having personally > debugged this leg several times on the underlying x86 EFI machines). > > So in addition to me testing this on the sgi-uv300 and Dell Optiplex EFI > enabled > machine, please feel free to add: > > Reviewed-by: Bhupesh Sharma> Hi Bhupesh, Thanks for the review and re-iterating the usefulness of this patch set. Regards, Sai
RE: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually
> So, in summary it seems that the primary kernel boot _fails_ with your > v2 patchset on the real hardware for me irrespective of whether I use Matt's > tree or Linus's tree: > > a) I would suggest that you perform some more checks on real hardware as > qemu boot tests sometimes do not expose the problems we might see when > booting a kernel on efi capable hardware. > > b) Also do note that both Matt's tree and Linus's tree work fine on this > hardware > for me (with the 'nopcid' added to the bootargs) > Hi Bhupesh, Thanks a lot! for the detailed explanation. I will address these issues in V3. Regards, Sai
RE: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually
> -Original Message- > From: Sai Praneeth Prakhya [mailto:sai.praneeth.prak...@intel.com] > Sent: Tuesday, September 5, 2017 7:43 PM > To: Bhupesh Sharma> Cc: linux-efi@vger.kernel.org; linux-ker...@vger.kernel.org; Matt Fleming > ; Ard Biesheuvel ; > j...@suse.com; Borislav Petkov ; Luck, Tony > ; l...@kernel.org; m...@redhat.com; Neri, Ricardo > ; Shankar, Ravi V > Subject: Re: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of > manually > > On Tue, 2017-09-05 at 19:21 -0700, Sai Praneeth Prakhya wrote: > > > I get a similar crash on Qemu with linus's master branch and the V2 > > > applied on top of it. Here are the details of my test environment: > > > > > > 1. I use the OVMF (EDK2) EFI firmware to launch the kernel: > > > edk2.git/ovmf-x64 > > > > > > 2. I used linus's master branch (HEAD - commit: > > > b1b6f83ac938d176742c85757960dec2cf10e468) and applied your v2 on top > > > of the same. > > > > > > 3. I use the following qemu command line to launch the test: > > > > > > # /usr/local/bin/qemu-system-x86_64 --version QEMU emulator version > > > 2.9.50 (v2.9.0-526-g76d20ea) Copyright (c) 2003-2017 Fabrice Bellard > > > and the QEMU Project developers > > > > > > # /usr/local/bin/qemu-system-x86_64 -enable-kvm -net nic -net tap > > > -m $MEMSIZE -nographic -drive > > > file=$DISK_IMAGE,if=virtio,format=qcow2 > > > -vga std -boot c -cpu host -kernel $KERNEL -append > > > "crashkernel=$CRASH_MEMSIZE console=ttyS0,115200n81" -initrd > > > $INITRAMFS -bios $OVMF_FW_PATH > > > > > > And here is the crash log: > > > > > > [0.006054] general protection fault: [#1] SMP > > > [0.006459] Modules linked in: > > > [0.006711] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3 > > > [0.007000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > BIOS 0.0.0 02/06/2015 > > > [0.007000] task: 81e0f480 task.stack: 81e0 > > > [0.007000] RIP: 0010:switch_mm_irqs_off+0x1bc/0x440 > > > [0.007000] RSP: :81e03d80 EFLAGS: 00010086 > > > [0.007000] RAX: 80007d084000 RBX: RCX: > 77ff8000 > > > [0.007000] RDX: 7d084000 RSI: 8000 RDI: > 00019a00 > > > [0.007000] RBP: 81e03dc0 R08: R09: > 88007d085000 > > > [0.007000] R10: 81e03dd8 R11: 7d095063 R12: > 81e5c6a0 > > > [0.007000] R13: 81ed4f40 R14: 0030 R15: > 0001 > > > [0.007000] FS: () GS:88007d40() > > > knlGS: > > > [0.007000] CS: 0010 DS: ES: CR0: 80050033 > > > [0.007000] CR2: 88007d754000 CR3: 0220a000 CR4: > 000406b0 > > > [0.007000] Call Trace: > > > [0.007000] switch_mm+0xd/0x20 > > > [0.007000] ? switch_mm+0xd/0x20 > > > [0.007000] efi_switch_mm+0x3e/0x4a > > > [0.007000] efi_call_phys_prolog+0x28/0x1ac > > > [0.007000] efi_enter_virtual_mode+0x35a/0x48f > > > [0.007000] start_kernel+0x332/0x3b8 > > > [0.007000] x86_64_start_reservations+0x2a/0x2c > > > [0.007000] x86_64_start_kernel+0x178/0x18b > > > [0.007000] secondary_startup_64+0xa5/0xa5 > > > [0.007000] ? secondary_startup_64+0xa5/0xa5 > > > [0.007000] Code: 00 00 00 80 49 03 55 50 0f 82 7f 02 00 00 48 b9 > > > 00 00 00 80 ff 77 00 00 48 be 00 00 00 00 00 00 00 80 48 01 ca 48 09 > > > f0 48 09 d0 <0f> 22 d8 0f 1f 44 00 00 e9 47 ff ff ff 65 8b 05 b8 87 > > > fb 7e 89 > > > [0.007000] RIP: switch_mm_irqs_off+0x1bc/0x440 RSP: 81e03d80 > > > [0.007000] ---[ end trace bfa55bf4e4765255 ]--- > > > [0.007000] Kernel panic - not syncing: Attempted to kill the idle > > > task! > > > [0.007000] ---[ end Kernel panic - not syncing: Attempted to kill > > > the idle task! > > > > > > 4. Note though that if I use the EFI_MIXED mode (i.e. 32-bit ovmf > > > firmware and 64-bit x86 kernel) with your patches, the primary > > > kernel boots fine on Qemu: > > > > > > ovmf firmware used in this case - edk2.git/ovmf-ia32 > > > > > > 5. Also, if I append 'efi=old_map' to the bootargs (for the failing > > > case in point 3 above), I see the primary kernel boots fine on Qemu > > > as well. > > > > > > Regards, > > > Bhupesh > > > > Hi Bhupesh, > > > > Thanks a lot for the detailed explanation. They are helpful to > > reproduce the issue quickly. From my initial debug, I think that AMD > > SME + efi_mm_struct patches + -cpu host (in qemu) are required to > > reproduce the issue on qemu. > > > > I have tried the following combinations (all tests are on qemu): > > On Linus's tree: > > 1. With SME and efi_mm and -cpu host -> panics 2. With SME and > > efi_mm and !-cpu host -> boots 3. With SME and !efi_mm and -cpu host > > -> boots 4. With SME
RE: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually
> > > > Thanks for this v2. > > Introducing the 'efi_switch_mm() ' helper instead of manually > > twiddling with %cr3 seems more cleaner. > > > > I have tested this patchset on a x86_64 machine with the following > > configurations: > > > > 1. Primary kernel boot with efi=old_map 2. Primary kernel boot with > > new efi map > > > > While it seems that efi=old_map passes, the new efi map boot fails for > > me on both the two x86 machine (Dell 3050MT and a SGI - UV300 machine. > > > > It seems we are hitting a NULL pointer deference in > > 'efi_call_phys_prolog' while accessing '_mm'. > > > > Please see the log below for details: > > > > [0.020006] BUG: unable to handle kernel NULL pointer dereference > > at (null) > > [0.021000] IP: switch_mm_irqs_off+0x44/0x270 > > [0.021000] Call Trace: > > [0.021000] switch_mm+0x20/0x30 > > [0.021000] efi_switch_mm+0x49/0x60 > > [0.021000] efi_call_phys_prolog+0x56/0x1b3 > > [0.021000] efi_enter_virtual_mode+0x3a9/0x520 > > [0.021000] start_kernel+0x424/0x4c8 > > [0.021000] ? set_init_arg+0x5a/0x5a > > [0.021000] ? early_idt_handler_array+0x120/0x120 > > [0.021000] x86_64_start_reservations+0x29/0x2b > > [0.021000] x86_64_start_kernel+0x151/0x174 > > [0.021000] secondary_startup_64+0x9f/0x9f > > [0.021000] Code: 2d 82 51 d9 4f 65 c7 05 0f 65 da 4f 01 00 00 00 > > 48 39 f7 0f 84 14 01 00 00 65 48 89 35 f6 64 da 4f 48 8b 86 e8 02 00 > > 00 45 89 ed 4c 0f ab 28 bf 00 00 00 80 48 03 7e 50 48 8b 05 27 b0 > > b9 00 > > [0.021000] RIP: switch_mm_irqs_off+0x44/0x270 RSP: b0e035d0 > > [0.021000] CR2: > > [0.021000] ---[ end trace fb94349305e1cb8b ]--- > > [0.021000] Kernel panic - not syncing: Fatal exception > > [0.021000] ---[ end Kernel panic - not syncing: Fatal exception > > > > And I forgot to mention that I tried the patchset both with the efi/next and > linus's trees and saw the same result. > > I would be happy to help in case you need further details of the test > environment > or need help in testing this crash further. > > Regards, > Bhupesh Hi Bhupesh, Thanks for trying the patches and raising the concern. Could you also please also give a try on qemu? (if reproducible, we will be having a common platform to start debugging) I have tested this patch set on qemu and real machines (different from one's you tried) in our lab and didn’t notice this issue. Regards, Sai N�r��yb�X��ǧv�^�){.n�+{�y^n�r���z���h�&���G���h�(�階�ݢj"���m��z�ޖ���f���h���~�m�
RE: [PATCH] x86/efi-bgrt: Fix kernel panic when mapping BGRT data
>>On Thu, Dec 10, 2015 at 10:27:01AM -0800, Sai Praneeth Prakhya wrote: >> From: Sai Praneeth>> >> Starting with this commit 35eb8b81edd4 ("x86/efi: Build our own page >> table structures") efi regions have a separate page directory called >> "efi_pgd". In order to access any efi region we have to first shift >>%cr3 to this page table. In the bgrt code we are trying to copy >> bgrt_header and image, but these regions fall under "EFI_BOOT_SERVICES_DATA" >> and to access these regions we have to shift %cr3 to efi_pgd and not >> doing so will cause page fault as shown below. >> >> [0.251599] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 >> [0.259126] Freeing SMP alternatives memory: 32K (8230e000 - >> 82316000) >> [0.271803] BUG: unable to handle kernel paging request at >> fffefce35002 >> [0.279740] IP: [] efi_bgrt_init+0x144/0x1fd >> [0.286383] PGD 300f067 PUD 0 >> [0.289879] Oops: [#1] SMP >> [0.293566] Modules linked in: >> [0.297039] CPU: 0 PID: 0 Comm: swapper/0 Not tainted >> 4.4.0-rc1-eywa-eywa-built-in-47041+ #2 >> [0.306619] Hardware name: Intel Corporation Skylake Client >> platform/Skylake Y LPDDR3 RVP3, BIOS SKLSE2R1.R00.B104.B01.150114 >> 11/11/2015 >> [0.320925] task: 820134c0 ti: 8200 task.ti: >> 8200 >> [0.329420] RIP: 0010:[] [] >> efi_bgrt_init+0x144/0x1fd >> [0.338821] RSP: :82003f18 EFLAGS: 00010246 >> [0.344852] RAX: fffefce35000 RBX: fffefce35000 RCX: >> fffefce2b000 >> [0.352952] RDX: 8a82b000 RSI: 8235bb80 RDI: >> 8a835000 >> [0.361050] RBP: 82003f30 R08: 8a865000 R09: >> ff202850 >> [0.369149] R10: 811ad62f R11: R12: >> >> [0.377248] R13: 88016dbaea40 R14: 822622c0 R15: >> 82003fb0 >> [0.385348] FS: () GS:88016d80() >> knlGS: >> [0.394533] CS: 0010 DS: ES: CR0: 80050033 >> [0.401054] CR2: fffefce35002 CR3: 0300c000 CR4: >> 003406f0 >> [0.409153] DR0: DR1: DR2: >> >> [0.417252] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [0.425350] Stack: >> [0.427638] 82256900 88016dbaea40 >> 82003f40 >> [0.436086] 821bbce0 82003f88 8219c0c2 >> >> [0.444533] 8219ba4a 822622c0 00083000 >> >> [0.452978] Call Trace: >> [0.455763] [] efi_late_init+0x9/0xb >> [0.461697] [] start_kernel+0x463/0x47f >> [0.467928] [] ? set_init_arg+0x55/0x55 >> [0.474159] [] ? early_idt_handler_array+0x120/0x120 >> [0.481669] [] x86_64_start_reservations+0x2a/0x2c >> [0.488982] [] x86_64_start_kernel+0x13d/0x14c >> [0.495897] Code: 00 41 b4 01 48 8b 78 28 e8 09 36 01 00 48 85 c0 48 89 >> c3 75 13 48 c7 c7 f8 ac d3 81 31 c0 e8 d7 3b fb fe e9 b5 00 00 00 45 84 e4 >> <44> 8b 6b 02 74 0d be 06 00 00 00 48 89 df e8 ae 34 0$ >> [0.518151] RIP [] efi_bgrt_init+0x144/0x1fd >> [0.524888] RSP >> [0.528851] CR2: fffefce35002 >> [0.532615] ---[ end trace 7b06521e6ebf2aea ]--- >> [0.537852] Kernel panic - not syncing: Attempted to kill the idle task! >> >> As said above one way to fix this bug is to shift %cr3 to efi_pgd but >> we are not doing that way because it leaks inner details of how we >> switch to EFI page tables into a new call site and it also adds duplicate >> code. >> Instead, we remove the call to efi_lookup_mapped_addr() and always >> perform early_mem*() instead of early_io*() because we want to remap >> RAM regions and not I/O regions. >> >> We also delete efi_lookup_mapped_addr() because it is impossible to >> use it without also doing the page table switch to efi_pgd. >The original motivation for efi_lookup_mapped_addr came from early_ioremap >printing a warning if used on an address range >already mapped as RAM. Does >early_mem* handle that case correctly without a warning? Thanks a lot Josh for letting me know that. I don't think early_memremap() does that because early_memremap() and early_ioremap() both use __early_ioremap() but with different page protections (and I am not sure how those protections effect warning, but I will check that). Waiting for comments from Matt and Boris. >Because not all firmware places the BGRT image in boot services memory; some >firmware places the BGRT image variously in BIOS >reserved memory, ACPI >reclaim >space, or other strange places. > >- Josh Triplett I think we should not support buggy firmware implementations because it's same as encouraging them, instead we could let user know that he has got a buggy firmware and we skip bgrt code as if