Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-24 Thread Dave Hansen
On 9/24/19 1:04 PM, Steve Wahl wrote:
> Given that, would you feel better with A) the way I have it, B) your
> rewrite, or C) with an inline comment for each part of the loop:
> 
>   pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> 
>   /* invalidate pages before the kernel image */
>   for (i = 0; i < pmd_index((unsigned long)_text); i++)
>   pmd[i] &= ~_PAGE_PRESENT;
> 
>   /* fixup pages that are part of the kernel image */
>   for (; i <= pmd_index((unsigned long)_end); i++)
>   if (pmd[i] & _PAGE_PRESENT)
>   pmd[i] += load_delta;
> 
>   /* invalidate pages after the kernel image */
>   for (; i < PTRS_PER_PMD; i++)
>   pmd[i] &= ~_PAGE_PRESENT;

I don't feel super strongly either way, but I'd prefer doing B or C over
nothing.


Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-24 Thread Steve Wahl
On Mon, Sep 23, 2019 at 02:19:46PM -0700, Dave Hansen wrote:
> On 9/23/19 11:15 AM, Steve Wahl wrote:
> > pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> > -   for (i = 0; i < PTRS_PER_PMD; i++) {
> > +   for (i = 0; i < pmd_index((unsigned long)_text); i++)
> > +   pmd[i] &= ~_PAGE_PRESENT;
> > +
> > +   for (; i <= pmd_index((unsigned long)_end); i++)
> > if (pmd[i] & _PAGE_PRESENT)
> > pmd[i] += load_delta;
> > -   }
> > +
> > +   for (; i < PTRS_PER_PMD; i++)
> > +   pmd[i] &= ~_PAGE_PRESENT;
> 
> This is looking a bunch better.  The broken-up loop could probably use
> some comments, or you could combine it back to a single loop like this:
> 
>   int text_start_pmd_index = pmd_index((unsigned long)_text);
>   int text_end_pmd_index   = pmd_index((unsigned long)_end);
> 
>   for (i = 0; i < PTRS_PER_PMD; i++) {
>   if ((i < text_start_pmd_index) ||
>   (i > text_end_pmd_index)) {
>   /* Unmap entries not mapping the kernel image */
>   pmd[i] &= ~_PAGE_PRESENT;
>   } else if (pmd[i] & _PAGE_PRESENT)
>   pmd[i] += load_delta;
>   }
>   }

That's funny, I wrote it very close to that way initially, and
re-wrote it broken-up loop style because I thought it better conveyed
my intention.  (Mark pages before the kernel image as invalid, fixup
the pages that are in kernel image range, mark pages after the kernel
image as invalid.)

Given that, would you feel better with A) the way I have it, B) your
rewrite, or C) with an inline comment for each part of the loop:

pmd = fixup_pointer(level2_kernel_pgt, physaddr);

/* invalidate pages before the kernel image */
for (i = 0; i < pmd_index((unsigned long)_text); i++)
pmd[i] &= ~_PAGE_PRESENT;

/* fixup pages that are part of the kernel image */
for (; i <= pmd_index((unsigned long)_end); i++)
if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;

/* invalidate pages after the kernel image */
for (; i < PTRS_PER_PMD; i++)
pmd[i] &= ~_PAGE_PRESENT;

> Although I'd prefer it get commented or rewritten, it's passable like
> this, so:
> 
> Reviewed-by: Dave Hansen 

Thanks for your input!  

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise


Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-24 Thread Steve Wahl
On Mon, Sep 23, 2019 at 11:49:44AM -0700, h...@zytor.com wrote:
> On September 23, 2019 11:15:20 AM PDT, Steve Wahl  wrote:
> >Our hardware (UV aka Superdome Flex) has address ranges marked
> >reserved by the BIOS. Access to these ranges is caught as an error,
> >causing the BIOS to halt the system.
> >
> >Initial page tables mapped a large range of physical addresses that
> >were not checked against the list of BIOS reserved addresses, and
> >sometimes included reserved addresses in part of the mapped range.
> >Including the reserved range in the map allowed processor speculative
> >accesses to the reserved range, triggering a BIOS halt.
> >
> >Used early in booting, the page table level2_kernel_pgt addresses 1
> >GiB divided into 2 MiB pages, and it was set up to linearly map a full
> >1 GiB of physical addresses that included the physical address range
> >of the kernel image, as chosen by KASLR.  But this also included a
> >large range of unused addresses on either side of the kernel image.
> >And unlike the kernel image's physical address range, this extra
> >mapped space was not checked against the BIOS tables of usable RAM
> >addresses.  So there were times when the addresses chosen by KASLR
> >would result in processor accessible mappings of BIOS reserved
> >physical addresses.
> >
> >The kernel code did not directly access any of this extra mapped
> >space, but having it mapped allowed the processor to issue speculative
> >accesses into reserved memory, causing system halts.
> >
> >This was encountered somewhat rarely on a normal system boot, and much
> >more often when starting the crash kernel if "crashkernel=512M,high"
> >was specified on the command line (this heavily restricts the physical
> >address of the crash kernel, in our case usually within 1 GiB of
> >reserved space).
> >
> >The solution is to invalidate the pages of this table outside the
> >kernel image's space before the page table is activated.  This patch
> >has been validated to fix this problem on our hardware.
> >
> >Signed-off-by: Steve Wahl 
> >Cc: sta...@vger.kernel.org
> >---
> >Changes since v1:
> >  * Added comment.
> >  * Reworked changelog text.
> >
> > arch/x86/kernel/head64.c | 18 --
> > 1 file changed, 16 insertions(+), 2 deletions(-)
> >
> >diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> >index 29ffa495bd1c..ee9d0e3e0c46 100644
> >--- a/arch/x86/kernel/head64.c
> >+++ b/arch/x86/kernel/head64.c
> >@@ -222,13 +222,27 @@ unsigned long __head __startup_64(unsigned long
> >physaddr,
> >  * we might write invalid pmds, when the kernel is relocated
> >  * cleanup_highmap() fixes this up along with the mappings
> >  * beyond _end.
> >+ *
> >+ * Only the region occupied by the kernel image has so far
> >+ * been checked against the table of usable memory regions
> >+ * provided by the firmware, so invalidate pages outside that
> >+ * region.  A page table entry that maps to a reserved area of
> >+ * memory would allow processor speculation into that area,
> >+ * and on some hardware (particularly the UV platform) even
> >+ * speculative access to some reserved areas is caught as an
> >+ * error, causing the BIOS to halt the system.
> >  */
> > 
> > pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> >-for (i = 0; i < PTRS_PER_PMD; i++) {
> >+for (i = 0; i < pmd_index((unsigned long)_text); i++)
> >+pmd[i] &= ~_PAGE_PRESENT;
> >+
> >+for (; i <= pmd_index((unsigned long)_end); i++)
> > if (pmd[i] & _PAGE_PRESENT)
> > pmd[i] += load_delta;
> >-}
> >+
> >+for (; i < PTRS_PER_PMD; i++)
> >+pmd[i] &= ~_PAGE_PRESENT;
> > 
> > /*
> >  * Fixup phys_base - remove the memory encryption mask to obtain
> 
> What does your MTRR setup look like, and what memory map do you
> present, in exact detail?

We set the MTRR default to writeback cacheable, and then use variable
MTRRs to mark certain parts of the address map as uncacheable.  The
uncacheable regions are at the top of the address map (AEP mailboxes,
PCIe config segments other than segment zero, MMRs, 64-bit MMIO) and
from 2GB to 4GB.  There are also uncacheable ranges below 1MB
controlled by the fixed MTRRs.

Details from the e820 map:

[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0005efff] usable
[0.00] BIOS-e820: [mem 0x0005f000-0x0005] reserved
[0.00] BIOS-e820: [mem 0x0006-0x0009] usable
[0.00] BIOS-e820: [mem 0x000a-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x72c4dfff] usable
[0.00] BIOS-e820: [mem 0x72c4e000-0x73a2] reserved
[0.00] BIOS-e820: [mem 0x73a3-0x7445] ACPI NVS
[0.00] BIOS-e820: [mem 0x7446-0x7599] ACPI data
[

Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-23 Thread Dave Hansen
On 9/23/19 11:15 AM, Steve Wahl wrote:
>   pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> - for (i = 0; i < PTRS_PER_PMD; i++) {
> + for (i = 0; i < pmd_index((unsigned long)_text); i++)
> + pmd[i] &= ~_PAGE_PRESENT;
> +
> + for (; i <= pmd_index((unsigned long)_end); i++)
>   if (pmd[i] & _PAGE_PRESENT)
>   pmd[i] += load_delta;
> - }
> +
> + for (; i < PTRS_PER_PMD; i++)
> + pmd[i] &= ~_PAGE_PRESENT;

This is looking a bunch better.  The broken-up loop could probably use
some comments, or you could combine it back to a single loop like this:

int text_start_pmd_index = pmd_index((unsigned long)_text);
int text_end_pmd_index   = pmd_index((unsigned long)_end);

for (i = 0; i < PTRS_PER_PMD; i++) {
if ((i < text_start_pmd_index) ||
(i > text_end_pmd_index)) {
/* Unmap entries not mapping the kernel image */
pmd[i] &= ~_PAGE_PRESENT;
} else if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;
}
}

Although I'd prefer it get commented or rewritten, it's passable like
this, so:

Reviewed-by: Dave Hansen 


Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-23 Thread hpa
On September 23, 2019 11:15:20 AM PDT, Steve Wahl  wrote:
>Our hardware (UV aka Superdome Flex) has address ranges marked
>reserved by the BIOS. Access to these ranges is caught as an error,
>causing the BIOS to halt the system.
>
>Initial page tables mapped a large range of physical addresses that
>were not checked against the list of BIOS reserved addresses, and
>sometimes included reserved addresses in part of the mapped range.
>Including the reserved range in the map allowed processor speculative
>accesses to the reserved range, triggering a BIOS halt.
>
>Used early in booting, the page table level2_kernel_pgt addresses 1
>GiB divided into 2 MiB pages, and it was set up to linearly map a full
>1 GiB of physical addresses that included the physical address range
>of the kernel image, as chosen by KASLR.  But this also included a
>large range of unused addresses on either side of the kernel image.
>And unlike the kernel image's physical address range, this extra
>mapped space was not checked against the BIOS tables of usable RAM
>addresses.  So there were times when the addresses chosen by KASLR
>would result in processor accessible mappings of BIOS reserved
>physical addresses.
>
>The kernel code did not directly access any of this extra mapped
>space, but having it mapped allowed the processor to issue speculative
>accesses into reserved memory, causing system halts.
>
>This was encountered somewhat rarely on a normal system boot, and much
>more often when starting the crash kernel if "crashkernel=512M,high"
>was specified on the command line (this heavily restricts the physical
>address of the crash kernel, in our case usually within 1 GiB of
>reserved space).
>
>The solution is to invalidate the pages of this table outside the
>kernel image's space before the page table is activated.  This patch
>has been validated to fix this problem on our hardware.
>
>Signed-off-by: Steve Wahl 
>Cc: sta...@vger.kernel.org
>---
>Changes since v1:
>  * Added comment.
>  * Reworked changelog text.
>
> arch/x86/kernel/head64.c | 18 --
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
>index 29ffa495bd1c..ee9d0e3e0c46 100644
>--- a/arch/x86/kernel/head64.c
>+++ b/arch/x86/kernel/head64.c
>@@ -222,13 +222,27 @@ unsigned long __head __startup_64(unsigned long
>physaddr,
>* we might write invalid pmds, when the kernel is relocated
>* cleanup_highmap() fixes this up along with the mappings
>* beyond _end.
>+   *
>+   * Only the region occupied by the kernel image has so far
>+   * been checked against the table of usable memory regions
>+   * provided by the firmware, so invalidate pages outside that
>+   * region.  A page table entry that maps to a reserved area of
>+   * memory would allow processor speculation into that area,
>+   * and on some hardware (particularly the UV platform) even
>+   * speculative access to some reserved areas is caught as an
>+   * error, causing the BIOS to halt the system.
>*/
> 
>   pmd = fixup_pointer(level2_kernel_pgt, physaddr);
>-  for (i = 0; i < PTRS_PER_PMD; i++) {
>+  for (i = 0; i < pmd_index((unsigned long)_text); i++)
>+  pmd[i] &= ~_PAGE_PRESENT;
>+
>+  for (; i <= pmd_index((unsigned long)_end); i++)
>   if (pmd[i] & _PAGE_PRESENT)
>   pmd[i] += load_delta;
>-  }
>+
>+  for (; i < PTRS_PER_PMD; i++)
>+  pmd[i] &= ~_PAGE_PRESENT;
> 
>   /*
>* Fixup phys_base - remove the memory encryption mask to obtain

What does your MTRR setup look like, and what memory map do you present, in 
exact detail? The BIOS is normally expected to mark the relevant ranges as UC 
in the MTRRs (that is the remaining, legitimate usage of MTRRs.)

I'm somewhat sceptical that chopping off potentially several megabytes is a 
good thing. We also have the memory type interfaces which can be used to map 
these as UC in the page tables.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.