[PATCH] x86/setup: Extend low identity map to cover whole kernel range

2015-10-16 Thread Matt Fleming
From: Paolo Bonzini 

On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

$ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
$ gdb
(gdb) target remote localhost:1234
(gdb) hb *0x02858f6f
Hardware assisted breakpoint 1 at 0x2858f6f
(gdb) c
Continuing.

Breakpoint 1, 0x02858f6f in ?? ()
(gdb) monitor info registers
...
GDT= 0724e000 00ff
IDT= fffbb000 07ff
CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=0690
...

The page directory is sane:

(gdb) x/4wx 0x32b7000
0x32b7000:  0x03398063  0x03399063  0x0339a063  0x0339b063
(gdb) x/4wx 0x3398000
0x3398000:  0x0163  0x1163  0x2163  0x3163
(gdb) x/4wx 0x3399000
0x3399000:  0x0043  0x00401003  0x00402003  0x00403003

but our particular page directory entry is empty:

(gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
0x32b7070:  0x

[ It appears that you can skate past this issue if you don't receive
  any interrupts while the bogus GDT pointer is loaded, or if you avoid
  reloading the segment registers in general.

  Andy Lutomirski provides some additional insight:

   "AFAICT it's entirely permissible for the GDTR and/or LDT
descriptor to point to unmapped memory.  Any attempt to use them
(segment loads, interrupts, IRET, etc) will try to access that memory
as if the access came from CPL 0 and, if the access fails, will
generate a valid page fault with CR2 pointing into the GDT or
LDT."

  Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI
  calls, not in the epilog/prolog calls") interrupts were disabled
  around the prolog and epilog calls, and the functional GDT was
  re-installed before interrupts were re-enabled.

  Which explains why no one has hit this issue until now. ]

Signed-off-by: Paolo Bonzini 
Reported-by: Laszlo Ersek 
Cc: 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Andy Lutomirski 
Signed-off-by: Matt Fleming 
[ Updated changelog. ]
---
 arch/x86/kernel/setup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb7f2a2d328..a3cccbfc5f77 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1173,6 +1173,14 @@ void __init setup_arch(char **cmdline_p)
clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
+
+   /*
+* sync back low identity map too.  It is used for example
+* in the 32-bit EFI stub.
+*/
+   clone_pgd_range(initial_page_table,
+   swapper_pg_dir + KERNEL_PGD_BOUNDARY,
+   KERNEL_PGD_PTRS);
 #endif
 
tboot_probe();
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86/setup: Extend low identity map to cover whole kernel range

2015-10-16 Thread Matt Fleming
From: Paolo Bonzini 

On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

$ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
$ gdb
(gdb) target remote localhost:1234
(gdb) hb *0x02858f6f
Hardware assisted breakpoint 1 at 0x2858f6f
(gdb) c
Continuing.

Breakpoint 1, 0x02858f6f in ?? ()
(gdb) monitor info registers
...
GDT= 0724e000 00ff
IDT= fffbb000 07ff
CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=0690
...

The page directory is sane:

(gdb) x/4wx 0x32b7000
0x32b7000:  0x03398063  0x03399063  0x0339a063  0x0339b063
(gdb) x/4wx 0x3398000
0x3398000:  0x0163  0x1163  0x2163  0x3163
(gdb) x/4wx 0x3399000
0x3399000:  0x0043  0x00401003  0x00402003  0x00403003

but our particular page directory entry is empty:

(gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
0x32b7070:  0x

[ It appears that you can skate past this issue if you don't receive
  any interrupts while the bogus GDT pointer is loaded, or if you avoid
  reloading the segment registers in general.

  Andy Lutomirski provides some additional insight:

   "AFAICT it's entirely permissible for the GDTR and/or LDT
descriptor to point to unmapped memory.  Any attempt to use them
(segment loads, interrupts, IRET, etc) will try to access that memory
as if the access came from CPL 0 and, if the access fails, will
generate a valid page fault with CR2 pointing into the GDT or
LDT."

  Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI
  calls, not in the epilog/prolog calls") interrupts were disabled
  around the prolog and epilog calls, and the functional GDT was
  re-installed before interrupts were re-enabled.

  Which explains why no one has hit this issue until now. ]

Signed-off-by: Paolo Bonzini 
Reported-by: Laszlo Ersek 
Cc: 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Andy Lutomirski 
Signed-off-by: Matt Fleming 
[ Updated changelog. ]
---
 arch/x86/kernel/setup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb7f2a2d328..a3cccbfc5f77 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1173,6 +1173,14 @@ void __init setup_arch(char **cmdline_p)
clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
+
+   /*
+* sync back low identity map too.  It is used for example
+* in the 32-bit EFI stub.
+*/
+   clone_pgd_range(initial_page_table,
+   swapper_pg_dir + KERNEL_PGD_BOUNDARY,
+   KERNEL_PGD_PTRS);
 #endif
 
tboot_probe();
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-15 Thread H. Peter Anvin
On October 14, 2015 2:39:58 PM PDT, Andy Lutomirski  wrote:
>On Wed, Oct 14, 2015 at 2:00 PM, Matt Fleming
> wrote:
>> On Wed, 14 Oct, at 09:22:03AM, Andy Lutomirski wrote:
>>> On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming
> wrote:
>>> > (Pulling in luto for low-level x86 fu)
>>> >
>>> > On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
>>> >> On 32-bit systems, the initial_page_table is reused by
>>> >> efi_call_phys_prolog as an identity map to call
>>> >> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
>>> >> converting the current CPU's GDT to a physical address too.
>>> >>
>>> >> For PAE kernels the identity mapping is achieved by aliasing the
>>> >> first PDPE for the kernel memory mapping into the first PDPE
>>> >> of initial_page_table.  This makes the EFI stub's trick "just
>work".
>>> >>
>>> >> However, for non-PAE kernels there is no guarantee that the
>identity
>>> >> mapping in the initial_page_table extends as far as the GDT; in
>this
>>> >> case, accesses to the GDT will cause a page fault (which quickly
>becomes
>>> >> a triple fault).  Fix this by copying the kernel mappings from
>>> >> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET
>and at
>>> >> identity mapping.
>>> >
>>> > Oops, good catch guys. This is clearly a bug, but...
>>> >
>>> >> For some reason, this is only reproducible with QEMU's dynamic
>translation
>>> >> mode, and not for example with KVM.  However, even under KVM one
>can clearly
>>> >> see that the page table is bogus:
>>>
>>> I haven't looked at the code, but it wouldn't surprise me if this is
>>> some kind of TLB issue.  With the hardware TLB (which is in use on
>>> KVM), it seems quite likely that the GDT is pretty much always in
>the
>>> TLB and, if nothing flushes global mappings, then it'll probably
>stick
>>> around.
>>
>> From some quick experiments it appears that you can skate past this
>> issue if you don't receive any interrupts while the bogus GDT pointer
>> is loaded, or if you avoid reloading the segment registers in
>general.
>> Which is interesting because I assumed that writing to GDTR took
>> immediate effect.
>
>Trivia for your amusement:
>
>AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to
>point to unmapped memory.  Any attempt to use them (segment loads,
>interrupts, IRET, etc) will try to access that memory as if the access
>came from CPL 0 and, if the access fails, will generate a valid page
>fault with CR2 pointing into the GDT or LDT.
>
>Xen is nuts^Wclever and actually uses this.
>
>Of course, if your #PF vector references a GDT or LDT descriptor and
>trying to load that descriptor results in a page fault, you get a
>double fault.
>
>I learned this while trying to puzzle out why v1 of my LDT
>synchronization patch caused random faults on Xen.
>
>--Andy

There is no "if"... you can't get to an interrupt vector without going through 
the GDT or LDT.  That being said, the GDT or LDT can be partially mapped.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-15 Thread Matt Fleming
On Wed, 14 Oct, at 02:39:58PM, Andy Lutomirski wrote:
> 
> Trivia for your amusement:
> 
> AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to
> point to unmapped memory.  Any attempt to use them (segment loads,
> interrupts, IRET, etc) will try to access that memory as if the access
> came from CPL 0 and, if the access fails, will generate a valid page
> fault with CR2 pointing into the GDT or LDT.
> 
> Xen is nuts^Wclever and actually uses this.
> 
> Of course, if your #PF vector references a GDT or LDT descriptor and
> trying to load that descriptor results in a page fault, you get a
> double fault.
> 
> I learned this while trying to puzzle out why v1 of my LDT
> synchronization patch caused random faults on Xen.

Ha, interesting! Thanks Andy, it's good to get confirmation.

OK, I think we understand this issue well enough to call this fixed.
I've queued the following patch up in my urgent tree and I'll send a
pull request to the tip folks tomorrow.

---

>From 72ac9f242fad7c3bf3e06461e34c9a546d8cb411 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini 
Date: Wed, 14 Oct 2015 13:30:45 +0200
Subject: [PATCH] x86/setup: Extend low identity map to cover whole kernel
 range

On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

$ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
$ gdb
(gdb) target remote localhost:1234
(gdb) hb *0x02858f6f
Hardware assisted breakpoint 1 at 0x2858f6f
(gdb) c
Continuing.

Breakpoint 1, 0x02858f6f in ?? ()
(gdb) monitor info registers
...
GDT= 0724e000 00ff
IDT= fffbb000 07ff
CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=0690
...

The page directory is sane:

(gdb) x/4wx 0x32b7000
0x32b7000:  0x03398063  0x03399063  0x0339a063  0x0339b063
(gdb) x/4wx 0x3398000
0x3398000:  0x0163  0x1163  0x2163  0x3163
(gdb) x/4wx 0x3399000
0x3399000:  0x0043  0x00401003  0x00402003  0x00403003

but our particular page directory entry is empty:

(gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
0x32b7070:  0x

[ It appears that you can skate past this issue if you don't receive
  any interrupts while the bogus GDT pointer is loaded, or if you avoid
  reloading the segment registers in general.

  Andy Lutomirski provides some additional insight:

   "AFAICT it's entirely permissible for the GDTR and/or LDT
descriptor to point to unmapped memory.  Any attempt to use them
(segment loads, interrupts, IRET, etc) will try to access that memory
as if the access came from CPL 0 and, if the access fails, will
generate a valid page fault with CR2 pointing into the GDT or
LDT."

  Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI
  calls, not in the epilog/prolog calls") interrupts were disabled
  around the prolog and epilog calls, and the functional GDT was
  re-installed before interrupts were re-enabled.

  Which explains why no one has hit this issue until now. ]

Signed-off-by: Paolo Bonzini 
Reported-by: Laszlo Ersek 
Cc: 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Andy Lutomirski 
Signed-off-by: Matt Fleming 
[ Updated changelog. ]
---
 arch/x86/kernel/setup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb7f2a2d328..a3cccbfc5f77 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1173,6 +1173,14 @@ void __init setup_arch(char **cmdline_p)
clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
+
+   /*
+* sync back low identity map too.  It is used for example
+* in the 32-bit EFI stub.
+*/
+   clone_pgd_range(initial_page_table,
+   swapper_pg_dir + KERNEL_PGD_BOUN

Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-15 Thread H. Peter Anvin
On October 14, 2015 2:39:58 PM PDT, Andy Lutomirski  wrote:
>On Wed, Oct 14, 2015 at 2:00 PM, Matt Fleming
> wrote:
>> On Wed, 14 Oct, at 09:22:03AM, Andy Lutomirski wrote:
>>> On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming
> wrote:
>>> > (Pulling in luto for low-level x86 fu)
>>> >
>>> > On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
>>> >> On 32-bit systems, the initial_page_table is reused by
>>> >> efi_call_phys_prolog as an identity map to call
>>> >> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
>>> >> converting the current CPU's GDT to a physical address too.
>>> >>
>>> >> For PAE kernels the identity mapping is achieved by aliasing the
>>> >> first PDPE for the kernel memory mapping into the first PDPE
>>> >> of initial_page_table.  This makes the EFI stub's trick "just
>work".
>>> >>
>>> >> However, for non-PAE kernels there is no guarantee that the
>identity
>>> >> mapping in the initial_page_table extends as far as the GDT; in
>this
>>> >> case, accesses to the GDT will cause a page fault (which quickly
>becomes
>>> >> a triple fault).  Fix this by copying the kernel mappings from
>>> >> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET
>and at
>>> >> identity mapping.
>>> >
>>> > Oops, good catch guys. This is clearly a bug, but...
>>> >
>>> >> For some reason, this is only reproducible with QEMU's dynamic
>translation
>>> >> mode, and not for example with KVM.  However, even under KVM one
>can clearly
>>> >> see that the page table is bogus:
>>>
>>> I haven't looked at the code, but it wouldn't surprise me if this is
>>> some kind of TLB issue.  With the hardware TLB (which is in use on
>>> KVM), it seems quite likely that the GDT is pretty much always in
>the
>>> TLB and, if nothing flushes global mappings, then it'll probably
>stick
>>> around.
>>
>> From some quick experiments it appears that you can skate past this
>> issue if you don't receive any interrupts while the bogus GDT pointer
>> is loaded, or if you avoid reloading the segment registers in
>general.
>> Which is interesting because I assumed that writing to GDTR took
>> immediate effect.
>
>Trivia for your amusement:
>
>AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to
>point to unmapped memory.  Any attempt to use them (segment loads,
>interrupts, IRET, etc) will try to access that memory as if the access
>came from CPL 0 and, if the access fails, will generate a valid page
>fault with CR2 pointing into the GDT or LDT.
>
>Xen is nuts^Wclever and actually uses this.
>
>Of course, if your #PF vector references a GDT or LDT descriptor and
>trying to load that descriptor results in a page fault, you get a
>double fault.
>
>I learned this while trying to puzzle out why v1 of my LDT
>synchronization patch caused random faults on Xen.
>
>--Andy

There is no "if"... you can't get to an interrupt vector without going through 
the GDT or LDT.  That being said, the GDT or LDT can be partially mapped.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-15 Thread Matt Fleming
On Wed, 14 Oct, at 02:39:58PM, Andy Lutomirski wrote:
> 
> Trivia for your amusement:
> 
> AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to
> point to unmapped memory.  Any attempt to use them (segment loads,
> interrupts, IRET, etc) will try to access that memory as if the access
> came from CPL 0 and, if the access fails, will generate a valid page
> fault with CR2 pointing into the GDT or LDT.
> 
> Xen is nuts^Wclever and actually uses this.
> 
> Of course, if your #PF vector references a GDT or LDT descriptor and
> trying to load that descriptor results in a page fault, you get a
> double fault.
> 
> I learned this while trying to puzzle out why v1 of my LDT
> synchronization patch caused random faults on Xen.

Ha, interesting! Thanks Andy, it's good to get confirmation.

OK, I think we understand this issue well enough to call this fixed.
I've queued the following patch up in my urgent tree and I'll send a
pull request to the tip folks tomorrow.

---

>From 72ac9f242fad7c3bf3e06461e34c9a546d8cb411 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonz...@redhat.com>
Date: Wed, 14 Oct 2015 13:30:45 +0200
Subject: [PATCH] x86/setup: Extend low identity map to cover whole kernel
 range

On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

$ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
$ gdb
(gdb) target remote localhost:1234
(gdb) hb *0x02858f6f
Hardware assisted breakpoint 1 at 0x2858f6f
(gdb) c
Continuing.

Breakpoint 1, 0x02858f6f in ?? ()
(gdb) monitor info registers
...
GDT= 0724e000 00ff
IDT= fffbb000 07ff
CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=0690
...

The page directory is sane:

(gdb) x/4wx 0x32b7000
0x32b7000:  0x03398063  0x03399063  0x0339a063  0x0339b063
(gdb) x/4wx 0x3398000
0x3398000:  0x0163  0x1163  0x2163  0x3163
(gdb) x/4wx 0x3399000
0x3399000:  0x0043  0x00401003  0x00402003  0x00403003

but our particular page directory entry is empty:

(gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
0x32b7070:  0x

[ It appears that you can skate past this issue if you don't receive
  any interrupts while the bogus GDT pointer is loaded, or if you avoid
  reloading the segment registers in general.

  Andy Lutomirski provides some additional insight:

   "AFAICT it's entirely permissible for the GDTR and/or LDT
descriptor to point to unmapped memory.  Any attempt to use them
(segment loads, interrupts, IRET, etc) will try to access that memory
as if the access came from CPL 0 and, if the access fails, will
generate a valid page fault with CR2 pointing into the GDT or
LDT."

  Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI
  calls, not in the epilog/prolog calls") interrupts were disabled
  around the prolog and epilog calls, and the functional GDT was
  re-installed before interrupts were re-enabled.

  Which explains why no one has hit this issue until now. ]

Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
Reported-by: Laszlo Ersek <ler...@redhat.com>
Cc: <sta...@vger.kernel.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@kernel.org>
Cc: Andy Lutomirski <l...@amacapital.net>
Signed-off-by: Matt Fleming <matt.flem...@intel.com>
[ Updated changelog. ]
---
 arch/x86/kernel/setup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb7f2a2d328..a3cccbfc5f77 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1173,6 +1173,14 @@ void __init setup_arch(char **cmdline_p)
clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUND

Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Andy Lutomirski
On Wed, Oct 14, 2015 at 2:00 PM, Matt Fleming  wrote:
> On Wed, 14 Oct, at 09:22:03AM, Andy Lutomirski wrote:
>> On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming  
>> wrote:
>> > (Pulling in luto for low-level x86 fu)
>> >
>> > On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
>> >> On 32-bit systems, the initial_page_table is reused by
>> >> efi_call_phys_prolog as an identity map to call
>> >> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
>> >> converting the current CPU's GDT to a physical address too.
>> >>
>> >> For PAE kernels the identity mapping is achieved by aliasing the
>> >> first PDPE for the kernel memory mapping into the first PDPE
>> >> of initial_page_table.  This makes the EFI stub's trick "just work".
>> >>
>> >> However, for non-PAE kernels there is no guarantee that the identity
>> >> mapping in the initial_page_table extends as far as the GDT; in this
>> >> case, accesses to the GDT will cause a page fault (which quickly becomes
>> >> a triple fault).  Fix this by copying the kernel mappings from
>> >> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
>> >> identity mapping.
>> >
>> > Oops, good catch guys. This is clearly a bug, but...
>> >
>> >> For some reason, this is only reproducible with QEMU's dynamic translation
>> >> mode, and not for example with KVM.  However, even under KVM one can 
>> >> clearly
>> >> see that the page table is bogus:
>>
>> I haven't looked at the code, but it wouldn't surprise me if this is
>> some kind of TLB issue.  With the hardware TLB (which is in use on
>> KVM), it seems quite likely that the GDT is pretty much always in the
>> TLB and, if nothing flushes global mappings, then it'll probably stick
>> around.
>
> From some quick experiments it appears that you can skate past this
> issue if you don't receive any interrupts while the bogus GDT pointer
> is loaded, or if you avoid reloading the segment registers in general.
> Which is interesting because I assumed that writing to GDTR took
> immediate effect.

Trivia for your amusement:

AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to
point to unmapped memory.  Any attempt to use them (segment loads,
interrupts, IRET, etc) will try to access that memory as if the access
came from CPL 0 and, if the access fails, will generate a valid page
fault with CR2 pointing into the GDT or LDT.

Xen is nuts^Wclever and actually uses this.

Of course, if your #PF vector references a GDT or LDT descriptor and
trying to load that descriptor results in a page fault, you get a
double fault.

I learned this while trying to puzzle out why v1 of my LDT
synchronization patch caused random faults on Xen.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Matt Fleming
On Wed, 14 Oct, at 04:29:33PM, Paolo Bonzini wrote:
> 
> On 14/10/2015 15:52, Matt Fleming wrote:
> >> > However, for non-PAE kernels there is no guarantee that the identity
> >> > mapping in the initial_page_table extends as far as the GDT; in this
> >> > case, accesses to the GDT will cause a page fault (which quickly becomes
> >> > a triple fault).  Fix this by copying the kernel mappings from
> >> > swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
> >> > identity mapping.
> >  
> > Oops, good catch guys. This is clearly a bug, but...
> > 
> > ... I'm a little surprised you managed to trigger this at all, because
> > the GDT we load in efi_call_phys_prolog() is part of the per-cpu data
> > section and therefore part of the kernel image.
> 
> Only until setup_percpu, which is earlier than SetVirtualAddressMap.
> For example, I get:
> 
>   setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:1 nr_node_ids:1
>   PERCPU: Embedded 18 pages/cpu @c728e000 s41800 r0 d31928 u73728
>   ^^^
> but the kernel image ends at 0x037f.
>
> The GDT is 0xc728e000 in this run, so the GDT is at the beginning of the
> relocated percpu area.

Ah, good point. I completely missed that the percpu sections get
relocated.

-- 
Matt Fleming, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Matt Fleming
On Wed, 14 Oct, at 09:22:03AM, Andy Lutomirski wrote:
> On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming  
> wrote:
> > (Pulling in luto for low-level x86 fu)
> >
> > On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
> >> On 32-bit systems, the initial_page_table is reused by
> >> efi_call_phys_prolog as an identity map to call
> >> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
> >> converting the current CPU's GDT to a physical address too.
> >>
> >> For PAE kernels the identity mapping is achieved by aliasing the
> >> first PDPE for the kernel memory mapping into the first PDPE
> >> of initial_page_table.  This makes the EFI stub's trick "just work".
> >>
> >> However, for non-PAE kernels there is no guarantee that the identity
> >> mapping in the initial_page_table extends as far as the GDT; in this
> >> case, accesses to the GDT will cause a page fault (which quickly becomes
> >> a triple fault).  Fix this by copying the kernel mappings from
> >> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
> >> identity mapping.
> >
> > Oops, good catch guys. This is clearly a bug, but...
> >
> >> For some reason, this is only reproducible with QEMU's dynamic translation
> >> mode, and not for example with KVM.  However, even under KVM one can 
> >> clearly
> >> see that the page table is bogus:
> 
> I haven't looked at the code, but it wouldn't surprise me if this is
> some kind of TLB issue.  With the hardware TLB (which is in use on
> KVM), it seems quite likely that the GDT is pretty much always in the
> TLB and, if nothing flushes global mappings, then it'll probably stick
> around.

>From some quick experiments it appears that you can skate past this
issue if you don't receive any interrupts while the bogus GDT pointer
is loaded, or if you avoid reloading the segment registers in general.
Which is interesting because I assumed that writing to GDTR took
immediate effect.

Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI
calls, not in the epilog/prolog calls") interrupts were disabled
around the prolog and epilog calls, and the functional GDT was
re-installed before interrupts were re-enabled. 

That does explain why no one has complained about this issue before.

-- 
Matt Fleming, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Andy Lutomirski
On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming  wrote:
> (Pulling in luto for low-level x86 fu)
>
> On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
>> On 32-bit systems, the initial_page_table is reused by
>> efi_call_phys_prolog as an identity map to call
>> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
>> converting the current CPU's GDT to a physical address too.
>>
>> For PAE kernels the identity mapping is achieved by aliasing the
>> first PDPE for the kernel memory mapping into the first PDPE
>> of initial_page_table.  This makes the EFI stub's trick "just work".
>>
>> However, for non-PAE kernels there is no guarantee that the identity
>> mapping in the initial_page_table extends as far as the GDT; in this
>> case, accesses to the GDT will cause a page fault (which quickly becomes
>> a triple fault).  Fix this by copying the kernel mappings from
>> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
>> identity mapping.
>
> Oops, good catch guys. This is clearly a bug, but...
>
>> For some reason, this is only reproducible with QEMU's dynamic translation
>> mode, and not for example with KVM.  However, even under KVM one can clearly
>> see that the page table is bogus:

I haven't looked at the code, but it wouldn't surprise me if this is
some kind of TLB issue.  With the hardware TLB (which is in use on
KVM), it seems quite likely that the GDT is pretty much always in the
TLB and, if nothing flushes global mappings, then it'll probably stick
around.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Paolo Bonzini


On 14/10/2015 15:52, Matt Fleming wrote:
>> > However, for non-PAE kernels there is no guarantee that the identity
>> > mapping in the initial_page_table extends as far as the GDT; in this
>> > case, accesses to the GDT will cause a page fault (which quickly becomes
>> > a triple fault).  Fix this by copying the kernel mappings from
>> > swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
>> > identity mapping.
>  
> Oops, good catch guys. This is clearly a bug, but...
> 
> ... I'm a little surprised you managed to trigger this at all, because
> the GDT we load in efi_call_phys_prolog() is part of the per-cpu data
> section and therefore part of the kernel image.

Only until setup_percpu, which is earlier than SetVirtualAddressMap.
For example, I get:

  setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:1 nr_node_ids:1
  PERCPU: Embedded 18 pages/cpu @c728e000 s41800 r0 d31928 u73728
  ^^^
but the kernel image ends at 0x037f.

The GDT is 0xc728e000 in this run, so the GDT is at the beginning of the
relocated percpu area.

In the above run, the FS base that switch_to_new_gdt loads is 0x551C000.
 You have 0x728E000 - 0x551C000 = 0x1D72000, and from tracing I see that
one of the GDT values that is loaded very early is exactly 0xC1D72000.
That _is_ inside the kernel image of course when you remove PAGE_OFFSET.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Matt Fleming
(Pulling in luto for low-level x86 fu)

On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
> On 32-bit systems, the initial_page_table is reused by
> efi_call_phys_prolog as an identity map to call
> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
> converting the current CPU's GDT to a physical address too.
> 
> For PAE kernels the identity mapping is achieved by aliasing the
> first PDPE for the kernel memory mapping into the first PDPE
> of initial_page_table.  This makes the EFI stub's trick "just work".
> 
> However, for non-PAE kernels there is no guarantee that the identity
> mapping in the initial_page_table extends as far as the GDT; in this
> case, accesses to the GDT will cause a page fault (which quickly becomes
> a triple fault).  Fix this by copying the kernel mappings from
> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
> identity mapping.
 
Oops, good catch guys. This is clearly a bug, but...

> For some reason, this is only reproducible with QEMU's dynamic translation
> mode, and not for example with KVM.  However, even under KVM one can clearly
> see that the page table is bogus:
> 
> $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
> $ gdb
> (gdb) target remote localhost:1234
> (gdb) hb *0x02858f6f
> Hardware assisted breakpoint 1 at 0x2858f6f
> (gdb) c
> Continuing.
> 
> Breakpoint 1, 0x02858f6f in ?? ()
> (gdb) monitor info registers
> ...
> GDT= 0724e000 00ff
> IDT= fffbb000 07ff
> CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=0690
> ...
> 
> The page directory is sane:
> 
> (gdb) x/4wx 0x32b7000
> 0x32b7000:0x03398063  0x03399063  0x0339a063  
> 0x0339b063
> (gdb) x/4wx 0x3398000
> 0x3398000:0x0163  0x1163  0x2163  
> 0x3163
> (gdb) x/4wx 0x3399000
> 0x3399000:0x0043  0x00401003  0x00402003  
> 0x00403003
> 
> but our particular page directory entry is empty:
> 
> (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
> 0x32b7070:0x
 
... I'm a little surprised you managed to trigger this at all, because
the GDT we load in efi_call_phys_prolog() is part of the per-cpu data
section and therefore part of the kernel image.

The kernel image *is* mapped in the identity range, even for non-PAE
kernels.

So yes, you're right this is a bug but I'm not sure how you're
actually triggering it.

> Reported-by: Laszlo Ersek 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kernel/setup.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 80f874bf999e..24154bd12307 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1198,6 +1198,14 @@ void __init setup_arch(char **cmdline_p)
>   clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
>   swapper_pg_dir + KERNEL_PGD_BOUNDARY,
>   KERNEL_PGD_PTRS);
> +
> + /*
> +  * sync back low identity map too.  It is used for example
> +  * in the 32-bit EFI stub.
> +  */
> + clone_pgd_range(initial_page_table,
> + swapper_pg_dir + KERNEL_PGD_BOUNDARY,
> + KERNEL_PGD_PTRS);
>  #endif
>  
>   tboot_probe();
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Fleming, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Paolo Bonzini
On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

$ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
$ gdb
(gdb) target remote localhost:1234
(gdb) hb *0x02858f6f
Hardware assisted breakpoint 1 at 0x2858f6f
(gdb) c
Continuing.

Breakpoint 1, 0x02858f6f in ?? ()
(gdb) monitor info registers
...
GDT= 0724e000 00ff
IDT= fffbb000 07ff
CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=0690
...

The page directory is sane:

(gdb) x/4wx 0x32b7000
0x32b7000:  0x03398063  0x03399063  0x0339a063  0x0339b063
(gdb) x/4wx 0x3398000
0x3398000:  0x0163  0x1163  0x2163  0x3163
(gdb) x/4wx 0x3399000
0x3399000:  0x0043  0x00401003  0x00402003  0x00403003

but our particular page directory entry is empty:

(gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
0x32b7070:  0x

Reported-by: Laszlo Ersek 
Cc: sta...@vger.kernel.org
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kernel/setup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 80f874bf999e..24154bd12307 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1198,6 +1198,14 @@ void __init setup_arch(char **cmdline_p)
clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
+
+   /*
+* sync back low identity map too.  It is used for example
+* in the 32-bit EFI stub.
+*/
+   clone_pgd_range(initial_page_table,
+   swapper_pg_dir + KERNEL_PGD_BOUNDARY,
+   KERNEL_PGD_PTRS);
 #endif
 
tboot_probe();
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Paolo Bonzini


On 14/10/2015 15:52, Matt Fleming wrote:
>> > However, for non-PAE kernels there is no guarantee that the identity
>> > mapping in the initial_page_table extends as far as the GDT; in this
>> > case, accesses to the GDT will cause a page fault (which quickly becomes
>> > a triple fault).  Fix this by copying the kernel mappings from
>> > swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
>> > identity mapping.
>  
> Oops, good catch guys. This is clearly a bug, but...
> 
> ... I'm a little surprised you managed to trigger this at all, because
> the GDT we load in efi_call_phys_prolog() is part of the per-cpu data
> section and therefore part of the kernel image.

Only until setup_percpu, which is earlier than SetVirtualAddressMap.
For example, I get:

  setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:1 nr_node_ids:1
  PERCPU: Embedded 18 pages/cpu @c728e000 s41800 r0 d31928 u73728
  ^^^
but the kernel image ends at 0x037f.

The GDT is 0xc728e000 in this run, so the GDT is at the beginning of the
relocated percpu area.

In the above run, the FS base that switch_to_new_gdt loads is 0x551C000.
 You have 0x728E000 - 0x551C000 = 0x1D72000, and from tracing I see that
one of the GDT values that is loaded very early is exactly 0xC1D72000.
That _is_ inside the kernel image of course when you remove PAGE_OFFSET.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Andy Lutomirski
On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming  wrote:
> (Pulling in luto for low-level x86 fu)
>
> On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
>> On 32-bit systems, the initial_page_table is reused by
>> efi_call_phys_prolog as an identity map to call
>> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
>> converting the current CPU's GDT to a physical address too.
>>
>> For PAE kernels the identity mapping is achieved by aliasing the
>> first PDPE for the kernel memory mapping into the first PDPE
>> of initial_page_table.  This makes the EFI stub's trick "just work".
>>
>> However, for non-PAE kernels there is no guarantee that the identity
>> mapping in the initial_page_table extends as far as the GDT; in this
>> case, accesses to the GDT will cause a page fault (which quickly becomes
>> a triple fault).  Fix this by copying the kernel mappings from
>> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
>> identity mapping.
>
> Oops, good catch guys. This is clearly a bug, but...
>
>> For some reason, this is only reproducible with QEMU's dynamic translation
>> mode, and not for example with KVM.  However, even under KVM one can clearly
>> see that the page table is bogus:

I haven't looked at the code, but it wouldn't surprise me if this is
some kind of TLB issue.  With the hardware TLB (which is in use on
KVM), it seems quite likely that the GDT is pretty much always in the
TLB and, if nothing flushes global mappings, then it'll probably stick
around.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Matt Fleming
On Wed, 14 Oct, at 04:29:33PM, Paolo Bonzini wrote:
> 
> On 14/10/2015 15:52, Matt Fleming wrote:
> >> > However, for non-PAE kernels there is no guarantee that the identity
> >> > mapping in the initial_page_table extends as far as the GDT; in this
> >> > case, accesses to the GDT will cause a page fault (which quickly becomes
> >> > a triple fault).  Fix this by copying the kernel mappings from
> >> > swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
> >> > identity mapping.
> >  
> > Oops, good catch guys. This is clearly a bug, but...
> > 
> > ... I'm a little surprised you managed to trigger this at all, because
> > the GDT we load in efi_call_phys_prolog() is part of the per-cpu data
> > section and therefore part of the kernel image.
> 
> Only until setup_percpu, which is earlier than SetVirtualAddressMap.
> For example, I get:
> 
>   setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:1 nr_node_ids:1
>   PERCPU: Embedded 18 pages/cpu @c728e000 s41800 r0 d31928 u73728
>   ^^^
> but the kernel image ends at 0x037f.
>
> The GDT is 0xc728e000 in this run, so the GDT is at the beginning of the
> relocated percpu area.

Ah, good point. I completely missed that the percpu sections get
relocated.

-- 
Matt Fleming, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Andy Lutomirski
On Wed, Oct 14, 2015 at 2:00 PM, Matt Fleming  wrote:
> On Wed, 14 Oct, at 09:22:03AM, Andy Lutomirski wrote:
>> On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming  
>> wrote:
>> > (Pulling in luto for low-level x86 fu)
>> >
>> > On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
>> >> On 32-bit systems, the initial_page_table is reused by
>> >> efi_call_phys_prolog as an identity map to call
>> >> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
>> >> converting the current CPU's GDT to a physical address too.
>> >>
>> >> For PAE kernels the identity mapping is achieved by aliasing the
>> >> first PDPE for the kernel memory mapping into the first PDPE
>> >> of initial_page_table.  This makes the EFI stub's trick "just work".
>> >>
>> >> However, for non-PAE kernels there is no guarantee that the identity
>> >> mapping in the initial_page_table extends as far as the GDT; in this
>> >> case, accesses to the GDT will cause a page fault (which quickly becomes
>> >> a triple fault).  Fix this by copying the kernel mappings from
>> >> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
>> >> identity mapping.
>> >
>> > Oops, good catch guys. This is clearly a bug, but...
>> >
>> >> For some reason, this is only reproducible with QEMU's dynamic translation
>> >> mode, and not for example with KVM.  However, even under KVM one can 
>> >> clearly
>> >> see that the page table is bogus:
>>
>> I haven't looked at the code, but it wouldn't surprise me if this is
>> some kind of TLB issue.  With the hardware TLB (which is in use on
>> KVM), it seems quite likely that the GDT is pretty much always in the
>> TLB and, if nothing flushes global mappings, then it'll probably stick
>> around.
>
> From some quick experiments it appears that you can skate past this
> issue if you don't receive any interrupts while the bogus GDT pointer
> is loaded, or if you avoid reloading the segment registers in general.
> Which is interesting because I assumed that writing to GDTR took
> immediate effect.

Trivia for your amusement:

AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to
point to unmapped memory.  Any attempt to use them (segment loads,
interrupts, IRET, etc) will try to access that memory as if the access
came from CPL 0 and, if the access fails, will generate a valid page
fault with CR2 pointing into the GDT or LDT.

Xen is nuts^Wclever and actually uses this.

Of course, if your #PF vector references a GDT or LDT descriptor and
trying to load that descriptor results in a page fault, you get a
double fault.

I learned this while trying to puzzle out why v1 of my LDT
synchronization patch caused random faults on Xen.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Matt Fleming
On Wed, 14 Oct, at 09:22:03AM, Andy Lutomirski wrote:
> On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming  
> wrote:
> > (Pulling in luto for low-level x86 fu)
> >
> > On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
> >> On 32-bit systems, the initial_page_table is reused by
> >> efi_call_phys_prolog as an identity map to call
> >> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
> >> converting the current CPU's GDT to a physical address too.
> >>
> >> For PAE kernels the identity mapping is achieved by aliasing the
> >> first PDPE for the kernel memory mapping into the first PDPE
> >> of initial_page_table.  This makes the EFI stub's trick "just work".
> >>
> >> However, for non-PAE kernels there is no guarantee that the identity
> >> mapping in the initial_page_table extends as far as the GDT; in this
> >> case, accesses to the GDT will cause a page fault (which quickly becomes
> >> a triple fault).  Fix this by copying the kernel mappings from
> >> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
> >> identity mapping.
> >
> > Oops, good catch guys. This is clearly a bug, but...
> >
> >> For some reason, this is only reproducible with QEMU's dynamic translation
> >> mode, and not for example with KVM.  However, even under KVM one can 
> >> clearly
> >> see that the page table is bogus:
> 
> I haven't looked at the code, but it wouldn't surprise me if this is
> some kind of TLB issue.  With the hardware TLB (which is in use on
> KVM), it seems quite likely that the GDT is pretty much always in the
> TLB and, if nothing flushes global mappings, then it'll probably stick
> around.

>From some quick experiments it appears that you can skate past this
issue if you don't receive any interrupts while the bogus GDT pointer
is loaded, or if you avoid reloading the segment registers in general.
Which is interesting because I assumed that writing to GDTR took
immediate effect.

Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI
calls, not in the epilog/prolog calls") interrupts were disabled
around the prolog and epilog calls, and the functional GDT was
re-installed before interrupts were re-enabled. 

That does explain why no one has complained about this issue before.

-- 
Matt Fleming, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Paolo Bonzini
On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

$ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
$ gdb
(gdb) target remote localhost:1234
(gdb) hb *0x02858f6f
Hardware assisted breakpoint 1 at 0x2858f6f
(gdb) c
Continuing.

Breakpoint 1, 0x02858f6f in ?? ()
(gdb) monitor info registers
...
GDT= 0724e000 00ff
IDT= fffbb000 07ff
CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=0690
...

The page directory is sane:

(gdb) x/4wx 0x32b7000
0x32b7000:  0x03398063  0x03399063  0x0339a063  0x0339b063
(gdb) x/4wx 0x3398000
0x3398000:  0x0163  0x1163  0x2163  0x3163
(gdb) x/4wx 0x3399000
0x3399000:  0x0043  0x00401003  0x00402003  0x00403003

but our particular page directory entry is empty:

(gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
0x32b7070:  0x

Reported-by: Laszlo Ersek 
Cc: sta...@vger.kernel.org
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kernel/setup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 80f874bf999e..24154bd12307 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1198,6 +1198,14 @@ void __init setup_arch(char **cmdline_p)
clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
+
+   /*
+* sync back low identity map too.  It is used for example
+* in the 32-bit EFI stub.
+*/
+   clone_pgd_range(initial_page_table,
+   swapper_pg_dir + KERNEL_PGD_BOUNDARY,
+   KERNEL_PGD_PTRS);
 #endif
 
tboot_probe();
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range

2015-10-14 Thread Matt Fleming
(Pulling in luto for low-level x86 fu)

On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
> On 32-bit systems, the initial_page_table is reused by
> efi_call_phys_prolog as an identity map to call
> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
> converting the current CPU's GDT to a physical address too.
> 
> For PAE kernels the identity mapping is achieved by aliasing the
> first PDPE for the kernel memory mapping into the first PDPE
> of initial_page_table.  This makes the EFI stub's trick "just work".
> 
> However, for non-PAE kernels there is no guarantee that the identity
> mapping in the initial_page_table extends as far as the GDT; in this
> case, accesses to the GDT will cause a page fault (which quickly becomes
> a triple fault).  Fix this by copying the kernel mappings from
> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
> identity mapping.
 
Oops, good catch guys. This is clearly a bug, but...

> For some reason, this is only reproducible with QEMU's dynamic translation
> mode, and not for example with KVM.  However, even under KVM one can clearly
> see that the page table is bogus:
> 
> $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
> $ gdb
> (gdb) target remote localhost:1234
> (gdb) hb *0x02858f6f
> Hardware assisted breakpoint 1 at 0x2858f6f
> (gdb) c
> Continuing.
> 
> Breakpoint 1, 0x02858f6f in ?? ()
> (gdb) monitor info registers
> ...
> GDT= 0724e000 00ff
> IDT= fffbb000 07ff
> CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=0690
> ...
> 
> The page directory is sane:
> 
> (gdb) x/4wx 0x32b7000
> 0x32b7000:0x03398063  0x03399063  0x0339a063  
> 0x0339b063
> (gdb) x/4wx 0x3398000
> 0x3398000:0x0163  0x1163  0x2163  
> 0x3163
> (gdb) x/4wx 0x3399000
> 0x3399000:0x0043  0x00401003  0x00402003  
> 0x00403003
> 
> but our particular page directory entry is empty:
> 
> (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
> 0x32b7070:0x
 
... I'm a little surprised you managed to trigger this at all, because
the GDT we load in efi_call_phys_prolog() is part of the per-cpu data
section and therefore part of the kernel image.

The kernel image *is* mapped in the identity range, even for non-PAE
kernels.

So yes, you're right this is a bug but I'm not sure how you're
actually triggering it.

> Reported-by: Laszlo Ersek 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kernel/setup.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 80f874bf999e..24154bd12307 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1198,6 +1198,14 @@ void __init setup_arch(char **cmdline_p)
>   clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
>   swapper_pg_dir + KERNEL_PGD_BOUNDARY,
>   KERNEL_PGD_PTRS);
> +
> + /*
> +  * sync back low identity map too.  It is used for example
> +  * in the 32-bit EFI stub.
> +  */
> + clone_pgd_range(initial_page_table,
> + swapper_pg_dir + KERNEL_PGD_BOUNDARY,
> + KERNEL_PGD_PTRS);
>  #endif
>  
>   tboot_probe();
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Fleming, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/