Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
No, it is a natural result of an implemention which treats setting the A bit as an abnormal flow (e.g. in microcode as opposed to hardware). On September 29, 2015 7:11:59 PM PDT, ebied...@xmission.com wrote: >"H. Peter Anvin" writes: > >> On 09/29/2015 06:20 PM, Eric W. Biederman wrote: >>> Linus Torvalds writes: >>> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski > wrote: > > Does anyone know what happens if you stick a non-accessed segment >in > the GDT, map the GDT RO, and access it? You should get a #PF, as you guess, but go ahead and test it if you want to make sure. >>> >>> I tested this by accident once when workinng on what has become >known >>> as coreboot. Early in boot with your GDT in a EEPROM switching from >>> real mode to 32bit protected mode causes a write and locks up the >>> machine when the hardware declines the write to the GDT to set the >>> accessed bit. As I recall the write kept being retried and retried >and >>> retried... >>> >>> Setting the access bit in the GDT cleared up the problem and I did >not >>> look back. >>> >>> Way up in 64bit mode something might be different, but I don't know >why >>> cpu designeres would waste the silicon. >>> >> >> This is totally different from a TLB violation. In your case, the >write >> goes through as far as the CPU is concerned, but when the data is >> fetched back, it hasn't changed. A write to a TLB-protected location >> will #PF. > >The key point is that a write is generated when the cpu needs to set >the >access bit. I agree the failure points are different. A TLB fault vs >a >case where the hardware did not accept the write. > >The idea of a cpu reading back data (and not trusting it's cache >coherency controls) to verify the access bit gets set seems mind >boggling. That is slow, stupid, racy and incorrect. Incorrect as the >cpu should not only set the access bit once per segment register load. > >In my case I am pretty certain it was something very weird with the >hardware not acceppting the write and either not acknowledging the bus >transaction or cancelling it. In which case the cpu knew the write had >not made it to the ``memory'' and was trying to cope. > >Eric -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
"H. Peter Anvin" writes: > On 09/29/2015 06:20 PM, Eric W. Biederman wrote: >> Linus Torvalds writes: >> >>> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski >>> wrote: Does anyone know what happens if you stick a non-accessed segment in the GDT, map the GDT RO, and access it? >>> >>> You should get a #PF, as you guess, but go ahead and test it if you >>> want to make sure. >> >> I tested this by accident once when workinng on what has become known >> as coreboot. Early in boot with your GDT in a EEPROM switching from >> real mode to 32bit protected mode causes a write and locks up the >> machine when the hardware declines the write to the GDT to set the >> accessed bit. As I recall the write kept being retried and retried and >> retried... >> >> Setting the access bit in the GDT cleared up the problem and I did not >> look back. >> >> Way up in 64bit mode something might be different, but I don't know why >> cpu designeres would waste the silicon. >> > > This is totally different from a TLB violation. In your case, the write > goes through as far as the CPU is concerned, but when the data is > fetched back, it hasn't changed. A write to a TLB-protected location > will #PF. The key point is that a write is generated when the cpu needs to set the access bit. I agree the failure points are different. A TLB fault vs a case where the hardware did not accept the write. The idea of a cpu reading back data (and not trusting it's cache coherency controls) to verify the access bit gets set seems mind boggling. That is slow, stupid, racy and incorrect. Incorrect as the cpu should not only set the access bit once per segment register load. In my case I am pretty certain it was something very weird with the hardware not acceppting the write and either not acknowledging the bus transaction or cancelling it. In which case the cpu knew the write had not made it to the ``memory'' and was trying to cope. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On 09/29/2015 06:20 PM, Eric W. Biederman wrote: > Linus Torvalds writes: > >> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski wrote: >>> >>> Does anyone know what happens if you stick a non-accessed segment in >>> the GDT, map the GDT RO, and access it? >> >> You should get a #PF, as you guess, but go ahead and test it if you >> want to make sure. > > I tested this by accident once when workinng on what has become known > as coreboot. Early in boot with your GDT in a EEPROM switching from > real mode to 32bit protected mode causes a write and locks up the > machine when the hardware declines the write to the GDT to set the > accessed bit. As I recall the write kept being retried and retried and > retried... > > Setting the access bit in the GDT cleared up the problem and I did not > look back. > > Way up in 64bit mode something might be different, but I don't know why > cpu designeres would waste the silicon. > This is totally different from a TLB violation. In your case, the write goes through as far as the CPU is concerned, but when the data is fetched back, it hasn't changed. A write to a TLB-protected location will #PF. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
Linus Torvalds writes: > On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski wrote: >> >> Does anyone know what happens if you stick a non-accessed segment in >> the GDT, map the GDT RO, and access it? > > You should get a #PF, as you guess, but go ahead and test it if you > want to make sure. I tested this by accident once when workinng on what has become known as coreboot. Early in boot with your GDT in a EEPROM switching from real mode to 32bit protected mode causes a write and locks up the machine when the hardware declines the write to the GDT to set the accessed bit. As I recall the write kept being retried and retried and retried... Setting the access bit in the GDT cleared up the problem and I did not look back. Way up in 64bit mode something might be different, but I don't know why cpu designeres would waste the silicon. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On 09/29/2015 11:02 AM, Andy Lutomirski wrote: > On Tue, Sep 29, 2015 at 10:50 AM, Linus Torvalds > wrote: >> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski wrote: >>> >>> Does anyone know what happens if you stick a non-accessed segment in >>> the GDT, map the GDT RO, and access it? >> >> You should get a #PF, as you guess, but go ahead and test it if you >> want to make sure. > > Then I think that, if we do this, the patch series should first make > it percpu and fixmapped but RW and then flip it RO as a separate patch > in case we need to revert the actual RO bit. I don't want to break > Wine or The Witcher 2 because of this, and we might need various > fixups. I really hope that no one uses get_thread_area to check > whether TLS has been accessed. > Of course. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On Tue, Sep 29, 2015 at 11:18 AM, H. Peter Anvin wrote: > SGDT would be easy to use, and it is logical that it is faster since it reads > an internal register. SIDT does too but unlike the GDT has a secondary limit > (it can never be larger than 4096 bytes) and so all limits in the range > 4095-65535 are exactly equivalent. > Using the IDT limit would have been a great ideal if Intel hadn't decided to clobber it on every VM exit. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
Ugh. Didn't realize that. On September 29, 2015 11:22:04 AM PDT, Andy Lutomirski wrote: >On Tue, Sep 29, 2015 at 11:18 AM, H. Peter Anvin wrote: >> SGDT would be easy to use, and it is logical that it is faster since >it reads an internal register. SIDT does too but unlike the GDT has a >secondary limit (it can never be larger than 4096 bytes) and so all >limits in the range 4095-65535 are exactly equivalent. >> > >Using the IDT limit would have been a great ideal if Intel hadn't >decided to clobber it on every VM exit. > >--Andy -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
SGDT would be easy to use, and it is logical that it is faster since it reads an internal register. SIDT does too but unlike the GDT has a secondary limit (it can never be larger than 4096 bytes) and so all limits in the range 4095-65535 are exactly equivalent. Anything that causes a write to the GDT will #PF if read-only. So yes, we need to force the accessed bit to set. This shouldn't be a problem and in fact ought to be a performance improvement. On September 29, 2015 10:35:38 AM PDT, Andy Lutomirski wrote: >On Sep 29, 2015 2:01 AM, "Ingo Molnar" wrote: >> >> >> * Denys Vlasenko wrote: >> >> > On 09/28/2015 09:58 AM, Ingo Molnar wrote: >> > > >> > > * Denys Vlasenko wrote: >> > > >> > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote: >> > >>> NAK. We really should map the GDT read-only on all 64 bit >systems, >> > >>> since we can't hide the address from SLDT. Same with the IDT. >> > >> >> > >> Sorry, I don't understand your point. >> > > >> > > So the problem is that right now the SGDT instruction (which is >unprivileged) >> > > leaks the real address of the kernel image: >> > > >> > > fomalhaut:~> ./sgdt >> > > SGDT: 88303fd89000 / 007f >> > > >> > > that '88303fd89000' is a kernel address. >> > >> > Thank you. >> > I do know that SGDT and friends are unprivileged on x86 >> > and thus they allow userspace (and guest kernels in paravirt) >> > learn things they don't need to know. >> > >> > I don't see how making GDT page-aligned and page-sized >> > changes anything in this regard. SGDT will still work, >> > and still leak GDT address. >> >> Well, as I try to explain it in the other part of my mail, doing so >enables us to >> remap the GDT to a less security sensitive virtual address that does >not leak the >> kernel's randomized address: >> >> > > Your observation in the changelog and your patch: >> > > >> > It is page-sized because of paravirt. [...] >> > > >> > > ... conflicts with the intention to mark (remap) the primary GDT >address read-only >> > > on native kernels as well. >> > > >> > > So what we should do instead is to use the page alignment >properly and remap the >> > > GDT to a read-only location, and load that one. >> > >> > If we'd have a small GDT (i.e. what my patch does), we still can >remap the >> > entire page which contains small GDT, and simply don't care that >some other data >> > is also visible through that RO page. >> >> That's generally considered fragile: suppose an attacker has a >limited information >> leak that can read absolute addresses with system privilege but he >doesn't know >> the kernel's randomized base offset. With a 'partial page' mapping >there could be >> function pointers near the GDT, part of the page the GDT happens to >be on, that >> leak this information. >> >> (Same goes for crypto keys or other critical information (like canary >information, >> salts, etc.) accidentally ending up nearby.) >> >> Arguably it's a bit tenuous, but when playing remapping games it's >generally >> considered good to be page aligned and page sized, with zero padding. >> >> > > This would have a couple of advantages: >> > > >> > > - This would give kernel address randomization more teeth on >x86. >> > > >> > > - An additional advantage would be that rootkits overwriting the >GDT would have >> > >a bit more work to do. >> > > >> > > - A third advantage would be that for NUMA systems we could >'mirror' the GDT into >> > >node-local memory and load those. This makes GDT load >cache-misses a bit less >> > >expensive. >> > >> > GDT is per-cpu. Isn't per-cpu memory already NUMA-local? >> >> Indeed it is: >> >> fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; >done >> SGDT: 88103fa09000 / 007f >> SGDT: 88103fa29000 / 007f >> SGDT: 88103fa29000 / 007f >> SGDT: 88103fa49000 / 007f >> SGDT: 88103fa49000 / 007f >> SGDT: 88103fa49000 / 007f >> SGDT: 88103fa29000 / 007f >> SGDT: 88103fa69000 / 007f >> >> I confused it with the IDT, which is still global. >> >> This also means that the GDT in itself does not leak kernel addresses >at the >> moment, except it leaks the layout of the percpu area. >> >> So my suggestion would be to: >> >> - make the GDT unconditionally page aligned and sized, then remap it >to a >>read-only address unconditionally as well, like we do it for the >IDT. > >Does anyone know what happens if you stick a non-accessed segment in >the GDT, map the GDT RO, and access it? The docs are extremely vague >on the interplay between segmentation and paging on the segmentation >structures themselves. My guess is that it causes #PF. This might >break set_thread_area users unless we change set_thread_area to force >the accessed bit on. > >There's a possible worse failure mode: if someone pokes an un-accessed >segment into SS or CS using sigreturn, then it's within the realm of >possibility that IRET would generate #PF (hey Intel and AMD, please >document this!). I don't think th
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On Tue, Sep 29, 2015 at 10:50 AM, Linus Torvalds wrote: > On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski wrote: >> >> Does anyone know what happens if you stick a non-accessed segment in >> the GDT, map the GDT RO, and access it? > > You should get a #PF, as you guess, but go ahead and test it if you > want to make sure. > Then I think that, if we do this, the patch series should first make it percpu and fixmapped but RW and then flip it RO as a separate patch in case we need to revert the actual RO bit. I don't want to break Wine or The Witcher 2 because of this, and we might need various fixups. I really hope that no one uses get_thread_area to check whether TLS has been accessed. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski wrote: > > Does anyone know what happens if you stick a non-accessed segment in > the GDT, map the GDT RO, and access it? You should get a #PF, as you guess, but go ahead and test it if you want to make sure. We do something very similar for the old Pentium F0 0F bug - we mark the IDT read-only, which causes the (bogus) locked read of the IDT entry that the F00F bug resulted in to be caught as a page fault instead. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On Sep 29, 2015 2:01 AM, "Ingo Molnar" wrote: > > > * Denys Vlasenko wrote: > > > On 09/28/2015 09:58 AM, Ingo Molnar wrote: > > > > > > * Denys Vlasenko wrote: > > > > > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote: > > >>> NAK. We really should map the GDT read-only on all 64 bit systems, > > >>> since we can't hide the address from SLDT. Same with the IDT. > > >> > > >> Sorry, I don't understand your point. > > > > > > So the problem is that right now the SGDT instruction (which is > > > unprivileged) > > > leaks the real address of the kernel image: > > > > > > fomalhaut:~> ./sgdt > > > SGDT: 88303fd89000 / 007f > > > > > > that '88303fd89000' is a kernel address. > > > > Thank you. > > I do know that SGDT and friends are unprivileged on x86 > > and thus they allow userspace (and guest kernels in paravirt) > > learn things they don't need to know. > > > > I don't see how making GDT page-aligned and page-sized > > changes anything in this regard. SGDT will still work, > > and still leak GDT address. > > Well, as I try to explain it in the other part of my mail, doing so enables > us to > remap the GDT to a less security sensitive virtual address that does not leak > the > kernel's randomized address: > > > > Your observation in the changelog and your patch: > > > > > It is page-sized because of paravirt. [...] > > > > > > ... conflicts with the intention to mark (remap) the primary GDT address > > > read-only > > > on native kernels as well. > > > > > > So what we should do instead is to use the page alignment properly and > > > remap the > > > GDT to a read-only location, and load that one. > > > > If we'd have a small GDT (i.e. what my patch does), we still can remap the > > entire page which contains small GDT, and simply don't care that some other > > data > > is also visible through that RO page. > > That's generally considered fragile: suppose an attacker has a limited > information > leak that can read absolute addresses with system privilege but he doesn't > know > the kernel's randomized base offset. With a 'partial page' mapping there > could be > function pointers near the GDT, part of the page the GDT happens to be on, > that > leak this information. > > (Same goes for crypto keys or other critical information (like canary > information, > salts, etc.) accidentally ending up nearby.) > > Arguably it's a bit tenuous, but when playing remapping games it's generally > considered good to be page aligned and page sized, with zero padding. > > > > This would have a couple of advantages: > > > > > > - This would give kernel address randomization more teeth on x86. > > > > > > - An additional advantage would be that rootkits overwriting the GDT > > > would have > > >a bit more work to do. > > > > > > - A third advantage would be that for NUMA systems we could 'mirror' the > > > GDT into > > >node-local memory and load those. This makes GDT load cache-misses a > > > bit less > > >expensive. > > > > GDT is per-cpu. Isn't per-cpu memory already NUMA-local? > > Indeed it is: > > fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; done > SGDT: 88103fa09000 / 007f > SGDT: 88103fa29000 / 007f > SGDT: 88103fa29000 / 007f > SGDT: 88103fa49000 / 007f > SGDT: 88103fa49000 / 007f > SGDT: 88103fa49000 / 007f > SGDT: 88103fa29000 / 007f > SGDT: 88103fa69000 / 007f > > I confused it with the IDT, which is still global. > > This also means that the GDT in itself does not leak kernel addresses at the > moment, except it leaks the layout of the percpu area. > > So my suggestion would be to: > > - make the GDT unconditionally page aligned and sized, then remap it to a >read-only address unconditionally as well, like we do it for the IDT. Does anyone know what happens if you stick a non-accessed segment in the GDT, map the GDT RO, and access it? The docs are extremely vague on the interplay between segmentation and paging on the segmentation structures themselves. My guess is that it causes #PF. This might break set_thread_area users unless we change set_thread_area to force the accessed bit on. There's a possible worse failure mode: if someone pokes an un-accessed segment into SS or CS using sigreturn, then it's within the realm of possibility that IRET would generate #PF (hey Intel and AMD, please document this!). I don't think that would be rootable, but at the very least we'd want to make sure it doesn't OOPS by either making it impossible or adding an explicit test to sigreturn.c. hpa pointed out in another thread that the GDT *must* be writable on 32-bit kernels because we use a task gate for NMI and jumping through a task gate writes to the GDT. On another note, SGDT is considerably faster than LSL, at least on Sandy Bridge. The vdso might be able to take advantage of that for getcpu. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord.
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
* Denys Vlasenko wrote: > On 09/28/2015 09:58 AM, Ingo Molnar wrote: > > > > * Denys Vlasenko wrote: > > > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote: > >>> NAK. We really should map the GDT read-only on all 64 bit systems, > >>> since we can't hide the address from SLDT. Same with the IDT. > >> > >> Sorry, I don't understand your point. > > > > So the problem is that right now the SGDT instruction (which is > > unprivileged) > > leaks the real address of the kernel image: > > > > fomalhaut:~> ./sgdt > > SGDT: 88303fd89000 / 007f > > > > that '88303fd89000' is a kernel address. > > Thank you. > I do know that SGDT and friends are unprivileged on x86 > and thus they allow userspace (and guest kernels in paravirt) > learn things they don't need to know. > > I don't see how making GDT page-aligned and page-sized > changes anything in this regard. SGDT will still work, > and still leak GDT address. Well, as I try to explain it in the other part of my mail, doing so enables us to remap the GDT to a less security sensitive virtual address that does not leak the kernel's randomized address: > > Your observation in the changelog and your patch: > > > It is page-sized because of paravirt. [...] > > > > ... conflicts with the intention to mark (remap) the primary GDT address > > read-only > > on native kernels as well. > > > > So what we should do instead is to use the page alignment properly and > > remap the > > GDT to a read-only location, and load that one. > > If we'd have a small GDT (i.e. what my patch does), we still can remap the > entire page which contains small GDT, and simply don't care that some other > data > is also visible through that RO page. That's generally considered fragile: suppose an attacker has a limited information leak that can read absolute addresses with system privilege but he doesn't know the kernel's randomized base offset. With a 'partial page' mapping there could be function pointers near the GDT, part of the page the GDT happens to be on, that leak this information. (Same goes for crypto keys or other critical information (like canary information, salts, etc.) accidentally ending up nearby.) Arguably it's a bit tenuous, but when playing remapping games it's generally considered good to be page aligned and page sized, with zero padding. > > This would have a couple of advantages: > > > > - This would give kernel address randomization more teeth on x86. > > > > - An additional advantage would be that rootkits overwriting the GDT would > > have > >a bit more work to do. > > > > - A third advantage would be that for NUMA systems we could 'mirror' the > > GDT into > >node-local memory and load those. This makes GDT load cache-misses a bit > > less > >expensive. > > GDT is per-cpu. Isn't per-cpu memory already NUMA-local? Indeed it is: fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; done SGDT: 88103fa09000 / 007f SGDT: 88103fa29000 / 007f SGDT: 88103fa29000 / 007f SGDT: 88103fa49000 / 007f SGDT: 88103fa49000 / 007f SGDT: 88103fa49000 / 007f SGDT: 88103fa29000 / 007f SGDT: 88103fa69000 / 007f I confused it with the IDT, which is still global. This also means that the GDT in itself does not leak kernel addresses at the moment, except it leaks the layout of the percpu area. So my suggestion would be to: - make the GDT unconditionally page aligned and sized, then remap it to a read-only address unconditionally as well, like we do it for the IDT. - make the IDT per CPU as well, for performance reasons. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On 09/28/2015 09:58 AM, Ingo Molnar wrote: > > * Denys Vlasenko wrote: > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote: >>> NAK. We really should map the GDT read-only on all 64 bit systems, >>> since we can't hide the address from SLDT. Same with the IDT. >> >> Sorry, I don't understand your point. > > So the problem is that right now the SGDT instruction (which is unprivileged) > leaks the real address of the kernel image: > > fomalhaut:~> ./sgdt > SGDT: 88303fd89000 / 007f > > that '88303fd89000' is a kernel address. Thank you. I do know that SGDT and friends are unprivileged on x86 and thus they allow userspace (and guest kernels in paravirt) learn things they don't need to know. I don't see how making GDT page-aligned and page-sized changes anything in this regard. SGDT will still work, and still leak GDT address. > Your observation in the changelog and your patch: > It is page-sized because of paravirt. [...] > > ... conflicts with the intention to mark (remap) the primary GDT address > read-only > on native kernels as well. > > So what we should do instead is to use the page alignment properly and remap > the > GDT to a read-only location, and load that one. If we'd have a small GDT (i.e. what my patch does), we still can remap the entire page which contains small GDT, and simply don't care that some other data is also visible through that RO page. > This would have a couple of advantages: > > - This would give kernel address randomization more teeth on x86. > > - An additional advantage would be that rootkits overwriting the GDT would > have >a bit more work to do. > > - A third advantage would be that for NUMA systems we could 'mirror' the GDT > into >node-local memory and load those. This makes GDT load cache-misses a bit > less >expensive. GDT is per-cpu. Isn't per-cpu memory already NUMA-local? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
* Denys Vlasenko wrote: > On 09/26/2015 09:50 PM, H. Peter Anvin wrote: > > NAK. We really should map the GDT read-only on all 64 bit systems, > > since we can't hide the address from SLDT. Same with the IDT. > > Sorry, I don't understand your point. So the problem is that right now the SGDT instruction (which is unprivileged) leaks the real address of the kernel image: fomalhaut:~> ./sgdt SGDT: 88303fd89000 / 007f that '88303fd89000' is a kernel address. fomalhaut:~> cat sgdt.c #include #include int main(void) { struct gdt_desc { unsigned short limit; unsigned long addr; } __attribute__((packed)) gdt_desc = { -1, -1 }; asm volatile("sgdt %0": "=m" (gdt_desc)); printf("SGDT: %016lx / %04x\n", gdt_desc.addr, gdt_desc.limit); return 0; } Your observation in the changelog and your patch: > >> It is page-sized because of paravirt. [...] ... conflicts with the intention to mark (remap) the primary GDT address read-only on native kernels as well. So what we should do instead is to use the page alignment properly and remap the GDT to a read-only location, and load that one. This would have a couple of advantages: - This would give kernel address randomization more teeth on x86. - An additional advantage would be that rootkits overwriting the GDT would have a bit more work to do. - A third advantage would be that for NUMA systems we could 'mirror' the GDT into node-local memory and load those. This makes GDT load cache-misses a bit less expensive. The IDT is already remapped: fomalhaut:~> ./sidt Sidt: ff57b000 / 0fff fomalhaut:~> cat sidt.c #include #include int main(void) { struct idt_desc { unsigned short limit; unsigned long addr; } __attribute__((packed)) idt_desc = { -1, -1 }; asm volatile("sidt %0": "=m" (idt_desc)); printf("Sidt: %016lx / %04x\n", idt_desc.addr, idt_desc.limit); return 0; } Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On 09/26/2015 09:50 PM, H. Peter Anvin wrote: > NAK. We really should map the GDT read-only on all 64 bit systems, > since we can't hide the address from SLDT. Same with the IDT. Sorry, I don't understand your point. > On September 26, 2015 11:00:40 AM PDT, Denys Vlasenko > wrote: >> We have our GDT in a page-sized per-cpu structure, gdt_page. >> >> On x86_64 kernel, GDT is 128 bytes - only ~3% of that page is used. >> >> It is page-sized because of paravirt. Hypervisors need to know when >> GDT is changed, so they remap it read-only and handle write faults. >> If it's not in its own page, other writes nearby will cause >> those faults too. >> >> In other words, we need GDT to live in a separate page >> only if CONFIG_HYPERVISOR_GUEST=y. >> >> This patch reduces GDT alignment to cacheline-aligned >> if CONFIG_HYPERVISOR_GUEST is not set. >> >> Patch also renames gdt_page to cpu_gdt (mimicking naming of existing >> cpu_tss per-cpu variable), since now it is not always a full page. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
NAK. We really should map the GDT read-only on all 64 bit systems, since we can't hide the address from SLDT. Same with the IDT. On September 26, 2015 11:00:40 AM PDT, Denys Vlasenko wrote: >We have our GDT in a page-sized per-cpu structure, gdt_page. > >On x86_64 kernel, GDT is 128 bytes - only ~3% of that page is used. > >It is page-sized because of paravirt. Hypervisors need to know when >GDT is changed, so they remap it read-only and handle write faults. >If it's not in its own page, other writes nearby will cause >those faults too. > >In other words, we need GDT to live in a separate page >only if CONFIG_HYPERVISOR_GUEST=y. > >This patch reduces GDT alignment to cacheline-aligned >if CONFIG_HYPERVISOR_GUEST is not set. > >Patch also renames gdt_page to cpu_gdt (mimicking naming of existing >cpu_tss per-cpu variable), since now it is not always a full page. > >$ objdump -x vmlinux | grep .data..percpu | sort >Before: >(offset)(size) > wO .data..percpu 4000 >irq_stack_union >4000 wO .data..percpu 5000 >exception_stacks >9000 wO .data..percpu 1000 gdt_page <<< >HERE > a000 wO .data..percpu 0008 espfix_waddr > a008 wO .data..percpu 0008 espfix_stack >... > 00019398 g .data..percpu __per_cpu_end >After: > wO .data..percpu 4000 >irq_stack_union >4000 wO .data..percpu 5000 >exception_stacks > 9000 wO .data..percpu 0008 espfix_waddr > 9008 wO .data..percpu 0008 espfix_stack >... >00013c80 wO .data..percpu 0040 cyc2ns >00013cc0 wO .data..percpu 22c0 cpu_tss >00015f80 wO .data..percpu 0080 cpu_gdt <<< >HERE > 00016000 wO .data..percpu 0018 cpu_tlbstate >... > 00018418 g .data..percpu __per_cpu_end > >Run-tested on a 144 CPU machine. > >Signed-off-by: Denys Vlasenko >CC: Ingo Molnar >CC: H. Peter Anvin >CC: Konrad Rzeszutek Wilk >CC: Boris Ostrovsky >CC: David Vrabel >CC: Joerg Roedel >CC: Gleb Natapov >CC: Paolo Bonzini >CC: k...@vger.kernel.org >CC: x...@kernel.org >CC: linux-kernel@vger.kernel.org >--- > arch/x86/entry/entry_32.S| 2 +- > arch/x86/include/asm/desc.h | 16 +++- > arch/x86/kernel/cpu/common.c | 10 -- > arch/x86/kernel/cpu/perf_event.c | 2 +- > arch/x86/kernel/head_32.S| 4 ++-- > arch/x86/kernel/head_64.S| 2 +- > arch/x86/kernel/vmlinux.lds.S| 2 +- > arch/x86/tools/relocs.c | 2 +- > arch/x86/xen/enlighten.c | 4 ++-- > 9 files changed, 28 insertions(+), 16 deletions(-) > >diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S >index b2909bf..bc6ae1c 100644 >--- a/arch/x86/entry/entry_32.S >+++ b/arch/x86/entry/entry_32.S >@@ -429,7 +429,7 @@ ldt_ss: > * compensating for the offset by changing to the ESPFIX segment with > * a base address that matches for the difference. > */ >-#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * >8) >+#define GDT_ESPFIX_SS PER_CPU_VAR(cpu_gdt) + (GDT_ENTRY_ESPFIX_SS * 8) > mov %esp, %edx /* load kernel esp */ > mov PT_OLDESP(%esp), %eax /* load userspace esp */ > mov %dx, %ax/* eax: new kernel esp */ >diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h >index 4e10d73..76de300 100644 >--- a/arch/x86/include/asm/desc.h >+++ b/arch/x86/include/asm/desc.h >@@ -39,15 +39,21 @@ extern gate_desc idt_table[]; > extern struct desc_ptr debug_idt_descr; > extern gate_desc debug_idt_table[]; > >-struct gdt_page { >+struct cpu_gdt { > struct desc_struct gdt[GDT_ENTRIES]; >-} __attribute__((aligned(PAGE_SIZE))); >- >-DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page); >+} >+#ifdef CONFIG_HYPERVISOR_GUEST >+/* Xen et al want GDT to have its own page. They remap it read-only */ >+__attribute__((aligned(PAGE_SIZE))); >+DECLARE_PER_CPU_PAGE_ALIGNED(struct cpu_gdt, cpu_gdt); >+#else >+cacheline_aligned; >+DECLARE_PER_CPU_ALIGNED(struct cpu_gdt, cpu_gdt); >+#endif > > static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu) > { >- return per_cpu(gdt_page, cpu).gdt; >+ return per_cpu(cpu_gdt, cpu).gdt; > } > > #ifdef CONFIG_X86_64 >diff --git a/arch/x86/kernel/cpu/common.c >b/arch/x86/kernel/cpu/common.c >index de22ea7..6b90785 100644 >--- a/arch/x86/kernel/cpu/common.c >+++ b/arch/x86/kernel/cpu/common.c >@@ -92,7 +92,13 @@ static const struct cpu_dev default_cpu = { > > static const struct cpu_dev *this_cpu = &default_cpu; > >-DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { >+#ifdef CONFIG_HY
[PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
We have our GDT in a page-sized per-cpu structure, gdt_page. On x86_64 kernel, GDT is 128 bytes - only ~3% of that page is used. It is page-sized because of paravirt. Hypervisors need to know when GDT is changed, so they remap it read-only and handle write faults. If it's not in its own page, other writes nearby will cause those faults too. In other words, we need GDT to live in a separate page only if CONFIG_HYPERVISOR_GUEST=y. This patch reduces GDT alignment to cacheline-aligned if CONFIG_HYPERVISOR_GUEST is not set. Patch also renames gdt_page to cpu_gdt (mimicking naming of existing cpu_tss per-cpu variable), since now it is not always a full page. $ objdump -x vmlinux | grep .data..percpu | sort Before: (offset)(size) wO .data..percpu 4000 irq_stack_union 4000 wO .data..percpu 5000 exception_stacks 9000 wO .data..percpu 1000 gdt_page <<< HERE a000 wO .data..percpu 0008 espfix_waddr a008 wO .data..percpu 0008 espfix_stack ... 00019398 g .data..percpu __per_cpu_end After: wO .data..percpu 4000 irq_stack_union 4000 wO .data..percpu 5000 exception_stacks 9000 wO .data..percpu 0008 espfix_waddr 9008 wO .data..percpu 0008 espfix_stack ... 00013c80 wO .data..percpu 0040 cyc2ns 00013cc0 wO .data..percpu 22c0 cpu_tss 00015f80 wO .data..percpu 0080 cpu_gdt <<< HERE 00016000 wO .data..percpu 0018 cpu_tlbstate ... 00018418 g .data..percpu __per_cpu_end Run-tested on a 144 CPU machine. Signed-off-by: Denys Vlasenko CC: Ingo Molnar CC: H. Peter Anvin CC: Konrad Rzeszutek Wilk CC: Boris Ostrovsky CC: David Vrabel CC: Joerg Roedel CC: Gleb Natapov CC: Paolo Bonzini CC: k...@vger.kernel.org CC: x...@kernel.org CC: linux-kernel@vger.kernel.org --- arch/x86/entry/entry_32.S| 2 +- arch/x86/include/asm/desc.h | 16 +++- arch/x86/kernel/cpu/common.c | 10 -- arch/x86/kernel/cpu/perf_event.c | 2 +- arch/x86/kernel/head_32.S| 4 ++-- arch/x86/kernel/head_64.S| 2 +- arch/x86/kernel/vmlinux.lds.S| 2 +- arch/x86/tools/relocs.c | 2 +- arch/x86/xen/enlighten.c | 4 ++-- 9 files changed, 28 insertions(+), 16 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index b2909bf..bc6ae1c 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -429,7 +429,7 @@ ldt_ss: * compensating for the offset by changing to the ESPFIX segment with * a base address that matches for the difference. */ -#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8) +#define GDT_ESPFIX_SS PER_CPU_VAR(cpu_gdt) + (GDT_ENTRY_ESPFIX_SS * 8) mov %esp, %edx /* load kernel esp */ mov PT_OLDESP(%esp), %eax /* load userspace esp */ mov %dx, %ax/* eax: new kernel esp */ diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index 4e10d73..76de300 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -39,15 +39,21 @@ extern gate_desc idt_table[]; extern struct desc_ptr debug_idt_descr; extern gate_desc debug_idt_table[]; -struct gdt_page { +struct cpu_gdt { struct desc_struct gdt[GDT_ENTRIES]; -} __attribute__((aligned(PAGE_SIZE))); - -DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page); +} +#ifdef CONFIG_HYPERVISOR_GUEST +/* Xen et al want GDT to have its own page. They remap it read-only */ +__attribute__((aligned(PAGE_SIZE))); +DECLARE_PER_CPU_PAGE_ALIGNED(struct cpu_gdt, cpu_gdt); +#else +cacheline_aligned; +DECLARE_PER_CPU_ALIGNED(struct cpu_gdt, cpu_gdt); +#endif static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu) { - return per_cpu(gdt_page, cpu).gdt; + return per_cpu(cpu_gdt, cpu).gdt; } #ifdef CONFIG_X86_64 diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index de22ea7..6b90785 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -92,7 +92,13 @@ static const struct cpu_dev default_cpu = { static const struct cpu_dev *this_cpu = &default_cpu; -DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { +#ifdef CONFIG_HYPERVISOR_GUEST +/* Xen et al want GDT to have its own page. They remap it read-only */ +DEFINE_PER_CPU_PAGE_ALIGNED(struct cpu_gdt, cpu_gdt) = +#else +DEFINE_PER_CPU_ALIGNED(struct cpu_gdt, cpu_gdt) = +#endif +{ .gdt = { #ifdef CONFIG_X86_64 /* * We need valid