Re: [RFC PATCH v2 17/19] heki: x86: Update permissions counters during text patching

2023-11-30 Thread Edgecombe, Rick P
On Wed, 2023-11-29 at 15:07 -0600, Madhavan T. Venkataraman wrote:
> Threat Model
> 
> 
> In the threat model in Heki, the attacker is a user space attacker
> who exploits
> a kernel vulnerability to gain more privileges or bypass the kernel's
> access
> control and self-protection mechanisms. 
> 
> In the context of the guest page table, one of the things that the
> threat model translates
> to is a hacker gaining access to a guest page with RWX permissions.
> E.g., by adding execute
> permissions to a writable page or by adding write permissions to an
> executable page.
> 
> Today, the permissions for a guest page in the extended page table
> are RWX by
> default. So, if a hacker manages to establish RWX for a page in the
> guest page
> table, then that is all he needs to do some damage.

I had a few random comments from watching the plumbers talk online:

Is there really a big difference between a page that is RWX, and a RW
page that is about to become RX? I realize that there is an addition of
timing, but when executable code is getting loaded it can be written to
then and later executed. I think that gap could be addressed in two
different ways, both pretty difficult:
 1. Verifying the loaded code before it gets marked 
executable. This is difficult because the kernel does lots of 
tweaks on the code it is loading (alternatives, etc). It can't 
just check a signature.
 2. Loading the code in a protected environment. In this model the 
(for example) module signature would be checked, then the code 
would be loaded in some sort of protected environment. This way 
integrity of the loaded code would be enforced. But extracting 
module loading into a separate domain would be difficult. 
Various scattered features all have their hands in the loading.

Secondly, I wonder if another way to look at the memory parts of HEKI
could be that this is a way to protect certain page table bits from
stay writes. The RWX bits in the EPT are not directly writable, so more
steps are needed to change things than just a stray write (instead the
helpers involved in the operations need to be called). If that is a
fair way of looking at it, then I wonder how HEKI compares to a
solution like this security-wise:
https://lore.kernel.org/lkml/20210830235927.6443-1-rick.p.edgeco...@intel.com/

Functional-wise it had the benefit of working on bare metal and
supporting the normal kernel features.


Re: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()

2023-06-14 Thread Edgecombe, Rick P
On Tue, 2023-06-13 at 19:00 +0200, David Hildenbrand wrote:
> On 13.06.23 18:19, Edgecombe, Rick P wrote:
> > On Tue, 2023-06-13 at 10:44 +0300, Mike Rapoport wrote:
> > > > Previous patches have done the first step, so next move the
> > > > callers
> > > > that
> > > > don't have a VMA to pte_mkwrite_novma(). Also do the same for
> > > 
> > > I hear x86 maintainers asking to drop "previous patches" ;-)
> > > 
> > > Maybe
> > > This is the second step of the conversion that moves the callers
> > > ...
> > 
> > Really? I've not heard that. Just a strong aversion to "this
> > patch".
> > I've got feedback to say "previous patches" and not "the last
> > patch" so
> > it doesn't get stale. I guess it could be "previous changes".
> 
> Talking about patches make sense when discussing literal patches sent
> to 
> the mailing list. In the git log, it's commit, and "future commits"
> or 
> "follow-up work".
> 
> Yes, we use "patches" all of the time in commit logs, especially when
> we 
>   include the cover letter in the commit message (as done frequently
> in 
> the -mm tree).

I think I'll switch over to talking about "changes". If you talk about
commits it doesn't make as much sense when they are still just patches.
Thanks.


Re: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()

2023-06-13 Thread Edgecombe, Rick P
On Tue, 2023-06-13 at 10:44 +0300, Mike Rapoport wrote:
> > Previous patches have done the first step, so next move the callers
> > that
> > don't have a VMA to pte_mkwrite_novma(). Also do the same for
> 
> I hear x86 maintainers asking to drop "previous patches" ;-)
> 
> Maybe
> This is the second step of the conversion that moves the callers ...

Really? I've not heard that. Just a strong aversion to "this patch".
I've got feedback to say "previous patches" and not "the last patch" so
it doesn't get stale. I guess it could be "previous changes".

> 
> > pmd_mkwrite(). This will be ok for the shadow stack feature, as
> > these
> > callers are on kernel memory which will not need to be made shadow
> > stack,
> > and the other architectures only currently support one type of
> > memory
> > in pte_mkwrite()
> > 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-arm-ker...@lists.infradead.org
> > Cc: linux-s...@vger.kernel.org
> > Cc: xen-devel@lists.xenproject.org
> > Cc: linux-a...@vger.kernel.org
> > Cc: linux...@kvack.org
> > Signed-off-by: Rick Edgecombe 
> 
> Reviewed-by: Mike Rapoport (IBM) 

Thanks!


Re: [PATCH v9 02/42] mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()

2023-06-13 Thread Edgecombe, Rick P
On Tue, 2023-06-13 at 14:27 +0200, David Hildenbrand wrote:
> Acked-by: David Hildenbrand 

Thanks!


Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-30 Thread Edgecombe, Rick P
On Fri, 2023-05-26 at 17:22 +0200, Mickaël Salaün wrote:
> > > Can the guest kernel ask the host VMM's emulated devices to DMA
> > > into
> > > the protected data? It should go through the host userspace
> > > mappings I
> > > think, which don't care about EPT permissions. Or did I miss
> > > where you
> > > are protecting that another way? There are a lot of easy ways to
> > > ask
> > > the host to write to guest memory that don't involve the EPT. You
> > > probably need to protect the host userspace mappings, and also
> > > the
> > > places in KVM that kmap a GPA provided by the guest.
> > 
> > Good point, I'll check this confused deputy attack. Extended KVM
> > protections should indeed handle all ways to map guests' memory.
> > I'm
> > wondering if current VMMs would gracefully handle such new
> > restrictions
> > though.
> 
> I guess the host could map arbitrary data to the guest, so that need
> to 
> be handled, but how could the VMM (not the host kernel) bypass/update
> EPT initially used for the guest (and potentially later mapped to the
> host)?

Well traditionally both QEMU and KVM accessed guest memory via host
mappings instead of the EPT. So I'm wondering what is stopping the
guest from passing a protected gfn when setting up the DMA, and QEMU
being enticed to write to it? The emulator as well would use these host
userspace mappings and not consult the EPT IIRC.

I think Sean was suggesting host userspace should be more involved in
this process, so perhaps it could protect its own alias of the
protected memory, for example mprotect() it as read-only.

There is (was?) some KVM PV features that accessed guest memory via the
host direct map as well. I would think mprotect() should protect this
at the get_user_pages() stage, but it looks like the details have
changed since I last understood it.


Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-25 Thread Edgecombe, Rick P
On Thu, 2023-05-25 at 09:07 -0700, Sean Christopherson wrote:
> On Thu, May 25, 2023, Rick P Edgecombe wrote:
> > I wonder if it might be a good idea to POC the guest side before
> > settling on the KVM interface. Then you can also look at the whole
> > thing and judge how much usage it would get for the different
> > options
> > of restrictions.
> 
> As I said earlier[*], IMO the control plane logic needs to live in
> host userspace.
> I think any attempt to have KVM providen anything but the low level
> plumbing will
> suffer the same fate as CR4 pinning and XO memory.  Iterating on an
> imperfect
> solution to incremently improve security is far, far easier to do in
> userspace,
> and far more likely to get merged.
> 
> [*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com

Sure, I should have put it more generally. I just meant people are not
going to want to maintain host-based features that guests can't
effectively use.

My takeaway from the CR pinning was that the guest kernel integration
was surprisingly tricky due to the one-way nature of the interface. XO
was more flexible than CR pinning in that respect, because the guest
could turn it off (and indeed, in the XO kernel text patches it had to
do this a lot).



Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-25 Thread Edgecombe, Rick P
On Thu, 2023-05-25 at 15:59 +0200, Mickaël Salaün wrote:
[ snip ]

> > The kernel often creates writable aliases in order to write to
> > protected data (kernel text, etc). Some of this is done right as
> > text
> > is being first written out (alternatives for example), and some
> > happens
> > way later (jump labels, etc). So for verification, I wonder what
> > stage
> > you would be verifying? If you want to verify the end state, you
> > would
> > have to maintain knowledge in the verifier of all the touch-ups the
> > kernel does. I think it would get very tricky.
> 
> For now, in the static kernel case, all rodata and text GPA is 
> restricted, so aliasing such memory in a writable way before or after
> the KVM enforcement would still restrict write access to this memory,
> which could be an issue but not a security one. Do you have such 
> examples in mind?
> 

On x86, look at all the callers of the text_poke() family. In
arch/x86/include/asm/text-patching.h.

> 
> > 
> > It also seems it will be a decent ask for the guest kernel to keep
> > track of GPA permissions as well as normal virtual memory
> > pemirssions,
> > if this thing is not widely used.
> 
> This would indeed be required to properly handle the dynamic cases.
> 
> 
> > 
> > So I wondering if you could go in two directions with this:
> > 1. Make this a feature only for super locked down kernels (no
> > modules,
> > etc). Forbid any configurations that might modify text. But eBPF is
> > used for seccomp, so you might be turning off some security
> > protections
> > to get this.
> 
> Good idea. For "super locked down kernels" :) , we should disable all
> kernel executable changes with the related kernel build configuration
> (e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no
> such 
> legitimate access. This looks like an acceptable initial feature.

How many users do you think will want this protection but not
protections that would have to be disabled? The main one that came to
mind for me is cBPF seccomp stuff.

But also, the alternative to JITing cBPF is the eBPF interpreter which
AFAIU is considered a juicy enough target for speculative attacks that
they created an option to compile it out. And leaving an interpreter in
the kernel means any data could be "executed" in the normal non-
speculative scenario, kind of working around the hypervisor executable
protections. Dropping e/cBPF entirely would be an option, but then I
wonder how many users you have left. Hopefully that is all correct,
it's hard to keep track with the pace of BPF development.

I wonder if it might be a good idea to POC the guest side before
settling on the KVM interface. Then you can also look at the whole
thing and judge how much usage it would get for the different options
of restrictions.

> 
> 
> > 2. Loosen the rules to allow the protections to not be so one-way
> > enable. Get less security, but used more widely.
> 
> This is our goal. I think both static and dynamic cases are
> legitimate 
> and have value according to the level of security sought. This should
> be 
> a build-time configuration.

Yea, the proper way to do this is probably to move all text handling
stuff into a separate domain of some sort, like you mentioned
elsewhere. It would be quite a job.


Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-24 Thread Edgecombe, Rick P
On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote:
> # How does it work?
> 
> This implementation mainly leverages KVM capabilities to control the
> Second
> Layer Address Translation (or the Two Dimensional Paging e.g.,
> Intel's EPT or
> AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC)
> introduced with
> the Kaby Lake (7th generation) architecture. This allows to set
> permissions on
> memory pages in a complementary way to the guest kernel's managed
> memory
> permissions. Once these permissions are set, they are locked and
> there is no
> way back.
> 
> A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest
> kernel to lock
> a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE
> or the
> HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a
> specific
> set of pages (allow-list approach), and the second only allows kernel
> execution
> for a set of pages (deny-list approach).
> 
> The current implementation sets the whole kernel's .rodata (i.e., any
> const or
> __ro_after_init variables, which includes critical security data such
> as LSM
> parameters) and .text sections as non-writable, and the .text section
> is the
> only one where kernel execution is allowed. This is possible thanks
> to the new
> MBEC support also brough by this series (otherwise the vDSO would
> have to be
> executable). Thanks to this hardware support (VT-x, EPT and MBEC),
> the
> performance impact of such guest protection is negligible.
> 
> The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some
> of its
> CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP,
> X86_CR4_SMAP),
> which is another complementary hardening mechanism.
> 
> Heki can be enabled with the heki=1 boot command argument.
> 
> 

Can the guest kernel ask the host VMM's emulated devices to DMA into
the protected data? It should go through the host userspace mappings I
think, which don't care about EPT permissions. Or did I miss where you
are protecting that another way? There are a lot of easy ways to ask
the host to write to guest memory that don't involve the EPT. You
probably need to protect the host userspace mappings, and also the
places in KVM that kmap a GPA provided by the guest.

[ snip ]

> 
> # Current limitations
> 
> The main limitation of this patch series is the statically enforced
> permissions. This is not an issue for kernels without module but this
> needs to
> be addressed.  Mechanisms that dynamically impact kernel executable
> memory are
> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
> and such
> code will need to be authenticated.  Because the hypervisor is highly
> privileged and critical to the security of all the VMs, we don't want
> to
> implement a code authentication mechanism in the hypervisor itself
> but delegate
> this verification to something much less privileged. We are thinking
> of two
> ways to solve this: implement this verification in the VMM or spawn a
> dedicated
> special VM (similar to Windows's VBS). There are pros on cons to each
> approach:
> complexity, verification code ownership (guest's or VMM's), access to
> guest
> memory (i.e., confidential computing).

The kernel often creates writable aliases in order to write to
protected data (kernel text, etc). Some of this is done right as text
is being first written out (alternatives for example), and some happens
way later (jump labels, etc). So for verification, I wonder what stage
you would be verifying? If you want to verify the end state, you would
have to maintain knowledge in the verifier of all the touch-ups the
kernel does. I think it would get very tricky.

It also seems it will be a decent ask for the guest kernel to keep
track of GPA permissions as well as normal virtual memory pemirssions,
if this thing is not widely used.

So I wondering if you could go in two directions with this:
1. Make this a feature only for super locked down kernels (no modules,
etc). Forbid any configurations that might modify text. But eBPF is
used for seccomp, so you might be turning off some security protections
to get this.
2. Loosen the rules to allow the protections to not be so one-way
enable. Get less security, but used more widely.

There were similar dilemmas with the PV CR pinning stuff.


Re: [PATCH v6 13/41] mm: Make pte_mkwrite() take a VMA

2023-02-20 Thread Edgecombe, Rick P
On Mon, 2023-02-20 at 12:23 +0100, David Hildenbrand wrote:
> That looks painful but IMHO worth it :)
> 
> Acked-by: David Hildenbrand 

Thanks. Yes it was not the most fun, but I agree - worth it.


Re: [PATCH v6 13/41] mm: Make pte_mkwrite() take a VMA

2023-02-20 Thread Edgecombe, Rick P
On Mon, 2023-02-20 at 12:00 +1100, Michael Ellerman wrote:
> Acked-by: Michael Ellerman  (powerpc)

Thanks!


Re: [PATCH v2 0/8] x86/mtrr: fix handling with PAT but without MTRR

2023-02-13 Thread Edgecombe, Rick P
On Mon, 2023-02-13 at 07:12 +0100, Juergen Gross wrote:
> 
> Thanks for the report.
> 
> I'll have a look. Probably I'll need to re-add the check for WB in
> patch 7.

Sure, let me know if you need any more details about by setup.


Re: [PATCH v2 0/8] x86/mtrr: fix handling with PAT but without MTRR

2023-02-10 Thread Edgecombe, Rick P
On Thu, 2023-02-09 at 08:22 +0100, Juergen Gross wrote:
> This series tries to fix the rather special case of PAT being
> available
> without having MTRRs (either due to CONFIG_MTRR being not set, or
> because the feature has been disabled e.g. by a hypervisor).

debug_vm_pgtable fails in a KVM guest with CONFIG_MTRR=y. CONFIG_MTRR=n
succeeds.

[0.830280] debug_vm_pgtable: [debug_vm_pgtable ]:
Validating architecture page table helpers
[0.831906] [ cut here ]
[0.832711] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:461
debug_vm_pgtable+0xb9a/0xe16
[0.833998] Modules linked in:
[0.834450] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc7+
#2366
[0.835462] RIP: 0010:debug_vm_pgtable+0xb9a/0xe16
[0.836217] Code: e2 3a 73 4a 48 c7 00 00 00 00 00 48 8b b4 24 a0 00
00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c4
[0.839068] RSP: :c9013de0 EFLAGS: 00010246
[0.839735] RAX:  RBX: 888100048868 RCX:
bff0
[0.840646] RDX:  RSI: 4000 RDI:

[0.841661] RBP: 88810004d140 R08:  R09:
888100280880
[0.842625] R10: 0001 R11: 0001 R12:
888103810298
[0.843574] R13: 888100048780 R14: 8282e099 R15:

[0.844524] FS:  () GS:88813bc0()
knlGS:
[0.845706] CS:  0010 DS:  ES:  CR0: 80050033
[0.846499] CR2: 88813000 CR3: 0222d001 CR4:
00370ef0
[0.847464] DR0:  DR1:  DR2:

[0.848432] DR3:  DR6: fffe0ff0 DR7:
0400
[0.849371] Call Trace:
[0.849699]  
[0.849997]  ? destroy_args+0x131/0x131
[0.850487]  do_one_initcall+0x61/0x250
[0.850983]  ? rdinit_setup+0x2c/0x2c
[0.851451]  kernel_init_freeable+0x18e/0x1d8
[0.852033]  ? rest_init+0x130/0x130
[0.852533]  kernel_init+0x16/0x120
[0.853035]  ret_from_fork+0x1f/0x30
[0.853507]  
[0.853803] ---[ end trace  ]---
[0.854421] [ cut here ]
[0.855027] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:462
debug_vm_pgtable+0xbaa/0xe16
[0.856115] Modules linked in:
[0.856517] CPU: 0 PID: 1 Comm: swapper/0 Tainted:
GW  6.2.0-rc7+ #2366
[0.857562] RIP: 0010:debug_vm_pgtable+0xbaa/0xe16
[0.858186] Code: 00 00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c1 e6 0c
e8 79 18 7f fe 85 c0 75 02 0f 0b 48 8b 7b
[0.860778] RSP: :c9013de0 EFLAGS: 00010246
[0.861519] RAX:  RBX: 888100048868 RCX:
bff0
[0.862530] RDX:  RSI: 4000 RDI:
88810380e7f8
[0.863522] RBP: 88810004d140 R08:  R09:
888100280880
[0.864449] R10: 0001 R11: 0001 R12:
888103810298
[0.865454] R13: 888100048780 R14: 8282e099 R15:

[0.866401] FS:  () GS:88813bc0()
knlGS:
[0.867438] CS:  0010 DS:  ES:  CR0: 80050033
[0.868181] CR2: 88813000 CR3: 0222d001 CR4:
00370ef0
[0.869097] DR0:  DR1:  DR2:

[0.870026] DR3:  DR6: fffe0ff0 DR7:
0400
[0.870943] Call Trace:
[0.871259]  
[0.871537]  ? destroy_args+0x131/0x131
[0.872030]  do_one_initcall+0x61/0x250
[0.872521]  ? rdinit_setup+0x2c/0x2c
[0.873005]  kernel_init_freeable+0x18e/0x1d8
[0.873607]  ? rest_init+0x130/0x130
[0.874116]  kernel_init+0x16/0x120
[0.874618]  ret_from_fork+0x1f/0x30
[0.875123]  
[0.875411] ---[ end trace  ]---