Re: [PATCH v38 10/24] mm: Add vm_ops->mprotect()

2020-09-24 Thread Dave Hansen
On 9/24/20 12:28 PM, Sean Christopherson wrote: > On Thu, Sep 24, 2020 at 02:11:37PM -0500, Haitao Huang wrote: >> On Wed, 23 Sep 2020 08:50:56 -0500, Jarkko Sakkinen >> wrote: >>> I'll categorically deny noexec in the next patch set version. >>> >>> /Jarkko >> There are use cases supported curren

Re: [PATCH v38 10/24] mm: Add vm_ops->mprotect()

2020-09-24 Thread Dave Hansen
On 9/23/20 7:33 AM, Jarkko Sakkinen wrote: > The consequence is that enclaves are best created with an ioctl API and the > access control can be based only to the origin of the source file for the > enclave data, i.e. on VMA file pointer and page permissions. For example, > this could be done with

Re: [PATCH v12 8/8] x86: Disallow vsyscall emulation when CET is enabled

2020-09-23 Thread Dave Hansen
On 9/23/20 3:47 PM, Andy Lutomirski wrote: > On Wed, Sep 23, 2020 at 3:20 PM Yu, Yu-cheng wrote: >> On 9/23/2020 3:08 PM, Dave Hansen wrote: >>> On 9/23/20 3:06 PM, Yu, Yu-cheng wrote: >>>> I think I'll add a check here for (r + 8) >= TASK_SIZE_MAX.

[PATCH] media: fix Omnivision Intel MAINTAINERS entry

2020-09-23 Thread Dave Hansen
From: Dave Hansen Tianshu Qiu has three MAINTAINERS entries, and one typo. After being notified if the typo a few months ago, they didn't act, so here's a patch. Tianshu, an ack would be appreciated. Signed-off-by: Dave Hansen Cc: Tianshu Qiu Cc: Shawn Tu Cc: Bingbu Cao Cc

Re: [RFC PATCH v2] tools/x86: add kcpuid tool to show raw CPU features

2020-09-22 Thread Dave Hansen
On 9/21/20 10:27 PM, Feng Tang wrote: > +static void parse_text(void) > +{ > + FILE *file; > + char *line = NULL; > + size_t len = 0; > + int ret; > + > + file = fopen("cpuid.txt", "r"); > + if (!file) { > + printf("Error in opening 'cpuid.txt'\n"); > +

Re: [PATCH v38 10/24] mm: Add vm_ops->mprotect()

2020-09-22 Thread Dave Hansen
On 9/22/20 5:58 AM, Jarkko Sakkinen wrote: > Intel Sofware Guard eXtensions (SGX) allows creation of executable blobs > called enclaves, of which page permissions are defined when the enclave "of which" => "for which" > is first loaded. Once an enclave is loaded and initialized, it can be > mappe

Re: [PATCH v12 1/8] x86/cet/ibt: Add Kconfig option for user-mode Indirect Branch Tracking

2020-09-21 Thread Dave Hansen
On 9/21/20 3:30 PM, Yu, Yu-cheng wrote: > +config X86_INTEL_BRANCH_TRACKING_USER > +prompt "Intel Indirect Branch Tracking for user-mode" Take the "Intel " and "INTEL_" out, please. It will only cause us all pain later if some of our x86 compatriots decide to implement this. > If the kernel

Re: [PATCH v12 8/8] x86: Disallow vsyscall emulation when CET is enabled

2020-09-18 Thread Dave Hansen
> Here is my CET slides for LPC 2020: > > https://gitlab.com/cet-software/cet-smoke-test/-/wikis/uploads/09431a51248858e6f716a59065d732e2/CET-LPC-2020.pdf > > which may have answers for most questions. Hi H.J., I know you're not super familiar with our kernel development process, which might be

Re: [PATCH v12 8/8] x86: Disallow vsyscall emulation when CET is enabled

2020-09-18 Thread Dave Hansen
On 9/18/20 2:06 PM, H.J. Lu wrote: > On Fri, Sep 18, 2020 at 2:00 PM Pavel Machek wrote: >> On Fri 2020-09-18 12:32:57, Dave Hansen wrote: >>> On 9/18/20 12:23 PM, Yu-cheng Yu wrote: >>>> Emulation of the legacy vsyscall page is required by some programs >>&

Re: [PATCH v12 8/8] x86: Disallow vsyscall emulation when CET is enabled

2020-09-18 Thread Dave Hansen
On 9/18/20 12:23 PM, Yu-cheng Yu wrote: > Emulation of the legacy vsyscall page is required by some programs > built before 2013. Newer programs after 2013 don't use it. > Disable vsyscall emulation when Control-flow Enforcement (CET) is > enabled to enhance security. How does this "enhance secur

Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-09-17 Thread Dave Hansen
On 9/17/20 2:58 PM, Liang, Kan wrote: > The user space perf tool looks like a better place for this kind of > warning. The perf tool knows the total number of the samples. It also > knows the number of the page size 0 samples. We can set a threshold, > e.g., 90%. If 90% of the samples have the page

Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-09-17 Thread Dave Hansen
On 9/17/20 6:52 AM, kan.li...@linux.intel.com wrote: > + mm = current->mm; > + if (!mm) { > + /* > + * For kernel threads and the like, use init_mm so that > + * we can find kernel memory. > + */ > + mm = &init_mm; > + } I

Re: [PATCH v37 13/24] x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES

2020-09-15 Thread Dave Hansen
On 9/15/20 3:17 AM, Jarkko Sakkinen wrote: > OK, spotted the regression, sorry about this. I'll fix it for v38, which > I'm sending soon given the email server issues with v37. I'm going to cry uncle on the mail quantity too. Someone is going to think the mail relays are mining bitcoin. Especial

Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-15 Thread Dave Hansen
On 9/15/20 12:08 PM, Yu-cheng Yu wrote: > On Mon, 2020-09-14 at 17:12 -0700, Yu, Yu-cheng wrote: >> On 9/14/2020 7:50 AM, Dave Hansen wrote: >>> On 9/11/20 3:59 PM, Yu-cheng Yu wrote: >>> ... >>>> Here are the changes if we take the mprotect(PROT_SHSTK)

Re: [RFC PATCH 00/24] mm/hugetlb: Free some vmemmap pages of hugetlb page

2020-09-15 Thread Dave Hansen
On 9/15/20 7:32 AM, Matthew Wilcox wrote: > On Tue, Sep 15, 2020 at 08:59:23PM +0800, Muchun Song wrote: >> This patch series will free some vmemmap pages(struct page structures) >> associated with each hugetlbpage when preallocated to save memory. > It would be lovely to be able to do this. Unfor

Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-14 Thread Dave Hansen
On 9/14/20 11:31 AM, Andy Lutomirski wrote: > No matter what we do, the effects of calling vfork() are going to be a > bit odd with SHSTK enabled. I suppose we could disallow this, but > that seems likely to cause its own issues. What's odd about it? If you're a vfork()'d child, you can't touch

Re: [NEEDS-REVIEW] Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-14 Thread Dave Hansen
On 9/11/20 3:59 PM, Yu-cheng Yu wrote: ... > Here are the changes if we take the mprotect(PROT_SHSTK) approach. > Any comments/suggestions? I still don't like it. :) I'll also be much happier when there's a proper changelog to accompany this which also spells out the alternatives any why they suc

Re: [PATCH 2/4 v3] x86: AMD: Add hardware-enforced cache coherency as a CPUID feature

2020-09-11 Thread Dave Hansen
On 9/11/20 1:10 PM, Krish Sadhukhan wrote: ... >>> +#define X86_FEATURE_HW_CACHE_COHERENCY (11*32+ 7) /* AMD >>> hardware-enforced cache coherency */ >> That's an awfully generic name.  We generally have "hardware-enforced >> cache coherency" already everywhere. :) >> >> This probably needs to say

Re: [PATCH 2/4 v3] x86: AMD: Add hardware-enforced cache coherency as a CPUID feature

2020-09-11 Thread Dave Hansen
On 9/11/20 12:25 PM, Krish Sadhukhan wrote: > > diff --git a/arch/x86/include/asm/cpufeatures.h > b/arch/x86/include/asm/cpufeatures.h > index 81335e6fe47d..0e5b27ee5931 100644 > --- a/arch/x86/include/asm/cpufeatures.h > +++ b/arch/x86/include/asm/cpufeatures.h > @@ -293,6 +293,7 @@ > #define X

Re: Ways to deprecate /sys/devices/system/memory/memoryX/phys_device ?

2020-09-11 Thread Dave Hansen
On 9/11/20 3:09 AM, David Hildenbrand wrote: > Maybe we can derive the actual DIMMs from some ACPI tables (SRAT?), > instead of relying on e820/"System RAM resources" - I have no clue. It's actually really hard to map a DIMM to a physical address. Interleaving can mean that one page actually spans

Re: Ways to deprecate /sys/devices/system/memory/memoryX/phys_device ?

2020-09-10 Thread Dave Hansen
On 9/10/20 3:20 AM, David Hildenbrand wrote: > While I'd love to rip it out completely, I think it would break old > lsmem/chmem completely - and I assume that's not acceptable. I was > wondering what would be considered safe to do now/in the future: > > 1. Make it always return 0 (just as if "scl

Re: Ways to deprecate /sys/devices/system/memory/memoryX/phys_device ?

2020-09-10 Thread Dave Hansen
On 9/10/20 3:20 AM, David Hildenbrand wrote: > I was just exploring how /sys/devices/system/memory/memoryX/phys_device > is/was used. It's one of these interfaces that most probably never > should have been added but now we are stuck with it. While I'm all for cleanups, what specific problems is p

Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-09 Thread Dave Hansen
On 9/9/20 4:25 PM, Yu, Yu-cheng wrote: > On 9/9/2020 4:11 PM, Dave Hansen wrote: >> On 9/9/20 4:07 PM, Yu, Yu-cheng wrote: >>> What if a writable mapping is passed to madvise(MADV_SHSTK)?  Should >>> that be rejected? >> >> It doesn't matter to me. 

Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-09 Thread Dave Hansen
On 9/9/20 4:07 PM, Yu, Yu-cheng wrote: > What if a writable mapping is passed to madvise(MADV_SHSTK)?  Should > that be rejected? It doesn't matter to me. Even if it's readable, it _stops_ being even directly readable after it's a shadow stack, right? I don't think writes are special in any way.

Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-09 Thread Dave Hansen
On 9/9/20 3:08 PM, Yu, Yu-cheng wrote: > After looking at this more, I found the changes are more similar to > mprotect() than madvise().  We are going to change an anonymous mapping > to a read-only mapping, and add the VM_SHSTK flag to it.  Would an > x86-specific mprotect(PROT_SHSTK) make more s

Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-09 Thread Dave Hansen
On 9/9/20 5:29 AM, Gerald Schaefer wrote: > This only works well as long there are real pagetable pointers involved, > that can also be used for iteration. For gup_fast, or any other future > pagetable walkers using the READ_ONCE logic w/o lock, that is not true. > There are pointers involved to lo

Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Dave Hansen
On 9/7/20 11:00 AM, Gerald Schaefer wrote: > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > code") introduced a subtle but severe bug on s390 with gup_fast, due to > dynamic page table folding. Would it be fair to say that the "fake" page table entries s390 allocates o

Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Dave Hansen
On 9/7/20 11:00 AM, Gerald Schaefer wrote: > x86: > add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10) > Function old new delta > vmemmap_populate 587 592 +5 > munlock_vma_pages_range 556 561

Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-08 Thread Dave Hansen
On 9/8/20 10:50 AM, Yu, Yu-cheng wrote: > What about this: > > - Do not add any new syscall or arch_prctl for creating a new shadow stack. > > - Add a new arch_prctl that can turn an anonymous mapping to a shadow > stack mapping. > > This allows the application to do whatever is necessary.  It c

Re: [PATCH RFC 09/10] kfence, Documentation: add KFENCE documentation

2020-09-08 Thread Dave Hansen
On 9/7/20 6:40 AM, Marco Elver wrote: > +The most important parameter is KFENCE's sample interval, which can be set > via > +the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The > +sample interval determines the frequency with which heap allocations will be > +guarded by KFENC

Re: [PATCH v11 6/9] x86/cet: Add PTRACE interface for CET

2020-09-03 Thread Dave Hansen
On 9/3/20 9:32 AM, Andy Lutomirski wrote: >> Taking the config register out of the init state is illogical, as is >> writing to SSP while the config register is in its init state. > What's so special about the INIT state? It's optimized by XSAVES, but > it's just a number, right? So taking the re

Re: [PATCH v11 6/9] x86/cet: Add PTRACE interface for CET

2020-09-03 Thread Dave Hansen
On 9/3/20 9:15 AM, Andy Lutomirski wrote: > On Thu, Sep 3, 2020 at 9:12 AM Dave Hansen wrote: >> >> On 9/3/20 9:09 AM, Yu, Yu-cheng wrote: >>> If the debugger is going to write an MSR, only in the third case would >>> this make a slight sense. For example, if th

Re: [PATCH v11 6/9] x86/cet: Add PTRACE interface for CET

2020-09-03 Thread Dave Hansen
On 9/3/20 9:09 AM, Yu, Yu-cheng wrote: > If the debugger is going to write an MSR, only in the third case would > this make a slight sense.  For example, if the system has CET enabled, > but the task does not have CET enabled, and GDB is writing to a CET MSR. >  But still, this is strange to me. I

Re: [PATCH v11 6/9] x86/cet: Add PTRACE interface for CET

2020-09-03 Thread Dave Hansen
On 9/2/20 9:35 PM, Andy Lutomirski wrote: >> + fpu__prepare_read(fpu); >> + cetregs = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER); >> + if (!cetregs) >> + return -EFAULT; > Can this branch ever be hit without a kernel bug? If yes, I think

Re: [RFC PATCH] tools/x86: add kcpuid tool to show raw CPU features

2020-09-02 Thread Dave Hansen
On 9/2/20 9:52 AM, Borislav Petkov wrote: >> I was *really* hoping that we could eventually feed kcpuid and the >> X86_FEATURE_* bits from the same source. > But X86_FEATURE_* won't be all bits in all CPUID leafs - only the ones the > kernel has enabled/use for/needs/... > > Also you have CPUID fi

Re: [RFC PATCH] tools/x86: add kcpuid tool to show raw CPU features

2020-09-02 Thread Dave Hansen
On 9/2/20 9:45 AM, pet...@infradead.org wrote: > On Thu, Aug 27, 2020 at 03:49:03PM +0800, Feng Tang wrote: >> End users frequently want to know what features their processor >> supports, independent of what the kernel supports. >> >> /proc/cpuinfo is great. It is omnipresent and since it is provid

Re: [RFC PATCH] tools/x86: add kcpuid tool to show raw CPU features

2020-09-02 Thread Dave Hansen
On 9/2/20 8:40 AM, Borislav Petkov wrote: > When you need to add a new leaf, you simply extend the text file and the > tool parses it anew and has its all CPUID info uptodate. This way you > won't even have to recompile it. Adding new CPUID leafs would be adding new > lines to the file. > > For ex

Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-01 Thread Dave Hansen
On 9/1/20 10:45 AM, Andy Lutomirski wrote: >>> For arm64 (and sparc etc.) we continue to use the regular mmap/mprotect >>> family of calls. One or two additional arch-specific mmap flags are >>> sufficient for now. >>> >>> Is x86 definitely not going to fit within those calls? >> That can work for

Re: [PATCH 05/17] virt: acrn: Introduce ACRN HSM basic driver

2020-08-29 Thread Dave Hansen
On 8/29/20 3:46 AM, Shuo A Liu wrote: > On Fri 28.Aug'20 at 12:25:59 +0200, Greg Kroah-Hartman wrote: >> On Tue, Aug 25, 2020 at 10:45:05AM +0800, shuo.a@intel.com wrote: >>> +static long acrn_dev_ioctl(struct file *filp, unsigned int cmd, >>> +   unsigned long ioctl_param) >>> +{ >

Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-08-26 Thread Dave Hansen
On 8/26/20 11:49 AM, Yu, Yu-cheng wrote: >> I would expect things like Go and various JITs to call it directly. >> >> If we wanted to be fancy and add a potentially more widely useful >> syscall, how about: >> >> mmap_special(void *addr, size_t length, int prot, int flags, int type); >> >> Where ty

Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-08-25 Thread Dave Hansen
On 8/25/20 2:04 PM, Yu, Yu-cheng wrote: >>> I think this is more arch-specific.  Even if it becomes a new syscall, >>> we still need to pass the same parameters. >> >> Right, but without the copying in and out of memory. >> > Linux-api is already on the Cc list.  Do we need to add more people to >

Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-08-25 Thread Dave Hansen
On 8/25/20 11:43 AM, Yu, Yu-cheng wrote: >>> arch_prctl(ARCH_X86_CET_MMAP_SHSTK, u64 *args) >>> Allocate a new shadow stack. >>> >>> The parameter 'args' is a pointer to a user buffer. >>> >>> *args = desired size >>> *(args + 1) = MAP_32BIT or MAP_POPULATE >>> >>> On retur

Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Dave Hansen
On 8/25/20 10:59 AM, Andrew Cooper wrote: > If I've read the TDX spec/whitepaper properly, the main hypervisor can > write to all the encrypted pages.  This will destroy data, break the > MAC, and yields #PF inside the SEAM hypervisor, or the TD when the cache > line is next referenced. I think yo

Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

2020-08-25 Thread Dave Hansen
On 8/24/20 9:39 PM, Sean Christopherson wrote: > +Andy > > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote: >> And to help with coordination, here is something prepared (slightly) >> earlier. >> >> https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?u

Re: FSGSBASE causing panic on 5.9-rc1

2020-08-20 Thread Dave Hansen
On 8/20/20 12:05 PM, Tom Lendacky wrote: >> I added a quick hack to save TSC_AUX to a new variable in the SVM >> struct and then restore it right after VMEXIT (just after where GS is >> restored in svm_vcpu_enter_exit()) and my guest is no longer crashing. > > Sorry, I mean my host is no longer cr

Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim

2020-08-20 Thread Dave Hansen
On 8/20/20 1:06 AM, Huang, Ying wrote: >> +/* Migrate pages selected for demotion */ >> +nr_reclaimed += demote_page_list(&ret_pages, &demote_pages, pgdat, sc); >> + >> pgactivate = stat->nr_activate[0] + stat->nr_activate[1]; >> >> mem_cgroup_uncharge_list(&free_pages); >> _ >

[RFC][PATCH 9/9] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2020-08-18 Thread Dave Hansen
From: Dave Hansen Some method is obviously needed to enable reclaim-based migration. Just like traditional autonuma, there will be some workloads that will benefit like workloads with more "static" configurations where hot pages stay hot and cold pages stay cold. If pages come a

[RFC][PATCH 2/9] mm/numa: automatically generate node migration order

2020-08-18 Thread Dave Hansen
From: Dave Hansen When memory fills up on a node, memory contents can be automatically migrated to another node. The biggest problems are knowing when to migrate and to where the migration should be targeted. The most straightforward way to generate the "to where" list would be to

[RFC][PATCH 6/9] mm/vmscan: add page demotion counter

2020-08-18 Thread Dave Hansen
ges() ] Signed-off-by: Yang Shi Signed-off-by: Dave Hansen Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams --- b/include/linux/vm_event_item.h |2 ++ b/mm/vmscan.c |6 ++ b/mm/vmstat.c |2 ++ 3 files changed, 10 insertions(+) diff -

[RFC][PATCH 5/9] mm/migrate: demote pages during reclaim

2020-08-18 Thread Dave Hansen
From: Dave Hansen This is mostly derived from a patch from Yang Shi: https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang@linux.alibaba.com/ Add code to the reclaim path (shrink_page_list()) to "demote" data to another NUMA node instead of disc

[RFC][PATCH 3/9] mm/migrate: update migration order during on hotplug events

2020-08-18 Thread Dave Hansen
From: Dave Hansen Reclaim-based migration is attempting to optimize data placement in memory based on the system topology. If the system changes, so must the migration ordering. The implementation here is pretty simple and entirely unoptimized. On any memory or CPU hotplug events, assume

[RFC][PATCH 8/9] mm/vmscan: never demote for memcg reclaim

2020-08-18 Thread Dave Hansen
From: Dave Hansen Global reclaim aims to reduce the amount of memory used on a given node or set of nodes. Migrating pages to another node serves this purpose. memcg reclaim is different. Its goal is to reduce the total memory consumption of the entire memcg, across all nodes. Migration

[RFC][PATCH 4/9] mm/migrate: make migrate_pages() return nr_succeeded

2020-08-18 Thread Dave Hansen
account how many pages are reclaimed (demoted) since page reclaim behavior depends on this. Add *nr_succeeded parameter to make migrate_pages() return how many pages are demoted successfully for all cases. Signed-off-by: Yang Shi Signed-off-by: Dave Hansen Cc: David Rientjes Cc: Huang Ying Cc

[RFC][PATCH 7/9] mm/vmscan: Consider anonymous pages without swap

2020-08-18 Thread Dave Hansen
-by: Dave Hansen Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams -- Changes from Dave 06/2020: * rename reclaim_anon_pages()->can_reclaim_anon_pages() Note: Keith's Intel SoB is commented out because he is no longer at Intel and his @intel.com mail will bouncee ---

[RFC][PATCH 0/9] [v3] Migrate Pages in lieu of discard

2020-08-18 Thread Dave Hansen
e migrations? * Migration failures will result in pages being unreclaimable. Need to be able to fall back to normal reclaim. Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams -- Dave Hansen (5): mm/numa: node demotion data structure and lookup mm/vmscan: At

[RFC][PATCH 1/9] mm/numa: node demotion data structure and lookup

2020-08-18 Thread Dave Hansen
From: Dave Hansen Prepare for the kernel to auto-migrate pages to other memory nodes with a user defined node migration table. This allows creating single migration target for each NUMA node to enable the kernel to do NUMA page migrations instead of simply reclaiming colder pages. A node with

Re: [RFC PATCH] mm: extend memfd with ability to create "secret" memory areas

2020-08-14 Thread Dave Hansen
On 8/14/20 10:46 AM, Andy Lutomirski wrote: > I'm a little unconvinced about the security benefits. As far as I > know, UC memory will not end up in cache by any means (unless > aliased), but it's going to be tough to do much with UC data with > anything resembling reasonable performance without d

[PATCH] [v2] Documentation: clarify driver licensing rules

2020-08-14 Thread Dave Hansen
From: Dave Hansen Greg has challenged some recent driver submitters on their license choices. He was correct to do so, as the choices in these instances did not always advance the aims of the submitters. But, this left submitters (and the folks who help them pick licenses) a bit confused

Re: [PATCH] Documentation: clarify driver licensing rules

2020-08-12 Thread Dave Hansen
On 8/12/20 1:23 AM, Greg KH wrote: > On Tue, Aug 11, 2020 at 10:17:48AM -0700, Dave Hansen wrote: >> But, this left submitters (and the folks who help them pick licenses) >> a bit confused. They have read things like >> Documentation/process/license-rules.rst which says:

Re: [PATCH V6 01/16] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-08-12 Thread Dave Hansen
On 8/12/20 6:39 AM, Liang, Kan wrote: > I searched the vma_mmu_pagesize(). It seems that PowerPC is the only > one that defines a 'strong' function. In other words, the MMUPageSize > and KerelPageSize are the same for X86. However, it seems not true > for the above compound page cases. Is it a bug

[PATCH] Documentation: clarify driver licensing rules

2020-08-11 Thread Dave Hansen
Resend. Something appears to have eaten this on the way to LKML (at least) the last time. -- From: Dave Hansen Greg has challenged some recent driver submitters on their license choices. He was correct to do so, as the choices in these instances did not always advance the aims of the

Re: [PATCH V6 01/16] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-08-10 Thread Dave Hansen
On 8/10/20 2:24 PM, Kan Liang wrote: > +static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr) > +{ > + struct page *page; > + pgd_t *pgd; > + p4d_t *p4d; > + pud_t *pud; > + pmd_t *pmd; > + pte_t *pte; > + > + pgd = pgd_offset(mm, addr); > + if (p

Re: [PATCH] x86/mm/64: Do not dereference non-present PGD entries

2020-08-10 Thread Dave Hansen
... adding Kirill On 8/7/20 1:40 AM, Joerg Roedel wrote: > + lvl = "p4d"; > + p4d = p4d_alloc(&init_mm, pgd, addr); > + if (!p4d) > + goto failed; > > + /* > + * With 5-level paging the P4D level is not folded. So t

Re: [PATCH v3] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-06 Thread Dave Hansen
On 8/6/20 4:04 PM, Ricardo Neri wrote: >* CPUID is the conventional way, but it's nasty: it doesn't >* exist on some 486-like CPUs, and it usually exits to a >* hypervisor. >* >* The SERIALIZE instruction is the most straightforward way to >* do this

Re: [PATCH v3] x86/cpu: Use SERIALIZE in sync_core() when available

2020-08-06 Thread Dave Hansen
On 8/6/20 12:25 PM, Ricardo Neri wrote: > static inline void sync_core(void) > { > /* > - * There are quite a few ways to do this. IRET-to-self is nice > + * Hardware can do this for us if SERIALIZE is available. Otherwise, > + * there are quite a few ways to do this. IRET-

Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Dave Hansen
On 8/3/20 10:16 AM, Andy Lutomirski wrote: > - TILE: genuinely per-thread, but it's expensive so it's > lazy-loadable. But the lazy-load mechanism reuses #NM, and it's not > fully disambiguated from the other use of #NM. So it sort of works, > but it's gross. For those playing along at home, the

Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Dave Hansen
On 8/3/20 8:12 AM, Andy Lutomirski wrote: > I could easily be convinced that the PASID fixup is so trivial and so > obviously free of misfiring in a way that causes an infinite loop that > this code is fine. But I think we first need to answer the bigger > question of why we're doing a lazy fixup

Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Dave Hansen
On 7/31/20 4:34 PM, Andy Lutomirski wrote: >> Thomas suggested to provide a reason for the #GP caused by executing ENQCMD >> without a valid PASID value programmed. #GP error codes are 16 bits and all >> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other >> choice was to re

Re: [PATCH 0/3] Drop unused MAX_PHYSADDR_BITS

2020-07-24 Thread Dave Hansen
On 7/23/20 4:15 PM, Arvind Sankar wrote: > This #define is not used anywhere, and has the wrong value on x86_64. Yeah, it certainly is unused. > I tried digging into the history a bit, but it seems to have been unused > even in the initial merge of sparsemem in v2.6.13, when it was first > define

Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

2020-07-23 Thread Dave Hansen
On 7/23/20 9:56 AM, Sean Christopherson wrote: > On Thu, Jul 23, 2020 at 09:41:37AM -0700, Dave Hansen wrote: >> On 7/23/20 9:25 AM, Sean Christopherson wrote: >>> How would people feel about taking the above two patches (02 and 03 in the >>> series) through

Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions

2020-07-23 Thread Dave Hansen
On 7/23/20 10:08 AM, Andy Lutomirski wrote: > Suppose some kernel code (a syscall or kernel thread) changes PKRS > then takes a page fault. The page fault handler needs a fresh PKRS. > Then the page fault handler (say a VMA’s .fault handler) changes > PKRS. The we get an interrupt. The interrupt *

Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

2020-07-23 Thread Dave Hansen
On 7/23/20 9:25 AM, Sean Christopherson wrote: > How would people feel about taking the above two patches (02 and 03 in the > series) through the KVM tree to enable KVM virtualization of CET before the > kernel itself gains CET support? I.e. add the MSR and feature bits, along > with the XSAVES co

Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions

2020-07-23 Thread Dave Hansen
On 7/23/20 9:18 AM, Fenghua Yu wrote: > The PKRS MSR has been preserved in thread_info during kernel entry. We > don't need to preserve it in another place (i.e. idtentry_state). I'm missing how the PKRS MSR gets preserved in thread_info. Could you explain the mechanism by which this happens and

Re: Random shadow stack pointer corruption

2020-07-18 Thread Dave Hansen
On 7/18/20 11:24 AM, Yu-cheng Yu wrote: > On Sat, 2020-07-18 at 11:00 -0700, Andy Lutomirski wrote: >> On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu wrote: >>> Hi, >>> >>> My shadow stack tests start to have random shadow stack pointer corruption >>> after >>> v5.7 (excluding). The symptom looks

Re: [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-07-17 Thread Dave Hansen
On 7/17/20 1:54 AM, Peter Zijlstra wrote: > This is unbelievable junk... Ouch! This is from the original user pkeys implementation. > How about something like: > > u32 update_pkey_reg(u32 pk_reg, int pkey, unsigned int flags) > { > int pkey_shift = pkey * PKR_BITS_PER_PKEY; > > pk_

Re: [PATCH] x86/bugs/multihit: Fix mitigation reporting when KVM is not in use

2020-07-15 Thread Dave Hansen
On 7/14/20 5:51 PM, Sean Christopherson wrote: > To do the above table, KVM will also need to update > itlb_multihit_kvm_mitigation > when it is unloaded, which seems rather silly. That's partly why I suggested > keying off CR4.VMXE as it doesn't require poking directly into KVM. E.g. the > enti

Re: [PATCH] x86/bugs/multihit: Fix mitigation reporting when KVM is not in use

2020-07-14 Thread Dave Hansen
On 7/14/20 2:04 PM, Pawan Gupta wrote: >> I see three inputs and four possible states (sorry for the ugly table, >> it was this or a spreadsheet :): >> >> X86_FEATURE_VMX CONFIG_KVM_*hpage split ResultReason >> N x xNot Affected No VMX

Re: [PATCH] x86/bugs/multihit: Fix mitigation reporting when KVM is not in use

2020-07-14 Thread Dave Hansen
On 7/14/20 12:17 PM, Pawan Gupta wrote: > On Tue, Jul 14, 2020 at 07:57:53AM -0700, Dave Hansen wrote: >> Let's stick to things which are at least static per reboot. Checking >> for X86_FEATURE_VMX or even CONFIG_KVM_INTEL seems like a good stopping >> point. "

Re: [RFC PATCH 12/15] kmap: Add stray write protection for device pages

2020-07-14 Thread Dave Hansen
On 7/14/20 12:29 PM, Peter Zijlstra wrote: > On Tue, Jul 14, 2020 at 12:06:16PM -0700, Ira Weiny wrote: >> On Tue, Jul 14, 2020 at 10:44:51AM +0200, Peter Zijlstra wrote: >>> So, if I followed along correctly, you're proposing to do a WRMSR per >>> k{,un}map{_atomic}(), sounds like excellent perfor

Re: [RFC PATCH 04/15] x86/pks: Preserve the PKRS MSR on context switch

2020-07-14 Thread Dave Hansen
On 7/14/20 11:53 AM, Ira Weiny wrote: >>> The PKRS MSR is defined as a per-core register. Just to be clear, PKRS is a per-logical-processor register, just like PKRU. The "per-core" thing here is a typo.

Re: [PATCH] x86/bugs/multihit: Fix mitigation reporting when KVM is not in use

2020-07-14 Thread Dave Hansen
On 7/13/20 6:45 PM, Sean Christopherson wrote: > This is all kinds of backwards. Virtualization being disabled in hardware > is very, very different than KVM not being loaded. One requires at the > very least a kernel reboot to change, the other does not. That's a very good point. It's a pretty

Re: [tip: perf/core] x86/cpufeatures: Add Architectural LBRs feature bit

2020-07-09 Thread Dave Hansen
On 7/8/20 2:51 AM, tip-bot2 for Kan Liang wrote: > diff --git a/arch/x86/include/asm/cpufeatures.h > b/arch/x86/include/asm/cpufeatures.h > index 02dabc9..72ba4c5 100644 > --- a/arch/x86/include/asm/cpufeatures.h > +++ b/arch/x86/include/asm/cpufeatures.h > @@ -366,6 +366,7 @@ > #define X86_FEATU

Re: [PATCH 2/4] KVM: x86: Introduce paravirt feature CR0/CR4 pinning

2020-07-09 Thread Dave Hansen
On 7/9/20 9:07 AM, Andy Lutomirski wrote: > On Thu, Jul 9, 2020 at 8:56 AM Dave Hansen wrote: >> On 7/9/20 8:44 AM, Andersen, John wrote: >>> Bits which are allowed to be pinned default to WP for CR0 and SMEP, >>> SMAP, and UMIP for CR4. >> I

Re: [PATCH 2/4] KVM: x86: Introduce paravirt feature CR0/CR4 pinning

2020-07-09 Thread Dave Hansen
On 7/9/20 8:44 AM, Andersen, John wrote: > > Bits which are allowed to be pinned default to WP for CR0 and SMEP, > SMAP, and UMIP for CR4. I think it also makes sense to have FSGSBASE in this set. I know it hasn't been tested, but I think we should do the legwork to test it. If

Re: [PATCH 2/4] KVM: x86: Introduce paravirt feature CR0/CR4 pinning

2020-07-07 Thread Dave Hansen
On 7/7/20 2:12 PM, Sean Christopherson wrote: Let's say Intel loses its marbles and adds a CR4 bit that lets userspace write to kernel memory. Linux won't set it, but an attacker would go after it, first thing. > That's an orthogonal to pinning. KVM never lets the guest set CR4 bit

Re: [PATCH v3 6/6] dmaengine: idxd: add ABI documentation for shared wq

2020-07-06 Thread Dave Hansen
On 7/6/20 11:22 AM, Dave Jiang wrote: > +What:/sys/bus/dsa/devices/dsa/pasid_enabled > +Date:Jul 5, 2020 > +KernelVersion: 5.9.0 > +Contact: dmaeng...@vger.kernel.org > +Description: To indicate if PASID (process address space identifier) is > +

Re: [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI

2020-07-02 Thread Dave Hansen
On 7/2/20 4:28 AM, Huang, Ying wrote: >> But, when the bit was removed (bit 0) the _other_ bit locations also >> got changed. That's not OK because the bit values are documented to >> mean one specific thing and users surely rely on them meaning that one >> thing and not changing from kernel to ke

Re: [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks

2020-07-01 Thread Dave Hansen
On 7/1/20 1:04 PM, Ben Widawsky wrote: >> +static inline bool node_reclaim_enabled(void) >> +{ >> +/* Is any node_reclaim_mode bit set? */ >> +return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP); >> +} >> + >> extern void check_move_unevictable_pages(struct pagevec *pvec)

Re: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order

2020-07-01 Thread Dave Hansen
On 6/30/20 1:22 AM, Huang, Ying wrote: >> +/* >> + * To avoid cycles in the migration "graph", ensure >> + * that migration sources are not future targets by >> + * setting them in 'used_targets'. >> + * >> + * But, do this only once per pass so that multiple >> + * sour

Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-07-01 Thread Dave Hansen
On 7/1/20 1:54 AM, Huang, Ying wrote: > Why can not we just bind the memory of the application to node 0, 2, 3 > via mbind() or cpuset.mems? Then the application can allocate memory > directly from PMEM. And if we bind the memory of the application via > mbind() to node 0, we can only allocate me

Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-07-01 Thread Dave Hansen
On 6/30/20 5:47 PM, David Rientjes wrote: > On Mon, 29 Jun 2020, Dave Hansen wrote: >> From: Dave Hansen >> >> If a memory node has a preferred migration path to demote cold pages, >> attempt to move those inactive pages to that migration node before >> rec

Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

2020-07-01 Thread Dave Hansen
On 6/30/20 10:50 AM, Yang Shi wrote: > So, I'm supposed you need check if node_reclaim is enabled before doing > migration in shrink_page_list() and also need make node reclaim to adopt > the new mode. > > Please refer to > https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang..

Re: [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header

2020-07-01 Thread Dave Hansen
On 7/1/20 8:46 AM, Ben Widawsky wrote: >> +/* >> + * These bit locations are exposed in the vm.zone_reclaim_mode sysctl >> + * ABI. New bits are OK, but existing bits can never change. >> + */ >> +#define RECLAIM_ZONE (1<<0)/* Run shrink_inactive_list on the zone >> */ >> +#define RECLAI

Re: [PATCH v4 23/26] mm/x86: Use general page fault accounting

2020-07-01 Thread Dave Hansen
On 6/30/20 1:46 PM, Peter Xu wrote: > Use the general page fault accounting by passing regs into handle_mm_fault(). ... > - /* > - * Major/minor page fault accounting. If any of the events > - * returned VM_FAULT_MAJOR, we account it as a major fault. > - */ > - if (major) {

[PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks

2020-07-01 Thread Dave Hansen
From: Dave Hansen RECLAIM_ZONE was assumed to be unused because it was never explicitly used in the kernel. However, there were a number of places where it was checked implicitly by checking 'node_reclaim_mode' for a zero value. These zero checks are not great because it is not ob

[PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI

2020-07-01 Thread Dave Hansen
A previous cleanup accidentally changed the vm.zone_reclaim_mode ABI. This series restores the ABI and then reorganizes the code to make the ABI more obvious. Since the single-patch v1[1], I've: * Restored the RECLAIM_ZONE naming, comment and Documentation now that the implicit checks for it

[PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header

2020-07-01 Thread Dave Hansen
From: Dave Hansen It is currently not obvious that the RECLAIM_* bits are part of the uapi since they are defined in vmscan.c. Move them to a uapi header to make it obvious. This should have no functional impact. Signed-off-by: Dave Hansen Cc: Ben Widawsky Cc: Alex Shi Cc: Daniel Wagner

[PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI

2020-07-01 Thread Dave Hansen
From: Dave Hansen I went to go add a new RECLAIM_* mode for the zone_reclaim_mode sysctl. Like a good kernel developer, I also went to go update the documentation. I noticed that the bits in the documentation didn't match the bits in the #defines. The VM never explicitly check

Re: [PATCH] mm/vmscan: restore zone_reclaim_mode ABI

2020-07-01 Thread Dave Hansen
On 6/30/20 7:47 PM, Andrew Morton wrote: >> Oh, that's a very good point. There are a couple of those around. Let >> me circle back and update the documentation and the variable name. I'll >> send out another version. > Was the omission of cc:stable deliberate? Nope, it was an accidental sta...

Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-07-01 Thread Dave Hansen
On 6/30/20 10:41 PM, David Rientjes wrote: > Maybe the strongest advantage of the node abstraction is the ability to > use autonuma and migrate_pages()/move_pages() API for moving pages > explicitly? Mempolicies could be used for migration to "top-tier" memory, > i.e. ZONE_NORMAL or ZONE_MOVABL

<    1   2   3   4   5   6   7   8   9   10   >