Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-30 Thread Dave Hansen
On 3/30/21 8:00 AM, Andi Kleen wrote: >>> + /* MWAIT is not supported in TDX platform, so suppress it */ >>> + setup_clear_cpu_cap(X86_FEATURE_MWAIT); >> In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This >> is enforced by SEAM module. > Good point. >> Do we still need

Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Dave Hansen
On 3/29/21 4:16 PM, Kuppuswamy Sathyanarayanan wrote: > In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions > are not supported. So handle #VE due to these instructions > appropriately. This misses a key detail: "are not supported" ... and other patches have prevented a gue

Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Dave Hansen
On 3/29/21 3:09 PM, Kuppuswamy, Sathyanarayanan wrote: > +    case EXIT_REASON_MWAIT_INSTRUCTION: > +    /* MWAIT is supressed, not supposed to reach here. */ > +    WARN(1, "MWAIT unexpected #VE Exception\n"); > +    return -EFAULT; How is MWAIT "supppressed"?

Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Dave Hansen
On 3/29/21 2:55 PM, Kuppuswamy, Sathyanarayanan wrote: >> >> MONITOR is a privileged instruction, right?  So we can only end up in >> here if the kernel screws up and isn't reading CPUID correctly, right? >> >> That dosen't seem to me like something we want to suppress.  This needs >> a warning, at

Re: I915 CI-run with kfence enabled, issues found

2021-03-29 Thread Dave Hansen
On 3/29/21 10:45 AM, Marco Elver wrote: > On Mon, 29 Mar 2021 at 19:32, Dave Hansen wrote: > Doing it to all CPUs is too expensive, and we can tolerate this being > approximate (nothing bad will happen, KFENCE might just miss a bug and > that's ok). ... >> BTW

Re: I915 CI-run with kfence enabled, issues found

2021-03-29 Thread Dave Hansen
On 3/29/21 9:40 AM, Marco Elver wrote: > It looks like the code path from flush_tlb_one_kernel() to > invalidate_user_asid()'s this_cpu_ptr() has several feature checks, so > probably some feature difference between systems where it triggers and > it doesn't. > > As far as I'm aware, there is no r

Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Dave Hansen
On 3/27/21 3:54 PM, Kuppuswamy Sathyanarayanan wrote: > + /* > + * Per Guest-Host-Communication Interface (GHCI) for Intel Trust > + * Domain Extensions (Intel TDX) specification, sec 2.4, > + * some instructions that unconditionally cause #VE (such as WBINVD, > + * MONITOR,

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-29 Thread Dave Hansen
On 3/27/21 5:53 PM, Thomas Gleixner wrote: > Making it solely depend on XCR0 and fault if not requested upfront is > bringing you into the situation that you broke 'legacy code' which > relied on the CPUID bit and that worked until now which gets you > in the no-regression trap. Trying to find the

Re: [PATCH v3 05/25] x86/sgx: Introduce virtual EPC for use by KVM guests

2021-03-26 Thread Dave Hansen
On 3/26/21 8:29 AM, Borislav Petkov wrote: > On Fri, Mar 26, 2021 at 08:17:38AM -0700, Dave Hansen wrote: >> We're working on a cgroup controller just for enclave pages that will >> apply to guest use and bare metal. It would have been nice to have up >> front, but

Re: [PATCH v3 05/25] x86/sgx: Introduce virtual EPC for use by KVM guests

2021-03-26 Thread Dave Hansen
On 3/26/21 8:03 AM, Borislav Petkov wrote: > Let's say all guests start using enclaves and baremetal cannot start any > new ones anymore due to no more memory. Are we ok with that? Yes, for now. > What if baremetal creates a big fat enclave and starves guests all of a > sudden. Are we ok with tha

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-25 Thread Dave Hansen
On 3/25/21 3:59 PM, Len Brown wrote: > We call AMX a "simple state feature" -- it actually requires NO KERNEL > ENABLING > above the generic state save/restore to fully support userspace AMX > applications. > > While not all ISA extensions can be simple state features, we do expect > future featu

Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

2021-03-25 Thread Dave Hansen
On 3/25/21 8:24 AM, Brijesh Singh wrote: > On 3/25/21 9:48 AM, Dave Hansen wrote: >> On 3/24/21 10:04 AM, Brijesh Singh wrote: >>> When SEV-SNP is enabled globally in the system, a write from the hypervisor >>> can raise an RMP violation. We can resolve the RMP vio

Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support

2021-03-25 Thread Dave Hansen
On 3/25/21 8:31 AM, Brijesh Singh wrote: > > On 3/25/21 9:58 AM, Dave Hansen wrote: >>> +static int __init mem_encrypt_snp_init(void) >>> +{ >>> + if (!boot_cpu_has(X86_FEATURE_SEV_SNP)) >>> + return 1; >>> + >>> +

Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-03-25 Thread Dave Hansen
On 3/24/21 10:04 AM, Brijesh Singh wrote: > The spliting of the physmap is a temporary solution until we work to > improve the kernel page fault handler to split the pages on demand. > One of the disadvtange of splitting is that eventually, we will end up > breaking down the entire physmap unless w

Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support

2021-03-25 Thread Dave Hansen
> +static int __init mem_encrypt_snp_init(void) > +{ > + if (!boot_cpu_has(X86_FEATURE_SEV_SNP)) > + return 1; > + > + if (rmptable_init()) { > + setup_clear_cpu_cap(X86_FEATURE_SEV_SNP); > + return 1; > + } > + > + static_branch_enable(&snp_enabl

Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

2021-03-25 Thread Dave Hansen
On 3/24/21 10:04 AM, Brijesh Singh wrote: > When SEV-SNP is enabled globally in the system, a write from the hypervisor > can raise an RMP violation. We can resolve the RMP violation by splitting > the virtual address to a lower page level. > > e.g > - guest made a page shared in the RMP entry so

Re: [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code

2021-03-25 Thread Dave Hansen
On 3/25/21 7:32 AM, Brijesh Singh wrote: >>> enum x86_pf_error_code { >>> X86_PF_PROT = 1 << 0, >>> @@ -21,6 +22,7 @@ enum x86_pf_error_code { >>> X86_PF_INSTR= 1 << 4, >>> X86_PF_PK = 1 << 5, >>> X86_PF_SGX =

Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

2021-03-25 Thread Dave Hansen
On 3/24/21 10:04 AM, Brijesh Singh wrote: > @@ -1377,6 +1442,22 @@ void do_user_addr_fault(struct pt_regs *regs, > if (hw_error_code & X86_PF_INSTR) > flags |= FAULT_FLAG_INSTRUCTION; > > + /* > + * If its an RMP violation, see if we can resolve it. > + */ > +

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Dave Hansen
On 3/24/21 2:42 PM, Andy Lutomirski wrote: 3. user space always uses fully uncompacted XSAVE buffers. >>> There is no reason we have to do this for new states. Arguably we >>> shouldn’t for AMX to avoid yet another altstack explosion. >> The thing that's worried me is that the list of OS-

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Dave Hansen
On 3/24/21 2:26 PM, Andy Lutomirski wrote: >> 3. user space always uses fully uncompacted XSAVE buffers. >> > There is no reason we have to do this for new states. Arguably we > shouldn’t for AMX to avoid yet another altstack explosion. The thing that's worried me is that the list of OS-enabled s

Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-24 Thread Dave Hansen
On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote: >> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are >> used.  It's quite possible we can encode another use even in the >> existing bits. >> >> Personally, I'd just try: >> >> #define _PAGE_BIT_SOFTW5    57  /* availab

Re: [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code

2021-03-24 Thread Dave Hansen
> diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h > index 10b1de500ab1..107f9d947e8d 100644 > --- a/arch/x86/include/asm/trap_pf.h > +++ b/arch/x86/include/asm/trap_pf.h > @@ -12,6 +12,7 @@ > * bit 4 ==1: fault was an instruction > f

Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

2021-03-24 Thread Dave Hansen
On 3/24/21 3:05 AM, Thomas Hellström (Intel) wrote: > Yes, I agree. Seems like the special (SW1) is available also for huge > page table entries on x86 AFAICT, although just not implemented. > Otherwise the SW bits appear completely used up. Although the _PAGE_BIT_SOFTW* bits are used up, there's

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Dave Hansen
On 3/23/21 2:52 PM, Bae, Chang Seok wrote: >> "System software may disable use of Intel AMX by clearing XCR0[18:17], by >> clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that >> system software initialize AMX state (e.g., by executing TILERELEASE) >> before doing so. Thi

[tip: x86/sgx] selftests/sgx: Improve error detection and messages

2021-03-19 Thread tip-bot2 for Dave Hansen
The following commit has been merged into the x86/sgx branch of tip: Commit-ID: 4284f7acb78bfb0e0c26a2b78e2b2c3d68fccd6f Gitweb: https://git.kernel.org/tip/4284f7acb78bfb0e0c26a2b78e2b2c3d68fccd6f Author:Dave Hansen AuthorDate:Thu, 18 Mar 2021 12:43:01 -07:00 Committer

Re: [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions

2021-03-19 Thread Dave Hansen
On 3/19/21 10:42 AM, Kuppuswamy, Sathyanarayanan wrote: >>> @@ -4,6 +4,58 @@ >>>   #include >>>   #include >>>   +void tdcall(u64 leafid, struct tdcall_regs *regs) >>> +{ >>> +    asm volatile( >>> +    /* RAX = leafid (TDCALL LEAF ID) */ >>> +    "  movq %0, %%rax;" >>> +

Re: [PATCH] x86/sgx: Avoid returning NULL in __sgx_alloc_epc_page()

2021-03-19 Thread Dave Hansen
On 3/19/21 8:52 AM, Borislav Petkov wrote: > On Fri, Mar 19, 2021 at 05:22:56PM +0200, Jarkko Sakkinen wrote: >> I did misread it for the first time. >> >> So let's sanity: you *are* going to squash the patches together because >> that way it's factors easier to backport the whole thing? >> >> Is t

Re: [tip: x86/sgx] selftests/sgx: Improve error detection and messages

2021-03-19 Thread Dave Hansen
On 3/19/21 7:58 AM, Borislav Petkov wrote: > On Fri, Mar 19, 2021 at 11:38:44AM -, tip-bot2 for Dave Hansen wrote: >> tools/testing/selftests/sgx/load.c | 66 ++--- >> tools/testing/selftests/sgx/main.c | 2 +- >> 2 files changed, 53 insert

[tip: x86/sgx] selftests/sgx: Improve error detection and messages

2021-03-19 Thread tip-bot2 for Dave Hansen
The following commit has been merged into the x86/sgx branch of tip: Commit-ID: 79713a1fa1b9cd9d650b1ff0657ddbadc5dbbeaa Gitweb: https://git.kernel.org/tip/79713a1fa1b9cd9d650b1ff0657ddbadc5dbbeaa Author:Dave Hansen AuthorDate:Thu, 18 Mar 2021 12:43:01 -07:00 Committer

[tip: x86/sgx] x86/sgx: Fix uninitialized 'nid' variable

2021-03-19 Thread tip-bot2 for Dave Hansen
The following commit has been merged into the x86/sgx branch of tip: Commit-ID: 262e88b63f55e3d2bacdf629874a0af486775572 Gitweb: https://git.kernel.org/tip/262e88b63f55e3d2bacdf629874a0af486775572 Author:Dave Hansen AuthorDate:Thu, 18 Mar 2021 14:49:33 -07:00 Committer

[PATCH] x86/sgx: fix uninitialized 'nid' variable

2021-03-18 Thread Dave Hansen
s. My gcc does not detect it. Fixes: 5b8719504e3a ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") Reported-by: kernel test robot Signed-off-by: Dave Hansen Cc: Jarkko Sakkinen Cc: Borislav Petkov Cc: x...@kernel.org Cc: linux-...@vger.kernel.org --- arc

[PATCH] selftests/sgx: improve error detection and messages

2021-03-18 Thread Dave Hansen
From: Dave Hansen The SGX device file (/dev/sgx_enclave) is unusual in that it requires execute permissions. It has to be both "chmod +x" *and* be on a filesystem without 'noexec'. In the future, udev and systemd should get updates to set up systems automatically. Bu

Re: [PATCH 1/2] x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list

2021-03-18 Thread Dave Hansen
On 3/18/21 10:40 AM, Borislav Petkov wrote: > So both patches look ok to me but the sgx test case fails on -rc3 with and > without those patches on my box: > > ./test_sgx > 0x 0x2000 0x03 > 0x2000 0x1000 0x05 > 0x3000 0x3

Re: [PATCH v2] mm: Move mem_init_print_info() into mm_init()

2021-03-17 Thread Dave Hansen
On 3/16/21 6:52 PM, Kefeng Wang wrote: > mem_init_print_info() is called in mem_init() on each architecture, > and pass NULL argument, so using void argument and move it into mm_init(). > > Acked-by: Dave Hansen It's not a big deal but you might want to say something like

Re: [PATCH] mm: Move mem_init_print_info() into mm_init()

2021-03-16 Thread Dave Hansen
em_init_print_info(), so this patch will change the location of the mem_init_print_info(), but I think it's actually for the better, since it will be pushed later in boot. As long as the x86 pieces stay the same: Acked-by: Dave Hansen

Re: [PATCH v1 00/14] Multigenerational LRU

2021-03-16 Thread Dave Hansen
On 3/16/21 1:30 PM, Yu Zhao wrote: > On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote: >> I think it would also be very worthwhile to include some research in >> this series about why the kernel moved away from page table scanning. >> What has changed? Are

Re: [PATCH v23 6/9] x86/entry: Introduce ENDBR macro

2021-03-16 Thread Dave Hansen
On 3/16/21 10:44 AM, Yu, Yu-cheng wrote: >> Also, Boris asked for two *different* macros for 32 and 64-bit: >> >> https://lore.kernel.org/linux-api/20210310231731.gk23...@zn.tnic/ >> >> Could you do that in the next version, please? > > Yes, we can do two macros, probably in arch/x86/include/asm/v

Re: [PATCH v23 6/9] x86/entry: Introduce ENDBR macro

2021-03-16 Thread Dave Hansen
On 3/16/21 10:12 AM, Yu, Yu-cheng wrote: > On 3/16/2021 8:49 AM, Dave Hansen wrote: ... >> Is "#ifdef __i386__" the right thing to use here?  I guess ENDBR only >> ends up getting used in the VDSO, but there's a lot of >> non-userspace-exposed stuff in call

Re: [PATCH v23 6/9] x86/entry: Introduce ENDBR macro

2021-03-16 Thread Dave Hansen
On 3/16/21 8:13 AM, Yu-cheng Yu wrote: > --- a/arch/x86/entry/calling.h > +++ b/arch/x86/entry/calling.h > @@ -392,3 +392,21 @@ For 32-bit we have the following conventions - kernel is > built with > .endm > > #endif /* CONFIG_SMP */ > +/* > + * ENDBR is an instruction for the Indirect Branch

Re: [PATCH v1 00/14] Multigenerational LRU

2021-03-16 Thread Dave Hansen
On 3/15/21 7:24 PM, Yu Zhao wrote: > On Mon, Mar 15, 2021 at 11:00:06AM -0700, Dave Hansen wrote: >> How bad does this scanning get in the worst case if there's a lot of >> sharing? > > Actually the improvement is larger when there is more sharing, i.e., > higher

Re: [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list

2021-03-15 Thread Dave Hansen
On 3/15/21 12:14 PM, Jarkko Sakkinen wrote: > On Mon, Mar 15, 2021 at 09:03:21AM -0700, Dave Hansen wrote: >> On 3/13/21 8:01 AM, Jarkko Sakkinen wrote: >>> Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state, >>> and free them using sgx_free_epc_pa

Re: [PATCH v1 00/14] Multigenerational LRU

2021-03-15 Thread Dave Hansen
On 3/12/21 11:57 PM, Yu Zhao wrote: > Background > == > DRAM is a major factor in total cost of ownership, and improving > memory overcommit brings a high return on investment. Over the past > decade of research and experimentation in memory overcommit, we > observed a distinct trend across

Re: [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()

2021-03-15 Thread Dave Hansen
On 3/13/21 8:01 AM, Jarkko Sakkinen wrote: > Background > == > > EPC section is covered by one or more SRAT entries that are associated with > one and only one PXM (NUMA node). The motivation behind this patch is to > provide basic elements of building allocation scheme based on this premi

Re: [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list

2021-03-15 Thread Dave Hansen
On 3/13/21 8:01 AM, Jarkko Sakkinen wrote: > Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state, > and free them using sgx_free_epc_page(). Do two passes, as for SECS pages > the first round can fail, if all child pages have not yet been removed. > The driver puts all pages o

Re: [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages()

2021-03-15 Thread Dave Hansen
On 3/13/21 8:01 AM, Jarkko Sakkinen wrote: > Replace the ad-hoc code with a sgx_free_epc_page(), in order to make sure > that all the relevant checks and book keeping is done, while freeing a > borrowed EPC page, and remove redundant code. EREMOVE inside > sgx_free_epc_page() does not change the se

Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries

2021-03-14 Thread Dave Hansen
On 3/12/21 11:57 PM, Yu Zhao wrote: > Some architectures support the accessed bit on non-leaf PMD entries > (parents) in addition to leaf PTE entries (children) where pages are > mapped, e.g., x86_64 sets the accessed bit on a parent when using it > as part of linear-address translation [1]. Page t

Re: [PATCH v22 8/8] x86/vdso: Add ENDBR64 to __vdso_sgx_enter_enclave

2021-03-12 Thread Dave Hansen
On 3/12/21 8:55 AM, Jarkko Sakkinen wrote: >> ENDBR is a special new instruction for the Indirect Branch Tracking >> (IBT) component of CET. IBT prevents attacks by ensuring that (most) >> indirect branches and function calls may only land at ENDBR >> instructions. Branches that don't follow the

Re: [PATCH v6 3/4] x86/vmemmap: Handle unpopulated sub-pmd ranges

2021-03-11 Thread Dave Hansen
On 3/9/21 1:40 PM, Oscar Salvador wrote: > +static void __meminit vmemmap_use_new_sub_pmd(unsigned long start, unsigned > long end) > +{ > + /* > + * Could be our memmap page is filled with PAGE_UNUSED already from a > + * previous remove. Make sure to reset it. > + */ > + v

Re: [PATCH v22 8/8] x86/vdso: Add ENDBR64 to __vdso_sgx_enter_enclave

2021-03-10 Thread Dave Hansen
On 3/10/21 2:55 PM, Yu, Yu-cheng wrote: > On 3/10/2021 2:39 PM, Jarkko Sakkinen wrote: >> On Wed, Mar 10, 2021 at 02:05:19PM -0800, Yu-cheng Yu wrote: >>> When CET is enabled, __vdso_sgx_enter_enclave() needs an endbr64 >>> in the beginning of the function. >> >> OK. >> >> What you should do is to

Re: [PATCH V2 16/25] perf/x86: Register hybrid PMUs

2021-03-10 Thread Dave Hansen
On 3/10/21 8:37 AM, kan.li...@linux.intel.com wrote: > - err = perf_pmu_register(&pmu, "cpu", PERF_TYPE_RAW); > - if (err) > - goto out2; > + if (!is_hybrid()) { > + err = perf_pmu_register(&pmu, "cpu", PERF_TYPE_RAW); > + if (err) > +

Re: [PATCH v3 2/5] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages()

2021-03-10 Thread Dave Hansen
On 3/10/21 7:11 AM, Jarkko Sakkinen wrote: >>> - section = &sgx_epc_sections[epc_page->section]; >>> - spin_lock(§ion->lock); >>> - list_add_tail(&epc_page->list, §ion->page_list); >>> - section->free_cnt++; >>> - spin_unlock(§ion->lock); >>> +

Re: [PATCH v3 5/5] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()

2021-03-10 Thread Dave Hansen
>>> + * node. >>> + */ >>> +static struct sgx_numa_node *sgx_numa_nodes; >>> + >>> +/* >>> + * sgx_free_epc_page() uses this to find out the correct struct >>> sgx_numa_node, >>> + * to put the page in. >>> + */ >>> +static int sgx_section_to_numa_node_id[SGX_MAX_EPC_SECTIONS]; >> >> If this is pe

Re: [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap

2021-03-09 Thread Dave Hansen
On 3/8/21 4:17 PM, Yang Shi wrote: >> Reclaim anonymous pages if a migration path is available now that >> demotion provides a non-swap recourse for reclaiming anon pages. >> >> Note that this check is subtly different from the >> anon_should_be_aged() checks. This mechanism checks whether a >> sp

Re: [PATCH 05/10] mm/migrate: demote pages during reclaim

2021-03-09 Thread Dave Hansen
On 3/8/21 4:10 PM, Yang Shi wrote: >> +static struct page *alloc_demote_page(struct page *page, unsigned long node) >> +{ >> + struct migration_target_control mtc = { >> + /* >> +* Fail the allocation quickly and quietly. When this >> +* happens,

Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events

2021-03-09 Thread Dave Hansen
On 3/8/21 4:03 PM, Yang Shi wrote: >> +static int __meminit migrate_on_reclaim_callback(struct notifier_block >> *self, >> +unsigned long action, void >> *arg) >> +{ >> + switch (action) { >> + case MEM_GOING_OFFLINE: >> +

Re: [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-03-09 Thread Dave Hansen
On 3/8/21 4:24 PM, Yang Shi wrote: >> Once this is enabled page demotion may move data to a NUMA node >> that does not fall into the cpuset of the allocating process. >> This could be construed to violate the guarantees of cpusets. >> However, since this is an opt-in mechanism, the assumption is >>

Re: [PATCH 00/10] [v6] Migrate Pages in lieu of discard

2021-03-09 Thread Dave Hansen
... >> == Open Issues == >> >> * For cpusets and memory policies that restrict allocations >>to PMEM, is it OK to demote to PMEM? Do we need a cgroup- >>level API to opt-in or opt-out of these migrations? > > I'm wondering if such usecases, which don't want to have memory > allocate on p

Re: [PATCH v5 4/4] x86/vmemmap: Optimize for consecutive sections in partial populated PMDs

2021-03-09 Thread Dave Hansen
w the previous one, we > know we can memset [unused_pmd_start, PMD_BOUNDARY) with PAGE_UNUSE. > > This patch is based on a similar patch by David Hildenbrand: > > https://lore.kernel.org/linux-mm/20200722094558.9828-10-da...@redhat.com/ > > Signed-off-by: Oscar Salvador This is much more clear now. Thanks! Acked-by: Dave Hansen

Re: [PATCH v5 3/4] x86/vmemmap: Handle unpopulated sub-pmd ranges

2021-03-09 Thread Dave Hansen
tch by David Hildenbrand: > > https://lore.kernel.org/linux-mm/20200722094558.9828-9-da...@redhat.com/ Looks good now. It's much easier to read without the optimization. Acked-by: Dave Hansen

Re: [PATCH v5 2/4] x86/vmemmap: Drop handling of 1GB vmemmap ranges

2021-03-09 Thread Dave Hansen
hat David Hildenbrand said in the v4 thread about this patch. Basically, we don't have code to allocate 1G mappings because it isn't clear that it would be worth the complexity, and it might also waste memory. I'm fine with the code, but I would appreciate a beefed-up changelog: Acked-by: Dave Hansen

Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges

2021-03-09 Thread Dave Hansen
On 3/9/21 12:25 AM, Oscar Salvador wrote: > > I think the confusion comes from the name. > "vmemmap_pmd_is_unused" might be a better fit? > > What do you think? Do you feel strong about moving the log in there > regardless of the name? No, not really. The name is probably worth adjusting, but I

Re: [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range

2021-03-08 Thread Dave Hansen
On 3/8/21 10:20 AM, Oscar Salvador wrote: > On Thu, Mar 04, 2021 at 07:50:10AM -0800, Dave Hansen wrote: >> On 3/1/21 12:32 AM, Oscar Salvador wrote: >>> remove_pte_table() is prepared to handle the case where either the >>> start or the end of the range is not PA

[tip: x86/cleanups] x86: Remove duplicate TSC DEADLINE MSR definitions

2021-03-08 Thread tip-bot2 for Dave Hansen
The following commit has been merged into the x86/cleanups branch of tip: Commit-ID: 09141ec0e4efede4fb5e2aa68cb819fba974325c Gitweb: https://git.kernel.org/tip/09141ec0e4efede4fb5e2aa68cb819fba974325c Author:Dave Hansen AuthorDate:Thu, 05 Mar 2020 09:47:06 -08:00

Re: [RFC][PATCH 1/2] x86: remove duplicate TSC DEADLINE MSR definitions

2021-03-06 Thread Dave Hansen
On 3/5/20 9:47 AM, Dave Hansen wrote: > There are two definitions for the TSC deadline MSR in msr-index.h, > one with an underscore and one without. Axe one of them and move > all the references over to the other one. > > Cc: x...@kernel.org > Cc: Peter Zijlstra Better late t

Re: [PATCH v3 RFC 14/14] mm: speedup page alloc for MPOL_PREFERRED_MANY by adding a NO_SLOWPATH gfp bit

2021-03-05 Thread Dave Hansen
On 3/3/21 8:31 AM, Ben Widawsky wrote: >> I haven't got to the whole series yet. The real question is whether the >> first attempt to enforce the preferred mask is a general win. I would >> argue that it resembles the existing single node preferred memory policy >> because that one doesn't push hea

Re: [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range

2021-03-05 Thread Dave Hansen
the code that you modified in remove_pte_table(). I assume this was because vmemmap_free() is the only (indirect) caller of remove_pte_table(). Otherwise, this looks fine to me: Acked-by: Dave Hansen

Re: [PATCH 02/25] x86/cpufeatures: Add SGX1 and SGX2 sub-features

2021-03-05 Thread Dave Hansen
On 3/2/21 7:48 AM, Haitao Huang wrote: > > Hi Haitao, Jarkko, > > Do you have more concrete use case of needing "sgx2" in /proc/cpuinfo? Kai, please remove it from your series. I'm not hearing any arguments remotely close enough to what Boris would require in order to keep it.

Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges

2021-03-05 Thread Dave Hansen
On 3/4/21 9:02 AM, Dave Hansen wrote: >> +#define PAGE_UNUSED 0xFD >> +/* >> + * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges >> + * from unused_pmd_start to next PMD_SIZE boundary. >> + */ >> +static unsigned long unused_p

Re: [PATCH v3 2/5] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages()

2021-03-05 Thread Dave Hansen
On 3/3/21 7:03 AM, Jarkko Sakkinen wrote: > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > index 52d070fb4c9a..ed99c60024dc 100644 > --- a/arch/x86/kernel/cpu/sgx/main.c > +++ b/arch/x86/kernel/cpu/sgx/main.c > @@ -305,7 +305,6 @@ static void sgx_reclaim_pages(void)

Re: [PATCH v3 1/5] x86/sgx: Fix a resource leak in sgx_init()

2021-03-05 Thread Dave Hansen
On 3/3/21 7:03 AM, Jarkko Sakkinen wrote: > If sgx_page_cache_init() fails in the middle, a trivial return > statement causes unused memory and virtual address space reserved for > the EPC section, not freed. Fix this by using the same rollback, as > when sgx_page_reclaimer_init() fails. ... > @@ -

Re: [PATCH v3 5/5] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()

2021-03-05 Thread Dave Hansen
What changed from the last patch? On 3/3/21 7:03 AM, Jarkko Sakkinen wrote: > Background > == > > EPC section is covered by one or more SRAT entries that are associated with > one and only one PXM (NUMA node). The motivation behind this patch is to > provide basic elements of building all

Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges

2021-03-05 Thread Dave Hansen
On 3/1/21 12:32 AM, Oscar Salvador wrote: > When the size of a struct page is not multiple of 2MB, sections do > not span a PMD anymore and so when populating them some parts of the > PMD will remain unused. Multiples of 2MB are 2MB, 4MB, 6MB, etc... I think you meant here that 2MB must be a mult

Re: [PATCH v4 2/3] x86/vmemmap: Drop handling of 1GB vmemmap ranges

2021-03-05 Thread Dave Hansen
On 3/1/21 12:32 AM, Oscar Salvador wrote: > We never get to allocate 1GB pages when mapping the vmemmap range. > Drop the dead code both for the aligned and unaligned cases and leave > only the direct map handling. Could you elaborate a bit on why 1GB pages are never used? It is just unlikely to

Re: [PATCH v3 3/5] x86/sgx: Replace section->init_laundry_list with a temp list

2021-03-05 Thread Dave Hansen
... > -static void sgx_sanitize_section(struct sgx_epc_section *section) > +static void sgx_sanitize_section(struct list_head *laundry) > { Does this need a better function name now that it's not literally dealing with sections at *all*? sgx_sanitize_pages() perhaps. > struct sgx

Re: [PATCH v3 4/5] x86/sgx: Replace section->page_list with a global free page list

2021-03-05 Thread Dave Hansen
On 3/3/21 7:03 AM, Jarkko Sakkinen wrote: > Background > == > > EPC section is covered by one or more SRAT entries that are associated with > one and only one PXM (NUMA node). The current implementation overheats a Overheats? > single NUMA node, because sgx_alloc_epc_page() always starts

[PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-03-04 Thread Dave Hansen
From: Dave Hansen Some method is obviously needed to enable reclaim-based migration. Just like traditional autonuma, there will be some workloads that will benefit like workloads with more "static" configurations where hot pages stay hot and cold pages stay cold. If pages come a

[PATCH 06/10] mm/vmscan: add page demotion counter

2021-03-04 Thread Dave Hansen
ges() ] Signed-off-by: Yang Shi Signed-off-by: Dave Hansen Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams Cc: David Hildenbrand Cc: osalvador -- Changes since 202010: * remove unused scan-control 'demoted' field --- b/include/linux/vm_event_item.h |2 ++

[PATCH 09/10] mm/vmscan: never demote for memcg reclaim

2021-03-04 Thread Dave Hansen
From: Dave Hansen Global reclaim aims to reduce the amount of memory used on a given node or set of nodes. Migrating pages to another node serves this purpose. memcg reclaim is different. Its goal is to reduce the total memory consumption of the entire memcg, across all nodes. Migration

[PATCH 08/10] mm/vmscan: Consider anonymous pages without swap

2021-03-04 Thread Dave Hansen
context *can* actually be reclaimed, given current swap space and cgroup limits anon_should_be_aged() is a much simpler and more prelimiary check which just says whether there is a possibility of future reclaim. #Signed-off-by: Keith Busch Cc: Keith Busch Signed-off-by: Dave Hansen Cc: Yang Shi

[PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages

2021-03-04 Thread Dave Hansen
From: Dave Hansen Anonymous pages are kept on their own LRU(s). These lists could theoretically always be scanned and maintained. But, without swap, there is currently nothing the kernel can *do* with the results of a scanned, sorted LRU for anonymous pages. A check for '!total_swap_

[PATCH 00/10] [v6] Migrate Pages in lieu of discard

2021-03-04 Thread Dave Hansen
The full series is also available here: https://github.com/hansendc/linux/tree/automigrate-20210304 which also inclues some vm.zone_reclaim_mode sysctl ABI fixup prerequisites. The meat of this patch is in: [PATCH 05/10] mm/migrate: demote pages during reclaim Which also has

[PATCH 01/10] mm/numa: node demotion data structure and lookup

2021-03-04 Thread Dave Hansen
From: Dave Hansen Prepare for the kernel to auto-migrate pages to other memory nodes with a user defined node migration table. This allows creating single migration target for each NUMA node to enable the kernel to do NUMA page migrations instead of simply reclaiming colder pages. A node with

[PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-03-04 Thread Dave Hansen
account how many pages are reclaimed (demoted) since page reclaim behavior depends on this. Add *nr_succeeded parameter to make migrate_pages() return how many pages are demoted successfully for all cases. Signed-off-by: Yang Shi Signed-off-by: Dave Hansen Cc: David Rientjes Cc: Huang Ying Cc

[PATCH 05/10] mm/migrate: demote pages during reclaim

2021-03-04 Thread Dave Hansen
From: Dave Hansen This is mostly derived from a patch from Yang Shi: https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang@linux.alibaba.com/ Add code to the reclaim path (shrink_page_list()) to "demote" data to another NUMA node instead of disc

[PATCH 03/10] mm/migrate: update node demotion order during on hotplug events

2021-03-04 Thread Dave Hansen
From: Dave Hansen Reclaim-based migration is attempting to optimize data placement in memory based on the system topology. If the system changes, so must the migration ordering. The implementation is conceptually simple and entirely unoptimized. On any memory or CPU hotplug events, assume

[PATCH 02/10] mm/numa: automatically generate node migration order

2021-03-04 Thread Dave Hansen
From: Dave Hansen When memory fills up on a node, memory contents can be automatically migrated to another node. The biggest problems are knowing when to migrate and to where the migration should be targeted. The most straightforward way to generate the "to where" list would be to

Re: [PATCH 0/3] Introduce version array structure: sgx_va

2021-02-24 Thread Dave Hansen
On 2/24/21 2:20 PM, Jarkko Sakkinen wrote: > The use of sgx_va can be later on extended to the following use cases: > > - A global VA for reclaimed SECS pages. > - A global VA for reclaimed VA pages. ... > arch/x86/kernel/cpu/sgx/driver.c | 3 +- > arch/x86/kernel/cpu/sgx/encl.c | 180 +++

Re: [PATCH v2] x86,sched: Update the Intel SNC CPU list that allows shared LLCs

2021-02-24 Thread Dave Hansen
On 2/16/21 11:58 AM, Alison Schofield wrote: > arch/x86/kernel/smpboot.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 02813a7f3a7c..de8c598dc3b9 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c >

Re: [PATCH] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()

2021-02-23 Thread Dave Hansen
This doesn't look like it addresses all of the suggestions that I made two days ago. Is that coming in v3?

Re: [PATCH] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()

2021-02-23 Thread Dave Hansen
On 2/23/21 11:17 AM, Jarkko Sakkinen wrote: > Instead, let's just: > > 1. Have a global sgx_free_epc_list and remove sgx_epc_section. >Pages from this are allocated from this in LIFO fashion. > 2. Instead add struct list_head node_list and use that for node >associated pages. > 3. Replace

Re: [PATCH] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()

2021-02-23 Thread Dave Hansen
On 2/21/21 4:54 PM, Dave Hansen wrote: > Instead of having a for-each-section loop, I'd make it for-each-node -> > for-each-section. Something like: > > for (i = 0; i < num_possible_nodes(); i++) { > node = (numa_node_id()

Re: [PATCH] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()

2021-02-21 Thread Dave Hansen
> +/* Nodes with one or more EPC sections. */ > +static nodemask_t sgx_numa_mask; I'd also add that this is for optimization only. > +/* Array of lists of EPC sections for each NUMA node. */ > +struct list_head *sgx_numa_nodes; I'd much prefer: /* * Array with one list_head for each possible N

Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory

2021-02-17 Thread Dave Hansen
On 2/17/21 7:48 AM, David Hildenbrand wrote: > While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably > discard memory, there is no generic approach to populate ("preallocate") > memory. > > Although mmap() supports MAP_POPULATE, it is not applicable to the concept > of sparse me

Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/12/21 1:47 PM, Andy Lutomirski wrote: >> What about adding a property to the TD, e.g. via a flag set during TD >> creation, >> that controls whether unaccepted accesses cause #VE or are, for all intents >> and >> purposes, fatal? That would allow Linux to pursue treating EPT #VEs for >> pr

Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/12/21 12:54 PM, Sean Christopherson wrote: > Ah, I see what you're thinking. > > Treating an EPT #VE as fatal was also considered as an option. IIUC it was > thought that finding every nook and cranny that could access a page, without > forcing the kernel to pre-accept huge swaths of memory,

Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/12/21 12:37 PM, Sean Christopherson wrote: > There needs to be a mechanism for lazy/deferred/on-demand acceptance of pages. > E.g. pre-accepting every page in a VM with hundreds of GB of memory will be > ridiculously slow. > > #VE is the best option to do that: > > - Relatively sane re-ent

Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/12/21 12:06 PM, Sean Christopherson wrote: >> What happens if the guest attempts to access a secure GPA that is not >> ACCEPTed? For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure >> address and the guest accesses it, via instruction fetch or data access. >> What happens? > Wel

Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-12 Thread Dave Hansen
On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote: > More details on cases where #VE exceptions are allowed/not-allowed: > > The #VE exception do not occur in the paranoid entry paths, like NMIs. > While other operations during an NMI might cause #VE, these are in the > NMI code that can handle

Re: Memory keys and io_uring.

2021-02-11 Thread Dave Hansen
On 2/11/21 10:59 PM, Aneesh Kumar K.V wrote: > A read syscall do fail with EFAULT. But we allow read via io_uring > syscalls. Is that ok? In short, yes. As much as I'd like to apply pkey permissions to all accesses, when we don't have the CPU registers around, we don't have a choice: we have to

Re: [RFC 1/9] mm, arm64: Update PR_SET/GET_TAGGED_ADDR_CTRL interface

2021-02-11 Thread Dave Hansen
Hi Catalin, I noticed there are some ELF bits for ARM's BTI feature: GNU_PROPERTY_AARCH64_FEATURE_1_BTI > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/elf.h#n453 There's been talk of needing a similar set of bits on x86 for tagged pointers (

<    1   2   3   4   5   6   7   8   9   10   >