On 3/30/21 8:00 AM, Andi Kleen wrote:
>>> + /* MWAIT is not supported in TDX platform, so suppress it */
>>> + setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>> In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This
>> is enforced by SEAM module.
> Good point.
>> Do we still need
On 3/29/21 4:16 PM, Kuppuswamy Sathyanarayanan wrote:
> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> are not supported. So handle #VE due to these instructions
> appropriately.
This misses a key detail:
"are not supported" ... and other patches have prevented a gue
On 3/29/21 3:09 PM, Kuppuswamy, Sathyanarayanan wrote:
> + case EXIT_REASON_MWAIT_INSTRUCTION:
> + /* MWAIT is supressed, not supposed to reach here. */
> + WARN(1, "MWAIT unexpected #VE Exception\n");
> + return -EFAULT;
How is MWAIT "supppressed"?
On 3/29/21 2:55 PM, Kuppuswamy, Sathyanarayanan wrote:
>>
>> MONITOR is a privileged instruction, right? So we can only end up in
>> here if the kernel screws up and isn't reading CPUID correctly, right?
>>
>> That dosen't seem to me like something we want to suppress. This needs
>> a warning, at
On 3/29/21 10:45 AM, Marco Elver wrote:
> On Mon, 29 Mar 2021 at 19:32, Dave Hansen wrote:
> Doing it to all CPUs is too expensive, and we can tolerate this being
> approximate (nothing bad will happen, KFENCE might just miss a bug and
> that's ok).
...
>> BTW
On 3/29/21 9:40 AM, Marco Elver wrote:
> It looks like the code path from flush_tlb_one_kernel() to
> invalidate_user_asid()'s this_cpu_ptr() has several feature checks, so
> probably some feature difference between systems where it triggers and
> it doesn't.
>
> As far as I'm aware, there is no r
On 3/27/21 3:54 PM, Kuppuswamy Sathyanarayanan wrote:
> + /*
> + * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
> + * Domain Extensions (Intel TDX) specification, sec 2.4,
> + * some instructions that unconditionally cause #VE (such as WBINVD,
> + * MONITOR,
On 3/27/21 5:53 PM, Thomas Gleixner wrote:
> Making it solely depend on XCR0 and fault if not requested upfront is
> bringing you into the situation that you broke 'legacy code' which
> relied on the CPUID bit and that worked until now which gets you
> in the no-regression trap.
Trying to find the
On 3/26/21 8:29 AM, Borislav Petkov wrote:
> On Fri, Mar 26, 2021 at 08:17:38AM -0700, Dave Hansen wrote:
>> We're working on a cgroup controller just for enclave pages that will
>> apply to guest use and bare metal. It would have been nice to have up
>> front, but
On 3/26/21 8:03 AM, Borislav Petkov wrote:
> Let's say all guests start using enclaves and baremetal cannot start any
> new ones anymore due to no more memory. Are we ok with that?
Yes, for now.
> What if baremetal creates a big fat enclave and starves guests all of a
> sudden. Are we ok with tha
On 3/25/21 3:59 PM, Len Brown wrote:
> We call AMX a "simple state feature" -- it actually requires NO KERNEL
> ENABLING
> above the generic state save/restore to fully support userspace AMX
> applications.
>
> While not all ISA extensions can be simple state features, we do expect
> future featu
On 3/25/21 8:24 AM, Brijesh Singh wrote:
> On 3/25/21 9:48 AM, Dave Hansen wrote:
>> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>>> can raise an RMP violation. We can resolve the RMP vio
On 3/25/21 8:31 AM, Brijesh Singh wrote:
>
> On 3/25/21 9:58 AM, Dave Hansen wrote:
>>> +static int __init mem_encrypt_snp_init(void)
>>> +{
>>> + if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
>>> + return 1;
>>> +
>>> +
On 3/24/21 10:04 AM, Brijesh Singh wrote:
> The spliting of the physmap is a temporary solution until we work to
> improve the kernel page fault handler to split the pages on demand.
> One of the disadvtange of splitting is that eventually, we will end up
> breaking down the entire physmap unless w
> +static int __init mem_encrypt_snp_init(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
> + return 1;
> +
> + if (rmptable_init()) {
> + setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> + return 1;
> + }
> +
> + static_branch_enable(&snp_enabl
On 3/24/21 10:04 AM, Brijesh Singh wrote:
> When SEV-SNP is enabled globally in the system, a write from the hypervisor
> can raise an RMP violation. We can resolve the RMP violation by splitting
> the virtual address to a lower page level.
>
> e.g
> - guest made a page shared in the RMP entry so
On 3/25/21 7:32 AM, Brijesh Singh wrote:
>>> enum x86_pf_error_code {
>>> X86_PF_PROT = 1 << 0,
>>> @@ -21,6 +22,7 @@ enum x86_pf_error_code {
>>> X86_PF_INSTR= 1 << 4,
>>> X86_PF_PK = 1 << 5,
>>> X86_PF_SGX =
On 3/24/21 10:04 AM, Brijesh Singh wrote:
> @@ -1377,6 +1442,22 @@ void do_user_addr_fault(struct pt_regs *regs,
> if (hw_error_code & X86_PF_INSTR)
> flags |= FAULT_FLAG_INSTRUCTION;
>
> + /*
> + * If its an RMP violation, see if we can resolve it.
> + */
> +
On 3/24/21 2:42 PM, Andy Lutomirski wrote:
3. user space always uses fully uncompacted XSAVE buffers.
>>> There is no reason we have to do this for new states. Arguably we
>>> shouldn’t for AMX to avoid yet another altstack explosion.
>> The thing that's worried me is that the list of OS-
On 3/24/21 2:26 PM, Andy Lutomirski wrote:
>> 3. user space always uses fully uncompacted XSAVE buffers.
>>
> There is no reason we have to do this for new states. Arguably we
> shouldn’t for AMX to avoid yet another altstack explosion.
The thing that's worried me is that the list of OS-enabled s
On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
>> We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
>> used. It's quite possible we can encode another use even in the
>> existing bits.
>>
>> Personally, I'd just try:
>>
>> #define _PAGE_BIT_SOFTW5 57 /* availab
> diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
> index 10b1de500ab1..107f9d947e8d 100644
> --- a/arch/x86/include/asm/trap_pf.h
> +++ b/arch/x86/include/asm/trap_pf.h
> @@ -12,6 +12,7 @@
> * bit 4 ==1: fault was an instruction
> f
On 3/24/21 3:05 AM, Thomas Hellström (Intel) wrote:
> Yes, I agree. Seems like the special (SW1) is available also for huge
> page table entries on x86 AFAICT, although just not implemented.
> Otherwise the SW bits appear completely used up.
Although the _PAGE_BIT_SOFTW* bits are used up, there's
On 3/23/21 2:52 PM, Bae, Chang Seok wrote:
>> "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>> clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>> system software initialize AMX state (e.g., by executing TILERELEASE)
>> before doing so. Thi
The following commit has been merged into the x86/sgx branch of tip:
Commit-ID: 4284f7acb78bfb0e0c26a2b78e2b2c3d68fccd6f
Gitweb:
https://git.kernel.org/tip/4284f7acb78bfb0e0c26a2b78e2b2c3d68fccd6f
Author:Dave Hansen
AuthorDate:Thu, 18 Mar 2021 12:43:01 -07:00
Committer
On 3/19/21 10:42 AM, Kuppuswamy, Sathyanarayanan wrote:
>>> @@ -4,6 +4,58 @@
>>> #include
>>> #include
>>> +void tdcall(u64 leafid, struct tdcall_regs *regs)
>>> +{
>>> + asm volatile(
>>> + /* RAX = leafid (TDCALL LEAF ID) */
>>> + " movq %0, %%rax;"
>>> +
On 3/19/21 8:52 AM, Borislav Petkov wrote:
> On Fri, Mar 19, 2021 at 05:22:56PM +0200, Jarkko Sakkinen wrote:
>> I did misread it for the first time.
>>
>> So let's sanity: you *are* going to squash the patches together because
>> that way it's factors easier to backport the whole thing?
>>
>> Is t
On 3/19/21 7:58 AM, Borislav Petkov wrote:
> On Fri, Mar 19, 2021 at 11:38:44AM -, tip-bot2 for Dave Hansen wrote:
>> tools/testing/selftests/sgx/load.c | 66 ++---
>> tools/testing/selftests/sgx/main.c | 2 +-
>> 2 files changed, 53 insert
The following commit has been merged into the x86/sgx branch of tip:
Commit-ID: 79713a1fa1b9cd9d650b1ff0657ddbadc5dbbeaa
Gitweb:
https://git.kernel.org/tip/79713a1fa1b9cd9d650b1ff0657ddbadc5dbbeaa
Author:Dave Hansen
AuthorDate:Thu, 18 Mar 2021 12:43:01 -07:00
Committer
The following commit has been merged into the x86/sgx branch of tip:
Commit-ID: 262e88b63f55e3d2bacdf629874a0af486775572
Gitweb:
https://git.kernel.org/tip/262e88b63f55e3d2bacdf629874a0af486775572
Author:Dave Hansen
AuthorDate:Thu, 18 Mar 2021 14:49:33 -07:00
Committer
s. My gcc
does not detect it.
Fixes: 5b8719504e3a ("x86/sgx: Add a basic NUMA allocation scheme to
sgx_alloc_epc_page()")
Reported-by: kernel test robot
Signed-off-by: Dave Hansen
Cc: Jarkko Sakkinen
Cc: Borislav Petkov
Cc: x...@kernel.org
Cc: linux-...@vger.kernel.org
---
arc
From: Dave Hansen
The SGX device file (/dev/sgx_enclave) is unusual in that it requires
execute permissions. It has to be both "chmod +x" *and* be on a
filesystem without 'noexec'.
In the future, udev and systemd should get updates to set up systems
automatically. Bu
On 3/18/21 10:40 AM, Borislav Petkov wrote:
> So both patches look ok to me but the sgx test case fails on -rc3 with and
> without those patches on my box:
>
> ./test_sgx
> 0x 0x2000 0x03
> 0x2000 0x1000 0x05
> 0x3000 0x3
On 3/16/21 6:52 PM, Kefeng Wang wrote:
> mem_init_print_info() is called in mem_init() on each architecture,
> and pass NULL argument, so using void argument and move it into mm_init().
>
> Acked-by: Dave Hansen
It's not a big deal but you might want to say something like
em_init_print_info(), so this patch will change the
location of the mem_init_print_info(), but I think it's actually for the
better, since it will be pushed later in boot. As long as the x86
pieces stay the same:
Acked-by: Dave Hansen
On 3/16/21 1:30 PM, Yu Zhao wrote:
> On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
>> I think it would also be very worthwhile to include some research in
>> this series about why the kernel moved away from page table scanning.
>> What has changed? Are
On 3/16/21 10:44 AM, Yu, Yu-cheng wrote:
>> Also, Boris asked for two *different* macros for 32 and 64-bit:
>>
>> https://lore.kernel.org/linux-api/20210310231731.gk23...@zn.tnic/
>>
>> Could you do that in the next version, please?
>
> Yes, we can do two macros, probably in arch/x86/include/asm/v
On 3/16/21 10:12 AM, Yu, Yu-cheng wrote:
> On 3/16/2021 8:49 AM, Dave Hansen wrote:
...
>> Is "#ifdef __i386__" the right thing to use here? I guess ENDBR only
>> ends up getting used in the VDSO, but there's a lot of
>> non-userspace-exposed stuff in call
On 3/16/21 8:13 AM, Yu-cheng Yu wrote:
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -392,3 +392,21 @@ For 32-bit we have the following conventions - kernel is
> built with
> .endm
>
> #endif /* CONFIG_SMP */
> +/*
> + * ENDBR is an instruction for the Indirect Branch
On 3/15/21 7:24 PM, Yu Zhao wrote:
> On Mon, Mar 15, 2021 at 11:00:06AM -0700, Dave Hansen wrote:
>> How bad does this scanning get in the worst case if there's a lot of
>> sharing?
>
> Actually the improvement is larger when there is more sharing, i.e.,
> higher
On 3/15/21 12:14 PM, Jarkko Sakkinen wrote:
> On Mon, Mar 15, 2021 at 09:03:21AM -0700, Dave Hansen wrote:
>> On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
>>> Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
>>> and free them using sgx_free_epc_pa
On 3/12/21 11:57 PM, Yu Zhao wrote:
> Background
> ==
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across
On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> Background
> ==
>
> EPC section is covered by one or more SRAT entries that are associated with
> one and only one PXM (NUMA node). The motivation behind this patch is to
> provide basic elements of building allocation scheme based on this premi
On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
> and free them using sgx_free_epc_page(). Do two passes, as for SECS pages
> the first round can fail, if all child pages have not yet been removed.
> The driver puts all pages o
On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> Replace the ad-hoc code with a sgx_free_epc_page(), in order to make sure
> that all the relevant checks and book keeping is done, while freeing a
> borrowed EPC page, and remove redundant code. EREMOVE inside
> sgx_free_epc_page() does not change the se
On 3/12/21 11:57 PM, Yu Zhao wrote:
> Some architectures support the accessed bit on non-leaf PMD entries
> (parents) in addition to leaf PTE entries (children) where pages are
> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> as part of linear-address translation [1]. Page t
On 3/12/21 8:55 AM, Jarkko Sakkinen wrote:
>> ENDBR is a special new instruction for the Indirect Branch Tracking
>> (IBT) component of CET. IBT prevents attacks by ensuring that (most)
>> indirect branches and function calls may only land at ENDBR
>> instructions. Branches that don't follow the
On 3/9/21 1:40 PM, Oscar Salvador wrote:
> +static void __meminit vmemmap_use_new_sub_pmd(unsigned long start, unsigned
> long end)
> +{
> + /*
> + * Could be our memmap page is filled with PAGE_UNUSED already from a
> + * previous remove. Make sure to reset it.
> + */
> + v
On 3/10/21 2:55 PM, Yu, Yu-cheng wrote:
> On 3/10/2021 2:39 PM, Jarkko Sakkinen wrote:
>> On Wed, Mar 10, 2021 at 02:05:19PM -0800, Yu-cheng Yu wrote:
>>> When CET is enabled, __vdso_sgx_enter_enclave() needs an endbr64
>>> in the beginning of the function.
>>
>> OK.
>>
>> What you should do is to
On 3/10/21 8:37 AM, kan.li...@linux.intel.com wrote:
> - err = perf_pmu_register(&pmu, "cpu", PERF_TYPE_RAW);
> - if (err)
> - goto out2;
> + if (!is_hybrid()) {
> + err = perf_pmu_register(&pmu, "cpu", PERF_TYPE_RAW);
> + if (err)
> +
On 3/10/21 7:11 AM, Jarkko Sakkinen wrote:
>>> - section = &sgx_epc_sections[epc_page->section];
>>> - spin_lock(§ion->lock);
>>> - list_add_tail(&epc_page->list, §ion->page_list);
>>> - section->free_cnt++;
>>> - spin_unlock(§ion->lock);
>>> +
>>> + * node.
>>> + */
>>> +static struct sgx_numa_node *sgx_numa_nodes;
>>> +
>>> +/*
>>> + * sgx_free_epc_page() uses this to find out the correct struct
>>> sgx_numa_node,
>>> + * to put the page in.
>>> + */
>>> +static int sgx_section_to_numa_node_id[SGX_MAX_EPC_SECTIONS];
>>
>> If this is pe
On 3/8/21 4:17 PM, Yang Shi wrote:
>> Reclaim anonymous pages if a migration path is available now that
>> demotion provides a non-swap recourse for reclaiming anon pages.
>>
>> Note that this check is subtly different from the
>> anon_should_be_aged() checks. This mechanism checks whether a
>> sp
On 3/8/21 4:10 PM, Yang Shi wrote:
>> +static struct page *alloc_demote_page(struct page *page, unsigned long node)
>> +{
>> + struct migration_target_control mtc = {
>> + /*
>> +* Fail the allocation quickly and quietly. When this
>> +* happens,
On 3/8/21 4:03 PM, Yang Shi wrote:
>> +static int __meminit migrate_on_reclaim_callback(struct notifier_block
>> *self,
>> +unsigned long action, void
>> *arg)
>> +{
>> + switch (action) {
>> + case MEM_GOING_OFFLINE:
>> +
On 3/8/21 4:24 PM, Yang Shi wrote:
>> Once this is enabled page demotion may move data to a NUMA node
>> that does not fall into the cpuset of the allocating process.
>> This could be construed to violate the guarantees of cpusets.
>> However, since this is an opt-in mechanism, the assumption is
>>
...
>> == Open Issues ==
>>
>> * For cpusets and memory policies that restrict allocations
>>to PMEM, is it OK to demote to PMEM? Do we need a cgroup-
>>level API to opt-in or opt-out of these migrations?
>
> I'm wondering if such usecases, which don't want to have memory
> allocate on p
w the previous one, we
> know we can memset [unused_pmd_start, PMD_BOUNDARY) with PAGE_UNUSE.
>
> This patch is based on a similar patch by David Hildenbrand:
>
> https://lore.kernel.org/linux-mm/20200722094558.9828-10-da...@redhat.com/
>
> Signed-off-by: Oscar Salvador
This is much more clear now. Thanks!
Acked-by: Dave Hansen
tch by David Hildenbrand:
>
> https://lore.kernel.org/linux-mm/20200722094558.9828-9-da...@redhat.com/
Looks good now. It's much easier to read without the optimization.
Acked-by: Dave Hansen
hat David Hildenbrand said in the v4 thread about this patch.
Basically, we don't have code to allocate 1G mappings because it isn't
clear that it would be worth the complexity, and it might also waste memory.
I'm fine with the code, but I would appreciate a beefed-up changelog:
Acked-by: Dave Hansen
On 3/9/21 12:25 AM, Oscar Salvador wrote:
>
> I think the confusion comes from the name.
> "vmemmap_pmd_is_unused" might be a better fit?
>
> What do you think? Do you feel strong about moving the log in there
> regardless of the name?
No, not really. The name is probably worth adjusting, but I
On 3/8/21 10:20 AM, Oscar Salvador wrote:
> On Thu, Mar 04, 2021 at 07:50:10AM -0800, Dave Hansen wrote:
>> On 3/1/21 12:32 AM, Oscar Salvador wrote:
>>> remove_pte_table() is prepared to handle the case where either the
>>> start or the end of the range is not PA
The following commit has been merged into the x86/cleanups branch of tip:
Commit-ID: 09141ec0e4efede4fb5e2aa68cb819fba974325c
Gitweb:
https://git.kernel.org/tip/09141ec0e4efede4fb5e2aa68cb819fba974325c
Author:Dave Hansen
AuthorDate:Thu, 05 Mar 2020 09:47:06 -08:00
On 3/5/20 9:47 AM, Dave Hansen wrote:
> There are two definitions for the TSC deadline MSR in msr-index.h,
> one with an underscore and one without. Axe one of them and move
> all the references over to the other one.
>
> Cc: x...@kernel.org
> Cc: Peter Zijlstra
Better late t
On 3/3/21 8:31 AM, Ben Widawsky wrote:
>> I haven't got to the whole series yet. The real question is whether the
>> first attempt to enforce the preferred mask is a general win. I would
>> argue that it resembles the existing single node preferred memory policy
>> because that one doesn't push hea
the code that you modified in remove_pte_table(). I assume
this was because vmemmap_free() is the only (indirect) caller of
remove_pte_table().
Otherwise, this looks fine to me:
Acked-by: Dave Hansen
On 3/2/21 7:48 AM, Haitao Huang wrote:
>
> Hi Haitao, Jarkko,
>
> Do you have more concrete use case of needing "sgx2" in /proc/cpuinfo?
Kai, please remove it from your series. I'm not hearing any arguments
remotely close enough to what Boris would require in order to keep it.
On 3/4/21 9:02 AM, Dave Hansen wrote:
>> +#define PAGE_UNUSED 0xFD
>> +/*
>> + * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
>> + * from unused_pmd_start to next PMD_SIZE boundary.
>> + */
>> +static unsigned long unused_p
On 3/3/21 7:03 AM, Jarkko Sakkinen wrote:
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 52d070fb4c9a..ed99c60024dc 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -305,7 +305,6 @@ static void sgx_reclaim_pages(void)
On 3/3/21 7:03 AM, Jarkko Sakkinen wrote:
> If sgx_page_cache_init() fails in the middle, a trivial return
> statement causes unused memory and virtual address space reserved for
> the EPC section, not freed. Fix this by using the same rollback, as
> when sgx_page_reclaimer_init() fails.
...
> @@ -
What changed from the last patch?
On 3/3/21 7:03 AM, Jarkko Sakkinen wrote:
> Background
> ==
>
> EPC section is covered by one or more SRAT entries that are associated with
> one and only one PXM (NUMA node). The motivation behind this patch is to
> provide basic elements of building all
On 3/1/21 12:32 AM, Oscar Salvador wrote:
> When the size of a struct page is not multiple of 2MB, sections do
> not span a PMD anymore and so when populating them some parts of the
> PMD will remain unused.
Multiples of 2MB are 2MB, 4MB, 6MB, etc...
I think you meant here that 2MB must be a mult
On 3/1/21 12:32 AM, Oscar Salvador wrote:
> We never get to allocate 1GB pages when mapping the vmemmap range.
> Drop the dead code both for the aligned and unaligned cases and leave
> only the direct map handling.
Could you elaborate a bit on why 1GB pages are never used? It is just
unlikely to
...
> -static void sgx_sanitize_section(struct sgx_epc_section *section)
> +static void sgx_sanitize_section(struct list_head *laundry)
> {
Does this need a better function name now that it's not literally
dealing with sections at *all*?
sgx_sanitize_pages()
perhaps.
> struct sgx
On 3/3/21 7:03 AM, Jarkko Sakkinen wrote:
> Background
> ==
>
> EPC section is covered by one or more SRAT entries that are associated with
> one and only one PXM (NUMA node). The current implementation overheats a
Overheats?
> single NUMA node, because sgx_alloc_epc_page() always starts
From: Dave Hansen
Some method is obviously needed to enable reclaim-based migration.
Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold. If pages come a
ges()
]
Signed-off-by: Yang Shi
Signed-off-by: Dave Hansen
Cc: David Rientjes
Cc: Huang Ying
Cc: Dan Williams
Cc: David Hildenbrand
Cc: osalvador
--
Changes since 202010:
* remove unused scan-control 'demoted' field
---
b/include/linux/vm_event_item.h |2 ++
From: Dave Hansen
Global reclaim aims to reduce the amount of memory used on
a given node or set of nodes. Migrating pages to another
node serves this purpose.
memcg reclaim is different. Its goal is to reduce the
total memory consumption of the entire memcg, across all
nodes. Migration
context *can* actually be reclaimed, given
current swap space and cgroup limits
anon_should_be_aged() is a much simpler and more prelimiary check
which just says whether there is a possibility of future reclaim.
#Signed-off-by: Keith Busch
Cc: Keith Busch
Signed-off-by: Dave Hansen
Cc: Yang Shi
From: Dave Hansen
Anonymous pages are kept on their own LRU(s). These lists could
theoretically always be scanned and maintained. But, without swap,
there is currently nothing the kernel can *do* with the results of a
scanned, sorted LRU for anonymous pages.
A check for '!total_swap_
The full series is also available here:
https://github.com/hansendc/linux/tree/automigrate-20210304
which also inclues some vm.zone_reclaim_mode sysctl ABI fixup
prerequisites.
The meat of this patch is in:
[PATCH 05/10] mm/migrate: demote pages during reclaim
Which also has
From: Dave Hansen
Prepare for the kernel to auto-migrate pages to other memory nodes
with a user defined node migration table. This allows creating single
migration target for each NUMA node to enable the kernel to do NUMA
page migrations instead of simply reclaiming colder pages. A node
with
account how many pages are reclaimed (demoted) since page
reclaim behavior depends on this. Add *nr_succeeded parameter to make
migrate_pages() return how many pages are demoted successfully for all
cases.
Signed-off-by: Yang Shi
Signed-off-by: Dave Hansen
Cc: David Rientjes
Cc: Huang Ying
Cc
From: Dave Hansen
This is mostly derived from a patch from Yang Shi:
https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang@linux.alibaba.com/
Add code to the reclaim path (shrink_page_list()) to "demote" data
to another NUMA node instead of disc
From: Dave Hansen
Reclaim-based migration is attempting to optimize data placement in
memory based on the system topology. If the system changes, so must
the migration ordering.
The implementation is conceptually simple and entirely unoptimized.
On any memory or CPU hotplug events, assume
From: Dave Hansen
When memory fills up on a node, memory contents can be
automatically migrated to another node. The biggest problems are
knowing when to migrate and to where the migration should be
targeted.
The most straightforward way to generate the "to where" list
would be to
On 2/24/21 2:20 PM, Jarkko Sakkinen wrote:
> The use of sgx_va can be later on extended to the following use cases:
>
> - A global VA for reclaimed SECS pages.
> - A global VA for reclaimed VA pages.
...
> arch/x86/kernel/cpu/sgx/driver.c | 3 +-
> arch/x86/kernel/cpu/sgx/encl.c | 180 +++
On 2/16/21 11:58 AM, Alison Schofield wrote:
> arch/x86/kernel/smpboot.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 02813a7f3a7c..de8c598dc3b9 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
>
This doesn't look like it addresses all of the suggestions that I made
two days ago. Is that coming in v3?
On 2/23/21 11:17 AM, Jarkko Sakkinen wrote:
> Instead, let's just:
>
> 1. Have a global sgx_free_epc_list and remove sgx_epc_section.
>Pages from this are allocated from this in LIFO fashion.
> 2. Instead add struct list_head node_list and use that for node
>associated pages.
> 3. Replace
On 2/21/21 4:54 PM, Dave Hansen wrote:
> Instead of having a for-each-section loop, I'd make it for-each-node ->
> for-each-section. Something like:
>
> for (i = 0; i < num_possible_nodes(); i++) {
> node = (numa_node_id()
> +/* Nodes with one or more EPC sections. */
> +static nodemask_t sgx_numa_mask;
I'd also add that this is for optimization only.
> +/* Array of lists of EPC sections for each NUMA node. */
> +struct list_head *sgx_numa_nodes;
I'd much prefer:
/*
* Array with one list_head for each possible N
On 2/17/21 7:48 AM, David Hildenbrand wrote:
> While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably
> discard memory, there is no generic approach to populate ("preallocate")
> memory.
>
> Although mmap() supports MAP_POPULATE, it is not applicable to the concept
> of sparse me
On 2/12/21 1:47 PM, Andy Lutomirski wrote:
>> What about adding a property to the TD, e.g. via a flag set during TD
>> creation,
>> that controls whether unaccepted accesses cause #VE or are, for all intents
>> and
>> purposes, fatal? That would allow Linux to pursue treating EPT #VEs for
>> pr
On 2/12/21 12:54 PM, Sean Christopherson wrote:
> Ah, I see what you're thinking.
>
> Treating an EPT #VE as fatal was also considered as an option. IIUC it was
> thought that finding every nook and cranny that could access a page, without
> forcing the kernel to pre-accept huge swaths of memory,
On 2/12/21 12:37 PM, Sean Christopherson wrote:
> There needs to be a mechanism for lazy/deferred/on-demand acceptance of pages.
> E.g. pre-accepting every page in a VM with hundreds of GB of memory will be
> ridiculously slow.
>
> #VE is the best option to do that:
>
> - Relatively sane re-ent
On 2/12/21 12:06 PM, Sean Christopherson wrote:
>> What happens if the guest attempts to access a secure GPA that is not
>> ACCEPTed? For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure
>> address and the guest accesses it, via instruction fetch or data access.
>> What happens?
> Wel
On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> More details on cases where #VE exceptions are allowed/not-allowed:
>
> The #VE exception do not occur in the paranoid entry paths, like NMIs.
> While other operations during an NMI might cause #VE, these are in the
> NMI code that can handle
On 2/11/21 10:59 PM, Aneesh Kumar K.V wrote:
> A read syscall do fail with EFAULT. But we allow read via io_uring
> syscalls. Is that ok?
In short, yes.
As much as I'd like to apply pkey permissions to all accesses, when we
don't have the CPU registers around, we don't have a choice: we have to
Hi Catalin,
I noticed there are some ELF bits for ARM's BTI feature:
GNU_PROPERTY_AARCH64_FEATURE_1_BTI
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/elf.h#n453
There's been talk of needing a similar set of bits on x86 for tagged
pointers (
101 - 200 of 2970 matches
Mail list logo