Re: [PATCH 12/22] x86/fpu: Only write PKRU if it is different from current

2019-02-25 Thread Dave Hansen
On 2/21/19 3:50 AM, Sebastian Andrzej Siewior wrote: > @@ -111,6 +111,12 @@ static inline void __write_pkru(u32 pkru) > { > u32 ecx = 0, edx = 0; > > + /* > + * WRPKRU is relatively expensive compared to RDPKRU. > + * Avoid WRPKRU when it would not change the value. > +

Re: [PATCH 2/6] mm/memblock: make full utilization of numa info

2019-02-25 Thread Dave Hansen
On 2/24/19 4:34 AM, Pingfan Liu wrote: > +/* > + * build_node_order() relies on cpumask_of_node(), hence arch should > + * set up cpumask before calling this func. > + */ Whenever I see comments like this, I wonder what happens if the arch doesn't do this? Do we just crash in early boot in

Re: [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map

2019-02-25 Thread Dave Hansen
On 2/24/19 4:34 AM, Pingfan Liu wrote: > At present the node to cpumask map is set up until the secondary > cpu boot up. But it is too late for the purpose of building node fall back > list at early boot stage. Considering that init_cpu_to_node() already owns > cpu to node map, it is a good place

Re: [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA

2019-02-25 Thread Dave Hansen
On 2/24/19 4:34 AM, Pingfan Liu wrote: > +#ifdef CONFIG_NUMA > /* > * There are unfortunately some poorly designed mainboards around that > * only connect memory to a single CPU. This breaks the 1:1 cpu->node > @@ -618,6 +619,9 @@ static void __init numa_init_array(void) > rr =

Re: [PATCH v10 04/12] mm, arm64: untag user pointers passed to memory syscalls

2019-02-22 Thread Dave Hansen
On 2/22/19 4:53 AM, Andrey Konovalov wrote: > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -578,6 +578,7 @@ static int do_mprotect_pkey(unsigned long start, size_t > len, > SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len, > unsigned long, prot) > { > + start =

Re: [PATCH v10 07/12] fs, arm64: untag user pointers in fs/userfaultfd.c

2019-02-22 Thread Dave Hansen
On 2/22/19 4:53 AM, Andrey Konovalov wrote: > userfaultfd_register() and userfaultfd_unregister() use provided user > pointers for vma lookups, which can only by done with untagged pointers. So, we have to patch all these sites before the tagged values get to the point of hitting the vma lookup

Re: [PATCH v10 06/12] fs, arm64: untag user pointers in copy_mount_options

2019-02-22 Thread Dave Hansen
On 2/22/19 4:53 AM, Andrey Konovalov wrote: > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2730,7 +2730,7 @@ void *copy_mount_options(const void __user * data) >* the remainder of the page. >*/ > /* copy_from_user cannot cross TASK_SIZE ! */ > - size = TASK_SIZE -

Re: [PATCH v10 00/12] arm64: untag user pointers passed to the kernel

2019-02-22 Thread Dave Hansen
On 2/22/19 4:53 AM, Andrey Konovalov wrote: > The following testing approaches has been taken to find potential issues > with user pointer untagging: > > 1. Static testing (with sparse [3] and separately with a custom static >analyzer based on Clang) to track casts of __user pointers to

Re: question about page tables in DAX/FS/PMEM case

2019-02-21 Thread Dave Hansen
On 2/21/19 2:58 PM, Larry Bassel wrote: > AFAIK there is no hardware benefit from sharing the page table > directory within different page table. So the only benefit is the > amount of memory we save. The hardware benefit from schemes like this is that the CPU caches are better utilized. If two

Re: [LSF/MM TOPIC] Page Cache Flexibility for NVM

2019-02-21 Thread Dave Hansen
On 2/21/19 3:11 PM, Adam Manzanares wrote: > I am proposing that as an alternative to using NVMs as a NUMA node > we expose the NVM through the page cache or a viable alternative and > have userspace applications mmap the NVM and hand out memory with > their favorite userspace memory allocator.

Re: [PATCHv6 07/10] acpi/hmat: Register processor domain to its memory

2019-02-20 Thread Dave Hansen
On 2/20/19 2:02 PM, Rafael J. Wysocki wrote: >> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig >> index c9637e2e7514..08e972ead159 100644 >> --- a/drivers/acpi/hmat/Kconfig >> +++ b/drivers/acpi/hmat/Kconfig >> @@ -2,6 +2,7 @@ >> config ACPI_HMAT >> bool "ACPI

Re: [PATCH v3] hugetlb: allow to free gigantic pages regardless of the configuration

2019-02-15 Thread Dave Hansen
> -#if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || > defined(CONFIG_CMA) > +#ifdef CONFIG_CONTIG_ALLOC > /* The below functions must be run on a range from a single zone. */ > extern int alloc_contig_range(unsigned long start, unsigned long end, >

Re: [PATCH 13/13] x86: mm: Convert dump_pagetables to use walk_page_range

2019-02-15 Thread Dave Hansen
On 2/15/19 9:02 AM, Steven Price wrote: > arch/x86/mm/dump_pagetables.c | 281 ++ > 1 file changed, 146 insertions(+), 135 deletions(-) I'll look through this in more detail in a bit. But, I'm a bit bummed out by the diffstat. When I see patches add a bunch of

Re: [RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-02-14 Thread Dave Hansen
> #endif > + > + /* If there is a pending TLB flush for this CPU due to XPFO > + * flush, do it now. > + */ Don't forget CodingStyle in all this, please. > + if (cpumask_test_and_clear_cpu(cpu, _xpfo_flush)) { > + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); >

Re: [PATCH v2] hugetlb: allow to free gigantic pages regardless of the configuration

2019-02-13 Thread Dave Hansen
> -#if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || > defined(CONFIG_CMA) > +#ifdef CONFIG_COMPACTION_CORE > static __init int gigantic_pages_init(void) > { > /* With compaction or CMA we can allocate gigantic pages at runtime */ > diff --git a/fs/Kconfig

Re: [PATCH v2 06/10] x86/cpu: Add Icelake to Intel family

2019-02-13 Thread Dave Hansen
On 2/13/19 8:35 AM, Bhardwaj, Rajneesh wrote: > I sure did, perhaps it wasn't clear in my response. I can remove > coreboot link in next version but please clarify whether i should keep > other link that i mentioned or just keep the commit without any link? I think we're hearing loud and clear

Re: [PATCH v3 2/2] mm: be more verbose about zonelist initialization

2019-02-13 Thread Dave Hansen
On 2/13/19 1:43 AM, Michal Hocko wrote: > > We have seen several bugs where zonelists have not been initialized > properly and it is not really straightforward to track those bugs down. > One way to help a bit at least is to dump zonelists of each node when > they are (re)initialized. Were you

Re: [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM

2019-02-12 Thread Dave Hansen
On 2/9/19 3:00 AM, Brice Goglin wrote: > I've used your patches on fake hardware (memmap=xx!yy) with an older > nvdimm-pending branch (without Keith's patches). It worked fine. This > time I am running on real Intel hardware. Any idea where to look ? I've run them on real Intel hardware too.

Re: [PATCH v3 08/10] x86/setcpuid: Add kernel option setcpuid

2019-02-12 Thread Dave Hansen
On 2/12/19 8:48 AM, Peter Zijlstra wrote: >>> IFF clearcpuid lives, it should also employ CPUID faulting and clear it >>> for userspace too. >> We have it already, > D'0h right, I thought that was introduced here, but that's just > extending it to multiple arguments. ... and making it take

Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

2019-02-11 Thread Dave Hansen
On 2/11/19 9:58 AM, Michael S. Tsirkin wrote: >>> Really it seems we want a virtio ring so we can pass a batch of these. >>> E.g. 256 entries, 2M each - that's more like it. >> That only makes sense for a system that's doing high-frequency, >> discontiguous frees of 2M pages. Right now, a 2M

Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

2019-02-11 Thread Dave Hansen
On 2/9/19 4:49 PM, Michael S. Tsirkin wrote: > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote: >> From: Alexander Duyck >> >> Add guest support for providing free memory hints to the KVM hypervisor for >> freed pages huge TLB size or larger. I am restricting the size to >> huge

Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints

2019-02-11 Thread Dave Hansen
On 2/9/19 4:44 PM, Michael S. Tsirkin wrote: > So the policy should not leak into host/guest interface. > Instead it is better to just keep the pages pinned and > ignore the hint for now. It does seems a bit silly to have guests forever hinting about freed memory when the host never has a hope of

Re: [PATCH 1/3] x86/mpx: tweak header name

2019-02-08 Thread Dave Hansen
On 2/8/19 1:43 PM, Michael S. Tsirkin wrote: > On Fri, Feb 08, 2019 at 04:42:39PM +0100, Borislav Petkov wrote: >> On Fri, Feb 08, 2019 at 01:02:53AM -0500, Michael S. Tsirkin wrote: >>> Use linux/mman.h to make sure we get all mmap flags we need. >> Why, asm/mman.h is not enough or is this fixing

Re: [GIT PULL] x86/mm changes for v4.21

2019-02-06 Thread Dave Hansen
On 2/6/19 4:17 PM, Luck, Tony wrote: > [ 93.491692] RAX: RBX: RCX: > 99623f2c3f70 > [ 93.499658] RDX: 2e6b58da0121 RSI: RDI: > 7fff9981feeab000 ... > Potentially the problem might be a non-canonical address passed down > by the

Re: [PATCH v3 08/10] x86/setcpuid: Add kernel option setcpuid

2019-02-05 Thread Dave Hansen
On 2/4/19 10:18 PM, Borislav Petkov wrote: > On Mon, Feb 04, 2019 at 03:24:23PM -0800, Dave Hansen wrote: >> Actually, there's one part of all this that I forgot. Will split lock >> detection be enumerated _widely_? > > You never know what users will do. The moment i

Re: [PATCH v3 08/10] x86/setcpuid: Add kernel option setcpuid

2019-02-05 Thread Dave Hansen
On 2/5/19 12:51 AM, Peter Zijlstra wrote: > On Mon, Feb 04, 2019 at 01:09:12PM -0800, Fenghua Yu wrote: > >> Intel SDM published TODAY does have IA32_CORE_CAPABILITY MSR enumerateion >> bit CPUID.0x7.0:EDX[30] now. Please check today's SDM for the bit: >>

Re: [PATCH v3 08/10] x86/setcpuid: Add kernel option setcpuid

2019-02-05 Thread Dave Hansen
On 2/5/19 12:48 AM, Peter Zijlstra wrote: > On Mon, Feb 04, 2019 at 12:46:30PM -0800, Dave Hansen wrote: >> So, the compromise we reached in this case is that Intel will fully >> document the future silicon architecture, and then write the kernel >> implementation to _that_.

Re: [PATCH v3 08/10] x86/setcpuid: Add kernel option setcpuid

2019-02-04 Thread Dave Hansen
On 2/4/19 1:40 PM, Borislav Petkov wrote: >> Then, for the weirdo deployments where this feature is not enumerated, >> we have the setcpuid= to fake the enumeration in software. >> >> The reason I'm pushing for setcpuid= instead of a one-off is that I >> don't expect this to be the last time Intel

Re: [PATCH v3 08/10] x86/setcpuid: Add kernel option setcpuid

2019-02-04 Thread Dave Hansen
On 2/4/19 11:57 AM, Borislav Petkov wrote: > On Mon, Feb 04, 2019 at 11:05:52AM -0800, Dave Hansen wrote: >> But, we're not being very persuasive because we kinda forgot about the >> "if and only if" condition that you mentioned. > But why does it have to be a cmdline

Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

2019-02-04 Thread Dave Hansen
On 2/4/19 10:15 AM, Alexander Duyck wrote: > +#ifdef CONFIG_KVM_GUEST > +#include > +extern struct static_key_false pv_free_page_hint_enabled; > + > +#define HAVE_ARCH_FREE_PAGE > +void __arch_free_page(struct page *page, unsigned int order); > +static inline void arch_free_page(struct page

Re: [RFC PATCH 4/4] mm: Add merge page notifier

2019-02-04 Thread Dave Hansen
> +void __arch_merge_page(struct zone *zone, struct page *page, > +unsigned int order) > +{ > + /* > + * The merging logic has merged a set of buddies up to the > + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take > + * advantage of this

Re: [PATCH v3 08/10] x86/setcpuid: Add kernel option setcpuid

2019-02-04 Thread Dave Hansen
On 2/4/19 9:49 AM, Thomas Gleixner wrote: > On Fri, 1 Feb 2019, Fenghua Yu wrote: >> This option behaves like existing kernel option clearcpuid. > > No it does NOT. clearcpuid allows to disable things. > > This allows to enable random CPUID bits without any sanity checking. Not > going to

Re: [PATCH v3 09/10] x86/split_lock: Define #AC for split lock feature

2019-02-04 Thread Dave Hansen
On 2/4/19 10:45 AM, Fenghua Yu wrote: > On Mon, Feb 04, 2019 at 10:41:40AM -0800, Dave Hansen wrote: >> On 2/1/19 9:14 PM, Fenghua Yu wrote: >>> --- a/arch/x86/include/asm/cpufeatures.h >>> +++ b/arch/x86/include/asm/cpufeatures.h >>> @@ -221,6 +2

Re: [PATCH v3 09/10] x86/split_lock: Define #AC for split lock feature

2019-02-04 Thread Dave Hansen
On 2/1/19 9:14 PM, Fenghua Yu wrote: > --- a/arch/x86/include/asm/cpufeatures.h > +++ b/arch/x86/include/asm/cpufeatures.h > @@ -221,6 +221,7 @@ > #define X86_FEATURE_ZEN ( 7*32+28) /* "" CPU is AMD > family 0x17 (Zen) */ > #define X86_FEATURE_L1TF_PTEINV (

Re: [PATCH 0/5] [v4] Allow persistent memory to be used like normal RAM

2019-01-28 Thread Dave Hansen
On 1/28/19 3:09 AM, Balbir Singh wrote: >> This is intended for Intel-style NVDIMMs (aka. Intel Optane DC >> persistent memory) NVDIMMs. These DIMMs are physically persistent, >> more akin to flash than traditional RAM. They are also expected to >> be more cost-effective than using RAM, which is

Re: [PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code

2019-01-25 Thread Dave Hansen
On 1/25/19 1:18 PM, Bjorn Helgaas wrote: > On Thu, Jan 24, 2019 at 5:21 PM Dave Hansen > wrote: >> diff -puN kernel/resource.c~move-request_region-check kernel/resource.c >> --- a/kernel/resource.c~move-request_region-check 2019-01-24 >> 15:13:14.453199539 -0800 &g

Re: [PATCH 1/5] mm/resource: return real error codes from walk failures

2019-01-25 Thread Dave Hansen
On 1/25/19 1:02 PM, Bjorn Helgaas wrote: >> @@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long >> unsigned long flags; >> struct resource res; >> unsigned long pfn, end_pfn; >> - int ret = -1; >> + int ret = -EINVAL; > Can you either make a similar

[PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code

2019-01-24 Thread Dave Hansen
From: Dave Hansen HMM consumes physical address space for its own use, even though nothing is mapped or accessible there. It uses a special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY) to uniquely identify these areas. When HMM consumes address space, it makes a best guess about

[PATCH 3/5] mm/memory-hotplug: allow memory resources to be children

2019-01-24 Thread Dave Hansen
From: Dave Hansen The mm/resource.c code is used to manage the physical address space. The current resource configuration can be viewed in /proc/iomem. An example of this is at the bottom of this description. The nvdimm subsystem "owns" the physical address resources which map to

[PATCH 1/5] mm/resource: return real error codes from walk failures

2019-01-24 Thread Dave Hansen
From: Dave Hansen walk_system_ram_range() can return an error code either becuase *it* failed, or because the 'func' that it calls returned an error. The memory hotplug does the following: ret = walk_system_ram_range(..., func); if (ret) return ret; and 'ret

[PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-24 Thread Dave Hansen
From: Dave Hansen This is intended for use with NVDIMMs that are physically persistent (physically like flash) so that they can be used as a cost-effective RAM replacement. Intel Optane DC persistent memory is one implementation of this kind of NVDIMM. Currently, a persistent memory region

[PATCH 4/5] dax/kmem: let walk_system_ram_range() search child resources

2019-01-24 Thread Dave Hansen
From: Dave Hansen In the process of onlining memory, we use walk_system_ram_range() to find the actual RAM areas inside of the area being onlined. However, it currently only finds memory resources which are "top-level" iomem_resources. Children are not currently searched wh

[PATCH 0/5] [v4] Allow persistent memory to be used like normal RAM

2019-01-24 Thread Dave Hansen
v3 spurred a bunch of really good discussion. Thanks to everybody that made comments and suggestions! I would still love some Acks on this from the folks on cc, even if it is on just the patch touching your area. Note: these are based on commit d2f33c19644 in:

Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-24 Thread Dave Hansen
On 1/24/19 6:17 AM, Michal Hocko wrote: > and nr_cpus set to 4. The underlying reason is tha the device is bound > to node 2 which doesn't have any memory and init_cpu_to_node only > initializes memory-less nodes for possible cpus which nr_cpus restrics. > This in turn means that proper zonelists

Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-23 Thread Dave Hansen
On 1/16/19 3:38 PM, Jerome Glisse wrote: > So right now i would rather that we keep properly reporting this > hazard so that at least we know it failed because of that. This > also include making sure that we can not register private memory > as a child of an un-busy resource that does exist but

Re: [PATCH 12/22] x86/fpu: Only write PKRU if it is different from current

2019-01-23 Thread Dave Hansen
On 1/9/19 3:47 AM, Sebastian Andrzej Siewior wrote: > +static inline void __write_pkru(u32 pkru) > +{ > + /* > + * Writting PKRU is expensive. Only write the PKRU value if it is > + * different from the current one. > + */ I'd say: WRPKRU is relatively expensive

Re: [PATCH 18/22] x86/fpu: Update xstate's PKRU value on write_pkru()

2019-01-23 Thread Dave Hansen
On 1/9/19 3:47 AM, Sebastian Andrzej Siewior wrote: > + pk = get_xsave_addr(>thread.fpu.state.xsave, XFEATURE_PKRU); > + /* > + * The PKRU value in xstate needs to be in sync with the value that is > + * written to the CPU. The FPU restore on return to userland would > + *

Re: [PATCH 05/22] x86/fpu: Remove fpu->initialized usage in copy_fpstate_to_sigframe()

2019-01-18 Thread Dave Hansen
On 1/18/19 1:37 PM, Sebastian Andrzej Siewior wrote: > On 2019-01-18 13:17:28 [-0800], Dave Hansen wrote: >> On 1/18/19 1:14 PM, Sebastian Andrzej Siewior wrote: >>> The kernel saves task's FPU registers on user's signal stack before >>> entering the signal handler.

Re: [PATCH 05/22] x86/fpu: Remove fpu->initialized usage in copy_fpstate_to_sigframe()

2019-01-18 Thread Dave Hansen
On 1/18/19 1:14 PM, Sebastian Andrzej Siewior wrote: > The kernel saves task's FPU registers on user's signal stack before > entering the signal handler. Can we avoid that and have in-kernel memory > for that? Does someone rely on the FPU registers from the task in the > signal handler? This is

Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-18 Thread Dave Hansen
On 1/16/19 11:16 AM, Jerome Glisse wrote: >> We *could* also simply truncate the existing top-level >> "Persistent Memory" resource and take over the released address >> space. But, this means that if we ever decide to hot-unplug the >> "RAM" and give it back, we need to recreate the original

Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-18 Thread Dave Hansen
On 1/17/19 11:47 PM, Yanmin Zhang wrote: > a chance for kernel to allocate PMEM as DMA buffer. > Some super speed devices like 10Giga NIC, USB (SSIC connecting modem), > might not work well if DMA buffer is in PMEM as it's slower than DRAM. > > Should your patchset consider it? No, I don't think

Re: [PATCH 0/4] Allow persistent memory to be used like normal RAM

2019-01-17 Thread Dave Hansen
On 1/17/19 8:29 AM, Jeff Moyer wrote: >> Persistent memory is cool. But, currently, you have to rewrite >> your applications to use it. Wouldn't it be cool if you could >> just have it show up in your system like normal RAM and get to >> it like a slow blob of memory? Well... have I got the

Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-17 Thread Dave Hansen
On 1/17/19 12:19 AM, Yanmin Zhang wrote: >> > I didn't try pmem and I am wondering it's slower than DRAM. > Should a flag, such like _GFP_PMEM, be added to distinguish it from > DRAM? Absolutely not. :) We already have performance-differentiated memory, and lots of ways to enumerate and select

Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-16 Thread Dave Hansen
On 1/16/19 11:16 AM, Jerome Glisse wrote: >> We also rework the old error message a bit since we do not get >> the conflicting entry back: only an indication that we *had* a >> conflict. > We should keep the device private check (moving it in __request_region) > as device private can try to

Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-16 Thread Dave Hansen
On 1/16/19 1:16 PM, Bjorn Helgaas wrote: >> + /* >> +* Set flags appropriate for System RAM. Leave ..._BUSY clear >> +* so that add_memory() can add a child resource. >> +*/ >> + new_res->flags = IORESOURCE_SYSTEM_RAM; > IIUC, new_res->flags was set to

Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-16 Thread Dave Hansen
On 1/16/19 1:16 PM, Bjorn Helgaas wrote: > On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen > wrote: >> From: Dave Hansen >> Currently, a persistent memory region is "owned" by a device driver, >> either the "Direct DAX" or "Filesystem DAX" drive

[PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-16 Thread Dave Hansen
From: Dave Hansen The mm/resource.c code is used to manage the physical address space. We can view the current resource configuration in /proc/iomem. An example of this is at the bottom of this description. The nvdimm subsystem "owns" the physical address resources which map to

[PATCH 1/4] mm/resource: return real error codes from walk failures

2019-01-16 Thread Dave Hansen
From: Dave Hansen walk_system_ram_range() can return an error code either becuase *it* failed, or because the 'func' that it calls returned an error. The memory hotplug does the following: ret = walk_system_ram_range(..., func); if (ret) return ret; and 'ret

[PATCH 0/4] Allow persistent memory to be used like normal RAM

2019-01-16 Thread Dave Hansen
I would like to get this queued up to get merged. Since most of the churn is in the nvdimm code, and it also depends on some refactoring that only exists in the nvdimm tree, it seems like putting it in *via* the nvdimm tree is the best path. But, this series makes non-trivial changes to the

[PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-16 Thread Dave Hansen
From: Dave Hansen Currently, a persistent memory region is "owned" by a device driver, either the "Direct DAX" or "Filesystem DAX" drivers. These drivers allow applications to explicitly use persistent memory, generally by being modified to use special, new li

[PATCH 3/4] dax/kmem: let walk_system_ram_range() search child resources

2019-01-16 Thread Dave Hansen
From: Dave Hansen In the process of onlining memory, we use walk_system_ram_range() to find the actual RAM areas inside of the area being onlined. However, it currently only finds memory resources which are "top-level" iomem_resources. Children are not currently searched wh

Re: [PATCH v6] x86: load FPU registers on return to userland

2019-01-15 Thread Dave Hansen
On 1/15/19 12:26 PM, Andy Lutomirski wrote: > I don't think we'd ever want kernel_fpu_end() to restore anything, > right? I'm a bit confused as to when this optimization would actually > be useful. Using AVX-512 as an example... Let's say there was AVX-512 state, and a kernel_fpu_begin() user

Re: [PATCH v6] x86: load FPU registers on return to userland

2019-01-15 Thread Dave Hansen
On 1/15/19 4:44 AM, David Laight wrote: > Once this is done it might be worth while adding a parameter to > kernel_fpu_begin() to request the registers only when they don't > need saving. > This would benefit code paths where the gains are reasonable but not massive. > > The return value from

[tip:x86/urgent] x86/selftests/pkeys: Fork() to check for state being preserved

2019-01-15 Thread tip-bot for Dave Hansen
Commit-ID: e1812933b17be7814f51b6c310c5d1ced7a9a5f5 Gitweb: https://git.kernel.org/tip/e1812933b17be7814f51b6c310c5d1ced7a9a5f5 Author: Dave Hansen AuthorDate: Wed, 2 Jan 2019 13:56:57 -0800 Committer: Thomas Gleixner CommitDate: Tue, 15 Jan 2019 10:33:45 +0100 x86/selftests/pkeys

[tip:x86/urgent] x86/pkeys: Properly copy pkey state at fork()

2019-01-15 Thread tip-bot for Dave Hansen
Commit-ID: a31e184e4f69965c99c04cc5eb8a4920e0c63737 Gitweb: https://git.kernel.org/tip/a31e184e4f69965c99c04cc5eb8a4920e0c63737 Author: Dave Hansen AuthorDate: Wed, 2 Jan 2019 13:56:55 -0800 Committer: Thomas Gleixner CommitDate: Tue, 15 Jan 2019 10:33:45 +0100 x86/pkeys: Properly

Re: [PATCHv2 6/7] x86/mm: remove bottom-up allocation style for x86_64

2019-01-14 Thread Dave Hansen
On 1/10/19 9:12 PM, Pingfan Liu wrote: > Although kaslr-kernel can avoid to stain the movable node. [1] Can you explain what staining is, or perhaps try to use some more standard nomenclature? There are exactly 0 instances of the word "stain" in arch/x86/ or mm/. > But the > pgtable can still

Re: [PATCHv2 2/7] acpi: change the topo of acpi_table_upgrade()

2019-01-14 Thread Dave Hansen
On 1/10/19 9:12 PM, Pingfan Liu wrote: > The current acpi_table_upgrade() relies on initrd_start, but this var is "var" meaning variable? Could you please go back and try to ensure you spell out all the words you are intending to write? I think "topo" probably means "topology", but it's a

Re: [PATCHv2 0/7] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info

2019-01-14 Thread Dave Hansen
On 1/10/19 9:12 PM, Pingfan Liu wrote: > Background > When kaslr kernel can be guaranteed to sit inside unmovable node > after [1]. What does this "[1]" refer to? Also, can you clarify your terminology here a bit. By "kaslr kernel", do you mean the base address? > But if kaslr kernel is

Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

2019-01-11 Thread Dave Hansen
>> The second process could easily have the page's old TLB entry. It could >> abuse that entry as long as that CPU doesn't context switch >> (switch_mm_irqs_off()) or otherwise flush the TLB entry. > > That is an interesting scenario. Working through this scenario, physmap > TLB entry for a page

Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

2019-01-10 Thread Dave Hansen
First of all, thanks for picking this back up. It looks to be going in a very positive direction! On 1/10/19 1:09 PM, Khalid Aziz wrote: > I implemented a solution to reduce performance penalty and > that has had large impact. When XPFO code flushes stale TLB entries, > it does so for all CPUs

Re: [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64

2019-01-08 Thread Dave Hansen
On 1/7/19 10:13 PM, Pingfan Liu wrote: > On Tue, Jan 8, 2019 at 1:42 AM Dave Hansen wrote: >> Why is this 0x10 open-coded? Why is this needed *now*? >> > > Memory under 1MB should be used by BIOS. For x86_64, after > e820__memblock_setup(), the memblock allocato

Re: [PATCH] drop_caches: Allow unmapping pages

2019-01-07 Thread Dave Hansen
On 1/7/19 6:15 AM, Matthew Wilcox wrote: > You're going to get data corruption doing this. try_to_unmap_one() > does: > > /* Move the dirty bit to the page. Now the pte is gone. */ > if (pte_dirty(pteval)) > set_page_dirty(page); > > so PageDirty() can be

Re: [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64

2019-01-07 Thread Dave Hansen
On 1/7/19 12:24 AM, Pingfan Liu wrote: > There are two acheivements by this patch. > -1st. keep the subtree of pgtable away from movable node. > Background about the defect of the current bottom-up allocation style, take > the following scenario: > | unmovable node | movable node

Re: [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping()

2019-01-07 Thread Dave Hansen
On 1/7/19 12:24 AM, Pingfan Liu wrote: > At present, memblock bottom-up allocation can help us against stamping over > movable node in very high probability. Is this what you are fixing? Making a "high probability", a certainty? Is this the problem? > diff --git a/arch/x86/kernel/setup.c

Re: [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info

2019-01-07 Thread Dave Hansen
On 1/7/19 12:24 AM, Pingfan Liu wrote: > Background about the defect of the current bottom-up allocation style, take > the following scenario: > | unmovable node | movable node | > | kaslr-kernel |subtree of pgtable for phy<->virt | > > Although kaslr-kernel

Re: [PATCH 2/2] x86/selftests/pkeys: fork() to check for state being preserved

2019-01-03 Thread Dave Hansen
On 1/3/19 5:52 AM, Sasha Levin wrote: > This commit has been processed because it contains a -stable tag. > The stable tag indicates that it's relevant for the following trees: all > > The bot has tested the following trees: v4.20.0, v4.19.13, v4.14.91, > v4.9.148, v4.4.169, v3.18.131, > >

Re: [PATCH 1/2] x86/pkeys: properly copy pkey state at fork()

2019-01-03 Thread Dave Hansen
On 1/3/19 5:52 AM, Sasha Levin wrote: > This commit has been processed because it contains a "Fixes:" tag, > fixing commit: e8c24d3a23a4 x86/pkeys: Allocation/free syscalls. > > The bot has tested the following trees: v4.20.0, v4.19.13, v4.14.91, > v4.9.148, > > v4.20.0: Build OK! > v4.19.13:

[PATCH 0/2] x86/mm/pkeys: fix user-visible pkey state destruction at fork()

2019-01-02 Thread Dave Hansen
Hi x86 maintainers, This is an important fix that I believe needs to be merged for 4.21. Without it, applications calling fork() can potentially double-allocate a protection key, causing lots of strange problems. Thomas's Reviewed-by is on the the actual fix, but not the selftest. I would also

[PATCH 1/2] x86/pkeys: properly copy pkey state at fork()

2019-01-02 Thread Dave Hansen
From: Dave Hansen Memory protection key behavior should be the same in a child as it was in the parent before a fork. But, there is a bug that resets the state in the child at fork instead of preserving it. Our creation of new mm's is a bit convoluted. At fork(), the code does: 1

[PATCH 2/2] x86/selftests/pkeys: fork() to check for state being preserved

2019-01-02 Thread Dave Hansen
From: Dave Hansen There was a bug where the per-mm pkey state was not being preserved across fork() in the child. fork() is performed in the pkey selftests, but all of our pkey activity is performed in the parent. The child does not perform any actions sensitive to pkey state. To make

Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration

2019-01-02 Thread Dave Hansen
On 12/28/18 12:41 AM, Michal Hocko wrote: >> >> It can be done in kernel page reclaim path, near the anonymous page >> swap out point. Instead of swapping out, we now have the option to >> migrate cold pages to PMEM NUMA nodes. > OK, this makes sense to me except I am not sure this is something

Re: [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM

2019-01-02 Thread Dave Hansen
On 12/26/18 5:14 AM, Fengguang Wu wrote: > +static unsigned long __get_dram_free_pages(gfp_t gfp_mask) > +{ > + struct page *page; > + > + page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id()); > + if (!page) > +return 0; > + return (unsigned long)

Re: [PATCH] x86/cpu: sort cpuinfo flags

2018-12-21 Thread Dave Hansen
On 12/21/18 5:04 AM, Borislav Petkov wrote: > > $ grep -m 1 flags /proc/cpuinfo | tr " " "\n" | sort | xargs > > and there probably is even a simpler way to do that. > > Or add a shell alias for that or a small script or ... I don't always look at these through the shell. I got a screenshot

Re: "x86: Remove Intel MPX" is wrong (Re: linux-next: manual merge of the kvm tree with the tip tree)

2018-12-19 Thread Dave Hansen
On 12/19/18 1:00 PM, Paolo Bonzini wrote: > On 19/12/18 21:54, Dave Hansen wrote: >> I should have called this out in the changelog, but I removed *all* the >> support because I assumed that guests don't need MPX because no other OS >> supported it that I know of. > >

Re: "x86: Remove Intel MPX" is wrong (Re: linux-next: manual merge of the kvm tree with the tip tree)

2018-12-19 Thread Dave Hansen
On 12/19/18 12:32 PM, Paolo Bonzini wrote: > On 19/12/18 05:12, Stephen Rothwell wrote: >> I fixed it up (the former removed some code updated by the latter, so I >> did that) and can carry the fix as necessary. This is now fixed as far as >> linux-next is concerned, but any non trivial conflicts

[PATCH] x86/cpu: sort cpuinfo flags

2018-12-19 Thread Dave Hansen
From: Dave Hansen I frequently find myself contemplating my life choices as I try to find 3-character entries in the 1,000-character, unsorted "flags:" field of /proc/cpuinfo. Sort that field, giving me hours back in my day. This eats up ~1200 bytes (NCAPINTS*2*32) of space for

Re: [PATCH v6 1/3] x86/fpu: track AVX-512 usage of tasks

2018-12-18 Thread Dave Hansen
On 12/18/18 1:38 PM, Andi Kleen wrote: >> I misunderstood, you mean 32bit kernel, not 32bit machine. Theoretically >> 32bit >> kernel can use AVX512, but not sure if anyone use it like this. >> get_jiffies_64() >> includes jiffies_lock ops so not good in context switch. So I want to use raw >>

Re: [PATCH v6 1/3] x86/fpu: track AVX-512 usage of tasks

2018-12-18 Thread Dave Hansen
On 12/18/18 7:32 AM, Thomas Gleixner wrote: > What exactly prevents a 32bit kernel from having the AVX512 feature bit > set? And if it cannot be set on 32bit, then why are you compiling that code > in at all? There are three different AVX-512 states (and three bits) which Aubrey's patch checks.

[tip:x86/mpx] x86: Remove Intel MPX

2018-12-18 Thread tip-bot for Dave Hansen
Commit-ID: eb012ef3b4e331ae479dd7cd9378041d9b7f851c Gitweb: https://git.kernel.org/tip/eb012ef3b4e331ae479dd7cd9378041d9b7f851c Author: Dave Hansen AuthorDate: Fri, 31 Aug 2018 14:14:16 -0700 Committer: Thomas Gleixner CommitDate: Tue, 18 Dec 2018 14:24:38 +0100 x86: Remove Intel MPX

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-12-17 Thread Dave Hansen
On 12/17/18 12:10 PM, Andy Lutomirski wrote: >> There's no 'struct page' for enclave memory as it stands. That means no >> page cache, and that means there's no 'struct address_space *mapping' in >> the first place. >> >> Basically, the choice was made a long time ago to have SGX's memory >>

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-12-17 Thread Dave Hansen
On 12/17/18 11:55 AM, Andy Lutomirski wrote: >> You're effectively rebuilding reverse-mapping infrastructure here. It's >> a frequent thing for the core VM to need to go from 'struct page' back >> to the page tables mapping it. For that we go (logically) >>

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-12-17 Thread Dave Hansen
On 12/17/18 11:49 AM, Jarkko Sakkinen wrote: >> Yeah, the code is built to have one VMA and only one VMA per enclave. >> You need to go over the origin of this restriction and what enforces this. > It is before ECREATE but after that you can split it with mprotect(). > > Lets take an example. I'm

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-12-17 Thread Dave Hansen
On 12/17/18 11:37 AM, Jarkko Sakkinen wrote: >> Suggestion: >> >> It looks like you only expect one VMA per enclave. Things go bonkers if >> this is not true. So, instead of storing encl->mm, don't. You can get >> the mm from vma->vm_mm and you could just store encl->vma instead. > The code

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-12-17 Thread Dave Hansen
On 12/17/18 11:12 AM, Andy Lutomirski wrote: > So I'm not saying that you shouldn't do it the way you are now, but I > do think that the changelog or at least some emails should explain > *why* the enclave needs to keep a pointer to the creating process's > mm. And, if you do keep the current

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-12-17 Thread Dave Hansen
On 12/17/18 10:48 AM, Sean Christopherson wrote: > We can't set mm to NULL as we need it to unregister the notifier, and > I'm fairly certain attempting to unregister in the release callback > will deadlock. Suggestion: It looks like you only expect one VMA per enclave. Things go bonkers if

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-12-17 Thread Dave Hansen
On 12/17/18 10:43 AM, Jarkko Sakkinen wrote: > On Mon, Dec 17, 2018 at 10:36:13AM -0800, Sean Christopherson wrote: >> I'm pretty sure doing mmget() would result in circular dependencies and >> a zombie enclave. In the do_exit() case where a task is abruptly killed: >> >> - __mmput() is never

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-12-17 Thread Dave Hansen
On 12/17/18 10:01 AM, Jarkko Sakkinen wrote: >>> + encl->mm = current->mm; <-> + >>> encl->base = secs->base; >>> + encl->size = secs->size; >>> + encl->ssaframesize = secs->ssa_frame_size; >>> + encl->backing = backing; >>> + >>> + return encl; >>> +}

Re: [PATCH v17 18/23] platform/x86: Intel SGX driver

2018-12-17 Thread Dave Hansen
> +struct sgx_encl *sgx_encl_alloc(struct sgx_secs *secs) > +{ ... > + kref_init(>refcount); > + INIT_LIST_HEAD(>add_page_reqs); > + INIT_RADIX_TREE(>page_tree, GFP_KERNEL); > + mutex_init(>lock); > + INIT_WORK(>add_page_work, sgx_add_page_worker); > + > + encl->mm =

Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-14 Thread Dave Hansen
On 12/14/18 11:48 AM, Matthew Wilcox wrote: > I think we can do better than a proxy object with bit 0 set. I'd go > for allocating something like this: > > struct dynamic_page { > struct page; > unsigned long vaddr; > unsigned long pfn; > ... > }; > > and use a bit in

Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Dave Hansen
On 12/11/18 4:59 PM, Li, Aubrey wrote: >> maybe instead of a 1/0 bit, it's useful to store the timestamp of the last >> time we found the task to use avx? (need to find a good time unit) >> >> > Are you suggesting kernel does not do any translation, just provide a fact > to the user space tool and

<    4   5   6   7   8   9   10   11   12   13   >