Re: [PATCH] x86/mm: Fix leak of pmd ptlock

2021-01-05 Thread Dave Hansen
On 12/2/20 10:28 PM, Dan Williams wrote:
> Commit 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
> introduced a new location where a pmd was released, but neglected to run
> the pmd page destructor. In fact, this happened previously for a
> different pmd release path and was fixed by commit:
> 
> c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables").
> 
> This issue was hidden until recently because the failure mode is silent,
> but commit:

Looks sane.  Thanks as always for the thorough changelog and the
investigation into why we're suddenly seeing this now.

I agree that ridding ourselves of open-coded free_page()'s is a good
idea, but this patch itself needs to be around for stable anyway.  So,

Acked-by: Dave Hansen 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-18 Thread Dave Hansen
On 12/18/20 11:42 AM, Ira Weiny wrote:
> Another problem would be if the kmap and kunmap happened in different
> contexts...  :-/  I don't think that is done either but I don't know for
> certain.

It would be really nice to put together some surveillance patches to
help become more certain about these things.  Even a per-task counter
would be better than nothing.

On kmap:
current->kmaps++;
On kunmap:
current->kmaps--;
WARN_ON(current->kmaps < 0);
On exit:
WARN_ON(current->kmaps);

That would at least find imbalances.  You could take it even further by
having a little array, say:

struct one_kmap {
struct page *page;
depot_stack_handle_t handle;
};

Then:

 struct task_struct {
...
+   struct one_kmap kmaps[10];
 };

On kmap() you make a new entry in current->kmaps[], and on kunmap() you
try to find the corresponding entry.  If you can't find one, in the
current task you can even go search all the other tasks and see who
might be responsible.  If something goes and does more than 10
simultaneous kmap()s in one thread, dump a warning and give up.  Or,
dynamically allocate the kmaps[] array.

Then you can dump out the stack of the kmap() culprit if it exits after
a kmap() but without a corresponding kfree().

Something like that should be low overhead enough to get it into things
like the 0day debug kernel.  It should be way cheaper than something
like lockdep.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [NEEDS-REVIEW] [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-18 Thread Dave Hansen
On 12/17/20 8:10 PM, Ira Weiny wrote:
> On Thu, Dec 17, 2020 at 12:41:50PM -0800, Dave Hansen wrote:
>> On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
>>>  void disable_TSC(void)
>>> @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, 
>>> struct task_struct *next_p)
>>>  
>>> if ((tifp ^ tifn) & _TIF_SLD)
>>> switch_to_sld(tifn);
>>> +
>>> +   pks_sched_in();
>>>  }
>>
>> Does the selftest for this ever actually schedule()?
> 
> At this point I'm not sure.  This code has been in since the beginning.  So 
> its
> seen a lot of soak time.

Think about it another way.  Let's say this didn't get called on the
first context switch away from the PKS-using task.  Would anyone notice?
 How likely is this to happen?

The function tracers or kprobes tend to be a great tool for this, at
least for testing whether the code path you expect to hit is getting hit.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 10/10] x86/pks: Add PKS test code

2020-12-17 Thread Dave Hansen
On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
> + /* Arm for context switch test */
> + write(fd, "1", 1);
> +
> + /* Context switch out... */
> + sleep(4);
> +
> + /* Check msr restored */
> + write(fd, "2", 1);

These are always tricky.  What you ideally want here is:

1. Switch away from this task to a non-PKS task, or
2. Switch from this task to a PKS-using task, but one which has a
   different PKS value

then, switch back to this task and make sure PKS maintained its value.

*But*, there's no absolute guarantee that another task will run.  It
would not be totally unreasonable to have the kernel just sit in a loop
without context switching here if no other tasks can run.

The only way you *know* there is a context switch is by having two tasks
bound to the same logical CPU and make sure they run one after another.
 This just gets itself into a state where it *CAN* context switch and
prays that one will happen.

You can also run a bunch of these in parallel bound to a single CPU.
That would also give you higher levels of assurance that *some* context
switch happens at sleep().

One critical thing with these tests is to sabotage the kernel and then
run them and make *sure* they fail.  Basically, if you screw up, do they
actually work to catch it?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [NEEDS-REVIEW] [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-17 Thread Dave Hansen
On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
>  void disable_TSC(void)
> @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct 
> task_struct *next_p)
>  
>   if ((tifp ^ tifn) & _TIF_SLD)
>   switch_to_sld(tifn);
> +
> + pks_sched_in();
>  }

Does the selftest for this ever actually schedule()?

I see it talking about context switching, but I don't immediately see
how it would.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions

2020-10-14 Thread Dave Hansen
On 10/14/20 8:46 PM, Ira Weiny wrote:
> On Tue, Oct 13, 2020 at 11:52:32AM -0700, Dave Hansen wrote:
>> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
>>> @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, 
>>> irqentry_state_t *state)
>>> /* Use the combo lockdep/tracing function */
>>> trace_hardirqs_off();
>>> instrumentation_end();
>>> +
>>> +done:
>>> +   irq_save_pkrs(state);
>>>  }
>> One nit: This saves *and* sets PKRS.  It's not obvious from the call
>> here that PKRS is altered at this site.  Seems like there could be a
>> better name.
>>
>> Even if we did:
>>
>>  irq_save_set_pkrs(state, INIT_VAL);
>>
>> It would probably compile down to the same thing, but be *really*
>> obvious what's going on.
> I suppose that is true.  But I think it is odd having a parameter which is the
> same for every call site.

Well, it depends on what you optimize for.  I'm trying to optimize for
the code being understood quickly the first time someone reads it.  To
me, that's more important than minimizing the number of function
parameters (which are essentially free).
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 9/9] x86/pks: Add PKS test code

2020-10-13 Thread Dave Hansen
On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
>  #ifdef CONFIG_X86_32
>   /*
>* We can fault-in kernel-space virtual memory on-demand. The
> diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> index cc3510cde64e..f9552bd9341f 100644
> --- a/include/linux/pkeys.h
> +++ b/include/linux/pkeys.h
> @@ -47,7 +47,6 @@ static inline bool arch_pkeys_enabled(void)
>  static inline void copy_init_pkru_to_fpregs(void)
>  {
>  }
> -
>  #endif /* ! CONFIG_ARCH_HAS_PKEYS */

^ Whitespace damage

>  #ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 0c781f912f9f..f015c09ba5a1 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2400,6 +2400,18 @@ config HYPERV_TESTING
>   help
> Select this option to enable Hyper-V vmbus testing.
>  
> +config PKS_TESTING
> + bool "PKey(S)upervisor testing"

Seems like we need a space in there somewhere.

> + pid = fork();
> + if (pid == 0) {
> + fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
> + if (fd < 0) {
> + printf("cannot open file\n");
> + return -1;
> + }
> +

Will this return code make anybody mad?  Should we have a nicer return
code for when this is running on non-PKS hardware?

I'm not going to be too picky about this.  I'll just ask one question:
Has this found real bugs for you?

Reviewed-by: Dave Hansen 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 8/9] x86/fault: Report the PKRS state on fault

2020-10-13 Thread Dave Hansen
> @@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
> error_code, unsigned long ad
>(error_code & X86_PF_PK)? "protection keys violation" :
>  "permissions violation");
>  
> +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> + if (irq_state && (error_code & X86_PF_PK))
> + pr_alert("PKRS: 0x%x\n", irq_state->pkrs);
> +#endif

This means everyone will see 'PKRS: 0x0', even if they're on non-PKS
hardware.  I think I'd rather have this only show PKRS when we're on
cpu_feature_enabled(PKS) hardware.

...
> @@ -1148,14 +1156,15 @@ static int fault_in_kernel_space(unsigned long 
> address)
>   */
>  static void
>  do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> -unsigned long address)
> +unsigned long address, irqentry_state_t *irq_state)
>  {
>   /*
> -  * Protection keys exceptions only happen on user pages.  We
> -  * have no user pages in the kernel portion of the address
> -  * space, so do not expect them here.
> +  * If protection keys are not enabled for kernel space
> +  * do not expect Pkey errors here.
>*/

Let's fix the double-negative:

/*
 * PF_PK is only expected on kernel addresses whenn
 * supervisor pkeys are enabled:
 */

> - WARN_ON_ONCE(hw_error_code & X86_PF_PK);
> + if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS) ||
> + !cpu_feature_enabled(X86_FEATURE_PKS))
> + WARN_ON_ONCE(hw_error_code & X86_PF_PK);

Yeah, please stick X86_FEATURE_PKS in disabled-features so you can use
cpu_feature_enabled(X86_FEATURE_PKS) by itself here..
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions

2020-10-13 Thread Dave Hansen
On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, 
> irqentry_state_t *state)
>   /* Use the combo lockdep/tracing function */
>   trace_hardirqs_off();
>   instrumentation_end();
> +
> +done:
> + irq_save_pkrs(state);
>  }

One nit: This saves *and* sets PKRS.  It's not obvious from the call
here that PKRS is altered at this site.  Seems like there could be a
better name.

Even if we did:

irq_save_set_pkrs(state, INIT_VAL);

It would probably compile down to the same thing, but be *really*
obvious what's going on.

>  void irqentry_exit_cond_resched(void)
> @@ -362,7 +365,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, 
> irqentry_state_t *state)
>   /* Check whether this returns to user mode */
>   if (user_mode(regs)) {
>   irqentry_exit_to_user_mode(regs);
> - } else if (!regs_irqs_disabled(regs)) {
> + return;
> + }
> +
> + irq_restore_pkrs(state);
> +
> + if (!regs_irqs_disabled(regs)) {
>   /*
>* If RCU was not watching on entry this needs to be done
>* carefully and needs the same ordering of lockdep/tracing
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API

2020-10-13 Thread Dave Hansen
> +static inline void pks_update_protection(int pkey, unsigned long protection)
> +{
> + current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
> +  pkey, protection);
> + preempt_disable();
> + write_pkrs(current->thread.saved_pkrs);
> + preempt_enable();
> +}

Why does this need preempt count manipulation in addition to the
get/put_cpu_var() inside of write_pkrs()?

> +/**
> + * PKS access control functions
> + *
> + * Change the access of the domain specified by the pkey.  These are global
> + * updates.  They only affects the current running thread.  It is undefined 
> and
> + * a bug for users to call this without having allocated a pkey and using it 
> as
> + * pkey here.
> + *
> + * pks_mknoaccess()
> + * Disable all access to the domain
> + * pks_mkread()
> + * Make the domain Read only
> + * pks_mkrdwr()
> + * Make the domain Read/Write
> + *
> + * @pkey the pkey for which the access should change.
> + *
> + */
> +void pks_mknoaccess(int pkey)
> +{
> + pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
> +}
> +EXPORT_SYMBOL_GPL(pks_mknoaccess);

These are named like PTE manipulation functions, which is kinda weird.

What's wrong with: pks_disable_access(pkey) ?

> +void pks_mkread(int pkey)
> +{
> + pks_update_protection(pkey, PKEY_DISABLE_WRITE);
> +}
> +EXPORT_SYMBOL_GPL(pks_mkread);

I really don't like this name.  It doesn't make readable, or even
read-only, *especially* if it was already access-disabled.

> +static const char pks_key_user0[] = "kernel";
> +
> +/* Store names of allocated keys for debug.  Key 0 is reserved for the 
> kernel.  */
> +static const char *pks_key_users[PKS_NUM_KEYS] = {
> + pks_key_user0
> +};
> +
> +/*
> + * Each key is represented by a bit.  Bit 0 is set for key 0 and reserved for
> + * its use.  We use ulong for the bit operations but only 16 bits are used.
> + */
> +static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY;
> +
> +/*
> + * pks_key_alloc - Allocate a PKS key
> + *
> + * @pkey_user: String stored for debugging of key exhaustion.  The caller is
> + * responsible to maintain this memory until pks_key_free().
> + */
> +int pks_key_alloc(const char * const pkey_user)
> +{
> + int nr;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_PKS))
> + return -EINVAL;

I'm not sure I like -EINVAL for this.  I thought we returned -ENOSPC for
this case for user pkeys.

> + while (1) {
> + nr = find_first_zero_bit(_key_allocation_map, PKS_NUM_KEYS);
> + if (nr >= PKS_NUM_KEYS) {
> + pr_info("Cannot allocate supervisor key for %s.\n",
> + pkey_user);
> + return -ENOSPC;
> + }
> + if (!test_and_set_bit_lock(nr, _key_allocation_map))
> + break;
> + }
> +
> + /* for debugging key exhaustion */
> + pks_key_users[nr] = pkey_user;
> +
> + return nr;
> +}
> +EXPORT_SYMBOL_GPL(pks_key_alloc);
> +
> +/*
> + * pks_key_free - Free a previously allocate PKS key
> + *
> + * @pkey: Key to be free'ed
> + */
> +void pks_key_free(int pkey)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_PKS))
> + return;
> +
> + if (pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY)
> + return;

This seems worthy of a WARN_ON_ONCE() at least.  It's essentially
corrupt data coming into a kernel API.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch

2020-10-13 Thread Dave Hansen
On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> The PKRS MSR is defined as a per-logical-processor register.  This
> isolates memory access by logical CPU.  Unfortunately, the MSR is not
> managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
> context switch.
> 
> Define a saved PKRS value in the task struct, as well as a cached
> per-logical-processor MSR value which mirrors the MSR value of the
> current CPU.  Initialize all tasks with the default MSR value.  Then, on
> schedule in, check the saved task MSR vs the per-cpu value.  If
> different proceed to write the MSR.  If not avoid the overhead of the
> MSR write and continue.

It's probably nice to note how the WRMSR is special here, in addition to
the comments below.

>  #endif /*_ASM_X86_PKEYS_INTERNAL_H */
> diff --git a/arch/x86/include/asm/processor.h 
> b/arch/x86/include/asm/processor.h
> index 97143d87994c..da2381136b2d 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -18,6 +18,7 @@ struct vm86;
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -542,6 +543,11 @@ struct thread_struct {
>  
>   unsigned intsig_on_uaccess_err:1;
>  
> +#ifdef   CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> + /* Saved Protection key register for supervisor mappings */
> + u32 saved_pkrs;
> +#endif

Could you take a look around thread_struct and see if there are some
other MSRs near which you can stash this?  This seems like a bit of a
lonely place.

...
>  void flush_thread(void)
>  {
>   struct task_struct *tsk = current;
> @@ -195,6 +212,8 @@ void flush_thread(void)
>   memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
>  
>   fpu__clear_all(>thread.fpu);
> +
> + pks_init_task(tsk);
>  }
>  
>  void disable_TSC(void)
> @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct 
> task_struct *next_p)
>  
>   if ((tifp ^ tifn) & _TIF_SLD)
>   switch_to_sld(tifn);
> +
> + pks_sched_in();
>  }
>  
>  /*
> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index 3cf8f775f36d..30f65dd3d0c5 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int 
> flags)
>  
>   return pk_reg;
>  }
> +
> +DEFINE_PER_CPU(u32, pkrs_cache);
> +
> +/**
> + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> + * serializing but still maintains ordering properties similar to WRPKRU.
> + * The current SDM section on PKRS needs updating but should be the same as
> + * that of WRPKRU.  So to quote from the WRPKRU text:
> + *
> + *   WRPKRU will never execute transiently. Memory accesses
> + *   affected by PKRU register will not execute (even transiently)
> + *   until all prior executions of WRPKRU have completed execution
> + *   and updated the PKRU register.
> + */
> +void write_pkrs(u32 new_pkrs)
> +{
> + u32 *pkrs;
> +
> + if (!static_cpu_has(X86_FEATURE_PKS))
> + return;
> +
> + pkrs = get_cpu_ptr(_cache);
> + if (*pkrs != new_pkrs) {
> + *pkrs = new_pkrs;
> + wrmsrl(MSR_IA32_PKRS, new_pkrs);
> + }
> + put_cpu_ptr(pkrs);
> +}
> 

It bugs me a *bit* that this is being called in a preempt-disabled
region, but we still bother with the get/put_cpu jazz.  Are there other
future call-sites for this that aren't in preempt-disabled regions?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS)

2020-10-13 Thread Dave Hansen
On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> +/*
> + * PKS is independent of PKU and either or both may be supported on a CPU.
> + * Configure PKS if the cpu supports the feature.
> + */

Let's at least be consistent about CPU vs. cpu in a single comment. :)

> +static void setup_pks(void)
> +{
> + if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS))
> + return;
> + if (!cpu_feature_enabled(X86_FEATURE_PKS))
> + return;

If you put X86_FEATURE_PKS in disabled-features.h, you can get rid of
the explicit CONFIG_ check.

> + cr4_set_bits(X86_CR4_PKS);
> +}
> +
>  /*
>   * This does the hard work of actually picking apart the CPU stuff...
>   */
> @@ -1544,6 +1558,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  
>   x86_init_rdrand(c);
>   setup_pku(c);
> + setup_pks();
>  
>   /*
>* Clear/Set all flags overridden by options, need do it
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 6c974888f86f..1b9bc004d9bc 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
>   bool
>  config ARCH_HAS_PKEYS
>   bool
> +config ARCH_HAS_SUPERVISOR_PKEYS
> + bool
>  
>  config PERCPU_STATS
>   bool "Collect percpu memory statistics"
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-10-13 Thread Dave Hansen
On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> +/*
> + * Update the pk_reg value and return it.

How about:

Replace disable bits for @pkey with values from @flags.

> + * Kernel users use the same flags as user space:
> + * PKEY_DISABLE_ACCESS
> + * PKEY_DISABLE_WRITE
> + */
> +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> +{
> + int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> +
> + pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> +
> + if (flags & PKEY_DISABLE_ACCESS)
> + pk_reg |= PKR_AD_BIT << pkey_shift;
> + if (flags & PKEY_DISABLE_WRITE)
> + pk_reg |= PKR_WD_BIT << pkey_shift;

I still think this deserves two lines of comments:

/* Mask out old bit values */

/* Or in new values */
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h

2020-10-13 Thread Dave Hansen
On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work
> in similar fashions and can share common defines.

Could we be a bit less abstract?  PKS and PKU each have:
1. A single control register
2. The same number of keys
3. The same number of bits in the register per key
4. Access and Write disable in the same bit locations

That means that we can share all the macros that synthesize and
manipulate register values between the two features.

> +++ b/arch/x86/include/asm/pkeys_common.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_PKEYS_INTERNAL_H
> +#define _ASM_X86_PKEYS_INTERNAL_H
> +
> +#define PKR_AD_BIT 0x1
> +#define PKR_WD_BIT 0x2
> +#define PKR_BITS_PER_PKEY 2
> +
> +#define PKR_AD_KEY(pkey) (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))

Now that this has moved away from its use-site, it's a bit less
self-documenting.  Let's add a comment:

/*
 * Generate an Access-Disable mask for the given pkey.  Several of these
 * can be OR'd together to generate pkey register values.
 */

Once that's in place, along with the updated changelog:

Reviewed-by: Dave Hansen 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Dave Hansen
On 10/12/20 9:19 AM, Eric Biggers wrote:
> On Sun, Oct 11, 2020 at 11:56:35PM -0700, Ira Weiny wrote:
>>> And I still don't really understand.  After this patchset, there is still 
>>> code
>>> nearly identical to the above (doing a temporary mapping just for a memcpy) 
>>> that
>>> would still be using kmap_atomic().
>> I don't understand.  You mean there would be other call sites calling:
>>
>> kmap_atomic()
>> memcpy()
>> kunmap_atomic()
> Yes, there are tons of places that do this.  Try 'git grep -A6 kmap_atomic'
> and look for memcpy().
> 
> Hence why I'm asking what will be the "recommended" way to do this...
> kunmap_thread() or kmap_atomic()?

kmap_atomic() is always preferred over kmap()/kmap_thread().
kmap_atomic() is _much_ more lightweight since its TLB invalidation is
always CPU-local and never broadcast.

So, basically, unless you *must* sleep while the mapping is in place,
kmap_atomic() is preferred.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation

2020-09-29 Thread Dave Hansen
On 9/29/20 7:12 AM, Peter Zijlstra wrote:
>>  |  1G|  2M|  4K
>>--+++-
>>   ssd, mitigations=on| 308.75 | 317.37 | 314.9
>>   ssd, mitigations=off   | 305.25 | 295.32 | 304.92
>>   ram, mitigations=on| 301.58 | 322.49 | 306.54
>>   ram, mitigations=off   | 299.32 | 288.44 | 310.65
> These results lack error data, but assuming the reults are significant,
> then this very much makes a case for 1G mappings. 5s on a kernel builds
> is pretty good.

Is something like secretmem all or nothing?

This seems like a similar situation to the side-channel mitigations.  We
know what the most "secure" thing to do is.  But, folks also disagree
about how much pain that security is worth.

That seems to indicate we're never going to come up with a
one-size-fits-all solution to this.  Apps are going to have to live
without secretmem being around if they want to run on old kernels
anyway, so it seems like something we should be able to enable or
disable without ABI concerns.

Do we just include it, but disable it by default so it doesn't eat
performance?  But, allow it to be reenabled by the folks who generally
prioritize hardening over performance, like Chromebooks for instance.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 0/5] mm: introduce memfd_secret system call to create "secret" memory areas

2020-09-16 Thread Dave Hansen
On 9/16/20 11:49 AM, Andy Lutomirski wrote:
> I still have serious concerns with uncached mappings.  I'm not saying
> I can't be convinced, but I'm not currently convinced that we should
> allow user code to create UC mappings on x86.

There's another widely-used OS that has a "NOCACHE" flag to be specified
to on what I *think* is the equivalent of anonymous memory mappings
(VirtualAlloc()):

> https://docs.microsoft.com/en-us/windows/win32/memory/memory-protection-constants

I'll offer this not as a reason we should or shouldn't allow it.  But,
we should at least be able to figure out if it's caused problems for
them or not.  I think it's been around a long time.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions

2020-07-23 Thread Dave Hansen
On 7/23/20 10:08 AM, Andy Lutomirski wrote:
> Suppose some kernel code (a syscall or kernel thread) changes PKRS
> then takes a page fault. The page fault handler needs a fresh PKRS.
> Then the page fault handler (say a VMA’s .fault handler) changes
> PKRS.  The we get an interrupt. The interrupt *also* needs a fresh
> PKRS and the page fault value needs to be saved somewhere.
> 
> So we have more than one saved value per thread, and thread_struct
> isn’t going to solve this problem.

Taking a step back...  This is all true only if we decide that we want
protection keys to provide protection during exceptions and interrupts.
 Right now, the code supports nesting:

kmap(foo);
kmap(bar);
kunmap(bar);
kunmap(foo);

with a reference count.  So, the nested kmap() will see the count
elevated and do nothing.

I'm generally OK with this getting merged without extending PKS
protection to interrupts and exceptions.  Ira and Fenghua should
certainly give it a go, but I'd like to see it as an add-on feature and
we can judge the benefits versus complexity separately.

Ira was looking at adding it because it _looked_ simple.  Andy has me
really scared about it now. :)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions

2020-07-23 Thread Dave Hansen
On 7/23/20 9:18 AM, Fenghua Yu wrote:
> The PKRS MSR has been preserved in thread_info during kernel entry. We
> don't need to preserve it in another place (i.e. idtentry_state).

I'm missing how the PKRS MSR gets preserved in thread_info.  Could you
explain the mechanism by which this happens and point to the code
implementing it, please?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V2 02/17] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-07-17 Thread Dave Hansen
On 7/17/20 1:54 AM, Peter Zijlstra wrote:
> This is unbelievable junk...

Ouch!

This is from the original user pkeys implementation.

> How about something like:
> 
> u32 update_pkey_reg(u32 pk_reg, int pkey, unsigned int flags)
> {
>   int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> 
>   pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> 
>   if (flags & PKEY_DISABLE_ACCESS)
>   pk_reg |= PKR_AD_BIT << pkey_shift;
>   if (flags & PKEY_DISABLE_WRITE)
>   pk_reg |= PKR_WD_BIT << pkey_shift;
> 
>   return pk_reg;
> }
> 
> Then we at least have a little clue wtf the thing does.. Yes I started
> with a rename and then got annoyed at the implementation too.

That's fine, if some comments get added.  It looks correct to me but
probably compiles down to pretty much the same thing as what was there.
 FWIW, I prefer the explicit masking off of two bit values to implicit
masking off with a mask generated from PKR_BITS_PER_PKEY.  It's
certainly more compact, but I usually don't fret over the lines of code.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH 12/15] kmap: Add stray write protection for device pages

2020-07-14 Thread Dave Hansen
On 7/14/20 12:29 PM, Peter Zijlstra wrote:
> On Tue, Jul 14, 2020 at 12:06:16PM -0700, Ira Weiny wrote:
>> On Tue, Jul 14, 2020 at 10:44:51AM +0200, Peter Zijlstra wrote:
>>> So, if I followed along correctly, you're proposing to do a WRMSR per
>>> k{,un}map{_atomic}(), sounds like excellent performance all-round :-(
>> Only to pages which have this additional protection, ie not DRAM.
>>
>> User mappings of this memory is not affected (would be covered by User PKeys 
>> if
>> desired).  User mappings to persistent memory are the primary use case and 
>> the
>> performant path.
> Because performance to non-volatile memory doesn't matter? I think Dave
> has a better answer here ...

So, these WRMSRs are less evil than normal.  They're architecturally
non-serializing instructions, just like the others in the SDM WRMSR
documentation:

Note that WRMSR to the IA32_TSC_DEADLINE MSR (MSR index 6E0H)
and the X2APIC MSRs (MSR indices 802H to 83FH) are  not
serializing.

This section of the SDM needs to be updated for the PKRS.  Also note
that the PKRS WRMSR is similar in its ordering properties to WRPKRU:

WRPKRU will never execute speculatively. Memory accesses
affected by PKRU register will not execute (even speculatively)
until all prior executions of WRPKRU have completed execution
and updated the PKRU register.

Which means we don't have to do silliness like LFENCE before WRMSR to
get ordering *back*.  This is another tidbit that needs to get added to
the SDM.  It should probably also get captured in the changelog.

But, either way, this *will* make accessing PMEM more expensive from the
kernel.  No escaping that.  But, we've also got customers saying they
won't deploy PMEM until we mitigate this stray write issue.  Those folks
are quite willing to pay the increased in-kernel cost for increased
protection from stray kernel writes.  Intel is also quite motivated
because we really like increasing the number of PMEM deployments. :)

Ira, can you make sure this all gets pulled into the changelogs somewhere?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC PATCH 04/15] x86/pks: Preserve the PKRS MSR on context switch

2020-07-14 Thread Dave Hansen
On 7/14/20 11:53 AM, Ira Weiny wrote:
>>> The PKRS MSR is defined as a per-core register.

Just to be clear, PKRS is a per-logical-processor register, just like
PKRU.  The "per-core" thing here is a typo.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe()

2020-05-01 Thread Dave Hansen
On 5/1/20 11:28 AM, Linus Torvalds wrote:
> Plus on x86 you can't reasonably even have different code sequences
> for that case, because CLAC/STAC don't have a "enable users read
> accesses" vs "write accesses" case. It's an all-or-nothing "enable
> user faults".
> 
> We _used_ to have a difference on x86, back when we did the whole "fs
> segment points to user space".

Protection keys might give us _some_ of this back.  If we're doing a
copy_from_user(), we could (logically) do:

stac()
save_pkru()
pkru |= ~0x
... do userspace read
restore_pkru()
clac()

That *should* generate a fault if we try to write to userspace in there
because PKRU affects all user *addresses* (PTEs with _PAGE_USER set) not
user-mode *accesses*.

Properly stashing the value off and context switching it correctly would
be fun, but probably not impossible to pull off.  You actually wouldn't
even technically need to restore PKRU in this path.  It would just need
to be restored before the thread runs userspace or hits a copy_to_user()
equivalent.

I can't imagine this would all be worth the trouble, but there are
crazier people out there than me.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Dave Hansen
On 4/30/20 8:52 AM, David Hildenbrand wrote:
>> Justifying behavior by documentation that does not consider memory
>> hotplug is bad thinking.
> Are you maybe confusing this patch series with the arm64 approach? This
> is not about ordinary hotplugged DIMMs.
> 
> I'd love to get Dan's, Dave's and Michal's opinion.

The impact on kexec from the DAX "kmem" driver's use of hotplug was
inadvertent and unfortunate.

The problem statement and solution seem pretty sane to me.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 3/3] device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Dave Hansen
On 4/30/20 3:29 AM, David Hildenbrand wrote:
> Currently, when adding memory, we create entries in /sys/firmware/memmap/
> as "System RAM". This does not reflect the reality and will lead to
> kexec-tools to add that memory to the fixed-up initial memmap for a
> kexec kernel (loaded via kexec_load()). The memory will be considered
> initial System RAM by the kexec kernel.
> 
> We should let the kexec kernel decide how to use that memory - just as
> we do during an ordinary reboot.
...
> - rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
> + rc = add_memory(numa_node, new_res->start, resource_size(new_res),
> + MHP_NO_FIRMWARE_MEMMAP);

Looks fine.  But, if you send another revision, could you add a comment
about the actual goal of MHP_NO_FIRMWARE_MEMMAP?  Maybe:

/*
 * MHP_NO_FIRMWARE_MEMMAP ensures that future
 * kexec'd kernels will not treat this as RAM.
     */

Not a biggie, though.

Acked-by: Dave Hansen 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 03/10] efi: Enumerate EFI_MEMORY_SP

2019-06-07 Thread Dave Hansen
On 6/7/19 1:03 PM, Dan Williams wrote:
>> Separate from these patches, should we have a runtime file that dumps
>> out the same info?  dmesg isn't always available, and hotplug could
>> change this too, I'd imagine.
> Perhaps, but I thought /proc/iomem was that runtime file. Given that
> x86/Linux only seems to care about the the EFI to E820 translation of
> the map and the E820 map is directly reflected in /proc/iomem, do we
> need another file?

Probably not.

I'm just trying to think of ways that we can debug systems where someone
"loses" a bunch of memory, especially if they're moving from an old
kernel to a new one with these patches.  From their perspective, they
just lost a bunch of expensive memory.

Do we owe a pr_info(), perhaps?  Or even a /proc/meminfo entry for how
much memory these devices own?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 00/10] EFI Specific Purpose Memory Support

2019-06-07 Thread Dave Hansen
On 6/7/19 12:27 PM, Dan Williams wrote:
> In support of optionally allowing either application-exclusive and
> core-kernel-mm managed access to differentiated memory, claim
> EFI_MEMORY_SP ranges for exposure as device-dax instances by default.
> Such instances can be directly owned / mapped by a
> platform-topology-aware application. Alternatively, with the new kmem
> facility [4], the administrator has the option to instead designate that
> those memory ranges be hot-added to the core-kernel-mm as a unique
> memory numa-node. In short, allow for the decision about what software
> agent manages specific-purpose memory to be made at runtime.

It's probably worth noting that the reason the memory lands into the
state of being controlled by device-dax by default is that device-dax is
nice.  It's actually willing and able to give up ownership of the memory
when we ask.  If we added to the core-mm, we'd almost certainly not be
able to get it back reliably.

Anyway, thanks for doing these, and I really hope that the world's
BIOSes actually use this flag.  For the series:

Reviewed-by: Dave Hansen 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 08/10] device-dax: Add a driver for "hmem" devices

2019-06-07 Thread Dave Hansen
On 6/7/19 12:27 PM, Dan Williams wrote:
> This consumes "hmem" devices the producer of "hmem" devices is saved for
> a follow-on patch so that it can reference the new CONFIG_DEV_DAX_HMEM
> symbol to gate performing the enumeration work.

Do these literally show up as /dev/hmemX?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 03/10] efi: Enumerate EFI_MEMORY_SP

2019-06-07 Thread Dave Hansen
On 6/7/19 12:27 PM, Dan Williams wrote:
> @@ -848,15 +848,16 @@ char * __init efi_md_typeattr_format(char *buf, size_t 
> size,
>   if (attr & ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT |
>EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_RO |
>EFI_MEMORY_WP | EFI_MEMORY_RP | EFI_MEMORY_XP |
> -  EFI_MEMORY_NV |
> +  EFI_MEMORY_NV | EFI_MEMORY_SP |
>EFI_MEMORY_RUNTIME | EFI_MEMORY_MORE_RELIABLE))
>   snprintf(pos, size, "|attr=0x%016llx]",
>(unsigned long long)attr);
>   else
>   snprintf(pos, size,
> -  "|%3s|%2s|%2s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]",
> +  
> "|%3s|%2s|%2s|%2s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]",
>attr & EFI_MEMORY_RUNTIME ? "RUN" : "",
>attr & EFI_MEMORY_MORE_RELIABLE ? "MR" : "",
> +  attr & EFI_MEMORY_SP  ? "SP"  : "",
>attr & EFI_MEMORY_NV  ? "NV"  : "",
>attr & EFI_MEMORY_XP  ? "XP"  : "",
>attr & EFI_MEMORY_RP  ? "RP"  : "",

Haha, I went digging in sysfs to find out where this gets dumped out.
The joke was on me because it seems to only go to dmesg.

Separate from these patches, should we have a runtime file that dumps
out the same info?  dmesg isn't always available, and hotplug could
change this too, I'd imagine.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH 0/8] daxctl: add a new reconfigure-device command

2019-05-06 Thread Dave Hansen
This all looks quite nice to me.  Thanks, Vishal!

One minor nit: for those of us new to daxctl and friends, they can be a
bit hard to get started with.  Could you maybe add a few example
invocations to the Documentation, or even this cover letter to help us
newbies get started?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

2019-05-06 Thread Dave Hansen
On 5/6/19 11:01 AM, Dan Williams wrote:
>>> +void __remove_memory(int nid, u64 start, u64 size)
>>>  {
>>> +
>>> + /*
>>> +  * trigger BUG() is some memory is not offlined prior to calling this
>>> +  * function
>>> +  */
>>> + if (try_remove_memory(nid, start, size))
>>> + BUG();
>>> +}
>> Could we call this remove_offline_memory()?  That way, it makes _some_
>> sense why we would BUG() if the memory isn't offline.
> Please WARN() instead of BUG() because failing to remove memory should
> not be system fatal.

That is my preference as well.  But, the existing code BUG()s, so I'm
OK-ish with this staying for the moment until we have a better handle on
what all the callers do if this fails.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

2019-05-06 Thread Dave Hansen
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline bool remove_memory(int nid, u64 start, u64 size)
> +{
> + return -EBUSY;
> +}

This seems like an appropriate place for a WARN_ONCE(), if someone
manages to call remove_memory() with hotplug disabled.

BTW, I looked and can't think of a better errno, but -EBUSY probably
isn't the best error code, right?

> -void remove_memory(int nid, u64 start, u64 size)
> +/**
> + * remove_memory
> + * @nid: the node ID
> + * @start: physical address of the region to remove
> + * @size: size of the region to remove
> + *
> + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> + * and online/offline operations before this call, as required by
> + * try_offline_node().
> + */
> +void __remove_memory(int nid, u64 start, u64 size)
>  {
> +
> + /*
> +  * trigger BUG() is some memory is not offlined prior to calling this
> +  * function
> +  */
> + if (try_remove_memory(nid, start, size))
> + BUG();
> +}

Could we call this remove_offline_memory()?  That way, it makes _some_
sense why we would BUG() if the memory isn't offline.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [v3 2/2] device-dax: "Hotremove" persistent memory that is used like normal RAM

2019-04-25 Thread Dave Hansen
Hi Pavel,

Thanks for doing this!  I knew we'd have to get to it eventually, but
sounds like you needed it sooner rather than later.

...
>  static inline struct dev_dax *to_dev_dax(struct device *dev)
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 4c0131857133..6f1640462df9 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -71,21 +71,107 @@ int dev_dax_kmem_probe(struct device *dev)
>   kfree(new_res);
>   return rc;
>   }
> + dev_dax->dax_kmem_res = new_res;
>  
>   return 0;
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE

Instead of this #ifdef, is there any downside to doing

if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) {
/*
 * Without hotremove, purposely leak ...
 */
return 0;
}


> +/*
> + * Check that device-dax's memory_blocks are offline. If a memory_block is 
> not
> + * offline a warning is printed and an error is returned. dax hotremove can
> + * succeed only when every memory_block is offlined beforehand.
> + */

I'd much rather see comments inline with the code than all piled at the
top of a function like this.

One thing that would be helpful, though, is a reminder about needing the
device hotplug lock.

> +static int
> +check_memblock_offlined_cb(struct memory_block *mem, void *arg)
> +{
> + struct device *mem_dev = >dev;
> + bool is_offline;
> +
> + device_lock(mem_dev);
> + is_offline = mem_dev->offline;
> + device_unlock(mem_dev);
> +
> + if (!is_offline) {
> + struct device *dev = (struct device *)arg;

The two devices confused me for a bit here.  Seems worth a comment to
remind the reader what this device _is_ versus 'mem_dev'.

> + unsigned long spfn = section_nr_to_pfn(mem->start_section_nr);
> + unsigned long epfn = section_nr_to_pfn(mem->end_section_nr);
> + phys_addr_t spa = spfn << PAGE_SHIFT;
> + phys_addr_t epa = epfn << PAGE_SHIFT;
> +
> + dev_warn(dev, "memory block [%pa-%pa] is not offline\n",
> +  , );

I thought we had a magic resource printk %something.  Could we just
print one of the device resources here to save all the section/pfn/paddr
calculations?

Also, should we consider a slightly scarier message?  This path has a
permanent, user-visible effect (we can never try to unbind again).

> + return -EBUSY;
> + }
> +
> + return 0;
> +}

Even though they're static, I'd prefer that we not create two versions
of check_memblock_offlined_cb() in the kernel.  Can we give this a
better, non-conflicting name?

> +static int dev_dax_kmem_remove(struct device *dev)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct resource *res = dev_dax->dax_kmem_res;
> + resource_size_t kmem_start;
> + resource_size_t kmem_size;
> + unsigned long start_pfn;
> + unsigned long end_pfn;
> + int rc;
> +
> + /*
> +  * dax kmem resource does not exist, means memory was never hotplugged.
> +  * So, nothing to do here.
> +  */
> + if (!res)
> + return 0;

How could that happen?  I can't think of any obvious scenarios.

> + kmem_start = res->start;
> + kmem_size = resource_size(res);
> + start_pfn = kmem_start >> PAGE_SHIFT;
> + end_pfn = start_pfn + (kmem_size >> PAGE_SHIFT) - 1;
> +
> + /*
> +  * Walk and check that every singe memory_block of dax region is
> +  * offline
> +  */
> + lock_device_hotplug();
> + rc = walk_memory_range(start_pfn, end_pfn, dev,
> +check_memblock_offlined_cb);

Does lock_device_hotplug() also lock memory online/offline?  Otherwise,
isn't this offline check racy?  If not, can you please spell that out in
a comment?

Also, could you compare this a bit to the walk_memory_range() use in
__remove_memory()?  Why do we need two walks looking for offline blocks?

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [v3 1/2] device-dax: fix memory and resource leak if hotplug fails

2019-04-25 Thread Dave Hansen
On 4/25/19 10:54 AM, Pavel Tatashin wrote:
>   rc = add_memory(numa_node, new_res->start, resource_size(new_res));
> - if (rc)
> + if (rc) {
> + release_resource(new_res);
> + kfree(new_res);
>   return rc;
> + }

Looks good to me:

Reviewed-by: Dave Hansen 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/5] [v5] Allow persistent memory to be used like normal RAM

2019-02-25 Thread Dave Hansen
This is a relatively small delta from v4.  The review comments seem
to be settling down, so it seems like we should start thinking about
how this might get merged.  Are there any objections to taking it in
via the nvdimm tree?

Dan Williams, our intrepid nvdimm maintainer has said he would
appreciate acks on these from relevant folks before merging them.
Reviews/acks on any in the series would be welcome, but the last
two especially are lacking any non-Intel acks:

mm/resource: let walk_system_ram_range() search child resources
dax: "Hotplug" persistent memory for use like normal RAM

Note: these are based on commit b20a7bfc2f9 in:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git 
for-5.1/devdax

Changes since v4:
 * Update powerpc resource return codes too, not just generic
 * Update dev_dax_kmem_remove() comment about resource "leaks"
 * Make HMM-related warning in __request_region() use %pR's
 * Add GPL export for memory_block_size_bytes() to fix build erroe
 * Add a FAQ in the documentation
 * Make resource.c hmm output a pr_warn() instead of pr_debug()
 * Minor CodingStyle tweaks
 * Allow device removal by making dev_dax_kmem_remove() return 0.
   Note that although the device goes away, the memory stays
   in-use, assigned and reserved.

Changes since v3:
 * Move HMM-related resource warning instead of removing it
 * Use __request_resource() directly instead of devm.
 * Create a separate DAX_PMEM Kconfig option, complete with help text
 * Update patch descriptions and cover letter to give a better
   overview of use-cases and hardware where this might be useful.

Changes since v2:
 * Updates to dev_dax_kmem_probe() in patch 5:
   * Reject probes for devices with bad NUMA nodes.  Keeps slow
 memory from being added to node 0.
   * Use raw request_mem_region()
   * Add comments about permanent reservation
   * use dev_*() instead of printk's
 * Add references to nvdimm documentation in descriptions
 * Remove unneeded GPL export
 * Add Kconfig prompt and help text

Changes since v1:
 * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
 * Use binding/unbinding from "dax bus" code
 * Move over to a "dax bus" driver from being an nvdimm driver

--

Persistent memory is cool.  But, currently, you have to rewrite
your applications to use it.  Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory?  Well... have I got the patch
series for you!

== Background / Use Cases ==

Persistent Memory (aka Non-Volatile DIMMs / NVDIMMS) themselves
are described in detail in Documentation/nvdimm/nvdimm.txt.
However, this documentation focuses on actually using them as
storage.  This set is focused on using NVDIMMs as DRAM replacement.

This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
persistent memory) NVDIMMs.  These DIMMs are physically persistent,
more akin to flash than traditional RAM.  They are also expected to
be more cost-effective than using RAM, which is why folks want this
set in the first place.

This set is not intended for RAM-based NVDIMMs.  Those are not
cost-effective vs. plain RAM, and this using them here would simply
be a waste.  However, if you have a user of a system that does not
care about persistence and wants all the RAM they can get, this
could allow you to use a RAM-based NVDIMM where it would otherwise
go unused.

But, why would you bother with this approach?  Intel itself [1]
has announced a hardware feature that does something very similar:
"Memory Mode" which turns DRAM into a cache in front of persistent
memory, which is then as a whole used as normal "RAM"?

Here are a few reasons:
1. The capacity of memory mode is the size of your persistent
   memory that you dedicate.  DRAM capacity is "lost" because it
   is used for cache.  With this, you get PMEM+DRAM capacity for
   memory.
2. DRAM acts as a cache with memory mode, and caches can lead to
   unpredictable latencies.  Since memory mode is all-or-nothing
   (either all your DRAM is used as cache or none is), your entire
   memory space is exposed to these unpredictable latencies.  This
   solution lets you guarantee DRAM latencies if you need them.
3. The new "tier" of memory is exposed to software.  That means
   that you can build tiered applications or infrastructure.  A
   cloud provider could sell cheaper VMs that use more PMEM and
   more expensive ones that use DRAM.  That's impossible with
   memory mode.

Don't take this as criticism of memory mode.  Memory mode is
awesome, and doesn't strictly require *any* software changes (we
have software changes proposed for optimizing it though).  It has
tons of other advantages over *this* approach.  Basically, we
believe that the approach in these patches is complementary to
memory mode and that both can live side-by-side in harmony.

== Patch Set Overview ==

This series adds a new "driver" to which pmem devices can be
attached.  

[PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code

2019-02-25 Thread Dave Hansen


From: Dave Hansen 

HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there.  It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.

When HMM consumes address space, it makes a best guess about
what to consume.  However, it is possible that a future memory
or device hotplug can collide with the reserved area.  In the
case of these conflicts, there is an error message in
register_memory_resource().

Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.

Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available.  This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.

Signed-off-by: Dave Hansen 
Reviewed-by: Jerome Glisse 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Keith Busch 
---

 b/kernel/resource.c   |9 +
 b/mm/memory_hotplug.c |5 -
 2 files changed, 9 insertions(+), 5 deletions(-)

diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
--- a/kernel/resource.c~move-request_region-check   2019-02-25 
10:56:48.581908031 -0800
+++ b/kernel/resource.c 2019-02-25 10:56:48.588908031 -0800
@@ -1132,6 +1132,15 @@ struct resource * __request_region(struc
conflict = __request_resource(parent, res);
if (!conflict)
break;
+   /*
+* mm/hmm.c reserves physical addresses which then
+* become unavailable to other users.  Conflicts are
+* not expected.  Warn to aid debugging if encountered.
+*/
+   if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
+   pr_warn("Unaddressable device %s %pR conflicts with 
%pR",
+   conflict->name, conflict, res);
+   }
if (conflict != parent) {
if (!(conflict->flags & IORESOURCE_BUSY)) {
parent = conflict;
diff -puN mm/memory_hotplug.c~move-request_region-check mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~move-request_region-check 2019-02-25 
10:56:48.583908031 -0800
+++ b/mm/memory_hotplug.c   2019-02-25 10:56:48.588908031 -0800
@@ -111,11 +111,6 @@ static struct resource *register_memory_
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
conflict =  request_resource_conflict(_resource, res);
if (conflict) {
-   if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-   pr_debug("Device unaddressable memory block "
-"memory hotplug at %#010llx !\n",
-(unsigned long long)start);
-   }
pr_debug("System RAM resource %pR cannot be added\n", res);
kfree(res);
return ERR_PTR(-EEXIST);
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 4/5] mm/resource: let walk_system_ram_range() search child resources

2019-02-25 Thread Dave Hansen


From: Dave Hansen 

In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.

However, it currently only finds memory resources which are
"top-level" iomem_resources.  Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):

a000-bfff : Persistent Memory (legacy)
  a000-afff : System RAM

Changing the true->false here allows children to be searched
as well.  We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.

Signed-off-by: Dave Hansen 
Cc: Keith Busch 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
Cc: Jerome Glisse 
---

 b/kernel/resource.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources 
kernel/resource.c
--- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources 
2019-02-25 10:56:50.750908026 -0800
+++ b/kernel/resource.c 2019-02-25 10:56:50.754908026 -0800
@@ -454,6 +454,9 @@ int walk_mem_res(u64 start, u64 end, voi
  * This function calls the @func callback against all memory ranges of type
  * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
  * It is to be used only for System RAM.
+ *
+ * This will find System RAM ranges that are children of top-level resources
+ * in addition to top-level System RAM resources.
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
  void *arg, int (*func)(unsigned long, unsigned long, 
void *))
@@ -469,7 +472,7 @@ int walk_system_ram_range(unsigned long
flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
while (start < end &&
   !find_next_iomem_res(start, end, flags, IORES_DESC_NONE,
-   true, )) {
+   false, )) {
pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
end_pfn = (res.end + 1) >> PAGE_SHIFT;
if (end_pfn > pfn)
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/5] mm/resource: return real error codes from walk failures

2019-02-25 Thread Dave Hansen


From: Dave Hansen 

walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error.  The memory hotplug does the following:

ret = walk_system_ram_range(..., func);
if (ret)
return ret;

and 'ret' makes it out to userspace, eventually.  The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1.  That leads to a very odd
-EPERM (-1) return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.

This return code is compatible with all the callers that I
audited.

This changes both the generic mm/ and powerpc-specific
implementations to have the same return value.

Signed-off-by: Dave Hansen 
Reviewed-by: Bjorn Helgaas 
Acked-by: Michael Ellerman  (powerpc)
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
Cc: Jerome Glisse 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-...@lists.ozlabs.org
Cc: Keith Busch 
---

 b/arch/powerpc/mm/mem.c |2 +-
 b/kernel/resource.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN 
arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 
arch/powerpc/mm/mem.c
--- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1  
2019-02-25 10:56:47.452908034 -0800
+++ b/arch/powerpc/mm/mem.c 2019-02-25 10:56:47.458908034 -0800
@@ -189,7 +189,7 @@ walk_system_ram_range(unsigned long star
struct memblock_region *reg;
unsigned long end_pfn = start_pfn + nr_pages;
unsigned long tstart, tend;
-   int ret = -1;
+   int ret = -EINVAL;
 
for_each_memblock(memory, reg) {
tstart = max(start_pfn, memblock_region_memory_base_pfn(reg));
diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 
kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1  
2019-02-25 10:56:47.454908034 -0800
+++ b/kernel/resource.c 2019-02-25 10:56:47.459908034 -0800
@@ -382,7 +382,7 @@ static int __walk_iomem_res_desc(resourc
 int (*func)(struct resource *, void *))
 {
struct resource res;
-   int ret = -1;
+   int ret = -EINVAL;
 
while (start < end &&
   !find_next_iomem_res(start, end, flags, desc, first_lvl, )) {
@@ -462,7 +462,7 @@ int walk_system_ram_range(unsigned long
unsigned long flags;
struct resource res;
unsigned long pfn, end_pfn;
-   int ret = -1;
+   int ret = -EINVAL;
 
start = (u64) start_pfn << PAGE_SHIFT;
end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 3/5] mm/memory-hotplug: allow memory resources to be children

2019-02-25 Thread Dave Hansen


From: Dave Hansen 

The mm/resource.c code is used to manage the physical address
space.  The current resource configuration can be viewed in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  The best way to repurpose this for volatile
use is to leave the existing resource in place, but add a "System
RAM" resource underneath it. This clearly communicates the
ownership relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

-0fff : Reserved
1000-0009fbff : System RAM
0009fc00-0009 : Reserved
000a-000b : PCI Bus :00
000c-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f-000f : Reserved
  000f-000f : System ROM
0010-9fff : System RAM
  0100-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a000-afff : Persistent Memory (legacy)
  a000-a7ff : System RAM
b000-bffd : System RAM
bffe-bfff : Reserved
c000-febf : PCI Bus :00

Signed-off-by: Dave Hansen 
Reviewed-by: Dan Williams 
Reviewed-by: Vishal Verma 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
Cc: Jerome Glisse 
Cc: Keith Busch 
---

 b/mm/memory_hotplug.c |   26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff -puN 
mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 
mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child   
2019-02-25 10:56:49.707908029 -0800
+++ b/mm/memory_hotplug.c   2019-02-25 10:56:49.711908029 -0800
@@ -100,19 +100,21 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-   struct resource *res, *conflict;
-   res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-   if (!res)
-   return ERR_PTR(-ENOMEM);
+   struct resource *res;
+   unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+   char *resource_name = "System RAM";
 
-   res->name = "System RAM";
-   res->start = start;
-   res->end = start + size - 1;
-   res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-   conflict =  request_resource_conflict(_resource, res);
-   if (conflict) {
-   pr_debug("System RAM resource %pR cannot be added\n", res);
-   kfree(res);
+   /*
+* Request ownership of the new memory range.  This might be
+* a child of an existing resource that was present but
+* not marked as busy.
+*/
+   res = __request_region(_resource, start, size,
+  resource_name, flags);
+
+   if (!res) {
+   pr_debug("Unable to reserve System RAM region: 
%016llx->%016llx\n",
+   start, start + size);
return ERR_PTR(-EEXIST);
}
return res;
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM

2019-02-25 Thread Dave Hansen


From: Dave Hansen 

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then tell the new driver that it can bind to the device:

echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip.  Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver.  There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly.  Two, the kmem driver destroys data on the
device.  Think of if you had good data on a pmem device.  It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver.  kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().

Signed-off-by: Dave Hansen 
Reviewed-by: Dan Williams 
Reviewed-by: Keith Busch 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
Cc: Jerome Glisse 
---

 b/drivers/base/memory.c |1 
 b/drivers/dax/Kconfig   |   16 +++
 b/drivers/dax/Makefile  |1 
 b/drivers/dax/kmem.c|  108 
 4 files changed, 126 insertions(+)

diff -puN drivers/base/memory.c~dax-kmem-try-4 drivers/base/memory.c
--- a/drivers/base/memory.c~dax-kmem-try-4  2019-02-25 10:56:51.791908023 
-0800
+++ b/drivers/base/memory.c 2019-02-25 10:56:51.800908023 -0800
@@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_b
 {
return MIN_MEMORY_BLOCK_SIZE;
 }
+EXPORT_SYMBOL_GPL(memory_block_size_bytes);
 
 static unsigned long get_memory_block_size(void)
 {
diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-42019-02-25 10:56:51.793908023 
-0800
+++ b/drivers/dax/Kconfig   2019-02-25 10:56:51.800908023 -0800
@@ -32,6 +32,22 @@ config DEV_DAX_PMEM
 
  Say M if unsure
 
+config DEV_DAX_KMEM
+   tristate "KMEM DAX: volatile-use of persistent memory"
+   default DEV_DAX
+   depends on DEV_DAX
+   depends on MEMORY_HOTPLUG # for add_memory() and friends
+   help
+ Support access to persistent memory as if it were RAM.  This
+ allows easier use of persistent memory by unmodified
+ applications.
+
+ To use this feature, a DAX device must be unbound from the
+ device_dax driver (PMEM DAX) and bound to this kmem driver
+ on

Re: question about page tables in DAX/FS/PMEM case

2019-02-21 Thread Dave Hansen
On 2/21/19 2:58 PM, Larry Bassel wrote:
> AFAIK there is no hardware benefit from sharing the page table
> directory within different page table. So the only benefit is the
> amount of memory we save.

The hardware benefit from schemes like this is that the CPU caches are
better utilized.  If two processes share page tables, they don't share
TLB entries, but they *do* share the contents of the CPU's caches.  That
will make TLB misses faster.

It probably doesn't matter *that* much in practice because the page
walker doing TLB fills does a pretty good job of hiding all the latency,
but it might matter in extreme cases.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM

2019-02-12 Thread Dave Hansen
On 2/9/19 3:00 AM, Brice Goglin wrote:
> I've used your patches on fake hardware (memmap=xx!yy) with an older
> nvdimm-pending branch (without Keith's patches). It worked fine. This
> time I am running on real Intel hardware. Any idea where to look ?

I've run them on real Intel hardware too.

Could you share the exact sequence of commands you're issuing to
reproduce the hang?  My guess would be that there's some odd interaction
between Dan's latest branch and my now (slightly) stale patches.

I'll refresh them this week and see if I can reproduce what you're seeing.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/5] [v4] Allow persistent memory to be used like normal RAM

2019-01-28 Thread Dave Hansen
On 1/28/19 3:09 AM, Balbir Singh wrote:
>> This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
>> persistent memory) NVDIMMs.  These DIMMs are physically persistent,
>> more akin to flash than traditional RAM.  They are also expected to
>> be more cost-effective than using RAM, which is why folks want this
>> set in the first place.
> What variant of NVDIMM's F/P or both?

I'd expect this to get used in any cases where the NVDIMM is
cost-effective vs. DRAM.  Today, I think that's only NVDIMM-P.  At least
from what Wikipedia tells me about F vs. P vs. N:

https://en.wikipedia.org/wiki/NVDIMM

>> == Patch Set Overview ==
>>
>> This series adds a new "driver" to which pmem devices can be
>> attached.  Once attached, the memory "owned" by the device is
>> hot-added to the kernel and managed like any other memory.  On
>> systems with an HMAT (a new ACPI table), each socket (roughly)
>> will have a separate NUMA node for its persistent memory so
>> this newly-added memory can be selected by its unique NUMA
>> node.
> 
> NUMA is distance based topology, does HMAT solve these problems?

NUMA is no longer just distance-based.  Any memory with different
properties, like different memory-side caches or bandwidth properties
can be in its own, discrete NUMA node.

> How do we prevent fallback nodes of normal nodes being pmem nodes?

NUMA policies.

> On an unexpected crash/failure is there a scrubbing mechanism
> or do we rely on the allocator to do the right thing prior to
> reallocating any memory.

Yes, but this is not unique to persistent memory.  On a kexec-based
crash, there might be old, sensitive data in *RAM* when the kernel comes
up.  We depend on the allocator to zero things there.  We also just
plain depend on the allocator to zero things so we don't leak
information when recycling pages in the first place.

I can't think of a scenario where some kind of "leak" of old data
wouldn't also be a bug with normal, volatile RAM.

> Will frequent zero'ing hurt NVDIMM/pmem's life times?

Everybody reputable that sells things with limited endurance quantifies
the endurance.  I'd suggest that folks know what the endurance of their
media is before enabling this.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code

2019-01-25 Thread Dave Hansen
On 1/25/19 1:18 PM, Bjorn Helgaas wrote:
> On Thu, Jan 24, 2019 at 5:21 PM Dave Hansen  
> wrote:
>> diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
>> --- a/kernel/resource.c~move-request_region-check   2019-01-24 
>> 15:13:14.453199539 -0800
>> +++ b/kernel/resource.c 2019-01-24 15:13:14.458199539 -0800
>> @@ -1123,6 +1123,16 @@ struct resource * __request_region(struc
>> conflict = __request_resource(parent, res);
>> if (!conflict)
>> break;
>> +   /*
>> +* mm/hmm.c reserves physical addresses which then
>> +* become unavailable to other users.  Conflicts are
>> +* not expected.  Be verbose if one is encountered.
>> +*/
>> +   if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
>> +   pr_debug("Resource conflict with unaddressable "
>> +"device memory at %#010llx !\n",
>> +(unsigned long long)start);
> 
> I don't object to the change, but are you really OK with this being a
> pr_debug() message that is only emitted when enabled via either the
> dynamic debug mechanism or DEBUG being defined?  From the comments, it
> seems more like a KERN_INFO sort of message.

I left it consistent with the original message that was in the code.
I'm happy to change it, though, if the consumers of it (Jerome,
basically) want something different.

> Also, maybe the message would be more useful if it included the
> conflicting resource as well as the region you're requesting?  Many of
> the other callers of request_resource_conflict() have something like
> this:
> 
>   dev_err(dev, "resource collision: %pR conflicts with %s %pR\n",
> new, conflict->name, conflict);

Seems sane.  I was just trying to change as little as possible.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/5] mm/resource: return real error codes from walk failures

2019-01-25 Thread Dave Hansen
On 1/25/19 1:02 PM, Bjorn Helgaas wrote:
>> @@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
>> unsigned long flags;
>> struct resource res;
>> unsigned long pfn, end_pfn;
>> -   int ret = -1;
>> +   int ret = -EINVAL;
> Can you either make a similar change to the powerpc version of
> walk_system_ram_range() in arch/powerpc/mm/mem.c or explain why it's
> not needed?  It *seems* like we'd want both versions of
> walk_system_ram_range() to behave similarly in this respect.

Sure.  A quick grep shows powerpc being the only other implementation.
I'll just add this hunk:

> diff -puN 
> arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 
> arch/powerpc/mm/mem.c
> --- 
> a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1  
> 2019-01-25 12:57:00.04446 -0800
> +++ b/arch/powerpc/mm/mem.c 2019-01-25 12:58:13.215004263 -0800 
> @@ -188,7 +188,7 @@ walk_system_ram_range(unsigned long star 
> struct memblock_region *reg; 
> unsigned long end_pfn = start_pfn + nr_pages; 
> unsigned long tstart, tend; 
> -   int ret = -1; 
> +   int ret = -EINVAL; 

I'll also dust off the ol' cross-compiler and make sure I didn't
fat-finger anything.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-24 Thread Dave Hansen


From: Dave Hansen 

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then bind it to this new driver:

echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip.  Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver.  There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly.  Two, the kmem driver destroys data on the
device.  Think of if you had good data on a pmem device.  It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver.  kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().

Signed-off-by: Dave Hansen 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
Cc: Jerome Glisse 
---

 b/drivers/base/memory.c |1 
 b/drivers/dax/Kconfig   |   16 +++
 b/drivers/dax/Makefile  |1 
 b/drivers/dax/kmem.c|  108 
 4 files changed, 126 insertions(+)

diff -puN drivers/base/memory.c~dax-kmem-try-4 drivers/base/memory.c
--- a/drivers/base/memory.c~dax-kmem-try-4  2019-01-24 15:13:15.987199535 
-0800
+++ b/drivers/base/memory.c 2019-01-24 15:13:15.994199535 -0800
@@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_b
 {
return MIN_MEMORY_BLOCK_SIZE;
 }
+EXPORT_SYMBOL_GPL(memory_block_size_bytes);
 
 static unsigned long get_memory_block_size(void)
 {
diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-42019-01-24 15:13:15.988199535 
-0800
+++ b/drivers/dax/Kconfig   2019-01-24 15:13:15.994199535 -0800
@@ -32,6 +32,22 @@ config DEV_DAX_PMEM
 
  Say M if unsure
 
+config DEV_DAX_KMEM
+   tristate "KMEM DAX: volatile-use of persistent memory"
+   default DEV_DAX
+   depends on DEV_DAX
+   depends on MEMORY_HOTPLUG # for add_memory() and friends
+   help
+ Support access to persistent memory as if it were RAM.  This
+ allows easier use of persistent memory by unmodified
+ applications.
+
+ To use this feature, a DAX device must be unbound from th

[PATCH 3/5] mm/memory-hotplug: allow memory resources to be children

2019-01-24 Thread Dave Hansen


From: Dave Hansen 

The mm/resource.c code is used to manage the physical address
space.  The current resource configuration can be viewed in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  The best way to repurpose this for volatile
use is to leave the existing resource in place, but add a "System
RAM" resource underneath it. This clearly communicates the
ownership relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

-0fff : Reserved
1000-0009fbff : System RAM
0009fc00-0009 : Reserved
000a-000b : PCI Bus :00
000c-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f-000f : Reserved
  000f-000f : System ROM
0010-9fff : System RAM
  0100-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a000-afff : Persistent Memory (legacy)
  a000-a7ff : System RAM
b000-bffd : System RAM
bffe-bfff : Reserved
c000-febf : PCI Bus :00

Signed-off-by: Dave Hansen 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
Cc: Jerome Glisse 
---

 b/mm/memory_hotplug.c |   26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff -puN 
mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 
mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child   
2019-01-24 15:13:14.979199537 -0800
+++ b/mm/memory_hotplug.c   2019-01-24 15:13:14.983199537 -0800
@@ -98,19 +98,21 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-   struct resource *res, *conflict;
-   res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-   if (!res)
-   return ERR_PTR(-ENOMEM);
+   struct resource *res;
+   unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+   char *resource_name = "System RAM";
 
-   res->name = "System RAM";
-   res->start = start;
-   res->end = start + size - 1;
-   res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-   conflict =  request_resource_conflict(_resource, res);
-   if (conflict) {
-   pr_debug("System RAM resource %pR cannot be added\n", res);
-   kfree(res);
+   /*
+* Request ownership of the new memory range.  This might be
+* a child of an existing resource that was present but
+* not marked as busy.
+*/
+   res = __request_region(_resource, start, size,
+  resource_name, flags);
+
+   if (!res) {
+   pr_debug("Unable to reserve System RAM region: 
%016llx->%016llx\n",
+   start, start + size);
return ERR_PTR(-EEXIST);
}
return res;
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 4/5] dax/kmem: let walk_system_ram_range() search child resources

2019-01-24 Thread Dave Hansen


From: Dave Hansen 

In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.

However, it currently only finds memory resources which are
"top-level" iomem_resources.  Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):

a000-bfff : Persistent Memory (legacy)
  a000-afff : System RAM

Changing the true->false here allows children to be searched
as well.  We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.

Signed-off-by: Dave Hansen 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
Cc: Jerome Glisse 
---

 b/kernel/resource.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources 
kernel/resource.c
--- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources 
2019-01-24 15:13:15.482199536 -0800
+++ b/kernel/resource.c 2019-01-24 15:13:15.486199536 -0800
@@ -445,6 +445,9 @@ int walk_mem_res(u64 start, u64 end, voi
  * This function calls the @func callback against all memory ranges of type
  * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
  * It is to be used only for System RAM.
+ *
+ * This will find System RAM ranges that are children of top-level resources
+ * in addition to top-level System RAM resources.
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
  void *arg, int (*func)(unsigned long, unsigned long, 
void *))
@@ -460,7 +463,7 @@ int walk_system_ram_range(unsigned long
flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
while (start < end &&
   !find_next_iomem_res(start, end, flags, IORES_DESC_NONE,
-   true, )) {
+   false, )) {
pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
end_pfn = (res.end + 1) >> PAGE_SHIFT;
if (end_pfn > pfn)
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/5] [v4] Allow persistent memory to be used like normal RAM

2019-01-24 Thread Dave Hansen
v3 spurred a bunch of really good discussion.  Thanks to everybody
that made comments and suggestions!

I would still love some Acks on this from the folks on cc, even if it
is on just the patch touching your area.

Note: these are based on commit d2f33c19644 in:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git 
libnvdimm-pending

Changes since v3:
 * Move HMM-related resource warning instead of removing it
 * Use __request_resource() directly instead of devm.
 * Create a separate DAX_PMEM Kconfig option, complete with help text
 * Update patch descriptions and cover letter to give a better
   overview of use-cases and hardware where this might be useful.

Changes since v2:
 * Updates to dev_dax_kmem_probe() in patch 5:
   * Reject probes for devices with bad NUMA nodes.  Keeps slow
 memory from being added to node 0.
   * Use raw request_mem_region()
   * Add comments about permanent reservation
   * use dev_*() instead of printk's
 * Add references to nvdimm documentation in descriptions
 * Remove unneeded GPL export
 * Add Kconfig prompt and help text

Changes since v1:
 * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
 * Use binding/unbinding from "dax bus" code
 * Move over to a "dax bus" driver from being an nvdimm driver

--

Persistent memory is cool.  But, currently, you have to rewrite
your applications to use it.  Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory?  Well... have I got the patch
series for you!

== Background / Use Cases ==

Persistent Memory (aka Non-Volatile DIMMs / NVDIMMS) themselves
are described in detail in Documentation/nvdimm/nvdimm.txt.
However, this documentation focuses on actually using them as
storage.  This set is focused on using NVDIMMs as DRAM replacement.

This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
persistent memory) NVDIMMs.  These DIMMs are physically persistent,
more akin to flash than traditional RAM.  They are also expected to
be more cost-effective than using RAM, which is why folks want this
set in the first place.

This set is not intended for RAM-based NVDIMMs.  Those are not
cost-effective vs. plain RAM, and this using them here would simply
be a waste.

But, why would you bother with this approach?  Intel itself [1]
has announced a hardware feature that does something very similar:
"Memory Mode" which turns DRAM into a cache in front of persistent
memory, which is then as a whole used as normal "RAM"?

Here are a few reasons:
1. The capacity of memory mode is the size of your persistent
   memory that you dedicate.  DRAM capacity is "lost" because it
   is used for cache.  With this, you get PMEM+DRAM capacity for
   memory.
2. DRAM acts as a cache with memory mode, and caches can lead to
   unpredictable latencies.  Since memory mode is all-or-nothing
   (either all your DRAM is used as cache or none is), your entire
   memory space is exposed to these unpredictable latencies.  This
   solution lets you guarantee DRAM latencies if you need them.
3. The new "tier" of memory is exposed to software.  That means
   that you can build tiered applications or infrastructure.  A
   cloud provider could sell cheaper VMs that use more PMEM and
   more expensive ones that use DRAM.  That's impossible with
   memory mode.

Don't take this as criticism of memory mode.  Memory mode is
awesome, and doesn't strictly require *any* software changes (we
have software changes proposed for optimizing it though).  It has
tons of other advantages over *this* approach.  Basically, we
believe that the approach in these patches is complementary to
memory mode and that both can live side-by-side in harmony.

== Patch Set Overview ==

This series adds a new "driver" to which pmem devices can be
attached.  Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory.  On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.

== Testing Overview ==

Here's how I set up a system to test this thing:

1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
   physical seems to work: memmap=512M!0x8000
   This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
ndctl create-namespace -fe namespace0.0 -m dax
4. See patch 4 for instructions on binding the kmem driver
   to a device.
5. Now, online the new memory sections.  Perhaps:

grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
echo $f: `cat $f`
echo online_movable > $f
grep ^MemTotal /proc/meminfo
done

1. 
https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/#gs.RKG7BeIu

Cc: 

[PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code

2019-01-24 Thread Dave Hansen


From: Dave Hansen 

HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there.  It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.

When HMM consumes address space, it makes a best guess about
what to consume.  However, it is possible that a future memory
or device hotplug can collide with the reserved area.  In the
case of these conflicts, there is an error message in
register_memory_resource().

Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.

Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available.  This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.

Signed-off-by: Dave Hansen 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Jerome Glisse 
---

 b/kernel/resource.c   |   10 ++
 b/mm/memory_hotplug.c |5 -
 2 files changed, 10 insertions(+), 5 deletions(-)

diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
--- a/kernel/resource.c~move-request_region-check   2019-01-24 
15:13:14.453199539 -0800
+++ b/kernel/resource.c 2019-01-24 15:13:14.458199539 -0800
@@ -1123,6 +1123,16 @@ struct resource * __request_region(struc
conflict = __request_resource(parent, res);
if (!conflict)
break;
+   /*
+* mm/hmm.c reserves physical addresses which then
+* become unavailable to other users.  Conflicts are
+* not expected.  Be verbose if one is encountered.
+*/
+   if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
+   pr_debug("Resource conflict with unaddressable "
+"device memory at %#010llx !\n",
+(unsigned long long)start);
+   }
if (conflict != parent) {
if (!(conflict->flags & IORESOURCE_BUSY)) {
parent = conflict;
diff -puN mm/memory_hotplug.c~move-request_region-check mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~move-request_region-check 2019-01-24 
15:13:14.455199539 -0800
+++ b/mm/memory_hotplug.c   2019-01-24 15:13:14.459199539 -0800
@@ -109,11 +109,6 @@ static struct resource *register_memory_
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
conflict =  request_resource_conflict(_resource, res);
if (conflict) {
-   if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-   pr_debug("Device unaddressable memory block "
-"memory hotplug at %#010llx !\n",
-(unsigned long long)start);
-   }
pr_debug("System RAM resource %pR cannot be added\n", res);
kfree(res);
return ERR_PTR(-EEXIST);
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/5] mm/resource: return real error codes from walk failures

2019-01-24 Thread Dave Hansen


From: Dave Hansen 

walk_system_ram_range() can return an error code either becuase *it*
failed, or because the 'func' that it calls returned an error.  The
memory hotplug does the following:

ret = walk_system_ram_range(..., func);
if (ret)
return ret;

and 'ret' makes it out to userspace, eventually.  The problem is,
walk_system_ram_range() failues that result from *it* failing (as
opposed to 'func') return -1.  That leads to a very odd -EPERM (-1)
return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal failures to
keep userspace less confused.

This return code is compatible with all the callers that I audited.

Signed-off-by: Dave Hansen 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
Cc: Jerome Glisse 
---

 b/kernel/resource.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 
kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1  
2019-01-24 15:13:13.950199540 -0800
+++ b/kernel/resource.c 2019-01-24 15:13:13.954199540 -0800
@@ -375,7 +375,7 @@ static int __walk_iomem_res_desc(resourc
 int (*func)(struct resource *, void *))
 {
struct resource res;
-   int ret = -1;
+   int ret = -EINVAL;
 
while (start < end &&
   !find_next_iomem_res(start, end, flags, desc, first_lvl, )) {
@@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
unsigned long flags;
struct resource res;
unsigned long pfn, end_pfn;
-   int ret = -1;
+   int ret = -EINVAL;
 
start = (u64) start_pfn << PAGE_SHIFT;
end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-23 Thread Dave Hansen
On 1/16/19 3:38 PM, Jerome Glisse wrote:
> So right now i would rather that we keep properly reporting this
> hazard so that at least we know it failed because of that. This
> also include making sure that we can not register private memory
> as a child of an un-busy resource that does exist but might not
> have yet been claim by its rightful owner.

I can definitely keep the warning in.  But, I don't think there's a
chance of HMM registering a IORES_DESC_DEVICE_PRIVATE_MEMORY region as
the child of another.  The region_intersects() check *should* find that:

> for (; addr > size && addr >= iomem_resource.start; addr -= size) {
> ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
> if (ret != REGION_DISJOINT)
> continue;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-18 Thread Dave Hansen
On 1/16/19 11:16 AM, Jerome Glisse wrote:
>> We *could* also simply truncate the existing top-level
>> "Persistent Memory" resource and take over the released address
>> space.  But, this means that if we ever decide to hot-unplug the
>> "RAM" and give it back, we need to recreate the original setup,
>> which may mean going back to the BIOS tables.
>>
>> This should have no real effect on the existing collision
>> detection because the areas that truly conflict should be marked
>> IORESOURCE_BUSY.
> 
> Still i am worrying that this might allow device private to register
> itself as a child of some un-busy resource as this patch obviously
> change the behavior of register_memory_resource()
> 
> What about instead explicitly providing parent resource to add_memory()
> and then to register_memory_resource() so if it is provided as an
> argument (!NULL) then you can __request_region(arg_res, ...) otherwise
> you keep existing code intact ?

We don't have the locking to do this, do we?  For instance, all the
locking is done below register_memory_resource(), so any previous
resource lookup is invalid by the time we get to register_memory_resource().
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-18 Thread Dave Hansen
On 1/17/19 11:47 PM, Yanmin Zhang wrote:
> a chance for kernel to allocate PMEM as DMA buffer.
> Some super speed devices like 10Giga NIC, USB (SSIC connecting modem),
> might not work well if DMA buffer is in PMEM as it's slower than DRAM.
> 
> Should your patchset consider it?

No, I don't think so.

They can DMA to persistent memory whether this patch set exists or not.
 So, if the hardware falls over, that's a separate problem.

If an app wants memory that performs in a particular way, then I would
suggest those app find the NUMA nodes on the system that match their
needs with these patches:

> http://lkml.kernel.org/r/20190116175804.30196-1-keith.bu...@intel.com

and use the existing NUMA APIs to select that memory.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/4] Allow persistent memory to be used like normal RAM

2019-01-17 Thread Dave Hansen
On 1/17/19 8:29 AM, Jeff Moyer wrote:
>> Persistent memory is cool.  But, currently, you have to rewrite
>> your applications to use it.  Wouldn't it be cool if you could
>> just have it show up in your system like normal RAM and get to
>> it like a slow blob of memory?  Well... have I got the patch
>> series for you!
> So, isn't that what memory mode is for?
>   
> https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> 
> Why do we need this code in the kernel?

So, my bad for not mentioning memory mode.  This patch set existed
before we could talk about it publicly, so it simply ignores its
existence.  It's a pretty glaring omissions at this point, sorry.

I'll add this to the patches, but here are a few reasons you might want
this instead of memory mode:
1. Memory mode is all-or-nothing.  Either 100% of your persistent memory
   is used for memory mode, or nothing is.  With this set, you can
   (theoretically) have very granular (128MB) assignment of PMEM to
   either volatile or persistent uses.  We have a few practical matters
   to fix to get us down to that 128MB value, but we can get there.
2. The capacity of memory mode is the size of your persistent memory.
   DRAM capacity is "lost" because it is used for cache.  With this,
   you get PMEM+DRAM capacity for memory.
3. DRAM acts as a cache with memory mode, and caches can lead to
   unpredictable latencies.  Since memory mode is all-or-nothing, your
   entire memory space is exposed to these unpredictable latencies.
   This solution lets you guarantee DRAM latencies if you need them.
4. The new "tier" of memory is exposed to software.  That means that you
   can build tiered applications or infrastructure.  A cloud provider
   could sell cheaper VMs that use more PMEM and more expensive ones
   that use DRAM.  That's impossible with memory mode.

Don't take this as criticism of memory mode.  Memory mode is awesome,
and doesn't strictly require *any* software changes (we have software
changes proposed for optimizing it though).  It has tons of other
advantages over *this* approach.  Basically, they are complementary
enough that we think both can live side-by-side.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-17 Thread Dave Hansen
On 1/17/19 12:19 AM, Yanmin Zhang wrote:
>>
> I didn't try pmem and I am wondering it's slower than DRAM.
> Should a flag, such like _GFP_PMEM, be added to distinguish it from
> DRAM?

Absolutely not. :)

We already have performance-differentiated memory, and lots of ways to
enumerate and select it in the kernel (all of our NUMA infrastructure).

PMEM is also just the first of many "kinds" of memory that folks want to
build in systems and use a "RAM".  We literally don't have space to put
a flag in for each type.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-16 Thread Dave Hansen
On 1/16/19 11:16 AM, Jerome Glisse wrote:
>> We also rework the old error message a bit since we do not get
>> the conflicting entry back: only an indication that we *had* a
>> conflict.
> We should keep the device private check (moving it in __request_region)
> as device private can try to register un-use physical address (un-use
> at time of device private registration) that latter can block valid
> physical address the error message you are removing report such event.

If a resource can't support having a child, shouldn't it just be marked
IORESOURCE_BUSY, rather than trying to somehow special-case
IORES_DESC_DEVICE_PRIVATE_MEMORY behavior?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-16 Thread Dave Hansen
On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
>> +   /*
>> +* Set flags appropriate for System RAM.  Leave ..._BUSY clear
>> +* so that add_memory() can add a child resource.
>> +*/
>> +   new_res->flags = IORESOURCE_SYSTEM_RAM;
> IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> devm_request_mem_region() path.  I think you should keep at least
> IORESOURCE_MEM so the iomem_resource tree stays consistent.

I went to look at fixing this.  It looks like "IORESOURCE_SYSTEM_RAM"
includes IORESOURCE_MEM:

> #define IORESOURCE_SYSTEM_RAM   (IORESOURCE_MEM|IORESOURCE_SYSRAM)

Did you want the patch to expand this #define, or did you just want to
ensure that IORESORUCE_MEM got in there somehow?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-16 Thread Dave Hansen
On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
> On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
>  wrote:
>> From: Dave Hansen 
>> Currently, a persistent memory region is "owned" by a device driver,
>> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
>> allow applications to explicitly use persistent memory, generally
>> by being modified to use special, new libraries.
> 
> Is there any documentation about exactly what persistent memory is?
> In Documentation/, I see references to pstore and pmem, which sound
> sort of similar, but maybe not quite the same?

One instance of persistent memory is nonvolatile DIMMS.  They're
described in great detail here: Documentation/nvdimm/nvdimm.txt

>> +config DEV_DAX_KMEM
>> +   def_bool y
> 
> Is "y" the right default here?  I periodically see Linus complain
> about new things defaulting to "on", but I admit I haven't paid enough
> attention to know whether that would apply here.
> 
>> +   depends on DEV_DAX_PMEM   # Needs DEV_DAX_PMEM infrastructure
>> +   depends on MEMORY_HOTPLUG # for add_memory() and friends

Well, it doesn't default to "on for everyone".  It inherits the state of
DEV_DAX_PMEM so it's only foisted on folks who have already opted in to
generic pmem support.

>> +int dev_dax_kmem_probe(struct device *dev)
>> +{
>> +   struct dev_dax *dev_dax = to_dev_dax(dev);
>> +   struct resource *res = _dax->region->res;
>> +   resource_size_t kmem_start;
>> +   resource_size_t kmem_size;
>> +   struct resource *new_res;
>> +   int numa_node;
>> +   int rc;
>> +
>> +   /* Hotplug starting at the beginning of the next block: */
>> +   kmem_start = ALIGN(res->start, memory_block_size_bytes());
>> +
>> +   kmem_size = resource_size(res);
>> +   /* Adjust the size down to compensate for moving up kmem_start: */
>> +kmem_size -= kmem_start - res->start;
>> +   /* Align the size down to cover only complete blocks: */
>> +   kmem_size &= ~(memory_block_size_bytes() - 1);
>> +
>> +   new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
>> + dev_name(dev));
>> +
>> +   if (!new_res) {
>> +   printk("could not reserve region %016llx -> %016llx\n",
>> +   kmem_start, kmem_start+kmem_size);
> 
> 1) It'd be nice to have some sort of module tag in the output that
> ties it to this driver.

Good point.  That should probably be a dev_printk().

> 2) It might be nice to print the range in the same format as %pR,
> i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ).

Sure, that sounds like a sane thing to do as well.

>> +   return -EBUSY;
>> +   }
>> +
>> +   /*
>> +* Set flags appropriate for System RAM.  Leave ..._BUSY clear
>> +* so that add_memory() can add a child resource.
>> +*/
>> +   new_res->flags = IORESOURCE_SYSTEM_RAM;
> 
> IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> devm_request_mem_region() path.  I think you should keep at least
> IORESOURCE_MEM so the iomem_resource tree stays consistent.
> 
>> +   new_res->name = dev_name(dev);
>> +
>> +   numa_node = dev_dax->target_node;
>> +   if (numa_node < 0) {
>> +   pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
> 
> It'd be nice to again have a module tag and an indication of what
> range is affected, e.g., %pR of new_res.
> 
> You don't save the new_res pointer anywhere, which I guess you intend
> for now since there's no remove or anything else to do with this
> resource?  I thought maybe devm_request_mem_region() would implicitly
> save it, but it doesn't; it only saves the parent (iomem_resource, the
> start (kmem_start), and the size (kmem_size)).

Yeah, that's the intention: removal is currently not supported.  I'll
add a comment to clarify.

>> +   numa_node = 0;
>> +   }
>> +
>> +   rc = add_memory(numa_node, new_res->start, resource_size(new_res));
>> +   if (rc)
>> +   return rc;
>> +
>> +   return 0;
> 
> Doesn't this mean "return rc" or even just "return add_memory(...)"?

Yeah, all of those are equivalent.  I guess I just prefer the explicit
error handling path.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/4] mm/resource: return real error codes from walk failures

2019-01-16 Thread Dave Hansen


From: Dave Hansen 

walk_system_ram_range() can return an error code either becuase *it*
failed, or because the 'func' that it calls returned an error.  The
memory hotplug does the following:

ret = walk_system_ram_range(..., func);
if (ret)
return ret;

and 'ret' makes it out to userspace, eventually.  The problem is,
walk_system_ram_range() failues that result from *it* failing (as
opposed to 'func') return -1.  That leads to a very odd -EPERM (-1)
return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal failures to
keep userspace less confused.

This return code is compatible with all the callers that I audited.

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 


Signed-off-by: Dave Hansen 
---

 b/kernel/resource.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 
kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1  
2018-12-20 11:48:41.810771934 -0800
+++ b/kernel/resource.c 2018-12-20 11:48:41.814771934 -0800
@@ -375,7 +375,7 @@ static int __walk_iomem_res_desc(resourc
 int (*func)(struct resource *, void *))
 {
struct resource res;
-   int ret = -1;
+   int ret = -EINVAL;
 
while (start < end &&
   !find_next_iomem_res(start, end, flags, desc, first_lvl, )) {
@@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
unsigned long flags;
struct resource res;
unsigned long pfn, end_pfn;
-   int ret = -1;
+   int ret = -EINVAL;
 
start = (u64) start_pfn << PAGE_SHIFT;
end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

2019-01-16 Thread Dave Hansen


From: Dave Hansen 

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries.

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed ad used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then bind it to this new driver:

echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

Note: this inherits any existing NUMA information for the newly-
added memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.

Signed-off-by: Dave Hansen 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
---

 b/drivers/dax/Kconfig  |5 ++
 b/drivers/dax/Makefile |1 
 b/drivers/dax/kmem.c   |   93 +
 3 files changed, 99 insertions(+)

diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-42019-01-08 09:54:44.051694874 
-0800
+++ b/drivers/dax/Kconfig   2019-01-08 09:54:44.056694874 -0800
@@ -32,6 +32,11 @@ config DEV_DAX_PMEM
 
  Say M if unsure
 
+config DEV_DAX_KMEM
+   def_bool y
+   depends on DEV_DAX_PMEM   # Needs DEV_DAX_PMEM infrastructure
+   depends on MEMORY_HOTPLUG # for add_memory() and friends
+
 config DEV_DAX_PMEM_COMPAT
tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
depends on DEV_DAX_PMEM
diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null   2018-12-03 08:41:47.355756491 -0800
+++ b/drivers/dax/kmem.c2019-01-08 09:54:44.056694874 -0800
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "dax-private.h"
+#include "bus.h"
+
+int dev_dax_kmem_probe(struct device *dev)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+   struct resource *res = _dax->region->res;
+   resource_size_t kmem_start;
+   resource_size_t kmem_size;
+   struct resource *new_res;
+   int numa_node;
+   int rc;
+
+   /* Hotplug starting at the beginning of the next block: */
+   kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
+   kmem_size = resource_size(res);
+   /* Adjust the size down to compensate for moving up kmem_start: */
+kmem_size -= kmem_start - res->start;
+   /* Align the size down to cover only complete blocks: */
+   kmem_size &= ~(memory_block_size_bytes() - 1);
+
+   new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
+ dev_name(dev));
+
+   if (!new_res) {
+   printk("could not reserve region %016llx -> %016llx\n",
+   kmem_start, kmem_start+kmem_size);
+   return -EBUSY;
+   }
+
+   /*
+* Set flags appropriate for System RAM.  Leave ..._BUSY clear
+* so that add_memory() can add a child resource.
+*/
+   new_res->flags = IORESOURCE_SYSTEM_RAM;
+   new_res->name = dev_name(dev);
+
+   numa_node = dev_dax->target_node;
+   if (numa_node < 0) {
+   pr_warn_onc

[PATCH 0/4] Allow persistent memory to be used like normal RAM

2019-01-16 Thread Dave Hansen
I would like to get this queued up to get merged.  Since most of the
churn is in the nvdimm code, and it also depends on some refactoring
that only exists in the nvdimm tree, it seems like putting it in *via*
the nvdimm tree is the best path.

But, this series makes non-trivial changes to the "resource" code and
memory hotplug.  I'd really like to get some acks from folks on the
first three patches which affect those areas.

Borislav and Bjorn, you seem to be the most active in the resource code.

Michal, I'd really appreciate at look at all of this from a mem hotplug
perspective.

Note: these are based on commit d2f33c19644 in:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git 
libnvdimm-pending

Changes since v1:
 * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
 * Use binding/unbinding from "dax bus" code
 * Move over to a "dax bus" driver from being an nvdimm driver

--

Persistent memory is cool.  But, currently, you have to rewrite
your applications to use it.  Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory?  Well... have I got the patch
series for you!

This series adds a new "driver" to which pmem devices can be
attached.  Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory.  On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.

Here's how I set up a system to test this thing:

1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
   physical seems to work: memmap=512M!0x8000
   This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
ndctl create-namespace -fe namespace0.0 -m dax
4. See patch 4 for instructions on binding the kmem driver
   to a device.
5. Now, online the new memory sections.  Perhaps:

grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
echo $f: `cat $f`
echo online_movable > $f
grep ^MemTotal /proc/meminfo
done

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
Cc: Borislav Petkov 
Cc: Bjorn Helgaas 
Cc: Yaowei Bai 
Cc: Takashi Iwai 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-16 Thread Dave Hansen


From: Dave Hansen 

The mm/resource.c code is used to manage the physical address
space.  We can view the current resource configuration in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  We want to use this persistent memory, but
as volatile memory, just like RAM.  The best way to do this is
to leave the existing resource in place, but add a "System RAM"
resource underneath it. This clearly communicates the ownership
relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We also rework the old error message a bit since we do not get
the conflicting entry back: only an indication that we *had* a
conflict.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

-0fff : Reserved
1000-0009fbff : System RAM
0009fc00-0009 : Reserved
000a-000b : PCI Bus :00
000c-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f-000f : Reserved
  000f-000f : System ROM
0010-9fff : System RAM
  0100-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a000-afff : Persistent Memory (legacy)
  a000-a7ff : System RAM
b000-bffd : System RAM
bffe-bfff : Reserved
c000-febf : PCI Bus :00

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

Signed-off-by: Dave Hansen 
---

 b/mm/memory_hotplug.c |   31 ++-
 1 file changed, 14 insertions(+), 17 deletions(-)

diff -puN 
mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 
mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child   
2018-12-20 11:48:42.317771933 -0800
+++ b/mm/memory_hotplug.c   2018-12-20 11:48:42.322771933 -0800
@@ -98,24 +98,21 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-   struct resource *res, *conflict;
-   res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-   if (!res)
-   return ERR_PTR(-ENOMEM);
+   struct resource *res;
+   unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+   char *resource_name = "System RAM";
 
-   res->name = "System RAM";
-   res->start = start;
-   res->end = start + size - 1;
-   res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-   conflict =  request_resource_conflict(_resource, res);
-   if (conflict) {
-   if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-   pr_debug("Device unaddressable memory block "
-"memory hotplug at %#010llx !\n",
-(unsigned long long)start);
-   }
-   pr_debug("System RAM resource %pR cannot be added\n", res);
-   kfree(res);
+   /*
+* Request ownership of the new memory range.  This might be
+* a child of an existing resource that was present but
+* not marked as busy.
+*/
+   res = __request_region(_resource, start, size,
+  resource_name, flags);
+
+   if (!res) {
+   pr_debug("Unable to reserve System RAM region: 
%016llx->%016llx\n",
+   start, start + size);
return ERR_PTR(-EEXIST);
}
return res;
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 3/4] dax/kmem: let walk_system_ram_range() search child resources

2019-01-16 Thread Dave Hansen


From: Dave Hansen 

In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.

However, it currently only finds memory resources which are
"top-level" iomem_resources.  Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):

a000-bfff : Persistent Memory (legacy)
  a000-afff : System RAM

Changing the true->false here allows children to be searched
as well.  We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

Signed-off-by: Dave Hansen 
---

 b/kernel/resource.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources 
kernel/resource.c
--- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources 
2018-12-20 11:48:42.824771932 -0800
+++ b/kernel/resource.c 2018-12-20 11:48:42.827771932 -0800
@@ -445,6 +445,9 @@ int walk_mem_res(u64 start, u64 end, voi
  * This function calls the @func callback against all memory ranges of type
  * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
  * It is to be used only for System RAM.
+ *
+ * This will find System RAM ranges that are children of top-level resources
+ * in addition to top-level System RAM resources.
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
  void *arg, int (*func)(unsigned long, unsigned long, 
void *))
@@ -460,7 +463,7 @@ int walk_system_ram_range(unsigned long
flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
while (start < end &&
   !find_next_iomem_res(start, end, flags, IORES_DESC_NONE,
-   true, )) {
+   false, )) {
pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
end_pfn = (res.end + 1) >> PAGE_SHIFT;
if (end_pfn > pfn)
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/9] Allow persistent memory to be used like normal RAM

2018-12-03 Thread Dave Hansen
On 12/3/18 1:22 AM, Brice Goglin wrote:
> Le 22/10/2018 à 22:13, Dave Hansen a écrit :
> What happens on systems without an HMAT? Does this new memory get merged
> into existing NUMA nodes?

It gets merged into the persistent memory device's node, as told by the
firmware.  Intel's persistent memory should always be in its own node,
separate from DRAM.

> Also, do you plan to have a way for applications to find out which NUMA
> nodes are "real DRAM" while others are "pmem-backed"? (something like a
> new attribute in /sys/devices/system/node/nodeX/) Or should we use HMAT
> performance attributes for this?

The best way is to use the sysfs-generic interfaces to the HMAT that
Keith Busch is pushing.  In the end, we really think folks will only
care about the memory's performance properties rather than whether it's
*actually* persistent memory or not.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/9] Allow persistent memory to be used like normal RAM

2018-10-26 Thread Dave Hansen
On 10/26/18 1:03 AM, Xishi Qiu wrote:
> How about let the BIOS report a new type for kmem in e820 table?
> e.g.
> #define E820_PMEM 7
> #define E820_KMEM 8

It would be best if the BIOS just did this all for us.  But, what you're
describing would take years to get from concept to showing up in
someone's hands.  I'd rather not wait.

Plus, doing it the way I suggested gives the OS the most control.  The
BIOS isn't in the critical path to do the right thing.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/9] Allow persistent memory to be used like normal RAM

2018-10-23 Thread Dave Hansen
>> This series adds a new "driver" to which pmem devices can be
>> attached.  Once attached, the memory "owned" by the device is
>> hot-added to the kernel and managed like any other memory.  On
> 
> Would this memory be considered volatile (with the driver initializing
> it to zeros), or persistent (contents are presented unchanged,
> applications may guarantee persistence by using cache flush
> instructions, fence instructions, and writing to flush hint addresses
> per the persistent memory programming model)?

Volatile.

>> I expect udev can automate this by setting up a rule to watch for
>> device-dax instances by UUID and call a script to do the detach /
>> reattach dance.
> 
> Where would that rule be stored? Storing it on another device
> is problematic. If that rule is lost, it could confuse other
> drivers trying to grab device DAX devices for use as persistent
> memory.

Well, we do lots of things like stable device naming from udev scripts.
 We depend on them not being lost.  At least this "fails safe" so we'll
default to persistence instead of defaulting to "eat your data".

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 9/9] dax/kmem: actually enable the code in Makefile

2018-10-22 Thread Dave Hansen


Most of the new code was dead up to this point.  Now that
all the pieces are in place, enable it.

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

---

 b/drivers/dax/Makefile |2 ++
 1 file changed, 2 insertions(+)

diff -puN drivers/dax/Makefile~dax-kmem-makefile drivers/dax/Makefile
--- a/drivers/dax/Makefile~dax-kmem-makefile2018-10-22 13:12:25.068930384 
-0700
+++ b/drivers/dax/Makefile  2018-10-22 13:12:25.071930384 -0700
@@ -2,7 +2,9 @@
 obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+obj-$(CONFIG_DEV_DAX_PMEM) += dax_kmem.o
 
 dax-y := super.o
 dax_pmem-y := pmem.o
+dax_kmem-y := kmem.o
 device_dax-y := device.o
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 6/9] mm/memory-hotplug: allow memory resources to be children

2018-10-22 Thread Dave Hansen


The mm/resource.c code is used to manage the physical address
space.  We can view the current resource configuration in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  We want to use this persistent memory, but
as volatile memory, just like RAM.  The best way to do this is
to leave the existing resource in place, but add a "System RAM"
resource underneath it. This clearly communicates the ownership
relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We also rework the old error message a bit since we do not get
the conflicting entry back: only an indication that we *had* a
conflict.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

-0fff : Reserved
1000-0009fbff : System RAM
0009fc00-0009 : Reserved
000a-000b : PCI Bus :00
000c-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f-000f : Reserved
  000f-000f : System ROM
0010-9fff : System RAM
  0100-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a000-afff : Persistent Memory (legacy)
  a000-a7ff : System RAM
b000-bffd : System RAM
bffe-bfff : Reserved
c000-febf : PCI Bus :00

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

---

 b/mm/memory_hotplug.c |   31 ++-
 1 file changed, 14 insertions(+), 17 deletions(-)

diff -puN 
mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 
mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child   
2018-10-22 13:12:23.570930388 -0700
+++ b/mm/memory_hotplug.c   2018-10-22 13:12:23.573930388 -0700
@@ -99,24 +99,21 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-   struct resource *res, *conflict;
-   res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-   if (!res)
-   return ERR_PTR(-ENOMEM);
+   struct resource *res;
+   unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+   char resource_name[] = "System RAM";
 
-   res->name = "System RAM";
-   res->start = start;
-   res->end = start + size - 1;
-   res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-   conflict =  request_resource_conflict(_resource, res);
-   if (conflict) {
-   if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-   pr_debug("Device unaddressable memory block "
-"memory hotplug at %#010llx !\n",
-(unsigned long long)start);
-   }
-   pr_debug("System RAM resource %pR cannot be added\n", res);
-   kfree(res);
+   /*
+* Request ownership of the new memory range.  This might be
+* a child of an existing resource that was present but
+* not marked as busy.
+*/
+   res = __request_region(_resource, start, size,
+  resource_name, flags);
+
+   if (!res) {
+   pr_debug("Unable to reserve System RAM region: 
%016llx->%016llx\n",
+   start, start + size);
return ERR_PTR(-EEXIST);
}
return res;
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 5/9] dax/kmem: add more nd dax kmem infrastructure

2018-10-22 Thread Dave Hansen


Each DAX mode has a set of wrappers and helpers.  Add them
for the kmem mode.

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

---

 b/drivers/nvdimm/bus.c  |2 ++
 b/drivers/nvdimm/dax_devs.c |   35 +++
 b/drivers/nvdimm/nd.h   |6 ++
 3 files changed, 43 insertions(+)

diff -puN drivers/nvdimm/bus.c~dax-kmem-try-again-2018-4-bus-dev-type-kmem 
drivers/nvdimm/bus.c
--- a/drivers/nvdimm/bus.c~dax-kmem-try-again-2018-4-bus-dev-type-kmem  
2018-10-22 13:12:23.024930389 -0700
+++ b/drivers/nvdimm/bus.c  2018-10-22 13:12:23.031930389 -0700
@@ -46,6 +46,8 @@ static int to_nd_device_type(struct devi
return ND_DEVICE_REGION_BLK;
else if (is_nd_dax(dev))
return ND_DEVICE_DAX_PMEM;
+   else if (is_nd_dax_kmem(dev))
+   return ND_DEVICE_DAX_KMEM;
else if (is_nd_region(dev->parent))
return nd_region_to_nstype(to_nd_region(dev->parent));
 
diff -puN drivers/nvdimm/dax_devs.c~dax-kmem-try-again-2018-4-bus-dev-type-kmem 
drivers/nvdimm/dax_devs.c
--- a/drivers/nvdimm/dax_devs.c~dax-kmem-try-again-2018-4-bus-dev-type-kmem 
2018-10-22 13:12:23.026930389 -0700
+++ b/drivers/nvdimm/dax_devs.c 2018-10-22 13:12:23.031930389 -0700
@@ -51,6 +51,41 @@ struct nd_dax *to_nd_dax(struct device *
 }
 EXPORT_SYMBOL(to_nd_dax);
 
+/* nd_dax_kmem */
+static void nd_dax_kmem_release(struct device *dev)
+{
+   struct nd_region *nd_region = to_nd_region(dev->parent);
+   struct nd_dax_kmem *nd_dax_kmem = to_nd_dax_kmem(dev);
+   struct nd_pfn *nd_pfn = _dax_kmem->nd_pfn;
+
+   dev_dbg(dev, "trace\n");
+   nd_detach_ndns(dev, _pfn->ndns);
+   ida_simple_remove(_region->dax_ida, nd_pfn->id);
+   kfree(nd_pfn->uuid);
+   kfree(nd_dax_kmem);
+}
+
+static struct device_type nd_dax_kmem_device_type = {
+   .name = "nd_dax_kmem",
+   .release = nd_dax_kmem_release,
+};
+
+bool is_nd_dax_kmem(struct device *dev)
+{
+   return dev ? dev->type == _dax_kmem_device_type : false;
+}
+EXPORT_SYMBOL(is_nd_dax_kmem);
+
+struct nd_dax_kmem *to_nd_dax_kmem(struct device *dev)
+{
+   struct nd_dax_kmem *nd_dax_kmem = container_of(dev, struct nd_dax_kmem, 
nd_pfn.dev);
+
+   WARN_ON(!is_nd_dax_kmem(dev));
+   return nd_dax_kmem;
+}
+EXPORT_SYMBOL(to_nd_dax_kmem);
+/* end nd_dax_kmem */
+
 static const struct attribute_group *nd_dax_attribute_groups[] = {
_pfn_attribute_group,
_device_attribute_group,
diff -puN drivers/nvdimm/nd.h~dax-kmem-try-again-2018-4-bus-dev-type-kmem 
drivers/nvdimm/nd.h
--- a/drivers/nvdimm/nd.h~dax-kmem-try-again-2018-4-bus-dev-type-kmem   
2018-10-22 13:12:23.027930389 -0700
+++ b/drivers/nvdimm/nd.h   2018-10-22 13:12:23.031930389 -0700
@@ -215,6 +215,10 @@ struct nd_dax {
struct nd_pfn nd_pfn;
 };
 
+struct nd_dax_kmem {
+   struct nd_pfn nd_pfn;
+};
+
 enum nd_async_mode {
ND_SYNC,
ND_ASYNC,
@@ -318,9 +322,11 @@ static inline int nd_pfn_validate(struct
 #endif
 
 struct nd_dax *to_nd_dax(struct device *dev);
+struct nd_dax_kmem *to_nd_dax_kmem(struct device *dev);
 #if IS_ENABLED(CONFIG_NVDIMM_DAX)
 int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns);
 bool is_nd_dax(struct device *dev);
+bool is_nd_dax_kmem(struct device *dev);
 struct device *nd_dax_create(struct nd_region *nd_region);
 #else
 static inline int nd_dax_probe(struct device *dev,
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 7/9] dax/kmem: actually perform memory hotplug

2018-10-22 Thread Dave Hansen


This is the meat of this whole series.  When the "kmem" device's
probe function is called and we know we have a good persistent
memory device, hotplug the memory back into the main kernel.

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

---

 b/drivers/dax/kmem.c |   28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff -puN drivers/dax/kmem.c~dax-kmem-hotplug drivers/dax/kmem.c
--- a/drivers/dax/kmem.c~dax-kmem-hotplug   2018-10-22 13:12:24.069930387 
-0700
+++ b/drivers/dax/kmem.c2018-10-22 13:12:24.072930387 -0700
@@ -55,10 +55,12 @@ static void dax_kmem_percpu_kill(void *d
 static int dax_kmem_probe(struct device *dev)
 {
void *addr;
+   int numa_node;
struct resource res;
int rc, id, region_id;
struct nd_pfn_sb *pfn_sb;
struct dev_dax *dev_dax;
+   struct resource *new_res;
struct dax_kmem *dax_kmem;
struct nd_namespace_io *nsio;
struct dax_region *dax_region;
@@ -86,13 +88,30 @@ static int dax_kmem_probe(struct device
 
pfn_sb = nd_pfn->pfn_sb;
 
-   if (!devm_request_mem_region(dev, nsio->res.start,
-   resource_size(>res),
-   dev_name(>dev))) {
+   new_res = devm_request_mem_region(dev, nsio->res.start,
+ resource_size(>res),
+ "System RAM (pmem)");
+   if (!new_res) {
dev_warn(dev, "could not reserve region %pR\n", >res);
return -EBUSY;
}
 
+   /*
+* Set flags appropriate for System RAM.  Leave ..._BUSY clear
+* so that add_memory() can add a child resource.
+*/
+   new_res->flags = IORESOURCE_SYSTEM_RAM;
+
+   numa_node = dev_to_node(dev);
+   if (numa_node < 0) {
+   pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
+   numa_node = 0;
+   }
+
+   rc = add_memory(numa_node, nsio->res.start, resource_size(>res));
+   if (rc)
+   return rc;
+
dax_kmem->dev = dev;
init_completion(_kmem->cmp);
rc = percpu_ref_init(_kmem->ref, dax_kmem_percpu_release, 0,
@@ -106,6 +125,9 @@ static int dax_kmem_probe(struct device
return rc;
 
dax_kmem->pgmap.ref = _kmem->ref;
+
+   dax_kmem->pgmap.res.name = "name_kmem_override2";
+
addr = devm_memremap_pages(dev, _kmem->pgmap);
if (IS_ERR(addr))
return PTR_ERR(addr);
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 3/9] dax: add more kmem device infrastructure

2018-10-22 Thread Dave Hansen


The previous patch is a simple copy of the pmem driver.  This
makes it easy while this is in development to keep the pmem
and kmem code in sync.

This actually adds some necessary infrastructure for the new
driver to compile.

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

---

 b/drivers/dax/kmem.c |   10 +-
 b/include/uapi/linux/ndctl.h |2 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff -puN drivers/dax/kmem.c~dax-kmem-try-again-2018-2-header drivers/dax/kmem.c
--- a/drivers/dax/kmem.c~dax-kmem-try-again-2018-2-header   2018-10-22 
13:12:22.000930392 -0700
+++ b/drivers/dax/kmem.c2018-10-22 13:12:22.005930392 -0700
@@ -27,7 +27,7 @@ static struct dax_kmem *to_dax_kmem(stru
 
 static void dax_kmem_percpu_release(struct percpu_ref *ref)
 {
-   struct dax_kmem *dax_kmem = to_dax_pmem(ref);
+   struct dax_kmem *dax_kmem = to_dax_kmem(ref);
 
dev_dbg(dax_kmem->dev, "trace\n");
complete(_kmem->cmp);
@@ -36,7 +36,7 @@ static void dax_kmem_percpu_release(stru
 static void dax_kmem_percpu_exit(void *data)
 {
struct percpu_ref *ref = data;
-   struct dax_kmem *dax_kmem = to_dax_pmem(ref);
+   struct dax_kmem *dax_kmem = to_dax_kmem(ref);
 
dev_dbg(dax_kmem->dev, "trace\n");
wait_for_completion(_kmem->cmp);
@@ -46,7 +46,7 @@ static void dax_kmem_percpu_exit(void *d
 static void dax_kmem_percpu_kill(void *data)
 {
struct percpu_ref *ref = data;
-   struct dax_kmem *dax_kmem = to_dax_pmem(ref);
+   struct dax_kmem *dax_kmem = to_dax_kmem(ref);
 
dev_dbg(dax_kmem->dev, "trace\n");
percpu_ref_kill(ref);
@@ -142,11 +142,11 @@ static struct nd_device_driver dax_kmem_
.drv = {
.name = "dax_kmem",
},
-   .type = ND_DRIVER_DAX_PMEM,
+   .type = ND_DRIVER_DAX_KMEM,
 };
 
 module_nd_driver(dax_kmem_driver);
 
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Intel Corporation");
-MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_PMEM);
+MODULE_ALIAS_ND_DEVICE(ND_DEVICE_DAX_KMEM);
diff -puN include/uapi/linux/ndctl.h~dax-kmem-try-again-2018-2-header 
include/uapi/linux/ndctl.h
--- a/include/uapi/linux/ndctl.h~dax-kmem-try-again-2018-2-header   
2018-10-22 13:12:22.002930392 -0700
+++ b/include/uapi/linux/ndctl.h2018-10-22 13:12:22.005930392 -0700
@@ -197,6 +197,7 @@ static inline const char *nvdimm_cmd_nam
 #define ND_DEVICE_NAMESPACE_PMEM 5  /* PMEM namespace (may alias with BLK) */
 #define ND_DEVICE_NAMESPACE_BLK 6   /* BLK namespace (may alias with PMEM) */
 #define ND_DEVICE_DAX_PMEM 7/* Device DAX interface to pmem */
+#define ND_DEVICE_DAX_KMEM 8/* Normal kernel-managed system memory */
 
 enum nd_driver_flags {
ND_DRIVER_DIMM= 1 << ND_DEVICE_DIMM,
@@ -206,6 +207,7 @@ enum nd_driver_flags {
ND_DRIVER_NAMESPACE_PMEM  = 1 << ND_DEVICE_NAMESPACE_PMEM,
ND_DRIVER_NAMESPACE_BLK   = 1 << ND_DEVICE_NAMESPACE_BLK,
ND_DRIVER_DAX_PMEM= 1 << ND_DEVICE_DAX_PMEM,
+   ND_DRIVER_DAX_KMEM= 1 << ND_DEVICE_DAX_KMEM,
 };
 
 enum {
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 4/9] dax/kmem: allow PMEM devices to bind to KMEM driver

2018-10-22 Thread Dave Hansen


Currently, a persistent memory device's mode must be coordinated
with the driver to which it needs to bind.  To change it from the
fsdax to the device-dax driver, you first change the mode of the
device itself.

Instead of adding a new device mode, allow the PMEM mode to also
bind to the KMEM driver.

As I write this, I'm realizing that it might have just been
better to add a new device mode, rather than hijacking the PMEM
eode.  If this is the case, please speak up, NVDIMM folks.  :)

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

---

 b/drivers/nvdimm/bus.c |   15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff -puN drivers/nvdimm/bus.c~dax-kmem-try-again-2018-3-bus-match-override 
drivers/nvdimm/bus.c
--- a/drivers/nvdimm/bus.c~dax-kmem-try-again-2018-3-bus-match-override 
2018-10-22 13:12:22.522930391 -0700
+++ b/drivers/nvdimm/bus.c  2018-10-22 13:12:22.525930391 -0700
@@ -464,11 +464,24 @@ static struct nd_device_driver nd_bus_dr
 static int nvdimm_bus_match(struct device *dev, struct device_driver *drv)
 {
struct nd_device_driver *nd_drv = to_nd_device_driver(drv);
+   bool match;
 
if (is_nvdimm_bus(dev) && nd_drv == _bus_driver)
return true;
 
-   return !!test_bit(to_nd_device_type(dev), _drv->type);
+   match = !!test_bit(to_nd_device_type(dev), _drv->type);
+
+   /*
+* We allow PMEM devices to be bound to the KMEM driver.
+* Force a match if we detect a PMEM device type but
+* a KMEM device driver.
+*/
+   if (!match &&
+   (to_nd_device_type(dev) == ND_DEVICE_DAX_PMEM) &&
+   (nd_drv->type == ND_DRIVER_DAX_KMEM))
+   match = true;
+
+   return match;
 }
 
 static ASYNC_DOMAIN_EXCLUSIVE(nd_async_domain);
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/9] Allow persistent memory to be used like normal RAM

2018-10-22 Thread Dave Hansen
Persistent memory is cool.  But, currently, you have to rewrite
your applications to use it.  Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory?  Well... have I got the patch
series for you!

This series adds a new "driver" to which pmem devices can be
attached.  Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory.  On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.

This is highly RFC, and I really want the feedback from the
nvdimm/pmem folks about whether this is a viable long-term
perversion of their code and device mode.  It's insufficiently
documented and probably not bisectable either.

Todo:
1. The device re-binding hacks are ham-fisted at best.  We
   need a better way of doing this, especially so the kmem
   driver does not get in the way of normal pmem devices.
2. When the device has no proper node, we default it to
   NUMA node 0.  Is that OK?
3. We muck with the 'struct resource' code quite a bit. It
   definitely needs a once-over from folks more familiar
   with it than I.
4. Is there a better way to do this than starting with a
   copy of pmem.c?

Here's how I set up a system to test this thing:

1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
   physical seems to work: memmap=512M!0x8000
   This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
ndctl create-namespace -fe namespace0.0 -m dax
4. In the background, the kmem driver will probably bind to the
   new device.
5. Now, online the new memory sections.  Perhaps:

grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
echo $f: `cat $f`
echo online > $f
grep ^MemTotal /proc/meminfo
done

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/9] mm/resource: return real error codes from walk failures

2018-10-22 Thread Dave Hansen


walk_system_ram_range() can return an error code either becuase *it*
failed, or because the 'func' that it calls returned an error.  The
memory hotplug does the following:

ret = walk_system_ram_range(..., func);
if (ret)
return ret;

and 'ret' makes it out to userspace, eventually.  The problem is,
walk_system_ram_range() failues that result from *it* failing (as
opposed to 'func') return -1.  That leads to a very odd -EPERM (-1)
return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal failures to
keep userspace less confused.

This return code is compatible with all the callers that I audited.

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 


---

 b/kernel/resource.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 
kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1  
2018-10-22 13:12:21.000930395 -0700
+++ b/kernel/resource.c 2018-10-22 13:12:21.003930395 -0700
@@ -375,7 +375,7 @@ static int __walk_iomem_res_desc(resourc
 int (*func)(struct resource *, void *))
 {
struct resource res;
-   int ret = -1;
+   int ret = -EINVAL;
 
while (start < end &&
   !find_next_iomem_res(start, end, flags, desc, first_lvl, )) {
@@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
unsigned long flags;
struct resource res;
unsigned long pfn, end_pfn;
-   int ret = -1;
+   int ret = -EINVAL;
 
start = (u64) start_pfn << PAGE_SHIFT;
end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 8/9] dax/kmem: let walk_system_ram_range() search child resources

2018-10-22 Thread Dave Hansen


In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.

However, it currently only finds memory resources which are
"top-level" iomem_resources.  Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):

a000-bfff : Persistent Memory (legacy)
  a000-afff : System RAM

Changing the true->false here allows children to be searched
as well.  We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

---

 b/kernel/resource.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources 
kernel/resource.c
--- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources 
2018-10-22 13:12:24.565930386 -0700
+++ b/kernel/resource.c 2018-10-22 13:12:24.572930386 -0700
@@ -445,6 +445,9 @@ int walk_mem_res(u64 start, u64 end, voi
  * This function calls the @func callback against all memory ranges of type
  * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
  * It is to be used only for System RAM.
+ *
+ * This will find System RAM ranges that are children of top-level resources
+ * in addition to top-level System RAM resources.
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
  void *arg, int (*func)(unsigned long, unsigned long, 
void *))
@@ -460,7 +463,7 @@ int walk_system_ram_range(unsigned long
flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
while (start < end &&
   !find_next_iomem_res(start, end, flags, IORES_DESC_NONE,
-   true, )) {
+   false, )) {
pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
end_pfn = (res.end + 1) >> PAGE_SHIFT;
if (end_pfn > pfn)
_
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 2/9] dax: kernel memory driver for mm ownership of DAX

2018-10-22 Thread Dave Hansen


Add the actual driver to which will own the DAX range.  This
allows very nice party with the other possible "owners" of
a DAX region: device DAX and filesystem DAX.  It also greatly
simplifies the process of handing off control of the memory
between the different owners since it's just a matter of
unbinding and rebinding the device to different drivers.

I tried to do this all internally to the kernel and the
locking and "self-destruction" of the old device context was
a nightmare.  Having userspace drive it is a wonderful
simplification.

Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Ross Zwisler 
Cc: Vishal Verma 
Cc: Tom Lendacky 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: linux-nvdimm@lists.01.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
Cc: Huang Ying 
Cc: Fengguang Wu 

---

 b/drivers/dax/kmem.c |  152 +++
 1 file changed, 152 insertions(+)

diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null   2018-09-18 12:39:53.059362935 -0700
+++ b/drivers/dax/kmem.c2018-10-22 13:12:21.502930393 -0700
@@ -0,0 +1,152 @@
+// this just just a copy of drivers/dax/pmem.c with
+// s/dax_pmem/dax_kmem' for now.
+//
+// need real license
+/*
+ * Copyright(c) 2016-2018 Intel Corporation. All rights reserved.
+ */
+#include 
+#include 
+#include 
+#include 
+#include "../nvdimm/pfn.h"
+#include "../nvdimm/nd.h"
+#include "device-dax.h"
+
+struct dax_kmem {
+   struct device *dev;
+   struct percpu_ref ref;
+   struct dev_pagemap pgmap;
+   struct completion cmp;
+};
+
+static struct dax_kmem *to_dax_kmem(struct percpu_ref *ref)
+{
+   return container_of(ref, struct dax_kmem, ref);
+}
+
+static void dax_kmem_percpu_release(struct percpu_ref *ref)
+{
+   struct dax_kmem *dax_kmem = to_dax_pmem(ref);
+
+   dev_dbg(dax_kmem->dev, "trace\n");
+   complete(_kmem->cmp);
+}
+
+static void dax_kmem_percpu_exit(void *data)
+{
+   struct percpu_ref *ref = data;
+   struct dax_kmem *dax_kmem = to_dax_pmem(ref);
+
+   dev_dbg(dax_kmem->dev, "trace\n");
+   wait_for_completion(_kmem->cmp);
+   percpu_ref_exit(ref);
+}
+
+static void dax_kmem_percpu_kill(void *data)
+{
+   struct percpu_ref *ref = data;
+   struct dax_kmem *dax_kmem = to_dax_pmem(ref);
+
+   dev_dbg(dax_kmem->dev, "trace\n");
+   percpu_ref_kill(ref);
+}
+
+static int dax_kmem_probe(struct device *dev)
+{
+   void *addr;
+   struct resource res;
+   int rc, id, region_id;
+   struct nd_pfn_sb *pfn_sb;
+   struct dev_dax *dev_dax;
+   struct dax_kmem *dax_kmem;
+   struct nd_namespace_io *nsio;
+   struct dax_region *dax_region;
+   struct nd_namespace_common *ndns;
+   struct nd_dax *nd_dax = to_nd_dax(dev);
+   struct nd_pfn *nd_pfn = _dax->nd_pfn;
+
+   ndns = nvdimm_namespace_common_probe(dev);
+   if (IS_ERR(ndns))
+   return PTR_ERR(ndns);
+   nsio = to_nd_namespace_io(>dev);
+
+   dax_kmem = devm_kzalloc(dev, sizeof(*dax_kmem), GFP_KERNEL);
+   if (!dax_kmem)
+   return -ENOMEM;
+
+   /* parse the 'pfn' info block via ->rw_bytes */
+   rc = devm_nsio_enable(dev, nsio);
+   if (rc)
+   return rc;
+   rc = nvdimm_setup_pfn(nd_pfn, _kmem->pgmap);
+   if (rc)
+   return rc;
+   devm_nsio_disable(dev, nsio);
+
+   pfn_sb = nd_pfn->pfn_sb;
+
+   if (!devm_request_mem_region(dev, nsio->res.start,
+   resource_size(>res),
+   dev_name(>dev))) {
+   dev_warn(dev, "could not reserve region %pR\n", >res);
+   return -EBUSY;
+   }
+
+   dax_kmem->dev = dev;
+   init_completion(_kmem->cmp);
+   rc = percpu_ref_init(_kmem->ref, dax_kmem_percpu_release, 0,
+   GFP_KERNEL);
+   if (rc)
+   return rc;
+
+   rc = devm_add_action_or_reset(dev, dax_kmem_percpu_exit,
+   _kmem->ref);
+   if (rc)
+   return rc;
+
+   dax_kmem->pgmap.ref = _kmem->ref;
+   addr = devm_memremap_pages(dev, _kmem->pgmap);
+   if (IS_ERR(addr))
+   return PTR_ERR(addr);
+
+   rc = devm_add_action_or_reset(dev, dax_kmem_percpu_kill,
+   _kmem->ref);
+   if (rc)
+   return rc;
+
+   /* adjust the dax_region resource to the start of data */
+   memcpy(, _kmem->pgmap.res, sizeof(res));
+   res.start += le64_to_cpu(pfn_sb->dataoff);
+
+   rc = sscanf(dev_name(>dev), "namespace%d.%d", _id, );
+   if (rc != 2)
+   return -EINVAL;
+
+   dax_region = alloc_dax_region(dev, region_id, ,
+   le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP);
+   if (!dax_region)
+   return -ENOMEM;
+
+   /* TODO: support for subdividing a dax region... */
+   dev_dax = 

Re: [PATCH v5 2/4] mm: Provide kernel parameter to allow disabling page init poisoning

2018-09-26 Thread Dave Hansen
On 09/26/2018 08:24 AM, Alexander Duyck wrote:
> With no options it works just like slub_debug and enables all
> available options. So in our case it is a NOP since we wanted the
> debugging enabled by default.

Yeah, but slub_debug is different.

First, nobody uses the slub_debug=- option because *that* is only used
when you have SLUB_DEBUG=y *and* CONFIG_SLUB_DEBUG_ON=y, which not even
Fedora does.

slub_debug is *primarily* for *adding* debug features.  For this, we
need to turn them off.

It sounds like following slub_debug was a bad idea, especially following
its semantics too closely when it doesn't make sense.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 2/4] mm: Provide kernel parameter to allow disabling page init poisoning

2018-09-26 Thread Dave Hansen
On 09/26/2018 12:38 AM, Michal Hocko wrote:
> Why cannot you simply go with [no]vm_page_poison[=on/off]?

I was trying to look to the future a bit, if we end up with five or six
more other options we want to allow folks to enable/disable.  I don't
want to end up in a situation where we have a bunch of different knobs
to turn all this stuff off at runtime.

I'd really like to have one stop shopping so that folks who have a
system that's behaving well and don't need any debugging can get some of
their performance back.

But, the *primary* thing we want here is a nice, quick way to turn as
much debugging off as we can.  A nice-to-have is a future-proof,
slub-style option that will centralize things.

Alex's patch fails at the primary goal, IMNHO because "vm_debug=-" is so
weird.  I'd much rather have "vm_debug=off" (the primary goal) and throw
away the nice-to-have (future-proof fine-grained on/off).

I think we can have both, but I guess the onus is on me to go and add a
strcmp(..., "off"). :)
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 2/4] mm: Provide kernel parameter to allow disabling page init poisoning

2018-09-25 Thread Dave Hansen
On 09/25/2018 01:38 PM, Alexander Duyck wrote:
> On 9/25/2018 1:26 PM, Dave Hansen wrote:
>> On 09/25/2018 01:20 PM, Alexander Duyck wrote:
>>> +    vm_debug[=options]    [KNL] Available with CONFIG_DEBUG_VM=y.
>>> +    May slow down system boot speed, especially when
>>> +    enabled on systems with a large amount of memory.
>>> +    All options are enabled by default, and this
>>> +    interface is meant to allow for selectively
>>> +    enabling or disabling specific virtual memory
>>> +    debugging features.
>>> +
>>> +    Available options are:
>>> +  P    Enable page structure init time poisoning
>>> +  -    Disable all of the above options
>>
>> Can we have vm_debug=off for turning things off, please?  That seems to
>> be pretty standard.
> 
> No. The simple reason for that is that you had requested this work like
> the slub_debug. If we are going to do that then each individual letter
> represents a feature. That is why the "-" represents off. We cannot have
> letters represent flags, and letters put together into words. For
> example slub_debug=OFF would turn on sanity checks and turn off
> debugging for caches that would have causes higher minimum slab orders.

We don't have to have the same letters mean the same things for both
options.  We also can live without 'o' and 'f' being valid.  We can
*also* just say "don't do 'off'" if you want to enable things.

I'd much rather have vm_debug=off do the right thing than have
per-feature enable/disable.  I know I'll *never* remember vm_debug=- and
doing it this way will subject me to innumerable trips to Documentation/
during my few remaining years.

Surely you can make vm_debug=off happen. :)

>> we need to document the defaults.  I think the default is "all
>> debug options are enabled", but it would be nice to document that.
> 
> In the description I call out "All options are enabled by default, an> this 
> interface is meant to allow for selectively enabling or disabling".

I found "all options are enabled by default" really confusing.  Maybe:

"Control debug features which become available when CONFIG_DEBUG_VM=y.
When this option is not specified, all debug features are enabled.  Use
this option enable a specific subset."

Then, let's actually say what the options do, and what their impact is:

P   Enable 'struct page' poisoning at initialization.
(Slows down boot time).

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 2/4] mm: Provide kernel parameter to allow disabling page init poisoning

2018-09-25 Thread Dave Hansen
On 09/25/2018 01:20 PM, Alexander Duyck wrote:
> + vm_debug[=options]  [KNL] Available with CONFIG_DEBUG_VM=y.
> + May slow down system boot speed, especially when
> + enabled on systems with a large amount of memory.
> + All options are enabled by default, and this
> + interface is meant to allow for selectively
> + enabling or disabling specific virtual memory
> + debugging features.
> +
> + Available options are:
> +   P Enable page structure init time poisoning
> +   - Disable all of the above options

Can we have vm_debug=off for turning things off, please?  That seems to
be pretty standard.

Also, we need to document the defaults.  I think the default is "all
debug options are enabled", but it would be nice to document that.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning

2018-09-12 Thread Dave Hansen
On 09/12/2018 09:36 AM, Alexander Duyck wrote:
>> vm_debug =  [KNL] Available with CONFIG_DEBUG_VM=y.
>> May slow down boot speed, especially on larger-
>> memory systems when enabled.
>> off: turn off all runtime VM debug features
>> all: turn on all debug features (default)
> This would introduce a significant amount of code change if we do it
> as a parameter that has control over everything.

Sure, but don't do that now.  Just put page poisoning under it now.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning

2018-09-12 Thread Dave Hansen
On 09/12/2018 07:49 AM, Alexander Duyck wrote:
>>> + page_init_poison=   [KNL] Boot-time parameter changing the
>>> + state of poisoning of page structures during early
>>> + boot. Used to verify page metadata is not accessed
>>> + prior to initialization. Available with
>>> + CONFIG_DEBUG_VM=y.
>>> + off: turn off poisoning
>>> + on: turn on poisoning (default)
>>> +
>> what about the following wording or something along those lines
>>
>> Boot-time parameter to control struct page poisoning which is a
>> debugging feature to catch unitialized struct page access. This option
>> is available only for CONFIG_DEBUG_VM=y and it affects boot time
>> (especially on large systems). If there are no poisoning bugs reported
>> on the particular system and workload it should be safe to disable it to
>> speed up the boot time.
> That works for me. I will update it for the next release.

FWIW, I rather liked Dan's idea of wrapping this under
vm_debug=.  We've got a zoo of boot options and it's really
hard to _remember_ what does what.  For this case, we're creating one
that's only available under a specific debug option and I think it makes
total sense to name the boot option accordingly.

For now, I think it makes total sense to do vm_debug=all/off.  If, in
the future, we get more options, we can do things like slab does and do
vm_debug=P (for Page poison) for this feature specifically.

vm_debug =  [KNL] Available with CONFIG_DEBUG_VM=y.
May slow down boot speed, especially on larger-
memory systems when enabled.
off: turn off all runtime VM debug features
all: turn on all debug features (default)
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH V4 4/4] kvm: add a check if pfn is from NVDIMM pmem.

2018-08-30 Thread Dave Hansen
On 08/22/2018 03:58 AM, Zhang Yi wrote:
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
> - if (pfn_valid(pfn))
> - return PageReserved(pfn_to_page(pfn));
> + struct page *page;
> +
> + if (pfn_valid(pfn)) {
> + page = pfn_to_page(pfn);
> + return PageReserved(page) && !is_dax_page(page);
> + }

This is in desperate need of commenting about what it is doing and why.

The changelog alone doesn't cut it.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/14] mm: Asynchronous + multithreaded memmap init for ZONE_DEVICE

2018-07-23 Thread Dave Hansen
On 07/23/2018 04:09 AM, Michal Hocko wrote:
> On Thu 19-07-18 11:41:10, Dave Hansen wrote:
>> Are you looking for the actual end-user reports?  This was more of a
>> case of the customer plugging in some persistent memory DIMMs, noticing
>> the boot delta and calling the folks who sold them the DIMMs (Intel).
> But this doesn't sound like something to rush a solution for in the
> upcoming merge windown, does it?

No, we should not rush it.  We'll try to rework it properly.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/14] mm: Asynchronous + multithreaded memmap init for ZONE_DEVICE

2018-07-19 Thread Dave Hansen
On 07/18/2018 05:05 AM, Michal Hocko wrote:
> On Tue 17-07-18 10:32:32, Dan Williams wrote:
>> On Tue, Jul 17, 2018 at 8:50 AM Michal Hocko  wrote:
> [...]
>>> Is there any reason that this work has to target the next merge window?
>>> The changelog is not really specific about that.
>>
>> Same reason as any other change in this space, hardware availability
>> continues to increase. These patches are a direct response to end user
>> reports of unacceptable init latency with current kernels.
> 
> Do you have any reference please?

Are you looking for the actual end-user reports?  This was more of a
case of the customer plugging in some persistent memory DIMMs, noticing
the boot delta and calling the folks who sold them the DIMMs (Intel).
We can get you more details if you'd like but there are no public
reports that we can share.

>>> There no numbers or
>>> anything that would make this sound as a high priority stuff.
>>
>> >From the end of the cover letter:
>>
>> "With this change an 8 socket system was observed to initialize pmem
>> namespaces in ~4 seconds whereas it was previously taking ~4 minutes."
> 
> Well, yeah, it sounds like a nice to have thing to me. 4 minutes doesn't
> sounds excesive for a single init time operation. Machines are booting
> tens of minutes these days...

It was excessive enough for the customer to complain. :)
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Dave Hansen
On 12/21/2017 07:09 PM, Anshuman Khandual wrote:
> I had presented a proposal for NUMA redesign in the Plumbers Conference this
> year where various memory devices with different kind of memory attributes
> can be represented in the kernel and be used explicitly from the user space.
> Here is the link to the proposal if you feel interested. The proposal is
> very intrusive and also I dont have a RFC for it yet for discussion here.

I think that's the best reason to "re-use NUMA" for this: it's _not_
intrusive.

Also, from an x86 perspective, these HMAT systems *will* be out there.
Old versions of Linux *will* see different types of memory as separate
NUMA nodes.  So, if we are going to do something different, it's going
to be interesting to un-teach those systems about using the NUMA APIs
for this.  That ship has sailed.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Dave Hansen
On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).

Those directories are already too redundant and wasteful.  I think we'd
really rather not add to them.  In addition, it's technically possible
to have a memory section span NUMA nodes and have different performance
properties, which make it impossible to represent there.

In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
uniform performance properties in the HMAT, and we just so happen to
always create one NUMA node per PXM.  So, NUMA nodes really are a good fit.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC v2 0/5] surface heterogeneous memory performance information

2017-07-19 Thread Dave Hansen
On 07/19/2017 02:48 AM, Bob Liu wrote:
>> Option 2: Provide the user with HMAT performance data directly in
>> sysfs, allowing applications to directly access it without the need
>> for the library and daemon.
>> 
> Is it possible to do the memory allocation automatically by the
> kernel and transparent to users? It sounds like unreasonable that
> most users should aware this detail memory topology.

It's possible, but I'm not sure this is something we automatically want
to see added to the kernel.

I look at it like NUMA.  We have lots of details available about how
things are connected.  But, "most users" are totally unaware of this.
We give them decent default policies and the ones that need more can do
so with the NUMA APIs.

These patches provide the framework to help users/apps who *do* care and
want to make intelligent, topology-aware decisions.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC v2 0/5] surface heterogeneous memory performance information

2017-07-07 Thread Dave Hansen
On 07/06/2017 11:27 PM, Balbir Singh wrote:
> On Thu, 2017-07-06 at 15:52 -0600, Ross Zwisler wrote:
>>   # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
>>   mem_tgt2/firmware_id:1

This is here for folks that know their platform and know exactly the
firmware ID (PXM in ACPI parlance) of a given piece of memory.  Without
this, we might be stuck with requiring the NUMA node ID that the kernel
uses to be bound 1:1 with the firmware ID.

>>   mem_tgt2/is_cached:0

This tells whether the memory is cached by some other memory.  MCDRAM is
an example of this.  It can be used as a high-bandwidth cache in front
of the lower-bandwidth DRAM.

This is referred to as "Memory Side Cache Information Structure" in the
ACPI spec: www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

>>   mem_tgt2/is_enabled:1
>>   mem_tgt2/is_isolated:0

This one is described in detail in the ACPI spec.  It's called
"Reservation hint" in there.

>>   mem_tgt2/phys_addr_base:0x0
>>   mem_tgt2/phys_length_bytes:0x8
>>   mem_tgt2/local_init/read_bw_MBps:30720
>>   mem_tgt2/local_init/read_lat_nsec:100
>>   mem_tgt2/local_init/write_bw_MBps:30720
>>   mem_tgt2/local_init/write_lat_nsec:100
> 
> How to these numbers compare to normal system memory?

They're made up in this instance.  But, it's safe to expect 10x swings
in bandwidth in latency, both up and down.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC v2 0/5] surface heterogeneous memory performance information

2017-07-06 Thread Dave Hansen
On 07/06/2017 04:08 PM, Jerome Glisse wrote:
>> So, for applications that need to differentiate between memory ranges based
>> on their performance, what option would work best for you?  Is the local
>> (initiator,target) performance provided by patch 5 enough, or do you
>> require performance information for all possible (initiator,target)
>> pairings?
> 
> Am i right in assuming that HBM or any faster memory will be relatively small
> (1GB - 8GB maybe 16GB ?) and of fix amount (ie size will depend on the exact
> CPU model you have) ?

For HBM, that's certainly consistent with the Xeon Phi MCDRAM.

But, please remember that this patch set is for fast memory *and* slow
memory (vs. plain DRAM).

> If so i am wondering if we should not restrict NUMA placement policy for such
> node to vma only. Forbid any policy that would prefer those node globally at
> thread/process level. This would avoid wide thread policy to exhaust this
> smaller pool of memory.

You would like to take the NUMA APIs and bifurcate them?  Make some of
them able to work on this memory, and others not?  So, set_mempolicy()
would work if you passed it one of these "special" nodes with
MPOL_F_ADDR, but would fail otherwise?


> Drawback of doing so would be that existing applications would not benefit
> from it. So workload where is acceptable to exhaust such memory wouldn't
> benefit until their application are updated.

I think the guys running 40-year-old fortran binaries might not be so
keen on this restriction.  I bet there are a pretty substantial number
of folks out there that would love to get new hardware and just do:

numactl --membind=fast-node ./old-binary

If I were working for a hardware company, I'd sure like to just be able
to sell somebody some fancy new hardware and have their existing
software "just work" with a minimal wrapper.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm,x86: fix SMP x86 32bit build for native_pud_clear()

2017-02-16 Thread Dave Hansen
On 02/15/2017 12:31 PM, Dave Jiang wrote:
> The fix introduced by e4decc90 to fix the UP case for 32bit x86, however
> that broke the SMP case that was working previously. Add ifdef so the dummy
> function only show up for 32bit UP case only.

Could you elaborate a bit on how it broke things?

> Fix: e4decc90 mm,x86: native_pud_clear missing on i386 build

Which tree is that in, btw?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Dave Hansen
On 11/22/2016 11:49 PM, Daniel Vetter wrote:
> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
> to reuse all the existing allocation policies directly, those won't work.
> So at boot-up your default numa policy would exclude any vram nodes.
> 
> But I think (as an -mm layman) that numa gives us a lot of the tools and
> policy interface that we need to implement what we want for gpus.

Are you suggesting creating NUMA nodes for video RAM (I assume that's
what you mean by vram) where that RAM is not at all CPU-accessible?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm: add ZONE_DEVICE statistics to smaps

2016-11-15 Thread Dave Hansen
On 11/10/2016 02:11 PM, Dan Williams wrote:
> @@ -774,6 +778,8 @@ static int show_smap(struct seq_file *m, void *v, int 
> is_pid)
>  "ShmemPmdMapped: %8lu kB\n"
>  "Shared_Hugetlb: %8lu kB\n"
>  "Private_Hugetlb: %7lu kB\n"
> +"Device: %8lu kB\n"
> +"DeviceHugePages: %7lu kB\n"
>  "Swap:   %8lu kB\n"
>  "SwapPss:%8lu kB\n"
>  "KernelPageSize: %8lu kB\n"

So, a couple of nits...

smaps is getting a bit big, and the fields that get added in this patch
are going to be pretty infrequently used.  Is it OK if smaps grows
forever, even if most of them items are "0 kB"?

IOW, Could we make it output Device* only for DAX VMAs?  All the parsers
have to handle that field being there or not (for old kernels).

The other thing missing for DAX is the page size.  DAX mappings support
mixed page sizes, so MMUPageSize in this context is pretty worthless.
What will we do in here for 1GB DAX pages?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm