Re: [PATCH 0/5] [v4] Allow persistent memory to be used like normal RAM

2019-01-28 Thread Balbir Singh
On Thu, Jan 24, 2019 at 03:14:41PM -0800, Dave Hansen wrote:
> v3 spurred a bunch of really good discussion.  Thanks to everybody
> that made comments and suggestions!
> 
> I would still love some Acks on this from the folks on cc, even if it
> is on just the patch touching your area.
> 
> Note: these are based on commit d2f33c19644 in:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git 
> libnvdimm-pending
> 
> Changes since v3:
>  * Move HMM-related resource warning instead of removing it
>  * Use __request_resource() directly instead of devm.
>  * Create a separate DAX_PMEM Kconfig option, complete with help text
>  * Update patch descriptions and cover letter to give a better
>overview of use-cases and hardware where this might be useful.
> 
> Changes since v2:
>  * Updates to dev_dax_kmem_probe() in patch 5:
>* Reject probes for devices with bad NUMA nodes.  Keeps slow
>  memory from being added to node 0.
>* Use raw request_mem_region()
>* Add comments about permanent reservation
>* use dev_*() instead of printk's
>  * Add references to nvdimm documentation in descriptions
>  * Remove unneeded GPL export
>  * Add Kconfig prompt and help text
> 
> Changes since v1:
>  * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
>  * Use binding/unbinding from "dax bus" code
>  * Move over to a "dax bus" driver from being an nvdimm driver
> 
> --
> 
> Persistent memory is cool.  But, currently, you have to rewrite
> your applications to use it.  Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory?  Well... have I got the patch
> series for you!
> 
> == Background / Use Cases ==
> 
> Persistent Memory (aka Non-Volatile DIMMs / NVDIMMS) themselves
> are described in detail in Documentation/nvdimm/nvdimm.txt.
> However, this documentation focuses on actually using them as
> storage.  This set is focused on using NVDIMMs as DRAM replacement.
> 
> This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
> persistent memory) NVDIMMs.  These DIMMs are physically persistent,
> more akin to flash than traditional RAM.  They are also expected to
> be more cost-effective than using RAM, which is why folks want this
> set in the first place.

What variant of NVDIMM's F/P or both?

> 
> This set is not intended for RAM-based NVDIMMs.  Those are not
> cost-effective vs. plain RAM, and this using them here would simply
> be a waste.
> 

Sounds like NVDIMM (P)

> But, why would you bother with this approach?  Intel itself [1]
> has announced a hardware feature that does something very similar:
> "Memory Mode" which turns DRAM into a cache in front of persistent
> memory, which is then as a whole used as normal "RAM"?
> 
> Here are a few reasons:
> 1. The capacity of memory mode is the size of your persistent
>memory that you dedicate.  DRAM capacity is "lost" because it
>is used for cache.  With this, you get PMEM+DRAM capacity for
>memory.
> 2. DRAM acts as a cache with memory mode, and caches can lead to
>unpredictable latencies.  Since memory mode is all-or-nothing
>(either all your DRAM is used as cache or none is), your entire
>memory space is exposed to these unpredictable latencies.  This
>solution lets you guarantee DRAM latencies if you need them.
> 3. The new "tier" of memory is exposed to software.  That means
>that you can build tiered applications or infrastructure.  A
>cloud provider could sell cheaper VMs that use more PMEM and
>more expensive ones that use DRAM.  That's impossible with
>memory mode.
> 
> Don't take this as criticism of memory mode.  Memory mode is
> awesome, and doesn't strictly require *any* software changes (we
> have software changes proposed for optimizing it though).  It has
> tons of other advantages over *this* approach.  Basically, we
> believe that the approach in these patches is complementary to
> memory mode and that both can live side-by-side in harmony.
> 
> == Patch Set Overview ==
> 
> This series adds a new "driver" to which pmem devices can be
> attached.  Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory.  On
> systems with an HMAT (a new ACPI table), each socket (roughly)
> will have a separate NUMA node for its persistent memory so
> this newly-added memory can be selected by its unique NUMA
> node.


NUMA is distance based topology, does HMAT solve these problems?
How do we prevent fallback nodes of normal nodes being pmem nodes?
On an unexpected crash/failure is there a scrubbing mechanism
or do we rely on the allocator to do the right thing prior to
reallocating any memory. Will frequent zero'ing hurt NVDIMM/pmem's
life times?

Balbir Singh.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-05-02 Thread Balbir Singh
On Wed, May 2, 2018 at 6:57 AM, Dan Williams <dan.j.willi...@intel.com> wrote:
> On Thu, Apr 5, 2018 at 8:00 AM, Dan Williams <dan.j.willi...@intel.com> wrote:
>> On Wed, Apr 4, 2018 at 11:45 PM, Nicholas Piggin <npig...@gmail.com> wrote:
> [,,]
>>> What's the problem with just counting bytes copied like usercopy --
>>> why is that harder than cacheline accuracy?
>>>
>>>> I'd rather implement the existing interface and port/support the new 
>>>> interface
>>>> as it becomes available
>>>
>>> Fair enough.
>>
>> I have patches already in progress to change the interface. My
>> preference is to hold off on adding a new implementation that will
>> need to be immediately reworked. When I say "immediate" I mean that
>> should be able to post what I have for review within the next few
>> days.
>>
>> Whether this is all too late for 4.17 is another question...
>
> Here is the x86 version of a 'bytes remaining' memcpy_mcsafe() implemenation:
>
> https://lists.01.org/pipermail/linux-nvdimm/2018-May/015548.html

Thanks for the heads up! I'll work on the implementation for powerpc.

Balbir Singh.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/4] libnvdimm: Add of_node to region and bus descriptors

2018-04-07 Thread Balbir Singh
On Sat, Apr 7, 2018 at 4:28 AM, Dan Williams <dan.j.willi...@intel.com> wrote:
> On Thu, Apr 5, 2018 at 10:21 PM, Oliver O'Halloran <ooh...@gmail.com> wrote:
>> We want to be able to cross reference the region and bus devices
>> with the device tree node that they were spawned from. libNVDIMM
>> handles creating the actual devices for these internally, so we
>> need to pass in a pointer to the relevant node in the descriptor.
>>
>> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
>> Acked-by: Dan Williams <dan.j.willi...@intel.com>
>
> These look good to me. I'll get them applied today and let them soak
> over the weekend for a pull request on Monday.
>
> If anyone wants to add Acks or Reviews I can append them to the merge
> tag. If there are any NAKs please speak up now, but as far as I know
> there are no pending device-tree design concerns.

Hi, Dan

I can ack Oliver's work, will do so in each patch

Overall

Acked-by: Balbir Singh <bsinghar...@gmail.com>

Balbir Singh
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-06 Thread Balbir Singh
On Fri, Apr 6, 2018 at 11:26 AM, Nicholas Piggin <npig...@gmail.com> wrote:
> On Thu, 05 Apr 2018 16:40:26 -0400
> Jeff Moyer <jmo...@redhat.com> wrote:
>
>> Nicholas Piggin <npig...@gmail.com> writes:
>>
>> > On Thu, 5 Apr 2018 15:53:07 +1000
>> > Balbir Singh <bsinghar...@gmail.com> wrote:
>> >> I'm thinking about it, I wonder what "bytes remaining" mean in pmem 
>> >> context
>> >> in the context of a machine check exception. Also, do we want to be byte
>> >> accurate or cache-line accurate for the bytes remaining? The former is 
>> >> much
>> >> easier than the latter :)
>> >
>> > The ideal would be a linear measure of how much of your copy reached
>> > (or can reach) non-volatile storage with nothing further copied. You
>> > may have to allow for some relaxing of the semantics depending on
>> > what the architecture can support.
>>
>> I think you've got that backwards.  memcpy_mcsafe is used to copy *from*
>> persistent memory.  The idea is to catch errors when reading pmem, not
>> writing to it.
>>

I know the comment in x86 says posted writes and cares for only loads, but I
don't see why both sides should not be handled.

>> > What's the problem with just counting bytes copied like usercopy --
>> > why is that harder than cacheline accuracy?
>>
>> He said the former (i.e. bytes) is easier.  So, I think you're on the
>> same page.  :)
>
> Oh well that makes a lot more sense in my mind now, thanks :)

I thought the cache-aligned might make sense, since usually we'd expect the
failure to be at a cache-line level, but our copy_tofrom_user does accurate
accounting

Balbir Singh.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-05 Thread Balbir Singh
On Thu, Apr 5, 2018 at 9:26 PM, Oliver <ooh...@gmail.com> wrote:
> On Thu, Apr 5, 2018 at 5:14 PM, Balbir Singh <bsinghar...@gmail.com> wrote:
>> The pmem infrastructure uses memcpy_mcsafe in the pmem
>> layer so as to convert machine check excpetions into
>> a return value on failure in case a machine check
>> exception is encoutered during the memcpy.
>>
>
> Would it be possible to move the bulk of the copyuser code into a
> seperate file which can be #included once the these err macros are
> defined? Anton's memcpy is pretty hairy and I don't think anyone wants
> to have multiple copies of it in the tree, even in a cut down form.
>

I've split it out for now, in the future that might be a good thing to do.
The copy_tofrom_user_power7 falls backs on __copy_tofrom_user_base
to track exactly how much is left over. Adding these changes there would
create a larger churn and need way more testing. I've taken this short-cut
for now with a promise to fix that as the semantics of memcpy_mcsafe()
change to do more accurate tracking of how much was copied over.

Balbir Singh.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 3/3] powerpc/mce: Handle memcpy_mcsafe

2018-04-05 Thread Balbir Singh
Add a blocking notifier callback to be called in real-mode
on machine check exceptions for UE (ld/st) errors only.
The patch registers a callback on boot to be notified
of machine check exceptions and returns a NOTIFY_STOP when
a page of interest is seen as the source of the machine
check exception. This page of interest is a ZONE_DEVICE
page and hence for now, for memcpy_mcsafe to work, the page
needs to belong to ZONE_DEVICE and memcpy_mcsafe should be
used to access the memory.

The patch also modifies the NIP of the exception context
to go back to the fixup handler (in memcpy_mcsafe) and does
not print any error message as the error is treated as
returned via a return value and handled.

Signed-off-by: Balbir Singh <bsinghar...@gmail.com>
---
 arch/powerpc/include/asm/mce.h |  3 +-
 arch/powerpc/kernel/mce.c  | 77 --
 2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 3a1226e9b465..a76638e3e47e 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -125,7 +125,8 @@ struct machine_check_event {
enum MCE_UeErrorType ue_error_type:8;
uint8_t effective_address_provided;
uint8_t physical_address_provided;
-   uint8_t reserved_1[5];
+   uint8_t error_return;
+   uint8_t reserved_1[4];
uint64_teffective_address;
uint64_tphysical_address;
uint8_t reserved_2[8];
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index efdd16a79075..b9e4881fa8c5 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -28,7 +28,9 @@
 #include 
 #include 
 #include 
+#include 
 
+#include 
 #include 
 #include 
 
@@ -54,6 +56,52 @@ static struct irq_work mce_event_process_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int register_mce_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_mce_notifier);
+
+int unregister_mce_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_mce_notifier);
+
+
+static int check_memcpy_mcsafe(struct notifier_block *nb,
+   unsigned long val, void *data)
+{
+   /*
+* val contains the physical_address of the bad address
+*/
+   unsigned long pfn = val >> PAGE_SHIFT;
+   struct page *page = realmode_pfn_to_page(pfn);
+   int rc = NOTIFY_DONE;
+
+   if (!page)
+   goto out;
+
+   if (is_zone_device_page(page))  /* for HMM and PMEM */
+   rc = NOTIFY_STOP;
+out:
+   return rc;
+}
+
+struct notifier_block memcpy_mcsafe_nb = {
+   .priority = 0,
+   .notifier_call = check_memcpy_mcsafe,
+};
+
+int  mce_mcsafe_register(void)
+{
+   register_mce_notifier(_mcsafe_nb);
+   return 0;
+}
+arch_initcall(mce_mcsafe_register);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -151,9 +199,31 @@ void save_mce_event(struct pt_regs *regs, long handled,
mce->u.ue_error.effective_address_provided = true;
mce->u.ue_error.effective_address = addr;
if (phys_addr != ULONG_MAX) {
+   int rc;
+   const struct exception_table_entry *entry;
+
+   /*
+* Once we have the physical address, we check to
+* see if the current nip has a fixup entry.
+* Having a fixup entry plus the notifier stating
+* that it can handle the exception is an indication
+* that we should return to the fixup entry and
+* return an error from there
+*/
mce->u.ue_error.physical_address_provided = true;
mce->u.ue_error.physical_address = phys_addr;
-   machine_check_ue_event(mce);
+
+   rc = blocking_notifier_call_chain(_notifier_list,
+   phys_addr, NULL);
+   if (rc & NOTIFY_STOP_MASK) {
+   entry = search_exception_tables(regs->nip);
+   if (entry != NULL) {
+   mce->u.ue_error.error_return = 1;
+   regs->nip = extable_fixup(entry);
+

[PATCH v2 1/3] powerpc/mce: Bug fixes for MCE handling in kernel space

2018-04-05 Thread Balbir Singh
The code currently assumes PAGE_SHIFT as the shift value of
the pfn, this works correctly (mostly) for user space pages,
but the correct thing to do is

1. Extract the shift value returned via the pte-walk API's
2. Use the shift value to access the instruction address.

Note, the final physical address still use PAGE_SHIFT for
computation. handle_ierror() is not modified and handle_derror()
is modified just for extracting the correct instruction
address.

This is largely due to __find_linux_pte() returning pfn's
shifted by pdshift. The code is much more generic and can
handle shift values returned.

Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")

Signed-off-by: Balbir Singh <bsinghar...@gmail.com>
---
 arch/powerpc/kernel/mce_power.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index fe6fc63251fe..bd9754def479 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -36,7 +36,8 @@
  * Convert an address related to an mm to a PFN. NOTE: we are in real
  * mode, we could potentially race with page table updates.
  */
-static unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
+static unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr,
+   unsigned int *shift)
 {
pte_t *ptep;
unsigned long flags;
@@ -49,13 +50,15 @@ static unsigned long addr_to_pfn(struct pt_regs *regs, 
unsigned long addr)
 
local_irq_save(flags);
if (mm == current->mm)
-   ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
+   ptep = find_current_mm_pte(mm->pgd, addr, NULL, shift);
else
-   ptep = find_init_mm_pte(addr, NULL);
+   ptep = find_init_mm_pte(addr, shift);
local_irq_restore(flags);
if (!ptep || pte_special(*ptep))
return ULONG_MAX;
-   return pte_pfn(*ptep);
+   if (!*shift)
+   *shift = PAGE_SHIFT;
+   return (pte_val(*ptep) & PTE_RPN_MASK) >> *shift;
 }
 
 /* flush SLBs and reload */
@@ -353,15 +356,16 @@ static int mce_find_instr_ea_and_pfn(struct pt_regs 
*regs, uint64_t *addr,
unsigned long pfn, instr_addr;
struct instruction_op op;
struct pt_regs tmp = *regs;
+   unsigned int shift;
 
-   pfn = addr_to_pfn(regs, regs->nip);
+   pfn = addr_to_pfn(regs, regs->nip, );
if (pfn != ULONG_MAX) {
-   instr_addr = (pfn << PAGE_SHIFT) + (regs->nip & ~PAGE_MASK);
+   instr_addr = (pfn << shift) + (regs->nip & ((1 << shift) - 1));
instr = *(unsigned int *)(instr_addr);
if (!analyse_instr(, , instr)) {
-   pfn = addr_to_pfn(regs, op.ea);
+   pfn = addr_to_pfn(regs, op.ea, );
*addr = op.ea;
-   *phys_addr = (pfn << PAGE_SHIFT);
+   *phys_addr = (pfn << shift);
return 0;
}
/*
@@ -435,12 +439,14 @@ static int mce_handle_ierror(struct pt_regs *regs,
if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
table[i].error_type == MCE_ERROR_TYPE_UE) {
unsigned long pfn;
+   unsigned int shift;
 
if (get_paca()->in_mce < MAX_MCE_DEPTH) {
-   pfn = addr_to_pfn(regs, regs->nip);
+   pfn = addr_to_pfn(regs, regs->nip,
+   );
if (pfn != ULONG_MAX) {
*phys_addr =
-   (pfn << PAGE_SHIFT);
+   (pfn << shift);
handled = 1;
}
}
-- 
2.13.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-05 Thread Balbir Singh
The pmem infrastructure uses memcpy_mcsafe in the pmem
layer so as to convert machine check excpetions into
a return value on failure in case a machine check
exception is encoutered during the memcpy.

This patch largely borrows from the copyuser_power7
logic and does not add the VMX optimizations, largely
to keep the patch simple. If needed those optimizations
can be folded in.

Signed-off-by: Balbir Singh <bsinghar...@gmail.com>
Acked-by: Nicholas Piggin <npig...@gmail.com>
---
 arch/powerpc/include/asm/string.h   |   2 +
 arch/powerpc/lib/Makefile   |   2 +-
 arch/powerpc/lib/memcpy_mcsafe_64.S | 212 
 3 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S

diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index 9b8cedf618f4..b7e872a64726 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -30,7 +30,9 @@ extern void * memcpy_flushcache(void *,const void 
*,__kernel_size_t);
 #ifdef CONFIG_PPC64
 #define __HAVE_ARCH_MEMSET32
 #define __HAVE_ARCH_MEMSET64
+#define __HAVE_ARCH_MEMCPY_MCSAFE
 
+extern int memcpy_mcsafe(void *dst, const void *src, __kernel_size_t sz);
 extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t);
 extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
 extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 3c29c9009bbf..048afee9f518 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -24,7 +24,7 @@ endif
 
 obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
   copyuser_power7.o string_64.o copypage_power7.o memcpy_power7.o \
-  memcpy_64.o memcmp_64.o pmem.o
+  memcpy_64.o memcmp_64.o pmem.o memcpy_mcsafe_64.o
 
 obj64-$(CONFIG_SMP)+= locks.o
 obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o
diff --git a/arch/powerpc/lib/memcpy_mcsafe_64.S 
b/arch/powerpc/lib/memcpy_mcsafe_64.S
new file mode 100644
index ..e7eaa9b6cded
--- /dev/null
+++ b/arch/powerpc/lib/memcpy_mcsafe_64.S
@@ -0,0 +1,212 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) IBM Corporation, 2011
+ * Derived from copyuser_power7.s by Anton Blanchard <an...@au.ibm.com>
+ * Author - Balbir Singh <bsinghar...@gmail.com>
+ */
+#include 
+#include 
+
+   .macro err1
+100:
+   EX_TABLE(100b,.Ldo_err1)
+   .endm
+
+   .macro err2
+200:
+   EX_TABLE(200b,.Ldo_err2)
+   .endm
+
+.Ldo_err2:
+   ld  r22,STK_REG(R22)(r1)
+   ld  r21,STK_REG(R21)(r1)
+   ld  r20,STK_REG(R20)(r1)
+   ld  r19,STK_REG(R19)(r1)
+   ld  r18,STK_REG(R18)(r1)
+   ld  r17,STK_REG(R17)(r1)
+   ld  r16,STK_REG(R16)(r1)
+   ld  r15,STK_REG(R15)(r1)
+   ld  r14,STK_REG(R14)(r1)
+   addir1,r1,STACKFRAMESIZE
+.Ldo_err1:
+   li  r3,-EFAULT
+   blr
+
+
+_GLOBAL(memcpy_mcsafe)
+   cmpldi  r5,16
+   blt .Lshort_copy
+
+.Lcopy:
+   /* Get the source 8B aligned */
+   neg r6,r4
+   mtocrf  0x01,r6
+   clrldi  r6,r6,(64-3)
+
+   bf  cr7*4+3,1f
+err1;  lbz r0,0(r4)
+   addir4,r4,1
+err1;  stb r0,0(r3)
+   addir3,r3,1
+
+1: bf  cr7*4+2,2f
+err1;  lhz r0,0(r4)
+   addir4,r4,2
+err1;  sth r0,0(r3)
+   addir3,r3,2
+
+2: bf  cr7*4+1,3f
+err1;  lwz r0,0(r4)
+   addir4,r4,4
+err1;  stw r0,0(r3)
+   addir3,r3,4
+
+3: sub r5,r5,r6
+   cmpldi  r5,128
+   blt 5f
+
+   mflrr0
+   stdur1,-STACKFRAMESIZE(r1)
+   std r14,STK_REG(R14)(r1)
+   std r15,STK_REG(R15)(r1)
+   std r16,STK_REG(R16)(r1)
+   std r17,STK_REG(R17)(r1)
+   std r18,STK_REG(R18)(r1)
+   std r19,STK_REG(R19)(r1)
+   std r20,STK_REG(R20)(r1)
+   std r21,STK_REG(R21)(r1)
+   std r22,STK_REG(R22)(r1)
+   std r0,STACKFRAMESIZE+16(r1)
+
+   srdir6,r5,7
+   mtctr   r6
+
+   /* Now do cacheline (128B) sized loads and stores. */
+   .align  5
+4:
+err2;  ld  r0,0(r4)
+err2;  ld  r6,8(r4)
+err2;  ld  r7,16(r4)
+err2;  ld  r8,24(r4)
+err2;  ld  r9,32(r4)
+err2;  ld  r10,40(r4)
+err2;  ld  r11,48(r4)
+err2;  ld  r12,56(r4)
+err2;  ld  r14,64(r4)
+err2;  ld  r15,72(r4)
+err2;  ld  r16,80(r4)
+err2;  ld  r17,88(r4)
+err2;  ld  r18,96(r4)
+err2;  ld  r19,104(r4)
+err2;  ld  r20,112(r4)
+err2;  ld  r21,120(r4)
+   addir4,r4,128
+err2;  std r0,0(r3)
+err2;  std r6,8(r3)
+err2;  std r7,16(r3)
+err2;  std r8,24(r3)
+err2;  std r9,32(r3)
+err2;  std r10,40(r3)
+err2;  std r11,48(r3)
+err2;  std r12,56(r3)
+err2;  std r14,64(r3)
+err2;  std r15,72(r3)
+err2;  std   

Re: [RESEND 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-04 Thread Balbir Singh
On Thu, 5 Apr 2018 15:04:05 +1000
Nicholas Piggin <npig...@gmail.com> wrote:

> On Wed, 4 Apr 2018 20:00:52 -0700
> Dan Williams <dan.j.willi...@intel.com> wrote:
> 
> > [ adding Matthew, Christoph, and Tony  ]
> > 
> > On Wed, Apr 4, 2018 at 4:57 PM, Nicholas Piggin <npig...@gmail.com> wrote:  
> > > On Thu,  5 Apr 2018 09:19:42 +1000
> > > Balbir Singh <bsinghar...@gmail.com> wrote:
> > >
> > >> The pmem infrastructure uses memcpy_mcsafe in the pmem
> > >> layer so as to convert machine check excpetions into
> > >> a return value on failure in case a machine check
> > >> exception is encoutered during the memcpy.
> > >>
> > >> This patch largely borrows from the copyuser_power7
> > >> logic and does not add the VMX optimizations, largely
> > >> to keep the patch simple. If needed those optimizations
> > >> can be folded in.
> > >
> > > So memcpy_mcsafe doesn't return number of bytes copied?
> > > Huh, well that makes it simple.
> > 
> > Well, not in current kernels, but we need to add that support or
> > remove the direct call to copy_to_iter() in fs/dax.c. I'm looking
> > right now to add "bytes remaining" support to the x86 memcpy_mcsafe(),
> > but for copy_to_user we also need to handle bytes remaining for write
> > faults. That fix is hopefully something that can land in an early
> > 4.17-rc, but it won't be ready for -rc1.  
> 
> I wonder if the powerpc implementation should just go straight to
> counting bytes. Backporting to this interface would be trivial, but
> it would just mean there's only one variant of the code to support.
> That's up to Balbir though.
> 

I'm thinking about it, I wonder what "bytes remaining" mean in pmem context
in the context of a machine check exception. Also, do we want to be byte
accurate or cache-line accurate for the bytes remaining? The former is much
easier than the latter :)


I'd rather implement the existing interface and port/support the new interface
as it becomes available

Balbir Singh.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND 1/3] powerpc/mce: Bug fixes for MCE handling in kernel space

2018-04-04 Thread Balbir Singh
On Thu, 5 Apr 2018 09:49:00 +1000
Nicholas Piggin <npig...@gmail.com> wrote:

> On Thu,  5 Apr 2018 09:19:41 +1000
> Balbir Singh <bsinghar...@gmail.com> wrote:
> 
> > The code currently assumes PAGE_SHIFT as the shift value of
> > the pfn, this works correctly (mostly) for user space pages,
> > but the correct thing to do is  
> 
> It would be good to actually explain the problem in the
> changelog. I would have thought pte_pfn returns a
> PAGE_SIZE based pfn value?
>

The issue is hidden inside of hugepte_offset() as invoked by __find_linux_pte().
I will send a new version because the code needs to do
<< (shift - PAGE_SHIFT) for instruction address.

> > 
> > 1. Extrace the shift value returned via the pte-walk API's  
> 
>   ^^^ extract?

Thanks, yes, typo!

Balbir Singh.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[RESEND 3/3] powerpc/mce: Handle memcpy_mcsafe

2018-04-04 Thread Balbir Singh
Add a blocking notifier callback to be called in real-mode
on machine check exceptions for UE (ld/st) errors only.
The patch registers a callback on boot to be notified
of machine check exceptions and returns a NOTIFY_STOP when
a page of interest is seen as the source of the machine
check exception. This page of interest is a ZONE_DEVICE
page and hence for now, for memcpy_mcsafe to work, the page
needs to belong to ZONE_DEVICE and memcpy_mcsafe should be
used to access the memory.

The patch also modifies the NIP of the exception context
to go back to the fixup handler (in memcpy_mcsafe) and does
not print any error message as the error is treated as
returned via a return value and handled.

Signed-off-by: Balbir Singh <bsinghar...@gmail.com>
---
 arch/powerpc/include/asm/mce.h |  3 +-
 arch/powerpc/kernel/mce.c  | 77 --
 2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 3a1226e9b465..a76638e3e47e 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -125,7 +125,8 @@ struct machine_check_event {
enum MCE_UeErrorType ue_error_type:8;
uint8_t effective_address_provided;
uint8_t physical_address_provided;
-   uint8_t reserved_1[5];
+   uint8_t error_return;
+   uint8_t reserved_1[4];
uint64_teffective_address;
uint64_tphysical_address;
uint8_t reserved_2[8];
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index efdd16a79075..b9e4881fa8c5 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -28,7 +28,9 @@
 #include 
 #include 
 #include 
+#include 
 
+#include 
 #include 
 #include 
 
@@ -54,6 +56,52 @@ static struct irq_work mce_event_process_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int register_mce_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_mce_notifier);
+
+int unregister_mce_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_mce_notifier);
+
+
+static int check_memcpy_mcsafe(struct notifier_block *nb,
+   unsigned long val, void *data)
+{
+   /*
+* val contains the physical_address of the bad address
+*/
+   unsigned long pfn = val >> PAGE_SHIFT;
+   struct page *page = realmode_pfn_to_page(pfn);
+   int rc = NOTIFY_DONE;
+
+   if (!page)
+   goto out;
+
+   if (is_zone_device_page(page))  /* for HMM and PMEM */
+   rc = NOTIFY_STOP;
+out:
+   return rc;
+}
+
+struct notifier_block memcpy_mcsafe_nb = {
+   .priority = 0,
+   .notifier_call = check_memcpy_mcsafe,
+};
+
+int  mce_mcsafe_register(void)
+{
+   register_mce_notifier(_mcsafe_nb);
+   return 0;
+}
+arch_initcall(mce_mcsafe_register);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -151,9 +199,31 @@ void save_mce_event(struct pt_regs *regs, long handled,
mce->u.ue_error.effective_address_provided = true;
mce->u.ue_error.effective_address = addr;
if (phys_addr != ULONG_MAX) {
+   int rc;
+   const struct exception_table_entry *entry;
+
+   /*
+* Once we have the physical address, we check to
+* see if the current nip has a fixup entry.
+* Having a fixup entry plus the notifier stating
+* that it can handle the exception is an indication
+* that we should return to the fixup entry and
+* return an error from there
+*/
mce->u.ue_error.physical_address_provided = true;
mce->u.ue_error.physical_address = phys_addr;
-   machine_check_ue_event(mce);
+
+   rc = blocking_notifier_call_chain(_notifier_list,
+   phys_addr, NULL);
+   if (rc & NOTIFY_STOP_MASK) {
+   entry = search_exception_tables(regs->nip);
+   if (entry != NULL) {
+   mce->u.ue_error.error_return = 1;
+   regs->nip = extable_fixup(entry);
+

[RESEND 2/3] powerpc/memcpy: Add memcpy_mcsafe for pmem

2018-04-04 Thread Balbir Singh
The pmem infrastructure uses memcpy_mcsafe in the pmem
layer so as to convert machine check excpetions into
a return value on failure in case a machine check
exception is encoutered during the memcpy.

This patch largely borrows from the copyuser_power7
logic and does not add the VMX optimizations, largely
to keep the patch simple. If needed those optimizations
can be folded in.

Signed-off-by: Balbir Singh <bsinghar...@gmail.com>
---
 arch/powerpc/include/asm/string.h   |   2 +
 arch/powerpc/lib/Makefile   |   2 +-
 arch/powerpc/lib/memcpy_mcsafe_64.S | 212 
 3 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S

diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index 9b8cedf618f4..b7e872a64726 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -30,7 +30,9 @@ extern void * memcpy_flushcache(void *,const void 
*,__kernel_size_t);
 #ifdef CONFIG_PPC64
 #define __HAVE_ARCH_MEMSET32
 #define __HAVE_ARCH_MEMSET64
+#define __HAVE_ARCH_MEMCPY_MCSAFE
 
+extern int memcpy_mcsafe(void *dst, const void *src, __kernel_size_t sz);
 extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t);
 extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
 extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 3c29c9009bbf..048afee9f518 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -24,7 +24,7 @@ endif
 
 obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
   copyuser_power7.o string_64.o copypage_power7.o memcpy_power7.o \
-  memcpy_64.o memcmp_64.o pmem.o
+  memcpy_64.o memcmp_64.o pmem.o memcpy_mcsafe_64.o
 
 obj64-$(CONFIG_SMP)+= locks.o
 obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o
diff --git a/arch/powerpc/lib/memcpy_mcsafe_64.S 
b/arch/powerpc/lib/memcpy_mcsafe_64.S
new file mode 100644
index ..e7eaa9b6cded
--- /dev/null
+++ b/arch/powerpc/lib/memcpy_mcsafe_64.S
@@ -0,0 +1,212 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) IBM Corporation, 2011
+ * Derived from copyuser_power7.s by Anton Blanchard <an...@au.ibm.com>
+ * Author - Balbir Singh <bsinghar...@gmail.com>
+ */
+#include 
+#include 
+
+   .macro err1
+100:
+   EX_TABLE(100b,.Ldo_err1)
+   .endm
+
+   .macro err2
+200:
+   EX_TABLE(200b,.Ldo_err2)
+   .endm
+
+.Ldo_err2:
+   ld  r22,STK_REG(R22)(r1)
+   ld  r21,STK_REG(R21)(r1)
+   ld  r20,STK_REG(R20)(r1)
+   ld  r19,STK_REG(R19)(r1)
+   ld  r18,STK_REG(R18)(r1)
+   ld  r17,STK_REG(R17)(r1)
+   ld  r16,STK_REG(R16)(r1)
+   ld  r15,STK_REG(R15)(r1)
+   ld  r14,STK_REG(R14)(r1)
+   addir1,r1,STACKFRAMESIZE
+.Ldo_err1:
+   li  r3,-EFAULT
+   blr
+
+
+_GLOBAL(memcpy_mcsafe)
+   cmpldi  r5,16
+   blt .Lshort_copy
+
+.Lcopy:
+   /* Get the source 8B aligned */
+   neg r6,r4
+   mtocrf  0x01,r6
+   clrldi  r6,r6,(64-3)
+
+   bf  cr7*4+3,1f
+err1;  lbz r0,0(r4)
+   addir4,r4,1
+err1;  stb r0,0(r3)
+   addir3,r3,1
+
+1: bf  cr7*4+2,2f
+err1;  lhz r0,0(r4)
+   addir4,r4,2
+err1;  sth r0,0(r3)
+   addir3,r3,2
+
+2: bf  cr7*4+1,3f
+err1;  lwz r0,0(r4)
+   addir4,r4,4
+err1;  stw r0,0(r3)
+   addir3,r3,4
+
+3: sub r5,r5,r6
+   cmpldi  r5,128
+   blt 5f
+
+   mflrr0
+   stdur1,-STACKFRAMESIZE(r1)
+   std r14,STK_REG(R14)(r1)
+   std r15,STK_REG(R15)(r1)
+   std r16,STK_REG(R16)(r1)
+   std r17,STK_REG(R17)(r1)
+   std r18,STK_REG(R18)(r1)
+   std r19,STK_REG(R19)(r1)
+   std r20,STK_REG(R20)(r1)
+   std r21,STK_REG(R21)(r1)
+   std r22,STK_REG(R22)(r1)
+   std r0,STACKFRAMESIZE+16(r1)
+
+   srdir6,r5,7
+   mtctr   r6
+
+   /* Now do cacheline (128B) sized loads and stores. */
+   .align  5
+4:
+err2;  ld  r0,0(r4)
+err2;  ld  r6,8(r4)
+err2;  ld  r7,16(r4)
+err2;  ld  r8,24(r4)
+err2;  ld  r9,32(r4)
+err2;  ld  r10,40(r4)
+err2;  ld  r11,48(r4)
+err2;  ld  r12,56(r4)
+err2;  ld  r14,64(r4)
+err2;  ld  r15,72(r4)
+err2;  ld  r16,80(r4)
+err2;  ld  r17,88(r4)
+err2;  ld  r18,96(r4)
+err2;  ld  r19,104(r4)
+err2;  ld  r20,112(r4)
+err2;  ld  r21,120(r4)
+   addir4,r4,128
+err2;  std r0,0(r3)
+err2;  std r6,8(r3)
+err2;  std r7,16(r3)
+err2;  std r8,24(r3)
+err2;  std r9,32(r3)
+err2;  std r10,40(r3)
+err2;  std r11,48(r3)
+err2;  std r12,56(r3)
+err2;  std r14,64(r3)
+err2;  std r15,72(r3)
+err2;  std r16,80(r3)
+err2;  std r17,88(r3)
+err2;  std   

[RESEND 0/3] Add support for memcpy_mcsafe

2018-04-04 Thread Balbir Singh
memcpy_mcsafe() is an API currently used by the pmem subsystem to convert
errors while doing a memcpy (machine check exception errors) to a return
value. This patchset consists of three patches

1. The first patch is a bug fix to handle machine check errors correctly
while walking the page tables in kernel mode, due to huge pmd/pud sizes
2. The second patch adds memcpy_mcsafe() support, this is largely derived
from existing code
3. The third patch registers for callbacks on machine check exceptions and
in them uses specialized knowledge of the type of page to decide whether
to handle the MCE as is or to return to a fixup address present in
memcpy_mcsafe(). If a fixup address is used, then we return an error
value of -EFAULT to the caller.

Testing

A large part of the testing was done under a simulator by selectively
inserting machine check exceptions in a test driver doing memcpy_mcsafe
via ioctls.

Balbir Singh (3):
  powerpc/mce: Bug fixes for MCE handling in kernel space
  powerpc/memcpy: Add memcpy_mcsafe for pmem
  powerpc/mce: Handle memcpy_mcsafe

 arch/powerpc/include/asm/mce.h  |   3 +-
 arch/powerpc/include/asm/string.h   |   2 +
 arch/powerpc/kernel/mce.c   |  76 +++-
 arch/powerpc/kernel/mce_power.c |  17 +--
 arch/powerpc/lib/Makefile   |   2 +-
 arch/powerpc/lib/memcpy_mcsafe_64.S | 225 
 6 files changed, 314 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S

-- 
2.13.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND v2 4/4] powerpc/powernv: Create platform devs for nvdimm buses

2018-04-04 Thread Balbir Singh
On Wed,  4 Apr 2018 00:24:15 +1000
Oliver O'Halloran <ooh...@gmail.com> wrote:

> Scan the devicetree for an nvdimm-bus compatible and create
> a platform device for them.
> 
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> ---

Acked-by: Balbir Singh <bsinghar...@gmail.com>

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND v2 1/4] libnvdimm: Add of_node to region and bus descriptors

2018-04-04 Thread Balbir Singh
On Wed,  4 Apr 2018 00:24:12 +1000
Oliver O'Halloran <ooh...@gmail.com> wrote:

> We want to be able to cross reference the region and bus devices
> with the device tree node that they were spawned from. libNVDIMM
> handles creating the actual devices for these internally, so we
> need to pass in a pointer to the relevant node in the descriptor.
> 
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> Acked-by: Dan Williams <dan.j.willi...@intel.com>
> ---
Acked-by: Balbir Singh <bsinghar...@gmail.com>

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] libnvdimm: Add device-tree based driver

2018-03-25 Thread Balbir Singh
NXIO;
> +
> + pr_err("registering region for %pOF\n", np);
> +
> + if (of_address_to_resource(np, 0, _res)) {
> + pr_warn("Unable to parse reg[0] for %pOF\n", np);
> + return -ENXIO;
> + }
> +
> + memset(_desc, 0, sizeof(ndr_desc));
> + ndr_desc.res = _res;
> + ndr_desc.of_node = np;
> + ndr_desc.attr_groups = region_attr_groups;
> + ndr_desc.numa_node = of_node_to_nid(np);

We probably need an ndr_desc.provider

> + set_bit(ND_REGION_PAGEMAP, _desc.flags);
> +
> + /*
> +  * NB: libnvdimm copies the data from ndr_desc into it's own structures
> +  * so passing stack pointers is fine.
> +  */
> + if (of_get_property(np, "volatile", NULL))
> + region = nvdimm_volatile_region_create(bus, _desc);
> + else
> + region = nvdimm_pmem_region_create(bus, _desc);
> +
> + pr_warn("registered pmem region %px\n", region);
> + if (!region)
> + return -ENXIO;
> +
> + platform_set_drvdata(pdev, region);
> +
> + return 0;
> +}
> +
> +static int of_nd_region_remove(struct platform_device *pdev)
> +{
> + struct nd_region *r = platform_get_drvdata(pdev);
> +
> + nd_region_destroy(r);
> +
> + return 0;
> +}
> +
> +static const struct of_device_id of_nd_region_match[] = {
> + { .compatible = "nvdimm-region" },
> + { },
> +};
> +
> +static struct platform_driver of_nd_region_driver = {
> + .probe = of_nd_region_probe,
> + .remove = of_nd_region_remove,
> + .driver = {
> + .name = "of_nd_region",
> + .owner = THIS_MODULE,
> + .of_match_table = of_nd_region_match,
> + },
> +};
> +
> +/* bus wrangling */
> +
> +static int __init of_nvdimm_init(void)
> +{
> + /* register  */
> + bus_desc.attr_groups = bus_attr_groups;
> + bus_desc.provider_name = "of_nvdimm";
> + bus_desc.module = THIS_MODULE;
> +
> + /* does parent == NULL work? */
> + bus = nvdimm_bus_register(NULL, _desc);
> + if (!bus)
> + return -ENODEV;
> +
> + platform_driver_register(_nd_region_driver);
> +
> + return 0;
> +}
> +module_init(of_nvdimm_init);
> +
> +static void __init of_nvdimm_exit(void)
> +{
> + nvdimm_bus_unregister(bus);
> + platform_driver_unregister(_nd_region_driver);
> +}
> +module_exit(of_nvdimm_exit);
> +
> +MODULE_DEVICE_TABLE(of, of_nd_region_match);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("IBM Corporation");

Balbir Singh.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/6] libnvdimm: Add nd_region_destroy()

2018-03-25 Thread Balbir Singh
On Fri, 23 Mar 2018 19:12:05 +1100
Oliver O'Halloran <ooh...@gmail.com> wrote:

> Currently there's no way to remove a region from and nvdimm_bus without
> tearing down the whole bus. This patch adds an API for removing a single
> region from the bus so that we can implement a sensible unbind operation
> for the of_nd_region platform driver.
> 
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> ---
>  drivers/nvdimm/region_devs.c | 6 ++
>  include/linux/libnvdimm.h| 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 2f1d5771100e..76f46fd1fae0 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1039,6 +1039,12 @@ struct nd_region *nvdimm_blk_region_create(struct 
> nvdimm_bus *nvdimm_bus,
>  }
>  EXPORT_SYMBOL_GPL(nvdimm_blk_region_create);
>  
> +void nd_region_destroy(struct nd_region *region)
> +{
> + nd_device_unregister(>dev, ND_SYNC);

child_unregister seems to do the same thing, but is expected to be used
as a callback from device_for_each_child()

I'd suggest we merge the two and rename child_unregister and 
nd_region_unregister

Balbir Singh.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/6] libnvdimm: Add of_node to region and bus descriptors

2018-03-25 Thread Balbir Singh
On Fri, 23 Mar 2018 19:12:04 +1100
Oliver O'Halloran <ooh...@gmail.com> wrote:

> We want to be able to cross reference the region and bus devices
> with the device tree node that they were spawned from. libNVDIMM
> handles creating the actual devices for these internally, so we
> need to pass in a pointer to the relevant node in the descriptor.
> 
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> Acked-by: Dan Williams <dan.j.willi...@intel.com>
> ---

Acked-by: Balbir Singh <bsinghar...@gmail.com>
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv

2017-04-11 Thread Balbir Singh
On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote:
> From: Rashmica Gupta <rashmic...@gmail.com>
> 
> Adds support for removing bolted (i.e kernel linear mapping) mappings on
> powernv. This is needed to support memory hot unplug operations which
> are required for the teardown of DAX/PMEM devices.
> 
> Cc: Rashmica Gupta <rashmic...@gmail.com>
> Cc: Anton Blanchard <an...@samba.org>
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> ---
> Could the original author of this add their S-o-b? I pulled it out of
> Rashmica's memtrace patch, but I remember someone saying Anton wrote
> it originally.
> ---
>  arch/powerpc/mm/hash_native_64.c | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/powerpc/mm/hash_native_64.c 
> b/arch/powerpc/mm/hash_native_64.c
> index 65bb8f33b399..9ba91d4905a4 100644
> --- a/arch/powerpc/mm/hash_native_64.c
> +++ b/arch/powerpc/mm/hash_native_64.c
> @@ -407,6 +407,36 @@ static void native_hpte_updateboltedpp(unsigned long 
> newpp, unsigned long ea,
>   tlbie(vpn, psize, psize, ssize, 0);
>  }
>  
> +/*
> + * Remove a bolted kernel entry. Memory hotplug uses this.
> + *
> + * No need to lock here because we should be the only user.

As long as this is after the necessary isolation and is called from
arch_remove_memory(), I think we should be fine

> + */
> +static int native_hpte_removebolted(unsigned long ea, int psize, int ssize)
> +{
> + unsigned long vpn;
> + unsigned long vsid;
> + long slot;
> + struct hash_pte *hptep;
> +
> + vsid = get_kernel_vsid(ea, ssize);
> + vpn = hpt_vpn(ea, vsid, ssize);
> +
> + slot = native_hpte_find(vpn, psize, ssize);
> + if (slot == -1)
> + return -ENOENT;

If slot == -1, it means someone else removed the HPTE entry? Are we racing?
I suspect we should never hit this situation during hotunplug, specifically
since this is bolted.

> +
> + hptep = htab_address + slot;
> +
> + /* Invalidate the hpte */
> + hptep->v = 0;

Under DEBUG or otherwise, I would add more checks like

1. was hpte_v & HPTE_V_VALID and BOLTED set? If not, we've already invalidated
that hpte and we can skip the tlbie. Since this was bolted you might be right
that it is always valid and bolted



> +
> +     /* Invalidate the TLB */
> + tlbie(vpn, psize, psize, ssize, 0);

The API also does not clear linear_map_hash_slots[] under DEBUG_PAGEALLOC

> + return 0;
> +}
> +
> +

Balbir Singh.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 6/9] powerpc, mm: Enable ZONE_DEVICE on powerpc

2017-04-11 Thread Balbir Singh
On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote:
> Flip the switch. Running around and screaming "IT'S ALIVE" is optional,
> but recommended.
> 
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> ---
>  mm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 43d000e44424..d696af58f97f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -724,7 +724,7 @@ config ZONE_DEVICE
>   depends on MEMORY_HOTPLUG
>   depends on MEMORY_HOTREMOVE
>   depends on SPARSEMEM_VMEMMAP
> - depends on X86_64 #arch_add_memory() comprehends device memory
> + depends on (X86_64 || PPC_BOOK3S_64)  #arch_add_memory() comprehends 
> device memory

Reviewed-by: Balbir Singh <bsinghar...@gmail.com>
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 5/9] powerpc/vmemmap: Add altmap support

2017-04-11 Thread Balbir Singh
On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote:
> Adds support to powerpc for the altmap feature of ZONE_DEVICE memory. An
> altmap is a driver provided region that is used to provide the backing
> storage for the struct pages of ZONE_DEVICE memory. In situations where
> large amount of ZONE_DEVICE memory is being added to the system the
> altmap reduces pressure on main system memory by allowing the mm/
> metadata to be stored on the device itself rather in main memory.
> 
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> ---

Reviewed-by: Balbir Singh <bsinghar...@gmail.com>
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm