date:20181105

Re: [driver-core PATCH v5 5/9] driver core: Establish clear order of operations for deferred probe and remove

2018-11-05 Thread kbuild test robot

Hi Alexander,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on driver-core/master]

url:
https://github.com/0day-ci/linux/commits/Alexander-Duyck/Add-NUMA-aware-async_schedule-calls/20181106-093800
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

   include/net/mac80211.h:1001: warning: Function parameter or member 
'status.is_valid_ack_signal' not described in 'ieee80211_tx_info'
   include/net/mac80211.h:1001: warning: Function parameter or member 
'status.status_driver_data' not described in 'ieee80211_tx_info'
   include/net/mac80211.h:1001: warning: Function parameter or member 
'driver_rates' not described in 'ieee80211_tx_info'
   include/net/mac80211.h:1001: warning: Function parameter or member 'pad' not 
described in 'ieee80211_tx_info'
   include/net/mac80211.h:1001: warning: Function parameter or member 
'rate_driver_data' not described in 'ieee80211_tx_info'
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '
   include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct

fsdax memory error handling regression

2018-11-05 Thread Williams, Dan J

Hi Willy,

I'm seeing the following warning with v4.20-rc1 and the "dax.sh" test
from the ndctl repository:

[   69.962873] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[   69.969522] EXT4-fs (pmem0): mounted filesystem with ordered data mode. 
Opts: dax
[   70.028571] Injecting memory failure for pfn 0x208900 at process virtual 
address 0x7efe87b0
[   70.032384] Memory failure: 0x208900: Killing dax-pmd:7066 due to hardware 
memory corruption
[   70.034420] Memory failure: 0x208900: recovery action for dax page: Recovered
[   70.038878] WARNING: CPU: 37 PID: 7066 at fs/dax.c:464 
dax_insert_entry+0x30b/0x330
[   70.040675] Modules linked in: ebtable_nat(E) ebtable_broute(E) bridge(E) 
stp(E) llc(E) ip6table_mangle(E) ip6table_raw(E) ip6table_security(E) 
iptable_mangle(E) iptable_raw(E) iptable_security(E) nf_conntrack(E) 
nf_defrag_ipv6(E) nf_defrag_ipv4(E) ebtable_filter(E) ebtables(E) 
ip6table_filter(E) ip6_tables(E) crct10dif_pclmul(E) crc32_pclmul(E) 
dax_pmem(OE) crc32c_intel(E) device_dax(OE) ghash_clmulni_intel(E) nd_pmem(OE) 
nd_btt(OE) serio_raw(E) nd_e820(OE) nfit(OE) libnvdimm(OE) nfit_test_iomap(OE)
[   70.049936] CPU: 37 PID: 7066 Comm: dax-pmd Tainted: G   OE 
4.19.0-rc5+ #2589
[   70.051726] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
[   70.055215] RIP: 0010:dax_insert_entry+0x30b/0x330
[   70.056769] Code: 84 b7 fe ff ff 48 81 e6 00 00 e0 ff e9 b2 fe ff ff 48 8b 
3c 24 48 89 ee 31 d2 e8 10 eb ff ff 49 8b 7d 00 31 f6 e9 99 fe ff ff <0f> 0b e9 
f8 fe ff ff 0f 0b e9 e2 fd ff ff e8 82 f1 f4 ff e9 9c fe
[   70.062086] RSP: :c900086bfb20 EFLAGS: 00010082
[   70.063726] RAX:  RBX:  RCX: ea000822
[   70.065755] RDX:  RSI: 00208800 RDI: 00208800
[   70.067784] RBP: 880327870bb0 R08: 00208801 R09: 00208a00
[   70.069813] R10: 00208801 R11: 0001 R12: 880327870bb8
[   70.071837] R13:  R14: 04110003 R15: 0009
[   70.073867] FS:  7efe8859d540() GS:88033ea8() 
knlGS:
[   70.076547] CS:  0010 DS:  ES:  CR0: 80050033
[   70.078294] CR2: 7efe87a0 CR3: 000334564003 CR4: 00160ee0
[   70.080326] Call Trace:
[   70.081404]  ? dax_iomap_pfn+0xb4/0x100
[   70.082770]  dax_iomap_pte_fault+0x648/0xd60
[   70.084222]  dax_iomap_fault+0x230/0xba0
[   70.085596]  ? lock_acquire+0x9e/0x1a0
[   70.086940]  ? ext4_dax_huge_fault+0x5e/0x200
[   70.088406]  ext4_dax_huge_fault+0x78/0x200
[   70.089840]  ? up_read+0x1c/0x70
[   70.091071]  __do_fault+0x1f/0x136
[   70.092344]  __handle_mm_fault+0xd2b/0x11c0
[   70.093790]  handle_mm_fault+0x198/0x3a0
[   70.095166]  __do_page_fault+0x279/0x510
[   70.096546]  do_page_fault+0x32/0x200
[   70.097884]  ? async_page_fault+0x8/0x30
[   70.099256]  async_page_fault+0x1e/0x30

I tried to get this test going on -next before the merge window, but
-next was not bootable for me. Bisection points to:

9f32d221301c dax: Convert dax_lock_mapping_entry to XArray

At first glance I think we need the old "always retry if we slept"
behavior. Otherwise this failure seems similar to the issue fixed by
Ross' change to always retry on any potential collision:

b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()

I'll take a closer look tomorrow to see if that guess is plausible.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [driver-core PATCH v5 4/9] driver core: Move async_synchronize_full call

2018-11-05 Thread Bart Van Assche

On Mon, 2018-11-05 at 13:11 -0800, Alexander Duyck wrote:
> This patch moves the async_synchronize_full call out of
> __device_release_driver and into driver_detach.
> 
> The idea behind this is that the async_synchronize_full call will only
> guarantee that any existing async operations are flushed. This doesn't do
> anything to guarantee that a hotplug event that may occur while we are
> doing the release of the driver will not be asynchronously scheduled.
> 
> By moving this into the driver_detach path we can avoid potential deadlocks
> as we aren't holding the device lock at this point and we should not have
> the driver we want to flush loaded so the flush will take care of any
> asynchronous events the driver we are detaching might have scheduled.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  drivers/base/dd.c |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 76c40fe69463..e74cefeb5b69 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -975,9 +975,6 @@ static void __device_release_driver(struct device *dev, 
> struct device *parent)
>  
>   drv = dev->driver;
>   if (drv) {
> - if (driver_allows_async_probing(drv))
> - async_synchronize_full();
> -
>   while (device_links_busy(dev)) {
>   __device_driver_unlock(dev, parent);
>  
> @@ -1087,6 +1084,9 @@ void driver_detach(struct device_driver *drv)
>   struct device_private *dev_prv;
>   struct device *dev;
>  
> + if (driver_allows_async_probing(drv))
> + async_synchronize_full();
> +
>   for (;;) {
>   spin_lock(>p->klist_devices.k_lock);
>   if (list_empty(>p->klist_devices.k_list)) {

Have you considered to move that async_synchronize_full() call into
bus_remove_driver()? Verifying the correctness of this patch requires to
check whether the async_synchronize_full() comes after the
klist_remove(>p->knode_bus) call. That verification is easier when
the async_synchronize_full() call occurs in bus_remove_driver() instead
of in driver_detach().

Thanks,

Bart.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [driver-core PATCH v5 0/9] Add NUMA aware async_schedule calls

2018-11-05 Thread Bart Van Assche

On Mon, 2018-11-05 at 13:11 -0800, Alexander Duyck wrote:
> This patch set provides functionality that will help to improve the
> locality of the async_schedule calls used to provide deferred
> initialization.

Hi Alexander,

Is this patch series perhaps available in a public git tree? That would make
it easier for me to test my sd changes on top of your patches than applying
this patch series myself on kernel v4.20-rc1.

Thanks,

Bart.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [driver-core PATCH v5 1/9] workqueue: Provide queue_work_node to queue work near a given NUMA node

2018-11-05 Thread Bart Van Assche

On Mon, 2018-11-05 at 13:11 -0800, Alexander Duyck wrote:
> +/**
> + * workqueue_select_cpu_near - Select a CPU based on NUMA node
> + * @node: NUMA node ID that we want to bind a CPU from
  
  select?
> + /* If CPU is valid return that, otherwise just defer */
> + return (cpu < nr_cpu_ids) ? cpu : WORK_CPU_UNBOUND;

Please leave out the superfluous parentheses if this patch series would have to
be reposted.

Anyway:

Reviewed-by: Bart Van Assche 


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

（桓扬）每次的业绩为什么总不理想

2018-11-05 Thread 桓扬

您好！linux-nvdimm
详，细，课，件，
请，查，阅，附，件，内，容
8:26:42
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

扶召|想成为领导的得力助手，却不知道如何才能做得优秀？

2018-11-05 Thread 扶召

linux-nvdimm@lists.01.org
附·件·内·容·请·查·收
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages

2018-11-05 Thread Alexander Duyck

This patch creates a common iterator to be used by both deferred_init_pages
and deferred_free_pages. By doing this we can cut down a bit on code
overhead as they will likely both be inlined into the same function anyway.

This new approach allows deferred_init_pages to make use of
__init_pageblock. By doing this we can cut down on the code size by sharing
code between both the hotplug and deferred memory init code paths.

An additional benefit to this approach is that we improve in cache locality
of the memory init as we can focus on the memory areas related to
identifying if a given PFN is valid and keep that warm in the cache until
we transition to a region of a different type. So we will stream through a
chunk of valid blocks before we turn to initializing page structs.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 1.38s to 1.06s as a result of this patch.

Signed-off-by: Alexander Duyck 
---
 mm/page_alloc.c |  134 +++
 1 file changed, 65 insertions(+), 69 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9eb993a9be99..521b94eb02a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1484,32 +1484,6 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(unsigned long pfn,
-  unsigned long nr_pages)
-{
-   struct page *page;
-   unsigned long i;
-
-   if (!nr_pages)
-   return;
-
-   page = pfn_to_page(pfn);
-
-   /* Free a large naturally-aligned chunk if possible */
-   if (nr_pages == pageblock_nr_pages &&
-   (pfn & (pageblock_nr_pages - 1)) == 0) {
-   set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-   __free_pages_core(page, pageblock_order);
-   return;
-   }
-
-   for (i = 0; i < nr_pages; i++, page++, pfn++) {
-   if ((pfn & (pageblock_nr_pages - 1)) == 0)
-   set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-   __free_pages_core(page, 0);
-   }
-}
-
 /* Completion tracking for deferred_init_memmap() threads */
 static atomic_t pgdat_init_n_undone __initdata;
 static __initdata DECLARE_COMPLETION(pgdat_init_all_done_comp);
@@ -1521,48 +1495,77 @@ static inline void __init 
pgdat_init_report_one_done(void)
 }
 
 /*
- * Returns true if page needs to be initialized or freed to buddy allocator.
+ * Returns count if page range needs to be initialized or freed
  *
- * First we check if pfn is valid on architectures where it is possible to have
- * holes within pageblock_nr_pages. On systems where it is not possible, this
- * function is optimized out.
+ * First, we check if a current large page is valid by only checking the
+ * validity of the head pfn.
  *
- * Then, we check if a current large page is valid by only checking the 
validity
- * of the head pfn.
+ * Then we check if the contiguous pfns are valid on architectures where it
+ * is possible to have holes within pageblock_nr_pages. On systems where it
+ * is not possible, this function is optimized out.
  */
-static inline bool __init deferred_pfn_valid(unsigned long pfn)
+static unsigned long __next_pfn_valid_range(unsigned long *i,
+   unsigned long end_pfn)
 {
-   if (!pfn_valid_within(pfn))
-   return false;
-   if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
-   return false;
-   return true;
+   unsigned long pfn = *i;
+   unsigned long count;
+
+   while (pfn < end_pfn) {
+   unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
+   unsigned long pageblock_pfn = min(t, end_pfn);
+
+#ifndef CONFIG_HOLES_IN_ZONE
+   count = pageblock_pfn - pfn;
+   pfn = pageblock_pfn;
+   if (!pfn_valid(pfn))
+   continue;
+#else
+   for (count = 0; pfn < pageblock_pfn; pfn++) {
+   if (pfn_valid_within(pfn)) {
+   count++;
+   continue;
+   }
+
+   if (count)
+   break;
+   }
+
+   if (!count)
+   continue;
+#endif
+   *i = pfn;
+   return count;
+   }
+
+   return 0;
 }
 
+#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) \
+   for (i = (start_pfn),\
+count = __next_pfn_valid_range(, (end_pfn));  \
+count && ({ pfn = i - count; 1; }); \
+count = __next_pfn_valid_range(, (end_pfn)))
 /*
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init

[mm PATCH v5 1/7] mm: Use mm_zero_struct_page from SPARC on all 64b architectures

2018-11-05 Thread Alexander Duyck

This change makes it so that we use the same approach that was already in
use on Sparc on all the archtectures that support a 64b long.

This is mostly motivated by the fact that 7 to 10 store/move instructions
are likely always going to be faster than having to call into a function
that is not specialized for handling page init.

An added advantage to doing it this way is that the compiler can get away
with combining writes in the __init_single_page call. As a result the
memset call will be reduced to only about 4 write operations, or at least
that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
my system.

One change I had to make to the function was to reduce the minimum page
size to 56 to support some powerpc64 configurations.

This change should introduce no change on SPARC since it already had this
code. In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
initializing 384GB of RAM per node. Pavel Tatashin tested on a system with
Broadcom's Stingray CPU and 48GB of RAM and found that __init_single_page()
takes 19.30ns / 64-byte struct page before this patch and with this patch
it takes 17.33ns / 64-byte struct page. Mike Rapoport ran a similar test on
a OpenPower (S812LC 8348-21C) with Power8 processor and 128GB or RAM. His
results per 64-byte struct page were 4.68ns before, and 4.59ns after this
patch.

Reviewed-by: Pavel Tatashin 
Acked-by: Michal Hocko 
Signed-off-by: Alexander Duyck 
---
 arch/sparc/include/asm/pgtable_64.h |   30 --
 include/linux/mm.h  |   41 ---
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 1393a8ac596b..22500c3be7a9 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -231,36 +231,6 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)   (mem_map_zero)
 
-/* This macro must be updated when the size of struct page grows above 80
- * or reduces below 64.
- * The idea that compiler optimizes out switch() statement, and only
- * leaves clrx instructions
- */
-#definemm_zero_struct_page(pp) do {
\
-   unsigned long *_pp = (void *)(pp);  \
-   \
-/* Check that struct page is either 64, 72, or 80 bytes */ \
-   BUILD_BUG_ON(sizeof(struct page) & 7);  \
-   BUILD_BUG_ON(sizeof(struct page) < 64); \
-   BUILD_BUG_ON(sizeof(struct page) > 80); \
-   \
-   switch (sizeof(struct page)) {  \
-   case 80:\
-   _pp[9] = 0; /* fallthrough */   \
-   case 72:\
-   _pp[8] = 0; /* fallthrough */   \
-   default:\
-   _pp[7] = 0; \
-   _pp[6] = 0; \
-   _pp[5] = 0; \
-   _pp[4] = 0; \
-   _pp[3] = 0; \
-   _pp[2] = 0; \
-   _pp[1] = 0; \
-   _pp[0] = 0; \
-   }   \
-} while (0)
-
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..288c407c08fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,10 +98,45 @@ extern int mmap_rnd_compat_bits __read_mostly;
 
 /*
  * On some architectures it is expensive to call memset() for small sizes.
- * Those architectures should provide their own implementation of "struct page"
- * zeroing by defining this macro in .
+ * If an architecture decides to implement their own version of
+ * mm_zero_struct_page they should wrap the defines below in a #ifndef and
+ * define their own version of this macro in 
  */
-#ifndef mm_zero_struct_page
+#if BITS_PER_LONG == 64
+/* This function must be updated when the size of struct page grows above 80
+ *

[mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize

2018-11-05 Thread Alexander Duyck

This patch is going through and combining the bits in memmap_init_zone and
memmap_init_zone_device that are related to hotplug into a single function
called __memmap_init_hotplug.

I also took the opportunity to integrate __init_single_page's functionality
into this function. In doing so I can get rid of some of the redundancy
such as the LRU pointers versus the pgmap.

Signed-off-by: Alexander Duyck 
---
 mm/page_alloc.c |  214 +--
 1 file changed, 144 insertions(+), 70 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3466a01ed90a..dbe00c1a0e23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1192,6 +1192,92 @@ static void __meminit __init_single_page(struct page 
*page, unsigned long pfn,
 #endif
 }
 
+static void __meminit __init_pageblock(unsigned long start_pfn,
+  unsigned long nr_pages,
+  unsigned long zone, int nid,
+  struct dev_pagemap *pgmap)
+{
+   unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   struct page *start_page = pfn_to_page(start_pfn);
+   unsigned long pfn = start_pfn + nr_pages - 1;
+#ifdef WANT_PAGE_VIRTUAL
+   bool is_highmem = is_highmem_idx(zone);
+#endif
+   struct page *page;
+
+   /*
+* Enforce the following requirements:
+* size > 0
+* size < pageblock_nr_pages
+* start_pfn -> pfn does not cross pageblock_nr_pages boundary
+*/
+   VM_BUG_ON(((start_pfn ^ pfn) | (nr_pages - 1)) > nr_pgmask);
+
+   /*
+* Work from highest page to lowest, this way we will still be
+* warm in the cache when we call set_pageblock_migratetype
+* below.
+*
+* The loop is based around the page pointer as the main index
+* instead of the pfn because pfn is not used inside the loop if
+* the section number is not in page flags and WANT_PAGE_VIRTUAL
+* is not defined.
+*/
+   for (page = start_page + nr_pages; page-- != start_page; pfn--) {
+   mm_zero_struct_page(page);
+
+   /*
+* We use the start_pfn instead of pfn in the set_page_links
+* call because of the fact that the pfn number is used to
+* get the section_nr and this function should not be
+* spanning more than a single section.
+*/
+   set_page_links(page, zone, nid, start_pfn);
+   init_page_count(page);
+   page_mapcount_reset(page);
+   page_cpupid_reset_last(page);
+
+   /*
+* We can use the non-atomic __set_bit operation for setting
+* the flag as we are still initializing the pages.
+*/
+   __SetPageReserved(page);
+
+   /*
+* ZONE_DEVICE pages union ->lru with a ->pgmap back
+* pointer and hmm_data.  It is a bug if a ZONE_DEVICE
+* page is ever freed or placed on a driver-private list.
+*/
+   page->pgmap = pgmap;
+   if (!pgmap)
+   INIT_LIST_HEAD(>lru);
+
+#ifdef WANT_PAGE_VIRTUAL
+   /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+   if (!is_highmem)
+   set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+   }
+
+   /*
+* Mark the block movable so that blocks are reserved for
+* movable at startup. This will force kernel allocations
+* to reserve their blocks rather than leaking throughout
+* the address space during boot when many long-lived
+* kernel allocations are made.
+*
+* bitmap is created for zone's valid pfn range. but memmap
+* can be created for invalid pages (for alignment)
+* check here not to call set_pageblock_migratetype() against
+* pfn out of zone.
+*
+* Please note that MEMMAP_HOTPLUG path doesn't clear memmap
+* because this is done early in sparse_add_one_section
+*/
+   if (!(start_pfn & nr_pgmask))
+   set_pageblock_migratetype(start_page, MIGRATE_MOVABLE);
+}
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static void __meminit init_reserved_page(unsigned long pfn)
 {
@@ -5513,6 +5599,25 @@ overlap_memmap_init(unsigned long zone, unsigned long 
*pfn)
return false;
 }
 
+static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
+   unsigned long zone,
+   unsigned long start_pfn,
+   struct dev_pagemap *pgmap)
+{
+   unsigned long pfn = start_pfn + size;
+
+   while (pfn != start_pfn) {
+   unsigned long stride = pfn;
+
+   pfn = max(ALIGN_DOWN(pfn - 1,

[mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links

2018-11-05 Thread Alexander Duyck

This patch modifies the set_page_links function to include the setting of
the reserved flag via a simple AND and OR operation. The motivation for
this is the fact that the existing __set_bit call still seems to have
effects on performance as replacing the call with the AND and OR can reduce
initialization time.

Looking over the assembly code before and after the change the main
difference between the two is that the reserved bit is stored in a value
that is generated outside of the main initialization loop and is then
written with the other flags field values in one write to the page->flags
value. Previously the generated value was written and then then a btsq
instruction was issued.

On my x86_64 test system with 3TB of persistent memory per node I saw the
persistent memory initialization time on average drop from 23.49s to
19.12s per node.

Signed-off-by: Alexander Duyck 
---
 include/linux/mm.h |9 -
 mm/page_alloc.c|   29 +++--
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 288c407c08fc..de6535a98e45 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1171,11 +1171,18 @@ static inline void set_page_node(struct page *page, 
unsigned long node)
page->flags |= (node & NODES_MASK) << NODES_PGSHIFT;
 }
 
+static inline void set_page_reserved(struct page *page, bool reserved)
+{
+   page->flags &= ~(1ul << PG_reserved);
+   page->flags |= (unsigned long)(!!reserved) << PG_reserved;
+}
+
 static inline void set_page_links(struct page *page, enum zone_type zone,
-   unsigned long node, unsigned long pfn)
+   unsigned long node, unsigned long pfn, bool reserved)
 {
set_page_zone(page, zone);
set_page_node(page, node);
+   set_page_reserved(page, reserved);
 #ifdef SECTION_IN_PAGE_FLAGS
set_page_section(page, pfn_to_section_nr(pfn));
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dbe00c1a0e23..9eb993a9be99 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1179,7 +1179,7 @@ static void __meminit __init_single_page(struct page 
*page, unsigned long pfn,
unsigned long zone, int nid)
 {
mm_zero_struct_page(page);
-   set_page_links(page, zone, nid, pfn);
+   set_page_links(page, zone, nid, pfn, false);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
@@ -1195,7 +1195,8 @@ static void __meminit __init_single_page(struct page 
*page, unsigned long pfn,
 static void __meminit __init_pageblock(unsigned long start_pfn,
   unsigned long nr_pages,
   unsigned long zone, int nid,
-  struct dev_pagemap *pgmap)
+  struct dev_pagemap *pgmap,
+  bool is_reserved)
 {
unsigned long nr_pgmask = pageblock_nr_pages - 1;
struct page *start_page = pfn_to_page(start_pfn);
@@ -1231,18 +1232,15 @@ static void __meminit __init_pageblock(unsigned long 
start_pfn,
 * call because of the fact that the pfn number is used to
 * get the section_nr and this function should not be
 * spanning more than a single section.
+*
+* We can use a non-atomic operation for setting the
+* PG_reserved flag as we are still initializing the pages.
 */
-   set_page_links(page, zone, nid, start_pfn);
+   set_page_links(page, zone, nid, start_pfn, is_reserved);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
 
-   /*
-* We can use the non-atomic __set_bit operation for setting
-* the flag as we are still initializing the pages.
-*/
-   __SetPageReserved(page);
-
/*
 * ZONE_DEVICE pages union ->lru with a ->pgmap back
 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
@@ -5612,7 +5610,18 @@ static void __meminit __memmap_init_hotplug(unsigned 
long size, int nid,
pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
stride -= pfn;
 
-   __init_pageblock(pfn, stride, zone, nid, pgmap);
+   /*
+* The last argument of __init_pageblock is a boolean
+* value indicating if the page will be marked as reserved.
+*
+* Mark page reserved as it will need to wait for onlining
+* phase for it to be fully associated with a zone.
+*
+* Under certain circumstances ZONE_DEVICE pages may not
+* need to be marked as reserved, however there is still
+* code that is

[mm PATCH v5 0/7] Deferred page init improvements

2018-11-05 Thread Alexander Duyck

This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.

In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
memory per node I have seen the following. In the case of regular memory
initialization the deferred init time was decreased from 3.75s to 1.06s on
average. For the persistent memory the initialization time dropped from
24.17s to 19.12s on average. This amounts to a 253% improvement for the
deferred memory initialization performance, and a 26% improvement in the
persistent memory initialization performance.

I have called out the improvement observed with each patch.

v1->v2:
Fixed build issue on PowerPC due to page struct size being 56
Added new patch that removed __SetPageReserved call for hotplug
v2->v3:
Rebased on latest linux-next
Removed patch that had removed __SetPageReserved call from init
Added patch that folded __SetPageReserved into set_page_links
Tweaked __init_pageblock to use start_pfn to get section_nr instead of pfn
v3->v4:
Updated patch description and comments for mm_zero_struct_page patch
Replaced "default" with "case 64"
Removed #ifndef mm_zero_struct_page
Fixed typo in comment that ommited "_from" in kerneldoc for iterator
Added Reviewed-by for patches reviewed by Pavel
Added Acked-by from Michal Hocko
Added deferred init times for patches that affect init performance
Swapped patches 5 & 6, pulled some code/comments from 4 into 5
v4->v5:
Updated Acks/Reviewed-by
Rebased on latest linux-next
Split core bits of zone iterator patch from MAX_ORDER_NR_PAGES init

---

Alexander Duyck (7):
  mm: Use mm_zero_struct_page from SPARC on all 64b architectures
  mm: Drop meminit_pfn_in_nid as it is redundant
  mm: Implement new zone specific memblock iterator
  mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger 
sections
  mm: Move hot-plug specific memory init into separate functions and 
optimize
  mm: Add reserved flag setting to set_page_links
  mm: Use common iterator for deferred_init_pages and deferred_free_pages


 arch/sparc/include/asm/pgtable_64.h |   30 --
 include/linux/memblock.h|   38 ++
 include/linux/mm.h  |   50 +++
 mm/memblock.c   |   63 
 mm/page_alloc.c |  567 +--
 5 files changed, 492 insertions(+), 256 deletions(-)

--
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[mm PATCH v5 2/7] mm: Drop meminit_pfn_in_nid as it is redundant

2018-11-05 Thread Alexander Duyck

As best as I can tell the meminit_pfn_in_nid call is completely redundant.
The deferred memory initialization is already making use of
for_each_free_mem_range which in turn will call into __next_mem_range which
will only return a memory range if it matches the node ID provided assuming
it is not NUMA_NO_NODE.

I am operating on the assumption that there are no zones or pgdata_t
structures that have a NUMA node of NUMA_NO_NODE associated with them. If
that is the case then __next_mem_range will never return a memory range
that doesn't match the zone's node ID and as such the check is redundant.

So one piece I would like to verify on this is if this works for ia64.
Technically it was using a different approach to get the node ID, but it
seems to have the node ID also encoded into the memblock. So I am
assuming this is okay, but would like to get confirmation on that.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 2.80s to 1.85s as a result of this patch.

Reviewed-by: Pavel Tatashin 
Acked-by: Michal Hocko 
Signed-off-by: Alexander Duyck 
---
 mm/page_alloc.c |   51 ++-
 1 file changed, 14 insertions(+), 37 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..be1197c120a8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1301,36 +1301,22 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
 #endif
 
 #ifdef CONFIG_NODES_SPAN_OTHER_NODES
-static inline bool __meminit __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
-  struct mminit_pfnnid_cache *state)
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
 {
int nid;
 
-   nid = __early_pfn_to_nid(pfn, state);
+   nid = __early_pfn_to_nid(pfn, _pfnnid_cache);
if (nid >= 0 && nid != node)
return false;
return true;
 }
 
-/* Only safe to use early in boot when initialisation is single-threaded */
-static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
-   return meminit_pfn_in_nid(pfn, node, _pfnnid_cache);
-}
-
 #else
-
 static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
 {
return true;
 }
-static inline bool __meminit  __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
-  struct mminit_pfnnid_cache *state)
-{
-   return true;
-}
 #endif
 
 
@@ -1459,21 +1445,13 @@ static inline void __init 
pgdat_init_report_one_done(void)
  *
  * Then, we check if a current large page is valid by only checking the 
validity
  * of the head pfn.
- *
- * Finally, meminit_pfn_in_nid is checked on systems where pfns can interleave
- * within a node: a pfn is between start and end of a node, but does not belong
- * to this memory node.
  */
-static inline bool __init
-deferred_pfn_valid(int nid, unsigned long pfn,
-  struct mminit_pfnnid_cache *nid_init_state)
+static inline bool __init deferred_pfn_valid(unsigned long pfn)
 {
if (!pfn_valid_within(pfn))
return false;
if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
return false;
-   if (!meminit_pfn_in_nid(pfn, nid, nid_init_state))
-   return false;
return true;
 }
 
@@ -1481,15 +1459,14 @@ deferred_pfn_valid(int nid, unsigned long pfn,
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
+static void __init deferred_free_pages(unsigned long pfn,
   unsigned long end_pfn)
 {
-   struct mminit_pfnnid_cache nid_init_state = { };
unsigned long nr_pgmask = pageblock_nr_pages - 1;
unsigned long nr_free = 0;
 
for (; pfn < end_pfn; pfn++) {
-   if (!deferred_pfn_valid(nid, pfn, _init_state)) {
+   if (!deferred_pfn_valid(pfn)) {
deferred_free_range(pfn - nr_free, nr_free);
nr_free = 0;
} else if (!(pfn & nr_pgmask)) {
@@ -1509,17 +1486,18 @@ static void __init deferred_free_pages(int nid, int 
zid, unsigned long pfn,
  * by performing it only once every pageblock_nr_pages.
  * Return number of pages initialized.
  */
-static unsigned long  __init deferred_init_pages(int nid, int zid,
+static unsigned long  __init deferred_init_pages(struct zone *zone,
 unsigned long pfn,
 unsigned long end_pfn)
 {
-   struct mminit_pfnnid_cache nid_init_state = { };
unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   int nid = zone_to_nid(zone);
unsigned long nr_pages = 0;
+   int zid = zone_idx(zone);
struct page *page = NULL;
 
for (; pfn <

[mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator

2018-11-05 Thread Alexander Duyck

This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.

This iterator will take care of making sure a given memory range provided
is in fact contained within a zone. It takes are of all the bounds checking
we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
it should help to speed up the search a bit by iterating until the end of a
range is greater than the start of the zone pfn range, and will exit
completely if the start is beyond the end of the zone.

Signed-off-by: Alexander Duyck 
---
 include/linux/memblock.h |   22 
 mm/memblock.c|   63 ++
 mm/page_alloc.c  |   31 +--
 3 files changed, 97 insertions(+), 19 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a6aa76..413623dc96a3 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -248,6 +248,28 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long 
*out_start_pfn,
 i >= 0; __next_mem_pfn_range(, nid, p_start, p_end, p_nid))
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+ unsigned long *out_spfn,
+ unsigned long *out_epfn);
+/**
+ * for_each_free_mem_range_in_zone - iterate through zone specific free
+ * memblock areas
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone. Available as soon as memblock is initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)   \
+   for (i = 0, \
+__next_mem_pfn_range_in_zone(, zone, p_start, p_end);\
+i != (u64)ULLONG_MAX;  \
+__next_mem_pfn_range_in_zone(, zone, p_start, p_end))
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
 /**
  * for_each_free_mem_range - iterate through free memblock areas
  * @i: u64 used as loop variable
diff --git a/mm/memblock.c b/mm/memblock.c
index 7df468c8ebc8..f1d1fbfd1ae7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, 
phys_addr_t size,
return 0;
 }
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/**
+ * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
+ *
+ * @idx: pointer to u64 loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @out_start: ptr to ulong for start pfn of the range, can be %NULL
+ * @out_end: ptr to ulong for end pfn of the range, can be %NULL
+ *
+ * This function is meant to be a zone/pfn specific wrapper for the
+ * for_each_mem_range type iterators. Specifically they are used in the
+ * deferred memory init routines and as such we were duplicating much of
+ * this logic throughout the code. So instead of having it in multiple
+ * locations it seemed like it would make more sense to centralize this to
+ * one new iterator that does everything they need.
+ */
+void __init_memblock
+__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+unsigned long *out_spfn, unsigned long *out_epfn)
+{
+   int zone_nid = zone_to_nid(zone);
+   phys_addr_t spa, epa;
+   int nid;
+
+   __next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+, ,
+, , );
+
+   while (*idx != ULLONG_MAX) {
+   unsigned long epfn = PFN_DOWN(epa);
+   unsigned long spfn = PFN_UP(spa);
+
+   /*
+* Verify the end is at least past the start of the zone and
+* that we have at least one PFN to initialize.
+*/
+   if (zone->zone_start_pfn < epfn && spfn < epfn) {
+   /* if we went too far just stop searching */
+   if (zone_end_pfn(zone) <= spfn)
+   break;
+
+   if (out_spfn)
+   *out_spfn = max(zone->zone_start_pfn, spfn);
+   if (out_epfn)
+   *out_epfn = min(zone_end_pfn(zone), epfn);
+
+   return;
+   }
+
+   __next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+, ,
+, , );
+   }
+
+   /* signal end of iteration */
+   *idx = ULLONG_MAX;
+   if (out_spfn)
+   *out_spfn = ULONG_MAX;
+   if (out_epfn)
+   *out_epfn = 0;
+}
+
+#endif

[mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections

2018-11-05 Thread Alexander Duyck

This patch adds yet another iterator, for_each_free_mem_range_in_zone_from.
It and then uses it to support initializing and freeing pages in groups no
larger than MAX_ORDER_NR_PAGES. By doing this we can greatly improve the
cache locality of the pages while we do several loops over them in the init
and freeing process.

We are able to tighten the loops as a result since we only really need the
checks for first_init_pfn in our first iteration and after that we can
assume that all future values will be greater than this. So I have added a
function called deferred_init_mem_pfn_range_in_zone that primes the
iterators and if it fails we can just exit.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 1.85s to 1.38s as a result of this patch.

Signed-off-by: Alexander Duyck 
---
 include/linux/memblock.h |   16 +
 mm/page_alloc.c  |  162 ++
 2 files changed, 134 insertions(+), 44 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 413623dc96a3..5ba52a7878a0 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -268,6 +268,22 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone 
*zone,
 __next_mem_pfn_range_in_zone(, zone, p_start, p_end);\
 i != (u64)ULLONG_MAX;  \
 __next_mem_pfn_range_in_zone(, zone, p_start, p_end))
+
+/**
+ * for_each_free_mem_range_in_zone_from - iterate through zone specific
+ * free memblock areas from a given point
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone, continuing from current position. Available as soon as memblock is
+ * initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
+   for (; i != (u64)ULLONG_MAX;  \
+__next_mem_pfn_range_in_zone(, zone, p_start, p_end))
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5cfd3ebe10d1..3466a01ed90a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1512,16 +1512,102 @@ static unsigned long  __init 
deferred_init_pages(struct zone *zone,
return (nr_pages);
 }
 
+/*
+ * This function is meant to pre-load the iterator for the zone init.
+ * Specifically it walks through the ranges until we are caught up to the
+ * first_init_pfn value and exits there. If we never encounter the value we
+ * return false indicating there are no valid ranges left.
+ */
+static bool __init
+deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone *zone,
+   unsigned long *spfn, unsigned long *epfn,
+   unsigned long first_init_pfn)
+{
+   u64 j;
+
+   /*
+* Start out by walking through the ranges in this zone that have
+* already been initialized. We don't need to do anything with them
+* so we just need to flush them out of the system.
+*/
+   for_each_free_mem_pfn_range_in_zone(j, zone, spfn, epfn) {
+   if (*epfn <= first_init_pfn)
+   continue;
+   if (*spfn < first_init_pfn)
+   *spfn = first_init_pfn;
+   *i = j;
+   return true;
+   }
+
+   return false;
+}
+
+/*
+ * Initialize and free pages. We do it in two loops: first we initialize
+ * struct page, than free to buddy allocator, because while we are
+ * freeing pages we can access pages that are ahead (computing buddy
+ * page in __free_one_page()).
+ *
+ * In order to try and keep some memory in the cache we have the loop
+ * broken along max page order boundaries. This way we will not cause
+ * any issues with the buddy page computation.
+ */
+static unsigned long __init
+deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
+  unsigned long *end_pfn)
+{
+   unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
+   unsigned long spfn = *start_pfn, epfn = *end_pfn;
+   unsigned long nr_pages = 0;
+   u64 j = *i;
+
+   /* First we loop through and initialize the page values */
+   for_each_free_mem_pfn_range_in_zone_from(j, zone, , ) {
+   unsigned long t;
+
+   if (mo_pfn <= spfn)
+   break;
+
+   t = min(mo_pfn, epfn);
+   nr_pages += deferred_init_pages(zone, spfn, t);
+
+   if (mo_pfn <= epfn)
+   break;
+   }
+
+   /* Reset values and now loop through freeing pages as needed */
+   j = *i;
+
+

[driver-core PATCH v5 8/9] PM core: Use new async_schedule_dev command

2018-11-05 Thread Alexander Duyck

This change makes it so that we use the device specific version of the
async_schedule commands to defer various tasks related to power management.
By doing this we should see a slight improvement in performance as any
device that is sensitive to latency/locality in the setup will now be
initializing on the node closest to the device.

Reviewed-by: Rafael J. Wysocki 
Signed-off-by: Alexander Duyck 
---
 drivers/base/power/main.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index a690fd400260..ebb8b61b52e9 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -726,7 +726,7 @@ void dpm_noirq_resume_devices(pm_message_t state)
reinit_completion(>power.completion);
if (is_async(dev)) {
get_device(dev);
-   async_schedule(async_resume_noirq, dev);
+   async_schedule_dev(async_resume_noirq, dev);
}
}
 
@@ -883,7 +883,7 @@ void dpm_resume_early(pm_message_t state)
reinit_completion(>power.completion);
if (is_async(dev)) {
get_device(dev);
-   async_schedule(async_resume_early, dev);
+   async_schedule_dev(async_resume_early, dev);
}
}
 
@@ -1047,7 +1047,7 @@ void dpm_resume(pm_message_t state)
reinit_completion(>power.completion);
if (is_async(dev)) {
get_device(dev);
-   async_schedule(async_resume, dev);
+   async_schedule_dev(async_resume, dev);
}
}
 
@@ -1366,7 +1366,7 @@ static int device_suspend_noirq(struct device *dev)
 
if (is_async(dev)) {
get_device(dev);
-   async_schedule(async_suspend_noirq, dev);
+   async_schedule_dev(async_suspend_noirq, dev);
return 0;
}
return __device_suspend_noirq(dev, pm_transition, false);
@@ -1569,7 +1569,7 @@ static int device_suspend_late(struct device *dev)
 
if (is_async(dev)) {
get_device(dev);
-   async_schedule(async_suspend_late, dev);
+   async_schedule_dev(async_suspend_late, dev);
return 0;
}
 
@@ -1833,7 +1833,7 @@ static int device_suspend(struct device *dev)
 
if (is_async(dev)) {
get_device(dev);
-   async_schedule(async_suspend, dev);
+   async_schedule_dev(async_suspend, dev);
return 0;
}
 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[driver-core PATCH v5 9/9] libnvdimm: Schedule device registration on node local to the device

2018-11-05 Thread Alexander Duyck

This patch is meant to force the device registration for nvdimm devices to
be closer to the actual device. This is achieved by using either the NUMA
node ID of the region, or of the parent. By doing this we can have
everything above the region based on the region, and everything below the
region based on the nvdimm bus.

By guaranteeing NUMA locality I see an improvement of as high as 25% for
per-node init of a system with 12TB of persistent memory.

Signed-off-by: Alexander Duyck 
---
 drivers/nvdimm/bus.c |   11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index f1fb39921236..b1e193541874 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -513,11 +514,15 @@ void __nd_device_register(struct device *dev)
set_dev_node(dev, to_nd_region(dev)->numa_node);
 
dev->bus = _bus_type;
-   if (dev->parent)
+   if (dev->parent) {
get_device(dev->parent);
+   if (dev_to_node(dev) == NUMA_NO_NODE)
+   set_dev_node(dev, dev_to_node(dev->parent));
+   }
get_device(dev);
-   async_schedule_domain(nd_async_device_register, dev,
-   _async_domain);
+
+   async_schedule_dev_domain(nd_async_device_register, dev,
+ _async_domain);
 }
 
 void nd_device_register(struct device *dev)

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[driver-core PATCH v5 7/9] driver core: Attach devices on CPU local to device node

2018-11-05 Thread Alexander Duyck

This change makes it so that we call the asynchronous probe routines on a
CPU local to the device node. By doing this we should be able to improve
our initialization time significantly as we can avoid having to access the
device from a remote node which may introduce higher latency.

For example, in the case of initializing memory for NVDIMM this can have a
significant impact as initialing 3TB on remote node can take up to 39
seconds while initialing it on a local node only takes 23 seconds. It is
situations like this where we will see the biggest improvement.

Signed-off-by: Alexander Duyck 
---
 drivers/base/dd.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 2fdfe45bb6ea..cf7681309ee3 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -834,7 +834,7 @@ static int __device_attach(struct device *dev, bool 
allow_async)
dev_dbg(dev, "scheduling asynchronous probe\n");
get_device(dev);
dev->async_probe = true;
-   async_schedule(__device_attach_async_helper, dev);
+   async_schedule_dev(__device_attach_async_helper, dev);
} else {
pm_request_idle(dev);
}
@@ -992,7 +992,7 @@ static int __driver_attach(struct device *dev, void *data)
if (!dev->driver) {
get_device(dev);
dev_set_drv_async(dev, drv);
-   async_schedule(__driver_attach_async_helper, dev);
+   async_schedule_dev(__driver_attach_async_helper, dev);
}
device_unlock(dev);
return 0;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[driver-core PATCH v5 6/9] driver core: Probe devices asynchronously instead of the driver

2018-11-05 Thread Alexander Duyck

This change makes it so that we probe devices asynchronously instead of the
driver. This results in us seeing the same behavior if the device is
registered before the driver or after. This way we can avoid serializing
the initialization should the driver not be loaded until after the devices
have already been added.

The motivation behind this is that if we have a set of devices that
take a significant amount of time to load we can greatly reduce the time to
load by processing them in parallel instead of one at a time. In addition,
each device can exist on a different node so placing a single thread on one
CPU to initialize all of the devices for a given driver can result in poor
performance on a system with multiple nodes.

I am using the driver_data member of the device struct to store the driver
pointer while we wait on the deferred probe call. This should be safe to do
as the value will either be set to NULL on a failed probe or driver load
followed by unload, or the driver value itself will be set on a successful
driver load. In addition I have used the async_probe flag to add additional
protection as it will be cleared if someone overwrites the driver_data
member as a part of loading the driver.

Signed-off-by: Alexander Duyck 
---
 drivers/base/bus.c |   23 +++--
 drivers/base/dd.c  |   52 
 include/linux/device.h |   26 +++-
 3 files changed, 80 insertions(+), 21 deletions(-)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 8a630f9bd880..0cd2eadd0816 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -606,17 +606,6 @@ static ssize_t uevent_store(struct device_driver *drv, 
const char *buf,
 }
 static DRIVER_ATTR_WO(uevent);
 
-static void driver_attach_async(void *_drv, async_cookie_t cookie)
-{
-   struct device_driver *drv = _drv;
-   int ret;
-
-   ret = driver_attach(drv);
-
-   pr_debug("bus: '%s': driver %s async attach completed: %d\n",
-drv->bus->name, drv->name, ret);
-}
-
 /**
  * bus_add_driver - Add a driver to the bus.
  * @drv: driver.
@@ -649,15 +638,9 @@ int bus_add_driver(struct device_driver *drv)
 
klist_add_tail(>knode_bus, >p->klist_drivers);
if (drv->bus->p->drivers_autoprobe) {
-   if (driver_allows_async_probing(drv)) {
-   pr_debug("bus: '%s': probing driver %s 
asynchronously\n",
-   drv->bus->name, drv->name);
-   async_schedule(driver_attach_async, drv);
-   } else {
-   error = driver_attach(drv);
-   if (error)
-   goto out_unregister;
-   }
+   error = driver_attach(drv);
+   if (error)
+   goto out_unregister;
}
module_add_driver(drv->owner, drv);
 
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index ed19cf0d6f9a..2fdfe45bb6ea 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -808,6 +808,7 @@ static int __device_attach(struct device *dev, bool 
allow_async)
ret = 1;
else {
dev->driver = NULL;
+   dev_set_drvdata(dev, NULL);
ret = 0;
}
} else {
@@ -925,6 +926,32 @@ int device_driver_attach(struct device_driver *drv, struct 
device *dev)
return ret;
 }
 
+static void __driver_attach_async_helper(void *_dev, async_cookie_t cookie)
+{
+   struct device *dev = _dev;
+   struct device_driver *drv;
+
+   __device_driver_lock(dev, dev->parent);
+
+   /*
+* If someone attempted to bind a driver either successfully or
+* unsuccessfully before we got here we should just skip the driver
+* probe call.
+*/
+   drv = dev_get_drv_async(dev);
+   if (drv && !dev->driver)
+   driver_probe_device(drv, dev);
+
+   /* We made our attempt at an async_probe, clear the flag */
+   dev->async_probe = false;
+
+   __device_driver_unlock(dev, dev->parent);
+
+   put_device(dev);
+
+   dev_dbg(dev, "async probe completed\n");
+}
+
 static int __driver_attach(struct device *dev, void *data)
 {
struct device_driver *drv = data;
@@ -952,6 +979,25 @@ static int __driver_attach(struct device *dev, void *data)
return ret;
} /* ret > 0 means positive match */
 
+   if (driver_allows_async_probing(drv)) {
+   /*
+* Instead of probing the device synchronously we will
+* probe it asynchronously to allow for more parallelism.
+*
+* We only take the device lock here in order to guarantee
+* that the dev->driver and driver_data fields are protected
+*/
+   dev_dbg(dev, "scheduling asynchronous probe\n");
+

[driver-core PATCH v5 5/9] driver core: Establish clear order of operations for deferred probe and remove

2018-11-05 Thread Alexander Duyck

This patch adds an additional bit to the device struct named async_probe.
This additional bit allows us to guarantee ordering between probe and
remove operations.

This allows us to guarantee that if we execute a remove operation or a
driver load attempt on a given interface it will not attempt to update the
driver member asynchronously following the earlier operation. Previously
this guarantee was not present and could result in us attempting to remove
a driver from an interface only to have it show up later when it is
asynchronously loaded.

One change I made in addition is I replaced the use of "bool X:1" to define
the bitfield to a "u8 X:1" setup in order to resolve some checkpatch
warnings.

Signed-off-by: Alexander Duyck 
---
 drivers/base/dd.c  |  104 +++-
 include/linux/device.h |9 ++--
 2 files changed, 64 insertions(+), 49 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index e74cefeb5b69..ed19cf0d6f9a 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -472,6 +472,8 @@ static int really_probe(struct device *dev, struct 
device_driver *drv)
 drv->bus->name, __func__, drv->name, dev_name(dev));
WARN_ON(!list_empty(>devres_head));
 
+   /* clear async_probe flag as we are no longer deferring driver load */
+   dev->async_probe = false;
 re_probe:
dev->driver = drv;
 
@@ -771,6 +773,10 @@ static void __device_attach_async_helper(void *_dev, 
async_cookie_t cookie)
 
device_lock(dev);
 
+   /* nothing to do if async_probe has been cleared */
+   if (!dev->async_probe)
+   goto out_unlock;
+
if (dev->parent)
pm_runtime_get_sync(dev->parent);
 
@@ -781,7 +787,7 @@ static void __device_attach_async_helper(void *_dev, 
async_cookie_t cookie)
 
if (dev->parent)
pm_runtime_put(dev->parent);
-
+out_unlock:
device_unlock(dev);
 
put_device(dev);
@@ -826,6 +832,7 @@ static int __device_attach(struct device *dev, bool 
allow_async)
 */
dev_dbg(dev, "scheduling asynchronous probe\n");
get_device(dev);
+   dev->async_probe = true;
async_schedule(__device_attach_async_helper, dev);
} else {
pm_request_idle(dev);
@@ -971,62 +978,69 @@ EXPORT_SYMBOL_GPL(driver_attach);
  */
 static void __device_release_driver(struct device *dev, struct device *parent)
 {
-   struct device_driver *drv;
+   struct device_driver *drv = dev->driver;
 
-   drv = dev->driver;
-   if (drv) {
-   while (device_links_busy(dev)) {
-   __device_driver_unlock(dev, parent);
+   /*
+* In the event that we are asked to release the driver on an
+* interface that is still waiting on a probe we can just terminate
+* the probe by setting async_probe to false. When the async call
+* is finally completed it will see this state and just exit.
+*/
+   dev->async_probe = false;
+   if (!drv)
+   return;
 
-   device_links_unbind_consumers(dev);
+   while (device_links_busy(dev)) {
+   __device_driver_unlock(dev, parent);
 
-   __device_driver_lock(dev, parent);
-   /*
-* A concurrent invocation of the same function might
-* have released the driver successfully while this one
-* was waiting, so check for that.
-*/
-   if (dev->driver != drv)
-   return;
-   }
+   device_links_unbind_consumers(dev);
 
-   pm_runtime_get_sync(dev);
-   pm_runtime_clean_up_links(dev);
+   __device_driver_lock(dev, parent);
+   /*
+* A concurrent invocation of the same function might
+* have released the driver successfully while this one
+* was waiting, so check for that.
+*/
+   if (dev->driver != drv)
+   return;
+   }
 
-   driver_sysfs_remove(dev);
+   pm_runtime_get_sync(dev);
+   pm_runtime_clean_up_links(dev);
 
-   if (dev->bus)
-   blocking_notifier_call_chain(>bus->p->bus_notifier,
-BUS_NOTIFY_UNBIND_DRIVER,
-dev);
+   driver_sysfs_remove(dev);
 
-   pm_runtime_put_sync(dev);
+   if (dev->bus)
+   blocking_notifier_call_chain(>bus->p->bus_notifier,
+BUS_NOTIFY_UNBIND_DRIVER,
+dev);
 
-   if (dev->bus && dev->bus->remove)
-

[driver-core PATCH v5 4/9] driver core: Move async_synchronize_full call

2018-11-05 Thread Alexander Duyck

This patch moves the async_synchronize_full call out of
__device_release_driver and into driver_detach.

The idea behind this is that the async_synchronize_full call will only
guarantee that any existing async operations are flushed. This doesn't do
anything to guarantee that a hotplug event that may occur while we are
doing the release of the driver will not be asynchronously scheduled.

By moving this into the driver_detach path we can avoid potential deadlocks
as we aren't holding the device lock at this point and we should not have
the driver we want to flush loaded so the flush will take care of any
asynchronous events the driver we are detaching might have scheduled.

Signed-off-by: Alexander Duyck 
---
 drivers/base/dd.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 76c40fe69463..e74cefeb5b69 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -975,9 +975,6 @@ static void __device_release_driver(struct device *dev, 
struct device *parent)
 
drv = dev->driver;
if (drv) {
-   if (driver_allows_async_probing(drv))
-   async_synchronize_full();
-
while (device_links_busy(dev)) {
__device_driver_unlock(dev, parent);
 
@@ -1087,6 +1084,9 @@ void driver_detach(struct device_driver *drv)
struct device_private *dev_prv;
struct device *dev;
 
+   if (driver_allows_async_probing(drv))
+   async_synchronize_full();
+
for (;;) {
spin_lock(>p->klist_devices.k_lock);
if (list_empty(>p->klist_devices.k_list)) {

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[driver-core PATCH v5 1/9] workqueue: Provide queue_work_node to queue work near a given NUMA node

2018-11-05 Thread Alexander Duyck

This patch provides a new function queue_work_node which is meant to
schedule work on a "random" CPU of the requested NUMA node. The main
motivation for this is to help assist asynchronous init to better improve
boot times for devices that are local to a specific node.

For now we just default to the first CPU that is in the intersection of the
cpumask of the node and the online cpumask. The only exception is if the
CPU is local to the node we will just use the current CPU. This should work
for our purposes as we are currently only using this for unbound work so
the CPU will be translated to a node anyway instead of being directly used.

As we are only using the first CPU to represent the NUMA node for now I am
limiting the scope of the function so that it can only be used with unbound
workqueues.

Acked-by: Tejun Heo 
Signed-off-by: Alexander Duyck 
---
 include/linux/workqueue.h |2 +
 kernel/workqueue.c|   84 +
 2 files changed, 86 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 60d673e15632..1f50c1e586e7 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -463,6 +463,8 @@ int workqueue_set_unbound_cpumask(cpumask_var_t cpumask);
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work);
+extern bool queue_work_node(int node, struct workqueue_struct *wq,
+   struct work_struct *work);
 extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct delayed_work *work, unsigned long delay);
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0280deac392e..6ed7c2eb84b0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1492,6 +1492,90 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL(queue_work_on);
 
+/**
+ * workqueue_select_cpu_near - Select a CPU based on NUMA node
+ * @node: NUMA node ID that we want to bind a CPU from
+ *
+ * This function will attempt to find a "random" cpu available on a given
+ * node. If there are no CPUs available on the given node it will return
+ * WORK_CPU_UNBOUND indicating that we should just schedule to any
+ * available CPU if we need to schedule this work.
+ */
+static int workqueue_select_cpu_near(int node)
+{
+   int cpu;
+
+   /* No point in doing this if NUMA isn't enabled for workqueues */
+   if (!wq_numa_enabled)
+   return WORK_CPU_UNBOUND;
+
+   /* Delay binding to CPU if node is not valid or online */
+   if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
+   return WORK_CPU_UNBOUND;
+
+   /* Use local node/cpu if we are already there */
+   cpu = raw_smp_processor_id();
+   if (node == cpu_to_node(cpu))
+   return cpu;
+
+   /* Use "random" otherwise know as "first" online CPU of node */
+   cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+
+   /* If CPU is valid return that, otherwise just defer */
+   return (cpu < nr_cpu_ids) ? cpu : WORK_CPU_UNBOUND;
+}
+
+/**
+ * queue_work_node - queue work on a "random" cpu for a given NUMA node
+ * @node: NUMA node that we are targeting the work for
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * We queue the work to a "random" CPU within a given NUMA node. The basic
+ * idea here is to provide a way to somehow associate work with a given
+ * NUMA node.
+ *
+ * This function will only make a best effort attempt at getting this onto
+ * the right NUMA node. If no node is requested or the requested node is
+ * offline then we just fall back to standard queue_work behavior.
+ *
+ * Currently the "random" CPU ends up being the first available CPU in the
+ * intersection of cpu_online_mask and the cpumask of the node, unless we
+ * are running on the node. In that case we just use the current CPU.
+ *
+ * Return: %false if @work was already on a queue, %true otherwise.
+ */
+bool queue_work_node(int node, struct workqueue_struct *wq,
+struct work_struct *work)
+{
+   unsigned long flags;
+   bool ret = false;
+
+   /*
+* This current implementation is specific to unbound workqueues.
+* Specifically we only return the first available CPU for a given
+* node instead of cycling through individual CPUs within the node.
+*
+* If this is used with a per-cpu workqueue then the logic in
+* workqueue_select_cpu_near would need to be updated to allow for
+* some round robin type logic.
+*/
+   WARN_ON_ONCE(!(wq->flags & WQ_UNBOUND));
+
+   local_irq_save(flags);
+
+   if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+   int cpu = workqueue_select_cpu_near(node);
+
+   __queue_work(cpu, wq, work);
+

[driver-core PATCH v5 3/9] device core: Consolidate locking and unlocking of parent and device

2018-11-05 Thread Alexander Duyck

This patch is meant to try and consolidate all of the locking and unlocking
of both the parent and device when attaching or removing a driver from a
given device.

To do that I first consolidated the lock pattern into two functions
__device_driver_lock and __device_driver_unlock. After doing that I then
created functions specific to attaching and detaching the driver while
acquiring these locks. By doing this I was able to reduce the number of
spots where we touch need_parent_lock from 12 down to 4.

Reviewed-by: Bart Van Assche 
Reviewed-by: Rafael J. Wysocki 
Signed-off-by: Alexander Duyck 
---
 drivers/base/base.h |2 +
 drivers/base/bus.c  |   23 ++---
 drivers/base/dd.c   |   91 ---
 3 files changed, 77 insertions(+), 39 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 7a419a7a6235..3f22ebd6117a 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -124,6 +124,8 @@ extern int driver_add_groups(struct device_driver *drv,
 const struct attribute_group **groups);
 extern void driver_remove_groups(struct device_driver *drv,
 const struct attribute_group **groups);
+int device_driver_attach(struct device_driver *drv, struct device *dev);
+void device_driver_detach(struct device *dev);
 
 extern char *make_class_name(const char *name, struct kobject *kobj);
 
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 8bfd27ec73d6..8a630f9bd880 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -184,11 +184,7 @@ static ssize_t unbind_store(struct device_driver *drv, 
const char *buf,
 
dev = bus_find_device_by_name(bus, NULL, buf);
if (dev && dev->driver == drv) {
-   if (dev->parent && dev->bus->need_parent_lock)
-   device_lock(dev->parent);
-   device_release_driver(dev);
-   if (dev->parent && dev->bus->need_parent_lock)
-   device_unlock(dev->parent);
+   device_driver_detach(dev);
err = count;
}
put_device(dev);
@@ -211,13 +207,7 @@ static ssize_t bind_store(struct device_driver *drv, const 
char *buf,
 
dev = bus_find_device_by_name(bus, NULL, buf);
if (dev && dev->driver == NULL && driver_match_device(drv, dev)) {
-   if (dev->parent && bus->need_parent_lock)
-   device_lock(dev->parent);
-   device_lock(dev);
-   err = driver_probe_device(drv, dev);
-   device_unlock(dev);
-   if (dev->parent && bus->need_parent_lock)
-   device_unlock(dev->parent);
+   err = device_driver_attach(drv, dev);
 
if (err > 0) {
/* success */
@@ -769,13 +759,8 @@ EXPORT_SYMBOL_GPL(bus_rescan_devices);
  */
 int device_reprobe(struct device *dev)
 {
-   if (dev->driver) {
-   if (dev->parent && dev->bus->need_parent_lock)
-   device_lock(dev->parent);
-   device_release_driver(dev);
-   if (dev->parent && dev->bus->need_parent_lock)
-   device_unlock(dev->parent);
-   }
+   if (dev->driver)
+   device_driver_detach(dev);
return bus_rescan_devices_helper(dev, NULL);
 }
 EXPORT_SYMBOL_GPL(device_reprobe);
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 169412ee4ae8..76c40fe69463 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -864,6 +864,60 @@ void device_initial_probe(struct device *dev)
__device_attach(dev, true);
 }
 
+/*
+ * __device_driver_lock - acquire locks needed to manipulate dev->drv
+ * @dev: Device we will update driver info for
+ * @parent: Parent device. Needed if the bus requires parent lock
+ *
+ * This function will take the required locks for manipulating dev->drv.
+ * Normally this will just be the @dev lock, but when called for a USB
+ * interface, @parent lock will be held as well.
+ */
+static void __device_driver_lock(struct device *dev, struct device *parent)
+{
+   if (parent && dev->bus->need_parent_lock)
+   device_lock(parent);
+   device_lock(dev);
+}
+
+/*
+ * __device_driver_lock - release locks needed to manipulate dev->drv
+ * @dev: Device we will update driver info for
+ * @parent: Parent device. Needed if the bus requires parent lock
+ *
+ * This function will release the required locks for manipulating dev->drv.
+ * Normally this will just be the the @dev lock, but when called for a
+ * USB interface, @parent lock will be released as well.
+ */
+static void __device_driver_unlock(struct device *dev, struct device *parent)
+{
+   device_unlock(dev);
+   if (parent && dev->bus->need_parent_lock)
+   device_unlock(parent);
+}
+
+/**
+ * device_driver_attach - attach a specific driver to a specific device
+ * @drv: Driver to attach
+ * @dev: Device to attach it

[driver-core PATCH v5 2/9] async: Add support for queueing on specific NUMA node

2018-11-05 Thread Alexander Duyck

This patch introduces four new variants of the async_schedule_ functions
that allow scheduling on a specific NUMA node.

The first two functions are async_schedule_near and
async_schedule_near_domain which end up mapping to async_schedule and
async_schedule_domain but provide NUMA node specific functionality. They
replace the original functions which were moved to inline function
definitions that call the new functions while passing NUMA_NO_NODE.

The second two functions are async_schedule_dev and
async_schedule_dev_domain which provide NUMA specific functionality when
passing a device as the data member and that device has a NUMA node other
than NUMA_NO_NODE.

The main motivation behind this is to address the need to be able to
schedule device specific init work on specific NUMA nodes in order to
improve performance of memory initialization.

Signed-off-by: Alexander Duyck 
---
 include/linux/async.h |   84 +++--
 kernel/async.c|   53 +--
 2 files changed, 110 insertions(+), 27 deletions(-)

diff --git a/include/linux/async.h b/include/linux/async.h
index 6b0226bdaadc..98a94e6e367d 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -14,6 +14,8 @@
 
 #include 
 #include 
+#include 
+#include 
 
 typedef u64 async_cookie_t;
 typedef void (*async_func_t) (void *data, async_cookie_t cookie);
@@ -37,9 +39,85 @@ struct async_domain {
struct async_domain _name = { .pending = LIST_HEAD_INIT(_name.pending), 
\
  .registered = 0 }
 
-extern async_cookie_t async_schedule(async_func_t func, void *data);
-extern async_cookie_t async_schedule_domain(async_func_t func, void *data,
-   struct async_domain *domain);
+async_cookie_t async_schedule_node(async_func_t func, void *data,
+  int node);
+async_cookie_t async_schedule_node_domain(async_func_t func, void *data,
+ int node,
+ struct async_domain *domain);
+
+/**
+ * async_schedule - schedule a function for asynchronous execution
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * Note: This function may be called from atomic or non-atomic contexts.
+ */
+static inline async_cookie_t async_schedule(async_func_t func, void *data)
+{
+   return async_schedule_node(func, data, NUMA_NO_NODE);
+}
+
+/**
+ * async_schedule_domain - schedule a function for asynchronous execution 
within a certain domain
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ * @domain: the domain
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @domain may be used in the async_synchronize_*_domain() functions to
+ * wait within a certain synchronization domain rather than globally.  A
+ * synchronization domain is specified via @domain.
+ * Note: This function may be called from atomic or non-atomic contexts.
+ */
+static inline async_cookie_t
+async_schedule_domain(async_func_t func, void *data,
+ struct async_domain *domain)
+{
+   return async_schedule_node_domain(func, data, NUMA_NO_NODE, domain);
+}
+
+/**
+ * async_schedule_dev - A device specific version of async_schedule
+ * @func: function to execute asynchronously
+ * @dev: device argument to be passed to function
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @dev is used as both the argument for the function and to provide NUMA
+ * context for where to run the function. By doing this we can try to
+ * provide for the best possible outcome by operating on the device on the
+ * CPUs closest to the device.
+ * Note: This function may be called from atomic or non-atomic contexts.
+ */
+static inline async_cookie_t
+async_schedule_dev(async_func_t func, struct device *dev)
+{
+   return async_schedule_node(func, dev, dev_to_node(dev));
+}
+
+/**
+ * async_schedule_dev_domain - A device specific version of 
async_schedule_domain
+ * @func: function to execute asynchronously
+ * @dev: device argument to be passed to function
+ * @domain: the domain
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @dev is used as both the argument for the function and to provide NUMA
+ * context for where to run the function. By doing this we can try to
+ * provide for the best possible outcome by operating on the device on the
+ * CPUs closest to the device.
+ * @domain may be used in the async_synchronize_*_domain() functions to
+ * wait within a certain synchronization domain rather than globally.  A
+ * synchronization domain is specified via @domain.
+ * Note: This function may be called from atomic or non-atomic contexts.
+ */
+static inline async_cookie_t

[driver-core PATCH v5 0/9] Add NUMA aware async_schedule calls

2018-11-05 Thread Alexander Duyck

This patch set provides functionality that will help to improve the
locality of the async_schedule calls used to provide deferred
initialization.

This patch set originally started out with me focused on just the one call
to async_schedule_domain in the nvdimm tree that was being used to
defer the device_add call however after doing some digging I realized the
scope of this was much broader than I had originally planned. As such I
went through and reworked the underlying infrastructure down to replacing
the queue_work call itself with a function of my own and opted to try and
provide a NUMA aware solution that would work for a broader audience.

RFC->v1:
Dropped nvdimm patch to submit later.
It relies on code in libnvdimm development tree.
Simplified queue_work_near to just convert node into a CPU.
Split up drivers core and PM core patches.
v1->v2:
Renamed queue_work_near to queue_work_node
Added WARN_ON_ONCE if we use queue_work_node with per-cpu workqueue
v2->v3:
Added Acked-by for queue_work_node patch
Continued rename from _near to _node to be consistent with queue_work_node
Renamed async_schedule_near_domain to async_schedule_node_domain
Renamed async_schedule_near to async_schedule_node
Added kerneldoc for new async_schedule_XXX functions
Updated patch description for patch 4 to include data on potential gains
v3->v4
Added patch to consolidate use of need_parent_lock
Make asynchronous driver probing explicit about use of drvdata
v4->v5
Added patch to move async_synchronize_full to address deadlock
Added bit async_probe to act as mutex for probe/remove calls
Added back nvdimm patch as code it relies on is now in Linus's tree
Incorporated review comments on parent & device locking consolidation
Rebased on latest linux-next

---

Alexander Duyck (9):
  workqueue: Provide queue_work_node to queue work near a given NUMA node
  async: Add support for queueing on specific NUMA node
  device core: Consolidate locking and unlocking of parent and device
  driver core: Move async_synchronize_full call
  driver core: Establish clear order of operations for deferred probe and 
remove
  driver core: Probe devices asynchronously instead of the driver
  driver core: Attach devices on CPU local to device node
  PM core: Use new async_schedule_dev command
  libnvdimm: Schedule device registration on node local to the device


 drivers/base/base.h   |2 
 drivers/base/bus.c|   46 +---
 drivers/base/dd.c |  249 +
 drivers/base/power/main.c |   12 +-
 drivers/nvdimm/bus.c  |   11 +-
 include/linux/async.h |   84 +++
 include/linux/device.h|   35 +-
 include/linux/workqueue.h |2 
 kernel/async.c|   53 +-
 kernel/workqueue.c|   84 +++
 10 files changed, 432 insertions(+), 146 deletions(-)

--
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

RE: [ndctl PATCH v13 2/5] ndctl, monitor: add main ndctl monitor configuration file

2018-11-05 Thread qi.f...@fujitsu.com

> > Hi,
> >
> > I have a future cleanup request. I've just come across, ciniparser
> > [1]. It would be great if we could drop the open coded implementation and 
> > just
> use that library.
> > We use ccan modules for other portions of ndctl.
> >
> > [1]: https://ccodearchive.net/info/ciniparser.html
> 
> Hi Dan,
> 
> It sounds good.
> I will study on it a bit and then make a cleanup patch.
> 
> Thank you very much.
>  Qi

Hi,

I have finished importing the ccan/ciniparser module and refactoring the 
read_config_file() in my local branch.
But thinking more about that more than one configuration file might be added 
into ndctl in future,
I want to add a common function parse_config_file() into ndctl/util/ to replace 
read_config_file().
If you agree, I will try to make a patch.

Thank you very much.
 QI
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [driver-core PATCH v5 5/9] driver core: Establish clear order of operations for deferred probe and remove

fsdax memory error handling regression

Re: [driver-core PATCH v5 4/9] driver core: Move async_synchronize_full call

Re: [driver-core PATCH v5 0/9] Add NUMA aware async_schedule calls

Re: [driver-core PATCH v5 1/9] workqueue: Provide queue_work_node to queue work near a given NUMA node

（桓扬）每次的业绩为什么总不理想

扶召|想成为领导的得力助手，却不知道如何才能做得优秀？

[mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages

[mm PATCH v5 1/7] mm: Use mm_zero_struct_page from SPARC on all 64b architectures

[mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize

[mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links

[mm PATCH v5 0/7] Deferred page init improvements

[mm PATCH v5 2/7] mm: Drop meminit_pfn_in_nid as it is redundant

[mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator

[mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections

[driver-core PATCH v5 8/9] PM core: Use new async_schedule_dev command

[driver-core PATCH v5 9/9] libnvdimm: Schedule device registration on node local to the device

[driver-core PATCH v5 7/9] driver core: Attach devices on CPU local to device node

[driver-core PATCH v5 6/9] driver core: Probe devices asynchronously instead of the driver

[driver-core PATCH v5 5/9] driver core: Establish clear order of operations for deferred probe and remove

[driver-core PATCH v5 4/9] driver core: Move async_synchronize_full call

[driver-core PATCH v5 1/9] workqueue: Provide queue_work_node to queue work near a given NUMA node

[driver-core PATCH v5 3/9] device core: Consolidate locking and unlocking of parent and device

[driver-core PATCH v5 2/9] async: Add support for queueing on specific NUMA node

[driver-core PATCH v5 0/9] Add NUMA aware async_schedule calls

RE: [ndctl PATCH v13 2/5] ndctl, monitor: add main ndctl monitor configuration file

26 matches

Site Navigation

Mail list logo

Footer information