Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages

2021-03-19 Thread Greg Thelen
Dave Hansen  wrote:

> From: Dave Hansen 
>
> Anonymous pages are kept on their own LRU(s).  These lists could
> theoretically always be scanned and maintained.  But, without swap,
> there is currently nothing the kernel can *do* with the results of a
> scanned, sorted LRU for anonymous pages.
>
> A check for '!total_swap_pages' currently serves as a valid check as
> to whether anonymous LRUs should be maintained.  However, another
> method will be added shortly: page demotion.
>
> Abstract out the 'total_swap_pages' checks into a helper, give it a
> logically significant name, and check for the possibility of page
> demotion.

Reviewed-by: Greg Thelen 

> Signed-off-by: Dave Hansen 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
> ---
>
>  b/mm/vmscan.c |   28 +---
>  1 file changed, 25 insertions(+), 3 deletions(-)
>
> diff -puN mm/vmscan.c~mm-vmscan-anon-can-be-aged mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-anon-can-be-aged  2021-03-04 15:35:58.935806422 
> -0800
> +++ b/mm/vmscan.c 2021-03-04 15:35:58.942806422 -0800
> @@ -2517,6 +2517,26 @@ out:
>   }
>  }
>  
> +/*
> + * Anonymous LRU management is a waste if there is
> + * ultimately no way to reclaim the memory.
> + */
> +bool anon_should_be_aged(struct lruvec *lruvec)

Should this be static?

> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> + /* Aging the anon LRU is valuable if swap is present: */
> + if (total_swap_pages > 0)
> + return true;
> +
> + /* Also valuable if anon pages can be demoted: */
> + if (next_demotion_node(pgdat->node_id) >= 0)
> + return true;
> +
> + /* No way to reclaim anon pages.  Should not age anon LRUs: */
> + return false;
> +}
> +
>  static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
>   unsigned long nr[NR_LRU_LISTS];
> @@ -2626,7 +2646,8 @@ static void shrink_lruvec(struct lruvec
>* Even if we did not try to evict anon pages at all, we want to
>* rebalance the anon lru active/inactive ratio.
>*/
> - if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> + if (anon_should_be_aged(lruvec) &&
> + inactive_is_low(lruvec, LRU_INACTIVE_ANON))
>   shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  sc, LRU_ACTIVE_ANON);
>  }
> @@ -3455,10 +3476,11 @@ static void age_active_anon(struct pglis
>   struct mem_cgroup *memcg;
>   struct lruvec *lruvec;
>  
> - if (!total_swap_pages)
> + lruvec = mem_cgroup_lruvec(NULL, pgdat);
> +
> + if (!anon_should_be_aged(lruvec))
>   return;
>  
> - lruvec = mem_cgroup_lruvec(NULL, pgdat);
>   if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
>   return;
>  
> _


Re: + revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch added to -mm tree

2020-12-07 Thread Greg Thelen
a...@linux-foundation.org wrote:

> The patch titled
>  Subject: revert "mm/filemap: add static for function 
> __add_to_page_cache_locked"
> has been added to the -mm tree.  Its filename is
>  
> revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
>
> This patch should soon appear at
> 
> https://ozlabs.org/~akpm/mmots/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
> and later at
> 
> https://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
>
> Before you just go and hit "reply", please:
>a) Consider who else should be cc'ed
>b) Prefer to cc a suitable mailing list as well
>c) Ideally: find the original patch on the mailing list and do a
>   reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/process/submit-checklist.rst when testing 
> your code ***
>
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
>
> --
> From: Andrew Morton 
> Subject: revert "mm/filemap: add static for function 
> __add_to_page_cache_locked"
>
> Revert 3351b16af494 ("mm/filemap: add static for function
> __add_to_page_cache_locked") due to build issues with
> CONFIG_DEBUG_INFO_BTF=y.
>
> Link: 
> https://lkml.kernel.org/r/caadnvqj6tmzbxvtrobueh6qa0h+q7yaskxrvvvxhqr3kbzd...@mail.gmail.com
> Cc: Michal Kubecek 
> Cc: Justin Forbes 
> Cc: Alex Shi 
> Cc: Souptick Joarder 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Josef Bacik 
> Signed-off-by: Andrew Morton 

Thanks.

Tested-by: Greg Thelen 

> ---
>
>  mm/filemap.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 
> a/mm/filemap.c~revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked
> +++ a/mm/filemap.c
> @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page
>  }
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
> -static noinline int __add_to_page_cache_locked(struct page *page,
> +noinline int __add_to_page_cache_locked(struct page *page,
>   struct address_space *mapping,
>   pgoff_t offset, gfp_t gfp,
>   void **shadowp)
> _
>
> Patches currently in -mm which might be from a...@linux-foundation.org are
>
> revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
> kthread_worker-document-cpu-hotplug-handling-fix.patch
> mm.patch
> mm-remove-the-unuseful-else-after-a-return-checkpatch-fixes.patch
> mm-prevent-gup_fast-from-racing-with-cow-during-fork-checkpatch-fixes.patch
> mm-swap_state-skip-meaningless-swap-cache-readahead-when-ra_infowin-==-0-fix.patch
> mm-vmallocc-__vmalloc_area_node-avoid-32-bit-overflow.patch
> mm-cma-improve-pr_debug-log-in-cma_release-fix.patch
> mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
> lib-cmdline_kunit-add-a-new-test-suite-for-cmdline-api-fix.patch
> ilog2-improve-ilog2-for-constant-arguments-checkpatch-fixes.patch
> lib-test_bitmapc-add-for_each_set_clump-test-cases-checkpatch-fixes.patch
> checkpatch-fix-typo_spelling-check-for-words-with-apostrophe-fix.patch
> resource-fix-kernel-doc-markups-checkpatch-fixes.patch
> linux-next-rejects.patch
> kmap-stupid-hacks-to-make-it-compile.patch
> init-kconfig-dont-assume-scripts-lld-versionsh-is-executable.patch
> set_memory-allow-set_direct_map__noflush-for-multiple-pages-fix.patch
> arch-mm-wire-up-memfd_secret-system-call-were-relevant-fix.patch
> kernel-forkc-export-kernel_thread-to-modules.patch


Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked

2020-12-07 Thread Greg Thelen
Alex Shi  wrote:

> 在 2020/11/11 上午3:50, Andrew Morton 写道:
>> On Tue, 10 Nov 2020 08:39:24 +0530 Souptick Joarder  
>> wrote:
>> 
>>> On Fri, Nov 6, 2020 at 4:55 PM Alex Shi  wrote:

 Otherwise it cause gcc warning:
   ^~~
 ../mm/filemap.c:830:14: warning: no previous prototype for
 ‘__add_to_page_cache_locked’ [-Wmissing-prototypes]
  noinline int __add_to_page_cache_locked(struct page *page,
   ^~
>>>
>>> Is CONFIG_DEBUG_INFO_BTF enabled in your .config ?
>> 
>> hm, yes.
>
> When the config enabled, compiling looks good untill pahole tool
> used to get BTF info, but I still failed on a right version pahole
>> 1.16. Sorry.

I'm seeing an issue with this patch.  My build system has pahole v1.17,
but I don't think the pahole version is key.

$ git checkout 3351b16af494  # recently added to linus/master
$ make defconfig
$ make menuconfig  # set CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF
$ make V=1
+ ./tools/bpf/resolve_btfids/resolve_btfids vmlinux
FAILED unresolved symbol __add_to_page_cache_locked

Reverting 3351b16af494 ("mm/filemap: add static for function
__add_to_page_cache_locked") fixes the issue.

I don't see the warning which motivated this patch, but maybe it
requires particular a .config or gcc version.  Perhaps adding a
__add_to_page_cache_locked() prototype would meet avoid it.  But I
haven't studied enough on BTF to know if there's a better answer.

 Signed-off-by: Alex Shi 
 Cc: Andrew Morton 
 Cc: linux...@kvack.org
 Cc: linux-kernel@vger.kernel.org
 ---
  mm/filemap.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/mm/filemap.c b/mm/filemap.c
 index d90614f501da..249cf489f5df 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page *old, struct 
 page *new, gfp_t gfp_mask)
  }
  EXPORT_SYMBOL_GPL(replace_page_cache_page);

 -noinline int __add_to_page_cache_locked(struct page *page,
 +static noinline int __add_to_page_cache_locked(struct page *page,
 struct address_space *mapping,
 pgoff_t offset, gfp_t gfp,
 void **shadowp)
>> 
>> It's unclear to me whether BTF_ID() requires that the target symbol be
>> non-static.  It doesn't actually reference the symbol:
>> 
>> #define BTF_ID(prefix, name) \
>> __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__))
>> 
>
> The above usage make me thought BTF don't require the symbol, though
> the symbol still exist in vmlinux with 'static'.
>
> So any comments of this, Alexei? 
>
>> Alexei, can you please comment?
>> 


Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints

2020-12-01 Thread Greg Thelen
Axel Rasmussen  wrote:

> On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt  wrote:
>>
>> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen  
>> wrote:
>> >
>> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
>> > is that an ongoing trace event might race with the tracepoint being
>> > disabled (and therefore the _unreg() callback being called). Consider
>> > this ordering:
>> >
>> > T1: trace event fires, get_mm_memcg_path() is called
>> > T1: get_memcg_path_buf() returns a buffer pointer
>> > T2: trace_mmap_lock_unreg() is called, buffers are freed
>> > T1: cgroup_path() is called with the now-freed buffer
>>
>> Any reason to use the cgroup_path instead of the cgroup_ino? There are
>> other examples of trace points using cgroup_ino and no need to
>> allocate buffers. Also cgroup namespace might complicate the path
>> usage.
>
> Hmm, so in general I would love to use a numeric identifier instead of a 
> string.
>
> I did some reading, and it looks like the cgroup_ino() mainly has to
> do with writeback, instead of being just a general identifier?
> https://www.kernel.org/doc/Documentation/cgroup-v2.txt
>
> There is cgroup_id() which I think is almost what I'd want, but there
> are a couple problems with it:
>
> - I don't know of a way for userspace to translate IDs -> paths, to
> make them human readable?

The id => name map can be built from user space with a tree walk.
Example:

$ find /sys/fs/cgroup/memory -type d -printf '%i %P\n'  
# ~ [main] 
20387 init.scope 
31 system.slice

> - Also I think the ID implementation we use for this is "dense",
> meaning if a cgroup is removed, its ID is likely to be quickly reused.
>
>>
>> >
>> > The solution in this commit is to modify trace_mmap_lock_unreg() to
>> > first stop new buffers from being handed out, and then to wait (spin)
>> > until any existing buffer references are dropped (i.e., those trace
>> > events complete).
>> >
>> > I have a simple reproducer program which spins up two pools of threads,
>> > doing the following in a tight loop:
>> >
>> >   Pool 1:
>> >   mmap(NULL, 4096, PROT_READ | PROT_WRITE,
>> >MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
>> >   munmap()
>> >
>> >   Pool 2:
>> >   echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>> >   echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>> >
>> > This triggers the use-after-free very quickly. With this patch, I let it
>> > run for an hour without any BUGs.
>> >
>> > While fixing this, I also noticed and fixed a css ref leak. Previously
>> > we called get_mem_cgroup_from_mm(), but we never called css_put() to
>> > release that reference. get_mm_memcg_path() now does this properly.
>> >
>> > [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a
>> >
>> > Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock 
>> > acquisition")
>>
>> The original patch is in mm tree, so the SHA1 is not stabilized.
>> Usually Andrew squash the fixes into the original patches.
>
> Ah, I added this because it also shows up in linux-next, under the
> next-20201130 tag. I'll remove it in v2, squashing is fine. :)
>
>>
>> > Signed-off-by: Axel Rasmussen 
>> > ---
>> >  mm/mmap_lock.c | 100 +
>> >  1 file changed, 85 insertions(+), 15 deletions(-)
>> >
>> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
>> > index 12af8f1b8a14..be38dc58278b 100644
>> > --- a/mm/mmap_lock.c
>> > +++ b/mm/mmap_lock.c
>> > @@ -3,6 +3,7 @@
>> >  #include 
>> >
>> >  #include 
>> > +#include 
>> >  #include 
>> >  #include 
>> >  #include 
>> > @@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
>> >  #ifdef CONFIG_MEMCG
>> >
>> >  /*
>> > - * Our various events all share the same buffer (because we don't want or 
>> > need
>> > - * to allocate a set of buffers *per event type*), so we need to protect 
>> > against
>> > - * concurrent _reg() and _unreg() calls, and count how many _reg() calls 
>> > have
>> > - * been made.
>> > + * This is unfortunately complicated... _reg() and _unreg() may be called
>> > + * in parallel, separately for each of our three event types. To save 
>> > memory,
>> > + * all of the event types share the same buffers. Furthermore, trace 
>> > events
>> > + * might happen in parallel with _unreg(); we need to ensure we don't 
>> > free the
>> > + * buffers before all inflights have finished. Because these events happen
>> > + * "frequently", we also want to prevent new inflights from starting once 
>> > the
>> > + * _unreg() process begins. And, for performance reasons, we want to 
>> > avoid any
>> > + * locking in the trace event path.
>> > + *
>> > + * So:
>> > + *
>> > + * - Use a spinlock to serialize _reg() and _unreg() calls.
>> > + * - Keep track of nested _reg() calls with a lock-protected counter.
>> > + * - Define a flag indicating whether or not 

Re: [PATCH] selftests: more general make nesting support

2020-08-05 Thread Greg Thelen
Shuah Khan  wrote:

> On 8/5/20 1:36 PM, Greg Thelen wrote:
>> On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen  wrote:
>>>
>>> selftests can be built from the toplevel kernel makefile (e.g. make
>>> kselftest-all) or directly (make -C tools/testing/selftests all).
>>>
>>> The toplevel kernel makefile explicitly disables implicit rules with
>>> "MAKEFLAGS += -rR", which is passed to tools/testing/selftests.  Some
>>> selftest makefiles require implicit make rules, which is why
>>> commit 67d8712dcc70 ("selftests: Fix build failures when invoked from
>>> kselftest target") reenables implicit rules by clearing MAKEFLAGS if
>>> MAKELEVEL=1.
>>>
>>> So far so good.  However, if the toplevel makefile is called from an
>>> outer makefile then MAKELEVEL will be elevated, which breaks the
>>> MAKELEVEL equality test.
>>> Example wrapped makefile error:
>>>$ cat ~/Makefile
>>>all:
>>>  $(MAKE) defconfig
>>>  $(MAKE) kselftest-all
>>>$ make -sf ~/Makefile
>>>  futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h  
>>>  /src/tools/testing/selftests/kselftest.h ../include/futextest.h 
>>> ../include/atomic.h ../include/logging.h -lpthread -lrt -o 
>>> /src/tools/testing/selftests/futex/functional/futex_wait_timeout
>>>make[4]: futex_wait_timeout.c: Command not found
>>>
>>> Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more
>>> direct side effect of "make -R".  This enables arbitrary makefile
>>> nesting.
>>>
>>> Signed-off-by: Greg Thelen 
>>> ---
>>>   tools/testing/selftests/Makefile | 8 
>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/tools/testing/selftests/Makefile 
>>> b/tools/testing/selftests/Makefile
>>> index 1195bd85af38..289a2e4b3f6f 100644
>>> --- a/tools/testing/selftests/Makefile
>>> +++ b/tools/testing/selftests/Makefile
>>> @@ -84,10 +84,10 @@ endif
>>>   # of the targets gets built.
>>>   FORCE_TARGETS ?=
>>>
>>> -# Clear LDFLAGS and MAKEFLAGS if called from main
>>> -# Makefile to avoid test build failures when test
>>> -# Makefile doesn't have explicit build rules.
>>> -ifeq (1,$(MAKELEVEL))
>>> +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing.  This 
>>> provides
>>> +# implicit rules to sub-test Makefiles which avoids build failures in test
>>> +# Makefile that don't have explicit build rules.
>>> +ifeq (,$(LINK.c))
>>>   override LDFLAGS =
>>>   override MAKEFLAGS =
>>>   endif
>>> --
>>> 2.28.0.rc0.142.g3c755180ce-goog
>> 
>> Is there any feedback on this patch?  It's not high priority but something 
>> that
>> will help me make more use of selftests.
>> 
>
> No issues with the patch. I will apply it once the merge window ends.
>
> thanks,
> -- Shuah

Great.  Thank you.


Re: [PATCH] selftests: more general make nesting support

2020-08-05 Thread Greg Thelen
On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen  wrote:
>
> selftests can be built from the toplevel kernel makefile (e.g. make
> kselftest-all) or directly (make -C tools/testing/selftests all).
>
> The toplevel kernel makefile explicitly disables implicit rules with
> "MAKEFLAGS += -rR", which is passed to tools/testing/selftests.  Some
> selftest makefiles require implicit make rules, which is why
> commit 67d8712dcc70 ("selftests: Fix build failures when invoked from
> kselftest target") reenables implicit rules by clearing MAKEFLAGS if
> MAKELEVEL=1.
>
> So far so good.  However, if the toplevel makefile is called from an
> outer makefile then MAKELEVEL will be elevated, which breaks the
> MAKELEVEL equality test.
> Example wrapped makefile error:
>   $ cat ~/Makefile
>   all:
> $(MAKE) defconfig
> $(MAKE) kselftest-all
>   $ make -sf ~/Makefile
> futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h   
> /src/tools/testing/selftests/kselftest.h ../include/futextest.h 
> ../include/atomic.h ../include/logging.h -lpthread -lrt -o 
> /src/tools/testing/selftests/futex/functional/futex_wait_timeout
>   make[4]: futex_wait_timeout.c: Command not found
>
> Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more
> direct side effect of "make -R".  This enables arbitrary makefile
> nesting.
>
> Signed-off-by: Greg Thelen 
> ---
>  tools/testing/selftests/Makefile | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/tools/testing/selftests/Makefile 
> b/tools/testing/selftests/Makefile
> index 1195bd85af38..289a2e4b3f6f 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -84,10 +84,10 @@ endif
>  # of the targets gets built.
>  FORCE_TARGETS ?=
>
> -# Clear LDFLAGS and MAKEFLAGS if called from main
> -# Makefile to avoid test build failures when test
> -# Makefile doesn't have explicit build rules.
> -ifeq (1,$(MAKELEVEL))
> +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing.  This provides
> +# implicit rules to sub-test Makefiles which avoids build failures in test
> +# Makefile that don't have explicit build rules.
> +ifeq (,$(LINK.c))
>  override LDFLAGS =
>  override MAKEFLAGS =
>  endif
> --
> 2.28.0.rc0.142.g3c755180ce-goog

Is there any feedback on this patch?  It's not high priority but something that
will help me make more use of selftests.


[PATCH] selftests: more general make nesting support

2020-07-28 Thread Greg Thelen
selftests can be built from the toplevel kernel makefile (e.g. make
kselftest-all) or directly (make -C tools/testing/selftests all).

The toplevel kernel makefile explicitly disables implicit rules with
"MAKEFLAGS += -rR", which is passed to tools/testing/selftests.  Some
selftest makefiles require implicit make rules, which is why
commit 67d8712dcc70 ("selftests: Fix build failures when invoked from
kselftest target") reenables implicit rules by clearing MAKEFLAGS if
MAKELEVEL=1.

So far so good.  However, if the toplevel makefile is called from an
outer makefile then MAKELEVEL will be elevated, which breaks the
MAKELEVEL equality test.
Example wrapped makefile error:
  $ cat ~/Makefile
  all:
$(MAKE) defconfig
$(MAKE) kselftest-all
  $ make -sf ~/Makefile
futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h   
/src/tools/testing/selftests/kselftest.h ../include/futextest.h 
../include/atomic.h ../include/logging.h -lpthread -lrt -o 
/src/tools/testing/selftests/futex/functional/futex_wait_timeout
  make[4]: futex_wait_timeout.c: Command not found

Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more
direct side effect of "make -R".  This enables arbitrary makefile
nesting.

Signed-off-by: Greg Thelen 
---
 tools/testing/selftests/Makefile | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1195bd85af38..289a2e4b3f6f 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -84,10 +84,10 @@ endif
 # of the targets gets built.
 FORCE_TARGETS ?=
 
-# Clear LDFLAGS and MAKEFLAGS if called from main
-# Makefile to avoid test build failures when test
-# Makefile doesn't have explicit build rules.
-ifeq (1,$(MAKELEVEL))
+# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing.  This provides
+# implicit rules to sub-test Makefiles which avoids build failures in test
+# Makefile that don't have explicit build rules.
+ifeq (,$(LINK.c))
 override LDFLAGS =
 override MAKEFLAGS =
 endif
-- 
2.28.0.rc0.142.g3c755180ce-goog



Re: [PATCH v18 06/14] mm/damon: Implement callbacks for the virtual memory address spaces

2020-07-27 Thread Greg Thelen
SeongJae Park  wrote:

> From: SeongJae Park 
>
> This commit introduces a reference implementation of the address space
> specific low level primitives for the virtual address space, so that
> users of DAMON can easily monitor the data accesses on virtual address
> spaces of specific processes by simply configuring the implementation to
> be used by DAMON.
>
> The low level primitives for the fundamental access monitoring are
> defined in two parts:
> 1. Identification of the monitoring target address range for the address
> space.
> 2. Access check of specific address range in the target space.
>
> The reference implementation for the virtual address space provided by
> this commit is designed as below.
>
> PTE Accessed-bit Based Access Check
> ---
>
> The implementation uses PTE Accessed-bit for basic access checks.  That
> is, it clears the bit for next sampling target page and checks whether
> it set again after one sampling period.  To avoid disturbing other
> Accessed bit users such as the reclamation logic, the implementation
> adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> 'Idle Page Tracking'.
>
> VMA-based Target Address Range Construction
> ---
>
> Only small parts in the super-huge virtual address space of the
> processes are mapped to physical memory and accessed.  Thus, tracking
> the unmapped address regions is just wasteful.  However, because DAMON
> can deal with some level of noise using the adaptive regions adjustment
> mechanism, tracking every mapping is not strictly required but could
> even incur a high overhead in some cases.  That said, too huge unmapped
> areas inside the monitoring target should be removed to not take the
> time for the adaptive mechanism.
>
> For the reason, this implementation converts the complex mappings to
> three distinct regions that cover every mapped area of the address
> space.  Also, the two gaps between the three regions are the two biggest
> unmapped areas in the given address space.  The two biggest unmapped
> areas would be the gap between the heap and the uppermost mmap()-ed
> region, and the gap between the lowermost mmap()-ed region and the stack
> in most of the cases.  Because these gaps are exceptionally huge in
> usual address spacees, excluding these will be sufficient to make a
> reasonable trade-off.  Below shows this in detail::
>
> 
> 
> 
> (small mmap()-ed regions and munmap()-ed regions)
> 
> 
> 
>
> Signed-off-by: SeongJae Park 
> Reviewed-by: Leonard Foerster 
> ---
>  include/linux/damon.h |   6 +
>  mm/damon.c| 474 ++
>  2 files changed, 480 insertions(+)
>
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index 3c0b92a679e8..310d36d123b3 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -144,6 +144,12 @@ struct damon_ctx {
>   void (*aggregate_cb)(struct damon_ctx *context);
>  };
>  
> +/* Reference callback implementations for virtual memory */
> +void kdamond_init_vm_regions(struct damon_ctx *ctx);
> +void kdamond_update_vm_regions(struct damon_ctx *ctx);
> +void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx);
> +unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx);
> +
>  int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids);
>  int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
>   unsigned long aggr_int, unsigned long regions_update_int,
> diff --git a/mm/damon.c b/mm/damon.c
> index b844924b9fdb..386780739007 100644
> --- a/mm/damon.c
> +++ b/mm/damon.c
> @@ -9,6 +9,9 @@
>   * This file is constructed in below parts.
>   *
>   * - Functions and macros for DAMON data structures
> + * - Functions for the initial monitoring target regions construction
> + * - Functions for the dynamic monitoring target regions update
> + * - Functions for the access checking of the regions
>   * - Functions for DAMON core logics and features
>   * - Functions for the DAMON programming interface
>   * - Functions for the module loading/unloading
> @@ -196,6 +199,477 @@ static unsigned long damon_region_sz_limit(struct 
> damon_ctx *ctx)
>   return sz;
>  }
>  
> +/*
> + * Get the mm_struct of the given task
> + *
> + * Caller _must_ put the mm_struct after use, unless it is NULL.
> + *
> + * Returns the mm_struct of the task on success, NULL on failure
> + */
> +static struct mm_struct *damon_get_mm(struct damon_task *t)
> +{
> + struct task_struct *task;
> + struct mm_struct *mm;
> +
> + task = damon_get_task_struct(t);
> + if (!task)
> + return NULL;
> +
> + mm = get_task_mm(task);
> + put_task_struct(task);
> + return mm;
> +}
> +
> +/*
> + * Functions for the initial monitoring target regions construction
> + */
> +
> +/*
> + * Size-evenly split a region into 'nr_pieces' small regions
> + *
> + * Returns 0 on success, 

Re: [PATCH v18 11/14] Documentation: Add documents for DAMON

2020-07-27 Thread Greg Thelen
SeongJae Park  wrote:

> From: SeongJae Park 
>
> This commit adds documents for DAMON under
> `Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`.
>
> Signed-off-by: SeongJae Park 
> ---
>  Documentation/admin-guide/mm/damon/guide.rst | 157 ++
>  Documentation/admin-guide/mm/damon/index.rst |  15 +
>  Documentation/admin-guide/mm/damon/plans.rst |  29 ++
>  Documentation/admin-guide/mm/damon/start.rst |  98 ++
>  Documentation/admin-guide/mm/damon/usage.rst | 298 +++
>  Documentation/admin-guide/mm/index.rst   |   1 +
>  Documentation/vm/damon/api.rst   |  20 ++
>  Documentation/vm/damon/eval.rst  | 222 ++
>  Documentation/vm/damon/faq.rst   |  59 
>  Documentation/vm/damon/index.rst |  32 ++
>  Documentation/vm/damon/mechanisms.rst| 165 ++
>  Documentation/vm/index.rst   |   1 +
>  12 files changed, 1097 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/damon/guide.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/index.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/plans.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/start.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
>  create mode 100644 Documentation/vm/damon/api.rst
>  create mode 100644 Documentation/vm/damon/eval.rst
>  create mode 100644 Documentation/vm/damon/faq.rst
>  create mode 100644 Documentation/vm/damon/index.rst
>  create mode 100644 Documentation/vm/damon/mechanisms.rst
>
> diff --git a/Documentation/admin-guide/mm/damon/guide.rst 
> b/Documentation/admin-guide/mm/damon/guide.rst
> new file mode 100644
> index ..c51fb843efaa
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/damon/guide.rst
> @@ -0,0 +1,157 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==
> +Optimization Guide
> +==
> +
> +This document helps you estimating the amount of benefit that you could get
> +from DAMON-based optimizations, and describes how you could achieve it.  You
> +are assumed to already read :doc:`start`.
> +
> +
> +Check The Signs
> +===
> +
> +No optimization can provide same extent of benefit to every case.  Therefore
> +you should first guess how much improvements you could get using DAMON.  If
> +some of below conditions match your situation, you could consider using 
> DAMON.
> +
> +- *Low IPC and High Cache Miss Ratios.*  Low IPC means most of the CPU time 
> is
> +  spent waiting for the completion of time-consuming operations such as 
> memory
> +  access, while high cache miss ratios mean the caches don't help it well.
> +  DAMON is not for cache level optimization, but DRAM level.  However,
> +  improving DRAM management will also help this case by reducing the memory
> +  operation latency.
> +- *Memory Over-commitment and Unknown Users.*  If you are doing memory
> +  overcommitment and you cannot control every user of your system, a memory
> +  bank run could happen at any time.  You can estimate when it will happen
> +  based on DAMON's monitoring results and act earlier to avoid or deal better
> +  with the crisis.
> +- *Frequent Memory Pressure.*  Frequent memory pressure means your system has
> +  wrong configurations or memory hogs.  DAMON will help you find the right
> +  configuration and/or the criminals.
> +- *Heterogeneous Memory System.*  If your system is utilizing memory devices
> +  that placed between DRAM and traditional hard disks, such as non-volatile
> +  memory or fast SSDs, DAMON could help you utilizing the devices more
> +  efficiently.
> +
> +
> +Profile
> +===
> +
> +If you found some positive signals, you could start by profiling your 
> workloads
> +using DAMON.  Find major workloads on your systems and analyze their data
> +access pattern to find something wrong or can be improved.  The DAMON user
> +space tool (``damo``) will be useful for this.
> +
> +We recommend you to start from working set size distribution check using 
> ``damo
> +report wss``.  If the distribution is ununiform or quite different from what
> +you estimated, you could consider `Memory Configuration`_ optimization.
> +
> +Then, review the overall access pattern in heatmap form using ``damo report
> +heats``.  If it shows a simple pattern consists of a small number of memory
> +regions having high contrast of access temperature, you could consider manual
> +`Program Modification`_.
> +
> +If you still want to absorb more benefits, you should develop `Personalized
> +DAMON Application`_ for your special case.
> +
> +You don't need to take only one approach among the above plans, but you could
> +use multiple of the above approaches to maximize the benefit.
> +
> +
> +Optimize
> +
> +
> +If the profiling result also says it's worth trying some optimization, you
> +could consider below approaches.  Note that some of the below approaches 
> assume
> +that 

Re: [PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code

2020-07-18 Thread Greg Thelen
Oliver O'Halloran  wrote:

> On Mon, Jun 15, 2020 at 9:33 AM Greg Thelen  wrote:
>>
>> Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
>> configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
>> only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
>> CONFIG_IOMMU_API see:
>>   arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
>> 'pnv_ioda_setup_bus_dma' defined but not used
>>
>> Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code.
>
> Doh! Thanks for the fix.
>
> Reviewed-by: Oliver O'Halloran 

Is there anything else needed from me on this patch?
Given that it fixes a 5.8 commit I figured it'd be 5.8 material.


Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed

2020-07-01 Thread Greg Thelen
Dave Hansen  wrote:

> From: Keith Busch 
>
> Migrating pages had been allocating the new page before it was actually
> needed. Subsequent operations may still fail, which would have to handle
> cleaning up the newly allocated page when it was never used.
>
> Defer allocating the page until we are actually ready to make use of
> it, after locking the original page. This simplifies error handling,
> but should not have any functional change in behavior. This is just
> refactoring page migration so the main part can more easily be reused
> by other code.

Is there any concern that the src page is now held PG_locked over the
dst page allocation, which might wander into
reclaim/cond_resched/oom_kill?  I don't have a deadlock in mind.  I'm
just wondering about the additional latency imposed on unrelated threads
who want access src page.

> #Signed-off-by: Keith Busch 

Is commented Signed-off-by intentional?  Same applies to later patches.

> Signed-off-by: Dave Hansen 
> Cc: Keith Busch 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> ---
>
>  b/mm/migrate.c |  148 
> -
>  1 file changed, 75 insertions(+), 73 deletions(-)
>
> diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed 
> mm/migrate.c
> --- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed 
> 2020-06-29 16:34:37.896312607 -0700
> +++ b/mm/migrate.c2020-06-29 16:34:37.900312607 -0700
> @@ -1014,56 +1014,17 @@ out:
>   return rc;
>  }
>  
> -static int __unmap_and_move(struct page *page, struct page *newpage,
> - int force, enum migrate_mode mode)
> +static int __unmap_and_move(new_page_t get_new_page,
> + free_page_t put_new_page,
> + unsigned long private, struct page *page,
> + enum migrate_mode mode,
> + enum migrate_reason reason)
>  {
>   int rc = -EAGAIN;
>   int page_was_mapped = 0;
>   struct anon_vma *anon_vma = NULL;
>   bool is_lru = !__PageMovable(page);
> -
> - if (!trylock_page(page)) {
> - if (!force || mode == MIGRATE_ASYNC)
> - goto out;
> -
> - /*
> -  * It's not safe for direct compaction to call lock_page.
> -  * For example, during page readahead pages are added locked
> -  * to the LRU. Later, when the IO completes the pages are
> -  * marked uptodate and unlocked. However, the queueing
> -  * could be merging multiple pages for one bio (e.g.
> -  * mpage_readpages). If an allocation happens for the
> -  * second or third page, the process can end up locking
> -  * the same page twice and deadlocking. Rather than
> -  * trying to be clever about what pages can be locked,
> -  * avoid the use of lock_page for direct compaction
> -  * altogether.
> -  */
> - if (current->flags & PF_MEMALLOC)
> - goto out;
> -
> - lock_page(page);
> - }
> -
> - if (PageWriteback(page)) {
> - /*
> -  * Only in the case of a full synchronous migration is it
> -  * necessary to wait for PageWriteback. In the async case,
> -  * the retry loop is too short and in the sync-light case,
> -  * the overhead of stalling is too much
> -  */
> - switch (mode) {
> - case MIGRATE_SYNC:
> - case MIGRATE_SYNC_NO_COPY:
> - break;
> - default:
> - rc = -EBUSY;
> - goto out_unlock;
> - }
> - if (!force)
> - goto out_unlock;
> - wait_on_page_writeback(page);
> - }
> + struct page *newpage;
>  
>   /*
>* By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
> @@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page
>   if (PageAnon(page) && !PageKsm(page))
>   anon_vma = page_get_anon_vma(page);
>  
> + newpage = get_new_page(page, private);
> + if (!newpage) {
> + rc = -ENOMEM;
> + goto out;
> + }
> +
>   /*
>* Block others from accessing the new page when we get around to
>* establishing additional references. We are usually the only one
> @@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page
>* This is much like races on refcount of oldpage: just don't BUG().
>*/
>   if (unlikely(!trylock_page(newpage)))
> - goto out_unlock;
> + goto out_put;
>  
>   if (unlikely(!is_lru)) {
>   rc = move_to_new_page(newpage, page, mode);
> - goto out_unlock_both;
> + goto out_unlock;
>   }
>  
>   /*
> @@ -1114,7 +1081,7 @@ 

[PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code

2020-06-14 Thread Greg Thelen
Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
CONFIG_IOMMU_API see:
  arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
'pnv_ioda_setup_bus_dma' defined but not used

Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code.

Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration")
Signed-off-by: Greg Thelen 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 73a63efcf855..743d840712da 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1885,19 +1885,6 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
 }
 
-static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
-{
-   struct pci_dev *dev;
-
-   list_for_each_entry(dev, >devices, bus_list) {
-   set_iommu_table_base(>dev, pe->table_group.tables[0]);
-   dev->dev.archdata.dma_offset = pe->tce_bypass_base;
-
-   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
-   }
-}
-
 static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb,
 bool real_mode)
 {
@@ -2501,6 +2488,19 @@ static long pnv_pci_ioda2_unset_window(struct 
iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
+static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   set_iommu_table_base(>dev, pe->table_group.tables[0]);
+   dev->dev.archdata.dma_offset = pe->tce_bypass_base;
+
+   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
+   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
+   }
+}
+
 unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
__u64 window_size, __u32 levels)
 {
-- 
2.27.0.290.gba653c62da-goog



[PATCH] e1000e: add ifdef to avoid dead code

2020-06-14 Thread Greg Thelen
Commit e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME
systems") added e1000e_check_me() but it's only called from
CONFIG_PM_SLEEP protected code.  Thus builds without CONFIG_PM_SLEEP
see:
  drivers/net/ethernet/intel/e1000e/netdev.c:137:13: warning: 'e1000e_check_me' 
defined but not used [-Wunused-function]

Add CONFIG_PM_SLEEP ifdef guard to avoid dead code.

Fixes: e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME systems")
Signed-off-by: Greg Thelen 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a279f4fa9962..165f0aea22c9 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -107,6 +107,7 @@ static const struct e1000_reg_info e1000_reg_info_tbl[] = {
{0, NULL}
 };
 
+#ifdef CONFIG_PM_SLEEP
 struct e1000e_me_supported {
u16 device_id;  /* supported device ID */
 };
@@ -145,6 +146,7 @@ static bool e1000e_check_me(u16 device_id)
 
return false;
 }
+#endif /* CONFIG_PM_SLEEP */
 
 /**
  * __ew32_prepare - prepare to write to MAC CSR register on certain parts
-- 
2.27.0.290.gba653c62da-goog



[PATCH] powerpc/powernv/pci: add ifdef to avoid dead code

2020-06-13 Thread Greg Thelen
Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
CONFIG_IOMMU_API see:
  arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
'pnv_ioda_setup_bus_dma' defined but not used

Add CONFIG_IOMMU_API ifdef guard to avoid dead code.

Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration")
Signed-off-by: Greg Thelen 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 73a63efcf855..f7762052b7c4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1885,6 +1885,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
 }
 
+#ifdef CONFIG_IOMMU_API
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 {
struct pci_dev *dev;
@@ -1897,6 +1898,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe 
*pe, struct pci_bus *bus)
pnv_ioda_setup_bus_dma(pe, dev->subordinate);
}
 }
+#endif
 
 static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb,
 bool real_mode)
-- 
2.27.0.290.gba653c62da-goog



Re: [PATCH] shmem, memcg: enable memcg aware shrinker

2020-06-04 Thread Greg Thelen
Yang Shi  wrote:

> On Sun, May 31, 2020 at 8:22 PM Greg Thelen  wrote:
>>
>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>> called when the per-memcg per-node shrinker_map indicates that the
>> shrinker may have objects to release to the memcg and node.
>>
>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>> shrinker which advertises per memcg and numa awareness.  The shmem
>> shrinker releases memory by splitting hugepages that extend beyond
>> i_size.
>>
>> Shmem does not currently set bits in shrinker_map.  So, starting with
>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>> pressure.  This leads to undeserved memcg OOM kills.
>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>   FS=/tmp/fs
>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>   mkdir -p $FS
>>   mount -t tmpfs -o huge=always nodev $FS
>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>   mkdir $CONTAINER
>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>   # which hits memcg 1000 MB limit.
>>   for i in {1..4000}; do
>> echo . > $FS/$i
>>   done
>
> It looks all the inodes which have tail THP beyond i_size are on one
> single list, then the shrinker actually just splits the first
> nr_to_scan inodes. But since the list is not memcg aware, so it seems
> it may split the THPs which are not charged to the victim memcg and
> the victim memcg still may suffer from pre-mature oom, right?

Correct.  shmem_unused_huge_shrink() is not memcg aware.  In response to
memcg pressure it will split the post-i_size tails of nr_to_scan tmpfs
inodes regardless of if they're charged to the under-pressure memcg.
do_shrink_slab() looks like it'll repeatedly call
shmem_unused_huge_shrink().  So it will split tails of many inodes.  So
I think it'll avoid the oom by over shrinking.  This is not ideal.  But
it seems better than undeserved oom kill.

I think the solution (as Kirill Tkhai suggested) a memcg-aware index
would solve both:
1) avoid premature oom by registering shrinker to responding to memcg
   pressure
2) avoid shrinking/splitting inodes unrelated to the under-pressure
   memcg

I can certainly look into that (thanks Kirill for the pointers).  In the
short term I'm still interested in avoiding premature OOMs with the
original thread (i.e. restore pre-4.19 behavior to shmem shrinker for
memcg pressure).  I plan to test and repost v2.


Re: [PATCH] shmem, memcg: enable memcg aware shrinker

2020-06-01 Thread Greg Thelen
Kirill Tkhai  wrote:

> Hi, Greg,
>
> good finding. See comments below.
>
> On 01.06.2020 06:22, Greg Thelen wrote:
>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>> called when the per-memcg per-node shrinker_map indicates that the
>> shrinker may have objects to release to the memcg and node.
>> 
>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>> shrinker which advertises per memcg and numa awareness.  The shmem
>> shrinker releases memory by splitting hugepages that extend beyond
>> i_size.
>> 
>> Shmem does not currently set bits in shrinker_map.  So, starting with
>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>> pressure.  This leads to undeserved memcg OOM kills.
>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>   FS=/tmp/fs
>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>   mkdir -p $FS
>>   mount -t tmpfs -o huge=always nodev $FS
>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>   mkdir $CONTAINER
>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>   # which hits memcg 1000 MB limit.
>>   for i in {1..4000}; do
>> echo . > $FS/$i
>>   done
>> 
>> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
>> aware") maintains the per-node per-memcg shrinker bitmap for THP
>> shrinker.  But there's no such logic in shmem.  Make shmem set the
>> per-memcg per-node shrinker bits when it modifies inodes to have
>> shrinkable pages.
>> 
>> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers 
>> during memcg shrink_slab()")
>> Cc:  # 4.19+
>> Signed-off-by: Greg Thelen 
>> ---
>>  mm/shmem.c | 61 +++---
>>  1 file changed, 35 insertions(+), 26 deletions(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index bd8840082c94..e11090f78cb5 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, 
>> struct kstat *stat,
>>  return 0;
>>  }
>>  
>> +/*
>> + * Expose inode and optional page to shrinker as having a possibly 
>> splittable
>> + * hugepage that reaches beyond i_size.
>> + */
>> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
>> +   struct inode *inode, struct page *page)
>> +{
>> +struct shmem_inode_info *info = SHMEM_I(inode);
>> +
>> +spin_lock(>shrinklist_lock);
>> +/*
>> + * _careful to defend against unlocked access to ->shrink_list in
>> + * shmem_unused_huge_shrink()
>> + */
>> +if (list_empty_careful(>shrinklist)) {
>> +list_add_tail(>shrinklist, >shrinklist);
>> +sbinfo->shrinklist_len++;
>> +}
>> +spin_unlock(>shrinklist_lock);
>> +
>> +#ifdef CONFIG_MEMCG
>> +if (page && PageTransHuge(page))
>> +memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
>> +   inode->i_sb->s_shrink.id);
>> +#endif
>> +}
>> +
>>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>  struct inode *inode = d_inode(dentry);
>> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, 
>> struct iattr *attr)
>>   * to shrink under memory pressure.
>>   */
>>  if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> -spin_lock(>shrinklist_lock);
>> -/*
>> - * _careful to defend against unlocked access to
>> - * ->shrink_list in shmem_unused_huge_shrink()
>> - */
>> -if (list_empty_careful(>shrinklist)) {
>> -list_add_tail(>shrinklist,
>> ->shrinklist);
>> -  

[PATCH] shmem, memcg: enable memcg aware shrinker

2020-05-31 Thread Greg Thelen
Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
called when the per-memcg per-node shrinker_map indicates that the
shrinker may have objects to release to the memcg and node.

shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
shrinker which advertises per memcg and numa awareness.  The shmem
shrinker releases memory by splitting hugepages that extend beyond
i_size.

Shmem does not currently set bits in shrinker_map.  So, starting with
b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
pressure.  This leads to undeserved memcg OOM kills.
Example that reliably sees memcg OOM kill in unpatched kernel:
  FS=/tmp/fs
  CONTAINER=/cgroup/memory/tmpfs_shrinker
  mkdir -p $FS
  mount -t tmpfs -o huge=always nodev $FS
  # Create 1000 MB container, which shouldn't suffer OOM.
  mkdir $CONTAINER
  echo 1000M > $CONTAINER/memory.limit_in_bytes
  echo $BASHPID >> $CONTAINER/cgroup.procs
  # Create 4000 files.  Ideally each file uses 4k data page + a little
  # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
  # fit within container's 1000 MB.  But if data pages use 2MB
  # hugepages (due to aggressive huge=always) then files consume 8GB,
  # which hits memcg 1000 MB limit.
  for i in {1..4000}; do
echo . > $FS/$i
  done

v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
aware") maintains the per-node per-memcg shrinker bitmap for THP
shrinker.  But there's no such logic in shmem.  Make shmem set the
per-memcg per-node shrinker bits when it modifies inodes to have
shrinkable pages.

Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during 
memcg shrink_slab()")
Cc:  # 4.19+
Signed-off-by: Greg Thelen 
---
 mm/shmem.c | 61 +++---
 1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index bd8840082c94..e11090f78cb5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct 
kstat *stat,
return 0;
 }
 
+/*
+ * Expose inode and optional page to shrinker as having a possibly splittable
+ * hugepage that reaches beyond i_size.
+ */
+static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
+  struct inode *inode, struct page *page)
+{
+   struct shmem_inode_info *info = SHMEM_I(inode);
+
+   spin_lock(>shrinklist_lock);
+   /*
+* _careful to defend against unlocked access to ->shrink_list in
+* shmem_unused_huge_shrink()
+*/
+   if (list_empty_careful(>shrinklist)) {
+   list_add_tail(>shrinklist, >shrinklist);
+   sbinfo->shrinklist_len++;
+   }
+   spin_unlock(>shrinklist_lock);
+
+#ifdef CONFIG_MEMCG
+   if (page && PageTransHuge(page))
+   memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
+  inode->i_sb->s_shrink.id);
+#endif
+}
+
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
struct inode *inode = d_inode(dentry);
@@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct 
iattr *attr)
 * to shrink under memory pressure.
 */
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-   spin_lock(>shrinklist_lock);
-   /*
-* _careful to defend against unlocked access to
-* ->shrink_list in shmem_unused_huge_shrink()
-*/
-   if (list_empty_careful(>shrinklist)) {
-   list_add_tail(>shrinklist,
-   >shrinklist);
-   sbinfo->shrinklist_len++;
-   }
-   spin_unlock(>shrinklist_lock);
+   struct page *page;
+
+   page = find_get_page(inode->i_mapping,
+   (newsize & HPAGE_PMD_MASK) >> 
PAGE_SHIFT);
+   shmem_shrinker_add(sbinfo, inode, page);
+   if (page)
+   put_page(page);
}
}
}
@@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, 
pgoff_t index,
if (PageTransHuge(page) &&
DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
hindex + HPAGE_PMD_NR - 1) {
-   /*
-* Part of the huge page is beyond i_siz

[PATCH] kbuild: clean compressed initramfs image

2019-07-22 Thread Greg Thelen
Since commit 9e3596b0c653 ("kbuild: initramfs cleanup, set target from
Kconfig") "make clean" leaves behind compressed initramfs images.
Example:
  $ make defconfig
  $ sed -i 
's|CONFIG_INITRAMFS_SOURCE=""|CONFIG_INITRAMFS_SOURCE="/tmp/ir.cpio"|' .config
  $ make olddefconfig
  $ make -s
  $ make -s clean
  $ git clean -ndxf | grep initramfs
  Would remove usr/initramfs_data.cpio.gz

clean rules do not have CONFIG_* context so they do not know which
compression format was used.  Thus they don't know which files to
delete.

Tell clean to delete all possible compression formats.

Once patched usr/initramfs_data.cpio.gz and friends are deleted by
"make clean".

Fixes: 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig")
Signed-off-by: Greg Thelen 
---
 usr/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/usr/Makefile b/usr/Makefile
index 6a89eb019275..e6f7cb2f81db 100644
--- a/usr/Makefile
+++ b/usr/Makefile
@@ -11,6 +11,9 @@ datafile_y = initramfs_data.cpio$(suffix_y)
 datafile_d_y = .$(datafile_y).d
 AFLAGS_initramfs_data.o += -DINITRAMFS_IMAGE="usr/$(datafile_y)"
 
+# clean rules do not have CONFIG_INITRAMFS_COMPRESSION.  So clean up after all
+# possible compression formats.
+clean-files += initramfs_data.cpio*
 
 # Generate builtin.o based on initramfs_data.o
 obj-$(CONFIG_BLK_DEV_INITRD) := initramfs_data.o
-- 
2.22.0.657.g960e92d24f-goog



Re: [PATCH v4 0/7] mm: reparent slab memory on cgroup removal

2019-06-05 Thread Greg Thelen
Roman Gushchin  wrote:

> # Why do we need this?
>
> We've noticed that the number of dying cgroups is steadily growing on most
> of our hosts in production. The following investigation revealed an issue
> in userspace memory reclaim code [1], accounting of kernel stacks [2],
> and also the mainreason: slab objects.
>
> The underlying problem is quite simple: any page charged
> to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> all charged pages are gone. If a slab object is actively used by other 
> cgroups,
> it won't be reclaimed, and will prevent the origin cgroup from being 
> reclaimed.
>
> Slab objects, and first of all vfs cache, is shared between cgroups, which are
> using the same underlying fs, and what's even more important, it's shared
> between multiple generations of the same workload. So if something is running
> periodically every time in a new cgroup (like how systemd works), we do
> accumulate multiple dying cgroups.
>
> Strictly speaking pagecache isn't different here, but there is a key 
> difference:
> we disable protection and apply some extra pressure on LRUs of dying cgroups,
> and these LRUs contain all charged pages.
> My experiments show that with the disabled kernel memory accounting the number
> of dying cgroups stabilizes at a relatively small number (~100, depends on
> memory pressure and cgroup creation rate), and with kernel memory accounting
> it grows pretty steadily up to several thousands.
>
> Memory cgroups are quite complex and big objects (mostly due to percpu stats),
> so it leads to noticeable memory losses. Memory occupied by dying cgroups
> is measured in hundreds of megabytes. I've even seen a host with more than 
> 100Gb
> of memory wasted for dying cgroups. It leads to a degradation of performance
> with the uptime, and generally limits the usage of cgroups.
>
> My previous attempt [3] to fix the problem by applying extra pressure on slab
> shrinker lists caused a regressions with xfs and ext4, and has been reverted 
> [4].
> The following attempts to find the right balance [5, 6] were not successful.
>
> So instead of trying to find a maybe non-existing balance, let's do reparent
> the accounted slabs to the parent cgroup on cgroup removal.
>
>
> # Implementation approach
>
> There is however a significant problem with reparenting of slab memory:
> there is no list of charged pages. Some of them are in shrinker lists,
> but not all. Introducing of a new list is really not an option.
>
> But fortunately there is a way forward: every slab page has a stable pointer
> to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> instead of slab pages.
>
> It's actually simpler and cheaper, but requires some underlying changes:
> 1) Make kmem_caches to hold a single reference to the memory cgroup,
>instead of a separate reference per every slab page.
> 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
>page->kmem_cache->memcg indirection instead. It's used only on
>slab page release, so it shouldn't be a big issue.
> 3) Introduce a refcounter for non-root slab caches. It's required to
>be able to destroy kmem_caches when they become empty and release
>the associated memory cgroup.
>
> There is a bonus: currently we do release empty kmem_caches on cgroup
> removal, however all other are waiting for the releasing of the memory cgroup.
> These refactorings allow kmem_caches to be released as soon as they
> become inactive and free.
>
> Some additional implementation details are provided in corresponding
> commit messages.
>
> # Results
>
> Below is the average number of dying cgroups on two groups of our production
> hosts. They do run some sort of web frontend workload, the memory pressure
> is moderate. As we can see, with the kernel memory reparenting the number
> stabilizes in 60s range; however with the original version it grows almost
> linearly and doesn't show any signs of plateauing. The difference in slab
> and percpu usage between patched and unpatched versions also grows linearly.
> In 7 days it exceeded 200Mb.
>
> day   01234567
> original 56  362  628  752 1070 1250 1490 1560
> patched  23   46   51   55   60   57   67   69
> mem diff(Mb) 22   74  123  152  164  182  214  241

No objection to the idea, but a question...

In patched kernel, does slabinfo (or similar) show the list reparented
slab caches?  A pile of zombie kmem_caches is certainly better than a
pile of zombie mem_cgroup.  But it still seems like it'll might cause
degradation - does cache_reap() walk an ever growing set of zombie
caches?

We've found it useful to add a slabinfo_full file which includes zombie
kmem_cache with their memcg_name.  This can help hunt down zombies.

> # History
>
> v4:
>   1) removed excessive memcg != parent check in memcg_deactivate_kmem_caches()
>   2) fixed rcu_read_lock() usage in memcg_charge_slab()
>   3) fixed synchronization 

Re: [PATCH] writeback: sum memcg dirty counters as needed

2019-03-29 Thread Greg Thelen
Johannes Weiner  wrote:

> On Thu, Mar 07, 2019 at 08:56:32AM -0800, Greg Thelen wrote:
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3880,6 +3880,7 @@ struct wb_domain *mem_cgroup_wb_domain(struct 
>> bdi_writeback *wb)
>>   * @pheadroom: out parameter for number of allocatable pages according to 
>> memcg
>>   * @pdirty: out parameter for number of dirty pages
>>   * @pwriteback: out parameter for number of pages under writeback
>> + * @exact: determines exact counters are required, indicates more work.
>>   *
>>   * Determine the numbers of file, headroom, dirty, and writeback pages in
>>   * @wb's memcg.  File, dirty and writeback are self-explanatory.  Headroom
>> @@ -3890,18 +3891,29 @@ struct wb_domain *mem_cgroup_wb_domain(struct 
>> bdi_writeback *wb)
>>   * ancestors.  Note that this doesn't consider the actual amount of
>>   * available memory in the system.  The caller should further cap
>>   * *@pheadroom accordingly.
>> + *
>> + * Return value is the error precision associated with *@pdirty
>> + * and *@pwriteback.  When @exact is set this a minimal value.
>>   */
>> -void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long 
>> *pfilepages,
>> - unsigned long *pheadroom, unsigned long *pdirty,
>> - unsigned long *pwriteback)
>> +unsigned long
>> +mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
>> +unsigned long *pheadroom, unsigned long *pdirty,
>> +unsigned long *pwriteback, bool exact)
>>  {
>>  struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
>>  struct mem_cgroup *parent;
>> +unsigned long precision;
>>  
>> -*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
>> -
>> +if (exact) {
>> +precision = 0;
>> +*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
>> +*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
>> +} else {
>> +precision = MEMCG_CHARGE_BATCH * num_online_cpus();
>> +*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
>> +*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
>> +}
>>  /* this should eventually include NR_UNSTABLE_NFS */
>> -*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
>>  *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
>>   (1 << LRU_ACTIVE_FILE));
>>  *pheadroom = PAGE_COUNTER_MAX;
>> @@ -3913,6 +3925,8 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, 
>> unsigned long *pfilepages,
>>  *pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
>>  memcg = parent;
>>  }
>> +
>> +return precision;
>
> Have you considered unconditionally using the exact version here?
>
> It does for_each_online_cpu(), but until very, very recently we did
> this per default for all stats, for years. It only became a problem in
> conjunction with the for_each_memcg loops when frequently reading
> memory stats at the top of a very large hierarchy.
>
> balance_dirty_pages() is called against memcgs that actually own the
> inodes/memory and doesn't do the additional recursive tree collection.
>
> It's also not *that* hot of a function, and in the io path...
>
> It would simplify this patch immensely.

Good idea.  Done in -v2 of the patch:
https://lore.kernel.org/lkml/20190329174609.164344-1-gthe...@google.com/


Re: [PATCH] writeback: sum memcg dirty counters as needed

2019-03-29 Thread Greg Thelen
Andrew Morton  wrote:

> On Thu,  7 Mar 2019 08:56:32 -0800 Greg Thelen  wrote:
>
>> Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
>> memory.stat reporting") memcg dirty and writeback counters are managed
>> as:
>> 1) per-memcg per-cpu values in range of [-32..32]
>> 2) per-memcg atomic counter
>> When a per-cpu counter cannot fit in [-32..32] it's flushed to the
>> atomic.  Stat readers only check the atomic.
>> Thus readers such as balance_dirty_pages() may see a nontrivial error
>> margin: 32 pages per cpu.
>> Assuming 100 cpus:
>>4k x86 page_size:  13 MiB error per memcg
>>   64k ppc page_size: 200 MiB error per memcg
>> Considering that dirty+writeback are used together for some decisions
>> the errors double.
>> 
>> This inaccuracy can lead to undeserved oom kills.  One nasty case is
>> when all per-cpu counters hold positive values offsetting an atomic
>> negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
>> balance_dirty_pages() only consults the atomic and does not consider
>> throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
>> 13..200 MiB range then there's absolutely no dirty throttling, which
>> burdens vmscan with only dirty+writeback pages thus resorting to oom
>> kill.
>> 
>> It could be argued that tiny containers are not supported, but it's more
>> subtle.  It's the amount the space available for file lru that matters.
>> If a container has memory.max-200MiB of non reclaimable memory, then it
>> will also suffer such oom kills on a 100 cpu machine.
>> 
>> ...
>> 
>> Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
>> collect exact per memcg counters when a memcg is close to the
>> throttling/writeback threshold.  This avoids the aforementioned oom
>> kills.
>> 
>> This does not affect the overhead of memory.stat, which still reads the
>> single atomic counter.
>> 
>> Why not use percpu_counter?  memcg already handles cpus going offline,
>> so no need for that overhead from percpu_counter.  And the
>> percpu_counter spinlocks are more heavyweight than is required.
>> 
>> It probably also makes sense to include exact dirty and writeback
>> counters in memcg oom reports.  But that is saved for later.
>
> Nice changelog, thanks.
>
>> Signed-off-by: Greg Thelen 
>
> Did you consider cc:stable for this?  We may as well - the stablebots
> backport everything which might look slightly like a fix anyway :(

Good idea.  Done in -v2 of the patch.

>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct 
>> mem_cgroup *memcg,
>>  return x;
>>  }
>>  
>> +/* idx can be of type enum memcg_stat_item or node_stat_item */
>> +static inline unsigned long
>> +memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
>> +{
>> +long x = atomic_long_read(>stat[idx]);
>> +#ifdef CONFIG_SMP
>> +int cpu;
>> +
>> +for_each_online_cpu(cpu)
>> +x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx];
>> +if (x < 0)
>> +x = 0;
>> +#endif
>> +return x;
>> +}
>
> This looks awfully heavyweight for an inline function.  Why not make it
> a regular function and avoid the bloat and i-cache consumption?

Done in -v2.

> Also, did you instead consider making this spill the percpu counters
> into memcg->stat[idx]?  That might be more useful for potential future
> callers.  It would become a little more expensive though.

I looked at that approach, but couldn't convince myself it was safe.  I
kept staring at "Remote [...] Write accesses can cause unique problems
due to the relaxed synchronization requirements for this_cpu
operations." from this_cpu_ops.txt.  So I'd like to delay this possible
optimization for a later patch.


[PATCH v2] writeback: use exact memcg dirty counts

2019-03-29 Thread Greg Thelen
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic.  Stat readers only check the atomic.
Thus readers such as balance_dirty_pages() may see a nontrivial error
margin: 32 pages per cpu.
Assuming 100 cpus:
   4k x86 page_size:  13 MiB error per memcg
  64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions
the errors double.

This inaccuracy can lead to undeserved oom kills.  One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle.  It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch.  This patch avoids
oom kills.

  $ cat test
  mount -t cgroup2 none /dev/cgroup
  cd /dev/cgroup
  echo +io +memory > cgroup.subtree_control
  mkdir test
  cd test
  echo 10M > memory.max
  (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
  (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

  $ cat memcg-writeback-stress.c
  /*
   * Dirty pages from all but one cpu.
   * Clean pages from the non dirtying cpu.
   * This is to stress per cpu counter imbalance.
   * On a 100 cpu machine:
   * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
   * - per memcg atomic is -99*32 pages
   * - thus the complete dirty limit: sum of all counters 0
   * - balance_dirty_pages() only sees atomic count -99*32 pages, which
   *   it max()s to 0.
   * - So a workload can dirty -99*32 pages before balance_dirty_pages()
   *   cares.
   */
  #define _GNU_SOURCE
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 

  static char *buf;
  static int bufSize;

  static void set_affinity(int cpu)
  {
cpu_set_t affinity;

CPU_ZERO();
CPU_SET(cpu, );
if (sched_setaffinity(0, sizeof(affinity), ))
err(1, "sched_setaffinity");
  }

  static void dirty_on(int output_fd, int cpu)
  {
int i, wrote;

set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
  }

  int main(int argc, char **argv)
  {
int cpu, flush_cpu = 1, output_fd;
const char *output;

if (argc != 2)
errx(1, "usage: output_file");

output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");

output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);

for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}

set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
  }

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters.  This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter?  memcg already handles cpus going offline,
so no need for that overhead from percpu_counter.  And the
percpu_counter spinlocks are more heavyweight than is required.

It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports.  But that is saved for later.

Cc: sta...@vger.kernel.org # v4.16+
Signed-off-by: Greg Thelen 
---
Changelog since v1:
- Move memcg_exact_page_state() into memcontrol.c.
- Unconditionally gather exact (per cpu) counters in mem_cgroup_wb_stats(), it's
  not called in performance sensitive paths.
- Unconditionally check for underflow regardless of CONFIG_SMP.  It's just
  easier th

Re: [PATCH] writeback: sum memcg dirty counters as needed

2019-03-27 Thread Greg Thelen
On Fri, Mar 22, 2019 at 11:15 AM Roman Gushchin  wrote:
>
> On Thu, Mar 07, 2019 at 08:56:32AM -0800, Greg Thelen wrote:
> > Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
> > memory.stat reporting") memcg dirty and writeback counters are managed
> > as:
> > 1) per-memcg per-cpu values in range of [-32..32]
> > 2) per-memcg atomic counter
> > When a per-cpu counter cannot fit in [-32..32] it's flushed to the
> > atomic.  Stat readers only check the atomic.
> > Thus readers such as balance_dirty_pages() may see a nontrivial error
> > margin: 32 pages per cpu.
> > Assuming 100 cpus:
> >4k x86 page_size:  13 MiB error per memcg
> >   64k ppc page_size: 200 MiB error per memcg
> > Considering that dirty+writeback are used together for some decisions
> > the errors double.
> >
> > This inaccuracy can lead to undeserved oom kills.  One nasty case is
> > when all per-cpu counters hold positive values offsetting an atomic
> > negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
> > balance_dirty_pages() only consults the atomic and does not consider
> > throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
> > 13..200 MiB range then there's absolutely no dirty throttling, which
> > burdens vmscan with only dirty+writeback pages thus resorting to oom
> > kill.
> >
> > It could be argued that tiny containers are not supported, but it's more
> > subtle.  It's the amount the space available for file lru that matters.
> > If a container has memory.max-200MiB of non reclaimable memory, then it
> > will also suffer such oom kills on a 100 cpu machine.
> >
> > The following test reliably ooms without this patch.  This patch avoids
> > oom kills.
> >
> > ...
> >
> > Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
> > collect exact per memcg counters when a memcg is close to the
> > throttling/writeback threshold.  This avoids the aforementioned oom
> > kills.
> >
> > This does not affect the overhead of memory.stat, which still reads the
> > single atomic counter.
> >
> > Why not use percpu_counter?  memcg already handles cpus going offline,
> > so no need for that overhead from percpu_counter.  And the
> > percpu_counter spinlocks are more heavyweight than is required.
> >
> > It probably also makes sense to include exact dirty and writeback
> > counters in memcg oom reports.  But that is saved for later.
> >
> > Signed-off-by: Greg Thelen 
> > ---
> >  include/linux/memcontrol.h | 33 +
> >  mm/memcontrol.c| 26 --
> >  mm/page-writeback.c| 27 +--
> >  3 files changed, 66 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 83ae11cbd12c..6a133c90138c 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct 
> > mem_cgroup *memcg,
> >   return x;
> >  }
>
> Hi Greg!
>
> Thank you for the patch, definitely a good problem to be fixed!
>
> >
> > +/* idx can be of type enum memcg_stat_item or node_stat_item */
> > +static inline unsigned long
> > +memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
> > +{
> > + long x = atomic_long_read(>stat[idx]);
> > +#ifdef CONFIG_SMP
>
> I doubt that this #ifdef is correct without corresponding changes
> in __mod_memcg_state(). As now, we do use per-cpu buffer which spills
> to an atomic value event if !CONFIG_SMP. It's probably something
> that we want to change, but as now, #ifdef CONFIG_SMP should protect
> only "if (x < 0)" part.

Ack.  I'll fix it.

> > + int cpu;
> > +
> > + for_each_online_cpu(cpu)
> > + x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx];
> > + if (x < 0)
> > + x = 0;
> > +#endif
> > + return x;
> > +}
>
> Also, isn't it worth it to generalize memcg_page_state() instead?
> By adding an bool exact argument? I believe dirty balance is not
> the only place, where we need a better accuracy.

Nod.  I'll provide a more general version of memcg_page_state().  I'm
testing updated (forthcoming v2) patch set now with feedback from
Andrew and Roman.


[PATCH] writeback: sum memcg dirty counters as needed

2019-03-07 Thread Greg Thelen
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic.  Stat readers only check the atomic.
Thus readers such as balance_dirty_pages() may see a nontrivial error
margin: 32 pages per cpu.
Assuming 100 cpus:
   4k x86 page_size:  13 MiB error per memcg
  64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions
the errors double.

This inaccuracy can lead to undeserved oom kills.  One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle.  It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch.  This patch avoids
oom kills.

  $ cat test
  mount -t cgroup2 none /dev/cgroup
  cd /dev/cgroup
  echo +io +memory > cgroup.subtree_control
  mkdir test
  cd test
  echo 10M > memory.max
  (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
  (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

  $ cat memcg-writeback-stress.c
  /*
   * Dirty pages from all but one cpu.
   * Clean pages from the non dirtying cpu.
   * This is to stress per cpu counter imbalance.
   * On a 100 cpu machine:
   * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
   * - per memcg atomic is -99*32 pages
   * - thus the complete dirty limit: sum of all counters 0
   * - balance_dirty_pages() only sees atomic count -99*32 pages, which
   *   it max()s to 0.
   * - So a workload can dirty -99*32 pages before balance_dirty_pages()
   *   cares.
   */
  #define _GNU_SOURCE
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 

  static char *buf;
  static int bufSize;

  static void set_affinity(int cpu)
  {
cpu_set_t affinity;

CPU_ZERO();
CPU_SET(cpu, );
if (sched_setaffinity(0, sizeof(affinity), ))
err(1, "sched_setaffinity");
  }

  static void dirty_on(int output_fd, int cpu)
  {
int i, wrote;

set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
  }

  int main(int argc, char **argv)
  {
int cpu, flush_cpu = 1, output_fd;
const char *output;

if (argc != 2)
errx(1, "usage: output_file");

output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");

output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);

for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}

set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
  }

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters when a memcg is close to the
throttling/writeback threshold.  This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter?  memcg already handles cpus going offline,
so no need for that overhead from percpu_counter.  And the
percpu_counter spinlocks are more heavyweight than is required.

It probably also makes sense to include exact dirty and writeback
counters in memcg oom reports.  But that is saved for later.

Signed-off-by: Greg Thelen 
---
 include/linux/memcontrol.h | 33 +
 mm/memcontrol.c| 26 --
 mm/page-writeback.c| 27 +--
 3 files changed, 66 insertions(+), 20 deletions(-)

diff --git 

[PATCH] writeback: fix inode cgroup switching comment

2019-03-04 Thread Greg Thelen
Commit 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb
transaction and use it for stat updates") refers to
inode_switch_wb_work_fn() which never got merged.  Switch the comments
to inode_switch_wbs_work_fn().

Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and 
use it for stat updates")
Signed-off-by: Greg Thelen 
---
 include/linux/backing-dev.h | 2 +-
 include/linux/fs.h  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c28a47cbe355..f9b029180241 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -365,7 +365,7 @@ unlocked_inode_to_wb_begin(struct inode *inode, struct 
wb_lock_cookie *cookie)
rcu_read_lock();
 
/*
-* Paired with store_release in inode_switch_wb_work_fn() and
+* Paired with store_release in inode_switch_wbs_work_fn() and
 * ensures that we see the new wb if we see cleared I_WB_SWITCH.
 */
cookie->locked = smp_load_acquire(>i_state) & I_WB_SWITCH;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd423fec8d83..08f26046233e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2091,7 +2091,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
struct file *filp)
  * I_WB_SWITCH Cgroup bdi_writeback switching in progress.  Used to
  * synchronize competing switching instances and to tell
  * wb stat updates to grab the i_pages lock.  See
- * inode_switch_wb_work_fn() for details.
+ * inode_switch_wbs_work_fn() for details.
  *
  * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
  * and work dirs among overlayfs mounts.
-- 
2.21.0.352.gf09ad66450-goog



Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-04 Thread Greg Thelen
Yang Shi  wrote:

> On 1/3/19 11:23 AM, Michal Hocko wrote:
>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>
>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
 On Thu 03-01-19 10:40:54, Yang Shi wrote:
> On 1/3/19 10:13 AM, Michal Hocko wrote:
>> [...]
>> Is there any reason for your scripts to be strictly sequential here? In
>> other words why cannot you offload those expensive operations to a
>> detached context in _userspace_?
> I would say it has not to be strictly sequential. The above script is just
> an example to illustrate the pattern. But, sometimes it may hit such 
> pattern
> due to the complicated cluster scheduling and container scheduling in the
> production environment, for example the creation process might be 
> scheduled
> to the same CPU which is doing force_empty. I have to say I don't know too
> much about the internals of the container scheduling.
 In that case I do not see a strong reason to implement the offloding
 into the kernel. It is an additional code and semantic to maintain.
>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>> css offline worker. And, that a couple of lines of code do improve some
>>> throughput issues for some real usecases.
>> I do not really care it is few LOC. It is more important that it is
>> conflating force_empty into offlining logic. There was a good reason to
>> remove reparenting/emptying the memcg during the offline. Considering
>> that you can offload force_empty from userspace trivially then I do not
>> see any reason to implement it in the kernel.
>
> Er, I may not articulate in the earlier email, force_empty can not be 
> offloaded from userspace *trivially*. IOWs the container scheduler may 
> unexpectedly overcommit something due to the stall of synchronous force 
> empty, which can't be figured out by userspace before it actually 
> happens. The scheduler doesn't know how long force_empty would take. If 
> the force_empty could be offloaded by kernel, it would make scheduler's 
> life much easier. This is not something userspace could do.

If kernel workqueues are doing more work (i.e. force_empty processing),
then it seem like the time to offline could grow.  I'm not sure if
that's important.

I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim.  So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.

 I think it is more important to discuss whether we want to introduce
 force_empty in cgroup v2.
>>> We would prefer have it in v2 as well.
>> Then bring this up in a separate email thread please.
>
> Sure. Will prepare the patches later.
>
> Thanks,
> Yang


Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-07-03 Thread Greg Thelen
Michal Hocko  wrote:

> On Tue 03-07-18 00:08:05, Greg Thelen wrote:
>> Michal Hocko  wrote:
>> 
>> > On Fri 29-06-18 11:59:04, Greg Thelen wrote:
>> >> Michal Hocko  wrote:
>> >> 
>> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> >> >> Michal Hocko  wrote:
>> >> > [...]
>> >> >> > +if (mem_cgroup_out_of_memory(memcg, mask, order))
>> >> >> > +return OOM_SUCCESS;
>> >> >> > +
>> >> >> > +WARN(1,"Memory cgroup charge failed because of no reclaimable 
>> >> >> > memory! "
>> >> >> > +"This looks like a misconfiguration or a kernel bug.");
>> >> >> 
>> >> >> I'm not sure here if the warning should here or so strongly worded.  It
>> >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> >> >> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> >> >> alarming in that case.
>> >> >
>> >> > If the task is reaped then its charges should be released as well and
>> >> > that means that we should get below the limit. Sure there is some room
>> >> > for races but this should be still unlikely. Maybe I am just
>> >> > underestimating though.
>> >> >
>> >> > What would you suggest instead?
>> >> 
>> >> I suggest checking MMF_OOM_SKIP or deleting the warning.
>> >
>> > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking
>> > for all the tasks would be quite expensive and remembering that from the
>> > task selection not nice either. Why do you think it would help much?
>> 
>> I assume we could just check current's MMF_OOM_SKIP - no need to check
>> all tasks.
>
> I still do not follow. If you are after a single task memcg then we
> should be ok. try_charge has a runaway for oom victims
>   if (unlikely(tsk_is_oom_victim(current) ||
>fatal_signal_pending(current) ||
>current->flags & PF_EXITING))
>   goto force;
>
> regardless of MMF_OOM_SKIP. So if there is a single process in the
> memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then
> we should bail out there. Or do I miss your intention?

For a single task memcg it seems that racing process cgroup migration
could trigger the new warning (I have attempted to reproduce this):

Processes A,B in memcg M1,M2.  M1 is oom.

  Process A[M1]   Process B[M2]

  M1 is oom
  try_charge(M1)
  Move A M1=>M2
  mem_cgroup_oom()
  mem_cgroup_out_of_memory()
out_of_memory()
  select_bad_process()
sees nothing in M1
  return 0
return 0
  WARN()


Another variant might be possible, this time with global oom:

Processes A,B in memcg M1,M2.  M1 is oom.

  Process A[M1]   Process B[M2]

  try_charge()
  trigger global oom
  reaper sets A.MMF_OOM_SKIP
  mem_cgroup_oom()
  mem_cgroup_out_of_memory()
out_of_memory()
  select_bad_process()
sees nothing in M1
  return 0
return 0
  WARN()


These seem unlikely, so I'm fine with taking a wait-and-see approach.


Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-07-03 Thread Greg Thelen
Michal Hocko  wrote:

> On Tue 03-07-18 00:08:05, Greg Thelen wrote:
>> Michal Hocko  wrote:
>> 
>> > On Fri 29-06-18 11:59:04, Greg Thelen wrote:
>> >> Michal Hocko  wrote:
>> >> 
>> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> >> >> Michal Hocko  wrote:
>> >> > [...]
>> >> >> > +if (mem_cgroup_out_of_memory(memcg, mask, order))
>> >> >> > +return OOM_SUCCESS;
>> >> >> > +
>> >> >> > +WARN(1,"Memory cgroup charge failed because of no reclaimable 
>> >> >> > memory! "
>> >> >> > +"This looks like a misconfiguration or a kernel bug.");
>> >> >> 
>> >> >> I'm not sure here if the warning should here or so strongly worded.  It
>> >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> >> >> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> >> >> alarming in that case.
>> >> >
>> >> > If the task is reaped then its charges should be released as well and
>> >> > that means that we should get below the limit. Sure there is some room
>> >> > for races but this should be still unlikely. Maybe I am just
>> >> > underestimating though.
>> >> >
>> >> > What would you suggest instead?
>> >> 
>> >> I suggest checking MMF_OOM_SKIP or deleting the warning.
>> >
>> > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking
>> > for all the tasks would be quite expensive and remembering that from the
>> > task selection not nice either. Why do you think it would help much?
>> 
>> I assume we could just check current's MMF_OOM_SKIP - no need to check
>> all tasks.
>
> I still do not follow. If you are after a single task memcg then we
> should be ok. try_charge has a runaway for oom victims
>   if (unlikely(tsk_is_oom_victim(current) ||
>fatal_signal_pending(current) ||
>current->flags & PF_EXITING))
>   goto force;
>
> regardless of MMF_OOM_SKIP. So if there is a single process in the
> memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then
> we should bail out there. Or do I miss your intention?

For a single task memcg it seems that racing process cgroup migration
could trigger the new warning (I have attempted to reproduce this):

Processes A,B in memcg M1,M2.  M1 is oom.

  Process A[M1]   Process B[M2]

  M1 is oom
  try_charge(M1)
  Move A M1=>M2
  mem_cgroup_oom()
  mem_cgroup_out_of_memory()
out_of_memory()
  select_bad_process()
sees nothing in M1
  return 0
return 0
  WARN()


Another variant might be possible, this time with global oom:

Processes A,B in memcg M1,M2.  M1 is oom.

  Process A[M1]   Process B[M2]

  try_charge()
  trigger global oom
  reaper sets A.MMF_OOM_SKIP
  mem_cgroup_oom()
  mem_cgroup_out_of_memory()
out_of_memory()
  select_bad_process()
sees nothing in M1
  return 0
return 0
  WARN()


These seem unlikely, so I'm fine with taking a wait-and-see approach.


Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-07-03 Thread Greg Thelen
Michal Hocko  wrote:

> On Fri 29-06-18 11:59:04, Greg Thelen wrote:
>> Michal Hocko  wrote:
>> 
>> > On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> >> Michal Hocko  wrote:
>> > [...]
>> >> > +   if (mem_cgroup_out_of_memory(memcg, mask, order))
>> >> > +   return OOM_SUCCESS;
>> >> > +
>> >> > +   WARN(1,"Memory cgroup charge failed because of no reclaimable 
>> >> > memory! "
>> >> > +   "This looks like a misconfiguration or a kernel bug.");
>> >> 
>> >> I'm not sure here if the warning should here or so strongly worded.  It
>> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> >> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> >> alarming in that case.
>> >
>> > If the task is reaped then its charges should be released as well and
>> > that means that we should get below the limit. Sure there is some room
>> > for races but this should be still unlikely. Maybe I am just
>> > underestimating though.
>> >
>> > What would you suggest instead?
>> 
>> I suggest checking MMF_OOM_SKIP or deleting the warning.
>
> So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking
> for all the tasks would be quite expensive and remembering that from the
> task selection not nice either. Why do you think it would help much?

I assume we could just check current's MMF_OOM_SKIP - no need to check
all tasks.  My only (minor) objection is that the warning text suggests
misconfiguration or kernel bug, when there may be neither.

> I feel strongly that we have to warn when bypassing the charge limit
> during the corner case because it can lead to unexpected behavior and
> users should be aware of this fact. I am open to the wording or some
> optimizations. I would prefer the latter on top with a clear description
> how it helped in a particular case though. I would rather not over
> optimize now without any story to back it.

I'm fine with the warning.  I know enough to look at dmesg logs to take
an educates that the race occurred.  We can refine it later if/when the
reports start rolling in.  No change needed.


Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-07-03 Thread Greg Thelen
Michal Hocko  wrote:

> On Fri 29-06-18 11:59:04, Greg Thelen wrote:
>> Michal Hocko  wrote:
>> 
>> > On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> >> Michal Hocko  wrote:
>> > [...]
>> >> > +   if (mem_cgroup_out_of_memory(memcg, mask, order))
>> >> > +   return OOM_SUCCESS;
>> >> > +
>> >> > +   WARN(1,"Memory cgroup charge failed because of no reclaimable 
>> >> > memory! "
>> >> > +   "This looks like a misconfiguration or a kernel bug.");
>> >> 
>> >> I'm not sure here if the warning should here or so strongly worded.  It
>> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> >> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> >> alarming in that case.
>> >
>> > If the task is reaped then its charges should be released as well and
>> > that means that we should get below the limit. Sure there is some room
>> > for races but this should be still unlikely. Maybe I am just
>> > underestimating though.
>> >
>> > What would you suggest instead?
>> 
>> I suggest checking MMF_OOM_SKIP or deleting the warning.
>
> So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking
> for all the tasks would be quite expensive and remembering that from the
> task selection not nice either. Why do you think it would help much?

I assume we could just check current's MMF_OOM_SKIP - no need to check
all tasks.  My only (minor) objection is that the warning text suggests
misconfiguration or kernel bug, when there may be neither.

> I feel strongly that we have to warn when bypassing the charge limit
> during the corner case because it can lead to unexpected behavior and
> users should be aware of this fact. I am open to the wording or some
> optimizations. I would prefer the latter on top with a clear description
> how it helped in a particular case though. I would rather not over
> optimize now without any story to back it.

I'm fine with the warning.  I know enough to look at dmesg logs to take
an educates that the race occurred.  We can refine it later if/when the
reports start rolling in.  No change needed.


Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-06-29 Thread Greg Thelen
Michal Hocko  wrote:

> On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> Michal Hocko  wrote:
> [...]
>> > +  if (mem_cgroup_out_of_memory(memcg, mask, order))
>> > +  return OOM_SUCCESS;
>> > +
>> > +  WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
>> > +  "This looks like a misconfiguration or a kernel bug.");
>> 
>> I'm not sure here if the warning should here or so strongly worded.  It
>> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> alarming in that case.
>
> If the task is reaped then its charges should be released as well and
> that means that we should get below the limit. Sure there is some room
> for races but this should be still unlikely. Maybe I am just
> underestimating though.
>
> What would you suggest instead?

I suggest checking MMF_OOM_SKIP or deleting the warning.  But I don't
feel strongly.


Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-06-29 Thread Greg Thelen
Michal Hocko  wrote:

> On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> Michal Hocko  wrote:
> [...]
>> > +  if (mem_cgroup_out_of_memory(memcg, mask, order))
>> > +  return OOM_SUCCESS;
>> > +
>> > +  WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
>> > +  "This looks like a misconfiguration or a kernel bug.");
>> 
>> I'm not sure here if the warning should here or so strongly worded.  It
>> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> alarming in that case.
>
> If the task is reaped then its charges should be released as well and
> that means that we should get below the limit. Sure there is some room
> for races but this should be still unlikely. Maybe I am just
> underestimating though.
>
> What would you suggest instead?

I suggest checking MMF_OOM_SKIP or deleting the warning.  But I don't
feel strongly.


Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-06-28 Thread Greg Thelen
Michal Hocko  wrote:

> From: Michal Hocko 
>
> 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")
> has changed the ENOMEM semantic of memcg charges. Rather than invoking
> the oom killer from the charging context it delays the oom killer to the
> page fault path (pagefault_out_of_memory). This in turn means that many
> users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg
> hits the hard limit and the memcg is is OOM. This is behavior is
> inconsistent with !memcg case where the oom killer is invoked from the
> allocation context and the allocator keeps retrying until it succeeds.
>
> The difference in the behavior is user visible. mmap(MAP_POPULATE) might
> result in not fully populated ranges while the mmap return code doesn't
> tell that to the userspace. Random syscalls might fail with ENOMEM etc.
>
> The primary motivation of the different memcg oom semantic was the
> deadlock avoidance. Things have changed since then, though. We have
> an async oom teardown by the oom reaper now and so we do not have to
> rely on the victim to tear down its memory anymore. Therefore we can
> return to the original semantic as long as the memcg oom killer is not
> handed over to the users space.
>
> There is still one thing to be careful about here though. If the oom
> killer is not able to make any forward progress - e.g. because there is
> no eligible task to kill - then we have to bail out of the charge path
> to prevent from same class of deadlocks. We have basically two options
> here. Either we fail the charge with ENOMEM or force the charge and
> allow overcharge. The first option has been considered more harmful than
> useful because rare inconsistencies in the ENOMEM behavior is hard to
> test for and error prone. Basically the same reason why the page
> allocator doesn't fail allocations under such conditions. The later
> might allow runaways but those should be really unlikely unless somebody
> misconfigures the system. E.g. allowing to migrate tasks away from the
> memcg to a different unlimited memcg with move_charge_at_immigrate
> disabled.
>
> Changes since rfc v1
> - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more
>   clear what is the purpose of the flag now
> - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g
>   s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes
> - make oom_kill_disable an exceptional case because it should be rare
>   and the normal oom handling a core of the function - per Johannes
>
> Signed-off-by: Michal Hocko 

Acked-by: Greg Thelen 

Thanks!  One comment below.

> ---
>
> Hi,
> I've posted this as an RFC previously [1]. There was no fundamental
> disagreement so I've integrated all the suggested changes and tested it.
> mmap(MAP_POPULATE) hits the oom killer again rather than silently fails
> to populate the mapping on the hard limit excess. On the other hand
> g-u-p and other charge path keep the ENOMEM semantic when the memcg oom
> killer is disabled. All the forward progress guarantee relies on the oom
> reaper.
>
> Unless there are objections I think this is ready to go to mmotm and
> ready for the next merge window
>
> [1] http://lkml.kernel.org/r/20180620103736.13880-1-mho...@kernel.org
>  include/linux/memcontrol.h | 16 
>  include/linux/sched.h  |  2 +-
>  mm/memcontrol.c| 75 ++
>  mm/memory.c|  4 +-
>  4 files changed, 71 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6c6fb116e925..5a69bb4026f6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup 
> *memcg);
>  void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>   struct task_struct *p);
>  
> -static inline void mem_cgroup_oom_enable(void)
> +static inline void mem_cgroup_enter_user_fault(void)
>  {
> - WARN_ON(current->memcg_may_oom);
> - current->memcg_may_oom = 1;
> + WARN_ON(current->in_user_fault);
> + current->in_user_fault = 1;
>  }
>  
> -static inline void mem_cgroup_oom_disable(void)
> +static inline void mem_cgroup_exit_user_fault(void)
>  {
> - WARN_ON(!current->memcg_may_oom);
> - current->memcg_may_oom = 0;
> + WARN_ON(!current->in_user_fault);
> + current->in_user_fault = 0;
>  }
>  
>  static inline bool task_in_memcg_oom(struct task_struct *p)
> @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void)
>  {
>  }
>  
> -static inline void mem_cgroup_oom_enable(void)

Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-06-28 Thread Greg Thelen
Michal Hocko  wrote:

> From: Michal Hocko 
>
> 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")
> has changed the ENOMEM semantic of memcg charges. Rather than invoking
> the oom killer from the charging context it delays the oom killer to the
> page fault path (pagefault_out_of_memory). This in turn means that many
> users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg
> hits the hard limit and the memcg is is OOM. This is behavior is
> inconsistent with !memcg case where the oom killer is invoked from the
> allocation context and the allocator keeps retrying until it succeeds.
>
> The difference in the behavior is user visible. mmap(MAP_POPULATE) might
> result in not fully populated ranges while the mmap return code doesn't
> tell that to the userspace. Random syscalls might fail with ENOMEM etc.
>
> The primary motivation of the different memcg oom semantic was the
> deadlock avoidance. Things have changed since then, though. We have
> an async oom teardown by the oom reaper now and so we do not have to
> rely on the victim to tear down its memory anymore. Therefore we can
> return to the original semantic as long as the memcg oom killer is not
> handed over to the users space.
>
> There is still one thing to be careful about here though. If the oom
> killer is not able to make any forward progress - e.g. because there is
> no eligible task to kill - then we have to bail out of the charge path
> to prevent from same class of deadlocks. We have basically two options
> here. Either we fail the charge with ENOMEM or force the charge and
> allow overcharge. The first option has been considered more harmful than
> useful because rare inconsistencies in the ENOMEM behavior is hard to
> test for and error prone. Basically the same reason why the page
> allocator doesn't fail allocations under such conditions. The later
> might allow runaways but those should be really unlikely unless somebody
> misconfigures the system. E.g. allowing to migrate tasks away from the
> memcg to a different unlimited memcg with move_charge_at_immigrate
> disabled.
>
> Changes since rfc v1
> - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more
>   clear what is the purpose of the flag now
> - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g
>   s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes
> - make oom_kill_disable an exceptional case because it should be rare
>   and the normal oom handling a core of the function - per Johannes
>
> Signed-off-by: Michal Hocko 

Acked-by: Greg Thelen 

Thanks!  One comment below.

> ---
>
> Hi,
> I've posted this as an RFC previously [1]. There was no fundamental
> disagreement so I've integrated all the suggested changes and tested it.
> mmap(MAP_POPULATE) hits the oom killer again rather than silently fails
> to populate the mapping on the hard limit excess. On the other hand
> g-u-p and other charge path keep the ENOMEM semantic when the memcg oom
> killer is disabled. All the forward progress guarantee relies on the oom
> reaper.
>
> Unless there are objections I think this is ready to go to mmotm and
> ready for the next merge window
>
> [1] http://lkml.kernel.org/r/20180620103736.13880-1-mho...@kernel.org
>  include/linux/memcontrol.h | 16 
>  include/linux/sched.h  |  2 +-
>  mm/memcontrol.c| 75 ++
>  mm/memory.c|  4 +-
>  4 files changed, 71 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6c6fb116e925..5a69bb4026f6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup 
> *memcg);
>  void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>   struct task_struct *p);
>  
> -static inline void mem_cgroup_oom_enable(void)
> +static inline void mem_cgroup_enter_user_fault(void)
>  {
> - WARN_ON(current->memcg_may_oom);
> - current->memcg_may_oom = 1;
> + WARN_ON(current->in_user_fault);
> + current->in_user_fault = 1;
>  }
>  
> -static inline void mem_cgroup_oom_disable(void)
> +static inline void mem_cgroup_exit_user_fault(void)
>  {
> - WARN_ON(!current->memcg_may_oom);
> - current->memcg_may_oom = 0;
> + WARN_ON(!current->in_user_fault);
> + current->in_user_fault = 0;
>  }
>  
>  static inline bool task_in_memcg_oom(struct task_struct *p)
> @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void)
>  {
>  }
>  
> -static inline void mem_cgroup_oom_enable(void)

[PATCH] writeback: update stale account_page_redirty() comment

2018-06-25 Thread Greg Thelen
commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into
bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
account_page_redirty().  Update comment to track that change.
  BDI_DIRTIED => WB_DIRTIED
  BDI_WRITTEN => WB_WRITTEN

Signed-off-by: Greg Thelen 
---
 mm/page-writeback.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 337c6afb3345..6551d3b0dc30 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2490,8 +2490,8 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
 /*
  * Call this whenever redirtying a page, to de-account the dirty counters
- * (NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied), so that they match the written
- * counters (NR_WRITTEN, BDI_WRITTEN) in long term. The mismatches will lead to
+ * (NR_DIRTIED, WB_DIRTIED, tsk->nr_dirtied), so that they match the written
+ * counters (NR_WRITTEN, WB_WRITTEN) in long term. The mismatches will lead to
  * systematic errors in balanced_dirty_ratelimit and the dirty pages position
  * control.
  */
-- 
2.18.0.rc2.346.g013aa6912e-goog



[PATCH] writeback: update stale account_page_redirty() comment

2018-06-25 Thread Greg Thelen
commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into
bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
account_page_redirty().  Update comment to track that change.
  BDI_DIRTIED => WB_DIRTIED
  BDI_WRITTEN => WB_WRITTEN

Signed-off-by: Greg Thelen 
---
 mm/page-writeback.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 337c6afb3345..6551d3b0dc30 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2490,8 +2490,8 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
 /*
  * Call this whenever redirtying a page, to de-account the dirty counters
- * (NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied), so that they match the written
- * counters (NR_WRITTEN, BDI_WRITTEN) in long term. The mismatches will lead to
+ * (NR_DIRTIED, WB_DIRTIED, tsk->nr_dirtied), so that they match the written
+ * counters (NR_WRITTEN, WB_WRITTEN) in long term. The mismatches will lead to
  * systematic errors in balanced_dirty_ratelimit and the dirty pages position
  * control.
  */
-- 
2.18.0.rc2.346.g013aa6912e-goog



Re: [PATCH v2] mm: condense scan_control

2018-06-20 Thread Greg Thelen
On Tue, May 29, 2018 at 11:12 PM Greg Thelen  wrote:
>
> Use smaller scan_control fields for order, priority, and reclaim_idx.
> Convert fields from int => s8.  All easily fit within a byte:
> * allocation order range: 0..MAX_ORDER(64?)
> * priority range: 0..12(DEF_PRIORITY)
> * reclaim_idx range:  0..6(__MAX_NR_ZONES)
>
> Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
> stack overflows are not an issue.  But it's inefficient to use ints.
>
> Use s8 (signed byte) rather than u8 to allow for loops like:
> do {
> ...
> } while (--sc.priority >= 0);
>
> Add BUILD_BUG_ON to verify that s8 is capable of storing max values.
>
> This reduces sizeof(struct scan_control):
> * 96 => 80 bytes (x86_64)
> * 68 => 56 bytes (i386)
>
> scan_control structure field order is changed to utilize padding.
> After this patch there is 1 bit of scan_control padding.
>
> Signed-off-by: Greg Thelen 
> Suggested-by: Matthew Wilcox 

Is there any interest in this?  Less stack usage could mean less
dcache and dtlb pressure.  But I understand if the complexity is
distasteful.

> ---
>  mm/vmscan.c | 32 
>  1 file changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9b697323a88c..42731faea306 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -65,12 +65,6 @@ struct scan_control {
> /* How many pages shrink_list() should reclaim */
> unsigned long nr_to_reclaim;
>
> -   /* This context's GFP mask */
> -   gfp_t gfp_mask;
> -
> -   /* Allocation order */
> -   int order;
> -
> /*
>  * Nodemask of nodes allowed by the caller. If NULL, all nodes
>  * are scanned.
> @@ -83,12 +77,6 @@ struct scan_control {
>  */
> struct mem_cgroup *target_mem_cgroup;
>
> -   /* Scan (total_size >> priority) pages at once */
> -   int priority;
> -
> -   /* The highest zone to isolate pages for reclaim from */
> -   enum zone_type reclaim_idx;
> -
> /* Writepage batching in laptop mode; RECLAIM_WRITE */
> unsigned int may_writepage:1;
>
> @@ -111,6 +99,18 @@ struct scan_control {
> /* One of the zones is ready for compaction */
> unsigned int compaction_ready:1;
>
> +   /* Allocation order */
> +   s8 order;
> +
> +   /* Scan (total_size >> priority) pages at once */
> +   s8 priority;
> +
> +   /* The highest zone to isolate pages for reclaim from */
> +   s8 reclaim_idx;
> +
> +   /* This context's GFP mask */
> +   gfp_t gfp_mask;
> +
> /* Incremented by the number of inactive pages that were scanned */
> unsigned long nr_scanned;
>
> @@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist 
> *zonelist, int order,
> .may_swap = 1,
> };
>
> +   /*
> +* scan_control uses s8 fields for order, priority, and reclaim_idx.
> +* Confirm they are large enough for max values.
> +*/
> +   BUILD_BUG_ON(MAX_ORDER > S8_MAX);
> +   BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
> +   BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);
> +
> /*
>  * Do not enter reclaim if fatal signal was delivered while throttled.
>  * 1 is returned so that the page allocator does not OOM kill at this
> --
> 2.17.0.921.gf22659ad46-goog
>


Re: [PATCH v2] mm: condense scan_control

2018-06-20 Thread Greg Thelen
On Tue, May 29, 2018 at 11:12 PM Greg Thelen  wrote:
>
> Use smaller scan_control fields for order, priority, and reclaim_idx.
> Convert fields from int => s8.  All easily fit within a byte:
> * allocation order range: 0..MAX_ORDER(64?)
> * priority range: 0..12(DEF_PRIORITY)
> * reclaim_idx range:  0..6(__MAX_NR_ZONES)
>
> Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
> stack overflows are not an issue.  But it's inefficient to use ints.
>
> Use s8 (signed byte) rather than u8 to allow for loops like:
> do {
> ...
> } while (--sc.priority >= 0);
>
> Add BUILD_BUG_ON to verify that s8 is capable of storing max values.
>
> This reduces sizeof(struct scan_control):
> * 96 => 80 bytes (x86_64)
> * 68 => 56 bytes (i386)
>
> scan_control structure field order is changed to utilize padding.
> After this patch there is 1 bit of scan_control padding.
>
> Signed-off-by: Greg Thelen 
> Suggested-by: Matthew Wilcox 

Is there any interest in this?  Less stack usage could mean less
dcache and dtlb pressure.  But I understand if the complexity is
distasteful.

> ---
>  mm/vmscan.c | 32 
>  1 file changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9b697323a88c..42731faea306 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -65,12 +65,6 @@ struct scan_control {
> /* How many pages shrink_list() should reclaim */
> unsigned long nr_to_reclaim;
>
> -   /* This context's GFP mask */
> -   gfp_t gfp_mask;
> -
> -   /* Allocation order */
> -   int order;
> -
> /*
>  * Nodemask of nodes allowed by the caller. If NULL, all nodes
>  * are scanned.
> @@ -83,12 +77,6 @@ struct scan_control {
>  */
> struct mem_cgroup *target_mem_cgroup;
>
> -   /* Scan (total_size >> priority) pages at once */
> -   int priority;
> -
> -   /* The highest zone to isolate pages for reclaim from */
> -   enum zone_type reclaim_idx;
> -
> /* Writepage batching in laptop mode; RECLAIM_WRITE */
> unsigned int may_writepage:1;
>
> @@ -111,6 +99,18 @@ struct scan_control {
> /* One of the zones is ready for compaction */
> unsigned int compaction_ready:1;
>
> +   /* Allocation order */
> +   s8 order;
> +
> +   /* Scan (total_size >> priority) pages at once */
> +   s8 priority;
> +
> +   /* The highest zone to isolate pages for reclaim from */
> +   s8 reclaim_idx;
> +
> +   /* This context's GFP mask */
> +   gfp_t gfp_mask;
> +
> /* Incremented by the number of inactive pages that were scanned */
> unsigned long nr_scanned;
>
> @@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist 
> *zonelist, int order,
> .may_swap = 1,
> };
>
> +   /*
> +* scan_control uses s8 fields for order, priority, and reclaim_idx.
> +* Confirm they are large enough for max values.
> +*/
> +   BUILD_BUG_ON(MAX_ORDER > S8_MAX);
> +   BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
> +   BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);
> +
> /*
>  * Do not enter reclaim if fatal signal was delivered while throttled.
>  * 1 is returned so that the page allocator does not OOM kill at this
> --
> 2.17.0.921.gf22659ad46-goog
>


[PATCH] trace: fix SKIP_STACK_VALIDATION=1 build

2018-06-08 Thread Greg Thelen
Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and
SKIP_STACK_VALIDATION=1 fail.
Example output:
  /bin/sh: init/.tmp_main.o: Permission denied

commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"),
added a mismatched endif.  This causes cmd_objtool to get mistakenly
set.

Relocate endif to balance the newly added -record-mcount check.

Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace")
Signed-off-by: Greg Thelen 
---
 scripts/Makefile.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 34d9e9ce97c2..e7889f486ca1 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -239,6 +239,7 @@ cmd_record_mcount = 
\
 "$(CC_FLAGS_FTRACE)" ]; then   \
$(sub_cmd_record_mcount)\
fi;
+endif # -record-mcount
 endif # CONFIG_FTRACE_MCOUNT_RECORD
 
 ifdef CONFIG_STACK_VALIDATION
@@ -263,7 +264,6 @@ ifneq ($(RETPOLINE_CFLAGS),)
   objtool_args += --retpoline
 endif
 endif
-endif
 
 
 ifdef CONFIG_MODVERSIONS
-- 
2.18.0.rc1.242.g61856ae69a-goog



[PATCH] trace: fix SKIP_STACK_VALIDATION=1 build

2018-06-08 Thread Greg Thelen
Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and
SKIP_STACK_VALIDATION=1 fail.
Example output:
  /bin/sh: init/.tmp_main.o: Permission denied

commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"),
added a mismatched endif.  This causes cmd_objtool to get mistakenly
set.

Relocate endif to balance the newly added -record-mcount check.

Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace")
Signed-off-by: Greg Thelen 
---
 scripts/Makefile.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 34d9e9ce97c2..e7889f486ca1 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -239,6 +239,7 @@ cmd_record_mcount = 
\
 "$(CC_FLAGS_FTRACE)" ]; then   \
$(sub_cmd_record_mcount)\
fi;
+endif # -record-mcount
 endif # CONFIG_FTRACE_MCOUNT_RECORD
 
 ifdef CONFIG_STACK_VALIDATION
@@ -263,7 +264,6 @@ ifneq ($(RETPOLINE_CFLAGS),)
   objtool_args += --retpoline
 endif
 endif
-endif
 
 
 ifdef CONFIG_MODVERSIONS
-- 
2.18.0.rc1.242.g61856ae69a-goog



Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency

2018-06-07 Thread Greg Thelen
On Mon, Jun 4, 2018 at 4:07 PM Jason Gunthorpe  wrote:
>
> On Thu, May 31, 2018 at 02:40:59PM -0400, Doug Ledford wrote:
> > On Wed, 2018-05-30 at 21:03 -0700, Greg Thelen wrote:
> > > On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe  wrote:
> > >
> > > > On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote:
> > > > > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote:
> > > > > > > On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote:
> > > > > > > >
> > > > > > > > On 5/30/2018 5:04 PM, Jason Gunthorpe wrote:
> > > > > > > > > On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote:
> > > > > > > > > > The newly added fill_res_ep_entry function fails to link if
> > > > > > > > > > CONFIG_INFINIBAND_ADDR_TRANS is not set:
> > > > > > > > > >
> > > > > > > > > > drivers/infiniband/hw/cxgb4/restrack.o: In function
> > >
> > > `fill_res_ep_entry':
> > > > > > > > > > restrack.c:(.text+0x3cc): undefined reference to 
> > > > > > > > > > `rdma_res_to_id'
> > > > > > > > > > restrack.c:(.text+0x3d0): undefined reference to 
> > > > > > > > > > `rdma_iw_cm_id'
> > > > > > > > > >
> > > > > > > > > > This adds a Kconfig dependency for the driver.
> > > > > > > > > >
> > > > > > > > > > Fixes: 116aeb887371 ("iw_cxgb4: provide detailed
> > >
> > > provider-specific CM_ID information")
> > > > > > > > > > Signed-off-by: Arnd Bergmann 
> > > > > > > > > >  drivers/infiniband/hw/cxgb4/Kconfig | 1 +
> > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > >
> > > > > > > > > Oh, I think we need to solve this with maybe a header fill 
> > > > > > > > > null
> > >
> > > stub
> > > > > > > > > instead..
> > > > > > > > >
> > > > > > > > > We don't want to disable drivers just because a user 
> > > > > > > > > interface is
> > > > > > > > > disabled.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Why does CONFIG_INFINIBAND_ADDR_TRANS disable building 
> > > > > > > > rdma_cm.ko?
> > >
> > > That
> > > > > > > > is not correct.
> > > > > > >
> > > > > > > That seems like a reasonable thing to do..
> > > > > >
> > > > > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, 
> > > > > > and
> > > > > > is required for iwarp drivers.  It seems rdma_cm.ko is not being
> > > > > > compiled if ADDR_TRANS is not set.
> > > > I think the intention was to completely disable rdma-cm, including all
> > > > support for rx'ing remote packets? Greg?
> > >
> > > Yes.  That's my goal when INFINIBAND_ADDR_TRANS is unset.
> > >
> > > > If this is required for iwarp then Arnd's patch is probably the right
> > > > way to go..
> > > > Jason
> > >
> > > Agreed.
> > > Acked-by: Greg Thelen 
> >
> > If that's the case, then there should be a NOTE: in the Kconfig that
> > disabling the connection manager completely disables iWARP hardware.
> >
> > I don't really think I like this approach though.  At a minimum if you
> > are going to make iWARP totally dependent on rdmacm, then there is zero
> > reason that iw_cm.o should be part of the obj-$(CONFIG_INFINIBAND)
> > Makefile recipe when ADDR_TRANS is disabled.
>
> This makes sense, Greg, can you send a followup patch with these
> items? I know this series has been tough..
>
> Thanks,
> Jason

Ack.  Sorry for delay.  I'll take a look at this in a day or two.  Let
me know if it's more urgent.


Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency

2018-06-07 Thread Greg Thelen
On Mon, Jun 4, 2018 at 4:07 PM Jason Gunthorpe  wrote:
>
> On Thu, May 31, 2018 at 02:40:59PM -0400, Doug Ledford wrote:
> > On Wed, 2018-05-30 at 21:03 -0700, Greg Thelen wrote:
> > > On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe  wrote:
> > >
> > > > On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote:
> > > > > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote:
> > > > > > > On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote:
> > > > > > > >
> > > > > > > > On 5/30/2018 5:04 PM, Jason Gunthorpe wrote:
> > > > > > > > > On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote:
> > > > > > > > > > The newly added fill_res_ep_entry function fails to link if
> > > > > > > > > > CONFIG_INFINIBAND_ADDR_TRANS is not set:
> > > > > > > > > >
> > > > > > > > > > drivers/infiniband/hw/cxgb4/restrack.o: In function
> > >
> > > `fill_res_ep_entry':
> > > > > > > > > > restrack.c:(.text+0x3cc): undefined reference to 
> > > > > > > > > > `rdma_res_to_id'
> > > > > > > > > > restrack.c:(.text+0x3d0): undefined reference to 
> > > > > > > > > > `rdma_iw_cm_id'
> > > > > > > > > >
> > > > > > > > > > This adds a Kconfig dependency for the driver.
> > > > > > > > > >
> > > > > > > > > > Fixes: 116aeb887371 ("iw_cxgb4: provide detailed
> > >
> > > provider-specific CM_ID information")
> > > > > > > > > > Signed-off-by: Arnd Bergmann 
> > > > > > > > > >  drivers/infiniband/hw/cxgb4/Kconfig | 1 +
> > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > >
> > > > > > > > > Oh, I think we need to solve this with maybe a header fill 
> > > > > > > > > null
> > >
> > > stub
> > > > > > > > > instead..
> > > > > > > > >
> > > > > > > > > We don't want to disable drivers just because a user 
> > > > > > > > > interface is
> > > > > > > > > disabled.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Why does CONFIG_INFINIBAND_ADDR_TRANS disable building 
> > > > > > > > rdma_cm.ko?
> > >
> > > That
> > > > > > > > is not correct.
> > > > > > >
> > > > > > > That seems like a reasonable thing to do..
> > > > > >
> > > > > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, 
> > > > > > and
> > > > > > is required for iwarp drivers.  It seems rdma_cm.ko is not being
> > > > > > compiled if ADDR_TRANS is not set.
> > > > I think the intention was to completely disable rdma-cm, including all
> > > > support for rx'ing remote packets? Greg?
> > >
> > > Yes.  That's my goal when INFINIBAND_ADDR_TRANS is unset.
> > >
> > > > If this is required for iwarp then Arnd's patch is probably the right
> > > > way to go..
> > > > Jason
> > >
> > > Agreed.
> > > Acked-by: Greg Thelen 
> >
> > If that's the case, then there should be a NOTE: in the Kconfig that
> > disabling the connection manager completely disables iWARP hardware.
> >
> > I don't really think I like this approach though.  At a minimum if you
> > are going to make iWARP totally dependent on rdmacm, then there is zero
> > reason that iw_cm.o should be part of the obj-$(CONFIG_INFINIBAND)
> > Makefile recipe when ADDR_TRANS is disabled.
>
> This makes sense, Greg, can you send a followup patch with these
> items? I know this series has been tough..
>
> Thanks,
> Jason

Ack.  Sorry for delay.  I'll take a look at this in a day or two.  Let
me know if it's more urgent.


Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency

2018-05-30 Thread Greg Thelen
On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe  wrote:

> On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote:
> > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote:
> > >> On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote:
> > >>>
> > >>> On 5/30/2018 5:04 PM, Jason Gunthorpe wrote:
> > >>>> On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote:
> > >>>>> The newly added fill_res_ep_entry function fails to link if
> > >>>>> CONFIG_INFINIBAND_ADDR_TRANS is not set:
> > >>>>>
> > >>>>> drivers/infiniband/hw/cxgb4/restrack.o: In function
`fill_res_ep_entry':
> > >>>>> restrack.c:(.text+0x3cc): undefined reference to `rdma_res_to_id'
> > >>>>> restrack.c:(.text+0x3d0): undefined reference to `rdma_iw_cm_id'
> > >>>>>
> > >>>>> This adds a Kconfig dependency for the driver.
> > >>>>>
> > >>>>> Fixes: 116aeb887371 ("iw_cxgb4: provide detailed
provider-specific CM_ID information")
> > >>>>> Signed-off-by: Arnd Bergmann 
> > >>>>>  drivers/infiniband/hw/cxgb4/Kconfig | 1 +
> > >>>>>  1 file changed, 1 insertion(+)
> > >>>> Oh, I think we need to solve this with maybe a header fill null
stub
> > >>>> instead..
> > >>>>
> > >>>> We don't want to disable drivers just because a user interface is
> > >>>> disabled.
> > >>>>
> > >>> Why does CONFIG_INFINIBAND_ADDR_TRANS disable building rdma_cm.ko?
That
> > >>> is not correct.
> > >> That seems like a reasonable thing to do..
> > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, and
> > > is required for iwarp drivers.  It seems rdma_cm.ko is not being
> > > compiled if ADDR_TRANS is not set.

> I think the intention was to completely disable rdma-cm, including all
> support for rx'ing remote packets? Greg?

Yes.  That's my goal when INFINIBAND_ADDR_TRANS is unset.

> If this is required for iwarp then Arnd's patch is probably the right
> way to go..

> Jason

Agreed.
Acked-by: Greg Thelen 


Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency

2018-05-30 Thread Greg Thelen
On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe  wrote:

> On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote:
> > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote:
> > >> On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote:
> > >>>
> > >>> On 5/30/2018 5:04 PM, Jason Gunthorpe wrote:
> > >>>> On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote:
> > >>>>> The newly added fill_res_ep_entry function fails to link if
> > >>>>> CONFIG_INFINIBAND_ADDR_TRANS is not set:
> > >>>>>
> > >>>>> drivers/infiniband/hw/cxgb4/restrack.o: In function
`fill_res_ep_entry':
> > >>>>> restrack.c:(.text+0x3cc): undefined reference to `rdma_res_to_id'
> > >>>>> restrack.c:(.text+0x3d0): undefined reference to `rdma_iw_cm_id'
> > >>>>>
> > >>>>> This adds a Kconfig dependency for the driver.
> > >>>>>
> > >>>>> Fixes: 116aeb887371 ("iw_cxgb4: provide detailed
provider-specific CM_ID information")
> > >>>>> Signed-off-by: Arnd Bergmann 
> > >>>>>  drivers/infiniband/hw/cxgb4/Kconfig | 1 +
> > >>>>>  1 file changed, 1 insertion(+)
> > >>>> Oh, I think we need to solve this with maybe a header fill null
stub
> > >>>> instead..
> > >>>>
> > >>>> We don't want to disable drivers just because a user interface is
> > >>>> disabled.
> > >>>>
> > >>> Why does CONFIG_INFINIBAND_ADDR_TRANS disable building rdma_cm.ko?
That
> > >>> is not correct.
> > >> That seems like a reasonable thing to do..
> > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, and
> > > is required for iwarp drivers.  It seems rdma_cm.ko is not being
> > > compiled if ADDR_TRANS is not set.

> I think the intention was to completely disable rdma-cm, including all
> support for rx'ing remote packets? Greg?

Yes.  That's my goal when INFINIBAND_ADDR_TRANS is unset.

> If this is required for iwarp then Arnd's patch is probably the right
> way to go..

> Jason

Agreed.
Acked-by: Greg Thelen 


[PATCH v2] mm: condense scan_control

2018-05-30 Thread Greg Thelen
Use smaller scan_control fields for order, priority, and reclaim_idx.
Convert fields from int => s8.  All easily fit within a byte:
* allocation order range: 0..MAX_ORDER(64?)
* priority range: 0..12(DEF_PRIORITY)
* reclaim_idx range:  0..6(__MAX_NR_ZONES)

Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
stack overflows are not an issue.  But it's inefficient to use ints.

Use s8 (signed byte) rather than u8 to allow for loops like:
do {
...
} while (--sc.priority >= 0);

Add BUILD_BUG_ON to verify that s8 is capable of storing max values.

This reduces sizeof(struct scan_control):
* 96 => 80 bytes (x86_64)
* 68 => 56 bytes (i386)

scan_control structure field order is changed to utilize padding.
After this patch there is 1 bit of scan_control padding.

Signed-off-by: Greg Thelen 
Suggested-by: Matthew Wilcox 
---
 mm/vmscan.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..42731faea306 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -65,12 +65,6 @@ struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
 
-   /* This context's GFP mask */
-   gfp_t gfp_mask;
-
-   /* Allocation order */
-   int order;
-
/*
 * Nodemask of nodes allowed by the caller. If NULL, all nodes
 * are scanned.
@@ -83,12 +77,6 @@ struct scan_control {
 */
struct mem_cgroup *target_mem_cgroup;
 
-   /* Scan (total_size >> priority) pages at once */
-   int priority;
-
-   /* The highest zone to isolate pages for reclaim from */
-   enum zone_type reclaim_idx;
-
/* Writepage batching in laptop mode; RECLAIM_WRITE */
unsigned int may_writepage:1;
 
@@ -111,6 +99,18 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;
 
+   /* Allocation order */
+   s8 order;
+
+   /* Scan (total_size >> priority) pages at once */
+   s8 priority;
+
+   /* The highest zone to isolate pages for reclaim from */
+   s8 reclaim_idx;
+
+   /* This context's GFP mask */
+   gfp_t gfp_mask;
+
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
 
@@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist 
*zonelist, int order,
.may_swap = 1,
};
 
+   /*
+* scan_control uses s8 fields for order, priority, and reclaim_idx.
+* Confirm they are large enough for max values.
+*/
+   BUILD_BUG_ON(MAX_ORDER > S8_MAX);
+   BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
+   BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);
+
/*
 * Do not enter reclaim if fatal signal was delivered while throttled.
 * 1 is returned so that the page allocator does not OOM kill at this
-- 
2.17.0.921.gf22659ad46-goog



Re: [PATCH] mm: convert scan_control.priority int => byte

2018-05-30 Thread Greg Thelen
Matthew Wilcox  wrote:

> On Mon, May 28, 2018 at 07:40:25PM -0700, Greg Thelen wrote:
>> Reclaim priorities range from 0..12(DEF_PRIORITY).
>> scan_control.priority is a 4 byte int, which is overkill.
>> 
>> Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
>> stack overflows are not an issue.  But it's inefficient to use 4 bytes
>> for priority.
>
> If you're looking to shave a few more bytes, allocation order can fit
> in a u8 too (can't be more than 6 bits, and realistically won't be more
> than 4 bits).  reclaim_idx likewise will fit in a u8, and actually won't
> be more than 3 bits.

Nod.  Good tip.  Included in ("[PATCH v2] mm: condense scan_control").

> I am sceptical that nr_to_reclaim should really be an unsigned long; I
> don't think we should be trying to free 4 billion pages in a single call.
> nr_scanned might be over 4 billion (!) but nr_reclaimed can probably
> shrink to unsigned int along with nr_to_reclaim.

Agreed.  For patch simplicity, I'll pass on this for now.


[PATCH v2] mm: condense scan_control

2018-05-30 Thread Greg Thelen
Use smaller scan_control fields for order, priority, and reclaim_idx.
Convert fields from int => s8.  All easily fit within a byte:
* allocation order range: 0..MAX_ORDER(64?)
* priority range: 0..12(DEF_PRIORITY)
* reclaim_idx range:  0..6(__MAX_NR_ZONES)

Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
stack overflows are not an issue.  But it's inefficient to use ints.

Use s8 (signed byte) rather than u8 to allow for loops like:
do {
...
} while (--sc.priority >= 0);

Add BUILD_BUG_ON to verify that s8 is capable of storing max values.

This reduces sizeof(struct scan_control):
* 96 => 80 bytes (x86_64)
* 68 => 56 bytes (i386)

scan_control structure field order is changed to utilize padding.
After this patch there is 1 bit of scan_control padding.

Signed-off-by: Greg Thelen 
Suggested-by: Matthew Wilcox 
---
 mm/vmscan.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..42731faea306 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -65,12 +65,6 @@ struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
 
-   /* This context's GFP mask */
-   gfp_t gfp_mask;
-
-   /* Allocation order */
-   int order;
-
/*
 * Nodemask of nodes allowed by the caller. If NULL, all nodes
 * are scanned.
@@ -83,12 +77,6 @@ struct scan_control {
 */
struct mem_cgroup *target_mem_cgroup;
 
-   /* Scan (total_size >> priority) pages at once */
-   int priority;
-
-   /* The highest zone to isolate pages for reclaim from */
-   enum zone_type reclaim_idx;
-
/* Writepage batching in laptop mode; RECLAIM_WRITE */
unsigned int may_writepage:1;
 
@@ -111,6 +99,18 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;
 
+   /* Allocation order */
+   s8 order;
+
+   /* Scan (total_size >> priority) pages at once */
+   s8 priority;
+
+   /* The highest zone to isolate pages for reclaim from */
+   s8 reclaim_idx;
+
+   /* This context's GFP mask */
+   gfp_t gfp_mask;
+
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
 
@@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist 
*zonelist, int order,
.may_swap = 1,
};
 
+   /*
+* scan_control uses s8 fields for order, priority, and reclaim_idx.
+* Confirm they are large enough for max values.
+*/
+   BUILD_BUG_ON(MAX_ORDER > S8_MAX);
+   BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
+   BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);
+
/*
 * Do not enter reclaim if fatal signal was delivered while throttled.
 * 1 is returned so that the page allocator does not OOM kill at this
-- 
2.17.0.921.gf22659ad46-goog



Re: [PATCH] mm: convert scan_control.priority int => byte

2018-05-30 Thread Greg Thelen
Matthew Wilcox  wrote:

> On Mon, May 28, 2018 at 07:40:25PM -0700, Greg Thelen wrote:
>> Reclaim priorities range from 0..12(DEF_PRIORITY).
>> scan_control.priority is a 4 byte int, which is overkill.
>> 
>> Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
>> stack overflows are not an issue.  But it's inefficient to use 4 bytes
>> for priority.
>
> If you're looking to shave a few more bytes, allocation order can fit
> in a u8 too (can't be more than 6 bits, and realistically won't be more
> than 4 bits).  reclaim_idx likewise will fit in a u8, and actually won't
> be more than 3 bits.

Nod.  Good tip.  Included in ("[PATCH v2] mm: condense scan_control").

> I am sceptical that nr_to_reclaim should really be an unsigned long; I
> don't think we should be trying to free 4 billion pages in a single call.
> nr_scanned might be over 4 billion (!) but nr_reclaimed can probably
> shrink to unsigned int along with nr_to_reclaim.

Agreed.  For patch simplicity, I'll pass on this for now.


[PATCH] mm: convert scan_control.priority int => byte

2018-05-28 Thread Greg Thelen
Reclaim priorities range from 0..12(DEF_PRIORITY).
scan_control.priority is a 4 byte int, which is overkill.

Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
stack overflows are not an issue.  But it's inefficient to use 4 bytes
for priority.

Use s8 (signed byte) rather than u8 to allow for loops like:
do {
...
} while (--sc.priority >= 0);

This reduces sizeof(struct scan_control) from 96 => 88 bytes (x86_64),
which saves some stack.

scan_control.priority field order is changed to occupy otherwise unused
padding.

Signed-off-by: Greg Thelen 
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..541c334bd176 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -83,9 +83,6 @@ struct scan_control {
 */
struct mem_cgroup *target_mem_cgroup;
 
-   /* Scan (total_size >> priority) pages at once */
-   int priority;
-
/* The highest zone to isolate pages for reclaim from */
enum zone_type reclaim_idx;
 
@@ -111,6 +108,9 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;
 
+   /* Scan (total_size >> priority) pages at once */
+   s8 priority;
+
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
 
-- 
2.17.0.921.gf22659ad46-goog



[PATCH] mm: convert scan_control.priority int => byte

2018-05-28 Thread Greg Thelen
Reclaim priorities range from 0..12(DEF_PRIORITY).
scan_control.priority is a 4 byte int, which is overkill.

Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
stack overflows are not an issue.  But it's inefficient to use 4 bytes
for priority.

Use s8 (signed byte) rather than u8 to allow for loops like:
do {
...
} while (--sc.priority >= 0);

This reduces sizeof(struct scan_control) from 96 => 88 bytes (x86_64),
which saves some stack.

scan_control.priority field order is changed to occupy otherwise unused
padding.

Signed-off-by: Greg Thelen 
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..541c334bd176 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -83,9 +83,6 @@ struct scan_control {
 */
struct mem_cgroup *target_mem_cgroup;
 
-   /* Scan (total_size >> priority) pages at once */
-   int priority;
-
/* The highest zone to isolate pages for reclaim from */
enum zone_type reclaim_idx;
 
@@ -111,6 +108,9 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;
 
+   /* Scan (total_size >> priority) pages at once */
+   s8 priority;
+
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
 
-- 
2.17.0.921.gf22659ad46-goog



Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"

2018-05-28 Thread Greg Thelen

Jason Gunthorpe  wrote:


On Fri, May 25, 2018 at 05:32:52PM -0700, Greg Thelen wrote:

On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann  wrote:


> Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn  
depends

> on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
> link error when another driver using it is built-in. The
> INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
> a 'bool' symbol that does not force anything to be a module in turn.



> fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
> smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_request':
> trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_destroy_trans':
> trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
> trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'



> Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig
dependencies")
> Signed-off-by: Arnd Bergmann 



Acked-by: Greg Thelen 



Sorry for the 9533b292a7ac problem.
At this point the in release cycle, I think Arnd's revert is best.



If there is interest, I've put a little thought into an alternative fix:
making INFINIBAND_ADDR_TRANS tristate.  But it's nontrivial.
So I prefer this simple revert for now.



Is that a normal thing to do?


For me: no, it's not normal.  In my use case I merely want to disable
INFINIBAND_ADDR_TRANS while continuing to use INFINIBAND.  This is
supported with f7cb7b85be55 ("IB: make INFINIBAND_ADDR_TRANS
configurable").

During f7cb7b85be55 development https://lkml.org/lkml/2018/4/30/1073
suggested that we drop dependency on both INFINIBAND and
INFINIBAND_ADDR_TRANS.  Thus 9533b292a7ac ("IB: remove redundant
INFINIBAND kconfig dependencies").

But 9533b292a7ac led to the randconfig build errors reported and thus
("IB: Revert "remove redundant INFINIBAND kconfig dependencies"").

So I see no need to do anything more than apply ("IB: Revert "remove
redundant INFINIBAND kconfig dependencies"").


Doug: do you need anything from me on this?



I can take the revert..



Jason


Thanks.


Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"

2018-05-28 Thread Greg Thelen

Jason Gunthorpe  wrote:


On Fri, May 25, 2018 at 05:32:52PM -0700, Greg Thelen wrote:

On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann  wrote:


> Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn  
depends

> on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
> link error when another driver using it is built-in. The
> INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
> a 'bool' symbol that does not force anything to be a module in turn.



> fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
> smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_request':
> trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_destroy_trans':
> trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
> trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'



> Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig
dependencies")
> Signed-off-by: Arnd Bergmann 



Acked-by: Greg Thelen 



Sorry for the 9533b292a7ac problem.
At this point the in release cycle, I think Arnd's revert is best.



If there is interest, I've put a little thought into an alternative fix:
making INFINIBAND_ADDR_TRANS tristate.  But it's nontrivial.
So I prefer this simple revert for now.



Is that a normal thing to do?


For me: no, it's not normal.  In my use case I merely want to disable
INFINIBAND_ADDR_TRANS while continuing to use INFINIBAND.  This is
supported with f7cb7b85be55 ("IB: make INFINIBAND_ADDR_TRANS
configurable").

During f7cb7b85be55 development https://lkml.org/lkml/2018/4/30/1073
suggested that we drop dependency on both INFINIBAND and
INFINIBAND_ADDR_TRANS.  Thus 9533b292a7ac ("IB: remove redundant
INFINIBAND kconfig dependencies").

But 9533b292a7ac led to the randconfig build errors reported and thus
("IB: Revert "remove redundant INFINIBAND kconfig dependencies"").

So I see no need to do anything more than apply ("IB: Revert "remove
redundant INFINIBAND kconfig dependencies"").


Doug: do you need anything from me on this?



I can take the revert..



Jason


Thanks.


Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"

2018-05-25 Thread Greg Thelen
On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann <a...@arndb.de> wrote:

> Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends
> on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
> link error when another driver using it is built-in. The
> INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
> a 'bool' symbol that does not force anything to be a module in turn.

> fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
> smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_request':
> trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_destroy_trans':
> trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
> trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'

> Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig
dependencies")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

Acked-by: Greg Thelen <gthe...@google.com>

Sorry for the 9533b292a7ac problem.
At this point the in release cycle, I think Arnd's revert is best.

If there is interest, I've put a little thought into an alternative fix:
making INFINIBAND_ADDR_TRANS tristate.  But it's nontrivial.
So I prefer this simple revert for now.

Doug: do you need anything from me on this?

> ---
> The patch that introduced the problem has been queued in the
> rdma-fixes/for-rc tree. Please revert the patch before sending
> the branch to Linus.
> ---
> drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> drivers/nvme/host/Kconfig   | 2 +-
> drivers/nvme/target/Kconfig | 2 +-
> drivers/staging/lustre/lnet/Kconfig | 2 +-
> fs/cifs/Kconfig | 2 +-
> net/9p/Kconfig  | 2 +-
> net/rds/Kconfig | 2 +-
> net/sunrpc/Kconfig  | 2 +-
> 8 files changed, 8 insertions(+), 8 deletions(-)

> diff --git a/drivers/infiniband/ulp/srpt/Kconfig
b/drivers/infiniband/ulp/srpt/Kconfig
> index 25bf6955b6d0..fb8b7182f05e 100644
> --- a/drivers/infiniband/ulp/srpt/Kconfig
> +++ b/drivers/infiniband/ulp/srpt/Kconfig
> @@ -1,6 +1,6 @@
> config INFINIBAND_SRPT
>tristate "InfiniBand SCSI RDMA Protocol target support"
> -   depends on INFINIBAND_ADDR_TRANS && TARGET_CORE
> +   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
>---help---

>  Support for the SCSI RDMA Protocol (SRP) Target driver. The
> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> index dbb7464c018c..88a8b5916624 100644
> --- a/drivers/nvme/host/Kconfig
> +++ b/drivers/nvme/host/Kconfig
> @@ -27,7 +27,7 @@ config NVME_FABRICS

> config NVME_RDMA
>tristate "NVM Express over Fabrics RDMA host driver"
> -   depends on INFINIBAND_ADDR_TRANS && BLOCK
> +   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
>select NVME_CORE
>select NVME_FABRICS
>select SG_POOL
> diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
> index 7595664ee753..3c7b61ddb0d1 100644
> --- a/drivers/nvme/target/Kconfig
> +++ b/drivers/nvme/target/Kconfig
> @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP

> config NVME_TARGET_RDMA
>tristate "NVMe over Fabrics RDMA target support"
> -   depends on INFINIBAND_ADDR_TRANS
> +   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
>depends on NVME_TARGET
>select SGL_ALLOC
>help
> diff --git a/drivers/staging/lustre/lnet/Kconfig
b/drivers/staging/lustre/lnet/Kconfig
> index f3b1ad4bd3dc..ad049e6f24e4 100644
> --- a/drivers/staging/lustre/lnet/Kconfig
> +++ b/drivers/staging/lustre/lnet/Kconfig
> @@ -34,7 +34,7 @@ config LNET_SELFTEST

> config LNET_XPRT_IB
>tristate "LNET infiniband support"
> -   depends on LNET && PCI && INFINIBAND_ADDR_TRANS
> +   depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS
>default LNET && INFINIBAND
>help
>  This option allows the LNET users to use infiniband as an
> diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
> index d61e2de8d0eb..5f132d59dfc2 100644
> --- a/fs/cifs/Kconfig
> +++ b/fs/cifs/Kconfig
> @@ -197,7 +197,7 @@ config CIFS_SMB311

> config CIFS_SMB_DIRECT
>bool "SMB Direct support (Experimental)"
> -   depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y &&
INFINIBAND_ADDR_TRANS=y
> +   depends on CIF

Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"

2018-05-25 Thread Greg Thelen
On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann  wrote:

> Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends
> on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
> link error when another driver using it is built-in. The
> INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
> a 'bool' symbol that does not force anything to be a module in turn.

> fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
> smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_request':
> trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_destroy_trans':
> trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
> trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'

> Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig
dependencies")
> Signed-off-by: Arnd Bergmann 

Acked-by: Greg Thelen 

Sorry for the 9533b292a7ac problem.
At this point the in release cycle, I think Arnd's revert is best.

If there is interest, I've put a little thought into an alternative fix:
making INFINIBAND_ADDR_TRANS tristate.  But it's nontrivial.
So I prefer this simple revert for now.

Doug: do you need anything from me on this?

> ---
> The patch that introduced the problem has been queued in the
> rdma-fixes/for-rc tree. Please revert the patch before sending
> the branch to Linus.
> ---
> drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> drivers/nvme/host/Kconfig   | 2 +-
> drivers/nvme/target/Kconfig | 2 +-
> drivers/staging/lustre/lnet/Kconfig | 2 +-
> fs/cifs/Kconfig | 2 +-
> net/9p/Kconfig  | 2 +-
> net/rds/Kconfig | 2 +-
> net/sunrpc/Kconfig  | 2 +-
> 8 files changed, 8 insertions(+), 8 deletions(-)

> diff --git a/drivers/infiniband/ulp/srpt/Kconfig
b/drivers/infiniband/ulp/srpt/Kconfig
> index 25bf6955b6d0..fb8b7182f05e 100644
> --- a/drivers/infiniband/ulp/srpt/Kconfig
> +++ b/drivers/infiniband/ulp/srpt/Kconfig
> @@ -1,6 +1,6 @@
> config INFINIBAND_SRPT
>tristate "InfiniBand SCSI RDMA Protocol target support"
> -   depends on INFINIBAND_ADDR_TRANS && TARGET_CORE
> +   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
>---help---

>  Support for the SCSI RDMA Protocol (SRP) Target driver. The
> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> index dbb7464c018c..88a8b5916624 100644
> --- a/drivers/nvme/host/Kconfig
> +++ b/drivers/nvme/host/Kconfig
> @@ -27,7 +27,7 @@ config NVME_FABRICS

> config NVME_RDMA
>tristate "NVM Express over Fabrics RDMA host driver"
> -   depends on INFINIBAND_ADDR_TRANS && BLOCK
> +   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
>select NVME_CORE
>select NVME_FABRICS
>select SG_POOL
> diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
> index 7595664ee753..3c7b61ddb0d1 100644
> --- a/drivers/nvme/target/Kconfig
> +++ b/drivers/nvme/target/Kconfig
> @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP

> config NVME_TARGET_RDMA
>tristate "NVMe over Fabrics RDMA target support"
> -   depends on INFINIBAND_ADDR_TRANS
> +   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
>depends on NVME_TARGET
>select SGL_ALLOC
>help
> diff --git a/drivers/staging/lustre/lnet/Kconfig
b/drivers/staging/lustre/lnet/Kconfig
> index f3b1ad4bd3dc..ad049e6f24e4 100644
> --- a/drivers/staging/lustre/lnet/Kconfig
> +++ b/drivers/staging/lustre/lnet/Kconfig
> @@ -34,7 +34,7 @@ config LNET_SELFTEST

> config LNET_XPRT_IB
>tristate "LNET infiniband support"
> -   depends on LNET && PCI && INFINIBAND_ADDR_TRANS
> +   depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS
>default LNET && INFINIBAND
>help
>  This option allows the LNET users to use infiniband as an
> diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
> index d61e2de8d0eb..5f132d59dfc2 100644
> --- a/fs/cifs/Kconfig
> +++ b/fs/cifs/Kconfig
> @@ -197,7 +197,7 @@ config CIFS_SMB311

> config CIFS_SMB_DIRECT
>bool "SMB Direct support (Experimental)"
> -   depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y &&
INFINIBAND_ADDR_TRANS=y
> +   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS ||
CIFS=y &a

Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-05-03 Thread Greg Thelen
On Tue, May 1, 2018 at 1:48 PM Jason Gunthorpe <j...@ziepe.ca> wrote:

> On Tue, May 01, 2018 at 03:08:57AM +0000, Greg Thelen wrote:
> > On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe <j...@ziepe.ca> wrote:
> >
> > > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote:
> > > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided
symbols.
> > > > So declare the kconfig dependency.  This is necessary to allow for
> > > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
> > > >
> > > > Signed-off-by: Greg Thelen <gthe...@google.com>
> > > > Cc: Tarick Bedeir <tar...@google.com>
> > > >  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig
> > b/drivers/infiniband/ulp/srpt/Kconfig
> > > > index 31ee83d528d9..fb8b7182f05e 100644
> > > > +++ b/drivers/infiniband/ulp/srpt/Kconfig
> > > > @@ -1,6 +1,6 @@
> > > >  config INFINIBAND_SRPT
> > > >   tristate "InfiniBand SCSI RDMA Protocol target support"
> > > > - depends on INFINIBAND && TARGET_CORE
> > > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
> >
> > > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have
> > > INFINIBAND_ADDR_TRANS without INFINIBAND.
> >
> > By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So yes, it
seems
> > redundant.  I don't know if anyone has designs to break this dependency
and
> > allow for ADDR_TRANS without INFINIBAND.  Assuming not, I'd be willing
to
> > amend my series removing redundant INFINIBAND and a followup series to
> > remove it from similar depends.  Though I'm not familiar with rdma dev
tree
> > lifecycle.  Is rdma/for-rc a throw away branch (akin to linux-next), or
> > will it be merged into linus/master?  If throwaway, then we can amend
> > its patches, otherwise followups will be needed.
> >
> > Let me know what you'd prefer.  Thanks.

> I think a single update patch to fix all of these, and the
> pre-existing ones would be great

I just posted the cleanup: https://lkml.org/lkml/2018/5/3/1045


Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-05-03 Thread Greg Thelen
On Tue, May 1, 2018 at 1:48 PM Jason Gunthorpe  wrote:

> On Tue, May 01, 2018 at 03:08:57AM +0000, Greg Thelen wrote:
> > On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe  wrote:
> >
> > > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote:
> > > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided
symbols.
> > > > So declare the kconfig dependency.  This is necessary to allow for
> > > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
> > > >
> > > > Signed-off-by: Greg Thelen 
> > > > Cc: Tarick Bedeir 
> > > >  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig
> > b/drivers/infiniband/ulp/srpt/Kconfig
> > > > index 31ee83d528d9..fb8b7182f05e 100644
> > > > +++ b/drivers/infiniband/ulp/srpt/Kconfig
> > > > @@ -1,6 +1,6 @@
> > > >  config INFINIBAND_SRPT
> > > >   tristate "InfiniBand SCSI RDMA Protocol target support"
> > > > - depends on INFINIBAND && TARGET_CORE
> > > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
> >
> > > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have
> > > INFINIBAND_ADDR_TRANS without INFINIBAND.
> >
> > By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So yes, it
seems
> > redundant.  I don't know if anyone has designs to break this dependency
and
> > allow for ADDR_TRANS without INFINIBAND.  Assuming not, I'd be willing
to
> > amend my series removing redundant INFINIBAND and a followup series to
> > remove it from similar depends.  Though I'm not familiar with rdma dev
tree
> > lifecycle.  Is rdma/for-rc a throw away branch (akin to linux-next), or
> > will it be merged into linus/master?  If throwaway, then we can amend
> > its patches, otherwise followups will be needed.
> >
> > Let me know what you'd prefer.  Thanks.

> I think a single update patch to fix all of these, and the
> pre-existing ones would be great

I just posted the cleanup: https://lkml.org/lkml/2018/5/3/1045


[PATCH] IB: remove redundant INFINIBAND kconfig dependencies

2018-05-03 Thread Greg Thelen
INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So there's no need for
options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND.
Remove the unnecessary INFINIBAND depends.

Signed-off-by: Greg Thelen <gthe...@google.com>
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 drivers/nvme/host/Kconfig   | 2 +-
 drivers/nvme/target/Kconfig | 2 +-
 drivers/staging/lustre/lnet/Kconfig | 2 +-
 fs/cifs/Kconfig | 2 +-
 net/9p/Kconfig  | 2 +-
 net/rds/Kconfig | 2 +-
 net/sunrpc/Kconfig  | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index fb8b7182f05e..25bf6955b6d0 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
+   depends on INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..dbb7464c018c 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
+   depends on INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 3c7b61ddb0d1..7595664ee753 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
diff --git a/drivers/staging/lustre/lnet/Kconfig 
b/drivers/staging/lustre/lnet/Kconfig
index ad049e6f24e4..f3b1ad4bd3dc 100644
--- a/drivers/staging/lustre/lnet/Kconfig
+++ b/drivers/staging/lustre/lnet/Kconfig
@@ -34,7 +34,7 @@ config LNET_SELFTEST
 
 config LNET_XPRT_IB
tristate "LNET infiniband support"
-   depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on LNET && PCI && INFINIBAND_ADDR_TRANS
default LNET && INFINIBAND
help
  This option allows the LNET users to use infiniband as an
diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 5f132d59dfc2..d61e2de8d0eb 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
+   depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
diff --git a/net/9p/Kconfig b/net/9p/Kconfig
index e6014e0e51f7..46c39f7da444 100644
--- a/net/9p/Kconfig
+++ b/net/9p/Kconfig
@@ -32,7 +32,7 @@ config NET_9P_XEN
 
 
 config NET_9P_RDMA
-   depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on INET && INFINIBAND_ADDR_TRANS
tristate "9P RDMA Transport (Experimental)"
help
  This builds support for an RDMA transport.
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index bffde4b46c5d..1a31502ee7db 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -8,7 +8,7 @@ config RDS
 
 config RDS_RDMA
tristate "RDS over Infiniband"
-   depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on RDS && INFINIBAND_ADDR_TRANS
---help---
  Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index ac09ca803296..6358e5271070 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -50,7 +50,7 @@ config SUNRPC_DEBUG
 
 config SUNRPC_XPRT_RDMA
tristate "RPC-over-RDMA transport"
-   depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on SUNRPC && INFINIBAND_ADDR_TRANS
default SUNRPC && INFINIBAND
select SG_POOL
help
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH] IB: remove redundant INFINIBAND kconfig dependencies

2018-05-03 Thread Greg Thelen
INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So there's no need for
options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND.
Remove the unnecessary INFINIBAND depends.

Signed-off-by: Greg Thelen 
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 drivers/nvme/host/Kconfig   | 2 +-
 drivers/nvme/target/Kconfig | 2 +-
 drivers/staging/lustre/lnet/Kconfig | 2 +-
 fs/cifs/Kconfig | 2 +-
 net/9p/Kconfig  | 2 +-
 net/rds/Kconfig | 2 +-
 net/sunrpc/Kconfig  | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index fb8b7182f05e..25bf6955b6d0 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
+   depends on INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..dbb7464c018c 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
+   depends on INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 3c7b61ddb0d1..7595664ee753 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
diff --git a/drivers/staging/lustre/lnet/Kconfig 
b/drivers/staging/lustre/lnet/Kconfig
index ad049e6f24e4..f3b1ad4bd3dc 100644
--- a/drivers/staging/lustre/lnet/Kconfig
+++ b/drivers/staging/lustre/lnet/Kconfig
@@ -34,7 +34,7 @@ config LNET_SELFTEST
 
 config LNET_XPRT_IB
tristate "LNET infiniband support"
-   depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on LNET && PCI && INFINIBAND_ADDR_TRANS
default LNET && INFINIBAND
help
  This option allows the LNET users to use infiniband as an
diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 5f132d59dfc2..d61e2de8d0eb 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
+   depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
diff --git a/net/9p/Kconfig b/net/9p/Kconfig
index e6014e0e51f7..46c39f7da444 100644
--- a/net/9p/Kconfig
+++ b/net/9p/Kconfig
@@ -32,7 +32,7 @@ config NET_9P_XEN
 
 
 config NET_9P_RDMA
-   depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on INET && INFINIBAND_ADDR_TRANS
tristate "9P RDMA Transport (Experimental)"
help
  This builds support for an RDMA transport.
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index bffde4b46c5d..1a31502ee7db 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -8,7 +8,7 @@ config RDS
 
 config RDS_RDMA
tristate "RDS over Infiniband"
-   depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on RDS && INFINIBAND_ADDR_TRANS
---help---
  Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index ac09ca803296..6358e5271070 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -50,7 +50,7 @@ config SUNRPC_DEBUG
 
 config SUNRPC_XPRT_RDMA
tristate "RPC-over-RDMA transport"
-   depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on SUNRPC && INFINIBAND_ADDR_TRANS
default SUNRPC && INFINIBAND
select SG_POOL
help
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH] memcg: mark memcg1_events static const

2018-05-03 Thread Greg Thelen
Mark memcg1_events static: it's only used by memcontrol.c.
And mark it const: it's not modified.

Signed-off-by: Greg Thelen <gthe...@google.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..c9c7e5ea0e2f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3083,7 +3083,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void 
*v)
 #endif /* CONFIG_NUMA */
 
 /* Universal VM events cgroup1 shows, original sort order */
-unsigned int memcg1_events[] = {
+static const unsigned int memcg1_events[] = {
PGPGIN,
PGPGOUT,
PGFAULT,
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH] memcg: mark memcg1_events static const

2018-05-03 Thread Greg Thelen
Mark memcg1_events static: it's only used by memcontrol.c.
And mark it const: it's not modified.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..c9c7e5ea0e2f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3083,7 +3083,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void 
*v)
 #endif /* CONFIG_NUMA */
 
 /* Universal VM events cgroup1 shows, original sort order */
-unsigned int memcg1_events[] = {
+static const unsigned int memcg1_events[] = {
PGPGIN,
PGPGOUT,
PGFAULT,
-- 
2.17.0.441.gb46fe60e1d-goog



Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-30 Thread Greg Thelen
On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe <j...@ziepe.ca> wrote:

> On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote:
> > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
> > So declare the kconfig dependency.  This is necessary to allow for
> > enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
> >
> > Signed-off-by: Greg Thelen <gthe...@google.com>
> > Cc: Tarick Bedeir <tar...@google.com>
> >  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/ulp/srpt/Kconfig
b/drivers/infiniband/ulp/srpt/Kconfig
> > index 31ee83d528d9..fb8b7182f05e 100644
> > +++ b/drivers/infiniband/ulp/srpt/Kconfig
> > @@ -1,6 +1,6 @@
> >  config INFINIBAND_SRPT
> >   tristate "InfiniBand SCSI RDMA Protocol target support"
> > - depends on INFINIBAND && TARGET_CORE
> > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE

> Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have
> INFINIBAND_ADDR_TRANS without INFINIBAND.

By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So yes, it seems
redundant.  I don't know if anyone has designs to break this dependency and
allow for ADDR_TRANS without INFINIBAND.  Assuming not, I'd be willing to
amend my series removing redundant INFINIBAND and a followup series to
remove it from similar depends.  Though I'm not familiar with rdma dev tree
lifecycle.  Is rdma/for-rc a throw away branch (akin to linux-next), or
will it be merged into linus/master?  If throwaway, then we can amend
its patches, otherwise followups will be needed.

Let me know what you'd prefer.  Thanks.

FYI from v4.17-rc3:
drivers/staging/lustre/lnet/Kconfig:  depends on LNET && PCI && INFINIBAND
&& INFINIBAND_ADDR_TRANS
net/9p/Kconfig:   depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS
net/rds/Kconfig:  depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
net/sunrpc/Kconfig:   depends on SUNRPC && INFINIBAND &&
INFINIBAND_ADDR_TRANS


Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-30 Thread Greg Thelen
On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe  wrote:

> On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote:
> > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
> > So declare the kconfig dependency.  This is necessary to allow for
> > enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
> >
> > Signed-off-by: Greg Thelen 
> > Cc: Tarick Bedeir 
> >  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/ulp/srpt/Kconfig
b/drivers/infiniband/ulp/srpt/Kconfig
> > index 31ee83d528d9..fb8b7182f05e 100644
> > +++ b/drivers/infiniband/ulp/srpt/Kconfig
> > @@ -1,6 +1,6 @@
> >  config INFINIBAND_SRPT
> >   tristate "InfiniBand SCSI RDMA Protocol target support"
> > - depends on INFINIBAND && TARGET_CORE
> > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE

> Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have
> INFINIBAND_ADDR_TRANS without INFINIBAND.

By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So yes, it seems
redundant.  I don't know if anyone has designs to break this dependency and
allow for ADDR_TRANS without INFINIBAND.  Assuming not, I'd be willing to
amend my series removing redundant INFINIBAND and a followup series to
remove it from similar depends.  Though I'm not familiar with rdma dev tree
lifecycle.  Is rdma/for-rc a throw away branch (akin to linux-next), or
will it be merged into linus/master?  If throwaway, then we can amend
its patches, otherwise followups will be needed.

Let me know what you'd prefer.  Thanks.

FYI from v4.17-rc3:
drivers/staging/lustre/lnet/Kconfig:  depends on LNET && PCI && INFINIBAND
&& INFINIBAND_ADDR_TRANS
net/9p/Kconfig:   depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS
net/rds/Kconfig:  depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
net/sunrpc/Kconfig:   depends on SUNRPC && INFINIBAND &&
INFINIBAND_ADDR_TRANS


[PATCH 6/6] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-26 Thread Greg Thelen
Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been
finding fair number of CM bugs.  So provide option to disable it.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 6/6] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-26 Thread Greg Thelen
Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been
finding fair number of CM bugs.  So provide option to disable it.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 5/6] ib_srp: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
INFINIBAND_SRP code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/infiniband/ulp/srp/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/Kconfig 
b/drivers/infiniband/ulp/srp/Kconfig
index c74ee9633041..99db8fe5173a 100644
--- a/drivers/infiniband/ulp/srp/Kconfig
+++ b/drivers/infiniband/ulp/srp/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRP
tristate "InfiniBand SCSI RDMA Protocol"
-   depends on SCSI
+   depends on SCSI && INFINIBAND_ADDR_TRANS
select SCSI_SRP_ATTRS
---help---
  Support for the SCSI RDMA Protocol over InfiniBand.  This
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 5/6] ib_srp: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
INFINIBAND_SRP code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/ulp/srp/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/Kconfig 
b/drivers/infiniband/ulp/srp/Kconfig
index c74ee9633041..99db8fe5173a 100644
--- a/drivers/infiniband/ulp/srp/Kconfig
+++ b/drivers/infiniband/ulp/srp/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRP
tristate "InfiniBand SCSI RDMA Protocol"
-   depends on SCSI
+   depends on SCSI && INFINIBAND_ADDR_TRANS
select SCSI_SRP_ATTRS
---help---
  Support for the SCSI RDMA Protocol over InfiniBand.  This
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 4/6] cifs: smbd: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
Reviewed-by: Long Li <lon...@microsoft.com>
---
 fs/cifs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 741749a98614..5f132d59dfc2 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y
+   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 4/6] cifs: smbd: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
Reviewed-by: Long Li 
---
 fs/cifs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 741749a98614..5f132d59dfc2 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y
+   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 1/6] nvme: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.  So
declare the kconfig dependency.  This is necessary to allow for enabling
INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/nvme/host/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index b979cf3bce65..88a8b5916624 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && BLOCK
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 1/6] nvme: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.  So
declare the kconfig dependency.  This is necessary to allow for enabling
INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/nvme/host/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index b979cf3bce65..88a8b5916624 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && BLOCK
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 3/6] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index 31ee83d528d9..fb8b7182f05e 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && TARGET_CORE
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 3/6] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index 31ee83d528d9..fb8b7182f05e 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && TARGET_CORE
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 2/6] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/nvme/target/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 5f4f8b16685f..3c7b61ddb0d1 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 2/6] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/nvme/target/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 5f4f8b16685f..3c7b61ddb0d1 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
-- 
2.17.0.484.g0c8726318c-goog



[PATCH 0/6] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-26 Thread Greg Thelen
_ADDR_TRANS"
drivers/nvme/host/rdma.c
  - depends on NVME_RDMA => INFINIBAND_ADDR_TRANS
per this series' "nvme: depend on INFINIBAND_ADDR_TRANS"
drivers/nvme/target/rdma.c
  - depends on NVME_TARGET_RDMA => INFINIBAND_ADDR_TRANS
per this series' "nvmet-rdma: depend on INFINIBAND_ADDR_TRANS"
drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
  - depends on LNET_XPRT_IB => INFINIBAND_ADDR_TRANS
fs/cifs/smbdirect.c
  - depends on CIFS_SMB_DIRECT => INFINIBAND_ADDR_TRANS
per this series' "cifs: smbd: depend on INFINIBAND_ADDR_TRANS"
net/9p/trans_rdma.c
  - depends on NET_9P_RDMA => INFINIBAND_ADDR_TRANS
net/rds/ib.c
net/rds/ib_cm.c
net/rds/rdma_transport.c
  - depends on RDS_RDMA => INFINIBAND_ADDR_TRANS
net/sunrpc/xprtrdma/svc_rdma_transport.c
net/sunrpc/xprtrdma/transport.c
net/sunrpc/xprtrdma/verbs.c
  - depends on SUNRPC_XPRT_RDMA => INFINIBAND_ADDR_TRANS

Greg Thelen (6):
  nvme: depend on INFINIBAND_ADDR_TRANS
  nvmet-rdma: depend on INFINIBAND_ADDR_TRANS
  ib_srpt: depend on INFINIBAND_ADDR_TRANS
  cifs: smbd: depend on INFINIBAND_ADDR_TRANS
  ib_srp: depend on INFINIBAND_ADDR_TRANS
  IB: make INFINIBAND_ADDR_TRANS configurable

 drivers/infiniband/Kconfig  | 5 -
 drivers/infiniband/ulp/srp/Kconfig  | 2 +-
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 drivers/nvme/host/Kconfig   | 2 +-
 drivers/nvme/target/Kconfig | 2 +-
 fs/cifs/Kconfig | 2 +-
 6 files changed, 9 insertions(+), 6 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog



[PATCH 0/6] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-26 Thread Greg Thelen
_ADDR_TRANS"
drivers/nvme/host/rdma.c
  - depends on NVME_RDMA => INFINIBAND_ADDR_TRANS
per this series' "nvme: depend on INFINIBAND_ADDR_TRANS"
drivers/nvme/target/rdma.c
  - depends on NVME_TARGET_RDMA => INFINIBAND_ADDR_TRANS
per this series' "nvmet-rdma: depend on INFINIBAND_ADDR_TRANS"
drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
  - depends on LNET_XPRT_IB => INFINIBAND_ADDR_TRANS
fs/cifs/smbdirect.c
  - depends on CIFS_SMB_DIRECT => INFINIBAND_ADDR_TRANS
per this series' "cifs: smbd: depend on INFINIBAND_ADDR_TRANS"
net/9p/trans_rdma.c
  - depends on NET_9P_RDMA => INFINIBAND_ADDR_TRANS
net/rds/ib.c
net/rds/ib_cm.c
net/rds/rdma_transport.c
  - depends on RDS_RDMA => INFINIBAND_ADDR_TRANS
net/sunrpc/xprtrdma/svc_rdma_transport.c
net/sunrpc/xprtrdma/transport.c
net/sunrpc/xprtrdma/verbs.c
  - depends on SUNRPC_XPRT_RDMA => INFINIBAND_ADDR_TRANS

Greg Thelen (6):
  nvme: depend on INFINIBAND_ADDR_TRANS
  nvmet-rdma: depend on INFINIBAND_ADDR_TRANS
  ib_srpt: depend on INFINIBAND_ADDR_TRANS
  cifs: smbd: depend on INFINIBAND_ADDR_TRANS
  ib_srp: depend on INFINIBAND_ADDR_TRANS
  IB: make INFINIBAND_ADDR_TRANS configurable

 drivers/infiniband/Kconfig  | 5 -
 drivers/infiniband/ulp/srp/Kconfig  | 2 +-
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 drivers/nvme/host/Kconfig   | 2 +-
 drivers/nvme/target/Kconfig | 2 +-
 fs/cifs/Kconfig | 2 +-
 6 files changed, 9 insertions(+), 6 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog



Re: [PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-25 Thread Greg Thelen
On Wed, Apr 25, 2018 at 7:13 PM Bart Van Assche <bart.vanass...@wdc.com>
wrote:

> On Wed, 2018-04-25 at 15:34 -0700, Greg Thelen wrote:
> > Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been
> > finding fair number of CM bugs.  So provide option to disable it.
> >
> > Signed-off-by: Greg Thelen <gthe...@google.com>
> > Cc: Tarick Bedeir <tar...@google.com>
> > ---
> >  drivers/infiniband/Kconfig | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
> > index ee270e065ba9..2a972ed6851b 100644
> > --- a/drivers/infiniband/Kconfig
> > +++ b/drivers/infiniband/Kconfig
> > @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
> > pages on demand instead.
> >
> >  config INFINIBAND_ADDR_TRANS
> > - bool
> > + bool "RDMA/CM"
> >   depends on INFINIBAND
> >   default y
> > + ---help---
> > +   Support for RDMA communication manager (CM).
> > +   This allows for a generic connection abstraction over RDMA.

> Hello Greg,

> Please provide a cover letter when posting a patch series. Such a cover
> letter is not only informative but also makes it easy for people who want
> to comment on a patch series as a whole. I have a question that applies
> to the entire patch series. The RDMA/CM code defines functions like
> rdma_create_id() and rdma_create_qp(). If I search through the kernel tree
> for callers of these functions then I find several more kernel modules
than
> the ones that are modified by this patch series:

> $ git grep -lE '[[:blank:]](rdma_create_id|rdma_create_qp)\('
> drivers/infiniband/core/cma.c
> drivers/infiniband/ulp/iser/iser_verbs.c
> drivers/infiniband/ulp/isert/ib_isert.c
> drivers/infiniband/ulp/srp/ib_srp.c
> drivers/infiniband/ulp/srpt/ib_srpt.c
> drivers/nvme/host/rdma.c
> drivers/nvme/target/rdma.c
> drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> fs/cifs/smbdirect.c
> include/rdma/rdma_cm.h
> net/9p/trans_rdma.c
> net/rds/ib.c
> net/rds/ib_cm.c
> net/rds/rdma_transport.c
> net/sunrpc/xprtrdma/svc_rdma_transport.c
> net/sunrpc/xprtrdma/verbs.c

> Are you sure that this patch series is complete?

I'll check your cited files.  I'll resend with cover letter.

FYI: I already rand this series through the 0-day builder, which presumably
would've caught most config dep issues.  But I'll research your list before
reposting.


Re: [PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-25 Thread Greg Thelen
On Wed, Apr 25, 2018 at 7:13 PM Bart Van Assche 
wrote:

> On Wed, 2018-04-25 at 15:34 -0700, Greg Thelen wrote:
> > Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been
> > finding fair number of CM bugs.  So provide option to disable it.
> >
> > Signed-off-by: Greg Thelen 
> > Cc: Tarick Bedeir 
> > ---
> >  drivers/infiniband/Kconfig | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
> > index ee270e065ba9..2a972ed6851b 100644
> > --- a/drivers/infiniband/Kconfig
> > +++ b/drivers/infiniband/Kconfig
> > @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
> > pages on demand instead.
> >
> >  config INFINIBAND_ADDR_TRANS
> > - bool
> > + bool "RDMA/CM"
> >   depends on INFINIBAND
> >   default y
> > + ---help---
> > +   Support for RDMA communication manager (CM).
> > +   This allows for a generic connection abstraction over RDMA.

> Hello Greg,

> Please provide a cover letter when posting a patch series. Such a cover
> letter is not only informative but also makes it easy for people who want
> to comment on a patch series as a whole. I have a question that applies
> to the entire patch series. The RDMA/CM code defines functions like
> rdma_create_id() and rdma_create_qp(). If I search through the kernel tree
> for callers of these functions then I find several more kernel modules
than
> the ones that are modified by this patch series:

> $ git grep -lE '[[:blank:]](rdma_create_id|rdma_create_qp)\('
> drivers/infiniband/core/cma.c
> drivers/infiniband/ulp/iser/iser_verbs.c
> drivers/infiniband/ulp/isert/ib_isert.c
> drivers/infiniband/ulp/srp/ib_srp.c
> drivers/infiniband/ulp/srpt/ib_srpt.c
> drivers/nvme/host/rdma.c
> drivers/nvme/target/rdma.c
> drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> fs/cifs/smbdirect.c
> include/rdma/rdma_cm.h
> net/9p/trans_rdma.c
> net/rds/ib.c
> net/rds/ib_cm.c
> net/rds/rdma_transport.c
> net/sunrpc/xprtrdma/svc_rdma_transport.c
> net/sunrpc/xprtrdma/verbs.c

> Are you sure that this patch series is complete?

I'll check your cited files.  I'll resend with cover letter.

FYI: I already rand this series through the 0-day builder, which presumably
would've caught most config dep issues.  But I'll research your list before
reposting.


[PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-25 Thread Greg Thelen
Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been
finding fair number of CM bugs.  So provide option to disable it.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-25 Thread Greg Thelen
Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been
finding fair number of CM bugs.  So provide option to disable it.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 4/5] cifs: smbd: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 fs/cifs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 741749a98614..5f132d59dfc2 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y
+   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 4/5] cifs: smbd: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 fs/cifs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 741749a98614..5f132d59dfc2 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y
+   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen
INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index 31ee83d528d9..fb8b7182f05e 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && TARGET_CORE
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen
INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index 31ee83d528d9..fb8b7182f05e 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && TARGET_CORE
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 2/5] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/nvme/target/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 5f4f8b16685f..3c7b61ddb0d1 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 2/5] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/nvme/target/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 5f4f8b16685f..3c7b61ddb0d1 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 1/5] nvme: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.  So
declare the kconfig dependency.  This is necessary to allow for enabling
INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/nvme/host/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index b979cf3bce65..88a8b5916624 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && BLOCK
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 1/5] nvme: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.  So
declare the kconfig dependency.  This is necessary to allow for enabling
INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/nvme/host/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index b979cf3bce65..88a8b5916624 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && BLOCK
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
-- 
2.17.0.441.gb46fe60e1d-goog



Re: [RFC PATCH 0/2] memory.low,min reclaim

2018-04-23 Thread Greg Thelen
On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin <g...@fb.com> wrote:

> Hi, Greg!

> On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote:
> > Roman's previously posted memory.low,min patches add per memcg effective
> > low limit to detect overcommitment of parental limits.  But if we flip
> > low,min reclaim to bail if usage<{low,min} at any level, then we don't
need
> > an effective low limit, which makes the code simpler.  When parent
limits
> > are overcommited memory.min will oom kill, which is more drastic but
makes
> > the memory.low a simpler concept.  If memcg a/b wants oom kill before
> > reclaim, then give it to them.  It seems a bit strange for
a/b/memory.low's
> > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless
> > a/b.low+a/c.low exceed a.low).

> It's actually not strange: a/b and a/c are sharing a common resource:
> a/memory.low.

> Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max.
> If there are sibling cgroups which are consuming memory, a cgroup can't
> exceed parent's memory.max, even if its memory.max is grater.

> >
> > I think there might be a simpler way (ableit it doesn't yet include
> > Documentation):
> > - memcg: fix memory.low
> > - memcg: add memory.min
> >  3 files changed, 75 insertions(+), 6 deletions(-)
> >
> > The idea of this alternate approach is for memory.low,min to avoid
reclaim
> > if any portion of under-consideration memcg ancestry is under respective
> > limit.

> This approach has a significant downside: it breaks hierarchical
constraints
> for memory.low/min. There are two important outcomes:

> 1) Any leaf's memory.low/min value is respected, even if parent's value
>   is lower or even 0. It's not possible anymore to limit the amount
of
>   protected memory for a sub-tree.
>   This is especially bad in case of delegation.

As someone who has been using something like memory.min for a while, I have
cases where it needs to be a strong protection.  Such jobs prefer oom kill
to reclaim.  These jobs know they need X MB of memory.  But I guess it's on
me to avoid configuring machines which overcommit memory.min at such cgroup
levels all the way to the root.

> 2) If a cgroup has an ancestor with the usage under its memory.low/min,
>   it becomes protection, even if its memory.low/min is 0. So it
becomes
>   impossible to have unprotected cgroups in protected sub-tree.

Fair point.

One use case is where a non trivial job which has several memory accounting
subcontainers.  Is there a way to only set memory.low at the top and have
the offer protection to the job?
The case I'm thinking of is:
% cd /cgroup
% echo +memory > cgroup.subtree_control
% mkdir top
% echo +memory > top/cgroup.subtree_control
% mkdir top/part1 top/part2
% echo 1GB > top/memory.min
% (echo $BASHPID > top/part1/cgroup.procs && part1)
% (echo $BASHPID > top/part2/cgroup.procs && part2)

Empirically it's been measured that the entire workload (/top) needs 1GB to
perform well.  But we don't care how the memory is distributed between
part1,part2.  Is the strategy for that to set /top, /top/part1.min, and
/top/part2.min to 1GB?

What do you think about exposing emin and elow to user space?  I think that
would reduce admin/user confusion in situations where memory.min is
internally discounted.

(tangent) Delegation in v2 isn't something I've been able to fully
internalize yet.
The "no interior processes" rule challenges my notion of subdelegation.
My current model is where a system controller creates a container C with
C.min and then starts client manager process M in C.  Then M can choose
to further divide C's resources (e.g. C/S).  This doesn't seem possible
because v2 doesn't allow for interior processes.  So the system manager
would need to create C, set C.low, create C/sub_manager, create
C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start
M in C/sub_manager.  Then sub_manager can create and manage
C/sub_resources/S.

PS: Thanks for the memory.low and memory.min work.  Regardless of how we
proceed it's better than the upstream memory.soft_limit_in_bytes!


Re: [RFC PATCH 0/2] memory.low,min reclaim

2018-04-23 Thread Greg Thelen
On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin  wrote:

> Hi, Greg!

> On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote:
> > Roman's previously posted memory.low,min patches add per memcg effective
> > low limit to detect overcommitment of parental limits.  But if we flip
> > low,min reclaim to bail if usage<{low,min} at any level, then we don't
need
> > an effective low limit, which makes the code simpler.  When parent
limits
> > are overcommited memory.min will oom kill, which is more drastic but
makes
> > the memory.low a simpler concept.  If memcg a/b wants oom kill before
> > reclaim, then give it to them.  It seems a bit strange for
a/b/memory.low's
> > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless
> > a/b.low+a/c.low exceed a.low).

> It's actually not strange: a/b and a/c are sharing a common resource:
> a/memory.low.

> Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max.
> If there are sibling cgroups which are consuming memory, a cgroup can't
> exceed parent's memory.max, even if its memory.max is grater.

> >
> > I think there might be a simpler way (ableit it doesn't yet include
> > Documentation):
> > - memcg: fix memory.low
> > - memcg: add memory.min
> >  3 files changed, 75 insertions(+), 6 deletions(-)
> >
> > The idea of this alternate approach is for memory.low,min to avoid
reclaim
> > if any portion of under-consideration memcg ancestry is under respective
> > limit.

> This approach has a significant downside: it breaks hierarchical
constraints
> for memory.low/min. There are two important outcomes:

> 1) Any leaf's memory.low/min value is respected, even if parent's value
>   is lower or even 0. It's not possible anymore to limit the amount
of
>   protected memory for a sub-tree.
>   This is especially bad in case of delegation.

As someone who has been using something like memory.min for a while, I have
cases where it needs to be a strong protection.  Such jobs prefer oom kill
to reclaim.  These jobs know they need X MB of memory.  But I guess it's on
me to avoid configuring machines which overcommit memory.min at such cgroup
levels all the way to the root.

> 2) If a cgroup has an ancestor with the usage under its memory.low/min,
>   it becomes protection, even if its memory.low/min is 0. So it
becomes
>   impossible to have unprotected cgroups in protected sub-tree.

Fair point.

One use case is where a non trivial job which has several memory accounting
subcontainers.  Is there a way to only set memory.low at the top and have
the offer protection to the job?
The case I'm thinking of is:
% cd /cgroup
% echo +memory > cgroup.subtree_control
% mkdir top
% echo +memory > top/cgroup.subtree_control
% mkdir top/part1 top/part2
% echo 1GB > top/memory.min
% (echo $BASHPID > top/part1/cgroup.procs && part1)
% (echo $BASHPID > top/part2/cgroup.procs && part2)

Empirically it's been measured that the entire workload (/top) needs 1GB to
perform well.  But we don't care how the memory is distributed between
part1,part2.  Is the strategy for that to set /top, /top/part1.min, and
/top/part2.min to 1GB?

What do you think about exposing emin and elow to user space?  I think that
would reduce admin/user confusion in situations where memory.min is
internally discounted.

(tangent) Delegation in v2 isn't something I've been able to fully
internalize yet.
The "no interior processes" rule challenges my notion of subdelegation.
My current model is where a system controller creates a container C with
C.min and then starts client manager process M in C.  Then M can choose
to further divide C's resources (e.g. C/S).  This doesn't seem possible
because v2 doesn't allow for interior processes.  So the system manager
would need to create C, set C.low, create C/sub_manager, create
C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start
M in C/sub_manager.  Then sub_manager can create and manage
C/sub_resources/S.

PS: Thanks for the memory.low and memory.min work.  Regardless of how we
proceed it's better than the upstream memory.soft_limit_in_bytes!


[RFC PATCH 2/2] memcg: add memory.min

2018-04-22 Thread Greg Thelen
The new memory.min limit is similar to memory.low, just no bypassing it
when reclaim is desparate.  Prefer oom kills before reclaim memory below
memory.min.  Sharing more code with memory_cgroup_low() is possible, but
the idea is posted here for simplicity.

Signed-off-by: Greg Thelen <gthe...@google.com>
---
 include/linux/memcontrol.h |  8 ++
 mm/memcontrol.c| 58 ++
 mm/vmscan.c|  3 ++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c46016bb25eb..22bb4a88653a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -178,6 +178,7 @@ struct mem_cgroup {
struct page_counter tcpmem;
 
/* Normal memory consumption range */
+   unsigned long min;
unsigned long low;
unsigned long high;
 
@@ -281,6 +282,7 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
 }
 
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg);
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
@@ -726,6 +728,12 @@ static inline void mem_cgroup_event(struct mem_cgroup 
*memcg,
 {
 }
 
+static inline bool mem_cgroup_min(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+   return false;
+}
+
 static inline bool mem_cgroup_low(struct mem_cgroup *root,
  struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9668f620203a..b2aaed4003b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5058,6 +5058,36 @@ static u64 memory_current_read(struct 
cgroup_subsys_state *css,
return (u64)page_counter_read(>memory) * PAGE_SIZE;
 }
 
+static int memory_min_show(struct seq_file *m, void *v)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+   unsigned long min = READ_ONCE(memcg->min);
+
+   if (min == PAGE_COUNTER_MAX)
+   seq_puts(m, "max\n");
+   else
+   seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE);
+
+   return 0;
+}
+
+static ssize_t memory_min_write(struct kernfs_open_file *of,
+   char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned long min;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", );
+   if (err)
+   return err;
+
+   memcg->min = min;
+
+   return nbytes;
+}
+
 static int memory_low_show(struct seq_file *m, void *v)
 {
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5288,6 +5318,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = memory_current_read,
},
+   {
+   .name = "min",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = memory_min_show,
+   .write = memory_min_write,
+   },
{
.name = "low",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -5336,6 +5372,28 @@ struct cgroup_subsys memory_cgrp_subsys = {
.early_init = 0,
 };
 
+/**
+ * mem_cgroup_min returns true for a memcg below its min limit.  Such memcg are
+ * excempt from reclaim.
+ */
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg)
+{
+   if (mem_cgroup_disabled())
+   return false;
+
+   if (!root)
+   root = root_mem_cgroup;
+
+   if (memcg == root)
+   return false;
+
+   for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
+   if (page_counter_read(>memory) <= memcg->min)
+   return true; /* protect */
+   }
+   return false; /* !protect */
+}
+
 /**
  * mem_cgroup_low - check if memory consumption is below the normal range
  * @root: the top ancestor of the sub-tree being checked
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd5dc3faaa57..15ae19a38ad5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,6 +2539,9 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
unsigned long reclaimed;
unsigned long scanned;
 
+   if (mem_cgroup_min(root, memcg))
+   continue;
+
if (mem_cgroup_low(root, memcg)) {
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
-- 
2.17.0.484.g0c8726318c-goog



[RFC PATCH 2/2] memcg: add memory.min

2018-04-22 Thread Greg Thelen
The new memory.min limit is similar to memory.low, just no bypassing it
when reclaim is desparate.  Prefer oom kills before reclaim memory below
memory.min.  Sharing more code with memory_cgroup_low() is possible, but
the idea is posted here for simplicity.

Signed-off-by: Greg Thelen 
---
 include/linux/memcontrol.h |  8 ++
 mm/memcontrol.c| 58 ++
 mm/vmscan.c|  3 ++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c46016bb25eb..22bb4a88653a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -178,6 +178,7 @@ struct mem_cgroup {
struct page_counter tcpmem;
 
/* Normal memory consumption range */
+   unsigned long min;
unsigned long low;
unsigned long high;
 
@@ -281,6 +282,7 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
 }
 
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg);
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
@@ -726,6 +728,12 @@ static inline void mem_cgroup_event(struct mem_cgroup 
*memcg,
 {
 }
 
+static inline bool mem_cgroup_min(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+   return false;
+}
+
 static inline bool mem_cgroup_low(struct mem_cgroup *root,
  struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9668f620203a..b2aaed4003b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5058,6 +5058,36 @@ static u64 memory_current_read(struct 
cgroup_subsys_state *css,
return (u64)page_counter_read(>memory) * PAGE_SIZE;
 }
 
+static int memory_min_show(struct seq_file *m, void *v)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+   unsigned long min = READ_ONCE(memcg->min);
+
+   if (min == PAGE_COUNTER_MAX)
+   seq_puts(m, "max\n");
+   else
+   seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE);
+
+   return 0;
+}
+
+static ssize_t memory_min_write(struct kernfs_open_file *of,
+   char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned long min;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", );
+   if (err)
+   return err;
+
+   memcg->min = min;
+
+   return nbytes;
+}
+
 static int memory_low_show(struct seq_file *m, void *v)
 {
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5288,6 +5318,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = memory_current_read,
},
+   {
+   .name = "min",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = memory_min_show,
+   .write = memory_min_write,
+   },
{
.name = "low",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -5336,6 +5372,28 @@ struct cgroup_subsys memory_cgrp_subsys = {
.early_init = 0,
 };
 
+/**
+ * mem_cgroup_min returns true for a memcg below its min limit.  Such memcg are
+ * excempt from reclaim.
+ */
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg)
+{
+   if (mem_cgroup_disabled())
+   return false;
+
+   if (!root)
+   root = root_mem_cgroup;
+
+   if (memcg == root)
+   return false;
+
+   for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
+   if (page_counter_read(>memory) <= memcg->min)
+   return true; /* protect */
+   }
+   return false; /* !protect */
+}
+
 /**
  * mem_cgroup_low - check if memory consumption is below the normal range
  * @root: the top ancestor of the sub-tree being checked
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd5dc3faaa57..15ae19a38ad5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,6 +2539,9 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
unsigned long reclaimed;
unsigned long scanned;
 
+   if (mem_cgroup_min(root, memcg))
+   continue;
+
if (mem_cgroup_low(root, memcg)) {
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
-- 
2.17.0.484.g0c8726318c-goog



[RFC PATCH 0/2] memory.low,min reclaim

2018-04-22 Thread Greg Thelen
4k => 4203908k
  # F/g  : 2102016k => 2101992k
  # F/H  : 2101948k => 2101916k
  # i: 4204408k => 4203988k
  # i/j  : 2101984k => 2101936k
  # i/K  : 2102424k => 2102052k
  # L: 4189960k => 3886296k
  # L/m  : 2101968k => 2838704k
  # L/N  : 2087992k => 1047592k

  set -o errexit
  set -o nounset
  set -o pipefail
  
  LIM="$1"; shift
  ANTAGONIST="$1"; shift
  CGPATH=/tmp/cgroup
  
  vmtouch2() {
rm -f "$2"
(echo $BASHPID > "${CGPATH}/$1/cgroup.procs" && exec /tmp/mmap --loop 1 
--file "$2" "$3")
  }
  
  vmtouch() {
# twice to ensure slab caches are warmed up and all objs are charged to 
cgroup.
vmtouch2 "$1" "$2" "$3"
vmtouch2 "$1" "$2" "$3"
  }
  
  dump() {
for i in A A/b A/C A/D A/e F F/g F/H i i/j i/K L L/m L/N; do
  printf "%-5s: %sk\n" $i $(($(cat "${CGPATH}/${i}/memory.current")/1024))
done
  }
  
  rm -f /file_?
  if [[ -e "${CGPATH}/A" ]]; then
rmdir ${CGPATH}/?/? ${CGPATH}/?
  fi
  echo "+memory" > "${CGPATH}/cgroup.subtree_control"
  mkdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/F/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/i/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/L/cgroup.subtree_control"
  mkdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  mkdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  mkdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  mkdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  
  echo 2G > "${CGPATH}/A/memory.${LIM}"
  echo 3G > "${CGPATH}/A/b/memory.${LIM}"
  echo 1G > "${CGPATH}/A/C/memory.${LIM}"
  echo 0 > "${CGPATH}/A/D/memory.${LIM}"
  echo 10G > "${CGPATH}/A/e/memory.${LIM}"
  
  echo 2G > "${CGPATH}/F/memory.${LIM}"
  echo 3G > "${CGPATH}/F/g/memory.${LIM}"
  echo 1G > "${CGPATH}/F/H/memory.${LIM}"
  
  echo 5G > "${CGPATH}/i/memory.${LIM}"
  echo 3G > "${CGPATH}/i/j/memory.${LIM}"
  echo 1G > "${CGPATH}/i/K/memory.${LIM}"
  
  echo 2G > "${CGPATH}/L/memory.${LIM}"
  echo 4G > "${CGPATH}/L/memory.max"
  echo 3G > "${CGPATH}/L/m/memory.${LIM}"
  echo 1G > "${CGPATH}/L/N/memory.${LIM}"
  
  vmtouch A/b /file_b 2G
  vmtouch A/C /file_C 2G
  vmtouch A/D /file_D 2G
  
  vmtouch F/g /file_g 2G
  vmtouch F/H /file_H 2G
  
  vmtouch i/j /file_j 2G
  vmtouch i/K /file_K 2G
  
  vmtouch L/m /file_m 2G
  vmtouch L/N /file_N 2G
  
  vmtouch2 "$ANTAGONIST" /file_ant 150G
  echo
  echo "after $ANTAGONIST antagonist"
  dump
  
  rmdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  rmdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  rmdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  rmdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  rmdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  rm /file_ant /file_b /file_C /file_D /file_g /file_H /file_j /file_K

Greg Thelen (2):
  memcg: fix memory.low
  memcg: add memory.min

 include/linux/memcontrol.h |  8 +
 mm/memcontrol.c| 70 ++
 mm/vmscan.c|  3 ++
 3 files changed, 75 insertions(+), 6 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog



[RFC PATCH 1/2] memcg: fix memory.low

2018-04-22 Thread Greg Thelen
When targeting reclaim to a memcg, protect that memcg from reclaim is
memory consumption of any level is below respective memory.low.

Signed-off-by: Greg Thelen <gthe...@google.com>
---
 mm/memcontrol.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 670e99b68aa6..9668f620203a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5341,8 +5341,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
  *
- * Returns %true if memory consumption of @memcg, and that of all
- * ancestors up to (but not including) @root, is below the normal range.
+ * Returns %true if memory consumption of @memcg, or any of its ancestors
+ * up to (but not including) @root, is below the normal range.
  *
  * @root is exclusive; it is never low when looked at directly and isn't
  * checked when traversing the hierarchy.
@@ -5379,12 +5379,12 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct 
mem_cgroup *memcg)
if (memcg == root)
return false;
 
+   /* If any level is under, then protect @memcg from reclaim */
for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
-   if (page_counter_read(>memory) >= memcg->low)
-   return false;
+   if (page_counter_read(>memory) <= memcg->low)
+   return true; /* protect from reclaim */
}
-
-   return true;
+   return false; /* not protected from reclaim */
 }
 
 /**
-- 
2.17.0.484.g0c8726318c-goog



[RFC PATCH 0/2] memory.low,min reclaim

2018-04-22 Thread Greg Thelen
4k => 4203908k
  # F/g  : 2102016k => 2101992k
  # F/H  : 2101948k => 2101916k
  # i: 4204408k => 4203988k
  # i/j  : 2101984k => 2101936k
  # i/K  : 2102424k => 2102052k
  # L: 4189960k => 3886296k
  # L/m  : 2101968k => 2838704k
  # L/N  : 2087992k => 1047592k

  set -o errexit
  set -o nounset
  set -o pipefail
  
  LIM="$1"; shift
  ANTAGONIST="$1"; shift
  CGPATH=/tmp/cgroup
  
  vmtouch2() {
rm -f "$2"
(echo $BASHPID > "${CGPATH}/$1/cgroup.procs" && exec /tmp/mmap --loop 1 
--file "$2" "$3")
  }
  
  vmtouch() {
# twice to ensure slab caches are warmed up and all objs are charged to 
cgroup.
vmtouch2 "$1" "$2" "$3"
vmtouch2 "$1" "$2" "$3"
  }
  
  dump() {
for i in A A/b A/C A/D A/e F F/g F/H i i/j i/K L L/m L/N; do
  printf "%-5s: %sk\n" $i $(($(cat "${CGPATH}/${i}/memory.current")/1024))
done
  }
  
  rm -f /file_?
  if [[ -e "${CGPATH}/A" ]]; then
rmdir ${CGPATH}/?/? ${CGPATH}/?
  fi
  echo "+memory" > "${CGPATH}/cgroup.subtree_control"
  mkdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/F/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/i/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/L/cgroup.subtree_control"
  mkdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  mkdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  mkdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  mkdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  
  echo 2G > "${CGPATH}/A/memory.${LIM}"
  echo 3G > "${CGPATH}/A/b/memory.${LIM}"
  echo 1G > "${CGPATH}/A/C/memory.${LIM}"
  echo 0 > "${CGPATH}/A/D/memory.${LIM}"
  echo 10G > "${CGPATH}/A/e/memory.${LIM}"
  
  echo 2G > "${CGPATH}/F/memory.${LIM}"
  echo 3G > "${CGPATH}/F/g/memory.${LIM}"
  echo 1G > "${CGPATH}/F/H/memory.${LIM}"
  
  echo 5G > "${CGPATH}/i/memory.${LIM}"
  echo 3G > "${CGPATH}/i/j/memory.${LIM}"
  echo 1G > "${CGPATH}/i/K/memory.${LIM}"
  
  echo 2G > "${CGPATH}/L/memory.${LIM}"
  echo 4G > "${CGPATH}/L/memory.max"
  echo 3G > "${CGPATH}/L/m/memory.${LIM}"
  echo 1G > "${CGPATH}/L/N/memory.${LIM}"
  
  vmtouch A/b /file_b 2G
  vmtouch A/C /file_C 2G
  vmtouch A/D /file_D 2G
  
  vmtouch F/g /file_g 2G
  vmtouch F/H /file_H 2G
  
  vmtouch i/j /file_j 2G
  vmtouch i/K /file_K 2G
  
  vmtouch L/m /file_m 2G
  vmtouch L/N /file_N 2G
  
  vmtouch2 "$ANTAGONIST" /file_ant 150G
  echo
  echo "after $ANTAGONIST antagonist"
  dump
  
  rmdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  rmdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  rmdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  rmdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  rmdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  rm /file_ant /file_b /file_C /file_D /file_g /file_H /file_j /file_K

Greg Thelen (2):
  memcg: fix memory.low
  memcg: add memory.min

 include/linux/memcontrol.h |  8 +
 mm/memcontrol.c| 70 ++
 mm/vmscan.c|  3 ++
 3 files changed, 75 insertions(+), 6 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog



[RFC PATCH 1/2] memcg: fix memory.low

2018-04-22 Thread Greg Thelen
When targeting reclaim to a memcg, protect that memcg from reclaim is
memory consumption of any level is below respective memory.low.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 670e99b68aa6..9668f620203a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5341,8 +5341,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
  *
- * Returns %true if memory consumption of @memcg, and that of all
- * ancestors up to (but not including) @root, is below the normal range.
+ * Returns %true if memory consumption of @memcg, or any of its ancestors
+ * up to (but not including) @root, is below the normal range.
  *
  * @root is exclusive; it is never low when looked at directly and isn't
  * checked when traversing the hierarchy.
@@ -5379,12 +5379,12 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct 
mem_cgroup *memcg)
if (memcg == root)
return false;
 
+   /* If any level is under, then protect @memcg from reclaim */
for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
-   if (page_counter_read(>memory) >= memcg->low)
-   return false;
+   if (page_counter_read(>memory) <= memcg->low)
+   return true; /* protect from reclaim */
}
-
-   return true;
+   return false; /* not protected from reclaim */
 }
 
 /**
-- 
2.17.0.484.g0c8726318c-goog



Re: [PATCH] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-15 Thread Greg Thelen
On Sun, Apr 15, 2018 at 5:06 AM Christoph Hellwig <h...@infradead.org> wrote:

> On Fri, Apr 13, 2018 at 12:06:44AM -0700, Greg Thelen wrote:
> > Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

> Why?  We are pushing everyone heavily to use RDMA/CM, so making it
> optional seems rather counter-intuitive.

Bugs.  We don't currently use CM.  But fuzzy testing has been finding a
fair number CM bugs, so we were thinking of disabling it until we need it.

> You'll also have to fix tons of ULPs to explicitly depend on
> INFINIBAND_ADDR_TRANS, or make code conditional on it.

I think I've identified the set of options which use
INFINIBAND_ADDR_TRANS without a kconfig depends:
* CIFS_SMB_DIRECT
* INFINIBAND_SRPT
* NVME_RDMA
* NVME_TARGET_RDMA

I have patches for the above, but need to finish the commit logs.  Let me
know if they'll be nacked and I'll just patch my kernel and forget
upstreaming.


Re: [PATCH] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-15 Thread Greg Thelen
On Sun, Apr 15, 2018 at 5:06 AM Christoph Hellwig  wrote:

> On Fri, Apr 13, 2018 at 12:06:44AM -0700, Greg Thelen wrote:
> > Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

> Why?  We are pushing everyone heavily to use RDMA/CM, so making it
> optional seems rather counter-intuitive.

Bugs.  We don't currently use CM.  But fuzzy testing has been finding a
fair number CM bugs, so we were thinking of disabling it until we need it.

> You'll also have to fix tons of ULPs to explicitly depend on
> INFINIBAND_ADDR_TRANS, or make code conditional on it.

I think I've identified the set of options which use
INFINIBAND_ADDR_TRANS without a kconfig depends:
* CIFS_SMB_DIRECT
* INFINIBAND_SRPT
* NVME_RDMA
* NVME_TARGET_RDMA

I have patches for the above, but need to finish the commit logs.  Let me
know if they'll be nacked and I'll just patch my kernel and forget
upstreaming.


[PATCH v3] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-14 Thread Greg Thelen
Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthe...@google.com>
Cc: Tarick Bedeir <tar...@google.com>
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.484.g0c8726318c-goog



[PATCH v3] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-14 Thread Greg Thelen
Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.484.g0c8726318c-goog



  1   2   3   4   >