Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages
Dave Hansen wrote: > From: Dave Hansen > > Anonymous pages are kept on their own LRU(s). These lists could > theoretically always be scanned and maintained. But, without swap, > there is currently nothing the kernel can *do* with the results of a > scanned, sorted LRU for anonymous pages. > > A check for '!total_swap_pages' currently serves as a valid check as > to whether anonymous LRUs should be maintained. However, another > method will be added shortly: page demotion. > > Abstract out the 'total_swap_pages' checks into a helper, give it a > logically significant name, and check for the possibility of page > demotion. Reviewed-by: Greg Thelen > Signed-off-by: Dave Hansen > Cc: David Rientjes > Cc: Huang Ying > Cc: Dan Williams > Cc: David Hildenbrand > Cc: osalvador > --- > > b/mm/vmscan.c | 28 +--- > 1 file changed, 25 insertions(+), 3 deletions(-) > > diff -puN mm/vmscan.c~mm-vmscan-anon-can-be-aged mm/vmscan.c > --- a/mm/vmscan.c~mm-vmscan-anon-can-be-aged 2021-03-04 15:35:58.935806422 > -0800 > +++ b/mm/vmscan.c 2021-03-04 15:35:58.942806422 -0800 > @@ -2517,6 +2517,26 @@ out: > } > } > > +/* > + * Anonymous LRU management is a waste if there is > + * ultimately no way to reclaim the memory. > + */ > +bool anon_should_be_aged(struct lruvec *lruvec) Should this be static? > +{ > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > + > + /* Aging the anon LRU is valuable if swap is present: */ > + if (total_swap_pages > 0) > + return true; > + > + /* Also valuable if anon pages can be demoted: */ > + if (next_demotion_node(pgdat->node_id) >= 0) > + return true; > + > + /* No way to reclaim anon pages. Should not age anon LRUs: */ > + return false; > +} > + > static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > { > unsigned long nr[NR_LRU_LISTS]; > @@ -2626,7 +2646,8 @@ static void shrink_lruvec(struct lruvec >* Even if we did not try to evict anon pages at all, we want to >* rebalance the anon lru active/inactive ratio. >*/ > - if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) > + if (anon_should_be_aged(lruvec) && > + inactive_is_low(lruvec, LRU_INACTIVE_ANON)) > shrink_active_list(SWAP_CLUSTER_MAX, lruvec, > sc, LRU_ACTIVE_ANON); > } > @@ -3455,10 +3476,11 @@ static void age_active_anon(struct pglis > struct mem_cgroup *memcg; > struct lruvec *lruvec; > > - if (!total_swap_pages) > + lruvec = mem_cgroup_lruvec(NULL, pgdat); > + > + if (!anon_should_be_aged(lruvec)) > return; > > - lruvec = mem_cgroup_lruvec(NULL, pgdat); > if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON)) > return; > > _
Re: + revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch added to -mm tree
a...@linux-foundation.org wrote: > The patch titled > Subject: revert "mm/filemap: add static for function > __add_to_page_cache_locked" > has been added to the -mm tree. Its filename is > > revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch > > This patch should soon appear at > > https://ozlabs.org/~akpm/mmots/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch > and later at > > https://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch > > Before you just go and hit "reply", please: >a) Consider who else should be cc'ed >b) Prefer to cc a suitable mailing list as well >c) Ideally: find the original patch on the mailing list and do a > reply-to-all to that, adding suitable additional cc's > > *** Remember to use Documentation/process/submit-checklist.rst when testing > your code *** > > The -mm tree is included into linux-next and is updated > there every 3-4 working days > > -- > From: Andrew Morton > Subject: revert "mm/filemap: add static for function > __add_to_page_cache_locked" > > Revert 3351b16af494 ("mm/filemap: add static for function > __add_to_page_cache_locked") due to build issues with > CONFIG_DEBUG_INFO_BTF=y. > > Link: > https://lkml.kernel.org/r/caadnvqj6tmzbxvtrobueh6qa0h+q7yaskxrvvvxhqr3kbzd...@mail.gmail.com > Cc: Michal Kubecek > Cc: Justin Forbes > Cc: Alex Shi > Cc: Souptick Joarder > Cc: Alexei Starovoitov > Cc: Daniel Borkmann > Cc: Josef Bacik > Signed-off-by: Andrew Morton Thanks. Tested-by: Greg Thelen > --- > > mm/filemap.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- > a/mm/filemap.c~revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked > +++ a/mm/filemap.c > @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page > } > EXPORT_SYMBOL_GPL(replace_page_cache_page); > > -static noinline int __add_to_page_cache_locked(struct page *page, > +noinline int __add_to_page_cache_locked(struct page *page, > struct address_space *mapping, > pgoff_t offset, gfp_t gfp, > void **shadowp) > _ > > Patches currently in -mm which might be from a...@linux-foundation.org are > > revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch > kthread_worker-document-cpu-hotplug-handling-fix.patch > mm.patch > mm-remove-the-unuseful-else-after-a-return-checkpatch-fixes.patch > mm-prevent-gup_fast-from-racing-with-cow-during-fork-checkpatch-fixes.patch > mm-swap_state-skip-meaningless-swap-cache-readahead-when-ra_infowin-==-0-fix.patch > mm-vmallocc-__vmalloc_area_node-avoid-32-bit-overflow.patch > mm-cma-improve-pr_debug-log-in-cma_release-fix.patch > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch > lib-cmdline_kunit-add-a-new-test-suite-for-cmdline-api-fix.patch > ilog2-improve-ilog2-for-constant-arguments-checkpatch-fixes.patch > lib-test_bitmapc-add-for_each_set_clump-test-cases-checkpatch-fixes.patch > checkpatch-fix-typo_spelling-check-for-words-with-apostrophe-fix.patch > resource-fix-kernel-doc-markups-checkpatch-fixes.patch > linux-next-rejects.patch > kmap-stupid-hacks-to-make-it-compile.patch > init-kconfig-dont-assume-scripts-lld-versionsh-is-executable.patch > set_memory-allow-set_direct_map__noflush-for-multiple-pages-fix.patch > arch-mm-wire-up-memfd_secret-system-call-were-relevant-fix.patch > kernel-forkc-export-kernel_thread-to-modules.patch
Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked
Alex Shi wrote: > 在 2020/11/11 上午3:50, Andrew Morton 写道: >> On Tue, 10 Nov 2020 08:39:24 +0530 Souptick Joarder >> wrote: >> >>> On Fri, Nov 6, 2020 at 4:55 PM Alex Shi wrote: Otherwise it cause gcc warning: ^~~ ../mm/filemap.c:830:14: warning: no previous prototype for ‘__add_to_page_cache_locked’ [-Wmissing-prototypes] noinline int __add_to_page_cache_locked(struct page *page, ^~ >>> >>> Is CONFIG_DEBUG_INFO_BTF enabled in your .config ? >> >> hm, yes. > > When the config enabled, compiling looks good untill pahole tool > used to get BTF info, but I still failed on a right version pahole >> 1.16. Sorry. I'm seeing an issue with this patch. My build system has pahole v1.17, but I don't think the pahole version is key. $ git checkout 3351b16af494 # recently added to linus/master $ make defconfig $ make menuconfig # set CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF $ make V=1 + ./tools/bpf/resolve_btfids/resolve_btfids vmlinux FAILED unresolved symbol __add_to_page_cache_locked Reverting 3351b16af494 ("mm/filemap: add static for function __add_to_page_cache_locked") fixes the issue. I don't see the warning which motivated this patch, but maybe it requires particular a .config or gcc version. Perhaps adding a __add_to_page_cache_locked() prototype would meet avoid it. But I haven't studied enough on BTF to know if there's a better answer. Signed-off-by: Alex Shi Cc: Andrew Morton Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index d90614f501da..249cf489f5df 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) } EXPORT_SYMBOL_GPL(replace_page_cache_page); -noinline int __add_to_page_cache_locked(struct page *page, +static noinline int __add_to_page_cache_locked(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp, void **shadowp) >> >> It's unclear to me whether BTF_ID() requires that the target symbol be >> non-static. It doesn't actually reference the symbol: >> >> #define BTF_ID(prefix, name) \ >> __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__)) >> > > The above usage make me thought BTF don't require the symbol, though > the symbol still exist in vmlinux with 'static'. > > So any comments of this, Alexei? > >> Alexei, can you please comment? >>
Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints
Axel Rasmussen wrote: > On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt wrote: >> >> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen >> wrote: >> > >> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug >> > is that an ongoing trace event might race with the tracepoint being >> > disabled (and therefore the _unreg() callback being called). Consider >> > this ordering: >> > >> > T1: trace event fires, get_mm_memcg_path() is called >> > T1: get_memcg_path_buf() returns a buffer pointer >> > T2: trace_mmap_lock_unreg() is called, buffers are freed >> > T1: cgroup_path() is called with the now-freed buffer >> >> Any reason to use the cgroup_path instead of the cgroup_ino? There are >> other examples of trace points using cgroup_ino and no need to >> allocate buffers. Also cgroup namespace might complicate the path >> usage. > > Hmm, so in general I would love to use a numeric identifier instead of a > string. > > I did some reading, and it looks like the cgroup_ino() mainly has to > do with writeback, instead of being just a general identifier? > https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > There is cgroup_id() which I think is almost what I'd want, but there > are a couple problems with it: > > - I don't know of a way for userspace to translate IDs -> paths, to > make them human readable? The id => name map can be built from user space with a tree walk. Example: $ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main] 20387 init.scope 31 system.slice > - Also I think the ID implementation we use for this is "dense", > meaning if a cgroup is removed, its ID is likely to be quickly reused. > >> >> > >> > The solution in this commit is to modify trace_mmap_lock_unreg() to >> > first stop new buffers from being handed out, and then to wait (spin) >> > until any existing buffer references are dropped (i.e., those trace >> > events complete). >> > >> > I have a simple reproducer program which spins up two pools of threads, >> > doing the following in a tight loop: >> > >> > Pool 1: >> > mmap(NULL, 4096, PROT_READ | PROT_WRITE, >> >MAP_PRIVATE | MAP_ANONYMOUS, -1, 0) >> > munmap() >> > >> > Pool 2: >> > echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable >> > echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable >> > >> > This triggers the use-after-free very quickly. With this patch, I let it >> > run for an hour without any BUGs. >> > >> > While fixing this, I also noticed and fixed a css ref leak. Previously >> > we called get_mem_cgroup_from_mm(), but we never called css_put() to >> > release that reference. get_mm_memcg_path() now does this properly. >> > >> > [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a >> > >> > Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock >> > acquisition") >> >> The original patch is in mm tree, so the SHA1 is not stabilized. >> Usually Andrew squash the fixes into the original patches. > > Ah, I added this because it also shows up in linux-next, under the > next-20201130 tag. I'll remove it in v2, squashing is fine. :) > >> >> > Signed-off-by: Axel Rasmussen >> > --- >> > mm/mmap_lock.c | 100 + >> > 1 file changed, 85 insertions(+), 15 deletions(-) >> > >> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c >> > index 12af8f1b8a14..be38dc58278b 100644 >> > --- a/mm/mmap_lock.c >> > +++ b/mm/mmap_lock.c >> > @@ -3,6 +3,7 @@ >> > #include >> > >> > #include >> > +#include >> > #include >> > #include >> > #include >> > @@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released); >> > #ifdef CONFIG_MEMCG >> > >> > /* >> > - * Our various events all share the same buffer (because we don't want or >> > need >> > - * to allocate a set of buffers *per event type*), so we need to protect >> > against >> > - * concurrent _reg() and _unreg() calls, and count how many _reg() calls >> > have >> > - * been made. >> > + * This is unfortunately complicated... _reg() and _unreg() may be called >> > + * in parallel, separately for each of our three event types. To save >> > memory, >> > + * all of the event types share the same buffers. Furthermore, trace >> > events >> > + * might happen in parallel with _unreg(); we need to ensure we don't >> > free the >> > + * buffers before all inflights have finished. Because these events happen >> > + * "frequently", we also want to prevent new inflights from starting once >> > the >> > + * _unreg() process begins. And, for performance reasons, we want to >> > avoid any >> > + * locking in the trace event path. >> > + * >> > + * So: >> > + * >> > + * - Use a spinlock to serialize _reg() and _unreg() calls. >> > + * - Keep track of nested _reg() calls with a lock-protected counter. >> > + * - Define a flag indicating whether or not
Re: [PATCH] selftests: more general make nesting support
Shuah Khan wrote: > On 8/5/20 1:36 PM, Greg Thelen wrote: >> On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen wrote: >>> >>> selftests can be built from the toplevel kernel makefile (e.g. make >>> kselftest-all) or directly (make -C tools/testing/selftests all). >>> >>> The toplevel kernel makefile explicitly disables implicit rules with >>> "MAKEFLAGS += -rR", which is passed to tools/testing/selftests. Some >>> selftest makefiles require implicit make rules, which is why >>> commit 67d8712dcc70 ("selftests: Fix build failures when invoked from >>> kselftest target") reenables implicit rules by clearing MAKEFLAGS if >>> MAKELEVEL=1. >>> >>> So far so good. However, if the toplevel makefile is called from an >>> outer makefile then MAKELEVEL will be elevated, which breaks the >>> MAKELEVEL equality test. >>> Example wrapped makefile error: >>>$ cat ~/Makefile >>>all: >>> $(MAKE) defconfig >>> $(MAKE) kselftest-all >>>$ make -sf ~/Makefile >>> futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h >>> /src/tools/testing/selftests/kselftest.h ../include/futextest.h >>> ../include/atomic.h ../include/logging.h -lpthread -lrt -o >>> /src/tools/testing/selftests/futex/functional/futex_wait_timeout >>>make[4]: futex_wait_timeout.c: Command not found >>> >>> Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more >>> direct side effect of "make -R". This enables arbitrary makefile >>> nesting. >>> >>> Signed-off-by: Greg Thelen >>> --- >>> tools/testing/selftests/Makefile | 8 >>> 1 file changed, 4 insertions(+), 4 deletions(-) >>> >>> diff --git a/tools/testing/selftests/Makefile >>> b/tools/testing/selftests/Makefile >>> index 1195bd85af38..289a2e4b3f6f 100644 >>> --- a/tools/testing/selftests/Makefile >>> +++ b/tools/testing/selftests/Makefile >>> @@ -84,10 +84,10 @@ endif >>> # of the targets gets built. >>> FORCE_TARGETS ?= >>> >>> -# Clear LDFLAGS and MAKEFLAGS if called from main >>> -# Makefile to avoid test build failures when test >>> -# Makefile doesn't have explicit build rules. >>> -ifeq (1,$(MAKELEVEL)) >>> +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing. This >>> provides >>> +# implicit rules to sub-test Makefiles which avoids build failures in test >>> +# Makefile that don't have explicit build rules. >>> +ifeq (,$(LINK.c)) >>> override LDFLAGS = >>> override MAKEFLAGS = >>> endif >>> -- >>> 2.28.0.rc0.142.g3c755180ce-goog >> >> Is there any feedback on this patch? It's not high priority but something >> that >> will help me make more use of selftests. >> > > No issues with the patch. I will apply it once the merge window ends. > > thanks, > -- Shuah Great. Thank you.
Re: [PATCH] selftests: more general make nesting support
On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen wrote: > > selftests can be built from the toplevel kernel makefile (e.g. make > kselftest-all) or directly (make -C tools/testing/selftests all). > > The toplevel kernel makefile explicitly disables implicit rules with > "MAKEFLAGS += -rR", which is passed to tools/testing/selftests. Some > selftest makefiles require implicit make rules, which is why > commit 67d8712dcc70 ("selftests: Fix build failures when invoked from > kselftest target") reenables implicit rules by clearing MAKEFLAGS if > MAKELEVEL=1. > > So far so good. However, if the toplevel makefile is called from an > outer makefile then MAKELEVEL will be elevated, which breaks the > MAKELEVEL equality test. > Example wrapped makefile error: > $ cat ~/Makefile > all: > $(MAKE) defconfig > $(MAKE) kselftest-all > $ make -sf ~/Makefile > futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h > /src/tools/testing/selftests/kselftest.h ../include/futextest.h > ../include/atomic.h ../include/logging.h -lpthread -lrt -o > /src/tools/testing/selftests/futex/functional/futex_wait_timeout > make[4]: futex_wait_timeout.c: Command not found > > Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more > direct side effect of "make -R". This enables arbitrary makefile > nesting. > > Signed-off-by: Greg Thelen > --- > tools/testing/selftests/Makefile | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/tools/testing/selftests/Makefile > b/tools/testing/selftests/Makefile > index 1195bd85af38..289a2e4b3f6f 100644 > --- a/tools/testing/selftests/Makefile > +++ b/tools/testing/selftests/Makefile > @@ -84,10 +84,10 @@ endif > # of the targets gets built. > FORCE_TARGETS ?= > > -# Clear LDFLAGS and MAKEFLAGS if called from main > -# Makefile to avoid test build failures when test > -# Makefile doesn't have explicit build rules. > -ifeq (1,$(MAKELEVEL)) > +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing. This provides > +# implicit rules to sub-test Makefiles which avoids build failures in test > +# Makefile that don't have explicit build rules. > +ifeq (,$(LINK.c)) > override LDFLAGS = > override MAKEFLAGS = > endif > -- > 2.28.0.rc0.142.g3c755180ce-goog Is there any feedback on this patch? It's not high priority but something that will help me make more use of selftests.
[PATCH] selftests: more general make nesting support
selftests can be built from the toplevel kernel makefile (e.g. make kselftest-all) or directly (make -C tools/testing/selftests all). The toplevel kernel makefile explicitly disables implicit rules with "MAKEFLAGS += -rR", which is passed to tools/testing/selftests. Some selftest makefiles require implicit make rules, which is why commit 67d8712dcc70 ("selftests: Fix build failures when invoked from kselftest target") reenables implicit rules by clearing MAKEFLAGS if MAKELEVEL=1. So far so good. However, if the toplevel makefile is called from an outer makefile then MAKELEVEL will be elevated, which breaks the MAKELEVEL equality test. Example wrapped makefile error: $ cat ~/Makefile all: $(MAKE) defconfig $(MAKE) kselftest-all $ make -sf ~/Makefile futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h /src/tools/testing/selftests/kselftest.h ../include/futextest.h ../include/atomic.h ../include/logging.h -lpthread -lrt -o /src/tools/testing/selftests/futex/functional/futex_wait_timeout make[4]: futex_wait_timeout.c: Command not found Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more direct side effect of "make -R". This enables arbitrary makefile nesting. Signed-off-by: Greg Thelen --- tools/testing/selftests/Makefile | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 1195bd85af38..289a2e4b3f6f 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -84,10 +84,10 @@ endif # of the targets gets built. FORCE_TARGETS ?= -# Clear LDFLAGS and MAKEFLAGS if called from main -# Makefile to avoid test build failures when test -# Makefile doesn't have explicit build rules. -ifeq (1,$(MAKELEVEL)) +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing. This provides +# implicit rules to sub-test Makefiles which avoids build failures in test +# Makefile that don't have explicit build rules. +ifeq (,$(LINK.c)) override LDFLAGS = override MAKEFLAGS = endif -- 2.28.0.rc0.142.g3c755180ce-goog
Re: [PATCH v18 06/14] mm/damon: Implement callbacks for the virtual memory address spaces
SeongJae Park wrote: > From: SeongJae Park > > This commit introduces a reference implementation of the address space > specific low level primitives for the virtual address space, so that > users of DAMON can easily monitor the data accesses on virtual address > spaces of specific processes by simply configuring the implementation to > be used by DAMON. > > The low level primitives for the fundamental access monitoring are > defined in two parts: > 1. Identification of the monitoring target address range for the address > space. > 2. Access check of specific address range in the target space. > > The reference implementation for the virtual address space provided by > this commit is designed as below. > > PTE Accessed-bit Based Access Check > --- > > The implementation uses PTE Accessed-bit for basic access checks. That > is, it clears the bit for next sampling target page and checks whether > it set again after one sampling period. To avoid disturbing other > Accessed bit users such as the reclamation logic, the implementation > adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the > 'Idle Page Tracking'. > > VMA-based Target Address Range Construction > --- > > Only small parts in the super-huge virtual address space of the > processes are mapped to physical memory and accessed. Thus, tracking > the unmapped address regions is just wasteful. However, because DAMON > can deal with some level of noise using the adaptive regions adjustment > mechanism, tracking every mapping is not strictly required but could > even incur a high overhead in some cases. That said, too huge unmapped > areas inside the monitoring target should be removed to not take the > time for the adaptive mechanism. > > For the reason, this implementation converts the complex mappings to > three distinct regions that cover every mapped area of the address > space. Also, the two gaps between the three regions are the two biggest > unmapped areas in the given address space. The two biggest unmapped > areas would be the gap between the heap and the uppermost mmap()-ed > region, and the gap between the lowermost mmap()-ed region and the stack > in most of the cases. Because these gaps are exceptionally huge in > usual address spacees, excluding these will be sufficient to make a > reasonable trade-off. Below shows this in detail:: > > > > > (small mmap()-ed regions and munmap()-ed regions) > > > > > Signed-off-by: SeongJae Park > Reviewed-by: Leonard Foerster > --- > include/linux/damon.h | 6 + > mm/damon.c| 474 ++ > 2 files changed, 480 insertions(+) > > diff --git a/include/linux/damon.h b/include/linux/damon.h > index 3c0b92a679e8..310d36d123b3 100644 > --- a/include/linux/damon.h > +++ b/include/linux/damon.h > @@ -144,6 +144,12 @@ struct damon_ctx { > void (*aggregate_cb)(struct damon_ctx *context); > }; > > +/* Reference callback implementations for virtual memory */ > +void kdamond_init_vm_regions(struct damon_ctx *ctx); > +void kdamond_update_vm_regions(struct damon_ctx *ctx); > +void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx); > +unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx); > + > int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids); > int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, > unsigned long aggr_int, unsigned long regions_update_int, > diff --git a/mm/damon.c b/mm/damon.c > index b844924b9fdb..386780739007 100644 > --- a/mm/damon.c > +++ b/mm/damon.c > @@ -9,6 +9,9 @@ > * This file is constructed in below parts. > * > * - Functions and macros for DAMON data structures > + * - Functions for the initial monitoring target regions construction > + * - Functions for the dynamic monitoring target regions update > + * - Functions for the access checking of the regions > * - Functions for DAMON core logics and features > * - Functions for the DAMON programming interface > * - Functions for the module loading/unloading > @@ -196,6 +199,477 @@ static unsigned long damon_region_sz_limit(struct > damon_ctx *ctx) > return sz; > } > > +/* > + * Get the mm_struct of the given task > + * > + * Caller _must_ put the mm_struct after use, unless it is NULL. > + * > + * Returns the mm_struct of the task on success, NULL on failure > + */ > +static struct mm_struct *damon_get_mm(struct damon_task *t) > +{ > + struct task_struct *task; > + struct mm_struct *mm; > + > + task = damon_get_task_struct(t); > + if (!task) > + return NULL; > + > + mm = get_task_mm(task); > + put_task_struct(task); > + return mm; > +} > + > +/* > + * Functions for the initial monitoring target regions construction > + */ > + > +/* > + * Size-evenly split a region into 'nr_pieces' small regions > + * > + * Returns 0 on success,
Re: [PATCH v18 11/14] Documentation: Add documents for DAMON
SeongJae Park wrote: > From: SeongJae Park > > This commit adds documents for DAMON under > `Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`. > > Signed-off-by: SeongJae Park > --- > Documentation/admin-guide/mm/damon/guide.rst | 157 ++ > Documentation/admin-guide/mm/damon/index.rst | 15 + > Documentation/admin-guide/mm/damon/plans.rst | 29 ++ > Documentation/admin-guide/mm/damon/start.rst | 98 ++ > Documentation/admin-guide/mm/damon/usage.rst | 298 +++ > Documentation/admin-guide/mm/index.rst | 1 + > Documentation/vm/damon/api.rst | 20 ++ > Documentation/vm/damon/eval.rst | 222 ++ > Documentation/vm/damon/faq.rst | 59 > Documentation/vm/damon/index.rst | 32 ++ > Documentation/vm/damon/mechanisms.rst| 165 ++ > Documentation/vm/index.rst | 1 + > 12 files changed, 1097 insertions(+) > create mode 100644 Documentation/admin-guide/mm/damon/guide.rst > create mode 100644 Documentation/admin-guide/mm/damon/index.rst > create mode 100644 Documentation/admin-guide/mm/damon/plans.rst > create mode 100644 Documentation/admin-guide/mm/damon/start.rst > create mode 100644 Documentation/admin-guide/mm/damon/usage.rst > create mode 100644 Documentation/vm/damon/api.rst > create mode 100644 Documentation/vm/damon/eval.rst > create mode 100644 Documentation/vm/damon/faq.rst > create mode 100644 Documentation/vm/damon/index.rst > create mode 100644 Documentation/vm/damon/mechanisms.rst > > diff --git a/Documentation/admin-guide/mm/damon/guide.rst > b/Documentation/admin-guide/mm/damon/guide.rst > new file mode 100644 > index ..c51fb843efaa > --- /dev/null > +++ b/Documentation/admin-guide/mm/damon/guide.rst > @@ -0,0 +1,157 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +== > +Optimization Guide > +== > + > +This document helps you estimating the amount of benefit that you could get > +from DAMON-based optimizations, and describes how you could achieve it. You > +are assumed to already read :doc:`start`. > + > + > +Check The Signs > +=== > + > +No optimization can provide same extent of benefit to every case. Therefore > +you should first guess how much improvements you could get using DAMON. If > +some of below conditions match your situation, you could consider using > DAMON. > + > +- *Low IPC and High Cache Miss Ratios.* Low IPC means most of the CPU time > is > + spent waiting for the completion of time-consuming operations such as > memory > + access, while high cache miss ratios mean the caches don't help it well. > + DAMON is not for cache level optimization, but DRAM level. However, > + improving DRAM management will also help this case by reducing the memory > + operation latency. > +- *Memory Over-commitment and Unknown Users.* If you are doing memory > + overcommitment and you cannot control every user of your system, a memory > + bank run could happen at any time. You can estimate when it will happen > + based on DAMON's monitoring results and act earlier to avoid or deal better > + with the crisis. > +- *Frequent Memory Pressure.* Frequent memory pressure means your system has > + wrong configurations or memory hogs. DAMON will help you find the right > + configuration and/or the criminals. > +- *Heterogeneous Memory System.* If your system is utilizing memory devices > + that placed between DRAM and traditional hard disks, such as non-volatile > + memory or fast SSDs, DAMON could help you utilizing the devices more > + efficiently. > + > + > +Profile > +=== > + > +If you found some positive signals, you could start by profiling your > workloads > +using DAMON. Find major workloads on your systems and analyze their data > +access pattern to find something wrong or can be improved. The DAMON user > +space tool (``damo``) will be useful for this. > + > +We recommend you to start from working set size distribution check using > ``damo > +report wss``. If the distribution is ununiform or quite different from what > +you estimated, you could consider `Memory Configuration`_ optimization. > + > +Then, review the overall access pattern in heatmap form using ``damo report > +heats``. If it shows a simple pattern consists of a small number of memory > +regions having high contrast of access temperature, you could consider manual > +`Program Modification`_. > + > +If you still want to absorb more benefits, you should develop `Personalized > +DAMON Application`_ for your special case. > + > +You don't need to take only one approach among the above plans, but you could > +use multiple of the above approaches to maximize the benefit. > + > + > +Optimize > + > + > +If the profiling result also says it's worth trying some optimization, you > +could consider below approaches. Note that some of the below approaches > assume > +that
Re: [PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code
Oliver O'Halloran wrote: > On Mon, Jun 15, 2020 at 9:33 AM Greg Thelen wrote: >> >> Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE >> configuration") removed a couple pnv_ioda_setup_bus_dma() calls. The >> only remaining calls are behind CONFIG_IOMMU_API. Thus builds without >> CONFIG_IOMMU_API see: >> arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: >> 'pnv_ioda_setup_bus_dma' defined but not used >> >> Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code. > > Doh! Thanks for the fix. > > Reviewed-by: Oliver O'Halloran Is there anything else needed from me on this patch? Given that it fixes a 5.8 commit I figured it'd be 5.8 material.
Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed
Dave Hansen wrote: > From: Keith Busch > > Migrating pages had been allocating the new page before it was actually > needed. Subsequent operations may still fail, which would have to handle > cleaning up the newly allocated page when it was never used. > > Defer allocating the page until we are actually ready to make use of > it, after locking the original page. This simplifies error handling, > but should not have any functional change in behavior. This is just > refactoring page migration so the main part can more easily be reused > by other code. Is there any concern that the src page is now held PG_locked over the dst page allocation, which might wander into reclaim/cond_resched/oom_kill? I don't have a deadlock in mind. I'm just wondering about the additional latency imposed on unrelated threads who want access src page. > #Signed-off-by: Keith Busch Is commented Signed-off-by intentional? Same applies to later patches. > Signed-off-by: Dave Hansen > Cc: Keith Busch > Cc: Yang Shi > Cc: David Rientjes > Cc: Huang Ying > Cc: Dan Williams > --- > > b/mm/migrate.c | 148 > - > 1 file changed, 75 insertions(+), 73 deletions(-) > > diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed > mm/migrate.c > --- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed > 2020-06-29 16:34:37.896312607 -0700 > +++ b/mm/migrate.c2020-06-29 16:34:37.900312607 -0700 > @@ -1014,56 +1014,17 @@ out: > return rc; > } > > -static int __unmap_and_move(struct page *page, struct page *newpage, > - int force, enum migrate_mode mode) > +static int __unmap_and_move(new_page_t get_new_page, > + free_page_t put_new_page, > + unsigned long private, struct page *page, > + enum migrate_mode mode, > + enum migrate_reason reason) > { > int rc = -EAGAIN; > int page_was_mapped = 0; > struct anon_vma *anon_vma = NULL; > bool is_lru = !__PageMovable(page); > - > - if (!trylock_page(page)) { > - if (!force || mode == MIGRATE_ASYNC) > - goto out; > - > - /* > - * It's not safe for direct compaction to call lock_page. > - * For example, during page readahead pages are added locked > - * to the LRU. Later, when the IO completes the pages are > - * marked uptodate and unlocked. However, the queueing > - * could be merging multiple pages for one bio (e.g. > - * mpage_readpages). If an allocation happens for the > - * second or third page, the process can end up locking > - * the same page twice and deadlocking. Rather than > - * trying to be clever about what pages can be locked, > - * avoid the use of lock_page for direct compaction > - * altogether. > - */ > - if (current->flags & PF_MEMALLOC) > - goto out; > - > - lock_page(page); > - } > - > - if (PageWriteback(page)) { > - /* > - * Only in the case of a full synchronous migration is it > - * necessary to wait for PageWriteback. In the async case, > - * the retry loop is too short and in the sync-light case, > - * the overhead of stalling is too much > - */ > - switch (mode) { > - case MIGRATE_SYNC: > - case MIGRATE_SYNC_NO_COPY: > - break; > - default: > - rc = -EBUSY; > - goto out_unlock; > - } > - if (!force) > - goto out_unlock; > - wait_on_page_writeback(page); > - } > + struct page *newpage; > > /* >* By try_to_unmap(), page->mapcount goes down to 0 here. In this case, > @@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page > if (PageAnon(page) && !PageKsm(page)) > anon_vma = page_get_anon_vma(page); > > + newpage = get_new_page(page, private); > + if (!newpage) { > + rc = -ENOMEM; > + goto out; > + } > + > /* >* Block others from accessing the new page when we get around to >* establishing additional references. We are usually the only one > @@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page >* This is much like races on refcount of oldpage: just don't BUG(). >*/ > if (unlikely(!trylock_page(newpage))) > - goto out_unlock; > + goto out_put; > > if (unlikely(!is_lru)) { > rc = move_to_new_page(newpage, page, mode); > - goto out_unlock_both; > + goto out_unlock; > } > > /* > @@ -1114,7 +1081,7 @@
[PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code
Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration") removed a couple pnv_ioda_setup_bus_dma() calls. The only remaining calls are behind CONFIG_IOMMU_API. Thus builds without CONFIG_IOMMU_API see: arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 'pnv_ioda_setup_bus_dma' defined but not used Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code. Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration") Signed-off-by: Greg Thelen --- arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++ 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 73a63efcf855..743d840712da 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1885,19 +1885,6 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, return false; } -static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) -{ - struct pci_dev *dev; - - list_for_each_entry(dev, >devices, bus_list) { - set_iommu_table_base(>dev, pe->table_group.tables[0]); - dev->dev.archdata.dma_offset = pe->tce_bypass_base; - - if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) - pnv_ioda_setup_bus_dma(pe, dev->subordinate); - } -} - static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb, bool real_mode) { @@ -2501,6 +2488,19 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group, #endif #ifdef CONFIG_IOMMU_API +static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, >devices, bus_list) { + set_iommu_table_base(>dev, pe->table_group.tables[0]); + dev->dev.archdata.dma_offset = pe->tce_bypass_base; + + if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) + pnv_ioda_setup_bus_dma(pe, dev->subordinate); + } +} + unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, __u64 window_size, __u32 levels) { -- 2.27.0.290.gba653c62da-goog
[PATCH] e1000e: add ifdef to avoid dead code
Commit e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME systems") added e1000e_check_me() but it's only called from CONFIG_PM_SLEEP protected code. Thus builds without CONFIG_PM_SLEEP see: drivers/net/ethernet/intel/e1000e/netdev.c:137:13: warning: 'e1000e_check_me' defined but not used [-Wunused-function] Add CONFIG_PM_SLEEP ifdef guard to avoid dead code. Fixes: e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME systems") Signed-off-by: Greg Thelen --- drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index a279f4fa9962..165f0aea22c9 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -107,6 +107,7 @@ static const struct e1000_reg_info e1000_reg_info_tbl[] = { {0, NULL} }; +#ifdef CONFIG_PM_SLEEP struct e1000e_me_supported { u16 device_id; /* supported device ID */ }; @@ -145,6 +146,7 @@ static bool e1000e_check_me(u16 device_id) return false; } +#endif /* CONFIG_PM_SLEEP */ /** * __ew32_prepare - prepare to write to MAC CSR register on certain parts -- 2.27.0.290.gba653c62da-goog
[PATCH] powerpc/powernv/pci: add ifdef to avoid dead code
Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration") removed a couple pnv_ioda_setup_bus_dma() calls. The only remaining calls are behind CONFIG_IOMMU_API. Thus builds without CONFIG_IOMMU_API see: arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 'pnv_ioda_setup_bus_dma' defined but not used Add CONFIG_IOMMU_API ifdef guard to avoid dead code. Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration") Signed-off-by: Greg Thelen --- arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 73a63efcf855..f7762052b7c4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1885,6 +1885,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, return false; } +#ifdef CONFIG_IOMMU_API static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) { struct pci_dev *dev; @@ -1897,6 +1898,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) pnv_ioda_setup_bus_dma(pe, dev->subordinate); } } +#endif static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb, bool real_mode) -- 2.27.0.290.gba653c62da-goog
Re: [PATCH] shmem, memcg: enable memcg aware shrinker
Yang Shi wrote: > On Sun, May 31, 2020 at 8:22 PM Greg Thelen wrote: >> >> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged >> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only >> called when the per-memcg per-node shrinker_map indicates that the >> shrinker may have objects to release to the memcg and node. >> >> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs >> shrinker which advertises per memcg and numa awareness. The shmem >> shrinker releases memory by splitting hugepages that extend beyond >> i_size. >> >> Shmem does not currently set bits in shrinker_map. So, starting with >> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under >> pressure. This leads to undeserved memcg OOM kills. >> Example that reliably sees memcg OOM kill in unpatched kernel: >> FS=/tmp/fs >> CONTAINER=/cgroup/memory/tmpfs_shrinker >> mkdir -p $FS >> mount -t tmpfs -o huge=always nodev $FS >> # Create 1000 MB container, which shouldn't suffer OOM. >> mkdir $CONTAINER >> echo 1000M > $CONTAINER/memory.limit_in_bytes >> echo $BASHPID >> $CONTAINER/cgroup.procs >> # Create 4000 files. Ideally each file uses 4k data page + a little >> # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily >> # fit within container's 1000 MB. But if data pages use 2MB >> # hugepages (due to aggressive huge=always) then files consume 8GB, >> # which hits memcg 1000 MB limit. >> for i in {1..4000}; do >> echo . > $FS/$i >> done > > It looks all the inodes which have tail THP beyond i_size are on one > single list, then the shrinker actually just splits the first > nr_to_scan inodes. But since the list is not memcg aware, so it seems > it may split the THPs which are not charged to the victim memcg and > the victim memcg still may suffer from pre-mature oom, right? Correct. shmem_unused_huge_shrink() is not memcg aware. In response to memcg pressure it will split the post-i_size tails of nr_to_scan tmpfs inodes regardless of if they're charged to the under-pressure memcg. do_shrink_slab() looks like it'll repeatedly call shmem_unused_huge_shrink(). So it will split tails of many inodes. So I think it'll avoid the oom by over shrinking. This is not ideal. But it seems better than undeserved oom kill. I think the solution (as Kirill Tkhai suggested) a memcg-aware index would solve both: 1) avoid premature oom by registering shrinker to responding to memcg pressure 2) avoid shrinking/splitting inodes unrelated to the under-pressure memcg I can certainly look into that (thanks Kirill for the pointers). In the short term I'm still interested in avoiding premature OOMs with the original thread (i.e. restore pre-4.19 behavior to shmem shrinker for memcg pressure). I plan to test and repost v2.
Re: [PATCH] shmem, memcg: enable memcg aware shrinker
Kirill Tkhai wrote: > Hi, Greg, > > good finding. See comments below. > > On 01.06.2020 06:22, Greg Thelen wrote: >> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged >> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only >> called when the per-memcg per-node shrinker_map indicates that the >> shrinker may have objects to release to the memcg and node. >> >> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs >> shrinker which advertises per memcg and numa awareness. The shmem >> shrinker releases memory by splitting hugepages that extend beyond >> i_size. >> >> Shmem does not currently set bits in shrinker_map. So, starting with >> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under >> pressure. This leads to undeserved memcg OOM kills. >> Example that reliably sees memcg OOM kill in unpatched kernel: >> FS=/tmp/fs >> CONTAINER=/cgroup/memory/tmpfs_shrinker >> mkdir -p $FS >> mount -t tmpfs -o huge=always nodev $FS >> # Create 1000 MB container, which shouldn't suffer OOM. >> mkdir $CONTAINER >> echo 1000M > $CONTAINER/memory.limit_in_bytes >> echo $BASHPID >> $CONTAINER/cgroup.procs >> # Create 4000 files. Ideally each file uses 4k data page + a little >> # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily >> # fit within container's 1000 MB. But if data pages use 2MB >> # hugepages (due to aggressive huge=always) then files consume 8GB, >> # which hits memcg 1000 MB limit. >> for i in {1..4000}; do >> echo . > $FS/$i >> done >> >> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg >> aware") maintains the per-node per-memcg shrinker bitmap for THP >> shrinker. But there's no such logic in shmem. Make shmem set the >> per-memcg per-node shrinker bits when it modifies inodes to have >> shrinkable pages. >> >> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers >> during memcg shrink_slab()") >> Cc: # 4.19+ >> Signed-off-by: Greg Thelen >> --- >> mm/shmem.c | 61 +++--- >> 1 file changed, 35 insertions(+), 26 deletions(-) >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index bd8840082c94..e11090f78cb5 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, >> struct kstat *stat, >> return 0; >> } >> >> +/* >> + * Expose inode and optional page to shrinker as having a possibly >> splittable >> + * hugepage that reaches beyond i_size. >> + */ >> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo, >> + struct inode *inode, struct page *page) >> +{ >> +struct shmem_inode_info *info = SHMEM_I(inode); >> + >> +spin_lock(>shrinklist_lock); >> +/* >> + * _careful to defend against unlocked access to ->shrink_list in >> + * shmem_unused_huge_shrink() >> + */ >> +if (list_empty_careful(>shrinklist)) { >> +list_add_tail(>shrinklist, >shrinklist); >> +sbinfo->shrinklist_len++; >> +} >> +spin_unlock(>shrinklist_lock); >> + >> +#ifdef CONFIG_MEMCG >> +if (page && PageTransHuge(page)) >> +memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page), >> + inode->i_sb->s_shrink.id); >> +#endif >> +} >> + >> static int shmem_setattr(struct dentry *dentry, struct iattr *attr) >> { >> struct inode *inode = d_inode(dentry); >> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, >> struct iattr *attr) >> * to shrink under memory pressure. >> */ >> if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { >> -spin_lock(>shrinklist_lock); >> -/* >> - * _careful to defend against unlocked access to >> - * ->shrink_list in shmem_unused_huge_shrink() >> - */ >> -if (list_empty_careful(>shrinklist)) { >> -list_add_tail(>shrinklist, >> ->shrinklist); >> -
[PATCH] shmem, memcg: enable memcg aware shrinker
Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()") a memcg aware shrinker is only called when the per-memcg per-node shrinker_map indicates that the shrinker may have objects to release to the memcg and node. shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs shrinker which advertises per memcg and numa awareness. The shmem shrinker releases memory by splitting hugepages that extend beyond i_size. Shmem does not currently set bits in shrinker_map. So, starting with b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under pressure. This leads to undeserved memcg OOM kills. Example that reliably sees memcg OOM kill in unpatched kernel: FS=/tmp/fs CONTAINER=/cgroup/memory/tmpfs_shrinker mkdir -p $FS mount -t tmpfs -o huge=always nodev $FS # Create 1000 MB container, which shouldn't suffer OOM. mkdir $CONTAINER echo 1000M > $CONTAINER/memory.limit_in_bytes echo $BASHPID >> $CONTAINER/cgroup.procs # Create 4000 files. Ideally each file uses 4k data page + a little # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily # fit within container's 1000 MB. But if data pages use 2MB # hugepages (due to aggressive huge=always) then files consume 8GB, # which hits memcg 1000 MB limit. for i in {1..4000}; do echo . > $FS/$i done v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") maintains the per-node per-memcg shrinker bitmap for THP shrinker. But there's no such logic in shmem. Make shmem set the per-memcg per-node shrinker bits when it modifies inodes to have shrinkable pages. Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()") Cc: # 4.19+ Signed-off-by: Greg Thelen --- mm/shmem.c | 61 +++--- 1 file changed, 35 insertions(+), 26 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index bd8840082c94..e11090f78cb5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat, return 0; } +/* + * Expose inode and optional page to shrinker as having a possibly splittable + * hugepage that reaches beyond i_size. + */ +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo, + struct inode *inode, struct page *page) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + + spin_lock(>shrinklist_lock); + /* +* _careful to defend against unlocked access to ->shrink_list in +* shmem_unused_huge_shrink() +*/ + if (list_empty_careful(>shrinklist)) { + list_add_tail(>shrinklist, >shrinklist); + sbinfo->shrinklist_len++; + } + spin_unlock(>shrinklist_lock); + +#ifdef CONFIG_MEMCG + if (page && PageTransHuge(page)) + memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page), + inode->i_sb->s_shrink.id); +#endif +} + static int shmem_setattr(struct dentry *dentry, struct iattr *attr) { struct inode *inode = d_inode(dentry); @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr) * to shrink under memory pressure. */ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { - spin_lock(>shrinklist_lock); - /* -* _careful to defend against unlocked access to -* ->shrink_list in shmem_unused_huge_shrink() -*/ - if (list_empty_careful(>shrinklist)) { - list_add_tail(>shrinklist, - >shrinklist); - sbinfo->shrinklist_len++; - } - spin_unlock(>shrinklist_lock); + struct page *page; + + page = find_get_page(inode->i_mapping, + (newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT); + shmem_shrinker_add(sbinfo, inode, page); + if (page) + put_page(page); } } } @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, if (PageTransHuge(page) && DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < hindex + HPAGE_PMD_NR - 1) { - /* -* Part of the huge page is beyond i_siz
[PATCH] kbuild: clean compressed initramfs image
Since commit 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig") "make clean" leaves behind compressed initramfs images. Example: $ make defconfig $ sed -i 's|CONFIG_INITRAMFS_SOURCE=""|CONFIG_INITRAMFS_SOURCE="/tmp/ir.cpio"|' .config $ make olddefconfig $ make -s $ make -s clean $ git clean -ndxf | grep initramfs Would remove usr/initramfs_data.cpio.gz clean rules do not have CONFIG_* context so they do not know which compression format was used. Thus they don't know which files to delete. Tell clean to delete all possible compression formats. Once patched usr/initramfs_data.cpio.gz and friends are deleted by "make clean". Fixes: 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig") Signed-off-by: Greg Thelen --- usr/Makefile | 3 +++ 1 file changed, 3 insertions(+) diff --git a/usr/Makefile b/usr/Makefile index 6a89eb019275..e6f7cb2f81db 100644 --- a/usr/Makefile +++ b/usr/Makefile @@ -11,6 +11,9 @@ datafile_y = initramfs_data.cpio$(suffix_y) datafile_d_y = .$(datafile_y).d AFLAGS_initramfs_data.o += -DINITRAMFS_IMAGE="usr/$(datafile_y)" +# clean rules do not have CONFIG_INITRAMFS_COMPRESSION. So clean up after all +# possible compression formats. +clean-files += initramfs_data.cpio* # Generate builtin.o based on initramfs_data.o obj-$(CONFIG_BLK_DEV_INITRD) := initramfs_data.o -- 2.22.0.657.g960e92d24f-goog
Re: [PATCH v4 0/7] mm: reparent slab memory on cgroup removal
Roman Gushchin wrote: > # Why do we need this? > > We've noticed that the number of dying cgroups is steadily growing on most > of our hosts in production. The following investigation revealed an issue > in userspace memory reclaim code [1], accounting of kernel stacks [2], > and also the mainreason: slab objects. > > The underlying problem is quite simple: any page charged > to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless > all charged pages are gone. If a slab object is actively used by other > cgroups, > it won't be reclaimed, and will prevent the origin cgroup from being > reclaimed. > > Slab objects, and first of all vfs cache, is shared between cgroups, which are > using the same underlying fs, and what's even more important, it's shared > between multiple generations of the same workload. So if something is running > periodically every time in a new cgroup (like how systemd works), we do > accumulate multiple dying cgroups. > > Strictly speaking pagecache isn't different here, but there is a key > difference: > we disable protection and apply some extra pressure on LRUs of dying cgroups, > and these LRUs contain all charged pages. > My experiments show that with the disabled kernel memory accounting the number > of dying cgroups stabilizes at a relatively small number (~100, depends on > memory pressure and cgroup creation rate), and with kernel memory accounting > it grows pretty steadily up to several thousands. > > Memory cgroups are quite complex and big objects (mostly due to percpu stats), > so it leads to noticeable memory losses. Memory occupied by dying cgroups > is measured in hundreds of megabytes. I've even seen a host with more than > 100Gb > of memory wasted for dying cgroups. It leads to a degradation of performance > with the uptime, and generally limits the usage of cgroups. > > My previous attempt [3] to fix the problem by applying extra pressure on slab > shrinker lists caused a regressions with xfs and ext4, and has been reverted > [4]. > The following attempts to find the right balance [5, 6] were not successful. > > So instead of trying to find a maybe non-existing balance, let's do reparent > the accounted slabs to the parent cgroup on cgroup removal. > > > # Implementation approach > > There is however a significant problem with reparenting of slab memory: > there is no list of charged pages. Some of them are in shrinker lists, > but not all. Introducing of a new list is really not an option. > > But fortunately there is a way forward: every slab page has a stable pointer > to the corresponding kmem_cache. So the idea is to reparent kmem_caches > instead of slab pages. > > It's actually simpler and cheaper, but requires some underlying changes: > 1) Make kmem_caches to hold a single reference to the memory cgroup, >instead of a separate reference per every slab page. > 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use >page->kmem_cache->memcg indirection instead. It's used only on >slab page release, so it shouldn't be a big issue. > 3) Introduce a refcounter for non-root slab caches. It's required to >be able to destroy kmem_caches when they become empty and release >the associated memory cgroup. > > There is a bonus: currently we do release empty kmem_caches on cgroup > removal, however all other are waiting for the releasing of the memory cgroup. > These refactorings allow kmem_caches to be released as soon as they > become inactive and free. > > Some additional implementation details are provided in corresponding > commit messages. > > # Results > > Below is the average number of dying cgroups on two groups of our production > hosts. They do run some sort of web frontend workload, the memory pressure > is moderate. As we can see, with the kernel memory reparenting the number > stabilizes in 60s range; however with the original version it grows almost > linearly and doesn't show any signs of plateauing. The difference in slab > and percpu usage between patched and unpatched versions also grows linearly. > In 7 days it exceeded 200Mb. > > day 01234567 > original 56 362 628 752 1070 1250 1490 1560 > patched 23 46 51 55 60 57 67 69 > mem diff(Mb) 22 74 123 152 164 182 214 241 No objection to the idea, but a question... In patched kernel, does slabinfo (or similar) show the list reparented slab caches? A pile of zombie kmem_caches is certainly better than a pile of zombie mem_cgroup. But it still seems like it'll might cause degradation - does cache_reap() walk an ever growing set of zombie caches? We've found it useful to add a slabinfo_full file which includes zombie kmem_cache with their memcg_name. This can help hunt down zombies. > # History > > v4: > 1) removed excessive memcg != parent check in memcg_deactivate_kmem_caches() > 2) fixed rcu_read_lock() usage in memcg_charge_slab() > 3) fixed synchronization
Re: [PATCH] writeback: sum memcg dirty counters as needed
Johannes Weiner wrote: > On Thu, Mar 07, 2019 at 08:56:32AM -0800, Greg Thelen wrote: >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -3880,6 +3880,7 @@ struct wb_domain *mem_cgroup_wb_domain(struct >> bdi_writeback *wb) >> * @pheadroom: out parameter for number of allocatable pages according to >> memcg >> * @pdirty: out parameter for number of dirty pages >> * @pwriteback: out parameter for number of pages under writeback >> + * @exact: determines exact counters are required, indicates more work. >> * >> * Determine the numbers of file, headroom, dirty, and writeback pages in >> * @wb's memcg. File, dirty and writeback are self-explanatory. Headroom >> @@ -3890,18 +3891,29 @@ struct wb_domain *mem_cgroup_wb_domain(struct >> bdi_writeback *wb) >> * ancestors. Note that this doesn't consider the actual amount of >> * available memory in the system. The caller should further cap >> * *@pheadroom accordingly. >> + * >> + * Return value is the error precision associated with *@pdirty >> + * and *@pwriteback. When @exact is set this a minimal value. >> */ >> -void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long >> *pfilepages, >> - unsigned long *pheadroom, unsigned long *pdirty, >> - unsigned long *pwriteback) >> +unsigned long >> +mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, >> +unsigned long *pheadroom, unsigned long *pdirty, >> +unsigned long *pwriteback, bool exact) >> { >> struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); >> struct mem_cgroup *parent; >> +unsigned long precision; >> >> -*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); >> - >> +if (exact) { >> +precision = 0; >> +*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); >> +*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); >> +} else { >> +precision = MEMCG_CHARGE_BATCH * num_online_cpus(); >> +*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); >> +*pwriteback = memcg_page_state(memcg, NR_WRITEBACK); >> +} >> /* this should eventually include NR_UNSTABLE_NFS */ >> -*pwriteback = memcg_page_state(memcg, NR_WRITEBACK); >> *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | >> (1 << LRU_ACTIVE_FILE)); >> *pheadroom = PAGE_COUNTER_MAX; >> @@ -3913,6 +3925,8 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, >> unsigned long *pfilepages, >> *pheadroom = min(*pheadroom, ceiling - min(ceiling, used)); >> memcg = parent; >> } >> + >> +return precision; > > Have you considered unconditionally using the exact version here? > > It does for_each_online_cpu(), but until very, very recently we did > this per default for all stats, for years. It only became a problem in > conjunction with the for_each_memcg loops when frequently reading > memory stats at the top of a very large hierarchy. > > balance_dirty_pages() is called against memcgs that actually own the > inodes/memory and doesn't do the additional recursive tree collection. > > It's also not *that* hot of a function, and in the io path... > > It would simplify this patch immensely. Good idea. Done in -v2 of the patch: https://lore.kernel.org/lkml/20190329174609.164344-1-gthe...@google.com/
Re: [PATCH] writeback: sum memcg dirty counters as needed
Andrew Morton wrote: > On Thu, 7 Mar 2019 08:56:32 -0800 Greg Thelen wrote: > >> Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in >> memory.stat reporting") memcg dirty and writeback counters are managed >> as: >> 1) per-memcg per-cpu values in range of [-32..32] >> 2) per-memcg atomic counter >> When a per-cpu counter cannot fit in [-32..32] it's flushed to the >> atomic. Stat readers only check the atomic. >> Thus readers such as balance_dirty_pages() may see a nontrivial error >> margin: 32 pages per cpu. >> Assuming 100 cpus: >>4k x86 page_size: 13 MiB error per memcg >> 64k ppc page_size: 200 MiB error per memcg >> Considering that dirty+writeback are used together for some decisions >> the errors double. >> >> This inaccuracy can lead to undeserved oom kills. One nasty case is >> when all per-cpu counters hold positive values offsetting an atomic >> negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). >> balance_dirty_pages() only consults the atomic and does not consider >> throttling the next n_cpu*32 dirty pages. If the file_lru is in the >> 13..200 MiB range then there's absolutely no dirty throttling, which >> burdens vmscan with only dirty+writeback pages thus resorting to oom >> kill. >> >> It could be argued that tiny containers are not supported, but it's more >> subtle. It's the amount the space available for file lru that matters. >> If a container has memory.max-200MiB of non reclaimable memory, then it >> will also suffer such oom kills on a 100 cpu machine. >> >> ... >> >> Make balance_dirty_pages() and wb_over_bg_thresh() work harder to >> collect exact per memcg counters when a memcg is close to the >> throttling/writeback threshold. This avoids the aforementioned oom >> kills. >> >> This does not affect the overhead of memory.stat, which still reads the >> single atomic counter. >> >> Why not use percpu_counter? memcg already handles cpus going offline, >> so no need for that overhead from percpu_counter. And the >> percpu_counter spinlocks are more heavyweight than is required. >> >> It probably also makes sense to include exact dirty and writeback >> counters in memcg oom reports. But that is saved for later. > > Nice changelog, thanks. > >> Signed-off-by: Greg Thelen > > Did you consider cc:stable for this? We may as well - the stablebots > backport everything which might look slightly like a fix anyway :( Good idea. Done in -v2 of the patch. >> --- a/include/linux/memcontrol.h >> +++ b/include/linux/memcontrol.h >> @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct >> mem_cgroup *memcg, >> return x; >> } >> >> +/* idx can be of type enum memcg_stat_item or node_stat_item */ >> +static inline unsigned long >> +memcg_exact_page_state(struct mem_cgroup *memcg, int idx) >> +{ >> +long x = atomic_long_read(>stat[idx]); >> +#ifdef CONFIG_SMP >> +int cpu; >> + >> +for_each_online_cpu(cpu) >> +x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx]; >> +if (x < 0) >> +x = 0; >> +#endif >> +return x; >> +} > > This looks awfully heavyweight for an inline function. Why not make it > a regular function and avoid the bloat and i-cache consumption? Done in -v2. > Also, did you instead consider making this spill the percpu counters > into memcg->stat[idx]? That might be more useful for potential future > callers. It would become a little more expensive though. I looked at that approach, but couldn't convince myself it was safe. I kept staring at "Remote [...] Write accesses can cause unique problems due to the relaxed synchronization requirements for this_cpu operations." from this_cpu_ops.txt. So I'd like to delay this possible optimization for a later patch.
[PATCH v2] writeback: use exact memcg dirty counts
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu*32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c /* * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -99*32 pages * - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -99*32 pages, which * it max()s to 0. * - So a workload can dirty -99*32 pages before balance_dirty_pages() * cares. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include static char *buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(); CPU_SET(cpu, ); if (sched_setaffinity(0, sizeof(affinity), )) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char **argv) { int cpu, flush_cpu = 1, output_fd; const char *output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to use exact dirty and writeback counters in memcg oom reports. But that is saved for later. Cc: sta...@vger.kernel.org # v4.16+ Signed-off-by: Greg Thelen --- Changelog since v1: - Move memcg_exact_page_state() into memcontrol.c. - Unconditionally gather exact (per cpu) counters in mem_cgroup_wb_stats(), it's not called in performance sensitive paths. - Unconditionally check for underflow regardless of CONFIG_SMP. It's just easier th
Re: [PATCH] writeback: sum memcg dirty counters as needed
On Fri, Mar 22, 2019 at 11:15 AM Roman Gushchin wrote: > > On Thu, Mar 07, 2019 at 08:56:32AM -0800, Greg Thelen wrote: > > Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in > > memory.stat reporting") memcg dirty and writeback counters are managed > > as: > > 1) per-memcg per-cpu values in range of [-32..32] > > 2) per-memcg atomic counter > > When a per-cpu counter cannot fit in [-32..32] it's flushed to the > > atomic. Stat readers only check the atomic. > > Thus readers such as balance_dirty_pages() may see a nontrivial error > > margin: 32 pages per cpu. > > Assuming 100 cpus: > >4k x86 page_size: 13 MiB error per memcg > > 64k ppc page_size: 200 MiB error per memcg > > Considering that dirty+writeback are used together for some decisions > > the errors double. > > > > This inaccuracy can lead to undeserved oom kills. One nasty case is > > when all per-cpu counters hold positive values offsetting an atomic > > negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). > > balance_dirty_pages() only consults the atomic and does not consider > > throttling the next n_cpu*32 dirty pages. If the file_lru is in the > > 13..200 MiB range then there's absolutely no dirty throttling, which > > burdens vmscan with only dirty+writeback pages thus resorting to oom > > kill. > > > > It could be argued that tiny containers are not supported, but it's more > > subtle. It's the amount the space available for file lru that matters. > > If a container has memory.max-200MiB of non reclaimable memory, then it > > will also suffer such oom kills on a 100 cpu machine. > > > > The following test reliably ooms without this patch. This patch avoids > > oom kills. > > > > ... > > > > Make balance_dirty_pages() and wb_over_bg_thresh() work harder to > > collect exact per memcg counters when a memcg is close to the > > throttling/writeback threshold. This avoids the aforementioned oom > > kills. > > > > This does not affect the overhead of memory.stat, which still reads the > > single atomic counter. > > > > Why not use percpu_counter? memcg already handles cpus going offline, > > so no need for that overhead from percpu_counter. And the > > percpu_counter spinlocks are more heavyweight than is required. > > > > It probably also makes sense to include exact dirty and writeback > > counters in memcg oom reports. But that is saved for later. > > > > Signed-off-by: Greg Thelen > > --- > > include/linux/memcontrol.h | 33 + > > mm/memcontrol.c| 26 -- > > mm/page-writeback.c| 27 +-- > > 3 files changed, 66 insertions(+), 20 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 83ae11cbd12c..6a133c90138c 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct > > mem_cgroup *memcg, > > return x; > > } > > Hi Greg! > > Thank you for the patch, definitely a good problem to be fixed! > > > > > +/* idx can be of type enum memcg_stat_item or node_stat_item */ > > +static inline unsigned long > > +memcg_exact_page_state(struct mem_cgroup *memcg, int idx) > > +{ > > + long x = atomic_long_read(>stat[idx]); > > +#ifdef CONFIG_SMP > > I doubt that this #ifdef is correct without corresponding changes > in __mod_memcg_state(). As now, we do use per-cpu buffer which spills > to an atomic value event if !CONFIG_SMP. It's probably something > that we want to change, but as now, #ifdef CONFIG_SMP should protect > only "if (x < 0)" part. Ack. I'll fix it. > > + int cpu; > > + > > + for_each_online_cpu(cpu) > > + x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx]; > > + if (x < 0) > > + x = 0; > > +#endif > > + return x; > > +} > > Also, isn't it worth it to generalize memcg_page_state() instead? > By adding an bool exact argument? I believe dirty balance is not > the only place, where we need a better accuracy. Nod. I'll provide a more general version of memcg_page_state(). I'm testing updated (forthcoming v2) patch set now with feedback from Andrew and Roman.
[PATCH] writeback: sum memcg dirty counters as needed
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu*32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c /* * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -99*32 pages * - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -99*32 pages, which * it max()s to 0. * - So a workload can dirty -99*32 pages before balance_dirty_pages() * cares. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include static char *buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(); CPU_SET(cpu, ); if (sched_setaffinity(0, sizeof(affinity), )) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char **argv) { int cpu, flush_cpu = 1, output_fd; const char *output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters when a memcg is close to the throttling/writeback threshold. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to include exact dirty and writeback counters in memcg oom reports. But that is saved for later. Signed-off-by: Greg Thelen --- include/linux/memcontrol.h | 33 + mm/memcontrol.c| 26 -- mm/page-writeback.c| 27 +-- 3 files changed, 66 insertions(+), 20 deletions(-) diff --git
[PATCH] writeback: fix inode cgroup switching comment
Commit 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") refers to inode_switch_wb_work_fn() which never got merged. Switch the comments to inode_switch_wbs_work_fn(). Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Signed-off-by: Greg Thelen --- include/linux/backing-dev.h | 2 +- include/linux/fs.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index c28a47cbe355..f9b029180241 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -365,7 +365,7 @@ unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie) rcu_read_lock(); /* -* Paired with store_release in inode_switch_wb_work_fn() and +* Paired with store_release in inode_switch_wbs_work_fn() and * ensures that we see the new wb if we see cleared I_WB_SWITCH. */ cookie->locked = smp_load_acquire(>i_state) & I_WB_SWITCH; diff --git a/include/linux/fs.h b/include/linux/fs.h index fd423fec8d83..08f26046233e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2091,7 +2091,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to * synchronize competing switching instances and to tell * wb stat updates to grab the i_pages lock. See - * inode_switch_wb_work_fn() for details. + * inode_switch_wbs_work_fn() for details. * * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper * and work dirs among overlayfs mounts. -- 2.21.0.352.gf09ad66450-goog
Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty
Yang Shi wrote: > On 1/3/19 11:23 AM, Michal Hocko wrote: >> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>> >>> On 1/3/19 10:53 AM, Michal Hocko wrote: On Thu 03-01-19 10:40:54, Yang Shi wrote: > On 1/3/19 10:13 AM, Michal Hocko wrote: >> [...] >> Is there any reason for your scripts to be strictly sequential here? In >> other words why cannot you offload those expensive operations to a >> detached context in _userspace_? > I would say it has not to be strictly sequential. The above script is just > an example to illustrate the pattern. But, sometimes it may hit such > pattern > due to the complicated cluster scheduling and container scheduling in the > production environment, for example the creation process might be > scheduled > to the same CPU which is doing force_empty. I have to say I don't know too > much about the internals of the container scheduling. In that case I do not see a strong reason to implement the offloding into the kernel. It is an additional code and semantic to maintain. >>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>> quite simple and very straight forward, isn't it? Just utilize the existing >>> css offline worker. And, that a couple of lines of code do improve some >>> throughput issues for some real usecases. >> I do not really care it is few LOC. It is more important that it is >> conflating force_empty into offlining logic. There was a good reason to >> remove reparenting/emptying the memcg during the offline. Considering >> that you can offload force_empty from userspace trivially then I do not >> see any reason to implement it in the kernel. > > Er, I may not articulate in the earlier email, force_empty can not be > offloaded from userspace *trivially*. IOWs the container scheduler may > unexpectedly overcommit something due to the stall of synchronous force > empty, which can't be figured out by userspace before it actually > happens. The scheduler doesn't know how long force_empty would take. If > the force_empty could be offloaded by kernel, it would make scheduler's > life much easier. This is not something userspace could do. If kernel workqueues are doing more work (i.e. force_empty processing), then it seem like the time to offline could grow. I'm not sure if that's important. I assume that if we make force_empty an async side effect of rmdir then user space scheduler would not be unable to immediately assume the rmdir'd container memory is available without subjecting a new container to direct reclaim. So it seems like user space would use a mechanism to wait for reclaim: either the existing sync force_empty or polling meminfo/etc waiting for free memory to appear. I think it is more important to discuss whether we want to introduce force_empty in cgroup v2. >>> We would prefer have it in v2 as well. >> Then bring this up in a separate email thread please. > > Sure. Will prepare the patches later. > > Thanks, > Yang
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Tue 03-07-18 00:08:05, Greg Thelen wrote: >> Michal Hocko wrote: >> >> > On Fri 29-06-18 11:59:04, Greg Thelen wrote: >> >> Michal Hocko wrote: >> >> >> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> >> >> Michal Hocko wrote: >> >> > [...] >> >> >> > +if (mem_cgroup_out_of_memory(memcg, mask, order)) >> >> >> > +return OOM_SUCCESS; >> >> >> > + >> >> >> > +WARN(1,"Memory cgroup charge failed because of no reclaimable >> >> >> > memory! " >> >> >> > +"This looks like a misconfiguration or a kernel bug."); >> >> >> >> >> >> I'm not sure here if the warning should here or so strongly worded. It >> >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> >> >> alarming in that case. >> >> > >> >> > If the task is reaped then its charges should be released as well and >> >> > that means that we should get below the limit. Sure there is some room >> >> > for races but this should be still unlikely. Maybe I am just >> >> > underestimating though. >> >> > >> >> > What would you suggest instead? >> >> >> >> I suggest checking MMF_OOM_SKIP or deleting the warning. >> > >> > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking >> > for all the tasks would be quite expensive and remembering that from the >> > task selection not nice either. Why do you think it would help much? >> >> I assume we could just check current's MMF_OOM_SKIP - no need to check >> all tasks. > > I still do not follow. If you are after a single task memcg then we > should be ok. try_charge has a runaway for oom victims > if (unlikely(tsk_is_oom_victim(current) || >fatal_signal_pending(current) || >current->flags & PF_EXITING)) > goto force; > > regardless of MMF_OOM_SKIP. So if there is a single process in the > memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then > we should bail out there. Or do I miss your intention? For a single task memcg it seems that racing process cgroup migration could trigger the new warning (I have attempted to reproduce this): Processes A,B in memcg M1,M2. M1 is oom. Process A[M1] Process B[M2] M1 is oom try_charge(M1) Move A M1=>M2 mem_cgroup_oom() mem_cgroup_out_of_memory() out_of_memory() select_bad_process() sees nothing in M1 return 0 return 0 WARN() Another variant might be possible, this time with global oom: Processes A,B in memcg M1,M2. M1 is oom. Process A[M1] Process B[M2] try_charge() trigger global oom reaper sets A.MMF_OOM_SKIP mem_cgroup_oom() mem_cgroup_out_of_memory() out_of_memory() select_bad_process() sees nothing in M1 return 0 return 0 WARN() These seem unlikely, so I'm fine with taking a wait-and-see approach.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Tue 03-07-18 00:08:05, Greg Thelen wrote: >> Michal Hocko wrote: >> >> > On Fri 29-06-18 11:59:04, Greg Thelen wrote: >> >> Michal Hocko wrote: >> >> >> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> >> >> Michal Hocko wrote: >> >> > [...] >> >> >> > +if (mem_cgroup_out_of_memory(memcg, mask, order)) >> >> >> > +return OOM_SUCCESS; >> >> >> > + >> >> >> > +WARN(1,"Memory cgroup charge failed because of no reclaimable >> >> >> > memory! " >> >> >> > +"This looks like a misconfiguration or a kernel bug."); >> >> >> >> >> >> I'm not sure here if the warning should here or so strongly worded. It >> >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> >> >> alarming in that case. >> >> > >> >> > If the task is reaped then its charges should be released as well and >> >> > that means that we should get below the limit. Sure there is some room >> >> > for races but this should be still unlikely. Maybe I am just >> >> > underestimating though. >> >> > >> >> > What would you suggest instead? >> >> >> >> I suggest checking MMF_OOM_SKIP or deleting the warning. >> > >> > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking >> > for all the tasks would be quite expensive and remembering that from the >> > task selection not nice either. Why do you think it would help much? >> >> I assume we could just check current's MMF_OOM_SKIP - no need to check >> all tasks. > > I still do not follow. If you are after a single task memcg then we > should be ok. try_charge has a runaway for oom victims > if (unlikely(tsk_is_oom_victim(current) || >fatal_signal_pending(current) || >current->flags & PF_EXITING)) > goto force; > > regardless of MMF_OOM_SKIP. So if there is a single process in the > memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then > we should bail out there. Or do I miss your intention? For a single task memcg it seems that racing process cgroup migration could trigger the new warning (I have attempted to reproduce this): Processes A,B in memcg M1,M2. M1 is oom. Process A[M1] Process B[M2] M1 is oom try_charge(M1) Move A M1=>M2 mem_cgroup_oom() mem_cgroup_out_of_memory() out_of_memory() select_bad_process() sees nothing in M1 return 0 return 0 WARN() Another variant might be possible, this time with global oom: Processes A,B in memcg M1,M2. M1 is oom. Process A[M1] Process B[M2] try_charge() trigger global oom reaper sets A.MMF_OOM_SKIP mem_cgroup_oom() mem_cgroup_out_of_memory() out_of_memory() select_bad_process() sees nothing in M1 return 0 return 0 WARN() These seem unlikely, so I'm fine with taking a wait-and-see approach.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Fri 29-06-18 11:59:04, Greg Thelen wrote: >> Michal Hocko wrote: >> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> >> Michal Hocko wrote: >> > [...] >> >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) >> >> > + return OOM_SUCCESS; >> >> > + >> >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable >> >> > memory! " >> >> > + "This looks like a misconfiguration or a kernel bug."); >> >> >> >> I'm not sure here if the warning should here or so strongly worded. It >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> >> alarming in that case. >> > >> > If the task is reaped then its charges should be released as well and >> > that means that we should get below the limit. Sure there is some room >> > for races but this should be still unlikely. Maybe I am just >> > underestimating though. >> > >> > What would you suggest instead? >> >> I suggest checking MMF_OOM_SKIP or deleting the warning. > > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking > for all the tasks would be quite expensive and remembering that from the > task selection not nice either. Why do you think it would help much? I assume we could just check current's MMF_OOM_SKIP - no need to check all tasks. My only (minor) objection is that the warning text suggests misconfiguration or kernel bug, when there may be neither. > I feel strongly that we have to warn when bypassing the charge limit > during the corner case because it can lead to unexpected behavior and > users should be aware of this fact. I am open to the wording or some > optimizations. I would prefer the latter on top with a clear description > how it helped in a particular case though. I would rather not over > optimize now without any story to back it. I'm fine with the warning. I know enough to look at dmesg logs to take an educates that the race occurred. We can refine it later if/when the reports start rolling in. No change needed.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Fri 29-06-18 11:59:04, Greg Thelen wrote: >> Michal Hocko wrote: >> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> >> Michal Hocko wrote: >> > [...] >> >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) >> >> > + return OOM_SUCCESS; >> >> > + >> >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable >> >> > memory! " >> >> > + "This looks like a misconfiguration or a kernel bug."); >> >> >> >> I'm not sure here if the warning should here or so strongly worded. It >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> >> alarming in that case. >> > >> > If the task is reaped then its charges should be released as well and >> > that means that we should get below the limit. Sure there is some room >> > for races but this should be still unlikely. Maybe I am just >> > underestimating though. >> > >> > What would you suggest instead? >> >> I suggest checking MMF_OOM_SKIP or deleting the warning. > > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking > for all the tasks would be quite expensive and remembering that from the > task selection not nice either. Why do you think it would help much? I assume we could just check current's MMF_OOM_SKIP - no need to check all tasks. My only (minor) objection is that the warning text suggests misconfiguration or kernel bug, when there may be neither. > I feel strongly that we have to warn when bypassing the charge limit > during the corner case because it can lead to unexpected behavior and > users should be aware of this fact. I am open to the wording or some > optimizations. I would prefer the latter on top with a clear description > how it helped in a particular case though. I would rather not over > optimize now without any story to back it. I'm fine with the warning. I know enough to look at dmesg logs to take an educates that the race occurred. We can refine it later if/when the reports start rolling in. No change needed.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> Michal Hocko wrote: > [...] >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) >> > + return OOM_SUCCESS; >> > + >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable memory! " >> > + "This looks like a misconfiguration or a kernel bug."); >> >> I'm not sure here if the warning should here or so strongly worded. It >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> alarming in that case. > > If the task is reaped then its charges should be released as well and > that means that we should get below the limit. Sure there is some room > for races but this should be still unlikely. Maybe I am just > underestimating though. > > What would you suggest instead? I suggest checking MMF_OOM_SKIP or deleting the warning. But I don't feel strongly.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> Michal Hocko wrote: > [...] >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) >> > + return OOM_SUCCESS; >> > + >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable memory! " >> > + "This looks like a misconfiguration or a kernel bug."); >> >> I'm not sure here if the warning should here or so strongly worded. It >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> alarming in that case. > > If the task is reaped then its charges should be released as well and > that means that we should get below the limit. Sure there is some room > for races but this should be still unlikely. Maybe I am just > underestimating though. > > What would you suggest instead? I suggest checking MMF_OOM_SKIP or deleting the warning. But I don't feel strongly.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > From: Michal Hocko > > 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") > has changed the ENOMEM semantic of memcg charges. Rather than invoking > the oom killer from the charging context it delays the oom killer to the > page fault path (pagefault_out_of_memory). This in turn means that many > users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg > hits the hard limit and the memcg is is OOM. This is behavior is > inconsistent with !memcg case where the oom killer is invoked from the > allocation context and the allocator keeps retrying until it succeeds. > > The difference in the behavior is user visible. mmap(MAP_POPULATE) might > result in not fully populated ranges while the mmap return code doesn't > tell that to the userspace. Random syscalls might fail with ENOMEM etc. > > The primary motivation of the different memcg oom semantic was the > deadlock avoidance. Things have changed since then, though. We have > an async oom teardown by the oom reaper now and so we do not have to > rely on the victim to tear down its memory anymore. Therefore we can > return to the original semantic as long as the memcg oom killer is not > handed over to the users space. > > There is still one thing to be careful about here though. If the oom > killer is not able to make any forward progress - e.g. because there is > no eligible task to kill - then we have to bail out of the charge path > to prevent from same class of deadlocks. We have basically two options > here. Either we fail the charge with ENOMEM or force the charge and > allow overcharge. The first option has been considered more harmful than > useful because rare inconsistencies in the ENOMEM behavior is hard to > test for and error prone. Basically the same reason why the page > allocator doesn't fail allocations under such conditions. The later > might allow runaways but those should be really unlikely unless somebody > misconfigures the system. E.g. allowing to migrate tasks away from the > memcg to a different unlimited memcg with move_charge_at_immigrate > disabled. > > Changes since rfc v1 > - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more > clear what is the purpose of the flag now > - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g > s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes > - make oom_kill_disable an exceptional case because it should be rare > and the normal oom handling a core of the function - per Johannes > > Signed-off-by: Michal Hocko Acked-by: Greg Thelen Thanks! One comment below. > --- > > Hi, > I've posted this as an RFC previously [1]. There was no fundamental > disagreement so I've integrated all the suggested changes and tested it. > mmap(MAP_POPULATE) hits the oom killer again rather than silently fails > to populate the mapping on the hard limit excess. On the other hand > g-u-p and other charge path keep the ENOMEM semantic when the memcg oom > killer is disabled. All the forward progress guarantee relies on the oom > reaper. > > Unless there are objections I think this is ready to go to mmotm and > ready for the next merge window > > [1] http://lkml.kernel.org/r/20180620103736.13880-1-mho...@kernel.org > include/linux/memcontrol.h | 16 > include/linux/sched.h | 2 +- > mm/memcontrol.c| 75 ++ > mm/memory.c| 4 +- > 4 files changed, 71 insertions(+), 26 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6c6fb116e925..5a69bb4026f6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup > *memcg); > void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > > -static inline void mem_cgroup_oom_enable(void) > +static inline void mem_cgroup_enter_user_fault(void) > { > - WARN_ON(current->memcg_may_oom); > - current->memcg_may_oom = 1; > + WARN_ON(current->in_user_fault); > + current->in_user_fault = 1; > } > > -static inline void mem_cgroup_oom_disable(void) > +static inline void mem_cgroup_exit_user_fault(void) > { > - WARN_ON(!current->memcg_may_oom); > - current->memcg_may_oom = 0; > + WARN_ON(!current->in_user_fault); > + current->in_user_fault = 0; > } > > static inline bool task_in_memcg_oom(struct task_struct *p) > @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void) > { > } > > -static inline void mem_cgroup_oom_enable(void)
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > From: Michal Hocko > > 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") > has changed the ENOMEM semantic of memcg charges. Rather than invoking > the oom killer from the charging context it delays the oom killer to the > page fault path (pagefault_out_of_memory). This in turn means that many > users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg > hits the hard limit and the memcg is is OOM. This is behavior is > inconsistent with !memcg case where the oom killer is invoked from the > allocation context and the allocator keeps retrying until it succeeds. > > The difference in the behavior is user visible. mmap(MAP_POPULATE) might > result in not fully populated ranges while the mmap return code doesn't > tell that to the userspace. Random syscalls might fail with ENOMEM etc. > > The primary motivation of the different memcg oom semantic was the > deadlock avoidance. Things have changed since then, though. We have > an async oom teardown by the oom reaper now and so we do not have to > rely on the victim to tear down its memory anymore. Therefore we can > return to the original semantic as long as the memcg oom killer is not > handed over to the users space. > > There is still one thing to be careful about here though. If the oom > killer is not able to make any forward progress - e.g. because there is > no eligible task to kill - then we have to bail out of the charge path > to prevent from same class of deadlocks. We have basically two options > here. Either we fail the charge with ENOMEM or force the charge and > allow overcharge. The first option has been considered more harmful than > useful because rare inconsistencies in the ENOMEM behavior is hard to > test for and error prone. Basically the same reason why the page > allocator doesn't fail allocations under such conditions. The later > might allow runaways but those should be really unlikely unless somebody > misconfigures the system. E.g. allowing to migrate tasks away from the > memcg to a different unlimited memcg with move_charge_at_immigrate > disabled. > > Changes since rfc v1 > - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more > clear what is the purpose of the flag now > - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g > s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes > - make oom_kill_disable an exceptional case because it should be rare > and the normal oom handling a core of the function - per Johannes > > Signed-off-by: Michal Hocko Acked-by: Greg Thelen Thanks! One comment below. > --- > > Hi, > I've posted this as an RFC previously [1]. There was no fundamental > disagreement so I've integrated all the suggested changes and tested it. > mmap(MAP_POPULATE) hits the oom killer again rather than silently fails > to populate the mapping on the hard limit excess. On the other hand > g-u-p and other charge path keep the ENOMEM semantic when the memcg oom > killer is disabled. All the forward progress guarantee relies on the oom > reaper. > > Unless there are objections I think this is ready to go to mmotm and > ready for the next merge window > > [1] http://lkml.kernel.org/r/20180620103736.13880-1-mho...@kernel.org > include/linux/memcontrol.h | 16 > include/linux/sched.h | 2 +- > mm/memcontrol.c| 75 ++ > mm/memory.c| 4 +- > 4 files changed, 71 insertions(+), 26 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6c6fb116e925..5a69bb4026f6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup > *memcg); > void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > > -static inline void mem_cgroup_oom_enable(void) > +static inline void mem_cgroup_enter_user_fault(void) > { > - WARN_ON(current->memcg_may_oom); > - current->memcg_may_oom = 1; > + WARN_ON(current->in_user_fault); > + current->in_user_fault = 1; > } > > -static inline void mem_cgroup_oom_disable(void) > +static inline void mem_cgroup_exit_user_fault(void) > { > - WARN_ON(!current->memcg_may_oom); > - current->memcg_may_oom = 0; > + WARN_ON(!current->in_user_fault); > + current->in_user_fault = 0; > } > > static inline bool task_in_memcg_oom(struct task_struct *p) > @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void) > { > } > > -static inline void mem_cgroup_oom_enable(void)
[PATCH] writeback: update stale account_page_redirty() comment
commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in account_page_redirty(). Update comment to track that change. BDI_DIRTIED => WB_DIRTIED BDI_WRITTEN => WB_WRITTEN Signed-off-by: Greg Thelen --- mm/page-writeback.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 337c6afb3345..6551d3b0dc30 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2490,8 +2490,8 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers); /* * Call this whenever redirtying a page, to de-account the dirty counters - * (NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied), so that they match the written - * counters (NR_WRITTEN, BDI_WRITTEN) in long term. The mismatches will lead to + * (NR_DIRTIED, WB_DIRTIED, tsk->nr_dirtied), so that they match the written + * counters (NR_WRITTEN, WB_WRITTEN) in long term. The mismatches will lead to * systematic errors in balanced_dirty_ratelimit and the dirty pages position * control. */ -- 2.18.0.rc2.346.g013aa6912e-goog
[PATCH] writeback: update stale account_page_redirty() comment
commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in account_page_redirty(). Update comment to track that change. BDI_DIRTIED => WB_DIRTIED BDI_WRITTEN => WB_WRITTEN Signed-off-by: Greg Thelen --- mm/page-writeback.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 337c6afb3345..6551d3b0dc30 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2490,8 +2490,8 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers); /* * Call this whenever redirtying a page, to de-account the dirty counters - * (NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied), so that they match the written - * counters (NR_WRITTEN, BDI_WRITTEN) in long term. The mismatches will lead to + * (NR_DIRTIED, WB_DIRTIED, tsk->nr_dirtied), so that they match the written + * counters (NR_WRITTEN, WB_WRITTEN) in long term. The mismatches will lead to * systematic errors in balanced_dirty_ratelimit and the dirty pages position * control. */ -- 2.18.0.rc2.346.g013aa6912e-goog
Re: [PATCH v2] mm: condense scan_control
On Tue, May 29, 2018 at 11:12 PM Greg Thelen wrote: > > Use smaller scan_control fields for order, priority, and reclaim_idx. > Convert fields from int => s8. All easily fit within a byte: > * allocation order range: 0..MAX_ORDER(64?) > * priority range: 0..12(DEF_PRIORITY) > * reclaim_idx range: 0..6(__MAX_NR_ZONES) > > Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 > stack overflows are not an issue. But it's inefficient to use ints. > > Use s8 (signed byte) rather than u8 to allow for loops like: > do { > ... > } while (--sc.priority >= 0); > > Add BUILD_BUG_ON to verify that s8 is capable of storing max values. > > This reduces sizeof(struct scan_control): > * 96 => 80 bytes (x86_64) > * 68 => 56 bytes (i386) > > scan_control structure field order is changed to utilize padding. > After this patch there is 1 bit of scan_control padding. > > Signed-off-by: Greg Thelen > Suggested-by: Matthew Wilcox Is there any interest in this? Less stack usage could mean less dcache and dtlb pressure. But I understand if the complexity is distasteful. > --- > mm/vmscan.c | 32 > 1 file changed, 20 insertions(+), 12 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9b697323a88c..42731faea306 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -65,12 +65,6 @@ struct scan_control { > /* How many pages shrink_list() should reclaim */ > unsigned long nr_to_reclaim; > > - /* This context's GFP mask */ > - gfp_t gfp_mask; > - > - /* Allocation order */ > - int order; > - > /* > * Nodemask of nodes allowed by the caller. If NULL, all nodes > * are scanned. > @@ -83,12 +77,6 @@ struct scan_control { > */ > struct mem_cgroup *target_mem_cgroup; > > - /* Scan (total_size >> priority) pages at once */ > - int priority; > - > - /* The highest zone to isolate pages for reclaim from */ > - enum zone_type reclaim_idx; > - > /* Writepage batching in laptop mode; RECLAIM_WRITE */ > unsigned int may_writepage:1; > > @@ -111,6 +99,18 @@ struct scan_control { > /* One of the zones is ready for compaction */ > unsigned int compaction_ready:1; > > + /* Allocation order */ > + s8 order; > + > + /* Scan (total_size >> priority) pages at once */ > + s8 priority; > + > + /* The highest zone to isolate pages for reclaim from */ > + s8 reclaim_idx; > + > + /* This context's GFP mask */ > + gfp_t gfp_mask; > + > /* Incremented by the number of inactive pages that were scanned */ > unsigned long nr_scanned; > > @@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist > *zonelist, int order, > .may_swap = 1, > }; > > + /* > +* scan_control uses s8 fields for order, priority, and reclaim_idx. > +* Confirm they are large enough for max values. > +*/ > + BUILD_BUG_ON(MAX_ORDER > S8_MAX); > + BUILD_BUG_ON(DEF_PRIORITY > S8_MAX); > + BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX); > + > /* > * Do not enter reclaim if fatal signal was delivered while throttled. > * 1 is returned so that the page allocator does not OOM kill at this > -- > 2.17.0.921.gf22659ad46-goog >
Re: [PATCH v2] mm: condense scan_control
On Tue, May 29, 2018 at 11:12 PM Greg Thelen wrote: > > Use smaller scan_control fields for order, priority, and reclaim_idx. > Convert fields from int => s8. All easily fit within a byte: > * allocation order range: 0..MAX_ORDER(64?) > * priority range: 0..12(DEF_PRIORITY) > * reclaim_idx range: 0..6(__MAX_NR_ZONES) > > Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 > stack overflows are not an issue. But it's inefficient to use ints. > > Use s8 (signed byte) rather than u8 to allow for loops like: > do { > ... > } while (--sc.priority >= 0); > > Add BUILD_BUG_ON to verify that s8 is capable of storing max values. > > This reduces sizeof(struct scan_control): > * 96 => 80 bytes (x86_64) > * 68 => 56 bytes (i386) > > scan_control structure field order is changed to utilize padding. > After this patch there is 1 bit of scan_control padding. > > Signed-off-by: Greg Thelen > Suggested-by: Matthew Wilcox Is there any interest in this? Less stack usage could mean less dcache and dtlb pressure. But I understand if the complexity is distasteful. > --- > mm/vmscan.c | 32 > 1 file changed, 20 insertions(+), 12 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9b697323a88c..42731faea306 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -65,12 +65,6 @@ struct scan_control { > /* How many pages shrink_list() should reclaim */ > unsigned long nr_to_reclaim; > > - /* This context's GFP mask */ > - gfp_t gfp_mask; > - > - /* Allocation order */ > - int order; > - > /* > * Nodemask of nodes allowed by the caller. If NULL, all nodes > * are scanned. > @@ -83,12 +77,6 @@ struct scan_control { > */ > struct mem_cgroup *target_mem_cgroup; > > - /* Scan (total_size >> priority) pages at once */ > - int priority; > - > - /* The highest zone to isolate pages for reclaim from */ > - enum zone_type reclaim_idx; > - > /* Writepage batching in laptop mode; RECLAIM_WRITE */ > unsigned int may_writepage:1; > > @@ -111,6 +99,18 @@ struct scan_control { > /* One of the zones is ready for compaction */ > unsigned int compaction_ready:1; > > + /* Allocation order */ > + s8 order; > + > + /* Scan (total_size >> priority) pages at once */ > + s8 priority; > + > + /* The highest zone to isolate pages for reclaim from */ > + s8 reclaim_idx; > + > + /* This context's GFP mask */ > + gfp_t gfp_mask; > + > /* Incremented by the number of inactive pages that were scanned */ > unsigned long nr_scanned; > > @@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist > *zonelist, int order, > .may_swap = 1, > }; > > + /* > +* scan_control uses s8 fields for order, priority, and reclaim_idx. > +* Confirm they are large enough for max values. > +*/ > + BUILD_BUG_ON(MAX_ORDER > S8_MAX); > + BUILD_BUG_ON(DEF_PRIORITY > S8_MAX); > + BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX); > + > /* > * Do not enter reclaim if fatal signal was delivered while throttled. > * 1 is returned so that the page allocator does not OOM kill at this > -- > 2.17.0.921.gf22659ad46-goog >
[PATCH] trace: fix SKIP_STACK_VALIDATION=1 build
Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and SKIP_STACK_VALIDATION=1 fail. Example output: /bin/sh: init/.tmp_main.o: Permission denied commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"), added a mismatched endif. This causes cmd_objtool to get mistakenly set. Relocate endif to balance the newly added -record-mcount check. Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace") Signed-off-by: Greg Thelen --- scripts/Makefile.build | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 34d9e9ce97c2..e7889f486ca1 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -239,6 +239,7 @@ cmd_record_mcount = \ "$(CC_FLAGS_FTRACE)" ]; then \ $(sub_cmd_record_mcount)\ fi; +endif # -record-mcount endif # CONFIG_FTRACE_MCOUNT_RECORD ifdef CONFIG_STACK_VALIDATION @@ -263,7 +264,6 @@ ifneq ($(RETPOLINE_CFLAGS),) objtool_args += --retpoline endif endif -endif ifdef CONFIG_MODVERSIONS -- 2.18.0.rc1.242.g61856ae69a-goog
[PATCH] trace: fix SKIP_STACK_VALIDATION=1 build
Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and SKIP_STACK_VALIDATION=1 fail. Example output: /bin/sh: init/.tmp_main.o: Permission denied commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"), added a mismatched endif. This causes cmd_objtool to get mistakenly set. Relocate endif to balance the newly added -record-mcount check. Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace") Signed-off-by: Greg Thelen --- scripts/Makefile.build | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 34d9e9ce97c2..e7889f486ca1 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -239,6 +239,7 @@ cmd_record_mcount = \ "$(CC_FLAGS_FTRACE)" ]; then \ $(sub_cmd_record_mcount)\ fi; +endif # -record-mcount endif # CONFIG_FTRACE_MCOUNT_RECORD ifdef CONFIG_STACK_VALIDATION @@ -263,7 +264,6 @@ ifneq ($(RETPOLINE_CFLAGS),) objtool_args += --retpoline endif endif -endif ifdef CONFIG_MODVERSIONS -- 2.18.0.rc1.242.g61856ae69a-goog
Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency
On Mon, Jun 4, 2018 at 4:07 PM Jason Gunthorpe wrote: > > On Thu, May 31, 2018 at 02:40:59PM -0400, Doug Ledford wrote: > > On Wed, 2018-05-30 at 21:03 -0700, Greg Thelen wrote: > > > On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe wrote: > > > > > > > On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote: > > > > > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote: > > > > > > > On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote: > > > > > > > > > > > > > > > > On 5/30/2018 5:04 PM, Jason Gunthorpe wrote: > > > > > > > > > On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote: > > > > > > > > > > The newly added fill_res_ep_entry function fails to link if > > > > > > > > > > CONFIG_INFINIBAND_ADDR_TRANS is not set: > > > > > > > > > > > > > > > > > > > > drivers/infiniband/hw/cxgb4/restrack.o: In function > > > > > > `fill_res_ep_entry': > > > > > > > > > > restrack.c:(.text+0x3cc): undefined reference to > > > > > > > > > > `rdma_res_to_id' > > > > > > > > > > restrack.c:(.text+0x3d0): undefined reference to > > > > > > > > > > `rdma_iw_cm_id' > > > > > > > > > > > > > > > > > > > > This adds a Kconfig dependency for the driver. > > > > > > > > > > > > > > > > > > > > Fixes: 116aeb887371 ("iw_cxgb4: provide detailed > > > > > > provider-specific CM_ID information") > > > > > > > > > > Signed-off-by: Arnd Bergmann > > > > > > > > > > drivers/infiniband/hw/cxgb4/Kconfig | 1 + > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > Oh, I think we need to solve this with maybe a header fill > > > > > > > > > null > > > > > > stub > > > > > > > > > instead.. > > > > > > > > > > > > > > > > > > We don't want to disable drivers just because a user > > > > > > > > > interface is > > > > > > > > > disabled. > > > > > > > > > > > > > > > > > > > > > > > > > Why does CONFIG_INFINIBAND_ADDR_TRANS disable building > > > > > > > > rdma_cm.ko? > > > > > > That > > > > > > > > is not correct. > > > > > > > > > > > > > > That seems like a reasonable thing to do.. > > > > > > > > > > > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, > > > > > > and > > > > > > is required for iwarp drivers. It seems rdma_cm.ko is not being > > > > > > compiled if ADDR_TRANS is not set. > > > > I think the intention was to completely disable rdma-cm, including all > > > > support for rx'ing remote packets? Greg? > > > > > > Yes. That's my goal when INFINIBAND_ADDR_TRANS is unset. > > > > > > > If this is required for iwarp then Arnd's patch is probably the right > > > > way to go.. > > > > Jason > > > > > > Agreed. > > > Acked-by: Greg Thelen > > > > If that's the case, then there should be a NOTE: in the Kconfig that > > disabling the connection manager completely disables iWARP hardware. > > > > I don't really think I like this approach though. At a minimum if you > > are going to make iWARP totally dependent on rdmacm, then there is zero > > reason that iw_cm.o should be part of the obj-$(CONFIG_INFINIBAND) > > Makefile recipe when ADDR_TRANS is disabled. > > This makes sense, Greg, can you send a followup patch with these > items? I know this series has been tough.. > > Thanks, > Jason Ack. Sorry for delay. I'll take a look at this in a day or two. Let me know if it's more urgent.
Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency
On Mon, Jun 4, 2018 at 4:07 PM Jason Gunthorpe wrote: > > On Thu, May 31, 2018 at 02:40:59PM -0400, Doug Ledford wrote: > > On Wed, 2018-05-30 at 21:03 -0700, Greg Thelen wrote: > > > On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe wrote: > > > > > > > On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote: > > > > > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote: > > > > > > > On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote: > > > > > > > > > > > > > > > > On 5/30/2018 5:04 PM, Jason Gunthorpe wrote: > > > > > > > > > On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote: > > > > > > > > > > The newly added fill_res_ep_entry function fails to link if > > > > > > > > > > CONFIG_INFINIBAND_ADDR_TRANS is not set: > > > > > > > > > > > > > > > > > > > > drivers/infiniband/hw/cxgb4/restrack.o: In function > > > > > > `fill_res_ep_entry': > > > > > > > > > > restrack.c:(.text+0x3cc): undefined reference to > > > > > > > > > > `rdma_res_to_id' > > > > > > > > > > restrack.c:(.text+0x3d0): undefined reference to > > > > > > > > > > `rdma_iw_cm_id' > > > > > > > > > > > > > > > > > > > > This adds a Kconfig dependency for the driver. > > > > > > > > > > > > > > > > > > > > Fixes: 116aeb887371 ("iw_cxgb4: provide detailed > > > > > > provider-specific CM_ID information") > > > > > > > > > > Signed-off-by: Arnd Bergmann > > > > > > > > > > drivers/infiniband/hw/cxgb4/Kconfig | 1 + > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > Oh, I think we need to solve this with maybe a header fill > > > > > > > > > null > > > > > > stub > > > > > > > > > instead.. > > > > > > > > > > > > > > > > > > We don't want to disable drivers just because a user > > > > > > > > > interface is > > > > > > > > > disabled. > > > > > > > > > > > > > > > > > > > > > > > > > Why does CONFIG_INFINIBAND_ADDR_TRANS disable building > > > > > > > > rdma_cm.ko? > > > > > > That > > > > > > > > is not correct. > > > > > > > > > > > > > > That seems like a reasonable thing to do.. > > > > > > > > > > > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, > > > > > > and > > > > > > is required for iwarp drivers. It seems rdma_cm.ko is not being > > > > > > compiled if ADDR_TRANS is not set. > > > > I think the intention was to completely disable rdma-cm, including all > > > > support for rx'ing remote packets? Greg? > > > > > > Yes. That's my goal when INFINIBAND_ADDR_TRANS is unset. > > > > > > > If this is required for iwarp then Arnd's patch is probably the right > > > > way to go.. > > > > Jason > > > > > > Agreed. > > > Acked-by: Greg Thelen > > > > If that's the case, then there should be a NOTE: in the Kconfig that > > disabling the connection manager completely disables iWARP hardware. > > > > I don't really think I like this approach though. At a minimum if you > > are going to make iWARP totally dependent on rdmacm, then there is zero > > reason that iw_cm.o should be part of the obj-$(CONFIG_INFINIBAND) > > Makefile recipe when ADDR_TRANS is disabled. > > This makes sense, Greg, can you send a followup patch with these > items? I know this series has been tough.. > > Thanks, > Jason Ack. Sorry for delay. I'll take a look at this in a day or two. Let me know if it's more urgent.
Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency
On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe wrote: > On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote: > > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote: > > >> On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote: > > >>> > > >>> On 5/30/2018 5:04 PM, Jason Gunthorpe wrote: > > >>>> On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote: > > >>>>> The newly added fill_res_ep_entry function fails to link if > > >>>>> CONFIG_INFINIBAND_ADDR_TRANS is not set: > > >>>>> > > >>>>> drivers/infiniband/hw/cxgb4/restrack.o: In function `fill_res_ep_entry': > > >>>>> restrack.c:(.text+0x3cc): undefined reference to `rdma_res_to_id' > > >>>>> restrack.c:(.text+0x3d0): undefined reference to `rdma_iw_cm_id' > > >>>>> > > >>>>> This adds a Kconfig dependency for the driver. > > >>>>> > > >>>>> Fixes: 116aeb887371 ("iw_cxgb4: provide detailed provider-specific CM_ID information") > > >>>>> Signed-off-by: Arnd Bergmann > > >>>>> drivers/infiniband/hw/cxgb4/Kconfig | 1 + > > >>>>> 1 file changed, 1 insertion(+) > > >>>> Oh, I think we need to solve this with maybe a header fill null stub > > >>>> instead.. > > >>>> > > >>>> We don't want to disable drivers just because a user interface is > > >>>> disabled. > > >>>> > > >>> Why does CONFIG_INFINIBAND_ADDR_TRANS disable building rdma_cm.ko? That > > >>> is not correct. > > >> That seems like a reasonable thing to do.. > > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, and > > > is required for iwarp drivers. It seems rdma_cm.ko is not being > > > compiled if ADDR_TRANS is not set. > I think the intention was to completely disable rdma-cm, including all > support for rx'ing remote packets? Greg? Yes. That's my goal when INFINIBAND_ADDR_TRANS is unset. > If this is required for iwarp then Arnd's patch is probably the right > way to go.. > Jason Agreed. Acked-by: Greg Thelen
Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency
On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe wrote: > On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote: > > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote: > > >> On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote: > > >>> > > >>> On 5/30/2018 5:04 PM, Jason Gunthorpe wrote: > > >>>> On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote: > > >>>>> The newly added fill_res_ep_entry function fails to link if > > >>>>> CONFIG_INFINIBAND_ADDR_TRANS is not set: > > >>>>> > > >>>>> drivers/infiniband/hw/cxgb4/restrack.o: In function `fill_res_ep_entry': > > >>>>> restrack.c:(.text+0x3cc): undefined reference to `rdma_res_to_id' > > >>>>> restrack.c:(.text+0x3d0): undefined reference to `rdma_iw_cm_id' > > >>>>> > > >>>>> This adds a Kconfig dependency for the driver. > > >>>>> > > >>>>> Fixes: 116aeb887371 ("iw_cxgb4: provide detailed provider-specific CM_ID information") > > >>>>> Signed-off-by: Arnd Bergmann > > >>>>> drivers/infiniband/hw/cxgb4/Kconfig | 1 + > > >>>>> 1 file changed, 1 insertion(+) > > >>>> Oh, I think we need to solve this with maybe a header fill null stub > > >>>> instead.. > > >>>> > > >>>> We don't want to disable drivers just because a user interface is > > >>>> disabled. > > >>>> > > >>> Why does CONFIG_INFINIBAND_ADDR_TRANS disable building rdma_cm.ko? That > > >>> is not correct. > > >> That seems like a reasonable thing to do.. > > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, and > > > is required for iwarp drivers. It seems rdma_cm.ko is not being > > > compiled if ADDR_TRANS is not set. > I think the intention was to completely disable rdma-cm, including all > support for rx'ing remote packets? Greg? Yes. That's my goal when INFINIBAND_ADDR_TRANS is unset. > If this is required for iwarp then Arnd's patch is probably the right > way to go.. > Jason Agreed. Acked-by: Greg Thelen
[PATCH v2] mm: condense scan_control
Use smaller scan_control fields for order, priority, and reclaim_idx. Convert fields from int => s8. All easily fit within a byte: * allocation order range: 0..MAX_ORDER(64?) * priority range: 0..12(DEF_PRIORITY) * reclaim_idx range: 0..6(__MAX_NR_ZONES) Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 stack overflows are not an issue. But it's inefficient to use ints. Use s8 (signed byte) rather than u8 to allow for loops like: do { ... } while (--sc.priority >= 0); Add BUILD_BUG_ON to verify that s8 is capable of storing max values. This reduces sizeof(struct scan_control): * 96 => 80 bytes (x86_64) * 68 => 56 bytes (i386) scan_control structure field order is changed to utilize padding. After this patch there is 1 bit of scan_control padding. Signed-off-by: Greg Thelen Suggested-by: Matthew Wilcox --- mm/vmscan.c | 32 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9b697323a88c..42731faea306 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -65,12 +65,6 @@ struct scan_control { /* How many pages shrink_list() should reclaim */ unsigned long nr_to_reclaim; - /* This context's GFP mask */ - gfp_t gfp_mask; - - /* Allocation order */ - int order; - /* * Nodemask of nodes allowed by the caller. If NULL, all nodes * are scanned. @@ -83,12 +77,6 @@ struct scan_control { */ struct mem_cgroup *target_mem_cgroup; - /* Scan (total_size >> priority) pages at once */ - int priority; - - /* The highest zone to isolate pages for reclaim from */ - enum zone_type reclaim_idx; - /* Writepage batching in laptop mode; RECLAIM_WRITE */ unsigned int may_writepage:1; @@ -111,6 +99,18 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + /* Allocation order */ + s8 order; + + /* Scan (total_size >> priority) pages at once */ + s8 priority; + + /* The highest zone to isolate pages for reclaim from */ + s8 reclaim_idx; + + /* This context's GFP mask */ + gfp_t gfp_mask; + /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; @@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_swap = 1, }; + /* +* scan_control uses s8 fields for order, priority, and reclaim_idx. +* Confirm they are large enough for max values. +*/ + BUILD_BUG_ON(MAX_ORDER > S8_MAX); + BUILD_BUG_ON(DEF_PRIORITY > S8_MAX); + BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX); + /* * Do not enter reclaim if fatal signal was delivered while throttled. * 1 is returned so that the page allocator does not OOM kill at this -- 2.17.0.921.gf22659ad46-goog
Re: [PATCH] mm: convert scan_control.priority int => byte
Matthew Wilcox wrote: > On Mon, May 28, 2018 at 07:40:25PM -0700, Greg Thelen wrote: >> Reclaim priorities range from 0..12(DEF_PRIORITY). >> scan_control.priority is a 4 byte int, which is overkill. >> >> Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 >> stack overflows are not an issue. But it's inefficient to use 4 bytes >> for priority. > > If you're looking to shave a few more bytes, allocation order can fit > in a u8 too (can't be more than 6 bits, and realistically won't be more > than 4 bits). reclaim_idx likewise will fit in a u8, and actually won't > be more than 3 bits. Nod. Good tip. Included in ("[PATCH v2] mm: condense scan_control"). > I am sceptical that nr_to_reclaim should really be an unsigned long; I > don't think we should be trying to free 4 billion pages in a single call. > nr_scanned might be over 4 billion (!) but nr_reclaimed can probably > shrink to unsigned int along with nr_to_reclaim. Agreed. For patch simplicity, I'll pass on this for now.
[PATCH v2] mm: condense scan_control
Use smaller scan_control fields for order, priority, and reclaim_idx. Convert fields from int => s8. All easily fit within a byte: * allocation order range: 0..MAX_ORDER(64?) * priority range: 0..12(DEF_PRIORITY) * reclaim_idx range: 0..6(__MAX_NR_ZONES) Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 stack overflows are not an issue. But it's inefficient to use ints. Use s8 (signed byte) rather than u8 to allow for loops like: do { ... } while (--sc.priority >= 0); Add BUILD_BUG_ON to verify that s8 is capable of storing max values. This reduces sizeof(struct scan_control): * 96 => 80 bytes (x86_64) * 68 => 56 bytes (i386) scan_control structure field order is changed to utilize padding. After this patch there is 1 bit of scan_control padding. Signed-off-by: Greg Thelen Suggested-by: Matthew Wilcox --- mm/vmscan.c | 32 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9b697323a88c..42731faea306 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -65,12 +65,6 @@ struct scan_control { /* How many pages shrink_list() should reclaim */ unsigned long nr_to_reclaim; - /* This context's GFP mask */ - gfp_t gfp_mask; - - /* Allocation order */ - int order; - /* * Nodemask of nodes allowed by the caller. If NULL, all nodes * are scanned. @@ -83,12 +77,6 @@ struct scan_control { */ struct mem_cgroup *target_mem_cgroup; - /* Scan (total_size >> priority) pages at once */ - int priority; - - /* The highest zone to isolate pages for reclaim from */ - enum zone_type reclaim_idx; - /* Writepage batching in laptop mode; RECLAIM_WRITE */ unsigned int may_writepage:1; @@ -111,6 +99,18 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + /* Allocation order */ + s8 order; + + /* Scan (total_size >> priority) pages at once */ + s8 priority; + + /* The highest zone to isolate pages for reclaim from */ + s8 reclaim_idx; + + /* This context's GFP mask */ + gfp_t gfp_mask; + /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; @@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_swap = 1, }; + /* +* scan_control uses s8 fields for order, priority, and reclaim_idx. +* Confirm they are large enough for max values. +*/ + BUILD_BUG_ON(MAX_ORDER > S8_MAX); + BUILD_BUG_ON(DEF_PRIORITY > S8_MAX); + BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX); + /* * Do not enter reclaim if fatal signal was delivered while throttled. * 1 is returned so that the page allocator does not OOM kill at this -- 2.17.0.921.gf22659ad46-goog
Re: [PATCH] mm: convert scan_control.priority int => byte
Matthew Wilcox wrote: > On Mon, May 28, 2018 at 07:40:25PM -0700, Greg Thelen wrote: >> Reclaim priorities range from 0..12(DEF_PRIORITY). >> scan_control.priority is a 4 byte int, which is overkill. >> >> Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 >> stack overflows are not an issue. But it's inefficient to use 4 bytes >> for priority. > > If you're looking to shave a few more bytes, allocation order can fit > in a u8 too (can't be more than 6 bits, and realistically won't be more > than 4 bits). reclaim_idx likewise will fit in a u8, and actually won't > be more than 3 bits. Nod. Good tip. Included in ("[PATCH v2] mm: condense scan_control"). > I am sceptical that nr_to_reclaim should really be an unsigned long; I > don't think we should be trying to free 4 billion pages in a single call. > nr_scanned might be over 4 billion (!) but nr_reclaimed can probably > shrink to unsigned int along with nr_to_reclaim. Agreed. For patch simplicity, I'll pass on this for now.
[PATCH] mm: convert scan_control.priority int => byte
Reclaim priorities range from 0..12(DEF_PRIORITY). scan_control.priority is a 4 byte int, which is overkill. Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 stack overflows are not an issue. But it's inefficient to use 4 bytes for priority. Use s8 (signed byte) rather than u8 to allow for loops like: do { ... } while (--sc.priority >= 0); This reduces sizeof(struct scan_control) from 96 => 88 bytes (x86_64), which saves some stack. scan_control.priority field order is changed to occupy otherwise unused padding. Signed-off-by: Greg Thelen --- mm/vmscan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9b697323a88c..541c334bd176 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -83,9 +83,6 @@ struct scan_control { */ struct mem_cgroup *target_mem_cgroup; - /* Scan (total_size >> priority) pages at once */ - int priority; - /* The highest zone to isolate pages for reclaim from */ enum zone_type reclaim_idx; @@ -111,6 +108,9 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + /* Scan (total_size >> priority) pages at once */ + s8 priority; + /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; -- 2.17.0.921.gf22659ad46-goog
[PATCH] mm: convert scan_control.priority int => byte
Reclaim priorities range from 0..12(DEF_PRIORITY). scan_control.priority is a 4 byte int, which is overkill. Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 stack overflows are not an issue. But it's inefficient to use 4 bytes for priority. Use s8 (signed byte) rather than u8 to allow for loops like: do { ... } while (--sc.priority >= 0); This reduces sizeof(struct scan_control) from 96 => 88 bytes (x86_64), which saves some stack. scan_control.priority field order is changed to occupy otherwise unused padding. Signed-off-by: Greg Thelen --- mm/vmscan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9b697323a88c..541c334bd176 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -83,9 +83,6 @@ struct scan_control { */ struct mem_cgroup *target_mem_cgroup; - /* Scan (total_size >> priority) pages at once */ - int priority; - /* The highest zone to isolate pages for reclaim from */ enum zone_type reclaim_idx; @@ -111,6 +108,9 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + /* Scan (total_size >> priority) pages at once */ + s8 priority; + /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; -- 2.17.0.921.gf22659ad46-goog
Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"
Jason Gunthorpe wrote: On Fri, May 25, 2018 at 05:32:52PM -0700, Greg Thelen wrote: On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann wrote: > Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends > on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a > link error when another driver using it is built-in. The > INFINIBAND_ADDR_TRANS dependency is insufficient here as this is > a 'bool' symbol that does not force anything to be a module in turn. > fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work': > smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect' > net/9p/trans_rdma.o: In function `rdma_request': > trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect' > net/9p/trans_rdma.o: In function `rdma_destroy_trans': > trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp' > trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd' > Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig dependencies") > Signed-off-by: Arnd Bergmann Acked-by: Greg Thelen Sorry for the 9533b292a7ac problem. At this point the in release cycle, I think Arnd's revert is best. If there is interest, I've put a little thought into an alternative fix: making INFINIBAND_ADDR_TRANS tristate. But it's nontrivial. So I prefer this simple revert for now. Is that a normal thing to do? For me: no, it's not normal. In my use case I merely want to disable INFINIBAND_ADDR_TRANS while continuing to use INFINIBAND. This is supported with f7cb7b85be55 ("IB: make INFINIBAND_ADDR_TRANS configurable"). During f7cb7b85be55 development https://lkml.org/lkml/2018/4/30/1073 suggested that we drop dependency on both INFINIBAND and INFINIBAND_ADDR_TRANS. Thus 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig dependencies"). But 9533b292a7ac led to the randconfig build errors reported and thus ("IB: Revert "remove redundant INFINIBAND kconfig dependencies""). So I see no need to do anything more than apply ("IB: Revert "remove redundant INFINIBAND kconfig dependencies""). Doug: do you need anything from me on this? I can take the revert.. Jason Thanks.
Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"
Jason Gunthorpe wrote: On Fri, May 25, 2018 at 05:32:52PM -0700, Greg Thelen wrote: On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann wrote: > Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends > on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a > link error when another driver using it is built-in. The > INFINIBAND_ADDR_TRANS dependency is insufficient here as this is > a 'bool' symbol that does not force anything to be a module in turn. > fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work': > smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect' > net/9p/trans_rdma.o: In function `rdma_request': > trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect' > net/9p/trans_rdma.o: In function `rdma_destroy_trans': > trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp' > trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd' > Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig dependencies") > Signed-off-by: Arnd Bergmann Acked-by: Greg Thelen Sorry for the 9533b292a7ac problem. At this point the in release cycle, I think Arnd's revert is best. If there is interest, I've put a little thought into an alternative fix: making INFINIBAND_ADDR_TRANS tristate. But it's nontrivial. So I prefer this simple revert for now. Is that a normal thing to do? For me: no, it's not normal. In my use case I merely want to disable INFINIBAND_ADDR_TRANS while continuing to use INFINIBAND. This is supported with f7cb7b85be55 ("IB: make INFINIBAND_ADDR_TRANS configurable"). During f7cb7b85be55 development https://lkml.org/lkml/2018/4/30/1073 suggested that we drop dependency on both INFINIBAND and INFINIBAND_ADDR_TRANS. Thus 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig dependencies"). But 9533b292a7ac led to the randconfig build errors reported and thus ("IB: Revert "remove redundant INFINIBAND kconfig dependencies""). So I see no need to do anything more than apply ("IB: Revert "remove redundant INFINIBAND kconfig dependencies""). Doug: do you need anything from me on this? I can take the revert.. Jason Thanks.
Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"
On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann <a...@arndb.de> wrote: > Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends > on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a > link error when another driver using it is built-in. The > INFINIBAND_ADDR_TRANS dependency is insufficient here as this is > a 'bool' symbol that does not force anything to be a module in turn. > fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work': > smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect' > net/9p/trans_rdma.o: In function `rdma_request': > trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect' > net/9p/trans_rdma.o: In function `rdma_destroy_trans': > trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp' > trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd' > Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig dependencies") > Signed-off-by: Arnd Bergmann <a...@arndb.de> Acked-by: Greg Thelen <gthe...@google.com> Sorry for the 9533b292a7ac problem. At this point the in release cycle, I think Arnd's revert is best. If there is interest, I've put a little thought into an alternative fix: making INFINIBAND_ADDR_TRANS tristate. But it's nontrivial. So I prefer this simple revert for now. Doug: do you need anything from me on this? > --- > The patch that introduced the problem has been queued in the > rdma-fixes/for-rc tree. Please revert the patch before sending > the branch to Linus. > --- > drivers/infiniband/ulp/srpt/Kconfig | 2 +- > drivers/nvme/host/Kconfig | 2 +- > drivers/nvme/target/Kconfig | 2 +- > drivers/staging/lustre/lnet/Kconfig | 2 +- > fs/cifs/Kconfig | 2 +- > net/9p/Kconfig | 2 +- > net/rds/Kconfig | 2 +- > net/sunrpc/Kconfig | 2 +- > 8 files changed, 8 insertions(+), 8 deletions(-) > diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig > index 25bf6955b6d0..fb8b7182f05e 100644 > --- a/drivers/infiniband/ulp/srpt/Kconfig > +++ b/drivers/infiniband/ulp/srpt/Kconfig > @@ -1,6 +1,6 @@ > config INFINIBAND_SRPT >tristate "InfiniBand SCSI RDMA Protocol target support" > - depends on INFINIBAND_ADDR_TRANS && TARGET_CORE > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE >---help--- > Support for the SCSI RDMA Protocol (SRP) Target driver. The > diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig > index dbb7464c018c..88a8b5916624 100644 > --- a/drivers/nvme/host/Kconfig > +++ b/drivers/nvme/host/Kconfig > @@ -27,7 +27,7 @@ config NVME_FABRICS > config NVME_RDMA >tristate "NVM Express over Fabrics RDMA host driver" > - depends on INFINIBAND_ADDR_TRANS && BLOCK > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK >select NVME_CORE >select NVME_FABRICS >select SG_POOL > diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig > index 7595664ee753..3c7b61ddb0d1 100644 > --- a/drivers/nvme/target/Kconfig > +++ b/drivers/nvme/target/Kconfig > @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP > config NVME_TARGET_RDMA >tristate "NVMe over Fabrics RDMA target support" > - depends on INFINIBAND_ADDR_TRANS > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS >depends on NVME_TARGET >select SGL_ALLOC >help > diff --git a/drivers/staging/lustre/lnet/Kconfig b/drivers/staging/lustre/lnet/Kconfig > index f3b1ad4bd3dc..ad049e6f24e4 100644 > --- a/drivers/staging/lustre/lnet/Kconfig > +++ b/drivers/staging/lustre/lnet/Kconfig > @@ -34,7 +34,7 @@ config LNET_SELFTEST > config LNET_XPRT_IB >tristate "LNET infiniband support" > - depends on LNET && PCI && INFINIBAND_ADDR_TRANS > + depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS >default LNET && INFINIBAND >help > This option allows the LNET users to use infiniband as an > diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig > index d61e2de8d0eb..5f132d59dfc2 100644 > --- a/fs/cifs/Kconfig > +++ b/fs/cifs/Kconfig > @@ -197,7 +197,7 @@ config CIFS_SMB311 > config CIFS_SMB_DIRECT >bool "SMB Direct support (Experimental)" > - depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND_ADDR_TRANS=y > + depends on CIF
Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"
On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann wrote: > Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends > on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a > link error when another driver using it is built-in. The > INFINIBAND_ADDR_TRANS dependency is insufficient here as this is > a 'bool' symbol that does not force anything to be a module in turn. > fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work': > smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect' > net/9p/trans_rdma.o: In function `rdma_request': > trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect' > net/9p/trans_rdma.o: In function `rdma_destroy_trans': > trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp' > trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd' > Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig dependencies") > Signed-off-by: Arnd Bergmann Acked-by: Greg Thelen Sorry for the 9533b292a7ac problem. At this point the in release cycle, I think Arnd's revert is best. If there is interest, I've put a little thought into an alternative fix: making INFINIBAND_ADDR_TRANS tristate. But it's nontrivial. So I prefer this simple revert for now. Doug: do you need anything from me on this? > --- > The patch that introduced the problem has been queued in the > rdma-fixes/for-rc tree. Please revert the patch before sending > the branch to Linus. > --- > drivers/infiniband/ulp/srpt/Kconfig | 2 +- > drivers/nvme/host/Kconfig | 2 +- > drivers/nvme/target/Kconfig | 2 +- > drivers/staging/lustre/lnet/Kconfig | 2 +- > fs/cifs/Kconfig | 2 +- > net/9p/Kconfig | 2 +- > net/rds/Kconfig | 2 +- > net/sunrpc/Kconfig | 2 +- > 8 files changed, 8 insertions(+), 8 deletions(-) > diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig > index 25bf6955b6d0..fb8b7182f05e 100644 > --- a/drivers/infiniband/ulp/srpt/Kconfig > +++ b/drivers/infiniband/ulp/srpt/Kconfig > @@ -1,6 +1,6 @@ > config INFINIBAND_SRPT >tristate "InfiniBand SCSI RDMA Protocol target support" > - depends on INFINIBAND_ADDR_TRANS && TARGET_CORE > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE >---help--- > Support for the SCSI RDMA Protocol (SRP) Target driver. The > diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig > index dbb7464c018c..88a8b5916624 100644 > --- a/drivers/nvme/host/Kconfig > +++ b/drivers/nvme/host/Kconfig > @@ -27,7 +27,7 @@ config NVME_FABRICS > config NVME_RDMA >tristate "NVM Express over Fabrics RDMA host driver" > - depends on INFINIBAND_ADDR_TRANS && BLOCK > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK >select NVME_CORE >select NVME_FABRICS >select SG_POOL > diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig > index 7595664ee753..3c7b61ddb0d1 100644 > --- a/drivers/nvme/target/Kconfig > +++ b/drivers/nvme/target/Kconfig > @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP > config NVME_TARGET_RDMA >tristate "NVMe over Fabrics RDMA target support" > - depends on INFINIBAND_ADDR_TRANS > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS >depends on NVME_TARGET >select SGL_ALLOC >help > diff --git a/drivers/staging/lustre/lnet/Kconfig b/drivers/staging/lustre/lnet/Kconfig > index f3b1ad4bd3dc..ad049e6f24e4 100644 > --- a/drivers/staging/lustre/lnet/Kconfig > +++ b/drivers/staging/lustre/lnet/Kconfig > @@ -34,7 +34,7 @@ config LNET_SELFTEST > config LNET_XPRT_IB >tristate "LNET infiniband support" > - depends on LNET && PCI && INFINIBAND_ADDR_TRANS > + depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS >default LNET && INFINIBAND >help > This option allows the LNET users to use infiniband as an > diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig > index d61e2de8d0eb..5f132d59dfc2 100644 > --- a/fs/cifs/Kconfig > +++ b/fs/cifs/Kconfig > @@ -197,7 +197,7 @@ config CIFS_SMB311 > config CIFS_SMB_DIRECT >bool "SMB Direct support (Experimental)" > - depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND_ADDR_TRANS=y > + depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y &a
Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS
On Tue, May 1, 2018 at 1:48 PM Jason Gunthorpe <j...@ziepe.ca> wrote: > On Tue, May 01, 2018 at 03:08:57AM +0000, Greg Thelen wrote: > > On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe <j...@ziepe.ca> wrote: > > > > > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote: > > > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols. > > > > So declare the kconfig dependency. This is necessary to allow for > > > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS. > > > > > > > > Signed-off-by: Greg Thelen <gthe...@google.com> > > > > Cc: Tarick Bedeir <tar...@google.com> > > > > drivers/infiniband/ulp/srpt/Kconfig | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig > > b/drivers/infiniband/ulp/srpt/Kconfig > > > > index 31ee83d528d9..fb8b7182f05e 100644 > > > > +++ b/drivers/infiniband/ulp/srpt/Kconfig > > > > @@ -1,6 +1,6 @@ > > > > config INFINIBAND_SRPT > > > > tristate "InfiniBand SCSI RDMA Protocol target support" > > > > - depends on INFINIBAND && TARGET_CORE > > > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE > > > > > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have > > > INFINIBAND_ADDR_TRANS without INFINIBAND. > > > > By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND. So yes, it seems > > redundant. I don't know if anyone has designs to break this dependency and > > allow for ADDR_TRANS without INFINIBAND. Assuming not, I'd be willing to > > amend my series removing redundant INFINIBAND and a followup series to > > remove it from similar depends. Though I'm not familiar with rdma dev tree > > lifecycle. Is rdma/for-rc a throw away branch (akin to linux-next), or > > will it be merged into linus/master? If throwaway, then we can amend > > its patches, otherwise followups will be needed. > > > > Let me know what you'd prefer. Thanks. > I think a single update patch to fix all of these, and the > pre-existing ones would be great I just posted the cleanup: https://lkml.org/lkml/2018/5/3/1045
Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS
On Tue, May 1, 2018 at 1:48 PM Jason Gunthorpe wrote: > On Tue, May 01, 2018 at 03:08:57AM +0000, Greg Thelen wrote: > > On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe wrote: > > > > > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote: > > > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols. > > > > So declare the kconfig dependency. This is necessary to allow for > > > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS. > > > > > > > > Signed-off-by: Greg Thelen > > > > Cc: Tarick Bedeir > > > > drivers/infiniband/ulp/srpt/Kconfig | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig > > b/drivers/infiniband/ulp/srpt/Kconfig > > > > index 31ee83d528d9..fb8b7182f05e 100644 > > > > +++ b/drivers/infiniband/ulp/srpt/Kconfig > > > > @@ -1,6 +1,6 @@ > > > > config INFINIBAND_SRPT > > > > tristate "InfiniBand SCSI RDMA Protocol target support" > > > > - depends on INFINIBAND && TARGET_CORE > > > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE > > > > > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have > > > INFINIBAND_ADDR_TRANS without INFINIBAND. > > > > By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND. So yes, it seems > > redundant. I don't know if anyone has designs to break this dependency and > > allow for ADDR_TRANS without INFINIBAND. Assuming not, I'd be willing to > > amend my series removing redundant INFINIBAND and a followup series to > > remove it from similar depends. Though I'm not familiar with rdma dev tree > > lifecycle. Is rdma/for-rc a throw away branch (akin to linux-next), or > > will it be merged into linus/master? If throwaway, then we can amend > > its patches, otherwise followups will be needed. > > > > Let me know what you'd prefer. Thanks. > I think a single update patch to fix all of these, and the > pre-existing ones would be great I just posted the cleanup: https://lkml.org/lkml/2018/5/3/1045
[PATCH] IB: remove redundant INFINIBAND kconfig dependencies
INFINIBAND_ADDR_TRANS depends on INFINIBAND. So there's no need for options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND. Remove the unnecessary INFINIBAND depends. Signed-off-by: Greg Thelen <gthe...@google.com> --- drivers/infiniband/ulp/srpt/Kconfig | 2 +- drivers/nvme/host/Kconfig | 2 +- drivers/nvme/target/Kconfig | 2 +- drivers/staging/lustre/lnet/Kconfig | 2 +- fs/cifs/Kconfig | 2 +- net/9p/Kconfig | 2 +- net/rds/Kconfig | 2 +- net/sunrpc/Kconfig | 2 +- 8 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig index fb8b7182f05e..25bf6955b6d0 100644 --- a/drivers/infiniband/ulp/srpt/Kconfig +++ b/drivers/infiniband/ulp/srpt/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRPT tristate "InfiniBand SCSI RDMA Protocol target support" - depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE + depends on INFINIBAND_ADDR_TRANS && TARGET_CORE ---help--- Support for the SCSI RDMA Protocol (SRP) Target driver. The diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index 88a8b5916624..dbb7464c018c 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -27,7 +27,7 @@ config NVME_FABRICS config NVME_RDMA tristate "NVM Express over Fabrics RDMA host driver" - depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK + depends on INFINIBAND_ADDR_TRANS && BLOCK select NVME_CORE select NVME_FABRICS select SG_POOL diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 3c7b61ddb0d1..7595664ee753 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP config NVME_TARGET_RDMA tristate "NVMe over Fabrics RDMA target support" - depends on INFINIBAND && INFINIBAND_ADDR_TRANS + depends on INFINIBAND_ADDR_TRANS depends on NVME_TARGET select SGL_ALLOC help diff --git a/drivers/staging/lustre/lnet/Kconfig b/drivers/staging/lustre/lnet/Kconfig index ad049e6f24e4..f3b1ad4bd3dc 100644 --- a/drivers/staging/lustre/lnet/Kconfig +++ b/drivers/staging/lustre/lnet/Kconfig @@ -34,7 +34,7 @@ config LNET_SELFTEST config LNET_XPRT_IB tristate "LNET infiniband support" - depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on LNET && PCI && INFINIBAND_ADDR_TRANS default LNET && INFINIBAND help This option allows the LNET users to use infiniband as an diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig index 5f132d59dfc2..d61e2de8d0eb 100644 --- a/fs/cifs/Kconfig +++ b/fs/cifs/Kconfig @@ -197,7 +197,7 @@ config CIFS_SMB311 config CIFS_SMB_DIRECT bool "SMB Direct support (Experimental)" - depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND=y && INFINIBAND_ADDR_TRANS=y + depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND_ADDR_TRANS=y help Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1. SMB Direct allows transferring SMB packets over RDMA. If unsure, diff --git a/net/9p/Kconfig b/net/9p/Kconfig index e6014e0e51f7..46c39f7da444 100644 --- a/net/9p/Kconfig +++ b/net/9p/Kconfig @@ -32,7 +32,7 @@ config NET_9P_XEN config NET_9P_RDMA - depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on INET && INFINIBAND_ADDR_TRANS tristate "9P RDMA Transport (Experimental)" help This builds support for an RDMA transport. diff --git a/net/rds/Kconfig b/net/rds/Kconfig index bffde4b46c5d..1a31502ee7db 100644 --- a/net/rds/Kconfig +++ b/net/rds/Kconfig @@ -8,7 +8,7 @@ config RDS config RDS_RDMA tristate "RDS over Infiniband" - depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on RDS && INFINIBAND_ADDR_TRANS ---help--- Allow RDS to use Infiniband as a transport. This transport supports RDMA operations. diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig index ac09ca803296..6358e5271070 100644 --- a/net/sunrpc/Kconfig +++ b/net/sunrpc/Kconfig @@ -50,7 +50,7 @@ config SUNRPC_DEBUG config SUNRPC_XPRT_RDMA tristate "RPC-over-RDMA transport" - depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on SUNRPC && INFINIBAND_ADDR_TRANS default SUNRPC && INFINIBAND select SG_POOL help -- 2.17.0.441.gb46fe60e1d-goog
[PATCH] IB: remove redundant INFINIBAND kconfig dependencies
INFINIBAND_ADDR_TRANS depends on INFINIBAND. So there's no need for options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND. Remove the unnecessary INFINIBAND depends. Signed-off-by: Greg Thelen --- drivers/infiniband/ulp/srpt/Kconfig | 2 +- drivers/nvme/host/Kconfig | 2 +- drivers/nvme/target/Kconfig | 2 +- drivers/staging/lustre/lnet/Kconfig | 2 +- fs/cifs/Kconfig | 2 +- net/9p/Kconfig | 2 +- net/rds/Kconfig | 2 +- net/sunrpc/Kconfig | 2 +- 8 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig index fb8b7182f05e..25bf6955b6d0 100644 --- a/drivers/infiniband/ulp/srpt/Kconfig +++ b/drivers/infiniband/ulp/srpt/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRPT tristate "InfiniBand SCSI RDMA Protocol target support" - depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE + depends on INFINIBAND_ADDR_TRANS && TARGET_CORE ---help--- Support for the SCSI RDMA Protocol (SRP) Target driver. The diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index 88a8b5916624..dbb7464c018c 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -27,7 +27,7 @@ config NVME_FABRICS config NVME_RDMA tristate "NVM Express over Fabrics RDMA host driver" - depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK + depends on INFINIBAND_ADDR_TRANS && BLOCK select NVME_CORE select NVME_FABRICS select SG_POOL diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 3c7b61ddb0d1..7595664ee753 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP config NVME_TARGET_RDMA tristate "NVMe over Fabrics RDMA target support" - depends on INFINIBAND && INFINIBAND_ADDR_TRANS + depends on INFINIBAND_ADDR_TRANS depends on NVME_TARGET select SGL_ALLOC help diff --git a/drivers/staging/lustre/lnet/Kconfig b/drivers/staging/lustre/lnet/Kconfig index ad049e6f24e4..f3b1ad4bd3dc 100644 --- a/drivers/staging/lustre/lnet/Kconfig +++ b/drivers/staging/lustre/lnet/Kconfig @@ -34,7 +34,7 @@ config LNET_SELFTEST config LNET_XPRT_IB tristate "LNET infiniband support" - depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on LNET && PCI && INFINIBAND_ADDR_TRANS default LNET && INFINIBAND help This option allows the LNET users to use infiniband as an diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig index 5f132d59dfc2..d61e2de8d0eb 100644 --- a/fs/cifs/Kconfig +++ b/fs/cifs/Kconfig @@ -197,7 +197,7 @@ config CIFS_SMB311 config CIFS_SMB_DIRECT bool "SMB Direct support (Experimental)" - depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND=y && INFINIBAND_ADDR_TRANS=y + depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND_ADDR_TRANS=y help Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1. SMB Direct allows transferring SMB packets over RDMA. If unsure, diff --git a/net/9p/Kconfig b/net/9p/Kconfig index e6014e0e51f7..46c39f7da444 100644 --- a/net/9p/Kconfig +++ b/net/9p/Kconfig @@ -32,7 +32,7 @@ config NET_9P_XEN config NET_9P_RDMA - depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on INET && INFINIBAND_ADDR_TRANS tristate "9P RDMA Transport (Experimental)" help This builds support for an RDMA transport. diff --git a/net/rds/Kconfig b/net/rds/Kconfig index bffde4b46c5d..1a31502ee7db 100644 --- a/net/rds/Kconfig +++ b/net/rds/Kconfig @@ -8,7 +8,7 @@ config RDS config RDS_RDMA tristate "RDS over Infiniband" - depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on RDS && INFINIBAND_ADDR_TRANS ---help--- Allow RDS to use Infiniband as a transport. This transport supports RDMA operations. diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig index ac09ca803296..6358e5271070 100644 --- a/net/sunrpc/Kconfig +++ b/net/sunrpc/Kconfig @@ -50,7 +50,7 @@ config SUNRPC_DEBUG config SUNRPC_XPRT_RDMA tristate "RPC-over-RDMA transport" - depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on SUNRPC && INFINIBAND_ADDR_TRANS default SUNRPC && INFINIBAND select SG_POOL help -- 2.17.0.441.gb46fe60e1d-goog
[PATCH] memcg: mark memcg1_events static const
Mark memcg1_events static: it's only used by memcontrol.c. And mark it const: it's not modified. Signed-off-by: Greg Thelen <gthe...@google.com> --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2bd3df3d101a..c9c7e5ea0e2f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3083,7 +3083,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) #endif /* CONFIG_NUMA */ /* Universal VM events cgroup1 shows, original sort order */ -unsigned int memcg1_events[] = { +static const unsigned int memcg1_events[] = { PGPGIN, PGPGOUT, PGFAULT, -- 2.17.0.441.gb46fe60e1d-goog
[PATCH] memcg: mark memcg1_events static const
Mark memcg1_events static: it's only used by memcontrol.c. And mark it const: it's not modified. Signed-off-by: Greg Thelen --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2bd3df3d101a..c9c7e5ea0e2f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3083,7 +3083,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) #endif /* CONFIG_NUMA */ /* Universal VM events cgroup1 shows, original sort order */ -unsigned int memcg1_events[] = { +static const unsigned int memcg1_events[] = { PGPGIN, PGPGOUT, PGFAULT, -- 2.17.0.441.gb46fe60e1d-goog
Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS
On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe <j...@ziepe.ca> wrote: > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote: > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols. > > So declare the kconfig dependency. This is necessary to allow for > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS. > > > > Signed-off-by: Greg Thelen <gthe...@google.com> > > Cc: Tarick Bedeir <tar...@google.com> > > drivers/infiniband/ulp/srpt/Kconfig | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig > > index 31ee83d528d9..fb8b7182f05e 100644 > > +++ b/drivers/infiniband/ulp/srpt/Kconfig > > @@ -1,6 +1,6 @@ > > config INFINIBAND_SRPT > > tristate "InfiniBand SCSI RDMA Protocol target support" > > - depends on INFINIBAND && TARGET_CORE > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have > INFINIBAND_ADDR_TRANS without INFINIBAND. By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND. So yes, it seems redundant. I don't know if anyone has designs to break this dependency and allow for ADDR_TRANS without INFINIBAND. Assuming not, I'd be willing to amend my series removing redundant INFINIBAND and a followup series to remove it from similar depends. Though I'm not familiar with rdma dev tree lifecycle. Is rdma/for-rc a throw away branch (akin to linux-next), or will it be merged into linus/master? If throwaway, then we can amend its patches, otherwise followups will be needed. Let me know what you'd prefer. Thanks. FYI from v4.17-rc3: drivers/staging/lustre/lnet/Kconfig: depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS net/9p/Kconfig: depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS net/rds/Kconfig: depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS net/sunrpc/Kconfig: depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS
On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe wrote: > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote: > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols. > > So declare the kconfig dependency. This is necessary to allow for > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS. > > > > Signed-off-by: Greg Thelen > > Cc: Tarick Bedeir > > drivers/infiniband/ulp/srpt/Kconfig | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig > > index 31ee83d528d9..fb8b7182f05e 100644 > > +++ b/drivers/infiniband/ulp/srpt/Kconfig > > @@ -1,6 +1,6 @@ > > config INFINIBAND_SRPT > > tristate "InfiniBand SCSI RDMA Protocol target support" > > - depends on INFINIBAND && TARGET_CORE > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have > INFINIBAND_ADDR_TRANS without INFINIBAND. By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND. So yes, it seems redundant. I don't know if anyone has designs to break this dependency and allow for ADDR_TRANS without INFINIBAND. Assuming not, I'd be willing to amend my series removing redundant INFINIBAND and a followup series to remove it from similar depends. Though I'm not familiar with rdma dev tree lifecycle. Is rdma/for-rc a throw away branch (akin to linux-next), or will it be merged into linus/master? If throwaway, then we can amend its patches, otherwise followups will be needed. Let me know what you'd prefer. Thanks. FYI from v4.17-rc3: drivers/staging/lustre/lnet/Kconfig: depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS net/9p/Kconfig: depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS net/rds/Kconfig: depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS net/sunrpc/Kconfig: depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
[PATCH 6/6] IB: make INFINIBAND_ADDR_TRANS configurable
Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been finding fair number of CM bugs. So provide option to disable it. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/infiniband/Kconfig | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ee270e065ba9..2a972ed6851b 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING pages on demand instead. config INFINIBAND_ADDR_TRANS - bool + bool "RDMA/CM" depends on INFINIBAND default y + ---help--- + Support for RDMA communication manager (CM). + This allows for a generic connection abstraction over RDMA. config INFINIBAND_ADDR_TRANS_CONFIGFS bool -- 2.17.0.484.g0c8726318c-goog
[PATCH 6/6] IB: make INFINIBAND_ADDR_TRANS configurable
Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been finding fair number of CM bugs. So provide option to disable it. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/infiniband/Kconfig | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ee270e065ba9..2a972ed6851b 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING pages on demand instead. config INFINIBAND_ADDR_TRANS - bool + bool "RDMA/CM" depends on INFINIBAND default y + ---help--- + Support for RDMA communication manager (CM). + This allows for a generic connection abstraction over RDMA. config INFINIBAND_ADDR_TRANS_CONFIGFS bool -- 2.17.0.484.g0c8726318c-goog
[PATCH 5/6] ib_srp: depend on INFINIBAND_ADDR_TRANS
INFINIBAND_SRP code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/infiniband/ulp/srp/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/srp/Kconfig b/drivers/infiniband/ulp/srp/Kconfig index c74ee9633041..99db8fe5173a 100644 --- a/drivers/infiniband/ulp/srp/Kconfig +++ b/drivers/infiniband/ulp/srp/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRP tristate "InfiniBand SCSI RDMA Protocol" - depends on SCSI + depends on SCSI && INFINIBAND_ADDR_TRANS select SCSI_SRP_ATTRS ---help--- Support for the SCSI RDMA Protocol over InfiniBand. This -- 2.17.0.484.g0c8726318c-goog
[PATCH 5/6] ib_srp: depend on INFINIBAND_ADDR_TRANS
INFINIBAND_SRP code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/infiniband/ulp/srp/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/srp/Kconfig b/drivers/infiniband/ulp/srp/Kconfig index c74ee9633041..99db8fe5173a 100644 --- a/drivers/infiniband/ulp/srp/Kconfig +++ b/drivers/infiniband/ulp/srp/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRP tristate "InfiniBand SCSI RDMA Protocol" - depends on SCSI + depends on SCSI && INFINIBAND_ADDR_TRANS select SCSI_SRP_ATTRS ---help--- Support for the SCSI RDMA Protocol over InfiniBand. This -- 2.17.0.484.g0c8726318c-goog
[PATCH 4/6] cifs: smbd: depend on INFINIBAND_ADDR_TRANS
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> Reviewed-by: Long Li <lon...@microsoft.com> --- fs/cifs/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig index 741749a98614..5f132d59dfc2 100644 --- a/fs/cifs/Kconfig +++ b/fs/cifs/Kconfig @@ -197,7 +197,7 @@ config CIFS_SMB311 config CIFS_SMB_DIRECT bool "SMB Direct support (Experimental)" - depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y + depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND=y && INFINIBAND_ADDR_TRANS=y help Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1. SMB Direct allows transferring SMB packets over RDMA. If unsure, -- 2.17.0.484.g0c8726318c-goog
[PATCH 4/6] cifs: smbd: depend on INFINIBAND_ADDR_TRANS
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir Reviewed-by: Long Li --- fs/cifs/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig index 741749a98614..5f132d59dfc2 100644 --- a/fs/cifs/Kconfig +++ b/fs/cifs/Kconfig @@ -197,7 +197,7 @@ config CIFS_SMB311 config CIFS_SMB_DIRECT bool "SMB Direct support (Experimental)" - depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y + depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND=y && INFINIBAND_ADDR_TRANS=y help Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1. SMB Direct allows transferring SMB packets over RDMA. If unsure, -- 2.17.0.484.g0c8726318c-goog
[PATCH 1/6] nvme: depend on INFINIBAND_ADDR_TRANS
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/nvme/host/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index b979cf3bce65..88a8b5916624 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -27,7 +27,7 @@ config NVME_FABRICS config NVME_RDMA tristate "NVM Express over Fabrics RDMA host driver" - depends on INFINIBAND && BLOCK + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK select NVME_CORE select NVME_FABRICS select SG_POOL -- 2.17.0.484.g0c8726318c-goog
[PATCH 1/6] nvme: depend on INFINIBAND_ADDR_TRANS
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/nvme/host/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index b979cf3bce65..88a8b5916624 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -27,7 +27,7 @@ config NVME_FABRICS config NVME_RDMA tristate "NVM Express over Fabrics RDMA host driver" - depends on INFINIBAND && BLOCK + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK select NVME_CORE select NVME_FABRICS select SG_POOL -- 2.17.0.484.g0c8726318c-goog
[PATCH 3/6] ib_srpt: depend on INFINIBAND_ADDR_TRANS
INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/infiniband/ulp/srpt/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig index 31ee83d528d9..fb8b7182f05e 100644 --- a/drivers/infiniband/ulp/srpt/Kconfig +++ b/drivers/infiniband/ulp/srpt/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRPT tristate "InfiniBand SCSI RDMA Protocol target support" - depends on INFINIBAND && TARGET_CORE + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE ---help--- Support for the SCSI RDMA Protocol (SRP) Target driver. The -- 2.17.0.484.g0c8726318c-goog
[PATCH 3/6] ib_srpt: depend on INFINIBAND_ADDR_TRANS
INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/infiniband/ulp/srpt/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig index 31ee83d528d9..fb8b7182f05e 100644 --- a/drivers/infiniband/ulp/srpt/Kconfig +++ b/drivers/infiniband/ulp/srpt/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRPT tristate "InfiniBand SCSI RDMA Protocol target support" - depends on INFINIBAND && TARGET_CORE + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE ---help--- Support for the SCSI RDMA Protocol (SRP) Target driver. The -- 2.17.0.484.g0c8726318c-goog
[PATCH 2/6] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/nvme/target/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 5f4f8b16685f..3c7b61ddb0d1 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP config NVME_TARGET_RDMA tristate "NVMe over Fabrics RDMA target support" - depends on INFINIBAND + depends on INFINIBAND && INFINIBAND_ADDR_TRANS depends on NVME_TARGET select SGL_ALLOC help -- 2.17.0.484.g0c8726318c-goog
[PATCH 2/6] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/nvme/target/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 5f4f8b16685f..3c7b61ddb0d1 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP config NVME_TARGET_RDMA tristate "NVMe over Fabrics RDMA target support" - depends on INFINIBAND + depends on INFINIBAND && INFINIBAND_ADDR_TRANS depends on NVME_TARGET select SGL_ALLOC help -- 2.17.0.484.g0c8726318c-goog
[PATCH 0/6] IB: make INFINIBAND_ADDR_TRANS configurable
_ADDR_TRANS" drivers/nvme/host/rdma.c - depends on NVME_RDMA => INFINIBAND_ADDR_TRANS per this series' "nvme: depend on INFINIBAND_ADDR_TRANS" drivers/nvme/target/rdma.c - depends on NVME_TARGET_RDMA => INFINIBAND_ADDR_TRANS per this series' "nvmet-rdma: depend on INFINIBAND_ADDR_TRANS" drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c - depends on LNET_XPRT_IB => INFINIBAND_ADDR_TRANS fs/cifs/smbdirect.c - depends on CIFS_SMB_DIRECT => INFINIBAND_ADDR_TRANS per this series' "cifs: smbd: depend on INFINIBAND_ADDR_TRANS" net/9p/trans_rdma.c - depends on NET_9P_RDMA => INFINIBAND_ADDR_TRANS net/rds/ib.c net/rds/ib_cm.c net/rds/rdma_transport.c - depends on RDS_RDMA => INFINIBAND_ADDR_TRANS net/sunrpc/xprtrdma/svc_rdma_transport.c net/sunrpc/xprtrdma/transport.c net/sunrpc/xprtrdma/verbs.c - depends on SUNRPC_XPRT_RDMA => INFINIBAND_ADDR_TRANS Greg Thelen (6): nvme: depend on INFINIBAND_ADDR_TRANS nvmet-rdma: depend on INFINIBAND_ADDR_TRANS ib_srpt: depend on INFINIBAND_ADDR_TRANS cifs: smbd: depend on INFINIBAND_ADDR_TRANS ib_srp: depend on INFINIBAND_ADDR_TRANS IB: make INFINIBAND_ADDR_TRANS configurable drivers/infiniband/Kconfig | 5 - drivers/infiniband/ulp/srp/Kconfig | 2 +- drivers/infiniband/ulp/srpt/Kconfig | 2 +- drivers/nvme/host/Kconfig | 2 +- drivers/nvme/target/Kconfig | 2 +- fs/cifs/Kconfig | 2 +- 6 files changed, 9 insertions(+), 6 deletions(-) -- 2.17.0.484.g0c8726318c-goog
[PATCH 0/6] IB: make INFINIBAND_ADDR_TRANS configurable
_ADDR_TRANS" drivers/nvme/host/rdma.c - depends on NVME_RDMA => INFINIBAND_ADDR_TRANS per this series' "nvme: depend on INFINIBAND_ADDR_TRANS" drivers/nvme/target/rdma.c - depends on NVME_TARGET_RDMA => INFINIBAND_ADDR_TRANS per this series' "nvmet-rdma: depend on INFINIBAND_ADDR_TRANS" drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c - depends on LNET_XPRT_IB => INFINIBAND_ADDR_TRANS fs/cifs/smbdirect.c - depends on CIFS_SMB_DIRECT => INFINIBAND_ADDR_TRANS per this series' "cifs: smbd: depend on INFINIBAND_ADDR_TRANS" net/9p/trans_rdma.c - depends on NET_9P_RDMA => INFINIBAND_ADDR_TRANS net/rds/ib.c net/rds/ib_cm.c net/rds/rdma_transport.c - depends on RDS_RDMA => INFINIBAND_ADDR_TRANS net/sunrpc/xprtrdma/svc_rdma_transport.c net/sunrpc/xprtrdma/transport.c net/sunrpc/xprtrdma/verbs.c - depends on SUNRPC_XPRT_RDMA => INFINIBAND_ADDR_TRANS Greg Thelen (6): nvme: depend on INFINIBAND_ADDR_TRANS nvmet-rdma: depend on INFINIBAND_ADDR_TRANS ib_srpt: depend on INFINIBAND_ADDR_TRANS cifs: smbd: depend on INFINIBAND_ADDR_TRANS ib_srp: depend on INFINIBAND_ADDR_TRANS IB: make INFINIBAND_ADDR_TRANS configurable drivers/infiniband/Kconfig | 5 - drivers/infiniband/ulp/srp/Kconfig | 2 +- drivers/infiniband/ulp/srpt/Kconfig | 2 +- drivers/nvme/host/Kconfig | 2 +- drivers/nvme/target/Kconfig | 2 +- fs/cifs/Kconfig | 2 +- 6 files changed, 9 insertions(+), 6 deletions(-) -- 2.17.0.484.g0c8726318c-goog
Re: [PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable
On Wed, Apr 25, 2018 at 7:13 PM Bart Van Assche <bart.vanass...@wdc.com> wrote: > On Wed, 2018-04-25 at 15:34 -0700, Greg Thelen wrote: > > Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been > > finding fair number of CM bugs. So provide option to disable it. > > > > Signed-off-by: Greg Thelen <gthe...@google.com> > > Cc: Tarick Bedeir <tar...@google.com> > > --- > > drivers/infiniband/Kconfig | 5 - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig > > index ee270e065ba9..2a972ed6851b 100644 > > --- a/drivers/infiniband/Kconfig > > +++ b/drivers/infiniband/Kconfig > > @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING > > pages on demand instead. > > > > config INFINIBAND_ADDR_TRANS > > - bool > > + bool "RDMA/CM" > > depends on INFINIBAND > > default y > > + ---help--- > > + Support for RDMA communication manager (CM). > > + This allows for a generic connection abstraction over RDMA. > Hello Greg, > Please provide a cover letter when posting a patch series. Such a cover > letter is not only informative but also makes it easy for people who want > to comment on a patch series as a whole. I have a question that applies > to the entire patch series. The RDMA/CM code defines functions like > rdma_create_id() and rdma_create_qp(). If I search through the kernel tree > for callers of these functions then I find several more kernel modules than > the ones that are modified by this patch series: > $ git grep -lE '[[:blank:]](rdma_create_id|rdma_create_qp)\(' > drivers/infiniband/core/cma.c > drivers/infiniband/ulp/iser/iser_verbs.c > drivers/infiniband/ulp/isert/ib_isert.c > drivers/infiniband/ulp/srp/ib_srp.c > drivers/infiniband/ulp/srpt/ib_srpt.c > drivers/nvme/host/rdma.c > drivers/nvme/target/rdma.c > drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c > drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h > fs/cifs/smbdirect.c > include/rdma/rdma_cm.h > net/9p/trans_rdma.c > net/rds/ib.c > net/rds/ib_cm.c > net/rds/rdma_transport.c > net/sunrpc/xprtrdma/svc_rdma_transport.c > net/sunrpc/xprtrdma/verbs.c > Are you sure that this patch series is complete? I'll check your cited files. I'll resend with cover letter. FYI: I already rand this series through the 0-day builder, which presumably would've caught most config dep issues. But I'll research your list before reposting.
Re: [PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable
On Wed, Apr 25, 2018 at 7:13 PM Bart Van Assche wrote: > On Wed, 2018-04-25 at 15:34 -0700, Greg Thelen wrote: > > Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been > > finding fair number of CM bugs. So provide option to disable it. > > > > Signed-off-by: Greg Thelen > > Cc: Tarick Bedeir > > --- > > drivers/infiniband/Kconfig | 5 - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig > > index ee270e065ba9..2a972ed6851b 100644 > > --- a/drivers/infiniband/Kconfig > > +++ b/drivers/infiniband/Kconfig > > @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING > > pages on demand instead. > > > > config INFINIBAND_ADDR_TRANS > > - bool > > + bool "RDMA/CM" > > depends on INFINIBAND > > default y > > + ---help--- > > + Support for RDMA communication manager (CM). > > + This allows for a generic connection abstraction over RDMA. > Hello Greg, > Please provide a cover letter when posting a patch series. Such a cover > letter is not only informative but also makes it easy for people who want > to comment on a patch series as a whole. I have a question that applies > to the entire patch series. The RDMA/CM code defines functions like > rdma_create_id() and rdma_create_qp(). If I search through the kernel tree > for callers of these functions then I find several more kernel modules than > the ones that are modified by this patch series: > $ git grep -lE '[[:blank:]](rdma_create_id|rdma_create_qp)\(' > drivers/infiniband/core/cma.c > drivers/infiniband/ulp/iser/iser_verbs.c > drivers/infiniband/ulp/isert/ib_isert.c > drivers/infiniband/ulp/srp/ib_srp.c > drivers/infiniband/ulp/srpt/ib_srpt.c > drivers/nvme/host/rdma.c > drivers/nvme/target/rdma.c > drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c > drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h > fs/cifs/smbdirect.c > include/rdma/rdma_cm.h > net/9p/trans_rdma.c > net/rds/ib.c > net/rds/ib_cm.c > net/rds/rdma_transport.c > net/sunrpc/xprtrdma/svc_rdma_transport.c > net/sunrpc/xprtrdma/verbs.c > Are you sure that this patch series is complete? I'll check your cited files. I'll resend with cover letter. FYI: I already rand this series through the 0-day builder, which presumably would've caught most config dep issues. But I'll research your list before reposting.
[PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable
Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been finding fair number of CM bugs. So provide option to disable it. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/infiniband/Kconfig | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ee270e065ba9..2a972ed6851b 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING pages on demand instead. config INFINIBAND_ADDR_TRANS - bool + bool "RDMA/CM" depends on INFINIBAND default y + ---help--- + Support for RDMA communication manager (CM). + This allows for a generic connection abstraction over RDMA. config INFINIBAND_ADDR_TRANS_CONFIGFS bool -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable
Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been finding fair number of CM bugs. So provide option to disable it. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/infiniband/Kconfig | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ee270e065ba9..2a972ed6851b 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING pages on demand instead. config INFINIBAND_ADDR_TRANS - bool + bool "RDMA/CM" depends on INFINIBAND default y + ---help--- + Support for RDMA communication manager (CM). + This allows for a generic connection abstraction over RDMA. config INFINIBAND_ADDR_TRANS_CONFIGFS bool -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 4/5] cifs: smbd: depend on INFINIBAND_ADDR_TRANS
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- fs/cifs/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig index 741749a98614..5f132d59dfc2 100644 --- a/fs/cifs/Kconfig +++ b/fs/cifs/Kconfig @@ -197,7 +197,7 @@ config CIFS_SMB311 config CIFS_SMB_DIRECT bool "SMB Direct support (Experimental)" - depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y + depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND=y && INFINIBAND_ADDR_TRANS=y help Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1. SMB Direct allows transferring SMB packets over RDMA. If unsure, -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 4/5] cifs: smbd: depend on INFINIBAND_ADDR_TRANS
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- fs/cifs/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig index 741749a98614..5f132d59dfc2 100644 --- a/fs/cifs/Kconfig +++ b/fs/cifs/Kconfig @@ -197,7 +197,7 @@ config CIFS_SMB311 config CIFS_SMB_DIRECT bool "SMB Direct support (Experimental)" - depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y + depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND=y && INFINIBAND_ADDR_TRANS=y help Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1. SMB Direct allows transferring SMB packets over RDMA. If unsure, -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS
INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/infiniband/ulp/srpt/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig index 31ee83d528d9..fb8b7182f05e 100644 --- a/drivers/infiniband/ulp/srpt/Kconfig +++ b/drivers/infiniband/ulp/srpt/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRPT tristate "InfiniBand SCSI RDMA Protocol target support" - depends on INFINIBAND && TARGET_CORE + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE ---help--- Support for the SCSI RDMA Protocol (SRP) Target driver. The -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS
INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/infiniband/ulp/srpt/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig index 31ee83d528d9..fb8b7182f05e 100644 --- a/drivers/infiniband/ulp/srpt/Kconfig +++ b/drivers/infiniband/ulp/srpt/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRPT tristate "InfiniBand SCSI RDMA Protocol target support" - depends on INFINIBAND && TARGET_CORE + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE ---help--- Support for the SCSI RDMA Protocol (SRP) Target driver. The -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 2/5] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/nvme/target/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 5f4f8b16685f..3c7b61ddb0d1 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP config NVME_TARGET_RDMA tristate "NVMe over Fabrics RDMA target support" - depends on INFINIBAND + depends on INFINIBAND && INFINIBAND_ADDR_TRANS depends on NVME_TARGET select SGL_ALLOC help -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 2/5] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/nvme/target/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 5f4f8b16685f..3c7b61ddb0d1 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP config NVME_TARGET_RDMA tristate "NVMe over Fabrics RDMA target support" - depends on INFINIBAND + depends on INFINIBAND && INFINIBAND_ADDR_TRANS depends on NVME_TARGET select SGL_ALLOC help -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 1/5] nvme: depend on INFINIBAND_ADDR_TRANS
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/nvme/host/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index b979cf3bce65..88a8b5916624 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -27,7 +27,7 @@ config NVME_FABRICS config NVME_RDMA tristate "NVM Express over Fabrics RDMA host driver" - depends on INFINIBAND && BLOCK + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK select NVME_CORE select NVME_FABRICS select SG_POOL -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 1/5] nvme: depend on INFINIBAND_ADDR_TRANS
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So declare the kconfig dependency. This is necessary to allow for enabling INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/nvme/host/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index b979cf3bce65..88a8b5916624 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -27,7 +27,7 @@ config NVME_FABRICS config NVME_RDMA tristate "NVM Express over Fabrics RDMA host driver" - depends on INFINIBAND && BLOCK + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK select NVME_CORE select NVME_FABRICS select SG_POOL -- 2.17.0.441.gb46fe60e1d-goog
Re: [RFC PATCH 0/2] memory.low,min reclaim
On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin <g...@fb.com> wrote: > Hi, Greg! > On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote: > > Roman's previously posted memory.low,min patches add per memcg effective > > low limit to detect overcommitment of parental limits. But if we flip > > low,min reclaim to bail if usage<{low,min} at any level, then we don't need > > an effective low limit, which makes the code simpler. When parent limits > > are overcommited memory.min will oom kill, which is more drastic but makes > > the memory.low a simpler concept. If memcg a/b wants oom kill before > > reclaim, then give it to them. It seems a bit strange for a/b/memory.low's > > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless > > a/b.low+a/c.low exceed a.low). > It's actually not strange: a/b and a/c are sharing a common resource: > a/memory.low. > Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max. > If there are sibling cgroups which are consuming memory, a cgroup can't > exceed parent's memory.max, even if its memory.max is grater. > > > > I think there might be a simpler way (ableit it doesn't yet include > > Documentation): > > - memcg: fix memory.low > > - memcg: add memory.min > > 3 files changed, 75 insertions(+), 6 deletions(-) > > > > The idea of this alternate approach is for memory.low,min to avoid reclaim > > if any portion of under-consideration memcg ancestry is under respective > > limit. > This approach has a significant downside: it breaks hierarchical constraints > for memory.low/min. There are two important outcomes: > 1) Any leaf's memory.low/min value is respected, even if parent's value > is lower or even 0. It's not possible anymore to limit the amount of > protected memory for a sub-tree. > This is especially bad in case of delegation. As someone who has been using something like memory.min for a while, I have cases where it needs to be a strong protection. Such jobs prefer oom kill to reclaim. These jobs know they need X MB of memory. But I guess it's on me to avoid configuring machines which overcommit memory.min at such cgroup levels all the way to the root. > 2) If a cgroup has an ancestor with the usage under its memory.low/min, > it becomes protection, even if its memory.low/min is 0. So it becomes > impossible to have unprotected cgroups in protected sub-tree. Fair point. One use case is where a non trivial job which has several memory accounting subcontainers. Is there a way to only set memory.low at the top and have the offer protection to the job? The case I'm thinking of is: % cd /cgroup % echo +memory > cgroup.subtree_control % mkdir top % echo +memory > top/cgroup.subtree_control % mkdir top/part1 top/part2 % echo 1GB > top/memory.min % (echo $BASHPID > top/part1/cgroup.procs && part1) % (echo $BASHPID > top/part2/cgroup.procs && part2) Empirically it's been measured that the entire workload (/top) needs 1GB to perform well. But we don't care how the memory is distributed between part1,part2. Is the strategy for that to set /top, /top/part1.min, and /top/part2.min to 1GB? What do you think about exposing emin and elow to user space? I think that would reduce admin/user confusion in situations where memory.min is internally discounted. (tangent) Delegation in v2 isn't something I've been able to fully internalize yet. The "no interior processes" rule challenges my notion of subdelegation. My current model is where a system controller creates a container C with C.min and then starts client manager process M in C. Then M can choose to further divide C's resources (e.g. C/S). This doesn't seem possible because v2 doesn't allow for interior processes. So the system manager would need to create C, set C.low, create C/sub_manager, create C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start M in C/sub_manager. Then sub_manager can create and manage C/sub_resources/S. PS: Thanks for the memory.low and memory.min work. Regardless of how we proceed it's better than the upstream memory.soft_limit_in_bytes!
Re: [RFC PATCH 0/2] memory.low,min reclaim
On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin wrote: > Hi, Greg! > On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote: > > Roman's previously posted memory.low,min patches add per memcg effective > > low limit to detect overcommitment of parental limits. But if we flip > > low,min reclaim to bail if usage<{low,min} at any level, then we don't need > > an effective low limit, which makes the code simpler. When parent limits > > are overcommited memory.min will oom kill, which is more drastic but makes > > the memory.low a simpler concept. If memcg a/b wants oom kill before > > reclaim, then give it to them. It seems a bit strange for a/b/memory.low's > > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless > > a/b.low+a/c.low exceed a.low). > It's actually not strange: a/b and a/c are sharing a common resource: > a/memory.low. > Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max. > If there are sibling cgroups which are consuming memory, a cgroup can't > exceed parent's memory.max, even if its memory.max is grater. > > > > I think there might be a simpler way (ableit it doesn't yet include > > Documentation): > > - memcg: fix memory.low > > - memcg: add memory.min > > 3 files changed, 75 insertions(+), 6 deletions(-) > > > > The idea of this alternate approach is for memory.low,min to avoid reclaim > > if any portion of under-consideration memcg ancestry is under respective > > limit. > This approach has a significant downside: it breaks hierarchical constraints > for memory.low/min. There are two important outcomes: > 1) Any leaf's memory.low/min value is respected, even if parent's value > is lower or even 0. It's not possible anymore to limit the amount of > protected memory for a sub-tree. > This is especially bad in case of delegation. As someone who has been using something like memory.min for a while, I have cases where it needs to be a strong protection. Such jobs prefer oom kill to reclaim. These jobs know they need X MB of memory. But I guess it's on me to avoid configuring machines which overcommit memory.min at such cgroup levels all the way to the root. > 2) If a cgroup has an ancestor with the usage under its memory.low/min, > it becomes protection, even if its memory.low/min is 0. So it becomes > impossible to have unprotected cgroups in protected sub-tree. Fair point. One use case is where a non trivial job which has several memory accounting subcontainers. Is there a way to only set memory.low at the top and have the offer protection to the job? The case I'm thinking of is: % cd /cgroup % echo +memory > cgroup.subtree_control % mkdir top % echo +memory > top/cgroup.subtree_control % mkdir top/part1 top/part2 % echo 1GB > top/memory.min % (echo $BASHPID > top/part1/cgroup.procs && part1) % (echo $BASHPID > top/part2/cgroup.procs && part2) Empirically it's been measured that the entire workload (/top) needs 1GB to perform well. But we don't care how the memory is distributed between part1,part2. Is the strategy for that to set /top, /top/part1.min, and /top/part2.min to 1GB? What do you think about exposing emin and elow to user space? I think that would reduce admin/user confusion in situations where memory.min is internally discounted. (tangent) Delegation in v2 isn't something I've been able to fully internalize yet. The "no interior processes" rule challenges my notion of subdelegation. My current model is where a system controller creates a container C with C.min and then starts client manager process M in C. Then M can choose to further divide C's resources (e.g. C/S). This doesn't seem possible because v2 doesn't allow for interior processes. So the system manager would need to create C, set C.low, create C/sub_manager, create C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start M in C/sub_manager. Then sub_manager can create and manage C/sub_resources/S. PS: Thanks for the memory.low and memory.min work. Regardless of how we proceed it's better than the upstream memory.soft_limit_in_bytes!
[RFC PATCH 2/2] memcg: add memory.min
The new memory.min limit is similar to memory.low, just no bypassing it when reclaim is desparate. Prefer oom kills before reclaim memory below memory.min. Sharing more code with memory_cgroup_low() is possible, but the idea is posted here for simplicity. Signed-off-by: Greg Thelen <gthe...@google.com> --- include/linux/memcontrol.h | 8 ++ mm/memcontrol.c| 58 ++ mm/vmscan.c| 3 ++ 3 files changed, 69 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c46016bb25eb..22bb4a88653a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -178,6 +178,7 @@ struct mem_cgroup { struct page_counter tcpmem; /* Normal memory consumption range */ + unsigned long min; unsigned long low; unsigned long high; @@ -281,6 +282,7 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } +bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg); bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg); int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, @@ -726,6 +728,12 @@ static inline void mem_cgroup_event(struct mem_cgroup *memcg, { } +static inline bool mem_cgroup_min(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + return false; +} + static inline bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9668f620203a..b2aaed4003b4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5058,6 +5058,36 @@ static u64 memory_current_read(struct cgroup_subsys_state *css, return (u64)page_counter_read(>memory) * PAGE_SIZE; } +static int memory_min_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + unsigned long min = READ_ONCE(memcg->min); + + if (min == PAGE_COUNTER_MAX) + seq_puts(m, "max\n"); + else + seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE); + + return 0; +} + +static ssize_t memory_min_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long min; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", ); + if (err) + return err; + + memcg->min = min; + + return nbytes; +} + static int memory_low_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); @@ -5288,6 +5318,12 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .read_u64 = memory_current_read, }, + { + .name = "min", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_min_show, + .write = memory_min_write, + }, { .name = "low", .flags = CFTYPE_NOT_ON_ROOT, @@ -5336,6 +5372,28 @@ struct cgroup_subsys memory_cgrp_subsys = { .early_init = 0, }; +/** + * mem_cgroup_min returns true for a memcg below its min limit. Such memcg are + * excempt from reclaim. + */ +bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return false; + + if (!root) + root = root_mem_cgroup; + + if (memcg == root) + return false; + + for (; memcg != root; memcg = parent_mem_cgroup(memcg)) { + if (page_counter_read(>memory) <= memcg->min) + return true; /* protect */ + } + return false; /* !protect */ +} + /** * mem_cgroup_low - check if memory consumption is below the normal range * @root: the top ancestor of the sub-tree being checked diff --git a/mm/vmscan.c b/mm/vmscan.c index cd5dc3faaa57..15ae19a38ad5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2539,6 +2539,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long reclaimed; unsigned long scanned; + if (mem_cgroup_min(root, memcg)) + continue; + if (mem_cgroup_low(root, memcg)) { if (!sc->memcg_low_reclaim) { sc->memcg_low_skipped = 1; -- 2.17.0.484.g0c8726318c-goog
[RFC PATCH 2/2] memcg: add memory.min
The new memory.min limit is similar to memory.low, just no bypassing it when reclaim is desparate. Prefer oom kills before reclaim memory below memory.min. Sharing more code with memory_cgroup_low() is possible, but the idea is posted here for simplicity. Signed-off-by: Greg Thelen --- include/linux/memcontrol.h | 8 ++ mm/memcontrol.c| 58 ++ mm/vmscan.c| 3 ++ 3 files changed, 69 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c46016bb25eb..22bb4a88653a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -178,6 +178,7 @@ struct mem_cgroup { struct page_counter tcpmem; /* Normal memory consumption range */ + unsigned long min; unsigned long low; unsigned long high; @@ -281,6 +282,7 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } +bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg); bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg); int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, @@ -726,6 +728,12 @@ static inline void mem_cgroup_event(struct mem_cgroup *memcg, { } +static inline bool mem_cgroup_min(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + return false; +} + static inline bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9668f620203a..b2aaed4003b4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5058,6 +5058,36 @@ static u64 memory_current_read(struct cgroup_subsys_state *css, return (u64)page_counter_read(>memory) * PAGE_SIZE; } +static int memory_min_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + unsigned long min = READ_ONCE(memcg->min); + + if (min == PAGE_COUNTER_MAX) + seq_puts(m, "max\n"); + else + seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE); + + return 0; +} + +static ssize_t memory_min_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long min; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", ); + if (err) + return err; + + memcg->min = min; + + return nbytes; +} + static int memory_low_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); @@ -5288,6 +5318,12 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .read_u64 = memory_current_read, }, + { + .name = "min", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_min_show, + .write = memory_min_write, + }, { .name = "low", .flags = CFTYPE_NOT_ON_ROOT, @@ -5336,6 +5372,28 @@ struct cgroup_subsys memory_cgrp_subsys = { .early_init = 0, }; +/** + * mem_cgroup_min returns true for a memcg below its min limit. Such memcg are + * excempt from reclaim. + */ +bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return false; + + if (!root) + root = root_mem_cgroup; + + if (memcg == root) + return false; + + for (; memcg != root; memcg = parent_mem_cgroup(memcg)) { + if (page_counter_read(>memory) <= memcg->min) + return true; /* protect */ + } + return false; /* !protect */ +} + /** * mem_cgroup_low - check if memory consumption is below the normal range * @root: the top ancestor of the sub-tree being checked diff --git a/mm/vmscan.c b/mm/vmscan.c index cd5dc3faaa57..15ae19a38ad5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2539,6 +2539,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long reclaimed; unsigned long scanned; + if (mem_cgroup_min(root, memcg)) + continue; + if (mem_cgroup_low(root, memcg)) { if (!sc->memcg_low_reclaim) { sc->memcg_low_skipped = 1; -- 2.17.0.484.g0c8726318c-goog
[RFC PATCH 0/2] memory.low,min reclaim
4k => 4203908k # F/g : 2102016k => 2101992k # F/H : 2101948k => 2101916k # i: 4204408k => 4203988k # i/j : 2101984k => 2101936k # i/K : 2102424k => 2102052k # L: 4189960k => 3886296k # L/m : 2101968k => 2838704k # L/N : 2087992k => 1047592k set -o errexit set -o nounset set -o pipefail LIM="$1"; shift ANTAGONIST="$1"; shift CGPATH=/tmp/cgroup vmtouch2() { rm -f "$2" (echo $BASHPID > "${CGPATH}/$1/cgroup.procs" && exec /tmp/mmap --loop 1 --file "$2" "$3") } vmtouch() { # twice to ensure slab caches are warmed up and all objs are charged to cgroup. vmtouch2 "$1" "$2" "$3" vmtouch2 "$1" "$2" "$3" } dump() { for i in A A/b A/C A/D A/e F F/g F/H i i/j i/K L L/m L/N; do printf "%-5s: %sk\n" $i $(($(cat "${CGPATH}/${i}/memory.current")/1024)) done } rm -f /file_? if [[ -e "${CGPATH}/A" ]]; then rmdir ${CGPATH}/?/? ${CGPATH}/? fi echo "+memory" > "${CGPATH}/cgroup.subtree_control" mkdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L" echo "+memory" > "${CGPATH}/A/cgroup.subtree_control" echo "+memory" > "${CGPATH}/F/cgroup.subtree_control" echo "+memory" > "${CGPATH}/i/cgroup.subtree_control" echo "+memory" > "${CGPATH}/L/cgroup.subtree_control" mkdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e" mkdir "${CGPATH}/F/g" "${CGPATH}/F/H" mkdir "${CGPATH}/i/j" "${CGPATH}/i/K" mkdir "${CGPATH}/L/m" "${CGPATH}/L/N" echo 2G > "${CGPATH}/A/memory.${LIM}" echo 3G > "${CGPATH}/A/b/memory.${LIM}" echo 1G > "${CGPATH}/A/C/memory.${LIM}" echo 0 > "${CGPATH}/A/D/memory.${LIM}" echo 10G > "${CGPATH}/A/e/memory.${LIM}" echo 2G > "${CGPATH}/F/memory.${LIM}" echo 3G > "${CGPATH}/F/g/memory.${LIM}" echo 1G > "${CGPATH}/F/H/memory.${LIM}" echo 5G > "${CGPATH}/i/memory.${LIM}" echo 3G > "${CGPATH}/i/j/memory.${LIM}" echo 1G > "${CGPATH}/i/K/memory.${LIM}" echo 2G > "${CGPATH}/L/memory.${LIM}" echo 4G > "${CGPATH}/L/memory.max" echo 3G > "${CGPATH}/L/m/memory.${LIM}" echo 1G > "${CGPATH}/L/N/memory.${LIM}" vmtouch A/b /file_b 2G vmtouch A/C /file_C 2G vmtouch A/D /file_D 2G vmtouch F/g /file_g 2G vmtouch F/H /file_H 2G vmtouch i/j /file_j 2G vmtouch i/K /file_K 2G vmtouch L/m /file_m 2G vmtouch L/N /file_N 2G vmtouch2 "$ANTAGONIST" /file_ant 150G echo echo "after $ANTAGONIST antagonist" dump rmdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e" rmdir "${CGPATH}/F/g" "${CGPATH}/F/H" rmdir "${CGPATH}/i/j" "${CGPATH}/i/K" rmdir "${CGPATH}/L/m" "${CGPATH}/L/N" rmdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L" rm /file_ant /file_b /file_C /file_D /file_g /file_H /file_j /file_K Greg Thelen (2): memcg: fix memory.low memcg: add memory.min include/linux/memcontrol.h | 8 + mm/memcontrol.c| 70 ++ mm/vmscan.c| 3 ++ 3 files changed, 75 insertions(+), 6 deletions(-) -- 2.17.0.484.g0c8726318c-goog
[RFC PATCH 1/2] memcg: fix memory.low
When targeting reclaim to a memcg, protect that memcg from reclaim is memory consumption of any level is below respective memory.low. Signed-off-by: Greg Thelen <gthe...@google.com> --- mm/memcontrol.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 670e99b68aa6..9668f620203a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5341,8 +5341,8 @@ struct cgroup_subsys memory_cgrp_subsys = { * @root: the top ancestor of the sub-tree being checked * @memcg: the memory cgroup to check * - * Returns %true if memory consumption of @memcg, and that of all - * ancestors up to (but not including) @root, is below the normal range. + * Returns %true if memory consumption of @memcg, or any of its ancestors + * up to (but not including) @root, is below the normal range. * * @root is exclusive; it is never low when looked at directly and isn't * checked when traversing the hierarchy. @@ -5379,12 +5379,12 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg) if (memcg == root) return false; + /* If any level is under, then protect @memcg from reclaim */ for (; memcg != root; memcg = parent_mem_cgroup(memcg)) { - if (page_counter_read(>memory) >= memcg->low) - return false; + if (page_counter_read(>memory) <= memcg->low) + return true; /* protect from reclaim */ } - - return true; + return false; /* not protected from reclaim */ } /** -- 2.17.0.484.g0c8726318c-goog
[RFC PATCH 0/2] memory.low,min reclaim
4k => 4203908k # F/g : 2102016k => 2101992k # F/H : 2101948k => 2101916k # i: 4204408k => 4203988k # i/j : 2101984k => 2101936k # i/K : 2102424k => 2102052k # L: 4189960k => 3886296k # L/m : 2101968k => 2838704k # L/N : 2087992k => 1047592k set -o errexit set -o nounset set -o pipefail LIM="$1"; shift ANTAGONIST="$1"; shift CGPATH=/tmp/cgroup vmtouch2() { rm -f "$2" (echo $BASHPID > "${CGPATH}/$1/cgroup.procs" && exec /tmp/mmap --loop 1 --file "$2" "$3") } vmtouch() { # twice to ensure slab caches are warmed up and all objs are charged to cgroup. vmtouch2 "$1" "$2" "$3" vmtouch2 "$1" "$2" "$3" } dump() { for i in A A/b A/C A/D A/e F F/g F/H i i/j i/K L L/m L/N; do printf "%-5s: %sk\n" $i $(($(cat "${CGPATH}/${i}/memory.current")/1024)) done } rm -f /file_? if [[ -e "${CGPATH}/A" ]]; then rmdir ${CGPATH}/?/? ${CGPATH}/? fi echo "+memory" > "${CGPATH}/cgroup.subtree_control" mkdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L" echo "+memory" > "${CGPATH}/A/cgroup.subtree_control" echo "+memory" > "${CGPATH}/F/cgroup.subtree_control" echo "+memory" > "${CGPATH}/i/cgroup.subtree_control" echo "+memory" > "${CGPATH}/L/cgroup.subtree_control" mkdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e" mkdir "${CGPATH}/F/g" "${CGPATH}/F/H" mkdir "${CGPATH}/i/j" "${CGPATH}/i/K" mkdir "${CGPATH}/L/m" "${CGPATH}/L/N" echo 2G > "${CGPATH}/A/memory.${LIM}" echo 3G > "${CGPATH}/A/b/memory.${LIM}" echo 1G > "${CGPATH}/A/C/memory.${LIM}" echo 0 > "${CGPATH}/A/D/memory.${LIM}" echo 10G > "${CGPATH}/A/e/memory.${LIM}" echo 2G > "${CGPATH}/F/memory.${LIM}" echo 3G > "${CGPATH}/F/g/memory.${LIM}" echo 1G > "${CGPATH}/F/H/memory.${LIM}" echo 5G > "${CGPATH}/i/memory.${LIM}" echo 3G > "${CGPATH}/i/j/memory.${LIM}" echo 1G > "${CGPATH}/i/K/memory.${LIM}" echo 2G > "${CGPATH}/L/memory.${LIM}" echo 4G > "${CGPATH}/L/memory.max" echo 3G > "${CGPATH}/L/m/memory.${LIM}" echo 1G > "${CGPATH}/L/N/memory.${LIM}" vmtouch A/b /file_b 2G vmtouch A/C /file_C 2G vmtouch A/D /file_D 2G vmtouch F/g /file_g 2G vmtouch F/H /file_H 2G vmtouch i/j /file_j 2G vmtouch i/K /file_K 2G vmtouch L/m /file_m 2G vmtouch L/N /file_N 2G vmtouch2 "$ANTAGONIST" /file_ant 150G echo echo "after $ANTAGONIST antagonist" dump rmdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e" rmdir "${CGPATH}/F/g" "${CGPATH}/F/H" rmdir "${CGPATH}/i/j" "${CGPATH}/i/K" rmdir "${CGPATH}/L/m" "${CGPATH}/L/N" rmdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L" rm /file_ant /file_b /file_C /file_D /file_g /file_H /file_j /file_K Greg Thelen (2): memcg: fix memory.low memcg: add memory.min include/linux/memcontrol.h | 8 + mm/memcontrol.c| 70 ++ mm/vmscan.c| 3 ++ 3 files changed, 75 insertions(+), 6 deletions(-) -- 2.17.0.484.g0c8726318c-goog
[RFC PATCH 1/2] memcg: fix memory.low
When targeting reclaim to a memcg, protect that memcg from reclaim is memory consumption of any level is below respective memory.low. Signed-off-by: Greg Thelen --- mm/memcontrol.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 670e99b68aa6..9668f620203a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5341,8 +5341,8 @@ struct cgroup_subsys memory_cgrp_subsys = { * @root: the top ancestor of the sub-tree being checked * @memcg: the memory cgroup to check * - * Returns %true if memory consumption of @memcg, and that of all - * ancestors up to (but not including) @root, is below the normal range. + * Returns %true if memory consumption of @memcg, or any of its ancestors + * up to (but not including) @root, is below the normal range. * * @root is exclusive; it is never low when looked at directly and isn't * checked when traversing the hierarchy. @@ -5379,12 +5379,12 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg) if (memcg == root) return false; + /* If any level is under, then protect @memcg from reclaim */ for (; memcg != root; memcg = parent_mem_cgroup(memcg)) { - if (page_counter_read(>memory) >= memcg->low) - return false; + if (page_counter_read(>memory) <= memcg->low) + return true; /* protect from reclaim */ } - - return true; + return false; /* not protected from reclaim */ } /** -- 2.17.0.484.g0c8726318c-goog
Re: [PATCH] IB: make INFINIBAND_ADDR_TRANS configurable
On Sun, Apr 15, 2018 at 5:06 AM Christoph Hellwig <h...@infradead.org> wrote: > On Fri, Apr 13, 2018 at 12:06:44AM -0700, Greg Thelen wrote: > > Allow INFINIBAND without INFINIBAND_ADDR_TRANS. > Why? We are pushing everyone heavily to use RDMA/CM, so making it > optional seems rather counter-intuitive. Bugs. We don't currently use CM. But fuzzy testing has been finding a fair number CM bugs, so we were thinking of disabling it until we need it. > You'll also have to fix tons of ULPs to explicitly depend on > INFINIBAND_ADDR_TRANS, or make code conditional on it. I think I've identified the set of options which use INFINIBAND_ADDR_TRANS without a kconfig depends: * CIFS_SMB_DIRECT * INFINIBAND_SRPT * NVME_RDMA * NVME_TARGET_RDMA I have patches for the above, but need to finish the commit logs. Let me know if they'll be nacked and I'll just patch my kernel and forget upstreaming.
Re: [PATCH] IB: make INFINIBAND_ADDR_TRANS configurable
On Sun, Apr 15, 2018 at 5:06 AM Christoph Hellwig wrote: > On Fri, Apr 13, 2018 at 12:06:44AM -0700, Greg Thelen wrote: > > Allow INFINIBAND without INFINIBAND_ADDR_TRANS. > Why? We are pushing everyone heavily to use RDMA/CM, so making it > optional seems rather counter-intuitive. Bugs. We don't currently use CM. But fuzzy testing has been finding a fair number CM bugs, so we were thinking of disabling it until we need it. > You'll also have to fix tons of ULPs to explicitly depend on > INFINIBAND_ADDR_TRANS, or make code conditional on it. I think I've identified the set of options which use INFINIBAND_ADDR_TRANS without a kconfig depends: * CIFS_SMB_DIRECT * INFINIBAND_SRPT * NVME_RDMA * NVME_TARGET_RDMA I have patches for the above, but need to finish the commit logs. Let me know if they'll be nacked and I'll just patch my kernel and forget upstreaming.
[PATCH v3] IB: make INFINIBAND_ADDR_TRANS configurable
Allow INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen <gthe...@google.com> Cc: Tarick Bedeir <tar...@google.com> --- drivers/infiniband/Kconfig | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ee270e065ba9..2a972ed6851b 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING pages on demand instead. config INFINIBAND_ADDR_TRANS - bool + bool "RDMA/CM" depends on INFINIBAND default y + ---help--- + Support for RDMA communication manager (CM). + This allows for a generic connection abstraction over RDMA. config INFINIBAND_ADDR_TRANS_CONFIGFS bool -- 2.17.0.484.g0c8726318c-goog
[PATCH v3] IB: make INFINIBAND_ADDR_TRANS configurable
Allow INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/infiniband/Kconfig | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ee270e065ba9..2a972ed6851b 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING pages on demand instead. config INFINIBAND_ADDR_TRANS - bool + bool "RDMA/CM" depends on INFINIBAND default y + ---help--- + Support for RDMA communication manager (CM). + This allows for a generic connection abstraction over RDMA. config INFINIBAND_ADDR_TRANS_CONFIGFS bool -- 2.17.0.484.g0c8726318c-goog