from:"Shakeel Butt"

Re: [RFC 0/1] add support for reclaiming priorities per mem cgroup

2017-03-30 Thread Shakeel Butt

> A more useful metric for memory pressure at this point is quantifying
> that time you spend thrashing: time the job spends in direct reclaim
> and on the flipside time the job waits for recently evicted pages to
> come back. Combined, that gives you a good measure of overhead from
> memory pressure; putting that in relation to a useful baseline of
> meaningful work done gives you a portable scale of how effictively
> your job is running.
>
> I'm working on that right now, hopefully I'll have something useful
> soon.

Johannes, is the work you are doing only about file pages or will it
equally apply to anon pages as well?

Re: [PATCH 7/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant

2017-03-30 Thread Shakeel Butt

On Mon, Mar 6, 2017 at 2:33 AM, Michal Hocko  wrote:
> From: Michal Hocko 
>
> fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc
> with vmalloc fallback. Use the kvmalloc variant instead. Keep the
> __GFP_REPEAT flag based on explanation from Eric:
> "
> At the time, tests on the hardware I had in my labs showed that
> vmalloc() could deliver pages spread all over the memory and that was a
> small penalty (once memory is fragmented enough, not at boot time)
> "
>
> The way how the code is constructed means, however, that we prefer to go
> and hit the OOM killer before we fall back to the vmalloc for requests
> <=32kB (with 4kB pages) in the current code. This is rather disruptive for
> something that can be achived with the fallback. On the other hand
> __GFP_REPEAT doesn't have any useful semantic for these requests. So the
> effect of this patch is that requests smaller than 64kB will fallback to

I am a bit confused about this 64kB, shouldn't it be <=32kB (with 4kB
pages & PAGE_ALLOC_COSTLY_ORDER = 3)?

> vmalloc easier now.
>
> Cc: Eric Dumazet 
> Cc: net...@vger.kernel.org
> Acked-by: Vlastimil Babka 
> Signed-off-by: Michal Hocko 
> ---

Re: [PATCH] mm/zswap: fix potential deadlock in zswap_frontswap_store()

2017-03-31 Thread Shakeel Butt

On Fri, Mar 31, 2017 at 8:30 AM, Andrey Ryabinin
 wrote:
> zswap_frontswap_store() is called during memory reclaim from
> __frontswap_store() from swap_writepage() from shrink_page_list().
> This may happen in NOFS context, thus zswap shouldn't use __GFP_FS,
> otherwise we may renter into fs code and deadlock.
> zswap_frontswap_store() also shouldn't use __GFP_IO to avoid recursion
> into itself.
>

Is it possible to enter fs code (or IO) from zswap_frontswap_store()
other than recursive memory reclaim? However recursive memory reclaim
is protected through PF_MEMALLOC task flag. The change seems fine but
IMHO reasoning needs an update. Adding Michal for expert opinion.

> zswap_frontswap_store() call zpool_malloc() with __GFP_NORETRY |
> __GFP_NOWARN | __GFP_KSWAPD_RECLAIM, so let's use the same flags for
> zswap_entry_cache_alloc() as well, instead of GFP_KERNEL.
>
> Signed-off-by: Andrey Ryabinin 
> ---
>  mm/zswap.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index eedc278..12ad7e9 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -966,6 +966,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t 
> offset,
> struct zswap_tree *tree = zswap_trees[type];
> struct zswap_entry *entry, *dupentry;
> struct crypto_comp *tfm;
> +   gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
> int ret;
> unsigned int dlen = PAGE_SIZE, len;
> unsigned long handle;
> @@ -989,7 +990,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t 
> offset,
> }
>
> /* allocate entry */
> -   entry = zswap_entry_cache_alloc(GFP_KERNEL);
> +   entry = zswap_entry_cache_alloc(gfp);
> if (!entry) {
> zswap_reject_kmemcache_fail++;
> ret = -ENOMEM;
> @@ -1017,9 +1018,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t 
> offset,
>
> /* store */
> len = dlen + sizeof(struct zswap_header);
> -   ret = zpool_malloc(entry->pool->zpool, len,
> -  __GFP_NORETRY | __GFP_NOWARN | 
> __GFP_KSWAPD_RECLAIM,
> -  );
> +   ret = zpool_malloc(entry->pool->zpool, len, gfp, );
> if (ret == -ENOSPC) {
> zswap_reject_compress_poor++;
> goto put_dstmem;
> --
> 2.10.2
>

[PATCH v3] mm: fix condition for throttle_direct_reclaim

2017-03-14 Thread Shakeel Butt

Since "mm: fix 100% CPU kswapd busyloop on unreclaimable nodes" kswapd
has been modified to give up after MAX_RECLAIM_RETRIES number of
unsucessful iterations. Before going to sleep, kswapd thread will
unconditionally wakeup all threads sleeping on pfmemalloc_wait.
However the awoken threads will recheck the watermarks and wake the
kswapd thread and sleep again on pfmemalloc_wait. There is a chance
that the system might end up in livelock between unsuccessful kswapd
and direct reclaimers because all direct reclaimer might end up in
throttle_direct_reclaim and there is nobody to make a forward
progress. So, add kswapd_failures check on the throttle_direct_reclaim
condition.

Signed-off-by: Shakeel Butt <shake...@google.com>
Suggested-by: Michal Hocko <mho...@suse.com>
Suggested-by: Johannes Weiner <han...@cmpxchg.org>
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
v3:
Commit message updated.

v2:
Instead of separate helper function for checking kswapd_failures,
added the check into pfmemalloc_watermark_ok() and renamed that
function.

 mm/vmscan.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bae698484e8e..afa5b20ab6d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2783,7 +2783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist 
*zonelist,
return 0;
 }
 
-static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+static bool allow_direct_reclaim(pg_data_t *pgdat)
 {
struct zone *zone;
unsigned long pfmemalloc_reserve = 0;
@@ -2791,6 +2791,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
int i;
bool wmark_ok;
 
+   if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+   return true;
+
for (i = 0; i <= ZONE_NORMAL; i++) {
zone = >node_zones[i];
if (!managed_zone(zone))
@@ -2873,7 +2876,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, 
struct zonelist *zonelist,
 
/* Throttle based on the first usable node */
pgdat = zone->zone_pgdat;
-   if (pfmemalloc_watermark_ok(pgdat))
+   if (allow_direct_reclaim(pgdat))
goto out;
break;
}
@@ -2895,14 +2898,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, 
struct zonelist *zonelist,
 */
if (!(gfp_mask & __GFP_FS)) {
wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
-   pfmemalloc_watermark_ok(pgdat), HZ);
+   allow_direct_reclaim(pgdat), HZ);
 
goto check_pending;
}
 
/* Throttle until kswapd wakes the process */
wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
-   pfmemalloc_watermark_ok(pgdat));
+   allow_direct_reclaim(pgdat));
 
 check_pending:
if (fatal_signal_pending(current))
@@ -3102,7 +3105,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, int classzone_idx)
 {
/*
 * The throttled processes are normally woken up in balance_pgdat() as
-* soon as pfmemalloc_watermark_ok() is true. But there is a potential
+* soon as allow_direct_reclaim() is true. But there is a potential
 * race between when kswapd checks the watermarks and a process gets
 * throttled. There is also a potential race if processes get
 * throttled, kswapd wakes, a large process exits thereby balancing the
@@ -3271,7 +3274,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
 * able to safely make forward progress. Wake them
 */
if (waitqueue_active(>pfmemalloc_wait) &&
-   pfmemalloc_watermark_ok(pgdat))
+   allow_direct_reclaim(pgdat))
wake_up_all(>pfmemalloc_wait);
 
/* Check if kswapd should be suspending */
-- 
2.12.0.367.g23dc2f6d3c-goog

Re: [PATCH v2 RFC] mm/vmscan: more restrictive condition for retry in do_try_to_free_pages

2017-03-16 Thread Shakeel Butt

On Thu, Mar 16, 2017 at 12:57 PM, Johannes Weiner <han...@cmpxchg.org> wrote:
> On Sat, Mar 11, 2017 at 09:52:15AM -0800, Shakeel Butt wrote:
>> On Sat, Mar 11, 2017 at 5:51 AM, Yisheng Xie <ys...@foxmail.com> wrote:
>> > @@ -2808,7 +2826,7 @@ static unsigned long do_try_to_free_pages(struct 
>> > zonelist *zonelist,
>> > return 1;
>> >
>> > /* Untapped cgroup reserves?  Don't OOM, retry. */
>> > -   if (!sc->may_thrash) {
>> > +   if (!may_thrash(sc)) {
>>
>> Thanks Yisheng. The name of the function may_thrash() is confusing in
>> the sense that it is returning exactly the opposite of what its name
>> implies. How about reversing the condition of may_thrash() function
>> and change the scan_control's field may_thrash to thrashed?
>
> How so?
>
> The user sets memory.low to a minimum below which the application will
> thrash. Hence, being allowed to break that minimum and causing the app
> to thrash, means you "may thrash".
>
Basically how I interpreted may_thrash() is "may I thrash" or may I
reclaim memory from memcgs which were already below memory.low. So, if
it returns true, we go for second pass with the authorization to
reclaim memory even from memcgs with usage below memory.low.

> OTOH, I'm not sure what "thrashed" would mean.
By 'thrashed', I wanted to say, hey I have already tried to reclaim
memory from memcgs below their memory.low, so no need to try again but
Yisheng correctly pointed out that it will cause confusion in
shrink_node().

Sorry for confusion.

Re: [PATCH v2 RFC] mm/vmscan: more restrictive condition for retry in do_try_to_free_pages

2017-03-11 Thread Shakeel Butt

On Sat, Mar 11, 2017 at 5:51 AM, Yisheng Xie <ys...@foxmail.com> wrote:
> From: Yisheng Xie <xieyishe...@huawei.com>
>
> When we enter do_try_to_free_pages, the may_thrash is always clear, and
> it will retry shrink zones to tap cgroup's reserves memory by setting
> may_thrash when the former shrink_zones reclaim nothing.
>
> However, when memcg is disabled or on legacy hierarchy, it should not do
> this useless retry at all, for we do not have any cgroup's reserves
> memory to tap, and we have already done hard work but made no progress.
>
> To avoid this time costly and useless retrying, add a stub function
> may_thrash and return true when memcg is disabled or on legacy
> hierarchy.
>
> Signed-off-by: Yisheng Xie <xieyishe...@huawei.com>
> Suggested-by: Shakeel Butt <shake...@google.com>
> ---
> v2:
>  - more restrictive condition for retry of shrink_zones (restricting
>cgroup_disabled=memory boot option and cgroup legacy hierarchy) - Shakeel
>
>  - add a stub function may_thrash() to avoid compile error or warning.
>
>  - rename subject from "donot retry shrink zones when memcg is disable"
>to "more restrictive condition for retry in do_try_to_free_pages"
>
> Any comment is more than welcome!
>
> Thanks
> Yisheng Xie
>
>  mm/vmscan.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc8031e..415f800 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -184,6 +184,19 @@ static bool sane_reclaim(struct scan_control *sc)
>  #endif
> return false;
>  }
> +
> +static bool may_thrash(struct scan_control *sc)
> +{
> +   /*
> +* When memcg is disabled or on legacy hierarchy, there is no cgroup
> +* reserves memory to tap.
> +*/
> +   if (!cgroup_subsys_enabled(memory_cgrp_subsys) ||
> +   !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +   return true;
> +
> +   return sc->may_thrash;
> +}
>  #else
>  static bool global_reclaim(struct scan_control *sc)
>  {
> @@ -194,6 +207,11 @@ static bool sane_reclaim(struct scan_control *sc)
>  {
> return true;
>  }
> +
> +static bool may_thrash(struct scan_control *sc)
> +{
> +   return true;
> +}
>  #endif
>
>  /*
> @@ -2808,7 +2826,7 @@ static unsigned long do_try_to_free_pages(struct 
> zonelist *zonelist,
> return 1;
>
> /* Untapped cgroup reserves?  Don't OOM, retry. */
> -   if (!sc->may_thrash) {
> +   if (!may_thrash(sc)) {

Thanks Yisheng. The name of the function may_thrash() is confusing in
the sense that it is returning exactly the opposite of what its name
implies. How about reversing the condition of may_thrash() function
and change the scan_control's field may_thrash to thrashed?

> sc->priority = initial_priority;
> sc->may_thrash = 1;
> goto retry;
> --
> 1.9.1
>

Re: [PATCH v3 RFC] mm/vmscan: more restrictive condition for retry of shrink_zones

2017-03-13 Thread Shakeel Butt

On Mon, Mar 13, 2017 at 1:33 AM, Michal Hocko  wrote:
> Please do not post new version after a single feedback and try to wait
> for more review to accumulate. This is in the 3rd version and it is not
> clear why it is still an RFC.
>
> On Sun 12-03-17 19:06:10, Yisheng Xie wrote:
>> From: Yisheng Xie 
>>
>> When we enter do_try_to_free_pages, the may_thrash is always clear, and
>> it will retry shrink zones to tap cgroup's reserves memory by setting
>> may_thrash when the former shrink_zones reclaim nothing.
>>
>> However, when memcg is disabled or on legacy hierarchy, it should not do
>> this useless retry at all, for we do not have any cgroup's reserves
>> memory to tap, and we have already done hard work but made no progress.
>>
>> To avoid this time costly and useless retrying, add a stub function
>> mem_cgroup_thrashed() and return true when memcg is disabled or on
>> legacy hierarchy.
>
> Have you actually seen this as a bad behavior? On which workload? Or
> have spotted this by the code review?
>
> Please note that more than _what_ it is more interesting _why_ the patch
> has been prepared.
>
> I agree the current additional round of reclaim is just lame because we
> are trying hard to control the retry logic from the page allocator which
> is a sufficient justification to fix this IMO. But I really hate the
> name. At this point we do not have any idea that the memcg is trashing
> as the name of the function suggests.
>
> All of them simply might not have any reclaimable pages. So I would
> suggest either a better name e.g. memcg_allow_lowmem_reclaim() or,
> preferably, fix this properly. E.g. something like the following.
> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bae698484e8e..989ba9761921 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -99,6 +99,9 @@ struct scan_control {
> /* Can cgroups be reclaimed below their normal consumption range? */
> unsigned int may_thrash:1;
>
> +   /* Did we have any memcg protected by the low limit */
> +   unsigned int memcg_low_protection:1;
> +
> unsigned int hibernation_mode:1;
>
> /* One of the zones is ready for compaction */
> @@ -2513,6 +2516,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
> if (mem_cgroup_low(root, memcg)) {
> if (!sc->may_thrash)
> continue;
> +   sc->memcg_low_protection = true;

I think you wanted to put this statement before the continue otherwise
it will just disable the sc->may_thrash (second reclaim pass)
altogether.

> mem_cgroup_events(memcg, MEMCG_LOW, 1);
> }
>
> @@ -2774,7 +2778,7 @@ static unsigned long do_try_to_free_pages(struct 
> zonelist *zonelist,
> return 1;
>
> /* Untapped cgroup reserves?  Don't OOM, retry. */
> -   if (!sc->may_thrash) {
> +   if ( sc->memcg_low_protection && !sc->may_thrash) {
> sc->priority = initial_priority;
> sc->may_thrash = 1;
> goto retry;
> --
> Michal Hocko
> SUSE Labs

Re: [PATCH] mm: fix condition for throttle_direct_reclaim

2017-03-13 Thread Shakeel Butt

On Mon, Mar 13, 2017 at 2:02 AM, Michal Hocko <mho...@kernel.org> wrote:
> On Fri 10-03-17 11:46:20, Shakeel Butt wrote:
>> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
>> number of unsucessful iterations. Before going to sleep, kswapd thread
>> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
>> However the awoken threads will recheck the watermarks and wake the
>> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
>> of continuous back and forth between kswapd and direct reclaiming
>> threads if the kswapd keep failing and thus defeat the purpose of
>> adding backoff mechanism to kswapd. So, add kswapd_failures check
>> on the throttle_direct_reclaim condition.
>
> I have to say I really do not like this. kswapd_failures shouldn't
> really be checked outside of the kswapd context. The
> pfmemalloc_watermark_ok/throttle_direct_reclaim is quite complex even
> without putting another variable into it. I wish we rather replace this
> throttling by something else. Johannes had an idea to throttle by the
> number of reclaimers.
>
Do you suspect race in accessing kswapd_failures in non-kswapd
context? Please do let me know more about replacing this throttling.

> Anyway, I am wondering whether we can hit this issue in
> practice? Have you seen it happening or is this a result of the code
> review? I would assume that that !zone_reclaimable_pages check in
> pfmemalloc_watermark_ok should help to some degree.
>
Yes, I have seen this issue going on for more than one hour on my
test. It was a simple test where the number of processes, in the
presence of swap, try to allocate memory more than RAM. The number of
processes are equal to the number of cores and are pinned to each
individual core. I am suspecting that !zone_reclaimable_pages() check
did not help.

>> Signed-off-by: Shakeel Butt <shake...@google.com>
>> ---
>>  mm/vmscan.c | 12 +---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index bae698484e8e..b2d24cc7a161 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2819,6 +2819,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>>   return wmark_ok;
>>  }
>>
>> +static bool should_throttle_direct_reclaim(pg_data_t *pgdat)
>> +{
>> + return (pgdat->kswapd_failures < MAX_RECLAIM_RETRIES &&
>> + !pfmemalloc_watermark_ok(pgdat));
>> +}
>> +
>>  /*
>>   * Throttle direct reclaimers if backing storage is backed by the network
>>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>> @@ -2873,7 +2879,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, 
>> struct zonelist *zonelist,
>>
>>   /* Throttle based on the first usable node */
>>   pgdat = zone->zone_pgdat;
>> - if (pfmemalloc_watermark_ok(pgdat))
>> + if (!should_throttle_direct_reclaim(pgdat))
>>   goto out;
>>   break;
>>   }
>> @@ -2895,14 +2901,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, 
>> struct zonelist *zonelist,
>>*/
>>   if (!(gfp_mask & __GFP_FS)) {
>>   wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>> - pfmemalloc_watermark_ok(pgdat), HZ);
>> + !should_throttle_direct_reclaim(pgdat), HZ);
>>
>>   goto check_pending;
>>   }
>>
>>   /* Throttle until kswapd wakes the process */
>>   wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>> - pfmemalloc_watermark_ok(pgdat));
>> + !should_throttle_direct_reclaim(pgdat));
>>
>>  check_pending:
>>   if (fatal_signal_pending(current))
>> --
>> 2.12.0.246.ga2ecc84866-goog
>>
>
> --
> Michal Hocko
> SUSE Labs

Re: [PATCH] mm: fix condition for throttle_direct_reclaim

2017-03-13 Thread Shakeel Butt

On Mon, Mar 13, 2017 at 8:46 AM, Michal Hocko <mho...@kernel.org> wrote:
> On Mon 13-03-17 08:07:15, Shakeel Butt wrote:
>> On Mon, Mar 13, 2017 at 2:02 AM, Michal Hocko <mho...@kernel.org> wrote:
>> > On Fri 10-03-17 11:46:20, Shakeel Butt wrote:
>> >> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
>> >> number of unsucessful iterations. Before going to sleep, kswapd thread
>> >> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
>> >> However the awoken threads will recheck the watermarks and wake the
>> >> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
>> >> of continuous back and forth between kswapd and direct reclaiming
>> >> threads if the kswapd keep failing and thus defeat the purpose of
>> >> adding backoff mechanism to kswapd. So, add kswapd_failures check
>> >> on the throttle_direct_reclaim condition.
>> >
>> > I have to say I really do not like this. kswapd_failures shouldn't
>> > really be checked outside of the kswapd context. The
>> > pfmemalloc_watermark_ok/throttle_direct_reclaim is quite complex even
>> > without putting another variable into it. I wish we rather replace this
>> > throttling by something else. Johannes had an idea to throttle by the
>> > number of reclaimers.
>> >
>>
>> Do you suspect race in accessing kswapd_failures in non-kswapd
>> context?
>
> No, this is not about race conditions. It is more about the logic of the
> code. kswapd_failures is the private thing to the kswapd daemon. Direct
> reclaimers shouldn't have any business in it - well except resetting it.
>
>> Please do let me know more about replacing this throttling.
>
> The idea behind a different throttling would be to not allow too many
> direct reclaimers on the same set of nodes/zones. Johannes would tell
> you more.
>
>> > Anyway, I am wondering whether we can hit this issue in
>> > practice? Have you seen it happening or is this a result of the code
>> > review? I would assume that that !zone_reclaimable_pages check in
>> > pfmemalloc_watermark_ok should help to some degree.
>> >
>> Yes, I have seen this issue going on for more than one hour on my
>> test. It was a simple test where the number of processes, in the
>> presence of swap, try to allocate memory more than RAM.
>
> this is an anonymous memory, right?
>
Yes.

>> The number of
>> processes are equal to the number of cores and are pinned to each
>> individual core. I am suspecting that !zone_reclaimable_pages() check
>> did not help.
>
> Hmm, interesting! I would expect the OOM killer triggering but I guess
> I see what is going on. kswapd couldn't reclaim a single page and ran
> out of its kswapd_failures attempts while no direct reclaimers could
> reclaim a single page either until we reached the throttling point when
> we are basically livelocked because neither kswapd nor _all_ direct
> reclaimers can make a forward progress. Although this sounds quite
> unlikely I think it is quite possible to happen. So we cannot really
> throttle _all_ direct reclaimers when the kswapd is out of game which I
> haven't fully realized when reviewing "mm: fix 100% CPU kswapd busyloop
> on unreclaimable nodes".
>
> The simplest thing to do would be something like you have proposed and
> do not throttle if kswapd is out of game.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bae698484e8e..d34b1afc781a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2791,6 +2791,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> int i;
> bool wmark_ok;
>
> +   if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +   return true;
> +
> for (i = 0; i <= ZONE_NORMAL; i++) {
> zone = >node_zones[i];
> if (!managed_zone(zone))
>
> I do not like this as I've already said but it would allow to merge
> "mm: fix 100% CPU kswapd busyloop on unreclaimable nodes" without too
> many additional changes.
>
> Another option would be to cap the waiting time same as we do for
> GFP_NOFS. Not ideal either because I suspect we would just get herds
> of direct reclaimers that way.
>
> The best option would be to rethink the throttling and move it out of
> the direct reclaim path somehow.
>
Agreed.

> Thanks and sorry for not spotting the potential lockup previously.
> --
> Michal Hocko
> SUSE Labs

Re: [PATCH v3 RFC] mm/vmscan: more restrictive condition for retry of shrink_zones

2017-03-12 Thread Shakeel Butt

On Sun, Mar 12, 2017 at 4:06 AM, Yisheng Xie <ys...@foxmail.com> wrote:
> From: Yisheng Xie <xieyishe...@huawei.com>
>
> When we enter do_try_to_free_pages, the may_thrash is always clear, and
> it will retry shrink zones to tap cgroup's reserves memory by setting
> may_thrash when the former shrink_zones reclaim nothing.
>
> However, when memcg is disabled or on legacy hierarchy, it should not do
> this useless retry at all, for we do not have any cgroup's reserves
> memory to tap, and we have already done hard work but made no progress.
>
> To avoid this time costly and useless retrying, add a stub function
> mem_cgroup_thrashed() and return true when memcg is disabled or on
> legacy hierarchy.
>
> Signed-off-by: Yisheng Xie <xieyishe...@huawei.com>
> Suggested-by: Shakeel Butt <shake...@google.com>

Thanks.
Reviewed-by: Shakeel Butt <shake...@google.com>

> ---
> v3:
>  - rename function may_thrash() to mem_cgroup_thrashed() to avoid confusing.
>
> v2:
>  - more restrictive condition for retry of shrink_zones (restricting
>cgroup_disabled=memory boot option and cgroup legacy hierarchy) - Shakeel
>
>  - add a stub function may_thrash() to avoid compile error or warning.
>
>  - rename subject from "donot retry shrink zones when memcg is disable"
>to "more restrictive condition for retry in do_try_to_free_pages"
>
> Any comment is more than welcome!
>
> Thanks
> Yisheng Xie
>
>  mm/vmscan.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc8031e..a76475af 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -184,6 +184,19 @@ static bool sane_reclaim(struct scan_control *sc)
>  #endif
> return false;
>  }
> +
> +static bool mem_cgroup_thrashed(struct scan_control *sc)
> +{
> +   /*
> +* When memcg is disabled or on legacy hierarchy, there is no cgroup
> +* reserves memory to tap. So fake it as thrashed.
> +*/
> +   if (!cgroup_subsys_enabled(memory_cgrp_subsys) ||
> +   !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +   return true;
> +
> +   return sc->may_thrash;
> +}
>  #else
>  static bool global_reclaim(struct scan_control *sc)
>  {
> @@ -194,6 +207,11 @@ static bool sane_reclaim(struct scan_control *sc)
>  {
> return true;
>  }
> +
> +static bool mem_cgroup_thrashed(struct scan_control *sc)
> +{
> +   return true;
> +}
>  #endif
>
>  /*
> @@ -2808,7 +2826,7 @@ static unsigned long do_try_to_free_pages(struct 
> zonelist *zonelist,
> return 1;
>
> /* Untapped cgroup reserves?  Don't OOM, retry. */
> -   if (!sc->may_thrash) {
> +   if (!mem_cgroup_thrashed(sc)) {
> sc->priority = initial_priority;
> sc->may_thrash = 1;
> goto retry;
> --
> 1.9.1
>

[PATCH] mm: fix condition for throttle_direct_reclaim

2017-03-10 Thread Shakeel Butt

Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
number of unsucessful iterations. Before going to sleep, kswapd thread
will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
However the awoken threads will recheck the watermarks and wake the
kswapd thread and sleep again on pfmemalloc_wait. There is a chance
of continuous back and forth between kswapd and direct reclaiming
threads if the kswapd keep failing and thus defeat the purpose of
adding backoff mechanism to kswapd. So, add kswapd_failures check
on the throttle_direct_reclaim condition.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 mm/vmscan.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bae698484e8e..b2d24cc7a161 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2819,6 +2819,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
return wmark_ok;
 }
 
+static bool should_throttle_direct_reclaim(pg_data_t *pgdat)
+{
+   return (pgdat->kswapd_failures < MAX_RECLAIM_RETRIES &&
+   !pfmemalloc_watermark_ok(pgdat));
+}
+
 /*
  * Throttle direct reclaimers if backing storage is backed by the network
  * and the PFMEMALLOC reserve for the preferred node is getting dangerously
@@ -2873,7 +2879,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, 
struct zonelist *zonelist,
 
/* Throttle based on the first usable node */
pgdat = zone->zone_pgdat;
-   if (pfmemalloc_watermark_ok(pgdat))
+   if (!should_throttle_direct_reclaim(pgdat))
goto out;
break;
}
@@ -2895,14 +2901,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, 
struct zonelist *zonelist,
 */
if (!(gfp_mask & __GFP_FS)) {
wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
-   pfmemalloc_watermark_ok(pgdat), HZ);
+   !should_throttle_direct_reclaim(pgdat), HZ);
 
goto check_pending;
}
 
/* Throttle until kswapd wakes the process */
wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
-   pfmemalloc_watermark_ok(pgdat));
+   !should_throttle_direct_reclaim(pgdat));
 
 check_pending:
if (fatal_signal_pending(current))
-- 
2.12.0.246.ga2ecc84866-goog

Re: [PATCH RFC] mm/vmscan: donot retry shrink zones when memcg is disabled

2017-03-10 Thread Shakeel Butt

On Fri, Mar 10, 2017 at 6:19 PM, Yisheng Xie  wrote:
> From: Yisheng Xie 
>
> When we enter do_try_to_free_pages, the may_thrash is always clear, and
> it will retry shrink zones to tap cgroup's reserves memory by setting
> may_thrash when the former shrink_zones reclaim nothing.
>
> However, if CONFIG_MEMCG=n, it should not do this useless retry at all,
> for we do not have any cgroup's reserves memory to tap, and we have
> already done hard work and made no progress.
>
> Signed-off-by: Yisheng Xie 
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc8031e..b03ccc1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2808,7 +2808,7 @@ static unsigned long do_try_to_free_pages(struct 
> zonelist *zonelist,
> return 1;
>
> /* Untapped cgroup reserves?  Don't OOM, retry. */
> -   if (!sc->may_thrash) {
> +   if (!sc->may_thrash && IS_ENABLED(CONFIG_MEMCG)) {

In my opinion it should be even more restrictive (restricting
cgroup_disabled=memory boot option and cgroup legacy hierarchy). So,
instead of IS_ENABLED(CONFIG_MEMCG), the check should be something
like (cgroup_subsys_enabled(memory_cgrp_subsys) &&
cgroup_subsys_on_dfl(memory_cgrp_subsys)).

> sc->priority = initial_priority;
> sc->may_thrash = 1;
> goto retry;
> --
> 1.9.1
>
>
>

[PATCH v2] mm: fix condition for throttle_direct_reclaim

2017-03-13 Thread Shakeel Butt

Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
number of unsucessful iterations. Before going to sleep, kswapd thread
will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
However the awoken threads will recheck the watermarks and wake the
kswapd thread and sleep again on pfmemalloc_wait. There is a chance
of continuous back and forth between kswapd and direct reclaiming
threads if the kswapd keep failing and thus defeat the purpose of
adding backoff mechanism to kswapd. So, add kswapd_failures check
on the throttle_direct_reclaim condition.

Signed-off-by: Shakeel Butt <shake...@google.com>
Suggested-by: Michal Hocko <mho...@suse.com>
Suggested-by: Johannes Weiner <han...@cmpxchg.org>
---
v2:
Instead of separate helper function for checking kswapd_failures,
added the check into pfmemalloc_watermark_ok() and renamed that
function.

 mm/vmscan.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bae698484e8e..afa5b20ab6d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2783,7 +2783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist 
*zonelist,
return 0;
 }
 
-static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+static bool allow_direct_reclaim(pg_data_t *pgdat)
 {
struct zone *zone;
unsigned long pfmemalloc_reserve = 0;
@@ -2791,6 +2791,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
int i;
bool wmark_ok;
 
+   if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+   return true;
+
for (i = 0; i <= ZONE_NORMAL; i++) {
zone = >node_zones[i];
if (!managed_zone(zone))
@@ -2873,7 +2876,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, 
struct zonelist *zonelist,
 
/* Throttle based on the first usable node */
pgdat = zone->zone_pgdat;
-   if (pfmemalloc_watermark_ok(pgdat))
+   if (allow_direct_reclaim(pgdat))
goto out;
break;
}
@@ -2895,14 +2898,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, 
struct zonelist *zonelist,
 */
if (!(gfp_mask & __GFP_FS)) {
wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
-   pfmemalloc_watermark_ok(pgdat), HZ);
+   allow_direct_reclaim(pgdat), HZ);
 
goto check_pending;
}
 
/* Throttle until kswapd wakes the process */
wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
-   pfmemalloc_watermark_ok(pgdat));
+   allow_direct_reclaim(pgdat));
 
 check_pending:
if (fatal_signal_pending(current))
@@ -3102,7 +3105,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, int classzone_idx)
 {
/*
 * The throttled processes are normally woken up in balance_pgdat() as
-* soon as pfmemalloc_watermark_ok() is true. But there is a potential
+* soon as allow_direct_reclaim() is true. But there is a potential
 * race between when kswapd checks the watermarks and a process gets
 * throttled. There is also a potential race if processes get
 * throttled, kswapd wakes, a large process exits thereby balancing the
@@ -3271,7 +3274,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
 * able to safely make forward progress. Wake them
 */
if (waitqueue_active(>pfmemalloc_wait) &&
-   pfmemalloc_watermark_ok(pgdat))
+   allow_direct_reclaim(pgdat))
wake_up_all(>pfmemalloc_wait);
 
/* Check if kswapd should be suspending */
-- 
2.12.0.246.ga2ecc84866-goog

Re: [PATCH] mm: fix condition for throttle_direct_reclaim

2017-03-13 Thread Shakeel Butt

On Mon, Mar 13, 2017 at 12:58 PM, Johannes Weiner <han...@cmpxchg.org> wrote:
> Hi Shakeel,
>
> On Fri, Mar 10, 2017 at 11:46:20AM -0800, Shakeel Butt wrote:
>> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
>> number of unsucessful iterations. Before going to sleep, kswapd thread
>> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
>> However the awoken threads will recheck the watermarks and wake the
>> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
>> of continuous back and forth between kswapd and direct reclaiming
>> threads if the kswapd keep failing and thus defeat the purpose of
>> adding backoff mechanism to kswapd. So, add kswapd_failures check
>> on the throttle_direct_reclaim condition.
>>
>> Signed-off-by: Shakeel Butt <shake...@google.com>
>
> You're right, the way it works right now is kind of lame. Did you
> observe continued kswapd spinning because of the wakeup ping-pong?
>

Yes, I did observe kswapd spinning for more than an hour.

>> +static bool should_throttle_direct_reclaim(pg_data_t *pgdat)
>> +{
>> + return (pgdat->kswapd_failures < MAX_RECLAIM_RETRIES &&
>> + !pfmemalloc_watermark_ok(pgdat));
>> +}
>> +
>>  /*
>>   * Throttle direct reclaimers if backing storage is backed by the network
>>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>> @@ -2873,7 +2879,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, 
>> struct zonelist *zonelist,
>>
>>   /* Throttle based on the first usable node */
>>   pgdat = zone->zone_pgdat;
>> - if (pfmemalloc_watermark_ok(pgdat))
>> + if (!should_throttle_direct_reclaim(pgdat))
>>   goto out;
>
> Instead of a second helper function, could you rename
> pfmemalloc_watermark_ok() and add the kswapd_failure check at the very
> beginning of that function?
>

Sure, Michal also suggested the same.

> Because that check fits nicely with the comment about kswapd having to
> be awake, too. We need kswapd operational when throttling reclaimers.
>
> Thanks

Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-02 Thread Shakeel Butt

On Tue, Feb 28, 2017 at 1:39 PM, Johannes Weiner  wrote:
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
>
> $ echo 4000 >/proc/sys/vm/nr_hugepages
>
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem
>
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3
>
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
>
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
>
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
>
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
>

Should the condition of wait_event_killable in throttle_direct_reclaim
be changed to (pfmemalloc_watermark_ok(pgdat) ||
pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)?

> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
>
> Reported-by: Jia He 
> Signed-off-by: Johannes Weiner 
> Tested-by: Jia He 
> Acked-by: Michal Hocko 
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/internal.h  |  6 ++
>  mm/page_alloc.c|  9 ++---
>  mm/vmscan.c| 27 ---
>  mm/vmstat.c|  2 +-
>  5 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
> int kswapd_order;
> enum zone_type kswapd_classzone_idx;
>
> +   int kswapd_failures;/* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
> int kcompactd_max_order;
> enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/internal.h b/mm/internal.h
> index ccfc2a2969f4..aae93e3fd984 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
>  extern unsigned long highest_memmap_pfn;
>
>  /*
> + * Maximum number of reclaim retries without progress before the OOM
> + * killer is consider the only way forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
>   * in mm/vmscan.c:
>   */
>  extern int isolate_lru_page(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..f50e36e7b024 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward 
> progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t 
> *nodemask)
> K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> node_page_state(pgdat, NR_PAGES_SCANNED),
> -   !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> +   "yes" : "no");
> }
>
> for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..407b27831ff7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)

Re: [PATCH 2/9] mm: support __GFP_REPEAT in kvmalloc_node for >32kB

2017-04-06 Thread Shakeel Butt

On Mon, Mar 6, 2017 at 2:30 AM, Michal Hocko  wrote:
> From: Michal Hocko 
>
> vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
> vhost_vsock because it would really like to prefer kmalloc to the
> vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device
> allocation to vmalloc") for more context. Michael Tsirkin has also
> noted:
> "
> __GFP_REPEAT overhead is during allocation time.  Using vmalloc means all
> accesses are slowed down.  Allocation is not on data path, accesses are.
> "
>
> The similar applies to other vhost_kvzalloc users.
>
> Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are two
> things to be careful about. First we should prevent from the OOM killer
> and so have to involve __GFP_NORETRY by default and secondly override
> __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is ignored
> for !costly orders.
>
> Supporting __GFP_REPEAT like semantic for !costly request is possible
> it would require changes in the page allocator. This is out of scope of
> this patch.
>
> This patch shouldn't introduce any functional change.
>
> Acked-by: Vlastimil Babka 
> Acked-by: Michael S. Tsirkin 
> Signed-off-by: Michal Hocko 
> ---
>  drivers/vhost/net.c   |  9 +++--
>  drivers/vhost/vhost.c | 15 +++
>  drivers/vhost/vsock.c |  9 +++--
>  mm/util.c | 20 
>  4 files changed, 25 insertions(+), 28 deletions(-)
>

There is a kzalloc/vzalloc call in
drivers/vhost/scsi.c:vhost_scsi_open() which is not converted to
kvzalloc(). Was that intentional?

Re: [RFC PATCH] mm: fadvise: avoid fadvise for fs without backing device

2017-08-18 Thread Shakeel Butt

On Fri, Aug 18, 2017 at 2:34 PM, Andrew Morton
<a...@linux-foundation.org> wrote:
> On Thu, 17 Aug 2017 18:20:17 -0700 Shakeel Butt <shake...@google.com> wrote:
>
>> +linux-mm, linux-kernel
>>
>> On Thu, Aug 17, 2017 at 6:10 PM, Shakeel Butt <shake...@google.com> wrote:
>> > The fadvise() manpage is silent on fadvise()'s effect on
>> > memory-based filesystems (shmem, hugetlbfs & ramfs) and pseudo
>> > file systems (procfs, sysfs, kernfs). The current implementaion
>> > of fadvise is mostly a noop for such filesystems except for
>> > FADV_DONTNEED which will trigger expensive remote LRU cache
>> > draining. This patch makes the noop of fadvise() on such file
>> > systems very explicit.
>> >
>> > However this change has two side effects for ramfs and one for
>> > tmpfs. First fadvise(FADV_DONTNEED) can remove the unmapped clean
>> > zero'ed pages of ramfs (allocated through read, readahead & read
>> > fault) and tmpfs (allocated through read fault). Also
>> > fadvise(FADV_WILLNEED) on create such clean zero'ed pages for
>> > ramfs. This change removes these two interfaces.
>> >
>
> It doesn't sound like a risky change to me, although perhaps someone is
> depending on the current behaviour for obscure reasons, who knows.
>
> What are the reasons for this change?  Is the current behaviour causing
> some sort of problem for someone?

Yes, one of our generic library does fadvise(FADV_DONTNEED). Recently
we observed high latency in fadvise() and notice that the users have
started using tmpfs files and the latency was due to expensive remote
LRU cache draining. For normal tmpfs files (have data written on
them), fadvise(FADV_DONTNEED) will always trigger the un-needed remote
cache draining.

>
>

Re: [RFC PATCH] mm: fadvise: avoid fadvise for fs without backing device

2017-08-17 Thread Shakeel Butt

+linux-mm, linux-kernel

On Thu, Aug 17, 2017 at 6:10 PM, Shakeel Butt <shake...@google.com> wrote:
> The fadvise() manpage is silent on fadvise()'s effect on
> memory-based filesystems (shmem, hugetlbfs & ramfs) and pseudo
> file systems (procfs, sysfs, kernfs). The current implementaion
> of fadvise is mostly a noop for such filesystems except for
> FADV_DONTNEED which will trigger expensive remote LRU cache
> draining. This patch makes the noop of fadvise() on such file
> systems very explicit.
>
> However this change has two side effects for ramfs and one for
> tmpfs. First fadvise(FADV_DONTNEED) can remove the unmapped clean
> zero'ed pages of ramfs (allocated through read, readahead & read
> fault) and tmpfs (allocated through read fault). Also
> fadvise(FADV_WILLNEED) on create such clean zero'ed pages for
> ramfs. This change removes these two interfaces.
>
> Signed-off-by: Shakeel Butt <shake...@google.com>
> ---
>  mm/fadvise.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index a43013112581..702f239cd6db 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -52,7 +52,9 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, 
> loff_t, len, int, advice)
> goto out;
> }
>
> -   if (IS_DAX(inode)) {
> +   bdi = inode_to_bdi(mapping->host);
> +
> +   if (IS_DAX(inode) || (bdi == _backing_dev_info)) {
> switch (advice) {
> case POSIX_FADV_NORMAL:
> case POSIX_FADV_RANDOM:
> @@ -75,8 +77,6 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, 
> loff_t, len, int, advice)
> else
> endbyte--;  /* inclusive */
>
> -   bdi = inode_to_bdi(mapping->host);
> -
> switch (advice) {
> case POSIX_FADV_NORMAL:
> f.file->f_ra.ra_pages = bdi->ra_pages;
> --
> 2.14.1.480.gb18f417b89-goog
>

Re: [RFC PATCH] mm: fadvise: avoid fadvise for fs without backing device

2017-08-22 Thread Shakeel Butt

>> It doesn't sound like a risky change to me, although perhaps someone is
>> depending on the current behaviour for obscure reasons, who knows.
>>
>> What are the reasons for this change?  Is the current behaviour causing
>> some sort of problem for someone?
>
> Yes, one of our generic library does fadvise(FADV_DONTNEED). Recently
> we observed high latency in fadvise() and notice that the users have
> started using tmpfs files and the latency was due to expensive remote
> LRU cache draining. For normal tmpfs files (have data written on
> them), fadvise(FADV_DONTNEED) will always trigger the un-needed remote
> cache draining.
>

Hi Andrew, do you have more comments or concerns?

>>
>>

Re: [PATCH] mm: respect the __GFP_NOWARN flag when warning about stalls

2017-09-13 Thread Shakeel Butt

>
> We would have to consider (instead of jiffies) the time the process was
> either running, or waiting on something that's related to memory
> allocation/reclaim (page lock etc.). I.e. deduct the time the process
> was runable but there was no available cpu. I expect however that such
> level of detail wouldn't be feasible here, though?
>

Johannes' memdelay work (once merged) might be useful here. I think
memdalay can differentiate between an allocating process getting
delayed due to preemption or due to unsuccessful reclaim/compaction.
If the delay is due to unsuccessful reclaim/compaction then we should
warn here.

Re: [PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-06 Thread Shakeel Butt

>>   names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>
> I might be wrong but isn't name cache only holding temporary objects
> used for path resolution which are not stored anywhere?
>

Even though they're temporary, many containers can together use a
significant amount of transient uncharged memory. We've seen machines
with 100s of MiBs in names_cache.

>>   filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>>   percpu_counter_init(_files, 0, GFP_KERNEL);
>>  }
>
> Don't we have a limit for the maximum number of open files?
>

Yes, there is a system limit of maximum number of open files. However
this limit is shared between different users on the system and one
user can hog this resource. To cater that, we set the maximum limit
very high and let the memory limit of each user limit the number of
files they can open.

Re: [v8 0/4] cgroup-aware OOM killer

2017-10-02 Thread Shakeel Butt

(Replying again as format of previous reply got messed up).

On Mon, Oct 2, 2017 at 1:00 PM, Tim Hockin  wrote:
> In the example above:
>
>root
>/\
>  A  D
>  / \
>B   C
>
> Does oom_group allow me to express "compare A and D; if A is chosen
> compare B and C; kill the loser" ?  As I understand the proposal (from
> reading thread, not patch) it does not.

It will let you compare A and D and if A is chosen then kill A, B and C.

Re: [v8 0/4] cgroup-aware OOM killer

2017-10-02 Thread Shakeel Butt

On Mon, Oct 2, 2017 at 12:56 PM, Michal Hocko <mho...@kernel.org> wrote:
> On Mon 02-10-17 12:45:18, Shakeel Butt wrote:
>> > I am sorry to cut the rest of your proposal because it simply goes over
>> > the scope of the proposed solution while the usecase you are mentioning
>> > is still possible. If we want to compare intermediate nodes (which seems
>> > to be the case) then we can always provide a knob to opt-in - be it your
>> > oom_gang or others.
>>
>> In the Roman's proposed solution we can already force the comparison
>> of intermediate nodes using 'oom_group', I am just requesting to
>> separate the killall semantics from it.
>
> oom_group _is_ about killall semantic.  And comparing killable entities
> is just a natural thing to do. So I am not sure what you mean
>

I am saying decouple the notion of comparable entities and killable entities.

[PATCH] epoll: account epitem and eppoll_entry to kmemcg

2017-10-02 Thread Shakeel Butt

The user space application can directly trigger the allocations
from eventpoll_epi and eventpoll_pwq slabs. A buggy or malicious
application can consume a significant amount of system memory by
triggering such allocations. Indeed we have seen in production
where a buggy application was leaking the epoll references and
causing a burst of eventpoll_epi and eventpoll_pwq slab
allocations. This patch opt-in the charging of eventpoll_epi
and eventpoll_pwq slabs.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 fs/eventpoll.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2fabd19cdeea..a45360444895 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2329,11 +2329,11 @@ static int __init eventpoll_init(void)
 
/* Allocates slab cache used to allocate "struct epitem" items */
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
-   0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+   0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
/* Allocates slab cache used to allocate "struct eppoll_entry" */
pwq_cache = kmem_cache_create("eventpoll_pwq",
-   sizeof(struct eppoll_entry), 0, SLAB_PANIC, NULL);
+   sizeof(struct eppoll_entry), 0, SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
return 0;
 }
-- 
2.14.2.822.g60be5d43e6-goog

[PATCH] kvm, mm: account kvm related kmem slabs to kmemcg

2017-10-05 Thread Shakeel Butt

The kvm slabs can consume a significant amount of system memory
and indeed in our production environment we have observed that
a lot of machines are spending significant amount of memory that
can not be left as system memory overhead. Also the allocations
from these slabs can be triggered directly by user space applications
which has access to kvm and thus a buggy application can leak
such memory. So, these caches should be accounted to kmemcg.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 arch/x86/kvm/mmu.c  | 4 ++--
 virt/kvm/kvm_main.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eca30c1eb1d9..87c5db9e644d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5475,13 +5475,13 @@ int kvm_mmu_module_init(void)
 
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
sizeof(struct pte_list_desc),
-   0, 0, NULL);
+   0, SLAB_ACCOUNT, NULL);
if (!pte_list_desc_cache)
goto nomem;
 
mmu_page_header_cache = kmem_cache_create("kvm_mmu_page_header",
  sizeof(struct kvm_mmu_page),
- 0, 0, NULL);
+ 0, SLAB_ACCOUNT, NULL);
if (!mmu_page_header_cache)
goto nomem;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9deb5a245b83..3d73299e05f2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4010,7 +4010,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned 
vcpu_align,
if (!vcpu_align)
vcpu_align = __alignof__(struct kvm_vcpu);
kvm_vcpu_cache = kmem_cache_create("kvm_vcpu", vcpu_size, vcpu_align,
-  0, NULL);
+  SLAB_ACCOUNT, NULL);
if (!kvm_vcpu_cache) {
r = -ENOMEM;
goto out_free_3;
-- 
2.14.2.920.gcf0c67979c-goog

Re: [PATCH] kvm, mm: account kvm related kmem slabs to kmemcg

2017-10-06 Thread Shakeel Butt

On Thu, Oct 5, 2017 at 9:28 PM, Anshuman Khandual
<khand...@linux.vnet.ibm.com> wrote:
> On 10/06/2017 06:37 AM, Shakeel Butt wrote:
>> The kvm slabs can consume a significant amount of system memory
>> and indeed in our production environment we have observed that
>> a lot of machines are spending significant amount of memory that
>> can not be left as system memory overhead. Also the allocations
>> from these slabs can be triggered directly by user space applications
>> which has access to kvm and thus a buggy application can leak
>> such memory. So, these caches should be accounted to kmemcg.
>
> But there may be other situations like this where user space can
> trigger allocation from various SLAB objects inside the kernel
> which are accounted as system memory. So how we draw the line
> which ones should be accounted for memcg. Just being curious.
>
Yes, there are indeed other slabs where user space can trigger
allocations. IMO selecting which kmem caches to account is kind of
workload and user specific decision. The ones I am converting are
selected based on the data gathered from our production environment.
However I think it would be useful in general.

Re: [PATCH] mm: memcontrol: use per-cpu stocks for socket memory uncharging

2017-09-07 Thread Shakeel Butt

On Thu, Sep 7, 2017 at 11:47 AM, Roman Gushchin <g...@fb.com> wrote:
> On Thu, Sep 07, 2017 at 11:44:12AM -0700, Shakeel Butt wrote:
>> >> As far as other types of pages go: page cache and anon are already
>> >> batched pretty well, but I think kmem might benefit from this
>> >> too. Have you considered using the stock in memcg_kmem_uncharge()?
>> >
>> > Good idea!
>> > I'll try to find an appropriate testcase and check if it really
>> > brings any benefits. If so, I'll master a patch.
>> >
>>
>> Hi Roman, did you get the chance to try this on memcg_kmem_uncharge()?
>
> Hi Shakeel!
>
> Not yet, I'll try to it asap.
> Do you have an example when it's costly?
>

Nah, I was also curious of such examples.

Re: [PATCH] mm: memcontrol: use per-cpu stocks for socket memory uncharging

2017-09-07 Thread Shakeel Butt

>> As far as other types of pages go: page cache and anon are already
>> batched pretty well, but I think kmem might benefit from this
>> too. Have you considered using the stock in memcg_kmem_uncharge()?
>
> Good idea!
> I'll try to find an appropriate testcase and check if it really
> brings any benefits. If so, I'll master a patch.
>

Hi Roman, did you get the chance to try this on memcg_kmem_uncharge()?

Re: [v8 0/4] cgroup-aware OOM killer

2017-10-01 Thread Shakeel Butt

>
> Going back to Michal's example, say the user configured the following:
>
>root
>   /\
>  A  D
> / \
>B   C
>
> A global OOM event happens and we find this:
> - A > D
> - B, C, D are oomgroups
>
> What the user is telling us is that B, C, and D are compound memory
> consumers. They cannot be divided into their task parts from a memory
> point of view.
>
> However, the user doesn't say the same for A: the A subtree summarizes
> and controls aggregate consumption of B and C, but without groupoom
> set on A, the user says that A is in fact divisible into independent
> memory consumers B and C.
>
> If we don't have to kill all of A, but we'd have to kill all of D,
> does it make sense to compare the two?
>

I think Tim has given very clear explanation why comparing A & D makes
perfect sense. However I think the above example, a single user system
where a user has designed and created the whole hierarchy and then
attaches different jobs/applications to different nodes in this
hierarchy, is also a valid scenario. One solution I can think of, to
cater both scenarios, is to introduce a notion of 'bypass oom' or not
include a memcg for oom comparision and instead include its children
in the comparison.

So, in the same above example:
root
   /   \
  A(b)D
 /  \
B   C

A is marked as bypass and thus B and C are to be compared to D. So,
for the single user scenario, all the internal nodes are marked
'bypass oom comparison' and oom_priority of the leaves has to be set
to the same value.

Below is the pseudo code of select_victim_memcg() based on this idea
and David's previous pseudo code. The calculation of size of a memcg
is still not very well baked here yet. I am working on it and I plan
to have a patch based on Roman's v9 "mm, oom: cgroup-aware OOM killer"
patch.


struct mem_cgroup *memcg = root_mem_cgroup;
struct mem_cgroup *selected_memcg = root_mem_cgroup;
struct mem_cgroup *low_memcg;
unsigned long low_priority;
unsigned long prev_badness = memcg_oom_badness(memcg); // Roman's code
LIST_HEAD(queue);

next_level:
low_memcg = NULL;
low_priority = ULONG_MAX;

next:
for_each_child_of_memcg(it, memcg) {
unsigned long prio = it->oom_priority;
unsigned long badness = 0;

if (it->bypass_oom && !it->oom_group &&
memcg_has_children(it)) {
list_add(>oom_queue, );
continue;
}

if (prio > low_priority)
continue;

if (prio == low_priority) {
badness = mem_cgroup_usage(it); // for
simplicity, need more thinking
if (badness < prev_badness)
continue;
}

low_memcg = it;
low_priority = prio;
prev_badness = badness ?: mem_cgroup_usage(it);  //
for simplicity
}
if (!list_empty()) {
memcg = list_last_entry(, struct mem_cgroup, oom_queue);
list_del(>oom_queue);
goto next;
}
if (low_memcg) {
selected_memcg = memcg = low_memcg;
prev_badness = 0;
if (!low_memcg->oom_group)
goto next_level;
}
if (selected_memcg->oom_group)
oom_kill_memcg(selected_memcg);
else
oom_kill_process_from_memcg(selected_memcg);

Re: [v8 0/4] cgroup-aware OOM killer

2017-10-02 Thread Shakeel Butt

> Yes and nobody is disputing that, really. I guess the main disconnect
> here is that different people want to have more detailed control over
> the victim selection while the patchset tries to handle the most
> simplistic scenario when a no userspace control over the selection is
> required. And I would claim that this will be a last majority of setups
> and we should address it first.

IMHO the disconnect/disagreement is which memcgs should be compared
with each other for oom victim selection. Let's forget about oom
priority and just take size into the account. Should the oom selection
algorithm, compare the leaves of the hierarchy or should it compare
siblings? For the single user system, comparing leaves makes sense
while in a multi user system, siblings should be compared for victim
selection.

Coming back to the same example:

   root
   /\
 A  D
 / \
   B   C

Let's view it as a multi user system and some central job scheduler
has asked a node controller on this system to start two jobs 'A' &
'D'. 'A' then went on to create sub-containers. Now, on system oom,
IMO the most simple sensible thing to do from the semantic point of
view is to compare 'A' and 'D' and if 'A''s usage is higher then
killall 'A' if oom_group or recursively find victim memcg taking 'A'
as root.

I have noted before that for single user systems, comparing 'B', 'C' &
'D' is the most sensible thing to do.

Now, in the multi user system, I can kind of force the comparison of
'A' & 'D' by setting oom_group on 'A'. IMO that is abuse of
'oom_group' as it will get double meanings/semantics which are
comparison leader and killall. I would humbly suggest to have two
separate notions instead. Let's say oom_gang (if you prefer just
'oom_group' is fine too) and killall.

For the single user system example, 'B', 'C' and 'D' will have
'oom_gang' set and if the user wants killall semantics too, he can set
it separately.

For the multi user, 'A' and 'D' will have 'oom_gang' set. Now, lets
say 'A' was selected on system oom, if 'killall' was set on 'A' then
'A' will be selected as victim otherwise the oom selection algorithm
will recursively take 'A' as root and try to find victim memcg.

Another major semantic of 'oom_gang' is that the leaves will always be
treated as 'oom_gang'.

Re: [v8 0/4] cgroup-aware OOM killer

2017-10-02 Thread Shakeel Butt

> I am sorry to cut the rest of your proposal because it simply goes over
> the scope of the proposed solution while the usecase you are mentioning
> is still possible. If we want to compare intermediate nodes (which seems
> to be the case) then we can always provide a knob to opt-in - be it your
> oom_gang or others.

In the Roman's proposed solution we can already force the comparison
of intermediate nodes using 'oom_group', I am just requesting to
separate the killall semantics from it.

Re: [RFC PATCH] mm: fadvise: avoid fadvise for fs without backing device

2017-08-25 Thread Shakeel Butt

On Fri, Aug 25, 2017 at 2:49 PM, Andrew Morton
<a...@linux-foundation.org> wrote:
> On Thu, 17 Aug 2017 18:20:17 -0700 Shakeel Butt <shake...@google.com> wrote:
>
>> +linux-mm, linux-kernel
>>
>> On Thu, Aug 17, 2017 at 6:10 PM, Shakeel Butt <shake...@google.com> wrote:
>> > The fadvise() manpage is silent on fadvise()'s effect on
>> > memory-based filesystems (shmem, hugetlbfs & ramfs) and pseudo
>> > file systems (procfs, sysfs, kernfs). The current implementaion
>> > of fadvise is mostly a noop for such filesystems except for
>> > FADV_DONTNEED which will trigger expensive remote LRU cache
>> > draining. This patch makes the noop of fadvise() on such file
>> > systems very explicit.
>> >
>> > However this change has two side effects for ramfs and one for
>> > tmpfs. First fadvise(FADV_DONTNEED) can remove the unmapped clean
>> > zero'ed pages of ramfs (allocated through read, readahead & read
>> > fault) and tmpfs (allocated through read fault). Also
>> > fadvise(FADV_WILLNEED) on create such clean zero'ed pages for
>> > ramfs.
>
> That sentence makes no sense.  I assume "fadvise(FADV_WILLNEED) will
> create"?
>

Sorry about that, it should be "fadvise(FADV_WILLNEED) can create".

>

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-04 Thread Shakeel Butt

On Mon, Sep 4, 2017 at 7:21 AM, Roman Gushchin  wrote:
> Introducing of cgroup-aware OOM killer changes the victim selection
> algorithm used by default: instead of picking the largest process,
> it will pick the largest memcg and then the largest process inside.
>
> This affects only cgroup v2 users.
>
> To provide a way to use cgroups v2 if the old OOM victim selection
> algorithm is preferred for some reason, the nogroupoom mount option
> is added.

Is this mount option or boot parameter? From the code, it seems like a
boot parameter.

Re: [PATCH] epoll: account epitem and eppoll_entry to kmemcg

2017-10-04 Thread Shakeel Butt

>
> I am not objecting to the patch I would just like to understand the
> runaway case. ep_insert seems to limit the maximum number of watches to
> max_user_watches which should be ~4% of lowmem if I am following the
> code properly. pwq_cache should be bound by the number of watches as
> well, or am I misunderstanding the code?
>

You are absolutely right that there is a per-user limit (~4% of total
memory if no highmem) on these caches. I think it is too generous
particularly in the scenario where jobs of multiple users are running
on the system and the administrator is reducing cost by overcomitting
the memory. This is unaccounted kernel memory and will not be
considered by the oom-killer. I think by accounting it to kmemcg, for
systems with kmem accounting enabled, we can provide better isolation
between jobs of different users.

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Shakeel Butt

> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
> +{
> +   struct mem_cgroup *iter;
> +
> +   oc->chosen_memcg = NULL;
> +   oc->chosen_points = 0;
> +
> +   /*
> +* The oom_score is calculated for leaf memory cgroups (including
> +* the root memcg).
> +*/
> +   rcu_read_lock();
> +   for_each_mem_cgroup_tree(iter, root) {
> +   long score;
> +
> +   if (memcg_has_children(iter))
> +   continue;

&& iter != root_mem_cgroup ?

> +
> +   score = oom_evaluate_memcg(iter, oc->nodemask, 
> oc->totalpages);
> +
> +   /*
> +* Ignore empty and non-eligible memory cgroups.
> +*/
> +   if (score == 0)
> +   continue;
> +
> +   /*
> +* If there are inflight OOM victims, we don't need
> +* to look further for new victims.
> +*/
> +   if (score == -1) {
> +   oc->chosen_memcg = INFLIGHT_VICTIM;
> +   mem_cgroup_iter_break(root, iter);
> +   break;
> +   }
> +

Shouldn't there be a CSS_ONLINE check? Also instead of css_get at the
end why not css_tryget_online() here and css_put for the previous
selected one.

> +   if (score > oc->chosen_points) {
> +   oc->chosen_points = score;
> +   oc->chosen_memcg = iter;
> +   }
> +   }
> +
> +   if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +   css_get(>chosen_memcg->css);
> +
> +   rcu_read_unlock();
> +}
> +

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Shakeel Butt

>> > +   if (memcg_has_children(iter))
>> > +   continue;
>>
>> && iter != root_mem_cgroup ?
>
> Oh, sure. I had a stupid bug in my test script, which prevented me from
> catching this. Thanks!
>
> This should fix the problem.
> --
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2e82625bd354..b3848bce4c86 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2807,7 +2807,8 @@ static void select_victim_memcg(struct mem_cgroup 
> *root, struct oom_control *oc)
>  * We don't consider non-leaf non-oom_group memory cgroups
>  * as OOM victims.
>  */
> -   if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> +   if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> +   !mem_cgroup_oom_group(iter))
> continue;

I think you are mixing the 3rd and 4th patch. The root_mem_cgroup
check should be in 3rd while oom_group stuff should be in 4th.


>>
>> Shouldn't there be a CSS_ONLINE check? Also instead of css_get at the
>> end why not css_tryget_online() here and css_put for the previous
>> selected one.
>
> Hm, why do we need to check this? I do not see, how we can choose
> an OFFLINE memcg as a victim, tbh. Please, explain the problem.
>

Sorry about the confusion. There are two things. First, should we do a
css_get on the newly selected memcg within the for loop when we still
have a reference to it?

Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
return 0 for offlined memcgs. Maybe no need to call
oom_evaluate_memcg() for offlined memcgs.

[PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-05 Thread Shakeel Butt

The allocations from filp and names kmem caches can be directly
triggered by user space applications. A buggy application can
consume a significant amount of unaccounted system memory. Though
we have not noticed such buggy applications in our production
but upon close inspection, we found that a lot of machines spend
very significant amount of memory on these caches. So, these
caches should be accounted to kmemcg.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 fs/dcache.c | 2 +-
 fs/file_table.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..fb3449161063 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3642,7 +3642,7 @@ void __init vfs_caches_init_early(void)
 void __init vfs_caches_init(void)
 {
names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
-   SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+   SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
dcache_init();
inode_init();
diff --git a/fs/file_table.c b/fs/file_table.c
index 61517f57f8ef..567888cdf7d3 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -312,7 +312,7 @@ void put_filp(struct file *file)
 void __init files_init(void)
 {
filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-   SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+   SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
percpu_counter_init(_files, 0, GFP_KERNEL);
 }
 
-- 
2.14.2.920.gcf0c67979c-goog

[PATCH v2] fs, mm: account filp cache to kmemcg

2017-10-11 Thread Shakeel Butt

The allocations from filp cache can be directly triggered by user
space applications. A buggy application can consume a significant
amount of unaccounted system memory. Though we have not noticed
such buggy applications in our production but upon close inspection,
we found that a lot of machines spend very significant amount of
memory on these caches.

One way to limit allocations from filp cache is to set system level
limit of maximum number of open files. However this limit is shared
between different users on the system and one user can hog this
resource. To cater that, we can charge filp to kmemcg and set the
maximum limit very high and let the memory limit of each user limit
the number of files they can open and indirectly limiting their
allocations from filp cache.

One side effect of this change is that it will allow _sysctl() to
return ENOMEM and the man page of _sysctl() does not specify that.
However the man page also discourages to use _sysctl() at all.

Signed-off-by: Shakeel Butt <shake...@google.com>
---

Changelog since v1:
- removed names_cache charging to kmemcg

 fs/file_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 61517f57f8ef..567888cdf7d3 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -312,7 +312,7 @@ void put_filp(struct file *file)
 void __init files_init(void)
 {
filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-   SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+   SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
percpu_counter_init(_files, 0, GFP_KERNEL);
 }
 
-- 
2.15.0.rc0.271.g36b669edcc-goog

Re: [PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-10 Thread Shakeel Butt

On Sun, Oct 8, 2017 at 11:24 PM, Michal Hocko <mho...@kernel.org> wrote:
> On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
>> >>   names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> >> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> >> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>> >
>> > I might be wrong but isn't name cache only holding temporary objects
>> > used for path resolution which are not stored anywhere?
>> >
>>
>> Even though they're temporary, many containers can together use a
>> significant amount of transient uncharged memory. We've seen machines
>> with 100s of MiBs in names_cache.
>
> Yes that might be possible but are we prepared for random ENOMEM from
> vfs calls which need to allocate a temporary name?
>

I looked at all the syscalls which invoke allocations from
'names_cache' and tried to narrow down whose man page does not mention
that they can return ENOMEM. I found couple of syscalls like
truncate(), readdir() & getdents() which does not mention that they
can return ENOMEM but this patch will make them return ENOMEM.

>>
>> >>   filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> >> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> >> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, 
>> >> NULL);
>> >>   percpu_counter_init(_files, 0, GFP_KERNEL);
>> >>  }
>> >
>> > Don't we have a limit for the maximum number of open files?
>> >
>>
>> Yes, there is a system limit of maximum number of open files. However
>> this limit is shared between different users on the system and one
>> user can hog this resource. To cater that, we set the maximum limit
>> very high and let the memory limit of each user limit the number of
>> files they can open.
>
> Similarly here. Are all syscalls allocating a fd prepared to return
> ENOMEM?

For filp, I found _sysctl(). However the man page says not to use it.

On Tue, Oct 10, 2017 at 2:10 AM, Michal Hocko <mho...@kernel.org> wrote:
> On Mon 09-10-17 20:17:54, Michal Hocko wrote:
>> the primary concern for this patch was whether we really need/want to
>> charge short therm objects which do not outlive a single syscall.
>
> Let me expand on this some more. What is the benefit of kmem accounting
> of such an object? It cannot stop any runaway as a syscall lifetime
> allocations are bound to number of processes which we kind of contain by
> other means.

We can contain by limited the number of processes or thread but for us
applications having thousands of threads is very common. So, limiting
the number of threads/processes will not work.

> If we do account then we put a memory pressure due to
> something that cannot be reclaimed by no means. Even the memcg OOM
> killer would simply kick a single path while there might be others
> to consume the same type of memory.
>
> So what is the actual point in accounting these? Does it help to contain
> any workload better? What kind of workload?
>

I think the benefits will be isolation and more accurate billing. As I
have said before we have observed 100s of MiBs in names_cache on many
machines and cumulative amount is not something we can ignore as just
memory overhead.

> Or am I completely wrong and name objects can outlive a syscall
> considerably?
>

No, I didn't find any instance of the name objects outliving the syscall.

Anyways, we can discuss more on names_cache, do you have any objection
regarding charging filp?

[PATCH] mm: mlock: remove lru_add_drain_all()

2017-10-18 Thread Shakeel Butt

Recently we have observed high latency in mlock() in our generic
library and noticed that users have started using tmpfs files even
without swap and the latency was due to expensive remote LRU cache
draining.

Is lru_add_drain_all() required by mlock()? The answer is no and the
reason it is still in mlock() is to rapidly move mlocked pages to
unevictable LRU. Without lru_add_drain_all() the mlocked pages which
were on pagevec at mlock() time will be moved to evictable LRUs but
will eventually be moved back to unevictable LRU by reclaim. So, we
can safely remove lru_add_drain_all() from mlock(). Also there is no
need for local lru_add_drain() as it will be called deep inside
__mm_populate() (in follow_page_pte()).

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 mm/mlock.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index dfc6f1912176..3ceb2935d1e0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -669,8 +669,6 @@ static __must_check int do_mlock(unsigned long start, 
size_t len, vm_flags_t fla
if (!can_do_mlock())
return -EPERM;
 
-   lru_add_drain_all();/* flush pagevec */
-
len = PAGE_ALIGN(len + (offset_in_page(start)));
start &= PAGE_MASK;
 
@@ -797,9 +795,6 @@ SYSCALL_DEFINE1(mlockall, int, flags)
if (!can_do_mlock())
return -EPERM;
 
-   if (flags & MCL_CURRENT)
-   lru_add_drain_all();/* flush pagevec */
-
lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
 
-- 
2.15.0.rc1.287.g2b38de12cc-goog

Re: [PATCH] mm: mlock: remove lru_add_drain_all()

2017-10-19 Thread Shakeel Butt

On Wed, Oct 18, 2017 at 8:18 PM, Balbir Singh <bsinghar...@gmail.com> wrote:
> On Wed, 18 Oct 2017 16:17:30 -0700
> Shakeel Butt <shake...@google.com> wrote:
>
>> Recently we have observed high latency in mlock() in our generic
>> library and noticed that users have started using tmpfs files even
>> without swap and the latency was due to expensive remote LRU cache
>> draining.
>>
>> Is lru_add_drain_all() required by mlock()? The answer is no and the
>> reason it is still in mlock() is to rapidly move mlocked pages to
>> unevictable LRU. Without lru_add_drain_all() the mlocked pages which
>> were on pagevec at mlock() time will be moved to evictable LRUs but
>> will eventually be moved back to unevictable LRU by reclaim. So, we
>> can safely remove lru_add_drain_all() from mlock(). Also there is no
>> need for local lru_add_drain() as it will be called deep inside
>> __mm_populate() (in follow_page_pte()).
>>
>> Signed-off-by: Shakeel Butt <shake...@google.com>
>> ---
>
> Does this perturb statistics around LRU pages in cgroups and meminfo
> about where the pages actually belong?
>

Yes, it would because the page can be in the evictable LRU until the
reclaim moves it back to the unevictable LRU. However even with the
draining there is a chance that the same thing can happen. For
example, after mlock drains all caches and before getting mmap_sem,
another cpu swaps in the page which the mlock syscall wants to mlock.
Though the without draining the chance of this scenario will increase
and in worst case mlock() can fail to move at most PAGEVEC_SIZE *
(number of cpus - 1)  pages to the unevictable LRU.

Re: [PATCH] mm: mlock: remove lru_add_drain_all()

2017-10-19 Thread Shakeel Butt

> [...]
>>
>> Sorry for the confusion. I wanted to say that if the pages which are
>> being mlocked are on caches of remote cpus then lru_add_drain_all will
>> move them to their corresponding LRUs and then remaining functionality
>> of mlock will move them again from their evictable LRUs to unevictable
>> LRU.
>
> yes, but the point is that we are draining pages which might be not
> directly related to pages which _will_ be mlocked by the syscall. In
> fact those will stay on the cache. This is the primary reason why this
> draining doesn't make much sense.
>
> Or am I still misunderstanding what you are saying here?
>

lru_add_drain_all() will drain everything irrespective if those pages
are being mlocked or not.

Re: [PATCH] mm: mlock: remove lru_add_drain_all()

2017-10-19 Thread Shakeel Butt

On Thu, Oct 19, 2017 at 5:32 AM, Michal Hocko <mho...@kernel.org> wrote:
> On Wed 18-10-17 16:17:30, Shakeel Butt wrote:
>> Recently we have observed high latency in mlock() in our generic
>> library and noticed that users have started using tmpfs files even
>> without swap and the latency was due to expensive remote LRU cache
>> draining.
>
> some numbers would be really nice
>

On a production workload, customers complained that single mlock()
call took around 10 seconds on mapped tmpfs files and the perf profile
showed lru_add_drain_all as culprit.

I wasn't able to replicate the workload on my test machine but a
simple workload of calling mlock() many type on a free machine shows
significant difference. Other than workload, the machine size (number
of cores) also matters.

>> Is lru_add_drain_all() required by mlock()? The answer is no and the
>> reason it is still in mlock() is to rapidly move mlocked pages to
>> unevictable LRU.
>
> Is this really true? lru_add_drain_all will flush the previously cached
> LRU pages. We are not flushing after the pages have been faulted in so
> this might not do anything wrt. mlocked pages, right?
>

Sorry for the confusion. I wanted to say that if the pages which are
being mlocked are on caches of remote cpus then lru_add_drain_all will
move them to their corresponding LRUs and then remaining functionality
of mlock will move them again from their evictable LRUs to unevictable
LRU.

>> Without lru_add_drain_all() the mlocked pages which
>> were on pagevec at mlock() time will be moved to evictable LRUs but
>> will eventually be moved back to unevictable LRU by reclaim. So, we
>> can safely remove lru_add_drain_all() from mlock(). Also there is no
>> need for local lru_add_drain() as it will be called deep inside
>> __mm_populate() (in follow_page_pte()).
>
> Anyway, I do agree that lru_add_drain_all here is pointless. Either we
> should drain after the memory has been faulted in and mlocked or not at
> all. So the patch looks good to me I am just not sure about the
> changelog.
>
>> Signed-off-by: Shakeel Butt <shake...@google.com>
>> ---
>>  mm/mlock.c | 5 -
>>  1 file changed, 5 deletions(-)
>>
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index dfc6f1912176..3ceb2935d1e0 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -669,8 +669,6 @@ static __must_check int do_mlock(unsigned long start, 
>> size_t len, vm_flags_t fla
>>   if (!can_do_mlock())
>>   return -EPERM;
>>
>> - lru_add_drain_all();/* flush pagevec */
>> -
>>   len = PAGE_ALIGN(len + (offset_in_page(start)));
>>   start &= PAGE_MASK;
>>
>> @@ -797,9 +795,6 @@ SYSCALL_DEFINE1(mlockall, int, flags)
>>   if (!can_do_mlock())
>>   return -EPERM;
>>
>> - if (flags & MCL_CURRENT)
>> - lru_add_drain_all();/* flush pagevec */
>> -
>>   lock_limit = rlimit(RLIMIT_MEMLOCK);
>>   lock_limit >>= PAGE_SHIFT;
>>
>> --
>> 2.15.0.rc1.287.g2b38de12cc-goog
>>
>
> --
> Michal Hocko
> SUSE Labs

Re: [PATCH] mm: mlock: remove lru_add_drain_all()

2017-10-19 Thread Shakeel Butt

On Wed, Oct 18, 2017 at 11:24 PM, Anshuman Khandual
<khand...@linux.vnet.ibm.com> wrote:
> On 10/19/2017 04:47 AM, Shakeel Butt wrote:
>> Recently we have observed high latency in mlock() in our generic
>> library and noticed that users have started using tmpfs files even
>> without swap and the latency was due to expensive remote LRU cache
>> draining.
>
> With and without this I patch I dont see much difference in number
> of instructions executed in the kernel for mlock() system call on
> POWER8 platform just after reboot (all the pagevecs might not been
> filled by then though). There is an improvement but its very less.
>
> Could you share your latency numbers and how this patch is making
> them better.
>

The latency is very dependent on the workload and the number of cores
on the machine. On production workload, the customers were complaining
single mlock() was taking around 10 seconds on tmpfs files which were
already in memory.

>>
>> Is lru_add_drain_all() required by mlock()? The answer is no and the
>> reason it is still in mlock() is to rapidly move mlocked pages to
>> unevictable LRU. Without lru_add_drain_all() the mlocked pages which
>> were on pagevec at mlock() time will be moved to evictable LRUs but
>> will eventually be moved back to unevictable LRU by reclaim. So, we
>
> Wont this affect the performance during reclaim ?
>

Yes, but reclaim is already a slow path and to seriously impact
reclaim we will need a very very antagonistic workload which is very
hard to trigger (i.e. for each mlock on a cpu, the pages being mlocked
happen to be on the cache of other cpus).

>> can safely remove lru_add_drain_all() from mlock(). Also there is no
>> need for local lru_add_drain() as it will be called deep inside
>> __mm_populate() (in follow_page_pte()).
>
> The following commit which originally added lru_add_drain_all()
> during mlock() and mlockall() has similar explanation.
>
> 8891d6da ("mm: remove lru_add_drain_all() from the munlock path")
>
> "In addition, this patch add lru_add_drain_all() to sys_mlock()
> and sys_mlockall().  it isn't must.  but it reduce the failure
> of moving to unevictable list.  its failure can rescue in
> vmscan later.  but reducing is better."
>
> Which sounds like either we have to handle the active to inactive
> LRU movement during reclaim or it can be done here to speed up
> reclaim later on.
>

Re: [PATCH] mm: mlock: remove lru_add_drain_all()

2017-10-19 Thread Shakeel Butt

On Thu, Oct 19, 2017 at 3:18 AM, Kirill A. Shutemov
<kir...@shutemov.name> wrote:
> On Wed, Oct 18, 2017 at 04:17:30PM -0700, Shakeel Butt wrote:
>> Recently we have observed high latency in mlock() in our generic
>> library and noticed that users have started using tmpfs files even
>> without swap and the latency was due to expensive remote LRU cache
>> draining.
>
> Hm. Isn't the point of mlock() to pay price upfront and make execution
> smoother after this?
>
> With this you shift latency onto reclaim (and future memory allocation).
>
> I'm not sure if it's a win.
>

It will not shift latency to fast path memory allocation. Reclaim
happens on slow path and only reclaim may see more mlocked pages.
Please note that the very antagonistics workload i.e. for each mlock
on a cpu, the pages being mlocked happen to be on the cache of other
cpus, is very hard to trigger.

Re: [PATCH] mm: mlock: remove lru_add_drain_all()

2017-10-19 Thread Shakeel Butt

On Thu, Oct 19, 2017 at 1:13 PM, Michal Hocko <mho...@kernel.org> wrote:
> On Thu 19-10-17 12:46:50, Shakeel Butt wrote:
>> > [...]
>> >>
>> >> Sorry for the confusion. I wanted to say that if the pages which are
>> >> being mlocked are on caches of remote cpus then lru_add_drain_all will
>> >> move them to their corresponding LRUs and then remaining functionality
>> >> of mlock will move them again from their evictable LRUs to unevictable
>> >> LRU.
>> >
>> > yes, but the point is that we are draining pages which might be not
>> > directly related to pages which _will_ be mlocked by the syscall. In
>> > fact those will stay on the cache. This is the primary reason why this
>> > draining doesn't make much sense.
>> >
>> > Or am I still misunderstanding what you are saying here?
>> >
>>
>> lru_add_drain_all() will drain everything irrespective if those pages
>> are being mlocked or not.
>
> yes, let me be more specific. lru_add_drain_all will drain everything
> that has been cached at the time mlock is called. And that is not really
> related to the memory which will be faulted in (and cached) and mlocked
> by the syscall itself. Does it make more sense now?
>

Yes, you are absolutely right. Sorry for the confusion.

Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.

2017-11-14 Thread Shakeel Butt

On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim  wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>> When shrinker_rwsem was introduced, it was assumed that
>> register_shrinker()/unregister_shrinker() are really unlikely paths
>> which are called during initialization and tear down. But nowadays,
>> register_shrinker()/unregister_shrinker() might be called regularly.
>> This patch prepares for allowing parallel registration/unregistration
>> of shrinkers.
>>
>> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>> using one RCU section. But using atomic_inc()/atomic_dec() for each
>> do_shrink_slab() call will not impact so much.
>>
>> This patch uses polling loop with short sleep for unregister_shrinker()
>> rather than wait_on_atomic_t(), for we can save reader's cost (plain
>> atomic_dec() compared to atomic_dec_and_test()), we can expect that
>> do_shrink_slab() of unregistering shrinker likely returns shortly, and
>> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
>> shrinker unexpectedly took so long.
>>
>> Signed-off-by: Tetsuo Handa 
>
> Before reviewing this patch, can't we solve the problem with more
> simple way? Like this.
>
> Shakeel, What do you think?
>

Seems simple enough. I will run my test (running fork bomb in one
memcg and separately time a mount operation) and update if numbers
differ significantly.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 13d711dd8776..cbb624cb9baa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> sc.nid = 0;
>
> freed += do_shrink_slab(, shrinker, nr_scanned, 
> nr_eligible);
> +   /*
> +* bail out if someone want to register a new shrinker to 
> prevent
> +* long time stall by parallel ongoing shrinking.
> +*/
> +   if (rwsem_is_contended(_rwsem)) {
> +   freed = 1;

freed = freed ?: 1;

> +   break;
> +   }
> }
>
> up_read(_rwsem);

Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.

2017-11-15 Thread Shakeel Butt

On Wed, Nov 15, 2017 at 4:46 PM, Minchan Kim <minc...@kernel.org> wrote:
> On Tue, Nov 14, 2017 at 10:28:10PM -0800, Shakeel Butt wrote:
>> On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim <minc...@kernel.org> wrote:
>> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>> >> When shrinker_rwsem was introduced, it was assumed that
>> >> register_shrinker()/unregister_shrinker() are really unlikely paths
>> >> which are called during initialization and tear down. But nowadays,
>> >> register_shrinker()/unregister_shrinker() might be called regularly.
>> >> This patch prepares for allowing parallel registration/unregistration
>> >> of shrinkers.
>> >>
>> >> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>> >> using one RCU section. But using atomic_inc()/atomic_dec() for each
>> >> do_shrink_slab() call will not impact so much.
>> >>
>> >> This patch uses polling loop with short sleep for unregister_shrinker()
>> >> rather than wait_on_atomic_t(), for we can save reader's cost (plain
>> >> atomic_dec() compared to atomic_dec_and_test()), we can expect that
>> >> do_shrink_slab() of unregistering shrinker likely returns shortly, and
>> >> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
>> >> shrinker unexpectedly took so long.
>> >>
>> >> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
>> >
>> > Before reviewing this patch, can't we solve the problem with more
>> > simple way? Like this.
>> >
>> > Shakeel, What do you think?
>> >
>>
>> Seems simple enough. I will run my test (running fork bomb in one
>> memcg and separately time a mount operation) and update if numbers
>> differ significantly.
>
> Thanks.
>
>>
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 13d711dd8776..cbb624cb9baa 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int 
>> > nid,
>> > sc.nid = 0;
>> >
>> > freed += do_shrink_slab(, shrinker, nr_scanned, 
>> > nr_eligible);
>> > +   /*
>> > +* bail out if someone want to register a new shrinker to 
>> > prevent
>> > +* long time stall by parallel ongoing shrinking.
>> > +*/
>> > +   if (rwsem_is_contended(_rwsem)) {
>> > +   freed = 1;
>>
>> freed = freed ?: 1;
>
> Yub.

Thanks Minchan, you can add

Reviewed-and-tested-by: Shakeel Butt <shake...@google.com>

Re: [PATCH] mm, mlock, vmscan: no more skipping pagevecs

2017-11-15 Thread Shakeel Butt

Ping, really appreciate comments on this patch.

On Sat, Nov 4, 2017 at 3:43 PM, Shakeel Butt <shake...@google.com> wrote:
> When a thread mlocks an address space backed by file, a new
> page is allocated (assuming file page is not in memory), added
> to the local pagevec (lru_add_pvec), I/O is triggered and the
> thread then sleeps on the page. On I/O completion, the thread
> can wake on a different CPU, the mlock syscall will then sets
> the PageMlocked() bit of the page but will not be able to put
> that page in unevictable LRU as the page is on the pagevec of
> a different CPU. Even on drain, that page will go to evictable
> LRU because the PageMlocked() bit is not checked on pagevec
> drain.
>
> The page will eventually go to right LRU on reclaim but the
> LRU stats will remain skewed for a long time.
>
> However, this issue does not happen for anon pages on swap
> because unlike file pages, anon pages are not added to pagevec
> until they have been fully swapped in. Also the fault handler
> uses vm_flags to set the PageMlocked() bit of such anon pages
> even before returning to mlock() syscall and mlocked pages will
> skip pagevecs and directly be put into unevictable LRU. No such
> luck for file pages.
>
> One way to resolve this issue, is to somehow plumb vm_flags from
> filemap_fault() to add_to_page_cache_lru() which will then skip
> the pagevec for pages of VM_LOCKED vma and directly put them to
> unevictable LRU. However this patch took a different approach.
>
> All the pages, even unevictable, will be added to the pagevecs
> and on the drain, the pages will be added on their LRUs correctly
> by checking their evictability. This resolves the mlocked file
> pages on pagevec of other CPUs issue because when those pagevecs
> will be drained, the mlocked file pages will go to unevictable
> LRU. Also this makes the race with munlock easier to resolve
> because the pagevec drains happen in LRU lock.
>
> There is one (good) side effect though. Without this patch, the
> pages allocated for System V shared memory segment are added to
> evictable LRUs even after shmctl(SHM_LOCK) on that segment. This
> patch will correctly put such pages to unevictable LRU.
>
> Signed-off-by: Shakeel Butt <shake...@google.com>
> ---
>  include/linux/swap.h |  2 --
>  mm/swap.c| 68 
> +---
>  mm/vmscan.c  | 59 +
>  3 files changed, 34 insertions(+), 95 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f02fb5db8914..9b31d04914eb 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -326,8 +326,6 @@ extern void deactivate_file_page(struct page *page);
>  extern void mark_page_lazyfree(struct page *page);
>  extern void swap_setup(void);
>
> -extern void add_page_to_unevictable_list(struct page *page);
> -
>  extern void lru_cache_add_active_or_unevictable(struct page *page,
> struct vm_area_struct *vma);
>
> diff --git a/mm/swap.c b/mm/swap.c
> index a77d68f2c1b6..776fb33e81d3 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -445,30 +445,6 @@ void lru_cache_add(struct page *page)
> __lru_cache_add(page);
>  }
>
> -/**
> - * add_page_to_unevictable_list - add a page to the unevictable list
> - * @page:  the page to be added to the unevictable list
> - *
> - * Add page directly to its zone's unevictable list.  To avoid races with
> - * tasks that might be making the page evictable, through eg. munlock,
> - * munmap or exit, while it's not on the lru, we want to add the page
> - * while it's locked or otherwise "invisible" to other tasks.  This is
> - * difficult to do when using the pagevec cache, so bypass that.
> - */
> -void add_page_to_unevictable_list(struct page *page)
> -{
> -   struct pglist_data *pgdat = page_pgdat(page);
> -   struct lruvec *lruvec;
> -
> -   spin_lock_irq(>lru_lock);
> -   lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -   ClearPageActive(page);
> -   SetPageUnevictable(page);
> -   SetPageLRU(page);
> -   add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
> -   spin_unlock_irq(>lru_lock);
> -}
> -
>  /**
>   * lru_cache_add_active_or_unevictable
>   * @page:  the page to be added to LRU
> @@ -484,13 +460,9 @@ void lru_cache_add_active_or_unevictable(struct page 
> *page,
>  {
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> -   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
> +   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
>

[PATCH] mm, memcg: fix mem_cgroup_swapout() for THPs

2017-11-28 Thread Shakeel Butt

The commit d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout()
support THP") changed mem_cgroup_swapout() to support transparent huge
page (THP). However the patch missed one location which should be
changed for correctly handling THPs. The resulting bug will cause the
memory cgroups whose THPs were swapped out to become zombies on
deletion.

Fixes: d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout() support THP")
Signed-off-by: Shakeel Butt <shake...@google.com>
Cc: sta...@vger.kernel.org
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50e6906314f8..ac2ffd5e02b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6044,7 +6044,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
memcg_check_events(memcg, page);
 
if (!mem_cgroup_is_root(memcg))
-   css_put(>css);
+   css_put_many(>css, nr_entries);
 }
 
 /**
-- 
2.15.0.417.g466bffb3ac-goog

Re: [PATCH] mm, memcg: fix mem_cgroup_swapout() for THPs

2017-11-28 Thread Shakeel Butt

On Tue, Nov 28, 2017 at 12:00 PM, Michal Hocko <mho...@kernel.org> wrote:
> On Tue 28-11-17 08:19:41, Shakeel Butt wrote:
>> The commit d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout()
>> support THP") changed mem_cgroup_swapout() to support transparent huge
>> page (THP). However the patch missed one location which should be
>> changed for correctly handling THPs. The resulting bug will cause the
>> memory cgroups whose THPs were swapped out to become zombies on
>> deletion.
>
> Very well spotted! Have you seen this triggering or you found it by the
> code inspection?
>

By code inspection.

Re: XArray documentation

2017-11-24 Thread Shakeel Butt

On Fri, Nov 24, 2017 at 10:01 AM, Martin Steigerwald
 wrote:
> Hi Matthew.
>
> Matthew Wilcox - 24.11.17, 18:03:
>> On Fri, Nov 24, 2017 at 05:50:41PM +0100, Martin Steigerwald wrote:
>> > Matthew Wilcox - 24.11.17, 02:16:
>> > > ==
>> > > XArray
>> > > ==
>> > >
>> > > Overview
>> > > 
>> > >
>> > > The XArray is an array of ULONG_MAX entries.  Each entry can be either
>> > > a pointer, or an encoded value between 0 and LONG_MAX.  It is efficient
>> > > when the indices used are densely clustered; hashing the object and
>> > > using the hash as the index will not perform well.  A
>> > > freshly-initialised
>> > > XArray contains a NULL pointer at every index.  There is no difference
>> > > between an entry which has never been stored to and an entry which has
>> > > most
>> > > recently had NULL stored to it.
>> >
>> > I am no kernel developer (just provided a tiny bit of documentation a long
>> > time ago)… but on reading into this, I missed:
>> >
>> > What is it about? And what is it used for?
>> >
>> > "Overview" appears to be already a description of the actual
>> > implementation
>> > specifics, instead of… well an overview.
>> >
>> > Of course, I am sure you all know what it is for… but someone who wants to
>> > learn about the kernel is likely to be confused by such a start.
> […]
>> Thank you for your comment.  I'm clearly too close to it because even
>> after reading your useful critique, I'm not sure what to change.  Please
>> help me!
>
> And likely I am too far away to and do not understand enough of it to provide
> more concrete suggestions, but let me try. (I do understand some programming
> stuff like what an array is, what a pointer what an linked list or a tree is
> or… so I am not completely novice here. I think the documentation should not
> cover any of these basics.)
>
>> Maybe it's that I've described the abstraction as if it's the
>> implementation and put too much detail into the overview.  This might
>> be clearer?
>>
>> The XArray is an abstract data type which behaves like an infinitely
>> large array of pointers.  The index into the array is an unsigned long.
>> A freshly-initialised XArray contains a NULL pointer at every index.
>
> Yes, I think this is clearer already.
>
> Maybe with a few sentences on "Why does the kernel provide this?", "Where is
> it used?" (if already known), "What use case is it suitable for – if I want to
> implement something into the kernel (or in user space?) ?" and probably "How
> does it differ from user data structures the kernel provides?"
>

Adding on to Martin's questions. Basically what is the motivation
behind it? It seems like a replacement for radix tree, so, it would be
good to write why radix tree was not good enough or which use cases
radix tree could not solve. Also how XArray solves those
issues/use-cases? And if you know which scenarios or use-cases where
XArray will not be an optimal solution.

Re: [PATCH] mm: Make count list_lru_one::nr_items lockless

2017-11-29 Thread Shakeel Butt

On Fri, Sep 29, 2017 at 1:15 AM, Kirill Tkhai  wrote:
> On 29.09.2017 00:02, Andrew Morton wrote:
>> On Thu, 28 Sep 2017 10:48:55 +0300 Kirill Tkhai  wrote:
>>
> This patch aims to make super_cache_count() (and other functions,
> which count LRU nr_items) more effective.
> It allows list_lru_node::memcg_lrus to be RCU-accessed, and makes
> __list_lru_count_one() count nr_items lockless to minimize
> overhead introduced by locking operation, and to make parallel
> reclaims more scalable.

 And...  what were the effects of the patch?  Did you not run the same
 performance tests after applying it?
>>>
>>> I've just detected the such high usage of shrink slab on production node. 
>>> It's rather
>>> difficult to make it use another kernel, than it uses, only kpatches are 
>>> possible.
>>> So, I haven't estimated how it acts on node's performance.
>>> On test node I see, that the patch obviously removes raw_spin_lock from 
>>> perf profile.
>>> So, it's a little bit untested in this way.
>>
>> Well that's a problem.  The patch increases list_lru.o text size by a
>> lot (4800->5696) which will have a cost.  And we don't have proof that
>> any benefit is worth that cost.  It shouldn't be too hard to cook up a
>> synthetic test to trigger memcg slab reclaim and then run a
>> before-n-after benchmark?
>
> Ok, then, please, ignore this for a while, I'll try to do it a little bit 
> later.
>

I rebased this patch on linus tree (replacing kfree_rcu with call_rcu
as there is no kvfree_rcu) and did some experiments. I think the patch
is worth to be included.

Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
VM and recording the trace with 'perf record -g -a'.

The trace without the patch:

+  34.19% fb.sh  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
+  30.77% fb.sh  [kernel.kallsyms]  [k] _raw_spin_lock
+   3.53% fb.sh  [kernel.kallsyms]  [k] list_lru_count_one
+   2.26% fb.sh  [kernel.kallsyms]  [k] super_cache_count
+   1.68% fb.sh  [kernel.kallsyms]  [k] shrink_slab
+   0.59% fb.sh  [kernel.kallsyms]  [k] down_read_trylock
+   0.48% fb.sh  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
+   0.38% fb.sh  [kernel.kallsyms]  [k] shrink_node_memcg
+   0.32% fb.sh  [kernel.kallsyms]  [k] queue_work_on
+   0.26% fb.sh  [kernel.kallsyms]  [k] count_shadow_nodes

With the patch:

+   0.16% swapper  [kernel.kallsyms][k] default_idle
+   0.13% oom_reaper  [kernel.kallsyms][k] mutex_spin_on_owner
+   0.05% perf  [kernel.kallsyms][k] copy_user_generic_string
+   0.05% init.real  [kernel.kallsyms][k] wait_consider_task
+   0.05% kworker/0:0  [kernel.kallsyms][k] finish_task_switch
+   0.04% kworker/2:1  [kernel.kallsyms][k] finish_task_switch
+   0.04% kworker/3:1  [kernel.kallsyms][k] finish_task_switch
+   0.04% kworker/1:0  [kernel.kallsyms][k] finish_task_switch
+   0.03% binary  [kernel.kallsyms][k] copy_page


Kirill, can you resend your patch with this info or do you want me
send the rebased patch?

Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-19 Thread Shakeel Butt

On Tue, Dec 19, 2017 at 4:49 AM, Michal Hocko <mho...@kernel.org> wrote:
> On Mon 18-12-17 16:01:31, Shakeel Butt wrote:
>> The memory controller in cgroup v1 provides the memory+swap (memsw)
>> interface to account to the combined usage of memory and swap of the
>> jobs. The memsw interface allows the users to limit or view the
>> consistent memory usage of their jobs irrespectibe of the presense of
>> swap on the system (consistent OOM and memory reclaim behavior). The
>> memory+swap accounting makes the job easier for centralized systems
>> doing resource usage monitoring, prediction or anomaly detection.
>>
>> In cgroup v2, the 'memsw' interface was dropped and a new 'swap'
>> interface has been introduced which allows to limit the actual usage of
>> swap by the job. For the systems where swap is a limited resource,
>> 'swap' interface can be used to fairly distribute the swap resource
>> between different jobs. There is no easy way to limit the swap usage
>> using the 'memsw' interface.
>>
>> However for the systems where the swap is cheap and can be increased
>> dynamically (like remote swap and swap on zram), the 'memsw' interface
>> is much more appropriate as it makes swap transparent to the jobs and
>> gives consistent memory usage history to centralized monitoring systems.
>>
>> This patch adds memsw interface to cgroup v2 memory controller behind a
>> mount option 'memsw'. The memsw interface is mutually exclusive with
>> the existing swap interface. When 'memsw' is enabled, reading or writing
>> to 'swap' interface files will return -ENOTSUPP and vice versa. Enabling
>> or disabling memsw through remounting cgroup v2, will only be effective
>> if there are no decendants of the root cgroup.
>>
>> When memsw accounting is enabled then "memory.high" is comapred with
>> memory+swap usage. So, when the allocating job's memsw usage hits its
>> high mark, the job will be throttled by triggering memory reclaim.
>
> From a quick look, this looks like a mess.

The main motivation behind this patch is to convince that memsw has
genuine use-cases. How to provide memsw is still in RFC stage.
Suggestions and comments are welcomed.

> We have agreed to go with
> the current scheme for some good reasons.

Yes I agree, when the swap is a limited resource the current 'swap'
interface should be used to fairly distribute it between different
jobs.

> There are cons/pros for both
> approaches but I am not convinced we should convolute the user API for
> the usecase you describe.
>

Yes, there are pros & cons, therefore we should give users the option
to select the API that is better suited for their use-cases and
environment. Both approaches are not interchangeable. We use memsw
internally for use-cases I mentioned in commit message. This is one of
the main blockers for us to even consider cgroup-v2 for memory
controller.

>> Signed-off-by: Shakeel Butt <shake...@google.com>

Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-19 Thread Shakeel Butt

On Tue, Dec 19, 2017 at 7:24 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 07:12:19AM -0800, Shakeel Butt wrote:
>> Yes, there are pros & cons, therefore we should give users the option
>> to select the API that is better suited for their use-cases and
>
> Heh, that's not how API decisions should be made.  The long term
> outcome would be really really bad.
>
>> environment. Both approaches are not interchangeable. We use memsw
>> internally for use-cases I mentioned in commit message. This is one of
>> the main blockers for us to even consider cgroup-v2 for memory
>> controller.
>
> Let's concentrate on the use case.  I couldn't quite understand what
> was missing from your description.  You said that it'd make things
> easier for the centralized monitoring system which isn't really a
> description of a use case.  Can you please go into more details
> focusing on the eventual goals (rather than what's currently
> implemented)?
>

The goal is to provide an interface that provides:

1. Consistent memory usage history
2. Consistent memory limit enforcement behavior

By consistent I mean, the environment should not affect the usage
history. For example, the presence or absence of swap or memory
pressure on the system should not affect the memory usage history i.e.
making environment an invariant. Similarly, the environment should not
affect the memcg OOM or memcg memory reclaim behavior.

To provide consistent memory usage history using the current
cgroup-v2's 'swap' interface, an additional metric expressing the
intersection of memory and swap has to be exposed. Basically memsw is
the union of memory and swap. So, if that additional metric can be
used to find the union. However for consistent memory limit
enforcement, I don't think there is an easy way to use current 'swap'
interface.

Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-19 Thread Shakeel Butt

On Tue, Dec 19, 2017 at 1:41 PM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 10:25:12AM -0800, Shakeel Butt wrote:
>> Making the runtime environment, an invariant is very critical to make
>> the management of a job easier whose instances run on different
>> clusters across the world. Some clusters might have different type of
>> swaps installed while some might not have one at all and the
>> availability of the swap can be dynamic (i.e. swap medium outage).
>>
>> So, if users want to run multiple instances of a job across multiple
>> clusters, they should be able to specify the limits of their jobs
>> irrespective of the knowledge of cluster. The best case would be they
>> just submits their jobs without any config and the system figures out
>> the right limit and enforce that. And to figure out the right limit
>> and enforcing it, the consistent memory usage history and consistent
>> memory limit enforcement is very critical.
>
> I'm having a hard time extracting anything concrete from your
> explanation on why memsw is required.  Can you please ELI5 with some
> examples?
>

Suppose a user wants to run multiple instances of a specific job on
different datacenters and s/he has budget of 100MiB for each instance.
The instances are schduled on the requested datacenters and the
scheduler has set the memory limit of those instances to 100MiB. Now,
some datacenters have swap deployed, so, there, let's say, the swap
limit of those instances are set according to swap medium
availability. In this setting the user will see inconsistent memcg OOM
behavior. Some of the instances see OOMs at 100MiB usage (suppose only
anon memory) while some will see OOMs way above 100MiB due to swap.
So, the user is required to know the internal knowledge of datacenters
(like which has swap or not and swap type) and has to set the limits
accordingly and thus increase the chance of config bugs.

Also different types and sizes of swap mediums in data center will
further complicates the configuration. One datacenter might have SSD
as a swap, another might be doing swap on zram and third might be
doing swap on nvdimm. Each can have different size and can be assigned
to jobs differently. So, it is possible that the instances of the same
job might be assigned different swap limit on different datacenters.

Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-19 Thread Shakeel Butt

On Tue, Dec 19, 2017 at 9:33 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 09:23:29AM -0800, Shakeel Butt wrote:
>> To provide consistent memory usage history using the current
>> cgroup-v2's 'swap' interface, an additional metric expressing the
>> intersection of memory and swap has to be exposed. Basically memsw is
>> the union of memory and swap. So, if that additional metric can be
>
> Exposing anonymous pages with swap backing sounds pretty trivial.
>
>> used to find the union. However for consistent memory limit
>> enforcement, I don't think there is an easy way to use current 'swap'
>> interface.
>
> Can you please go into details on why this is important?  I get that
> you can't do it as easily w/o memsw but I don't understand why this is
> a critical feature.  Why is that?
>

Making the runtime environment, an invariant is very critical to make
the management of a job easier whose instances run on different
clusters across the world. Some clusters might have different type of
swaps installed while some might not have one at all and the
availability of the swap can be dynamic (i.e. swap medium outage).

So, if users want to run multiple instances of a job across multiple
clusters, they should be able to specify the limits of their jobs
irrespective of the knowledge of cluster. The best case would be they
just submits their jobs without any config and the system figures out
the right limit and enforce that. And to figure out the right limit
and enforcing it, the consistent memory usage history and consistent
memory limit enforcement is very critical.

thanks,
Shakeel

[RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-18 Thread Shakeel Butt

The memory controller in cgroup v1 provides the memory+swap (memsw)
interface to account to the combined usage of memory and swap of the
jobs. The memsw interface allows the users to limit or view the
consistent memory usage of their jobs irrespectibe of the presense of
swap on the system (consistent OOM and memory reclaim behavior). The
memory+swap accounting makes the job easier for centralized systems
doing resource usage monitoring, prediction or anomaly detection.

In cgroup v2, the 'memsw' interface was dropped and a new 'swap'
interface has been introduced which allows to limit the actual usage of
swap by the job. For the systems where swap is a limited resource,
'swap' interface can be used to fairly distribute the swap resource
between different jobs. There is no easy way to limit the swap usage
using the 'memsw' interface.

However for the systems where the swap is cheap and can be increased
dynamically (like remote swap and swap on zram), the 'memsw' interface
is much more appropriate as it makes swap transparent to the jobs and
gives consistent memory usage history to centralized monitoring systems.

This patch adds memsw interface to cgroup v2 memory controller behind a
mount option 'memsw'. The memsw interface is mutually exclusive with
the existing swap interface. When 'memsw' is enabled, reading or writing
to 'swap' interface files will return -ENOTSUPP and vice versa. Enabling
or disabling memsw through remounting cgroup v2, will only be effective
if there are no decendants of the root cgroup.

When memsw accounting is enabled then "memory.high" is comapred with
memory+swap usage. So, when the allocating job's memsw usage hits its
high mark, the job will be throttled by triggering memory reclaim.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 Documentation/cgroup-v2.txt |  69 
 include/linux/cgroup-defs.h |   5 +++
 kernel/cgroup/cgroup.c  |  12 +
 mm/memcontrol.c | 107 +---
 4 files changed, 157 insertions(+), 36 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 9a4f2e54a97d..1cbc51203b00 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -169,6 +169,12 @@ cgroup v2 currently supports the following mount options.
ignored on non-init namespace mounts.  Please refer to the
Delegation section for details.
 
+  memsw
+
+   Allows the enforcement of memory+swap limit on cgroups. This
+   option is system wide and can only be set on mount and can only
+   be modified through remount from the init namespace and if root
+   cgroup has no children.
 
 Organizing Processes and Threads
 
@@ -1020,6 +1026,10 @@ PAGE_SIZE multiple when read back.
Going over the high limit never invokes the OOM killer and
under extreme conditions the limit may be breached.
 
+   If memsw (memory+swap) enforcement is enabled then the
+   cgroup's memory+swap usage is checked against memory.high
+   instead of just memory.
+
   memory.max
A read-write single value file which exists on non-root
cgroups.  The default is "max".
@@ -1207,18 +1217,39 @@ PAGE_SIZE multiple when read back.
 
   memory.swap.current
A read-only single value file which exists on non-root
-   cgroups.
+   cgroups. If memsw is enabled then reading this file will return
+   -ENOTSUPP.
 
The total amount of swap currently being used by the cgroup
and its descendants.
 
   memory.swap.max
A read-write single value file which exists on non-root
-   cgroups.  The default is "max".
+   cgroups.  The default is "max". Accessing this file will return
+   -ENOTSUPP if memsw enforcement is enabled.
 
Swap usage hard limit.  If a cgroup's swap usage reaches this
limit, anonymous meomry of the cgroup will not be swapped out.
 
+  memory.memsw.current
+   A read-only single value file which exists on non-root
+   cgroups. -ENOTSUPP will be returned on read if memsw is not
+   enabled.
+
+   The total amount of memory+swap currently being used by the cgroup
+   and its descendants.
+
+  memory.memsw.max
+   A read-write single value file which exists on non-root
+   cgroups.  The default is "max". -ENOTSUPP will be returned on
+   access if memsw is not enabled.
+
+   Memory+swap usage hard limit. If a cgroup's memory+swap usage
+   reaches this limit and  can't be reduced, the OOM killer is
+   invoked in the cgroup. Under certain circumstances, the usage
+   may go over the limit temporarily.
+
+
 
 Usage Guidelines
 
@@ -1243,6 +1274,23 @@ memory - is necessary to determine whether a workload 
needs more
 memory; unfortunately, memory pressure monitoring mechanism isn't
 implemented yet.
 
+

Re: [PATCH] mm/shmem: set default tmpfs size according to memcg limit

2017-11-17 Thread Shakeel Butt

On Fri, Nov 17, 2017 at 9:41 AM, Yafang Shao <laoar.s...@gmail.com> wrote:
> 2017-11-18 1:35 GMT+08:00 Shakeel Butt <shake...@google.com>:
>> On Fri, Nov 17, 2017 at 9:09 AM, Yafang Shao <laoar.s...@gmail.com> wrote:
>>> 2017-11-18 0:45 GMT+08:00 Roman Gushchin <g...@fb.com>:
>>>> On Sat, Nov 18, 2017 at 12:20:40AM +0800, Yafang Shao wrote:
>>>>> 2017-11-17 23:55 GMT+08:00 Roman Gushchin <g...@fb.com>:
>>>>> > On Thu, Nov 16, 2017 at 08:43:17PM -0800, Shakeel Butt wrote:
>>>>> >> On Thu, Nov 16, 2017 at 7:09 PM, Yafang Shao <laoar.s...@gmail.com> 
>>>>> >> wrote:
>>>>> >> > Currently the default tmpfs size is totalram_pages / 2 if mount tmpfs
>>>>> >> > without "-o size=XXX".
>>>>> >> > When we mount tmpfs in a container(i.e. docker), it is also
>>>>> >> > totalram_pages / 2 regardless of the memory limit on this container.
>>>>> >> > That may easily cause OOM if tmpfs occupied too much memory when 
>>>>> >> > swap is
>>>>> >> > off.
>>>>> >> > So when we mount tmpfs in a memcg, the default size should be 
>>>>> >> > limited by
>>>>> >> > the memcg memory.limit.
>>>>> >> >
>>>>> >>
>>>>> >> The pages of the tmpfs files are charged to the memcg of allocators
>>>>> >> which can be in memcg different from the memcg in which the mount
>>>>> >> operation happened. So, tying the size of a tmpfs mount where it was
>>>>> >> mounted does not make much sense.
>>>>> >
>>>>> > Also, memory limit is adjustable,
>>>>>
>>>>> Yes. But that's irrelevant.
>>>>>
>>>>> > and using a particular limit value
>>>>> > at a moment of tmpfs mounting doesn't provide any warranties further.
>>>>> >
>>>>>
>>>>> I can not agree.
>>>>> The default size of tmpfs is totalram / 2, the reason we do this is to
>>>>> provide any warranties further IMHO.
>>>>>
>>>>> > Is there a reason why the userspace app which is mounting tmpfs can't
>>>>> > set the size based on memory.limit?
>>>>>
>>>>> That's because of misuse.
>>>>> The application should set size with "-o size=" when mount tmpfs, but
>>>>> not all applications do this.
>>>>> As we can't guarantee that all applications will do this, we should
>>>>> give them a proper default value.
>>>>
>>>> The value you're suggesting is proper only if an app which is mounting
>>>> tmpfs resides in the same memcg
>>>
>>> Yes.
>>> But maybe that's mostly used today?
>>>
>>>> and the memory limit will not be adjusted
>>>> significantly later.
>>>
>>> There's a similar issue for physical memory adjusted by memory hotplug.
>>> So what will happen if the physical memory adjusted significantly later ?
>>>
>>>> Otherwise you can end up with a default value, which
>>>> is worse than totalram/2, for instance, if tmpfs is mounted by some helper,
>>>> which is located in a separate and very limited memcg.
>>>
>>> That may happen.
>>> Maybe we could improve the solution to handle this issue ?
>>>
>>>
>>
>> Let's backtrack, what is the actual concern? If a user/process inside
>> a memcg is allocating pages for a file on a tmpfs mounted without size
>> parameter, you want the OS to return ENOSPC (if allocation is done by
>> write syscall) earlier to not cause the user/process's memcg to OOM.
>> Is that right?
>>
>
> Right.
>
>> First, there is no guarantee to not cause OOM by restricting tmpfs to
>> half the size of memcg limit due to the presence of other memory
>> charged to that memcg. The memcg can OOM even before the tmpfs hits
>> its size.
>>
>
> Just guarantee that the OOM not caused by misuse of tmpfs.
>
>> Second, the users who really care to avoid such scenario should just
>> set the size parameter of tmpfs.
>
> Of couse that is the best way.
> But we can not ensue all applications will do it.
> That's why I introduce a proper defalut value for them.
>

I think we disagree on the how to get proper default value. Unless you
can restrict that all the memory allocated for a tmpfs mount will be
charged to a specific memcg, you should not just pick limit of the
memcg of the process mounting the tmpfs to set the default of tmpfs
mount. If you can restrict tmpfs charging to a specific memcg then the
limit of that memcg should be used to set the default of the tmpfs
mount. However this feature is not present in the upstream kernel at
the moment (We have this feature in our local kernel and I am planning
to upstream that).

Re: [PATCH] mm/shmem: set default tmpfs size according to memcg limit

2017-11-17 Thread Shakeel Butt

On Fri, Nov 17, 2017 at 9:09 AM, Yafang Shao <laoar.s...@gmail.com> wrote:
> 2017-11-18 0:45 GMT+08:00 Roman Gushchin <g...@fb.com>:
>> On Sat, Nov 18, 2017 at 12:20:40AM +0800, Yafang Shao wrote:
>>> 2017-11-17 23:55 GMT+08:00 Roman Gushchin <g...@fb.com>:
>>> > On Thu, Nov 16, 2017 at 08:43:17PM -0800, Shakeel Butt wrote:
>>> >> On Thu, Nov 16, 2017 at 7:09 PM, Yafang Shao <laoar.s...@gmail.com> 
>>> >> wrote:
>>> >> > Currently the default tmpfs size is totalram_pages / 2 if mount tmpfs
>>> >> > without "-o size=XXX".
>>> >> > When we mount tmpfs in a container(i.e. docker), it is also
>>> >> > totalram_pages / 2 regardless of the memory limit on this container.
>>> >> > That may easily cause OOM if tmpfs occupied too much memory when swap 
>>> >> > is
>>> >> > off.
>>> >> > So when we mount tmpfs in a memcg, the default size should be limited 
>>> >> > by
>>> >> > the memcg memory.limit.
>>> >> >
>>> >>
>>> >> The pages of the tmpfs files are charged to the memcg of allocators
>>> >> which can be in memcg different from the memcg in which the mount
>>> >> operation happened. So, tying the size of a tmpfs mount where it was
>>> >> mounted does not make much sense.
>>> >
>>> > Also, memory limit is adjustable,
>>>
>>> Yes. But that's irrelevant.
>>>
>>> > and using a particular limit value
>>> > at a moment of tmpfs mounting doesn't provide any warranties further.
>>> >
>>>
>>> I can not agree.
>>> The default size of tmpfs is totalram / 2, the reason we do this is to
>>> provide any warranties further IMHO.
>>>
>>> > Is there a reason why the userspace app which is mounting tmpfs can't
>>> > set the size based on memory.limit?
>>>
>>> That's because of misuse.
>>> The application should set size with "-o size=" when mount tmpfs, but
>>> not all applications do this.
>>> As we can't guarantee that all applications will do this, we should
>>> give them a proper default value.
>>
>> The value you're suggesting is proper only if an app which is mounting
>> tmpfs resides in the same memcg
>
> Yes.
> But maybe that's mostly used today?
>
>> and the memory limit will not be adjusted
>> significantly later.
>
> There's a similar issue for physical memory adjusted by memory hotplug.
> So what will happen if the physical memory adjusted significantly later ?
>
>> Otherwise you can end up with a default value, which
>> is worse than totalram/2, for instance, if tmpfs is mounted by some helper,
>> which is located in a separate and very limited memcg.
>
> That may happen.
> Maybe we could improve the solution to handle this issue ?
>
>

Let's backtrack, what is the actual concern? If a user/process inside
a memcg is allocating pages for a file on a tmpfs mounted without size
parameter, you want the OS to return ENOSPC (if allocation is done by
write syscall) earlier to not cause the user/process's memcg to OOM.
Is that right?

First, there is no guarantee to not cause OOM by restricting tmpfs to
half the size of memcg limit due to the presence of other memory
charged to that memcg. The memcg can OOM even before the tmpfs hits
its size.

Second, the users who really care to avoid such scenario should just
set the size parameter of tmpfs.

Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.

2017-11-17 Thread Shakeel Butt

On Fri, Nov 17, 2017 at 9:41 AM, Shakeel Butt <shake...@google.com> wrote:
> On Fri, Nov 17, 2017 at 9:35 AM, Christoph Hellwig <h...@infradead.org> wrote:
>> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>>> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>>> using one RCU section. But using atomic_inc()/atomic_dec() for each
>>> do_shrink_slab() call will not impact so much.
>>
>> But you could use SRCU..
>
> I looked into that but was advised to not go through that route due to
> SRCU behind the CONFIG_SRCU. However now I see the precedence of
> "#ifdef CONFIG_SRCU" in drivers/base/core.c and I think if we can take
> that route if even after Minchan's patch the issue persists.

Too many 'ifs' in the last sentence. I just wanted to say we can
consider SRCU if the issue persists even after Minchan's patch.

Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.

2017-11-17 Thread Shakeel Butt

On Fri, Nov 17, 2017 at 9:35 AM, Christoph Hellwig  wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>> using one RCU section. But using atomic_inc()/atomic_dec() for each
>> do_shrink_slab() call will not impact so much.
>
> But you could use SRCU..

I looked into that but was advised to not go through that route due to
SRCU behind the CONFIG_SRCU. However now I see the precedence of
"#ifdef CONFIG_SRCU" in drivers/base/core.c and I think if we can take
that route if even after Minchan's patch the issue persists.

Re: [PATCH v2] mm, shrinker: make shrinker_list lockless

2017-11-10 Thread Shakeel Butt

On Thu, Nov 9, 2017 at 1:46 PM, Tetsuo Handa
<penguin-ker...@i-love.sakura.ne.jp> wrote:
> Shakeel Butt wrote:
>> > If you can accept serialized register_shrinker()/unregister_shrinker(),
>> > I think that something like shown below can do it.
>>
>> If we assume that we will never do register_shrinker and
>> unregister_shrinker on the same object in parallel then do we still
>> need to do msleep & synchronize_rcu() within mutex?
>
> Doing register_shrinker() and unregister_shrinker() on the same object
> in parallel is wrong. This mutex is to ensure that we do not need to
> worry about ->list.next field. synchronize_rcu() should not be slow.
> If you want to avoid msleep() with mutex held, you can also apply
>
>> > If you want parallel register_shrinker()/unregister_shrinker(), something 
>> > like
>> > shown below on top of shown above will do it.
>
> change.

Thanks for the explanation. Can you post the patch for others to
review without parallel register/unregister and SHRINKER_PERMANENT (we
can add when we need them)? You can use the motivation for the patch I
mentioned in my patch instead.

Re: [PATCH] mm, mlock, vmscan: no more skipping pagevecs

2017-11-21 Thread Shakeel Butt

On Tue, Nov 21, 2017 at 7:32 AM, Johannes Weiner <han...@cmpxchg.org> wrote:
> On Sat, Nov 04, 2017 at 03:43:12PM -0700, Shakeel Butt wrote:
>> When a thread mlocks an address space backed by file, a new
>> page is allocated (assuming file page is not in memory), added
>> to the local pagevec (lru_add_pvec), I/O is triggered and the
>> thread then sleeps on the page. On I/O completion, the thread
>> can wake on a different CPU, the mlock syscall will then sets
>> the PageMlocked() bit of the page but will not be able to put
>> that page in unevictable LRU as the page is on the pagevec of
>> a different CPU. Even on drain, that page will go to evictable
>> LRU because the PageMlocked() bit is not checked on pagevec
>> drain.
>>
>> The page will eventually go to right LRU on reclaim but the
>> LRU stats will remain skewed for a long time.
>>
>> However, this issue does not happen for anon pages on swap
>> because unlike file pages, anon pages are not added to pagevec
>> until they have been fully swapped in.
>
> How so? __read_swap_cache_async() is the core function that allocates
> the page, and that always puts the page on the pagevec before IO is
> initiated.
>
>> Also the fault handler uses vm_flags to set the PageMlocked() bit of
>> such anon pages even before returning to mlock() syscall and mlocked
>> pages will skip pagevecs and directly be put into unevictable LRU.
>
> Where does the swap fault path set PageMlocked()?
>
> I might just be missing something.

No, you are right. I got confused by
lru_cache_add_active_or_unevictable() in do_swap_page() but missed the
preceding comment that says "ksm created a completely new copy". I
will fix the the commit message as well.

Re: [PATCH] mm, mlock, vmscan: no more skipping pagevecs

2017-11-21 Thread Shakeel Butt

On Tue, Nov 21, 2017 at 7:06 AM, Johannes Weiner <han...@cmpxchg.org> wrote:
> On Tue, Nov 21, 2017 at 01:39:57PM +0100, Vlastimil Babka wrote:
>> On 11/04/2017 11:43 PM, Shakeel Butt wrote:
>> > When a thread mlocks an address space backed by file, a new
>> > page is allocated (assuming file page is not in memory), added
>> > to the local pagevec (lru_add_pvec), I/O is triggered and the
>> > thread then sleeps on the page. On I/O completion, the thread
>> > can wake on a different CPU, the mlock syscall will then sets
>> > the PageMlocked() bit of the page but will not be able to put
>> > that page in unevictable LRU as the page is on the pagevec of
>> > a different CPU. Even on drain, that page will go to evictable
>> > LRU because the PageMlocked() bit is not checked on pagevec
>> > drain.
>> >
>> > The page will eventually go to right LRU on reclaim but the
>> > LRU stats will remain skewed for a long time.
>> >
>> > However, this issue does not happen for anon pages on swap
>> > because unlike file pages, anon pages are not added to pagevec
>> > until they have been fully swapped in. Also the fault handler
>> > uses vm_flags to set the PageMlocked() bit of such anon pages
>> > even before returning to mlock() syscall and mlocked pages will
>> > skip pagevecs and directly be put into unevictable LRU. No such
>> > luck for file pages.
>> >
>> > One way to resolve this issue, is to somehow plumb vm_flags from
>> > filemap_fault() to add_to_page_cache_lru() which will then skip
>> > the pagevec for pages of VM_LOCKED vma and directly put them to
>> > unevictable LRU. However this patch took a different approach.
>> >
>> > All the pages, even unevictable, will be added to the pagevecs
>> > and on the drain, the pages will be added on their LRUs correctly
>> > by checking their evictability. This resolves the mlocked file
>> > pages on pagevec of other CPUs issue because when those pagevecs
>> > will be drained, the mlocked file pages will go to unevictable
>> > LRU. Also this makes the race with munlock easier to resolve
>> > because the pagevec drains happen in LRU lock.
>> >
>> > There is one (good) side effect though. Without this patch, the
>> > pages allocated for System V shared memory segment are added to
>> > evictable LRUs even after shmctl(SHM_LOCK) on that segment. This
>> > patch will correctly put such pages to unevictable LRU.
>> >
>> > Signed-off-by: Shakeel Butt <shake...@google.com>
>>
>> I like the approach in general, as it seems to make the code simpler,
>> and the diffstats support that. I found no bugs, but I can't say that
>> with certainty that there aren't any, though. This code is rather
>> tricky. But it should be enough for an ack, so.
>>
>> Acked-by: Vlastimil Babka <vba...@suse.cz>
>>
>> A question below, though.
>>
>> ...
>>
>> > @@ -883,15 +855,41 @@ void lru_add_page_tail(struct page *page, struct 
>> > page *page_tail,
>> >  static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>> >  void *arg)
>> >  {
>> > -   int file = page_is_file_cache(page);
>> > -   int active = PageActive(page);
>> > -   enum lru_list lru = page_lru(page);
>> > +   enum lru_list lru;
>> > +   int was_unevictable = TestClearPageUnevictable(page);
>> >
>> > VM_BUG_ON_PAGE(PageLRU(page), page);
>> >
>> > SetPageLRU(page);
>> > +   /*
>> > +* Page becomes evictable in two ways:
>> > +* 1) Within LRU lock [munlock_vma_pages() and __munlock_pagevec()].
>> > +* 2) Before acquiring LRU lock to put the page to correct LRU and then
>> > +*   a) do PageLRU check with lock [check_move_unevictable_pages]
>> > +*   b) do PageLRU check before lock [isolate_lru_page]
>> > +*
>> > +* (1) & (2a) are ok as LRU lock will serialize them. For (2b), if the
>> > +* other thread does not observe our setting of PG_lru and fails
>> > +* isolation, the following page_evictable() check will make us put
>> > +* the page in correct LRU.
>> > +*/
>> > +   smp_mb();
>>
>> Could you elaborate on the purpose of smp_mb() here? Previously there
>> was "The other side is TestClearPageMlocked() or shmem_lock()" in
>> putback_lru_page(), which seems rather unclear to me (neither has an
>> explic

Re: [PATCH] mm, mlock, vmscan: no more skipping pagevecs

2017-11-21 Thread Shakeel Butt

On Tue, Nov 21, 2017 at 7:06 AM, Johannes Weiner <han...@cmpxchg.org> wrote:
> On Tue, Nov 21, 2017 at 01:39:57PM +0100, Vlastimil Babka wrote:
>> On 11/04/2017 11:43 PM, Shakeel Butt wrote:
>> > When a thread mlocks an address space backed by file, a new
>> > page is allocated (assuming file page is not in memory), added
>> > to the local pagevec (lru_add_pvec), I/O is triggered and the
>> > thread then sleeps on the page. On I/O completion, the thread
>> > can wake on a different CPU, the mlock syscall will then sets
>> > the PageMlocked() bit of the page but will not be able to put
>> > that page in unevictable LRU as the page is on the pagevec of
>> > a different CPU. Even on drain, that page will go to evictable
>> > LRU because the PageMlocked() bit is not checked on pagevec
>> > drain.
>> >
>> > The page will eventually go to right LRU on reclaim but the
>> > LRU stats will remain skewed for a long time.
>> >
>> > However, this issue does not happen for anon pages on swap
>> > because unlike file pages, anon pages are not added to pagevec
>> > until they have been fully swapped in. Also the fault handler
>> > uses vm_flags to set the PageMlocked() bit of such anon pages
>> > even before returning to mlock() syscall and mlocked pages will
>> > skip pagevecs and directly be put into unevictable LRU. No such
>> > luck for file pages.
>> >
>> > One way to resolve this issue, is to somehow plumb vm_flags from
>> > filemap_fault() to add_to_page_cache_lru() which will then skip
>> > the pagevec for pages of VM_LOCKED vma and directly put them to
>> > unevictable LRU. However this patch took a different approach.
>> >
>> > All the pages, even unevictable, will be added to the pagevecs
>> > and on the drain, the pages will be added on their LRUs correctly
>> > by checking their evictability. This resolves the mlocked file
>> > pages on pagevec of other CPUs issue because when those pagevecs
>> > will be drained, the mlocked file pages will go to unevictable
>> > LRU. Also this makes the race with munlock easier to resolve
>> > because the pagevec drains happen in LRU lock.
>> >
>> > There is one (good) side effect though. Without this patch, the
>> > pages allocated for System V shared memory segment are added to
>> > evictable LRUs even after shmctl(SHM_LOCK) on that segment. This
>> > patch will correctly put such pages to unevictable LRU.
>> >
>> > Signed-off-by: Shakeel Butt <shake...@google.com>
>>
>> I like the approach in general, as it seems to make the code simpler,
>> and the diffstats support that. I found no bugs, but I can't say that
>> with certainty that there aren't any, though. This code is rather
>> tricky. But it should be enough for an ack, so.
>>
>> Acked-by: Vlastimil Babka <vba...@suse.cz>
>>
>> A question below, though.
>>
>> ...
>>
>> > @@ -883,15 +855,41 @@ void lru_add_page_tail(struct page *page, struct 
>> > page *page_tail,
>> >  static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>> >  void *arg)
>> >  {
>> > -   int file = page_is_file_cache(page);
>> > -   int active = PageActive(page);
>> > -   enum lru_list lru = page_lru(page);
>> > +   enum lru_list lru;
>> > +   int was_unevictable = TestClearPageUnevictable(page);
>> >
>> > VM_BUG_ON_PAGE(PageLRU(page), page);
>> >
>> > SetPageLRU(page);
>> > +   /*
>> > +* Page becomes evictable in two ways:
>> > +* 1) Within LRU lock [munlock_vma_pages() and __munlock_pagevec()].
>> > +* 2) Before acquiring LRU lock to put the page to correct LRU and then
>> > +*   a) do PageLRU check with lock [check_move_unevictable_pages]
>> > +*   b) do PageLRU check before lock [isolate_lru_page]
>> > +*
>> > +* (1) & (2a) are ok as LRU lock will serialize them. For (2b), if the
>> > +* other thread does not observe our setting of PG_lru and fails
>> > +* isolation, the following page_evictable() check will make us put
>> > +* the page in correct LRU.
>> > +*/
>> > +   smp_mb();
>>
>> Could you elaborate on the purpose of smp_mb() here? Previously there
>> was "The other side is TestClearPageMlocked() or shmem_lock()" in
>> putback_lru_page(), which seems rather unclear to me (neither has an
>> explic

[PATCH v2] mm, mlock, vmscan: no more skipping pagevecs

2017-11-21 Thread Shakeel Butt

When a thread mlocks an address space backed either by file
pages which are currently not present in memory or swapped
out anon pages (not in swapcache), a new page is allocated
and added to the local pagevec (lru_add_pvec), I/O is triggered
and the thread then sleeps on the page. On I/O completion, the
thread can wake on a different CPU, the mlock syscall will then
sets the PageMlocked() bit of the page but will not be able to
put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable
LRU because the PageMlocked() bit is not checked on pagevec
drain.

The page will eventually go to right LRU on reclaim but the
LRU stats will remain skewed for a long time.

This patch put all the pages, even unevictable, to the pagevecs
and on the drain, the pages will be added on their LRUs correctly
by checking their evictability. This resolves the mlocked pages
on pagevec of other CPUs issue because when those pagevecs will
be drained, the mlocked file pages will go to unevictable LRU.
Also this makes the race with munlock easier to resolve because
the pagevec drains happen in LRU lock.

However there is still one place which makes a page evictable
and does PageLRU check on that page without LRU lock and needs
special attention. TestClearPageMlocked() and isolate_lru_page()
in clear_page_mlock().

#0: __pagevec_lru_add_fn#1: clear_page_mlock

SetPageLRU()if (!TestClearPageMlocked())
  return
smp_mb() // <--required
// inside does PageLRU
if (!PageMlocked()) if (isolate_lru_page())
  move to evictable LRU   putback_lru_page()
else
  move to unevictable LRU

In '#1', TestClearPageMlocked() provides full memory barrier
semantics and thus the PageLRU check (inside isolate_lru_page)
can not be reordered before it.

In '#0', without explicit memory barrier, the PageMlocked() check
can be reordered before SetPageLRU(). If that happens, '#0' can
put a page in unevictable LRU and '#1' might have just cleared the
Mlocked bit of that page but fails to isolate as PageLRU fails as
'#0' still hasn't set PageLRU bit of that page. That page will be
stranded on the unevictable LRU.

There is one (good) side effect though. Without this patch, the
pages allocated for System V shared memory segment are added to
evictable LRUs even after shmctl(SHM_LOCK) on that segment. This
patch will correctly put such pages to unevictable LRU.

Signed-off-by: Shakeel Butt <shake...@google.com>
Acked-by: Vlastimil Babka <vba...@suse.cz>

---
Changelog since v1:
- Fixed the commit message.
- Added more comments based on Johannes's suggestion.

 include/linux/swap.h |  2 --
 mm/mlock.c   |  6 
 mm/swap.c| 82 ++--
 mm/vmscan.c  | 59 +
 4 files changed, 54 insertions(+), 95 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c2b8128799c1..4aebd2b760c8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,6 @@ extern void deactivate_file_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void add_page_to_unevictable_list(struct page *page);
-
 extern void lru_cache_add_active_or_unevictable(struct page *page,
struct vm_area_struct *vma);
 
diff --git a/mm/mlock.c b/mm/mlock.c
index 30472d438794..c3d2d0d3d73b 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -64,6 +64,12 @@ void clear_page_mlock(struct page *page)
mod_zone_page_state(page_zone(page), NR_MLOCK,
-hpage_nr_pages(page));
count_vm_event(UNEVICTABLE_PGCLEARED);
+   /*
+* The previous TestClearPageMlocked() corresponds to the smp_mb()
+* in __pagevec_lru_add_fn().
+*
+* See __pagevec_lru_add_fn for more explanation.
+*/
if (!isolate_lru_page(page)) {
putback_lru_page(page);
} else {
diff --git a/mm/swap.c b/mm/swap.c
index 38e1b6374a97..31869a543270 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -445,30 +445,6 @@ void lru_cache_add(struct page *page)
__lru_cache_add(page);
 }
 
-/**
- * add_page_to_unevictable_list - add a page to the unevictable list
- * @page:  the page to be added to the unevictable list
- *
- * Add page directly to its zone's unevictable list.  To avoid races with
- * tasks that might be making the page evictable, through eg. munlock,
- * munmap or exit, while it's not on the lru, we want to add the page
- * while it's locked or otherwise "invisible" to other tasks.  This is
- * difficult to do when using the pagevec cache, so bypass that.
- */
-void add_page_to_unevictable_list(struct page *page)
-{
-

Re: [PATCH] mm/shmem: set default tmpfs size according to memcg limit

2017-11-16 Thread Shakeel Butt

On Thu, Nov 16, 2017 at 7:09 PM, Yafang Shao  wrote:
> Currently the default tmpfs size is totalram_pages / 2 if mount tmpfs
> without "-o size=XXX".
> When we mount tmpfs in a container(i.e. docker), it is also
> totalram_pages / 2 regardless of the memory limit on this container.
> That may easily cause OOM if tmpfs occupied too much memory when swap is
> off.
> So when we mount tmpfs in a memcg, the default size should be limited by
> the memcg memory.limit.
>

The pages of the tmpfs files are charged to the memcg of allocators
which can be in memcg different from the memcg in which the mount
operation happened. So, tying the size of a tmpfs mount where it was
mounted does not make much sense.

Also mount operation which requires CAP_SYS_ADMIN, is usually
performed by node controller (or job loader) which don't necessarily
run in the memcg of the actual job.

> Signed-off-by: Yafang Shao 
> ---
>  include/linux/memcontrol.h |  1 +
>  mm/memcontrol.c|  2 +-
>  mm/shmem.c | 20 +++-
>  3 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 69966c4..79c6709 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -265,6 +265,7 @@ struct mem_cgroup {
> /* WARNING: nodeinfo must be the last member here */
>  };
>
> +extern struct mutex memcg_limit_mutex;
>  extern struct mem_cgroup *root_mem_cgroup;
>
>  static inline bool mem_cgroup_disabled(void)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 661f046..ad32f3c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2464,7 +2464,7 @@ static inline int 
> mem_cgroup_move_swap_account(swp_entry_t entry,
>  }
>  #endif
>
> -static DEFINE_MUTEX(memcg_limit_mutex);
> +DEFINE_MUTEX(memcg_limit_mutex);

This mutex is only needed for updating the limit.

>
>  static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>unsigned long limit)
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 07a1d22..1c320dd 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include  /* for arch/microblaze update_mmu_cache() */
>
> @@ -108,7 +109,24 @@ struct shmem_falloc {
>  #ifdef CONFIG_TMPFS
>  static unsigned long shmem_default_max_blocks(void)
>  {
> -   return totalram_pages / 2;
> +   unsigned long size;
> +
> +#ifdef CONFIG_MEMCG
> +   struct mem_cgroup *memcg = mem_cgroup_from_task(current);
> +
> +   if (memcg == NULL || memcg == root_mem_cgroup)
> +   size = totalram_pages / 2;
> +   else {
> +   mutex_lock(_limit_mutex);
> +   size = memcg->memory.limit > totalram_pages ?
> +totalram_pages / 2 : memcg->memory.limit / 2;
> +   mutex_unlock(_limit_mutex);
> +   }
> +#else
> +   size = totalram_pages / 2;
> +#endif
> +
> +   return size;
>  }
>
>  static unsigned long shmem_default_max_inodes(void)
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.

2017-11-13 Thread Shakeel Butt

On Mon, Nov 13, 2017 at 1:37 PM, Tetsuo Handa
<penguin-ker...@i-love.sakura.ne.jp> wrote:
> When shrinker_rwsem was introduced, it was assumed that
> register_shrinker()/unregister_shrinker() are really unlikely paths
> which are called during initialization and tear down. But nowadays,
> register_shrinker()/unregister_shrinker() might be called regularly.
> This patch prepares for allowing parallel registration/unregistration
> of shrinkers.
>
> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> using one RCU section. But using atomic_inc()/atomic_dec() for each
> do_shrink_slab() call will not impact so much.
>
> This patch uses polling loop with short sleep for unregister_shrinker()
> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> shrinker unexpectedly took so long.
>
> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

Reviewed-and-tested-by: Shakeel Butt <shake...@google.com>

> ---
>  include/linux/shrinker.h |  3 ++-
>  mm/vmscan.c  | 41 +++--
>  2 files changed, 21 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 388ff29..333a1d0 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -62,9 +62,10 @@ struct shrinker {
>
> int seeks;  /* seeks to recreate an obj */
> long batch; /* reclaim batch size, 0 = default */
> -   unsigned long flags;
> +   unsigned int flags;
>
> /* These are for internal use */
> +   atomic_t nr_active;
> struct list_head list;
> /* objs pending delete, per node */
> atomic_long_t *nr_deferred;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1c1bc95..c8996e8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -157,7 +157,7 @@ struct scan_control {
>  unsigned long vm_total_pages;
>
>  static LIST_HEAD(shrinker_list);
> -static DECLARE_RWSEM(shrinker_rwsem);
> +static DEFINE_MUTEX(shrinker_lock);
>
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
> @@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
> if (!shrinker->nr_deferred)
> return -ENOMEM;
>
> -   down_write(_rwsem);
> -   list_add_tail(>list, _list);
> -   up_write(_rwsem);
> +   atomic_set(>nr_active, 0);
> +   mutex_lock(_lock);
> +   list_add_tail_rcu(>list, _list);
> +   mutex_unlock(_lock);
> return 0;
>  }
>  EXPORT_SYMBOL(register_shrinker);
> @@ -297,9 +298,13 @@ int register_shrinker(struct shrinker *shrinker)
>   */
>  void unregister_shrinker(struct shrinker *shrinker)
>  {
> -   down_write(_rwsem);
> -   list_del(>list);
> -   up_write(_rwsem);
> +   mutex_lock(_lock);
> +   list_del_rcu(>list);
> +   synchronize_rcu();
> +   while (atomic_read(>nr_active))
> +   schedule_timeout_uninterruptible(1);
> +   synchronize_rcu();
> +   mutex_unlock(_lock);
> kfree(shrinker->nr_deferred);
>  }
>  EXPORT_SYMBOL(unregister_shrinker);
> @@ -468,18 +473,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> if (nr_scanned == 0)
> nr_scanned = SWAP_CLUSTER_MAX;
>
> -   if (!down_read_trylock(_rwsem)) {
> -   /*
> -* If we would return 0, our callers would understand that we
> -* have nothing else to shrink and give up trying. By 
> returning
> -* 1 we keep it going and assume we'll be able to shrink next
> -* time.
> -*/
> -   freed = 1;
> -   goto out;
> -   }
> -
> -   list_for_each_entry(shrinker, _list, list) {
> +   rcu_read_lock();
> +   list_for_each_entry_rcu(shrinker, _list, list) {
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> .nid = nid,
> @@ -498,11 +493,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int 
> nid,
> if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> sc.nid = 0;
>
> +   atomic_inc(>nr_active);
> +   rcu_read_unlock();
> freed += do_shrink_slab(, shrinker, nr_scanned, 
> nr_eligible);
> +   rcu_read_lock();
> +   atomic_dec(>nr_active);
> }
> -
> -   up_read(_rwsem);
> -out:
> +   rcu_read_unlock();
> cond_resched();
> return freed;
>  }
> --
> 1.8.3.1
>

[PATCH] vfs: remove might_sleep() from clear_inode()

2017-11-07 Thread Shakeel Butt

Commit 7994e6f72543 ("vfs: Move waiting for inode writeback from
end_writeback() to evict_inode()") removed inode_sync_wait() from
end_writeback() and commit dbd5768f87ff ("vfs: Rename end_writeback()
to clear_inode()") renamed end_writeback() to clear_inode(). After
these patches there is no sleeping operation in clear_inode(). So,
remove might_sleep() from it.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 fs/inode.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index d1e35b53bb23..528f3159b928 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -497,7 +497,6 @@ EXPORT_SYMBOL(__remove_inode_hash);
 
 void clear_inode(struct inode *inode)
 {
-   might_sleep();
/*
 * We have to cycle tree_lock here because reclaim can be still in the
 * process of removing the last page (in __delete_from_page_cache())
-- 
2.15.0.403.gc27cc4dac6-goog

Re: [PATCH] mm, shrinker: make shrinker_list lockless

2017-11-07 Thread Shakeel Butt

> if (next_deferred >= scanned)
> @@ -468,18 +487,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> if (nr_scanned == 0)
> nr_scanned = SWAP_CLUSTER_MAX;
>
> -   if (!down_read_trylock(_rwsem)) {
> -   /*
> -* If we would return 0, our callers would understand that we
> -* have nothing else to shrink and give up trying. By 
> returning
> -* 1 we keep it going and assume we'll be able to shrink next
> -* time.
> -*/
> -   freed = 1;
> -   goto out;
> -   }
> +   rcu_read_lock();

Sorry, the rcu_read_lock() will not work. I am currently testing with
srcu_read_lock() and see if it gives any error.

>
> -   list_for_each_entry(shrinker, _list, list) {
> +   list_for_each_entry_rcu(shrinker, _list, list) {
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> .nid = nid,
> @@ -498,11 +508,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int 
> nid,
> if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> sc.nid = 0;
>
> +   get_shrinker(shrinker);
> freed += do_shrink_slab(, shrinker, nr_scanned, 
> nr_eligible);
> +   put_shrinker(shrinker);
> }
>
> -   up_read(_rwsem);
> -out:
> +   rcu_read_unlock();
> +
> cond_resched();
> return freed;
>  }
> --
> 2.15.0.403.gc27cc4dac6-goog
>

[PATCH] mm, shrinker: make shrinker_list lockless

2017-11-07 Thread Shakeel Butt

In our production, we have observed that the job loader gets stuck for
10s of seconds while doing mount operation. It turns out that it was
stuck in register_shrinker() and some unrelated job was under memory
pressure and spending time in shrink_slab(). Our machines have a lot
of shrinkers registered and jobs under memory pressure has to traverse
all of those memcg-aware shrinkers and do affect unrelated jobs which
want to register their own shrinkers.

This patch has made the shrinker_list traversal lockless and shrinker
register remain fast. For the shrinker unregister, atomic counter
has been introduced to avoid synchronize_rcu() call. The fields of
struct shrinker has been rearraged to make sure that the size does
not increase.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 include/linux/shrinker.h |  4 +++-
 mm/vmscan.c  | 54 +---
 2 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 388ff2936a87..434b76ef9367 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -60,14 +60,16 @@ struct shrinker {
unsigned long (*scan_objects)(struct shrinker *,
  struct shrink_control *sc);
 
+   unsigned int flags;
int seeks;  /* seeks to recreate an obj */
long batch; /* reclaim batch size, 0 = default */
-   unsigned long flags;
 
/* These are for internal use */
struct list_head list;
/* objs pending delete, per node */
atomic_long_t *nr_deferred;
+   /* Number of active do_shrink_slab calls to this shrinker */
+   atomic_t nr_active;
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb2f0315b8c0..f58c0d973bc4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -157,7 +157,7 @@ int vm_swappiness = 60;
 unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
-static DECLARE_RWSEM(shrinker_rwsem);
+static DEFINE_SPINLOCK(shrinker_lock);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -285,21 +285,40 @@ int register_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return -ENOMEM;
 
-   down_write(_rwsem);
-   list_add_tail(>list, _list);
-   up_write(_rwsem);
+   atomic_set(>nr_active, 0);
+   spin_lock(_lock);
+   list_add_tail_rcu(>list, _list);
+   spin_unlock(_lock);
return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
 
+static void get_shrinker(struct shrinker *shrinker)
+{
+   atomic_inc(>nr_active);
+}
+
+static void put_shrinker(struct shrinker *shrinker)
+{
+   if (!atomic_dec_return(>nr_active))
+   wake_up_atomic_t(>nr_active);
+}
+
+static int shrinker_wait_atomic_t(atomic_t *p)
+{
+   schedule();
+   return 0;
+}
 /*
  * Remove one
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-   down_write(_rwsem);
-   list_del(>list);
-   up_write(_rwsem);
+   spin_lock(_lock);
+   list_del_rcu(>list);
+   spin_unlock(_lock);
+   wait_on_atomic_t(>nr_active, shrinker_wait_atomic_t,
+TASK_UNINTERRUPTIBLE);
kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
@@ -404,7 +423,7 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
total_scan -= shrinkctl->nr_scanned;
scanned += shrinkctl->nr_scanned;
 
-   cond_resched();
+   cond_resched_rcu();
}
 
if (next_deferred >= scanned)
@@ -468,18 +487,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (nr_scanned == 0)
nr_scanned = SWAP_CLUSTER_MAX;
 
-   if (!down_read_trylock(_rwsem)) {
-   /*
-* If we would return 0, our callers would understand that we
-* have nothing else to shrink and give up trying. By returning
-* 1 we keep it going and assume we'll be able to shrink next
-* time.
-*/
-   freed = 1;
-   goto out;
-   }
+   rcu_read_lock();
 
-   list_for_each_entry(shrinker, _list, list) {
+   list_for_each_entry_rcu(shrinker, _list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -498,11 +508,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
sc.nid = 0;
 
+   get_shrinker(shrinker);
freed += do_shrink_slab(, shrinker, nr_scanned, nr_eligible);
+   put_shrinker(shrinker);
}
 
-   up_read(_rwsem);
-out:
+   rcu_read_unlock();
+
cond_resched();
return freed;
 }
-- 
2.15.0.403.gc27cc4dac6-goog

Re: [PATCH v2] mm, shrinker: make shrinker_list lockless

2017-11-09 Thread Shakeel Butt

>
> If you can accept serialized register_shrinker()/unregister_shrinker(),
> I think that something like shown below can do it.
>

Thanks.

> --
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 388ff29..e2272dd 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -62,9 +62,10 @@ struct shrinker {
>
> int seeks;  /* seeks to recreate an obj */
> long batch; /* reclaim batch size, 0 = default */
> -   unsigned long flags;
> +   unsigned int flags;
>
> /* These are for internal use */
> +   atomic_t nr_active; /* Counted only if !SHRINKER_PERMANENT */
> struct list_head list;
> /* objs pending delete, per node */
> atomic_long_t *nr_deferred;
> @@ -74,6 +75,7 @@ struct shrinker {
>  /* Flags */
>  #define SHRINKER_NUMA_AWARE(1 << 0)
>  #define SHRINKER_MEMCG_AWARE   (1 << 1)
> +#define SHRINKER_PERMANENT (1 << 2)
>
>  extern int register_shrinker(struct shrinker *);
>  extern void unregister_shrinker(struct shrinker *);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1c1bc95..e963359 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -157,7 +157,7 @@ struct scan_control {
>  unsigned long vm_total_pages;
>
>  static LIST_HEAD(shrinker_list);
> -static DECLARE_RWSEM(shrinker_rwsem);
> +static DEFINE_MUTEX(shrinker_lock);
>
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
> @@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
> if (!shrinker->nr_deferred)
> return -ENOMEM;
>
> -   down_write(_rwsem);
> -   list_add_tail(>list, _list);
> -   up_write(_rwsem);
> +   atomic_set(>nr_active, 0);
> +   mutex_lock(_lock);
> +   list_add_tail_rcu(>list, _list);
> +   mutex_unlock(_lock);
> return 0;
>  }
>  EXPORT_SYMBOL(register_shrinker);
> @@ -297,9 +298,14 @@ int register_shrinker(struct shrinker *shrinker)
>   */
>  void unregister_shrinker(struct shrinker *shrinker)
>  {
> -   down_write(_rwsem);
> -   list_del(>list);
> -   up_write(_rwsem);
> +   BUG_ON(shrinker->flags & SHRINKER_PERMANENT);
> +   mutex_lock(_lock);
> +   list_del_rcu(>list);
> +   synchronize_rcu();
> +   while (atomic_read(>nr_active))
> +   msleep(1);

If we assume that we will never do register_shrinker and
unregister_shrinker on the same object in parallel then do we still
need to do msleep & synchronize_rcu() within mutex?

> +   synchronize_rcu();

I was hoping to not put any delay for the normal case (no memory
pressure and no reclaimers).

> +   mutex_unlock(_lock);
> kfree(shrinker->nr_deferred);
>  }
>  EXPORT_SYMBOL(unregister_shrinker);
> @@ -468,18 +474,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> if (nr_scanned == 0)
> nr_scanned = SWAP_CLUSTER_MAX;
>
> -   if (!down_read_trylock(_rwsem)) {
> -   /*
> -* If we would return 0, our callers would understand that we
> -* have nothing else to shrink and give up trying. By 
> returning
> -* 1 we keep it going and assume we'll be able to shrink next
> -* time.
> -*/
> -   freed = 1;
> -   goto out;
> -   }
> -
> -   list_for_each_entry(shrinker, _list, list) {
> +   rcu_read_lock();
> +   list_for_each_entry_rcu(shrinker, _list, list) {
> +   bool permanent;
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> .nid = nid,
> @@ -498,11 +495,16 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int 
> nid,
> if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> sc.nid = 0;
>
> +   permanent = (shrinker->flags & SHRINKER_PERMANENT);
> +   if (!permanent)
> +   atomic_inc(>nr_active);
> +   rcu_read_unlock();
> freed += do_shrink_slab(, shrinker, nr_scanned, 
> nr_eligible);
> +   rcu_read_lock();
> +   if (!permanent)
> +   atomic_dec(>nr_active);
> }
> -
> -   up_read(_rwsem);
> -out:
> +   rcu_read_unlock();
> cond_resched();
> return freed;
>  }
> --
>
> If you want parallel register_shrinker()/unregister_shrinker(), something like
> shown below on top of shown above will do it.
>
> --
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index e2272dd..471b2f6 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -67,6 +67,7 @@ struct shrinker {
> /* These are for internal use */
> atomic_t nr_active; /* Counted only if !SHRINKER_PERMANENT */
> struct list_head list;
> +   struct list_head gc_list;
> /* objs pending delete, per node */
> atomic_long_t

[PATCH v2] mm, shrinker: make shrinker_list lockless

2017-11-08 Thread Shakeel Butt

In our production, we have observed that the job loader gets stuck for
10s of seconds while doing mount operation. It turns out that it was
stuck in register_shrinker() and some unrelated job was under memory
pressure and spending time in shrink_slab(). Our machines have a lot
of shrinkers registered and jobs under memory pressure has to traverse
all of those memcg-aware shrinkers and do affect unrelated jobs which
want to register their own shrinkers.

This patch has made the shrinker_list traversal lockless and shrinker
register remain fast. For the shrinker unregister, atomic counter
has been introduced to avoid synchronize_rcu() call. The fields of
struct shrinker has been rearraged to make sure that the size does
not increase for x86_64.

The shrinker functions are allowed to reschedule() and thus can not
be called with rcu read lock. One way to resolve that is to use
srcu read lock but then ifdefs has to be used as SRCU is behind
CONFIG_SRCU. Another way is to just release the rcu read lock before
calling the shrinker and reacquire on the return. The atomic counter
will make sure that the shrinker entry will not be freed under us.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
Changelog since v1:
- release and reacquire rcu lock across shrinker call.

 include/linux/shrinker.h |  4 +++-
 mm/vmscan.c  | 54 ++--
 2 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 388ff2936a87..434b76ef9367 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -60,14 +60,16 @@ struct shrinker {
unsigned long (*scan_objects)(struct shrinker *,
  struct shrink_control *sc);
 
+   unsigned int flags;
int seeks;  /* seeks to recreate an obj */
long batch; /* reclaim batch size, 0 = default */
-   unsigned long flags;
 
/* These are for internal use */
struct list_head list;
/* objs pending delete, per node */
atomic_long_t *nr_deferred;
+   /* Number of active do_shrink_slab calls to this shrinker */
+   atomic_t nr_active;
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb2f0315b8c0..6cec46ac6d95 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -157,7 +157,7 @@ int vm_swappiness = 60;
 unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
-static DECLARE_RWSEM(shrinker_rwsem);
+static DEFINE_SPINLOCK(shrinker_lock);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -285,21 +285,42 @@ int register_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return -ENOMEM;
 
-   down_write(_rwsem);
-   list_add_tail(>list, _list);
-   up_write(_rwsem);
+   atomic_set(>nr_active, 0);
+   spin_lock(_lock);
+   list_add_tail_rcu(>list, _list);
+   spin_unlock(_lock);
return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
 
+static void get_shrinker(struct shrinker *shrinker)
+{
+   atomic_inc(>nr_active);
+   rcu_read_unlock();
+}
+
+static void put_shrinker(struct shrinker *shrinker)
+{
+   rcu_read_lock();
+   if (!atomic_dec_return(>nr_active))
+   wake_up_atomic_t(>nr_active);
+}
+
+static int shrinker_wait_atomic_t(atomic_t *p)
+{
+   schedule();
+   return 0;
+}
 /*
  * Remove one
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-   down_write(_rwsem);
-   list_del(>list);
-   up_write(_rwsem);
+   spin_lock(_lock);
+   list_del_rcu(>list);
+   spin_unlock(_lock);
+   wait_on_atomic_t(>nr_active, shrinker_wait_atomic_t,
+TASK_UNINTERRUPTIBLE);
kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
@@ -468,18 +489,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (nr_scanned == 0)
nr_scanned = SWAP_CLUSTER_MAX;
 
-   if (!down_read_trylock(_rwsem)) {
-   /*
-* If we would return 0, our callers would understand that we
-* have nothing else to shrink and give up trying. By returning
-* 1 we keep it going and assume we'll be able to shrink next
-* time.
-*/
-   freed = 1;
-   goto out;
-   }
+   rcu_read_lock();
 
-   list_for_each_entry(shrinker, _list, list) {
+   list_for_each_entry_rcu(shrinker, _list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -498,11 +510,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
sc.nid = 0;
 
+   get_shrinker(shri

Re: [PATCH v2] mm, shrinker: make shrinker_list lockless

2017-11-08 Thread Shakeel Butt

On Wed, Nov 8, 2017 at 4:07 PM, Minchan Kim <minc...@kernel.org> wrote:
> Hi,
>
> On Wed, Nov 08, 2017 at 09:37:40AM -0800, Shakeel Butt wrote:
>> In our production, we have observed that the job loader gets stuck for
>> 10s of seconds while doing mount operation. It turns out that it was
>> stuck in register_shrinker() and some unrelated job was under memory
>> pressure and spending time in shrink_slab(). Our machines have a lot
>> of shrinkers registered and jobs under memory pressure has to traverse
>> all of those memcg-aware shrinkers and do affect unrelated jobs which
>> want to register their own shrinkers.
>>
>> This patch has made the shrinker_list traversal lockless and shrinker
>> register remain fast. For the shrinker unregister, atomic counter
>> has been introduced to avoid synchronize_rcu() call. The fields of
>
> So, do you want to enhance unregister shrinker path as well as registering?
>

Yes, I don't want to add delay to unregister_shrinker for the normal
case where there isn't any readers (i.e. unconditional
synchronize_rcu).

>> struct shrinker has been rearraged to make sure that the size does
>> not increase for x86_64.
>>
>> The shrinker functions are allowed to reschedule() and thus can not
>> be called with rcu read lock. One way to resolve that is to use
>> srcu read lock but then ifdefs has to be used as SRCU is behind
>> CONFIG_SRCU. Another way is to just release the rcu read lock before
>> calling the shrinker and reacquire on the return. The atomic counter
>> will make sure that the shrinker entry will not be freed under us.
>
> Instead of adding new lock, could we simply release shrinker_rwsem read-side
> lock in list traveral periodically to give a chance to hold a write-side
> lock?
>

Greg has already pointed out that this patch is still not right/safe
and now I am getting to the opinion that without changing the shrinker
API, it might not be possible to do lockless shrinker traversal and
unregister shrinker without synchronize_rcu().

Regarding your suggestion, do you mean to add periodic release lock
and reacquire using down_read_trylock() or down_read()?

[PATCH] mm, mlock, vmscan: no more skipping pagevecs

2017-11-04 Thread Shakeel Butt

When a thread mlocks an address space backed by file, a new
page is allocated (assuming file page is not in memory), added
to the local pagevec (lru_add_pvec), I/O is triggered and the
thread then sleeps on the page. On I/O completion, the thread
can wake on a different CPU, the mlock syscall will then sets
the PageMlocked() bit of the page but will not be able to put
that page in unevictable LRU as the page is on the pagevec of
a different CPU. Even on drain, that page will go to evictable
LRU because the PageMlocked() bit is not checked on pagevec
drain.

The page will eventually go to right LRU on reclaim but the
LRU stats will remain skewed for a long time.

However, this issue does not happen for anon pages on swap
because unlike file pages, anon pages are not added to pagevec
until they have been fully swapped in. Also the fault handler
uses vm_flags to set the PageMlocked() bit of such anon pages
even before returning to mlock() syscall and mlocked pages will
skip pagevecs and directly be put into unevictable LRU. No such
luck for file pages.

One way to resolve this issue, is to somehow plumb vm_flags from
filemap_fault() to add_to_page_cache_lru() which will then skip
the pagevec for pages of VM_LOCKED vma and directly put them to
unevictable LRU. However this patch took a different approach.

All the pages, even unevictable, will be added to the pagevecs
and on the drain, the pages will be added on their LRUs correctly
by checking their evictability. This resolves the mlocked file
pages on pagevec of other CPUs issue because when those pagevecs
will be drained, the mlocked file pages will go to unevictable
LRU. Also this makes the race with munlock easier to resolve
because the pagevec drains happen in LRU lock.

There is one (good) side effect though. Without this patch, the
pages allocated for System V shared memory segment are added to
evictable LRUs even after shmctl(SHM_LOCK) on that segment. This
patch will correctly put such pages to unevictable LRU.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 include/linux/swap.h |  2 --
 mm/swap.c| 68 +---
 mm/vmscan.c  | 59 +
 3 files changed, 34 insertions(+), 95 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f02fb5db8914..9b31d04914eb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -326,8 +326,6 @@ extern void deactivate_file_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void add_page_to_unevictable_list(struct page *page);
-
 extern void lru_cache_add_active_or_unevictable(struct page *page,
struct vm_area_struct *vma);
 
diff --git a/mm/swap.c b/mm/swap.c
index a77d68f2c1b6..776fb33e81d3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -445,30 +445,6 @@ void lru_cache_add(struct page *page)
__lru_cache_add(page);
 }
 
-/**
- * add_page_to_unevictable_list - add a page to the unevictable list
- * @page:  the page to be added to the unevictable list
- *
- * Add page directly to its zone's unevictable list.  To avoid races with
- * tasks that might be making the page evictable, through eg. munlock,
- * munmap or exit, while it's not on the lru, we want to add the page
- * while it's locked or otherwise "invisible" to other tasks.  This is
- * difficult to do when using the pagevec cache, so bypass that.
- */
-void add_page_to_unevictable_list(struct page *page)
-{
-   struct pglist_data *pgdat = page_pgdat(page);
-   struct lruvec *lruvec;
-
-   spin_lock_irq(>lru_lock);
-   lruvec = mem_cgroup_page_lruvec(page, pgdat);
-   ClearPageActive(page);
-   SetPageUnevictable(page);
-   SetPageLRU(page);
-   add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
-   spin_unlock_irq(>lru_lock);
-}
-
 /**
  * lru_cache_add_active_or_unevictable
  * @page:  the page to be added to LRU
@@ -484,13 +460,9 @@ void lru_cache_add_active_or_unevictable(struct page *page,
 {
VM_BUG_ON_PAGE(PageLRU(page), page);
 
-   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
+   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
-   lru_cache_add(page);
-   return;
-   }
-
-   if (!TestSetPageMlocked(page)) {
+   else if (!TestSetPageMlocked(page)) {
/*
 * We use the irq-unsafe __mod_zone_page_stat because this
 * counter is not modified from interrupt context, and the pte
@@ -500,7 +472,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
hpage_nr_pages(page));
count_vm_event(UNEVICTABLE_PGMLOCKED);
}
-   add_page_to_unevictable_list(page);
+   lru_cache_add(page);

Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer

2017-10-31 Thread Shakeel Butt

On Tue, Oct 31, 2017 at 9:40 AM, Johannes Weiner <han...@cmpxchg.org> wrote:
> On Tue, Oct 31, 2017 at 08:04:19AM -0700, Shakeel Butt wrote:
>> > +
>> > +static void select_victim_memcg(struct mem_cgroup *root, struct 
>> > oom_control *oc)
>> > +{
>> > +   struct mem_cgroup *iter;
>> > +
>> > +   oc->chosen_memcg = NULL;
>> > +   oc->chosen_points = 0;
>> > +
>> > +   /*
>> > +* The oom_score is calculated for leaf memory cgroups (including
>> > +* the root memcg).
>> > +*/
>> > +   rcu_read_lock();
>> > +   for_each_mem_cgroup_tree(iter, root) {
>> > +   long score;
>> > +
>> > +   if (memcg_has_children(iter) && iter != root_mem_cgroup)
>> > +   continue;
>> > +
>>
>> Cgroup v2 does not support charge migration between memcgs. So, there
>> can be intermediate nodes which may contain the major charge of the
>> processes in their leave descendents. Skipping such intermediate nodes
>> will kind of protect such processes from oom-killer (lower on the list
>> to be killed). Is it ok to not handle such scenario? If yes, shouldn't
>> we document it?
>
> Tasks cannot be in intermediate nodes, so the only way you can end up
> in a situation like this is to start tasks fully, let them fault in
> their full workingset, then create child groups and move them there.
>
> That has attribution problems much wider than the OOM killer: any
> local limits you would set on a leaf cgroup like this ALSO won't
> control the memory of its tasks - as it's all sitting in the parent.
>
> We created the "no internal competition" rule exactly to prevent this
> situation.

Rather than the "no internal competition" restriction I think "charge
migration" would have resolved that situation? Also "no internal
competition" restriction (I am assuming 'no internal competition' is
no tasks in internal nodes, please correct me if I am wrong) has made
"charge migration" hard to implement and thus not added in cgroup v2.

I know this is parallel discussion and excuse my ignorance, what are
other reasons behind "no internal competition" specifically for memory
controller?

> To be consistent with that rule, we might want to disallow
> the creation of child groups once a cgroup has local memory charges.
>
> It's trivial to change the setup sequence to create the leaf cgroup
> first, then launch the workload from within.
>

Only if cgroup hierarchy is centrally controller and each task's whole
hierarchy is known in advance.

> Either way, this is nothing specific about the OOM killer.

Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer

2017-10-31 Thread Shakeel Butt

> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
> +{
> +   struct mem_cgroup *iter;
> +
> +   oc->chosen_memcg = NULL;
> +   oc->chosen_points = 0;
> +
> +   /*
> +* The oom_score is calculated for leaf memory cgroups (including
> +* the root memcg).
> +*/
> +   rcu_read_lock();
> +   for_each_mem_cgroup_tree(iter, root) {
> +   long score;
> +
> +   if (memcg_has_children(iter) && iter != root_mem_cgroup)
> +   continue;
> +

Cgroup v2 does not support charge migration between memcgs. So, there
can be intermediate nodes which may contain the major charge of the
processes in their leave descendents. Skipping such intermediate nodes
will kind of protect such processes from oom-killer (lower on the list
to be killed). Is it ok to not handle such scenario? If yes, shouldn't
we document it?

> +   score = oom_evaluate_memcg(iter, oc->nodemask, 
> oc->totalpages);
> +
> +   /*
> +* Ignore empty and non-eligible memory cgroups.
> +*/
> +   if (score == 0)
> +   continue;
> +
> +   /*
> +* If there are inflight OOM victims, we don't need
> +* to look further for new victims.
> +*/
> +   if (score == -1) {
> +   oc->chosen_memcg = INFLIGHT_VICTIM;
> +   mem_cgroup_iter_break(root, iter);
> +   break;
> +   }
> +
> +   if (score > oc->chosen_points) {
> +   oc->chosen_points = score;
> +   oc->chosen_memcg = iter;
> +   }
> +   }
> +
> +   if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +   css_get(>chosen_memcg->css);
> +
> +   rcu_read_unlock();
> +}

Re: [PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-30 Thread Shakeel Butt

On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mho...@kernel.org> wrote:
> On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
>> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
>> > else before you kill me"? It's crashing the kernel in trying to
>> > protect a userspace application. How is that not insane?
>>
>> In parallel to other discussion, I think we should definitely move
>> from "completely oom-disabled" semantics to something similar to "kill
>> me last" semantics. Is there any objection to this idea?
>
> Could you be more specific what you mean?
>

I get the impression that the main reason behind the complexity of
oom-killer is allowing processes to be protected from the oom-killer
i.e. disabling oom-killing a process by setting
/proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add
an interface which will let users/admins to set a process to be
oom-killed as a last resort.

Re: [PATCH 1/2] mm/memcg: try harder to decrease [memory,memsw].limit_in_bytes

2017-12-20 Thread Shakeel Butt

On Wed, Dec 20, 2017 at 3:34 AM, Michal Hocko  wrote:
> On Wed 20-12-17 14:32:19, Andrey Ryabinin wrote:
>> On 12/20/2017 01:33 PM, Michal Hocko wrote:
>> > On Wed 20-12-17 13:24:28, Andrey Ryabinin wrote:
>> >> mem_cgroup_resize_[memsw]_limit() tries to free only 32 (SWAP_CLUSTER_MAX)
>> >> pages on each iteration. This makes practically impossible to decrease
>> >> limit of memory cgroup. Tasks could easily allocate back 32 pages,
>> >> so we can't reduce memory usage, and once retry_count reaches zero we 
>> >> return
>> >> -EBUSY.
>> >>
>> >> It's easy to reproduce the problem by running the following commands:
>> >>
>> >>   mkdir /sys/fs/cgroup/memory/test
>> >>   echo $$ >> /sys/fs/cgroup/memory/test/tasks
>> >>   cat big_file > /dev/null &
>> >>   sleep 1 && echo $((100*1024*1024)) > 
>> >> /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> >>   -bash: echo: write error: Device or resource busy
>> >>
>> >> Instead of trying to free small amount of pages, it's much more
>> >> reasonable to free 'usage - limit' pages.
>> >
>> > But that only makes the issue less probable. It doesn't fix it because
>> > if (curusage >= oldusage)
>> > retry_count--;
>> > can still be true because allocator might be faster than the reclaimer.
>> > Wouldn't it be more reasonable to simply remove the retry count and keep
>> > trying until interrupted or we manage to update the limit.
>>
>> But does it makes sense to continue reclaiming even if reclaimer can't
>> make any progress? I'd say no. "Allocator is faster than reclaimer"
>> may be not the only reason for failed reclaim. E.g. we could try to
>> set limit lower than amount of mlock()ed memory in cgroup, retrying
>> reclaim would be just a waste of machine's resources.  Or we simply
>> don't have any swap, and anon > new_limit. Should be burn the cpu in
>> that case?
>
> We can check the number of reclaimed pages and go EBUSY if it is 0.
>
>> > Another option would be to commit the new limit and allow temporal 
>> > overcommit
>> > of the hard limit. New allocations and the limit update paths would
>> > reclaim to the hard limit.
>> >
>>
>> It sounds a bit fragile and tricky to me. I wouldn't go that way
>> without unless we have a very good reason for this.
>
> I haven't explored this, to be honest, so there may be dragons that way.
> I've just mentioned that option for completness.
>

We already do this for cgroup-v2's memory.max. So, I don't think it is
fragile or tricky.

[PATCH] mm: memcontrol: drain stocks on resize limit

2018-05-04 Thread Shakeel Butt

Resizing the memcg limit for cgroup-v2 drains the stocks before
triggering the memcg reclaim. Do the same for cgroup-v1 to make the
behavior consistent.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 mm/memcontrol.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 25b148c2d222..e2d33a37f971 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2463,6 +2463,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 unsigned long max, bool memsw)
 {
bool enlarge = false;
+   bool drained = false;
int ret;
bool limits_invariant;
struct page_counter *counter = memsw ? >memsw : >memory;
@@ -2493,6 +2494,12 @@ static int mem_cgroup_resize_max(struct mem_cgroup 
*memcg,
if (!ret)
break;
 
+   if (!drained) {
+   drain_all_stock(memcg);
+   drained = true;
+   continue;
+   }
+
if (!try_to_free_mem_cgroup_pages(memcg, 1,
GFP_KERNEL, !memsw)) {
ret = -EBUSY;
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH] mm: memcontrol: drain memcg stock on force_empty

2018-05-07 Thread Shakeel Butt

From: Junaid Shahid <juna...@google.com>

The per-cpu memcg stock can retain a charge of upto 32 pages. On a
machine with large number of cpus, this can amount to a decent amount
of memory. Additionally force_empty interface might be triggering
unneeded memcg reclaims.

Signed-off-by: Junaid Shahid <juan...@google.com>
Signed-off-by: Shakeel Butt <shake...@google.com>
---
 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2d33a37f971..2c3c69524b49 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2841,6 +2841,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup 
*memcg)
 
/* we call try-to-free pages for make this cgroup empty */
lru_add_drain_all();
+
+   drain_all_stock(memcg);
+
/* try to free all pages in this cgroup */
while (nr_retries && page_counter_read(>memory)) {
int progress;
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH] mm: memcontrol: drain memcg stock on force_empty

2018-05-07 Thread Shakeel Butt

On Mon, May 7, 2018 at 1:16 PM Shakeel Butt <shake...@google.com> wrote:

> From: Junaid Shahid <juna...@google.com>

> The per-cpu memcg stock can retain a charge of upto 32 pages. On a
> machine with large number of cpus, this can amount to a decent amount
> of memory. Additionally force_empty interface might be triggering
> unneeded memcg reclaims.

> Signed-off-by: Junaid Shahid <juan...@google.com>

This should be "Junaid Shahid <juna...@google.com>".

> Signed-off-by: Shakeel Butt <shake...@google.com>
> ---
>   mm/memcontrol.c | 3 +++
>   1 file changed, 3 insertions(+)

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2d33a37f971..2c3c69524b49 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2841,6 +2841,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup
*memcg)

>  /* we call try-to-free pages for make this cgroup empty */
>  lru_add_drain_all();
> +
> +   drain_all_stock(memcg);
> +
>  /* try to free all pages in this cgroup */
>  while (nr_retries && page_counter_read(>memory)) {
>  int progress;
> --
> 2.17.0.441.gb46fe60e1d-goog

Re: [PATCH v4 01/13] mm: Assign id to every memcg-aware shrinker

2018-05-09 Thread Shakeel Butt

On Wed, May 9, 2018 at 3:55 PM Andrew Morton 
wrote:

> On Wed, 09 May 2018 14:56:55 +0300 Kirill Tkhai 
wrote:

> > The patch introduces shrinker::id number, which is used to enumerate
> > memcg-aware shrinkers. The number start from 0, and the code tries
> > to maintain it as small as possible.
> >
> > This will be used as to represent a memcg-aware shrinkers in memcg
> > shrinkers map.
> >
> > ...
> >
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -248,6 +248,9 @@ static struct super_block *alloc_super(struct
file_system_type *type, int flags,
> >   s->s_time_gran = 10;
> >   s->cleancache_poolid = CLEANCACHE_NO_POOL;
> >
> > +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)

> It would be more conventional to do this logic in Kconfig - define a
> new MEMCG_SHRINKER which equals MEMCG && !SLOB.

> This ifdef occurs a distressing number of times in the patchset :( I
> wonder if there's something we can do about that.

> Also, why doesn't it work with slob?  Please describe the issue in the
> changelogs somewhere.

> It's a pretty big patchset.  I *could* merge it up in the hope that
> someone is planning do do a review soon.  But is there such a person?


Hi Andrew, couple of these patches are being reviewed by Vladimir and I
plan to review too by next week. I think we can merge them into mm tree for
more testing and I will also this patch series internally (though I have to
backport them to our kernel for more extensive testing).

thanks,
Shakeel

[PATCH] mm: fix race between kmem_cache destroy, create and deactivate

2018-05-21 Thread Shakeel Butt

The memcg kmem cache creation and deactivation (SLUB only) is
asynchronous. If a root kmem cache is destroyed whose memcg cache is in
the process of creation or deactivation, the kernel may crash.

Example of one such crash:
general protection fault:  [#1] SMP PTI
CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
...
Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
RIP: 0010:has_cpu_slab
...
Call Trace:
? on_each_cpu_cond
__kmem_cache_shrink
kmemcg_cache_deact_after_rcu
kmemcg_deactivate_workfn
process_one_work
worker_thread
kthread
ret_from_fork+0x35/0x40

This issue is due to the lack of reference counting for the root
kmem_caches. There exist a refcount in kmem_cache but it is actually a
count of aliases i.e. number of kmem_caches merged together.

This patch make alias count explicit and adds reference counting to the
root kmem_caches. The reference of a root kmem cache is elevated on
merge and while its memcg kmem_cache is in the process of creation or
deactivation.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 include/linux/slab.h |  2 +
 include/linux/slab_def.h |  3 +-
 include/linux/slub_def.h |  3 +-
 mm/memcontrol.c  |  7 
 mm/slab.c|  4 +-
 mm/slab.h|  5 ++-
 mm/slab_common.c | 84 ++--
 mm/slub.c| 14 ---
 8 files changed, 90 insertions(+), 32 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9ebe659bd4a5..4c28f2483a22 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -674,6 +674,8 @@ struct memcg_cache_params {
 };
 
 int memcg_update_all_caches(int num_memcgs);
+bool kmem_cache_tryget(struct kmem_cache *s);
+void kmem_cache_put(struct kmem_cache *s);
 
 /**
  * kmalloc_array - allocate memory for an array.
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index d9228e4d0320..4bb22c89a740 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -41,7 +41,8 @@ struct kmem_cache {
 /* 4) cache creation/removal */
const char *name;
struct list_head list;
-   int refcount;
+   refcount_t refcount;
+   int alias_count;
int object_size;
int align;
 
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 3773e26c08c1..532d4b6f83ed 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -97,7 +97,8 @@ struct kmem_cache {
struct kmem_cache_order_objects max;
struct kmem_cache_order_objects min;
gfp_t allocflags;   /* gfp flags to use on each alloc */
-   int refcount;   /* Refcount for slab cache destroy */
+   refcount_t refcount;/* Refcount for slab cache destroy */
+   int alias_count;/* Number of root kmem caches merged */
void (*ctor)(void *);
unsigned int inuse; /* Offset to metadata */
unsigned int align; /* Alignment */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdb8028c806c..ab5673dbfc4e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2185,6 +2185,7 @@ static void memcg_kmem_cache_create_func(struct 
work_struct *w)
memcg_create_kmem_cache(memcg, cachep);
 
css_put(>css);
+   kmem_cache_put(cachep);
kfree(cw);
 }
 
@@ -2200,6 +2201,12 @@ static void __memcg_schedule_kmem_cache_create(struct 
mem_cgroup *memcg,
if (!cw)
return;
 
+   /* Make sure root kmem cache does not get destroyed in the middle */
+   if (!kmem_cache_tryget(cachep)) {
+   kfree(cw);
+   return;
+   }
+
css_get(>css);
 
cw->memcg = memcg;
diff --git a/mm/slab.c b/mm/slab.c
index c1fe8099b3cd..080732f5f20d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1883,8 +1883,8 @@ __kmem_cache_alias(const char *name, unsigned int size, 
unsigned int align,
struct kmem_cache *cachep;
 
cachep = find_mergeable(size, align, flags, name, ctor);
-   if (cachep) {
-   cachep->refcount++;
+   if (cachep && kmem_cache_tryget(cachep)) {
+   cachep->alias_count++;
 
/*
 * Adjust the object sizes so that we clear
diff --git a/mm/slab.h b/mm/slab.h
index 68bdf498da3b..25962ab75ec1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -25,7 +25,8 @@ struct kmem_cache {
unsigned int useroffset;/* Usercopy region offset */
unsigned int usersize;  /* Usercopy region size */
const char *name;   /* Slab name for sysfs */
-   int refcount;   /* Use counter */
+   refcount_t refcount;/* Use counter */
+   int alias_count;
void (*ctor)(void *);   /* Called on object slot creation */
struct list_head list;  /* List of all slab caches on the system */
 };
@@

Re: [PATCH] mm: fix race between kmem_cache destroy, create and deactivate

2018-05-21 Thread Shakeel Butt

On Mon, May 21, 2018 at 11:42 AM Andrew Morton <a...@linux-foundation.org>
wrote:

> On Mon, 21 May 2018 10:41:16 -0700 Shakeel Butt <shake...@google.com>
wrote:

> > The memcg kmem cache creation and deactivation (SLUB only) is
> > asynchronous. If a root kmem cache is destroyed whose memcg cache is in
> > the process of creation or deactivation, the kernel may crash.
> >
> > Example of one such crash:
> >   general protection fault:  [#1] SMP PTI
> >   CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
> >   ...
> >   Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
> >   RIP: 0010:has_cpu_slab
> >   ...
> >   Call Trace:
> >   ? on_each_cpu_cond
> >   __kmem_cache_shrink
> >   kmemcg_cache_deact_after_rcu
> >   kmemcg_deactivate_workfn
> >   process_one_work
> >   worker_thread
> >   kthread
> >   ret_from_fork+0x35/0x40
> >
> > This issue is due to the lack of reference counting for the root
> > kmem_caches. There exist a refcount in kmem_cache but it is actually a
> > count of aliases i.e. number of kmem_caches merged together.
> >
> > This patch make alias count explicit and adds reference counting to the
> > root kmem_caches. The reference of a root kmem cache is elevated on
> > merge and while its memcg kmem_cache is in the process of creation or
> > deactivation.
> >

> The patch seems depressingly complex.

> And a bit underdocumented...


I will add more documentation to the code.

> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -674,6 +674,8 @@ struct memcg_cache_params {
> >  };
> >
> >  int memcg_update_all_caches(int num_memcgs);
> > +bool kmem_cache_tryget(struct kmem_cache *s);
> > +void kmem_cache_put(struct kmem_cache *s);
> >
> >  /**
> >   * kmalloc_array - allocate memory for an array.
> > diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
> > index d9228e4d0320..4bb22c89a740 100644
> > --- a/include/linux/slab_def.h
> > +++ b/include/linux/slab_def.h
> > @@ -41,7 +41,8 @@ struct kmem_cache {
> >  /* 4) cache creation/removal */
> >   const char *name;
> >   struct list_head list;
> > - int refcount;
> > + refcount_t refcount;
> > + int alias_count;

> The semantic meaning of these two?  What locking protects alias_count?

SLAB and SLUB allow reusing existing root kmem caches. The alias_count of a
kmem cache tells the number of times this kmem cache is reused (maybe
shared_count or reused_count are better names). Basically if there were 5
root kmem cache creation request and suppose SLAB/SLUB decide to reuse the
first kmem cache created for next 4 requests then this count will be 5 and
all 5 will be pointing to the same kmem_cache object.

Before this patch, alias_count (previously named refcount) was modified
only within slab_mutex but can be read outside. It was conflated into
multiple things like shared count, reference count and unmergeable flag (if
-ve). This patch decouples the reference counting from this field and there
is no need to protect alias_count with locks.

> >   int object_size;
> >   int align;
> >
> > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> > index 3773e26c08c1..532d4b6f83ed 100644
> > --- a/include/linux/slub_def.h
> > +++ b/include/linux/slub_def.h
> > @@ -97,7 +97,8 @@ struct kmem_cache {
> >   struct kmem_cache_order_objects max;
> >   struct kmem_cache_order_objects min;
> >   gfp_t allocflags;   /* gfp flags to use on each alloc */
> > - int refcount;   /* Refcount for slab cache destroy */
> > + refcount_t refcount;/* Refcount for slab cache destroy */
> > + int alias_count;/* Number of root kmem caches merged */

> "merged" what with what in what manner?

shared or reused might be better words here.

> >   void (*ctor)(void *);
> >   unsigned int inuse; /* Offset to metadata */
> >   unsigned int align; /* Alignment */
> >
> > ...
> >
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -25,7 +25,8 @@ struct kmem_cache {
> >   unsigned int useroffset;/* Usercopy region offset */
> >   unsigned int usersize;  /* Usercopy region size */
> >   const char *name;   /* Slab name for sysfs */
> > - int refcount;   /* Use counter */
> > + refcount_t refcount;/* Use counter */
> > + int alias_count;

> Semantic meaning/usage of alias_count?  Locking for i

[PATCH v2] mm: fix race between kmem_cache destroy, create and deactivate

2018-05-22 Thread Shakeel Butt

The memcg kmem cache creation and deactivation (SLUB only) is
asynchronous. If a root kmem cache is destroyed whose memcg cache is in
the process of creation or deactivation, the kernel may crash.

Example of one such crash:
general protection fault:  [#1] SMP PTI
CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
...
Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
RIP: 0010:has_cpu_slab
...
Call Trace:
? on_each_cpu_cond
__kmem_cache_shrink
kmemcg_cache_deact_after_rcu
kmemcg_deactivate_workfn
process_one_work
worker_thread
kthread
ret_from_fork+0x35/0x40

This issue is due to the lack of real reference counting for the root
kmem_caches. Currently kmem_cache does have a field named refcount which
has been used for multiple purposes i.e. shared count, reference count
and noshare flag. Due to its conflated nature, it can not be used for
reference counting by other subsystems.

This patch decoupled the reference counting from shared count and
noshare flag. The new field 'shared_count' represents the shared count
and noshare flag while 'refcount' is converted into a real reference
counter.

The reference counting is only implemented for root kmem_caches for
simplicity. The reference of a root kmem_cache is elevated on sharing or
while its memcg kmem_cache creation or deactivation request is in the
fly and thus it is made sure that the root kmem_cache is not destroyed
in the middle. As the reference of kmem_cache is elevated on sharing,
the 'shared_count' does not need any locking protection as at worst it
can be out-dated for a small window which is tolerable.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
Changelog since v1:
- Added more documentation to the code
- Renamed fields to be more readable

 include/linux/slab.h |   2 +
 include/linux/slab_def.h |   5 +-
 include/linux/slub_def.h |   3 +-
 mm/memcontrol.c  |   7 +++
 mm/slab.c|   4 +-
 mm/slab.h|   5 +-
 mm/slab_common.c | 122 ---
 mm/slub.c|  14 +++--
 8 files changed, 130 insertions(+), 32 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9ebe659bd4a5..4c28f2483a22 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -674,6 +674,8 @@ struct memcg_cache_params {
 };
 
 int memcg_update_all_caches(int num_memcgs);
+bool kmem_cache_tryget(struct kmem_cache *s);
+void kmem_cache_put(struct kmem_cache *s);
 
 /**
  * kmalloc_array - allocate memory for an array.
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index d9228e4d0320..a3aa2c7c1fcd 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -41,7 +41,10 @@ struct kmem_cache {
 /* 4) cache creation/removal */
const char *name;
struct list_head list;
-   int refcount;
+   /* Refcount for root kmem caches */
+   refcount_t refcount;
+   /* Number of root kmem caches sharing this cache */
+   int shared_count;
int object_size;
int align;
 
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 3773e26c08c1..a2d53cd28bb6 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -97,7 +97,8 @@ struct kmem_cache {
struct kmem_cache_order_objects max;
struct kmem_cache_order_objects min;
gfp_t allocflags;   /* gfp flags to use on each alloc */
-   int refcount;   /* Refcount for slab cache destroy */
+   refcount_t refcount;/* Refcount for root kmem cache */
+   int shared_count;   /* Number of kmem caches sharing this cache */
void (*ctor)(void *);
unsigned int inuse; /* Offset to metadata */
unsigned int align; /* Alignment */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdb8028c806c..ab5673dbfc4e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2185,6 +2185,7 @@ static void memcg_kmem_cache_create_func(struct 
work_struct *w)
memcg_create_kmem_cache(memcg, cachep);
 
css_put(>css);
+   kmem_cache_put(cachep);
kfree(cw);
 }
 
@@ -2200,6 +2201,12 @@ static void __memcg_schedule_kmem_cache_create(struct 
mem_cgroup *memcg,
if (!cw)
return;
 
+   /* Make sure root kmem cache does not get destroyed in the middle */
+   if (!kmem_cache_tryget(cachep)) {
+   kfree(cw);
+   return;
+   }
+
css_get(>css);
 
cw->memcg = memcg;
diff --git a/mm/slab.c b/mm/slab.c
index c1fe8099b3cd..7e494b83bcef 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1883,8 +1883,8 @@ __kmem_cache_alias(const char *name, unsigned int size, 
unsigned int align,
struct kmem_cache *cachep;
 
cachep = find_mergeable(size, align, flags, name, ctor);
-   if (cachep) {
-   c

[PATCH] memcg: force charge kmem counter too

2018-05-25 Thread Shakeel Butt

Based on several conditions the kernel can decide to force charge an
allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
on kmem counter is set and reached.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 mm/memcontrol.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab5673dbfc4e..0a88f824c550 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1893,6 +1893,18 @@ void mem_cgroup_handle_over_high(void)
current->memcg_nr_pages_over_high = 0;
 }
 
+/*
+ * Based on try_charge() force charge conditions.
+ */
+static inline bool should_force_charge(gfp_t gfp_mask)
+{
+   return (unlikely(tsk_is_oom_victim(current) ||
+fatal_signal_pending(current) ||
+current->flags & PF_EXITING ||
+current->flags & PF_MEMALLOC ||
+gfp_mask & __GFP_NOFAIL));
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
  unsigned int nr_pages)
 {
@@ -2008,6 +2020,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
 * The allocation either can't fail or will lead to more memory
 * being freed very soon.  Allow memory usage go over the limit
 * temporarily by force charging it.
+*
+* NOTE: Please keep the should_force_charge() conditions in sync.
 */
page_counter_charge(>memory, nr_pages);
if (do_memsw_account())
@@ -2331,8 +2345,11 @@ int memcg_kmem_charge_memcg(struct page *page, gfp_t 
gfp, int order,
 
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
!page_counter_try_charge(>kmem, nr_pages, )) {
-   cancel_charge(memcg, nr_pages);
-   return -ENOMEM;
+   if (!should_force_charge(gfp)) {
+   cancel_charge(memcg, nr_pages);
+   return -ENOMEM;
+   }
+   page_counter_charge(>kmem, nr_pages);
}
 
page->mem_cgroup = memcg;
-- 
2.17.0.921.gf22659ad46-goog

Re: [PATCH] doc: document scope NOFS, NOIO APIs

2018-05-24 Thread Shakeel Butt

On Thu, May 24, 2018 at 4:43 AM, Michal Hocko  wrote:
> From: Michal Hocko 
>
> Although the api is documented in the source code Ted has pointed out
> that there is no mention in the core-api Documentation and there are
> people looking there to find answers how to use a specific API.
>
> Cc: "Darrick J. Wong" 
> Cc: David Sterba 
> Requested-by: "Theodore Y. Ts'o" 
> Signed-off-by: Michal Hocko 
> ---
>
> Hi Johnatan,
> Ted has proposed this at LSFMM and then we discussed that briefly on the
> mailing list [1]. I received some useful feedback from Darrick and Dave
> which has been (hopefully) integrated. Then the thing fall off my radar
> rediscovering it now when doing some cleanup. Could you take the patch
> please?
>
> [1] http://lkml.kernel.org/r/20180424183536.gf30...@thunk.org
>  .../core-api/gfp_mask-from-fs-io.rst  | 55 +++
>  1 file changed, 55 insertions(+)
>  create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst
>
> diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst 
> b/Documentation/core-api/gfp_mask-from-fs-io.rst
> new file mode 100644
> index ..e8b2678e959b
> --- /dev/null
> +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> @@ -0,0 +1,55 @@
> +=
> +GFP masks used from FS/IO context
> +=
> +
> +:Date: Mapy, 2018
> +:Author: Michal Hocko 
> +
> +Introduction
> +
> +
> +Code paths in the filesystem and IO stacks must be careful when
> +allocating memory to prevent recursion deadlocks caused by direct
> +memory reclaim calling back into the FS or IO paths and blocking on
> +already held resources (e.g. locks - most commonly those used for the
> +transaction context).
> +
> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> +resp. __GFP_IO (note the later implies clearing the first as well) in

Is resp. == respectively? Why not use the full word (here and below)?

> +the gfp mask when calling an allocator. GFP_NOFS resp. GFP_NOIO can be
> +used as shortcut. It turned out though that above approach has led to
> +abuses when the restricted gfp mask is used "just in case" without a
> +deeper consideration which leads to problems because an excessive use
> +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> +reclaim issues.
> +
> +New API
> +
> +
> +Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> +``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. 
> ``memalloc_noio_save``,
> +``memalloc_noio_restore`` which allow to mark a scope to be a critical
> +section from the memory reclaim recursion into FS/IO POV. Any allocation
> +from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> +mask so no memory allocation can recurse back in the FS/IO.
> +
> +FS/IO code then simply calls the appropriate save function right at the
> +layer where a lock taken from the reclaim context (e.g. shrinker) and
> +the corresponding restore function when the lock is released. All that
> +ideally along with an explanation what is the reclaim context for easier
> +maintenance.
> +
> +What about __vmalloc(GFP_NOFS)
> +==
> +
> +vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
> +GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
> +to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
> +almost always a bug. The good news is that the NOFS/NOIO semantic can be
> +achieved by the scope api.
> +
> +In the ideal world, upper layers should already mark dangerous contexts
> +and so no special care is required and vmalloc should be called without
> +any problems. Sometimes if the context is not really clear or there are
> +layering violations then the recommended way around that is to wrap 
> ``vmalloc``
> +by the scope API with a comment explaining the problem.
> --
> 2.17.0
>

Re: [PATCH] mm: save two stranding bit in gfp_mask

2018-05-16 Thread Shakeel Butt

On Wed, May 16, 2018 at 1:41 PM Vlastimil Babka <vba...@suse.cz> wrote:
> On 05/16/2018 10:20 PM, Shakeel Butt wrote:
> > ___GFP_COLD and ___GFP_OTHER_NODE were removed but their bits were
> > stranded. Slide existing gfp masks to make those two bits available.
> Well, there are already available for hypothetical new flags. Is there
> anything that benefits from a smaller __GFP_BITS_SHIFT?

I am prototyping to pass along the type of kmem allocation e.g. page table,
vmalloc, stack e.t.c. (still very preliminary).

[PATCH] mm: save two stranding bit in gfp_mask

2018-05-16 Thread Shakeel Butt

___GFP_COLD and ___GFP_OTHER_NODE were removed but their bits were
stranded. Slide existing gfp masks to make those two bits available.

Signed-off-by: Shakeel Butt <shake...@google.com>
---
 include/linux/gfp.h | 42 +-
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b44d32..8edf72d32411 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -16,31 +16,31 @@ struct vm_area_struct;
  */
 
 /* Plain integer GFP bitmasks. Do not use this directly. */
-#define ___GFP_DMA 0x01u
-#define ___GFP_HIGHMEM 0x02u
-#define ___GFP_DMA32   0x04u
-#define ___GFP_MOVABLE 0x08u
+#define ___GFP_DMA 0x1u
+#define ___GFP_HIGHMEM 0x2u
+#define ___GFP_DMA32   0x4u
+#define ___GFP_MOVABLE 0x8u
 #define ___GFP_RECLAIMABLE 0x10u
 #define ___GFP_HIGH0x20u
 #define ___GFP_IO  0x40u
 #define ___GFP_FS  0x80u
-#define ___GFP_NOWARN  0x200u
-#define ___GFP_RETRY_MAYFAIL   0x400u
-#define ___GFP_NOFAIL  0x800u
-#define ___GFP_NORETRY 0x1000u
-#define ___GFP_MEMALLOC0x2000u
-#define ___GFP_COMP0x4000u
-#define ___GFP_ZERO0x8000u
-#define ___GFP_NOMEMALLOC  0x1u
-#define ___GFP_HARDWALL0x2u
-#define ___GFP_THISNODE0x4u
-#define ___GFP_ATOMIC  0x8u
-#define ___GFP_ACCOUNT 0x10u
-#define ___GFP_DIRECT_RECLAIM  0x40u
-#define ___GFP_WRITE   0x80u
-#define ___GFP_KSWAPD_RECLAIM  0x100u
+#define ___GFP_NOWARN  0x100u
+#define ___GFP_RETRY_MAYFAIL   0x200u
+#define ___GFP_NOFAIL  0x400u
+#define ___GFP_NORETRY 0x800u
+#define ___GFP_MEMALLOC0x1000u
+#define ___GFP_COMP0x2000u
+#define ___GFP_ZERO0x4000u
+#define ___GFP_NOMEMALLOC  0x8000u
+#define ___GFP_HARDWALL0x1u
+#define ___GFP_THISNODE0x2u
+#define ___GFP_ATOMIC  0x4u
+#define ___GFP_ACCOUNT 0x8u
+#define ___GFP_DIRECT_RECLAIM  0x10u
+#define ___GFP_WRITE   0x20u
+#define ___GFP_KSWAPD_RECLAIM  0x40u
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP   0x200u
+#define ___GFP_NOLOCKDEP   0x80u
 #else
 #define ___GFP_NOLOCKDEP   0
 #endif
@@ -205,7 +205,7 @@ struct vm_area_struct;
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH v2] mm: save two stranding bit in gfp_mask

2018-05-16 Thread Shakeel Butt

___GFP_COLD and ___GFP_OTHER_NODE were removed but their bits were
stranded. Fill the gaps by moving the existing gfp masks around.

Signed-off-by: Shakeel Butt <shake...@google.com>
Suggested-by: Vlastimil Babka <vba...@suse.cz>
Acked-by: Michal Hocko <mho...@suse.com>
---
Changelog since v1:
- Moved couple of gfp masks instead of sliding all.

 include/linux/gfp.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b44d32..036846fc00a6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -24,6 +24,7 @@ struct vm_area_struct;
 #define ___GFP_HIGH0x20u
 #define ___GFP_IO  0x40u
 #define ___GFP_FS  0x80u
+#define ___GFP_WRITE   0x100u
 #define ___GFP_NOWARN  0x200u
 #define ___GFP_RETRY_MAYFAIL   0x400u
 #define ___GFP_NOFAIL  0x800u
@@ -36,11 +37,10 @@ struct vm_area_struct;
 #define ___GFP_THISNODE0x4u
 #define ___GFP_ATOMIC  0x8u
 #define ___GFP_ACCOUNT 0x10u
-#define ___GFP_DIRECT_RECLAIM  0x40u
-#define ___GFP_WRITE   0x80u
-#define ___GFP_KSWAPD_RECLAIM  0x100u
+#define ___GFP_DIRECT_RECLAIM  0x20u
+#define ___GFP_KSWAPD_RECLAIM  0x40u
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP   0x200u
+#define ___GFP_NOLOCKDEP   0x80u
 #else
 #define ___GFP_NOLOCKDEP   0
 #endif
@@ -205,7 +205,7 @@ struct vm_area_struct;
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH v3] mm: fix race between kmem_cache destroy, create and deactivate

2018-06-10 Thread Shakeel Butt

On Sun, Jun 10, 2018 at 9:32 AM Paul E. McKenney
 wrote:
>
> On Sun, Jun 10, 2018 at 07:52:50AM -0700, Shakeel Butt wrote:
> > On Sat, Jun 9, 2018 at 3:20 AM Vladimir Davydov  
> > wrote:
> > >
> > > On Tue, May 29, 2018 at 05:12:04PM -0700, Shakeel Butt wrote:
> > > > The memcg kmem cache creation and deactivation (SLUB only) is
> > > > asynchronous. If a root kmem cache is destroyed whose memcg cache is in
> > > > the process of creation or deactivation, the kernel may crash.
> > > >
> > > > Example of one such crash:
> > > >   general protection fault:  [#1] SMP PTI
> > > >   CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
> > > >   ...
> > > >   Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
> > > >   RIP: 0010:has_cpu_slab
> > > >   ...
> > > >   Call Trace:
> > > >   ? on_each_cpu_cond
> > > >   __kmem_cache_shrink
> > > >   kmemcg_cache_deact_after_rcu
> > > >   kmemcg_deactivate_workfn
> > > >   process_one_work
> > > >   worker_thread
> > > >   kthread
> > > >   ret_from_fork+0x35/0x40
> > > >
> > > > To fix this race, on root kmem cache destruction, mark the cache as
> > > > dying and flush the workqueue used for memcg kmem cache creation and
> > > > deactivation.
> > >
> > > > @@ -845,6 +862,8 @@ void kmem_cache_destroy(struct kmem_cache *s)
> > > >   if (unlikely(!s))
> > > >   return;
> > > >
> > > > + flush_memcg_workqueue(s);
> > > > +
> > >
> > > This should definitely help against async memcg_kmem_cache_create(),
> > > but I'm afraid it doesn't eliminate the race with async destruction,
> > > unfortunately, because the latter uses call_rcu_sched():
> > >
> > >   memcg_deactivate_kmem_caches
> > >__kmem_cache_deactivate
> > > slab_deactivate_memcg_cache_rcu_sched
> > >  call_rcu_sched
> > > kmem_cache_destroy
> > >  shutdown_memcg_caches
> > >   shutdown_cache
> > >   memcg_deactivate_rcufn
> > >
> > >
> > > Can we somehow flush those pending rcu requests?
> >
> > You are right and thanks for catching that. Now I am wondering if
> > synchronize_sched() just before flush_workqueue() should be enough.
> > Otherwise we might have to replace call_sched_rcu with
> > synchronize_sched() in kmemcg_deactivate_workfn which I would not
> > prefer as that would holdup the kmem_cache workqueue.
> >
> > +Paul
> >
> > Paul, we have a situation something similar to the following pseudo code.
> >
> > CPU0:
> > lock(l)
> > if (!flag)
> >   call_rcu_sched(callback);
> > unlock(l)
> > --
> > CPU1:
> > lock(l)
> > flag = true
> > unlock(l)
> > synchronize_sched()
> > --
> >
> > If CPU0 has called already called call_rchu_sched(callback) then later
> > if CPU1 calls synchronize_sched(). Is there any guarantee that on
> > return from synchronize_sched(), the rcu callback scheduled by CPU0
> > has already been executed?
>
> No.  There is no such guarantee.
>
> You instead want rcu_barrier_sched(), which waits for the callbacks from
> all prior invocations of call_rcu_sched() to be invoked.
>
> Please note that synchronize_sched() is -not- sufficient.  It is only
> guaranteed to wait for a grace period, not necessarily for all prior
> callbacks.  This goes both directions because if there are no callbacks
> in the system, then rcu_barrier_sched() is within its rights to return
> immediately.
>
> So please make sure you use each of synchronize_sched() and
> rcu_barrier_sched() to do the job that it was intended to do!  ;-)
>
> If your lock(l) is shorthand for spin_lock(), it looks to me like you
> actually only need rcu_barrier_sched():
>
> CPU0:
> spin_lock();
> if (!flag)
>   call_rcu_sched(callback);
> spin_unlock();
>
> CPU1:
> spin_lock();
> flag = true;
> spin_unlock();
> /* At this point, no more callbacks will be registered. */
> rcu_barrier_sched();
> /* At this point, all registered callbacks will have been invoked. */
>
> On the other hand, if your "lock(l)" was instead shorthand for
> rcu_read_lock_sched(), then you need -both- synchronize_sched() -and-
> rcu_barrier().  And even then, you will be broken in -rt kernels.
> (Which might or might not be a concern, depending on whether your code
> matters to -rt kernels.
>
> Make sense?
>

Thanks a lot, that was really helpful. The lock is actually
mutex_lock. So, I think rcu_barrier_sched() should be sufficient.

Shakeel

Re: [PATCH v3] mm: fix race between kmem_cache destroy, create and deactivate

2018-06-10 Thread Shakeel Butt

On Sat, Jun 9, 2018 at 3:20 AM Vladimir Davydov  wrote:
>
> On Tue, May 29, 2018 at 05:12:04PM -0700, Shakeel Butt wrote:
> > The memcg kmem cache creation and deactivation (SLUB only) is
> > asynchronous. If a root kmem cache is destroyed whose memcg cache is in
> > the process of creation or deactivation, the kernel may crash.
> >
> > Example of one such crash:
> >   general protection fault:  [#1] SMP PTI
> >   CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
> >   ...
> >   Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
> >   RIP: 0010:has_cpu_slab
> >   ...
> >   Call Trace:
> >   ? on_each_cpu_cond
> >   __kmem_cache_shrink
> >   kmemcg_cache_deact_after_rcu
> >   kmemcg_deactivate_workfn
> >   process_one_work
> >   worker_thread
> >   kthread
> >   ret_from_fork+0x35/0x40
> >
> > To fix this race, on root kmem cache destruction, mark the cache as
> > dying and flush the workqueue used for memcg kmem cache creation and
> > deactivation.
>
> > @@ -845,6 +862,8 @@ void kmem_cache_destroy(struct kmem_cache *s)
> >   if (unlikely(!s))
> >   return;
> >
> > + flush_memcg_workqueue(s);
> > +
>
> This should definitely help against async memcg_kmem_cache_create(),
> but I'm afraid it doesn't eliminate the race with async destruction,
> unfortunately, because the latter uses call_rcu_sched():
>
>   memcg_deactivate_kmem_caches
>__kmem_cache_deactivate
> slab_deactivate_memcg_cache_rcu_sched
>  call_rcu_sched
> kmem_cache_destroy
>  shutdown_memcg_caches
>   shutdown_cache
>   memcg_deactivate_rcufn
>
>
> Can we somehow flush those pending rcu requests?

You are right and thanks for catching that. Now I am wondering if
synchronize_sched() just before flush_workqueue() should be enough.
Otherwise we might have to replace call_sched_rcu with
synchronize_sched() in kmemcg_deactivate_workfn which I would not
prefer as that would holdup the kmem_cache workqueue.

+Paul

Paul, we have a situation something similar to the following pseudo code.

CPU0:
lock(l)
if (!flag)
  call_rcu_sched(callback);
unlock(l)
--
CPU1:
lock(l)
flag = true
unlock(l)
synchronize_sched()
--

If CPU0 has called already called call_rchu_sched(callback) then later
if CPU1 calls synchronize_sched(). Is there any guarantee that on
return from synchronize_sched(), the rcu callback scheduled by CPU0
has already been executed?

thanks,
Shakeel

Re: [PATCH] mm: fix null pointer dereference in mem_cgroup_protected

2018-06-08 Thread Shakeel Butt

On Fri, Jun 8, 2018 at 10:06 AM Roman Gushchin  wrote:
>
> Shakeel reported a crash in mem_cgroup_protected(), which
> can be triggered by memcg reclaim if the legacy cgroup v1
> use_hierarchy=0 mode is used:
>
> [  226.060572] BUG: unable to handle kernel NULL pointer dereference
> at 0120
> [  226.068310] PGD 801ff55da067 P4D 801ff55da067 PUD 1fdc7df067 PMD 0
> [  226.075191] Oops:  [#4] SMP PTI
> [  226.078637] CPU: 0 PID: 15581 Comm: bash Tainted: G  D
>  4.17.0-smp-clean #5
> [  226.086635] Hardware name: ...
> [  226.094546] RIP: 0010:mem_cgroup_protected+0x54/0x130
> [  226.099533] Code: 4c 8b 8e 00 01 00 00 4c 8b 86 08 01 00 00 48 8d
> 8a 08 ff ff ff 48 85 d2 ba 00 00 00 00 48 0f 44 ca 48 39 c8 0f 84 cf
> 00 00 00 <48> 8b 81 20 01 00 00 4d 89 ca 4c 39 c8 4c 0f 46 d0 4d 85 d2
> 74 05
> [  226.118194] RSP: :abe64dfafa58 EFLAGS: 00010286
> [  226.123358] RAX: 9fb6ff03d000 RBX: 9fb6f5b1b000 RCX: 
> 
> [  226.130406] RDX:  RSI: 9fb6f5b1b000 RDI: 
> 9fb6f5b1b000
> [  226.137454] RBP: abe64dfafb08 R08:  R09: 
> 
> [  226.144503] R10:  R11: c800 R12: 
> abe64dfafb88
> [  226.151551] R13: 9fb6f5b1b000 R14: abe64dfafb88 R15: 
> 9fb77fffe000
> [  226.158602] FS:  7fed1f8ac700() GS:9fb6ff40()
> knlGS:
> [  226.166594] CS:  0010 DS:  ES:  CR0: 80050033
> [  226.172270] CR2: 0120 CR3: 001fdcf86003 CR4: 
> 001606f0
> [  226.179317] Call Trace:
> [  226.181732]  ? shrink_node+0x194/0x510
> [  226.185435]  do_try_to_free_pages+0xfd/0x390
> [  226.189653]  try_to_free_mem_cgroup_pages+0x123/0x210
> [  226.194643]  try_charge+0x19e/0x700
> [  226.198088]  mem_cgroup_try_charge+0x10b/0x1a0
> [  226.202478]  wp_page_copy+0x134/0x5b0
> [  226.206094]  do_wp_page+0x90/0x460
> [  226.209453]  __handle_mm_fault+0x8e3/0xf30
> [  226.213498]  handle_mm_fault+0xfe/0x220
> [  226.217285]  __do_page_fault+0x262/0x500
> [  226.221158]  do_page_fault+0x28/0xd0
> [  226.224689]  ? page_fault+0x8/0x30
> [  226.228048]  page_fault+0x1e/0x30
> [  226.231323] RIP: 0033:0x485b72
>
> The problem happens because parent_mem_cgroup() returns a NULL
> pointer, which is dereferenced later without a check.
>
> As cgroup v1 has no memory guarantee support, let's make
> mem_cgroup_protected() immediately return MEMCG_PROT_NONE,
> if the given cgroup has no parent (non-hierarchical mode is used).
>
> Reported-by: Shakeel Butt 
> Signed-off-by: Roman Gushchin 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Andrew Morton 

Tested-by: Shakeel Butt 

> ---
>  mm/memcontrol.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6c9fb4e47be3..6205ba512928 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5750,6 +5750,9 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
> mem_cgroup *root,
> elow = memcg->memory.low;
>
> parent = parent_mem_cgroup(memcg);
> +   if (!parent)
> +   return MEMCG_PROT_NONE;
> +
> if (parent == root_mem_cgroup)
> goto exit;
>
> --
> 2.14.3
>

Re: [PATCH v7 15/17] mm: Generalize shrink_slab() calls in shrink_node()

2018-06-08 Thread Shakeel Butt

On Tue, May 22, 2018 at 3:09 AM Kirill Tkhai  wrote:
>
> From: Vladimir Davydov 
>
> The patch makes shrink_slab() be called for root_mem_cgroup
> in the same way as it's called for the rest of cgroups.
> This simplifies the logic and improves the readability.
>
> Signed-off-by: Vladimir Davydov 
> ktkhai: Description written.
> Signed-off-by: Kirill Tkhai 
> ---
>  mm/vmscan.c |   21 ++---
>  1 file changed, 6 insertions(+), 15 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f26ca1e00efb..6dbc659db120 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -628,10 +628,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
>   * unaware shrinkers will receive a node id of 0 instead.
>   *
> - * @memcg specifies the memory cgroup to target. If it is not NULL,
> - * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan
> - * objects from the memory cgroup specified. Otherwise, only unaware
> - * shrinkers are called.
> + * @memcg specifies the memory cgroup to target. Unaware shrinkers
> + * are called only if it is the root cgroup.
>   *
>   * @priority is sc->priority, we take the number of objects and >> by 
> priority
>   * in order to get the scan target.
> @@ -645,7 +643,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> struct shrinker *shrinker;
> unsigned long freed = 0;
>

Shouldn't there be a VM_BUG_ON(!memcg) here?

> -   if (memcg && !mem_cgroup_is_root(memcg))
> +   if (!mem_cgroup_is_root(memcg))
> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>
> if (!down_read_trylock(_rwsem))
> @@ -658,9 +656,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> .memcg = memcg,
> };
>
> -   if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> -   continue;
> -
> if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> sc.nid = 0;
>
> @@ -690,6 +685,7 @@ void drop_slab_node(int nid)
> struct mem_cgroup *memcg = NULL;
>
> freed = 0;
> +   memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
> } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != 
> NULL);
> @@ -2709,9 +2705,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
> shrink_node_memcg(pgdat, memcg, sc, _pages);
> node_lru_pages += lru_pages;
>
> -   if (memcg)
> -   shrink_slab(sc->gfp_mask, pgdat->node_id,
> -   memcg, sc->priority);
> +   shrink_slab(sc->gfp_mask, pgdat->node_id,
> +   memcg, sc->priority);
>
> /* Record the group's reclaim efficiency */
> vmpressure(sc->gfp_mask, memcg, false,
> @@ -2735,10 +2730,6 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
> }
> } while ((memcg = mem_cgroup_iter(root, memcg, )));
>
> -   if (global_reclaim(sc))
> -   shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> -   sc->priority);
> -
> if (reclaim_state) {
> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> reclaim_state->reclaimed_slab = 0;
>

Re: [PATCH 6/6] Convert intel uncore to struct_size

2018-06-07 Thread Shakeel Butt

On Thu, Jun 7, 2018 at 10:30 AM Ralph Campbell  wrote:
>
>
>
> On 06/07/2018 07:57 AM, Matthew Wilcox wrote:
> > From: Matthew Wilcox 
> >
> > Need to do a bit of rearranging to make this work.
> >
> > Signed-off-by: Matthew Wilcox 
> > ---
> >   arch/x86/events/intel/uncore.c | 19 ++-
> >   1 file changed, 10 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
> > index 15b07379e72d..e15cfad4f89b 100644
> > --- a/arch/x86/events/intel/uncore.c
> > +++ b/arch/x86/events/intel/uncore.c
> > @@ -865,8 +865,6 @@ static void uncore_types_exit(struct intel_uncore_type 
> > **types)
> >   static int __init uncore_type_init(struct intel_uncore_type *type, bool 
> > setid)
> >   {
> >   struct intel_uncore_pmu *pmus;
> > - struct attribute_group *attr_group;
> > - struct attribute **attrs;
> >   size_t size;
> >   int i, j;
> >
> > @@ -891,21 +889,24 @@ static int __init uncore_type_init(struct 
> > intel_uncore_type *type, bool setid)
> >   0, type->num_counters, 0, 0);
> >
> >   if (type->event_descs) {
> > + struct {
> > + struct attribute_group group;
> > + struct attribute *attrs[];
> > + } *attr_group;
> >   for (i = 0; type->event_descs[i].attr.attr.name; i++);
>
> What does this for loop do?
> Looks like nothing given the semicolon at the end.
>

Finding the first index 'i' where type->event_descs[i].attr.attr.name
is NULL with the assumption that one such entry definitely exists.

[PATCH v4] mm: fix race between kmem_cache destroy, create and deactivate

2018-06-11 Thread Shakeel Butt

The memcg kmem cache creation and deactivation (SLUB only) is
asynchronous. If a root kmem cache is destroyed whose memcg cache is in
the process of creation or deactivation, the kernel may crash.

Example of one such crash:
general protection fault:  [#1] SMP PTI
CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
...
Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
RIP: 0010:has_cpu_slab
...
Call Trace:
? on_each_cpu_cond
__kmem_cache_shrink
kmemcg_cache_deact_after_rcu
kmemcg_deactivate_workfn
process_one_work
worker_thread
kthread
ret_from_fork+0x35/0x40

To fix this race, on root kmem cache destruction, mark the cache as
dying and flush the workqueue used for memcg kmem cache creation and
deactivation. SLUB's memcg kmem cache deactivation also includes RCU
callback and thus make sure all previous registered RCU callbacks
have completed as well.

Signed-off-by: Shakeel Butt 
---
Changelog since v3:
- Handle the RCU callbacks for SLUB deactivation

Changelog since v2:
- Rewrote the patch and used workqueue flushing instead of refcount

Changelog since v1:
- Added more documentation to the code
- Renamed fields to be more readable

---
 include/linux/slab.h |  1 +
 mm/slab_common.c | 33 -
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9ebe659bd4a5..71c5467d99c1 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -658,6 +658,7 @@ struct memcg_cache_params {
struct memcg_cache_array __rcu *memcg_caches;
struct list_head __root_caches_node;
struct list_head children;
+   bool dying;
};
struct {
struct mem_cgroup *memcg;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b0dd9db1eb2f..890b1f04a03a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -136,6 +136,7 @@ void slab_init_memcg_params(struct kmem_cache *s)
s->memcg_params.root_cache = NULL;
RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
INIT_LIST_HEAD(>memcg_params.children);
+   s->memcg_params.dying = false;
 }
 
 static int init_memcg_params(struct kmem_cache *s,
@@ -608,7 +609,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 * The memory cgroup could have been offlined while the cache
 * creation work was pending.
 */
-   if (memcg->kmem_state != KMEM_ONLINE)
+   if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
goto out_unlock;
 
idx = memcg_cache_id(memcg);
@@ -712,6 +713,9 @@ void slab_deactivate_memcg_cache_rcu_sched(struct 
kmem_cache *s,
WARN_ON_ONCE(s->memcg_params.deact_fn))
return;
 
+   if (s->memcg_params.root_cache->memcg_params.dying)
+   return;
+
/* pin memcg so that @s doesn't get destroyed in the middle */
css_get(>memcg_params.memcg->css);
 
@@ -823,11 +827,36 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
return -EBUSY;
return 0;
 }
+
+static void flush_memcg_workqueue(struct kmem_cache *s)
+{
+   mutex_lock(_mutex);
+   s->memcg_params.dying = true;
+   mutex_unlock(_mutex);
+
+   /*
+* SLUB deactivates the kmem_caches through call_rcu_sched. Make
+* sure all registered rcu callbacks have been invoked.
+*/
+   if (IS_ENABLED(CONFIG_SLUB))
+   rcu_barrier_sched();
+
+   /*
+* SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
+* deactivates the memcg kmem_caches through workqueue. Make sure all
+* previous workitems on workqueue are processed.
+*/
+   flush_workqueue(memcg_kmem_cache_wq);
+}
 #else
 static inline int shutdown_memcg_caches(struct kmem_cache *s)
 {
return 0;
 }
+
+static inline void flush_memcg_workqueue(struct kmem_cache *s)
+{
+}
 #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
 
 void slab_kmem_cache_release(struct kmem_cache *s)
@@ -845,6 +874,8 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (unlikely(!s))
return;
 
+   flush_memcg_workqueue(s);
+
get_online_cpus();
get_online_mems();
 
-- 
2.18.0.rc1.242.g61856ae69a-goog

Re: Possible regression in "slab, slub: skip unnecessary kasan_cache_shutdown()"

2018-06-19 Thread Shakeel Butt

On Tue, Jun 19, 2018 at 8:19 AM Jason A. Donenfeld  wrote:
>
> On Tue, Jun 19, 2018 at 5:08 PM Shakeel Butt  wrote:
> > > > Are you using SLAB or SLUB? We stress kernel pretty heavily, but with
> > > > SLAB, and I suspect Shakeel may also be using SLAB. So if you are
> > > > using SLUB, there is significant chance that it's a bug in the SLUB
> > > > part of the change.
> > >
> > > Nice intuition; I am indeed using SLUB rather than SLAB...
> > >
> >
> > Can you try once with SLAB? Just to make sure that it is SLUB specific.
>
> Sorry, I meant to mention that earlier. I tried with SLAB; the crash
> does not occur. This appears to be SLUB-specific.

Jason, can you try the following patch?

---
 mm/slub.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index a3b8467c14af..746cfe4515c2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3673,9 +3673,17 @@ static void free_partial(struct kmem_cache *s, struct 
kmem_cache_node *n)
 
 bool __kmem_cache_empty(struct kmem_cache *s)
 {
+   int cpu;
int node;
struct kmem_cache_node *n;
 
+   for_each_online_cpu(cpu) {
+   struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+   if (c->page || slub_percpu_partial(c))
+   return false;
+   }
+
for_each_kmem_cache_node(s, node, n)
if (n->nr_partial || slabs_node(s, node))
return false;
-- 
2.18.0.rc1.244.gcf134e6275-goog

Re: [PATCH v6 0/3] Directed kmem charging

2018-06-19 Thread Shakeel Butt

On Tue, Jun 19, 2018 at 9:09 AM Johannes Weiner  wrote:
>
> Hi Shakeel,
>
> this looks generally reasonable to me.
>
> However, patch 1 introduces API that isn't used until patch 2 and 3,
> which makes reviewing harder since you have to jump back and forth
> between emails. Please fold patch 1 and introduce API along with the
> users.
>

Thanks a lot for the review. Ack, I will do as you suggested in next version.

> On Mon, Jun 18, 2018 at 10:13:24PM -0700, Shakeel Butt wrote:
> > This patchset introduces memcg variant memory allocation functions.  The
> > caller can explicitly pass the memcg to charge for kmem allocations.
> > Currently the kernel, for __GFP_ACCOUNT memory allocation requests,
> > extract the memcg of the current task to charge for the kmem allocation.
> > This patch series introduces kmem allocation functions where the caller
> > can pass the pointer to the remote memcg.  The remote memcg will be
> > charged for the allocation instead of the memcg of the caller.  However
> > the caller must have a reference to the remote memcg.  This patch series
> > also introduces scope API for targeted memcg charging. So, all the
> > __GFP_ACCOUNT alloctions within the specified scope will be charged to
> > the given target memcg.
>
> Can you open with the rationale for the series, i.e. the problem
> statement (fsnotify and bh memory footprint), *then* follow with the
> proposed solution?
>

Sure.

thanks,
Shakeel

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1184 matches

Mail list logo