Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-26 Thread Kirill Tkhai
On 26.03.2018 05:37, Dave Chinner wrote:
> On Fri, Mar 23, 2018 at 03:39:53PM +0300, Kirill Tkhai wrote:
>> On 23.03.2018 02:46, Dave Chinner wrote:
>>> On Thu, Mar 22, 2018 at 07:52:37PM +0300, Kirill Tkhai wrote:
 Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.
>>>
>>> Oh, finally you tell me what the problem is that you're trying to
>>> solve. I *asked this several times* and got no response. Thank you
>>> for wasting so much of my time.
>>>
 Current shrinker is not scalable. Then there are many memcg and mounts,
 the time of iteration shrink_slab() in case of global reclaim can
 take much time. There is times of shrink_slab() by the link. A node
 with 200 containers may waste 4 seconds on global reclaim just to
 iterate over all shrinkers for all cgroups, call shrinker::count_objects()
 and receive 0 zero objects.
>>>
>>> So, your problem is the way the memcgs were tacked onto the side
>>> of the list_lru infrastructure and are iterated, which has basically
>>> nothing to do with the way the low level XFS inode shrinker behaves.
>>>
>>> /me looks at the patches
>>>
>>> /me shudders at the thought of external "cache has freeable items"
>>> control for the shrinking of vfs caches.
>>>
>>> Biggest problem I see with this is the scope for coherency bugs ini
>>> the "memcg shrinker has freeable items" tracking. If that happens,
>>> there's no way of getting that memcg to run it's shrinker ever
>>> again. That seems very, very fragile and likely to be an endless
>>> source of OOM bugs. The whole point of the shrinker polling
>>> infrastructure is that it is not susceptible to this sort of bug.
>>>
>>> Really, the problem is that there's no separate list of memcg aware
>>> shrinkers, so every registered shrinker has to be iterated just to
>>> find the one shrinker that is memcg aware.
>>
>> I don't think the logic is difficult. There are generic rules,
>> and the only task is to teach them memcg-aware shrinkers. Currently,
>> they are only super block and workingsets shrinkers, and both of
>> them are based on generic list_lru infrastructure. Shrinker-related
>> bit is also cleared in generic code (shrink_slab()) only, and
>> the algorhythm doesn't allow to clear it without double check.
>> The only principle modification I'm thinking about is we should
>> clear the bit only when the shrinker is called with maximum
>> parameters: priority and GFP.
> 
> Lots of "simple logic" combined together makes for a complex mass of
> difficult to understand and debug code.
> 
> And, really, you're not suffering from a memcg problem - you're
> suffering from a "there are thousands of shrinkers" scalability
> issue because superblocks have per-superblock shrinker contexts and
> you have thousands of mounted filesystems.

Yes, but it's the way of how the containers are arranged. There are several
FS_USERNS_MOUNT tagged filesystems in kernel, and these filesystems are tagged
this flag, because they are safe and widely used in containers. It's impossible
to unite all of them as the only partition.

Also, separate ext4/xfs/etc partitions may be useful for some systems. You just
can't know all the workloads, people use it.

>> There are a lot of performance improving synchronizations in kernel,
>> and they had been refused, the kernel would have remained in the age
>> of big kernel lock.
> 
> That's false equivalence and hyperbole. The shrinkers are not
> limiting what every Linux user can do with their hardware. It's not
> a fundamental architectural limitation.  These sorts of arguments
> are not convincing - this is the second time I've told you this, so
> please stick to technical arguments and drop the dramatic
> "anti-progress" conspiracy theory bullshit.

Your view is limited by XFS, but the fact you don't use this doesn't
mean nobody is need such functionality. 
 
>>> So why not just do the simple thing which is create a separate
>>> "memcg-aware" shrinker list (i.e. create shrinker_list_memcg
>>> alongside shrinker_list) and only iterate the shrinker_list_memcg
>>> when a memcg is passed to shrink_slab()?
>>>
>>> That means we'll only run 2 shrinkers per memcg at most (sueprblock
>>> and working set) per memcg reclaim call. That's a simple 10-20 line
>>> change, not a whole mess of new code and issues...
>>
>> It was the first optimization, which came to my head, by there is no
>> almost a performance profit, since memcg-aware shrinkers still be called
>> per every memcg, and they are the biggest part of shrinkers in the system.
> 
> Sure, but a polling algorithm is not a fundamental performance
> limitation.
> 
> The problem here is that the memcg infrastructure has caused an
> exponential explosion in shrinker scanning.

Yeah, so this is the reason I seeking for the solution.

 Can't we call shrink of shared objects only for top memcg? Something like
 this:
>>>
>>> What's a "shared object", and how is it any different to a normal
>>> slab cache object?
>>
>> 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-26 Thread Kirill Tkhai
On 26.03.2018 05:37, Dave Chinner wrote:
> On Fri, Mar 23, 2018 at 03:39:53PM +0300, Kirill Tkhai wrote:
>> On 23.03.2018 02:46, Dave Chinner wrote:
>>> On Thu, Mar 22, 2018 at 07:52:37PM +0300, Kirill Tkhai wrote:
 Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.
>>>
>>> Oh, finally you tell me what the problem is that you're trying to
>>> solve. I *asked this several times* and got no response. Thank you
>>> for wasting so much of my time.
>>>
 Current shrinker is not scalable. Then there are many memcg and mounts,
 the time of iteration shrink_slab() in case of global reclaim can
 take much time. There is times of shrink_slab() by the link. A node
 with 200 containers may waste 4 seconds on global reclaim just to
 iterate over all shrinkers for all cgroups, call shrinker::count_objects()
 and receive 0 zero objects.
>>>
>>> So, your problem is the way the memcgs were tacked onto the side
>>> of the list_lru infrastructure and are iterated, which has basically
>>> nothing to do with the way the low level XFS inode shrinker behaves.
>>>
>>> /me looks at the patches
>>>
>>> /me shudders at the thought of external "cache has freeable items"
>>> control for the shrinking of vfs caches.
>>>
>>> Biggest problem I see with this is the scope for coherency bugs ini
>>> the "memcg shrinker has freeable items" tracking. If that happens,
>>> there's no way of getting that memcg to run it's shrinker ever
>>> again. That seems very, very fragile and likely to be an endless
>>> source of OOM bugs. The whole point of the shrinker polling
>>> infrastructure is that it is not susceptible to this sort of bug.
>>>
>>> Really, the problem is that there's no separate list of memcg aware
>>> shrinkers, so every registered shrinker has to be iterated just to
>>> find the one shrinker that is memcg aware.
>>
>> I don't think the logic is difficult. There are generic rules,
>> and the only task is to teach them memcg-aware shrinkers. Currently,
>> they are only super block and workingsets shrinkers, and both of
>> them are based on generic list_lru infrastructure. Shrinker-related
>> bit is also cleared in generic code (shrink_slab()) only, and
>> the algorhythm doesn't allow to clear it without double check.
>> The only principle modification I'm thinking about is we should
>> clear the bit only when the shrinker is called with maximum
>> parameters: priority and GFP.
> 
> Lots of "simple logic" combined together makes for a complex mass of
> difficult to understand and debug code.
> 
> And, really, you're not suffering from a memcg problem - you're
> suffering from a "there are thousands of shrinkers" scalability
> issue because superblocks have per-superblock shrinker contexts and
> you have thousands of mounted filesystems.

Yes, but it's the way of how the containers are arranged. There are several
FS_USERNS_MOUNT tagged filesystems in kernel, and these filesystems are tagged
this flag, because they are safe and widely used in containers. It's impossible
to unite all of them as the only partition.

Also, separate ext4/xfs/etc partitions may be useful for some systems. You just
can't know all the workloads, people use it.

>> There are a lot of performance improving synchronizations in kernel,
>> and they had been refused, the kernel would have remained in the age
>> of big kernel lock.
> 
> That's false equivalence and hyperbole. The shrinkers are not
> limiting what every Linux user can do with their hardware. It's not
> a fundamental architectural limitation.  These sorts of arguments
> are not convincing - this is the second time I've told you this, so
> please stick to technical arguments and drop the dramatic
> "anti-progress" conspiracy theory bullshit.

Your view is limited by XFS, but the fact you don't use this doesn't
mean nobody is need such functionality. 
 
>>> So why not just do the simple thing which is create a separate
>>> "memcg-aware" shrinker list (i.e. create shrinker_list_memcg
>>> alongside shrinker_list) and only iterate the shrinker_list_memcg
>>> when a memcg is passed to shrink_slab()?
>>>
>>> That means we'll only run 2 shrinkers per memcg at most (sueprblock
>>> and working set) per memcg reclaim call. That's a simple 10-20 line
>>> change, not a whole mess of new code and issues...
>>
>> It was the first optimization, which came to my head, by there is no
>> almost a performance profit, since memcg-aware shrinkers still be called
>> per every memcg, and they are the biggest part of shrinkers in the system.
> 
> Sure, but a polling algorithm is not a fundamental performance
> limitation.
> 
> The problem here is that the memcg infrastructure has caused an
> exponential explosion in shrinker scanning.

Yeah, so this is the reason I seeking for the solution.

 Can't we call shrink of shared objects only for top memcg? Something like
 this:
>>>
>>> What's a "shared object", and how is it any different to a normal
>>> slab cache object?
>>
>> 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-25 Thread Dave Chinner
On Fri, Mar 23, 2018 at 03:39:53PM +0300, Kirill Tkhai wrote:
> On 23.03.2018 02:46, Dave Chinner wrote:
> > On Thu, Mar 22, 2018 at 07:52:37PM +0300, Kirill Tkhai wrote:
> >> Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.
> > 
> > Oh, finally you tell me what the problem is that you're trying to
> > solve. I *asked this several times* and got no response. Thank you
> > for wasting so much of my time.
> > 
> >> Current shrinker is not scalable. Then there are many memcg and mounts,
> >> the time of iteration shrink_slab() in case of global reclaim can
> >> take much time. There is times of shrink_slab() by the link. A node
> >> with 200 containers may waste 4 seconds on global reclaim just to
> >> iterate over all shrinkers for all cgroups, call shrinker::count_objects()
> >> and receive 0 zero objects.
> > 
> > So, your problem is the way the memcgs were tacked onto the side
> > of the list_lru infrastructure and are iterated, which has basically
> > nothing to do with the way the low level XFS inode shrinker behaves.
> > 
> > /me looks at the patches
> > 
> > /me shudders at the thought of external "cache has freeable items"
> > control for the shrinking of vfs caches.
> > 
> > Biggest problem I see with this is the scope for coherency bugs ini
> > the "memcg shrinker has freeable items" tracking. If that happens,
> > there's no way of getting that memcg to run it's shrinker ever
> > again. That seems very, very fragile and likely to be an endless
> > source of OOM bugs. The whole point of the shrinker polling
> > infrastructure is that it is not susceptible to this sort of bug.
> > 
> > Really, the problem is that there's no separate list of memcg aware
> > shrinkers, so every registered shrinker has to be iterated just to
> > find the one shrinker that is memcg aware.
> 
> I don't think the logic is difficult. There are generic rules,
> and the only task is to teach them memcg-aware shrinkers. Currently,
> they are only super block and workingsets shrinkers, and both of
> them are based on generic list_lru infrastructure. Shrinker-related
> bit is also cleared in generic code (shrink_slab()) only, and
> the algorhythm doesn't allow to clear it without double check.
> The only principle modification I'm thinking about is we should
> clear the bit only when the shrinker is called with maximum
> parameters: priority and GFP.

Lots of "simple logic" combined together makes for a complex mass of
difficult to understand and debug code.

And, really, you're not suffering from a memcg problem - you're
suffering from a "there are thousands of shrinkers" scalability
issue because superblocks have per-superblock shrinker contexts and
you have thousands of mounted filesystems.

> There are a lot of performance improving synchronizations in kernel,
> and they had been refused, the kernel would have remained in the age
> of big kernel lock.

That's false equivalence and hyperbole. The shrinkers are not
limiting what every Linux user can do with their hardware. It's not
a fundamental architectural limitation.  These sorts of arguments
are not convincing - this is the second time I've told you this, so
please stick to technical arguments and drop the dramatic
"anti-progress" conspiracy theory bullshit.

> > So why not just do the simple thing which is create a separate
> > "memcg-aware" shrinker list (i.e. create shrinker_list_memcg
> > alongside shrinker_list) and only iterate the shrinker_list_memcg
> > when a memcg is passed to shrink_slab()?
> > 
> > That means we'll only run 2 shrinkers per memcg at most (sueprblock
> > and working set) per memcg reclaim call. That's a simple 10-20 line
> > change, not a whole mess of new code and issues...
> 
> It was the first optimization, which came to my head, by there is no
> almost a performance profit, since memcg-aware shrinkers still be called
> per every memcg, and they are the biggest part of shrinkers in the system.

Sure, but a polling algorithm is not a fundamental performance
limitation.

The problem here is that the memcg infrastructure has caused an
exponential explosion in shrinker scanning.

> >> Can't we call shrink of shared objects only for top memcg? Something like
> >> this:
> > 
> > What's a "shared object", and how is it any different to a normal
> > slab cache object?
> 
> Sorry, it's erratum. I'm speaking about cached objects. I mean something like
> the below. The patch makes cached objects be cleared outside the memcg 
> iteration
> cycle (it has not sense to call them for every memcg since cached object logic
> just ignores memcg).

The cached flag seems like a hack to me. It does nothing to address
the number of shrinker callout calls (it actually increases them!),
just tries to hack around something you want a specific shrinker to
avoid doing.

I've asked *repeatedly* for a description of the actual workload
problems the XFS shrinker behaviour is causing you. In the absence
of any workload description, I'm simply going to 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-25 Thread Dave Chinner
On Fri, Mar 23, 2018 at 03:39:53PM +0300, Kirill Tkhai wrote:
> On 23.03.2018 02:46, Dave Chinner wrote:
> > On Thu, Mar 22, 2018 at 07:52:37PM +0300, Kirill Tkhai wrote:
> >> Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.
> > 
> > Oh, finally you tell me what the problem is that you're trying to
> > solve. I *asked this several times* and got no response. Thank you
> > for wasting so much of my time.
> > 
> >> Current shrinker is not scalable. Then there are many memcg and mounts,
> >> the time of iteration shrink_slab() in case of global reclaim can
> >> take much time. There is times of shrink_slab() by the link. A node
> >> with 200 containers may waste 4 seconds on global reclaim just to
> >> iterate over all shrinkers for all cgroups, call shrinker::count_objects()
> >> and receive 0 zero objects.
> > 
> > So, your problem is the way the memcgs were tacked onto the side
> > of the list_lru infrastructure and are iterated, which has basically
> > nothing to do with the way the low level XFS inode shrinker behaves.
> > 
> > /me looks at the patches
> > 
> > /me shudders at the thought of external "cache has freeable items"
> > control for the shrinking of vfs caches.
> > 
> > Biggest problem I see with this is the scope for coherency bugs ini
> > the "memcg shrinker has freeable items" tracking. If that happens,
> > there's no way of getting that memcg to run it's shrinker ever
> > again. That seems very, very fragile and likely to be an endless
> > source of OOM bugs. The whole point of the shrinker polling
> > infrastructure is that it is not susceptible to this sort of bug.
> > 
> > Really, the problem is that there's no separate list of memcg aware
> > shrinkers, so every registered shrinker has to be iterated just to
> > find the one shrinker that is memcg aware.
> 
> I don't think the logic is difficult. There are generic rules,
> and the only task is to teach them memcg-aware shrinkers. Currently,
> they are only super block and workingsets shrinkers, and both of
> them are based on generic list_lru infrastructure. Shrinker-related
> bit is also cleared in generic code (shrink_slab()) only, and
> the algorhythm doesn't allow to clear it without double check.
> The only principle modification I'm thinking about is we should
> clear the bit only when the shrinker is called with maximum
> parameters: priority and GFP.

Lots of "simple logic" combined together makes for a complex mass of
difficult to understand and debug code.

And, really, you're not suffering from a memcg problem - you're
suffering from a "there are thousands of shrinkers" scalability
issue because superblocks have per-superblock shrinker contexts and
you have thousands of mounted filesystems.

> There are a lot of performance improving synchronizations in kernel,
> and they had been refused, the kernel would have remained in the age
> of big kernel lock.

That's false equivalence and hyperbole. The shrinkers are not
limiting what every Linux user can do with their hardware. It's not
a fundamental architectural limitation.  These sorts of arguments
are not convincing - this is the second time I've told you this, so
please stick to technical arguments and drop the dramatic
"anti-progress" conspiracy theory bullshit.

> > So why not just do the simple thing which is create a separate
> > "memcg-aware" shrinker list (i.e. create shrinker_list_memcg
> > alongside shrinker_list) and only iterate the shrinker_list_memcg
> > when a memcg is passed to shrink_slab()?
> > 
> > That means we'll only run 2 shrinkers per memcg at most (sueprblock
> > and working set) per memcg reclaim call. That's a simple 10-20 line
> > change, not a whole mess of new code and issues...
> 
> It was the first optimization, which came to my head, by there is no
> almost a performance profit, since memcg-aware shrinkers still be called
> per every memcg, and they are the biggest part of shrinkers in the system.

Sure, but a polling algorithm is not a fundamental performance
limitation.

The problem here is that the memcg infrastructure has caused an
exponential explosion in shrinker scanning.

> >> Can't we call shrink of shared objects only for top memcg? Something like
> >> this:
> > 
> > What's a "shared object", and how is it any different to a normal
> > slab cache object?
> 
> Sorry, it's erratum. I'm speaking about cached objects. I mean something like
> the below. The patch makes cached objects be cleared outside the memcg 
> iteration
> cycle (it has not sense to call them for every memcg since cached object logic
> just ignores memcg).

The cached flag seems like a hack to me. It does nothing to address
the number of shrinker callout calls (it actually increases them!),
just tries to hack around something you want a specific shrinker to
avoid doing.

I've asked *repeatedly* for a description of the actual workload
problems the XFS shrinker behaviour is causing you. In the absence
of any workload description, I'm simply going to 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-23 Thread Kirill Tkhai
On 23.03.2018 02:46, Dave Chinner wrote:
> On Thu, Mar 22, 2018 at 07:52:37PM +0300, Kirill Tkhai wrote:
>> Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.
> 
> Oh, finally you tell me what the problem is that you're trying to
> solve. I *asked this several times* and got no response. Thank you
> for wasting so much of my time.
> 
>> Current shrinker is not scalable. Then there are many memcg and mounts,
>> the time of iteration shrink_slab() in case of global reclaim can
>> take much time. There is times of shrink_slab() by the link. A node
>> with 200 containers may waste 4 seconds on global reclaim just to
>> iterate over all shrinkers for all cgroups, call shrinker::count_objects()
>> and receive 0 zero objects.
> 
> So, your problem is the way the memcgs were tacked onto the side
> of the list_lru infrastructure and are iterated, which has basically
> nothing to do with the way the low level XFS inode shrinker behaves.
> 
> /me looks at the patches
> 
> /me shudders at the thought of external "cache has freeable items"
> control for the shrinking of vfs caches.
> 
> Biggest problem I see with this is the scope for coherency bugs ini
> the "memcg shrinker has freeable items" tracking. If that happens,
> there's no way of getting that memcg to run it's shrinker ever
> again. That seems very, very fragile and likely to be an endless
> source of OOM bugs. The whole point of the shrinker polling
> infrastructure is that it is not susceptible to this sort of bug.
> 
> Really, the problem is that there's no separate list of memcg aware
> shrinkers, so every registered shrinker has to be iterated just to
> find the one shrinker that is memcg aware.

I don't think the logic is difficult. There are generic rules,
and the only task is to teach them memcg-aware shrinkers. Currently,
they are only super block and workingsets shrinkers, and both of
them are based on generic list_lru infrastructure. Shrinker-related
bit is also cleared in generic code (shrink_slab()) only, and
the algorhythm doesn't allow to clear it without double check.
The only principle modification I'm thinking about is we should
clear the bit only when the shrinker is called with maximum
parameters: priority and GFP.

There are a lot of performance improving synchronizations in kernel,
and they had been refused, the kernel would have remained in the age
of big kernel lock.

> So why not just do the simple thing which is create a separate
> "memcg-aware" shrinker list (i.e. create shrinker_list_memcg
> alongside shrinker_list) and only iterate the shrinker_list_memcg
> when a memcg is passed to shrink_slab()?
> 
> That means we'll only run 2 shrinkers per memcg at most (sueprblock
> and working set) per memcg reclaim call. That's a simple 10-20 line
> change, not a whole mess of new code and issues...

It was the first optimization, which came to my head, by there is no
almost a performance profit, since memcg-aware shrinkers still be called
per every memcg, and they are the biggest part of shrinkers in the system.
 
>> Can't we call shrink of shared objects only for top memcg? Something like
>> this:
> 
> What's a "shared object", and how is it any different to a normal
> slab cache object?

Sorry, it's erratum. I'm speaking about cached objects. I mean something like
the below. The patch makes cached objects be cleared outside the memcg iteration
cycle (it has not sense to call them for every memcg since cached object logic
just ignores memcg).

---
diff --git a/fs/super.c b/fs/super.c
index 0660083427fa..c7321c42ab1f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -61,8 +61,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
longfs_objects = 0;
longtotal_objects;
longfreed = 0;
-   longdentries;
-   longinodes;
+   longdentries = 0;
+   longinodes = 0;
 
sb = container_of(shrink, struct super_block, s_shrink);
 
@@ -76,19 +76,27 @@ static unsigned long super_cache_scan(struct shrinker 
*shrink,
if (!trylock_super(sb))
return SHRINK_STOP;
 
-   if (sb->s_op->nr_cached_objects)
-   fs_objects = sb->s_op->nr_cached_objects(sb, sc);
+   if (sc->cached) {
+   if (sb->s_op->nr_cached_objects) {
+   fs_objects = sb->s_op->nr_cached_objects(sb, sc);
+   if (!fs_objects)
+   fs_objects = 1;
+
+   sc->nr_to_scan = fs_objects + 1;
+   freed = sb->s_op->free_cached_objects(sb, sc);
+   }
+   goto unlock;
+   }
 
inodes = list_lru_shrink_count(>s_inode_lru, sc);
dentries = list_lru_shrink_count(>s_dentry_lru, sc);
-   total_objects = dentries + inodes + fs_objects + 1;
+   total_objects = dentries + inodes + 1;
if (!total_objects)
total_objects = 1;
 
/* proportion the scan between the caches */
   

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-23 Thread Kirill Tkhai
On 23.03.2018 02:46, Dave Chinner wrote:
> On Thu, Mar 22, 2018 at 07:52:37PM +0300, Kirill Tkhai wrote:
>> Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.
> 
> Oh, finally you tell me what the problem is that you're trying to
> solve. I *asked this several times* and got no response. Thank you
> for wasting so much of my time.
> 
>> Current shrinker is not scalable. Then there are many memcg and mounts,
>> the time of iteration shrink_slab() in case of global reclaim can
>> take much time. There is times of shrink_slab() by the link. A node
>> with 200 containers may waste 4 seconds on global reclaim just to
>> iterate over all shrinkers for all cgroups, call shrinker::count_objects()
>> and receive 0 zero objects.
> 
> So, your problem is the way the memcgs were tacked onto the side
> of the list_lru infrastructure and are iterated, which has basically
> nothing to do with the way the low level XFS inode shrinker behaves.
> 
> /me looks at the patches
> 
> /me shudders at the thought of external "cache has freeable items"
> control for the shrinking of vfs caches.
> 
> Biggest problem I see with this is the scope for coherency bugs ini
> the "memcg shrinker has freeable items" tracking. If that happens,
> there's no way of getting that memcg to run it's shrinker ever
> again. That seems very, very fragile and likely to be an endless
> source of OOM bugs. The whole point of the shrinker polling
> infrastructure is that it is not susceptible to this sort of bug.
> 
> Really, the problem is that there's no separate list of memcg aware
> shrinkers, so every registered shrinker has to be iterated just to
> find the one shrinker that is memcg aware.

I don't think the logic is difficult. There are generic rules,
and the only task is to teach them memcg-aware shrinkers. Currently,
they are only super block and workingsets shrinkers, and both of
them are based on generic list_lru infrastructure. Shrinker-related
bit is also cleared in generic code (shrink_slab()) only, and
the algorhythm doesn't allow to clear it without double check.
The only principle modification I'm thinking about is we should
clear the bit only when the shrinker is called with maximum
parameters: priority and GFP.

There are a lot of performance improving synchronizations in kernel,
and they had been refused, the kernel would have remained in the age
of big kernel lock.

> So why not just do the simple thing which is create a separate
> "memcg-aware" shrinker list (i.e. create shrinker_list_memcg
> alongside shrinker_list) and only iterate the shrinker_list_memcg
> when a memcg is passed to shrink_slab()?
> 
> That means we'll only run 2 shrinkers per memcg at most (sueprblock
> and working set) per memcg reclaim call. That's a simple 10-20 line
> change, not a whole mess of new code and issues...

It was the first optimization, which came to my head, by there is no
almost a performance profit, since memcg-aware shrinkers still be called
per every memcg, and they are the biggest part of shrinkers in the system.
 
>> Can't we call shrink of shared objects only for top memcg? Something like
>> this:
> 
> What's a "shared object", and how is it any different to a normal
> slab cache object?

Sorry, it's erratum. I'm speaking about cached objects. I mean something like
the below. The patch makes cached objects be cleared outside the memcg iteration
cycle (it has not sense to call them for every memcg since cached object logic
just ignores memcg).

---
diff --git a/fs/super.c b/fs/super.c
index 0660083427fa..c7321c42ab1f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -61,8 +61,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
longfs_objects = 0;
longtotal_objects;
longfreed = 0;
-   longdentries;
-   longinodes;
+   longdentries = 0;
+   longinodes = 0;
 
sb = container_of(shrink, struct super_block, s_shrink);
 
@@ -76,19 +76,27 @@ static unsigned long super_cache_scan(struct shrinker 
*shrink,
if (!trylock_super(sb))
return SHRINK_STOP;
 
-   if (sb->s_op->nr_cached_objects)
-   fs_objects = sb->s_op->nr_cached_objects(sb, sc);
+   if (sc->cached) {
+   if (sb->s_op->nr_cached_objects) {
+   fs_objects = sb->s_op->nr_cached_objects(sb, sc);
+   if (!fs_objects)
+   fs_objects = 1;
+
+   sc->nr_to_scan = fs_objects + 1;
+   freed = sb->s_op->free_cached_objects(sb, sc);
+   }
+   goto unlock;
+   }
 
inodes = list_lru_shrink_count(>s_inode_lru, sc);
dentries = list_lru_shrink_count(>s_dentry_lru, sc);
-   total_objects = dentries + inodes + fs_objects + 1;
+   total_objects = dentries + inodes + 1;
if (!total_objects)
total_objects = 1;
 
/* proportion the scan between the caches */
   

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-22 Thread Dave Chinner
On Thu, Mar 22, 2018 at 07:52:37PM +0300, Kirill Tkhai wrote:
> Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.

Oh, finally you tell me what the problem is that you're trying to
solve. I *asked this several times* and got no response. Thank you
for wasting so much of my time.

> Current shrinker is not scalable. Then there are many memcg and mounts,
> the time of iteration shrink_slab() in case of global reclaim can
> take much time. There is times of shrink_slab() by the link. A node
> with 200 containers may waste 4 seconds on global reclaim just to
> iterate over all shrinkers for all cgroups, call shrinker::count_objects()
> and receive 0 zero objects.

So, your problem is the way the memcgs were tacked onto the side
of the list_lru infrastructure and are iterated, which has basically
nothing to do with the way the low level XFS inode shrinker behaves.

/me looks at the patches

/me shudders at the thought of external "cache has freeable items"
control for the shrinking of vfs caches.

Biggest problem I see with this is the scope for coherency bugs ini
the "memcg shrinker has freeable items" tracking. If that happens,
there's no way of getting that memcg to run it's shrinker ever
again. That seems very, very fragile and likely to be an endless
source of OOM bugs. The whole point of the shrinker polling
infrastructure is that it is not susceptible to this sort of bug.

Really, the problem is that there's no separate list of memcg aware
shrinkers, so every registered shrinker has to be iterated just to
find the one shrinker that is memcg aware.

So why not just do the simple thing which is create a separate
"memcg-aware" shrinker list (i.e. create shrinker_list_memcg
alongside shrinker_list) and only iterate the shrinker_list_memcg
when a memcg is passed to shrink_slab()?

That means we'll only run 2 shrinkers per memcg at most (sueprblock
and working set) per memcg reclaim call. That's a simple 10-20 line
change, not a whole mess of new code and issues...

> Can't we call shrink of shared objects only for top memcg? Something like
> this:

What's a "shared object", and how is it any different to a normal
slab cache object?

CHeers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-22 Thread Dave Chinner
On Thu, Mar 22, 2018 at 07:52:37PM +0300, Kirill Tkhai wrote:
> Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.

Oh, finally you tell me what the problem is that you're trying to
solve. I *asked this several times* and got no response. Thank you
for wasting so much of my time.

> Current shrinker is not scalable. Then there are many memcg and mounts,
> the time of iteration shrink_slab() in case of global reclaim can
> take much time. There is times of shrink_slab() by the link. A node
> with 200 containers may waste 4 seconds on global reclaim just to
> iterate over all shrinkers for all cgroups, call shrinker::count_objects()
> and receive 0 zero objects.

So, your problem is the way the memcgs were tacked onto the side
of the list_lru infrastructure and are iterated, which has basically
nothing to do with the way the low level XFS inode shrinker behaves.

/me looks at the patches

/me shudders at the thought of external "cache has freeable items"
control for the shrinking of vfs caches.

Biggest problem I see with this is the scope for coherency bugs ini
the "memcg shrinker has freeable items" tracking. If that happens,
there's no way of getting that memcg to run it's shrinker ever
again. That seems very, very fragile and likely to be an endless
source of OOM bugs. The whole point of the shrinker polling
infrastructure is that it is not susceptible to this sort of bug.

Really, the problem is that there's no separate list of memcg aware
shrinkers, so every registered shrinker has to be iterated just to
find the one shrinker that is memcg aware.

So why not just do the simple thing which is create a separate
"memcg-aware" shrinker list (i.e. create shrinker_list_memcg
alongside shrinker_list) and only iterate the shrinker_list_memcg
when a memcg is passed to shrink_slab()?

That means we'll only run 2 shrinkers per memcg at most (sueprblock
and working set) per memcg reclaim call. That's a simple 10-20 line
change, not a whole mess of new code and issues...

> Can't we call shrink of shared objects only for top memcg? Something like
> this:

What's a "shared object", and how is it any different to a normal
slab cache object?

CHeers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-22 Thread Kirill Tkhai
On 22.03.2018 08:01, Dave Chinner wrote:
> On Wed, Mar 21, 2018 at 07:15:14PM +0300, Kirill Tkhai wrote:
>> On 20.03.2018 17:34, Dave Chinner wrote:
>>> On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
 On 20.03.2018 03:18, Dave Chinner wrote:
> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
>> On 17.03.2018 00:39, Dave Chinner wrote:
> Actually, it is fair, because:
>
> /* proportion the scan between the caches */
> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
>
> This means that if the number of objects in the memcg aware VFS
> caches are tiny compared to the global XFS inode cache, then they
> only get a tiny amount of the total scanning done by the shrinker.

 This is just wrong. If you have two memcgs, the first is charged
 by xfs dentries and inodes, while the second is not charged by xfs
 dentries and inodes, the second will response for xfs shrink as well
 as the first.
>>>
>>> That makes no sense to me. Shrinkers are per-sb, so they only
>>> shrink per-filesystem dentries and inodes. If your memcgs are on
>>> different filesystems, then they'll behave according to the size of
>>> that filesystem's caches, not the cache of some other filesystem.
>>
>> But this break the isolation purpose of memcg. When someone places
>> different services in different pairs of memcg and cpu cgroup, he
>> wants do not do foreign work.
> 
> Too bad. Filesystems *break memcg isolation all the time*.
> 
> Think about it. The filesystem journal is owned by the filesystem,
> not the memcg that is writing a transaction to it.  If one memcg
> does lots of metadata modifications, it can use all the journal
> space and starve iitself and all other memcgs of journal space while
> the filesystem does writeback to clear it out.
> 
> Another example: if one memcg is manipulating free space in a
> particular allocation group, other memcgs that need to manipulate
> space in those allocation groups will block and be prevented from
> operating until the other memcg finishes it's work.
> 
> Another example: inode IO in XFS are done in clusters of 32. read or
> write any inode in the cluster, and we read/write all the other
> inodes in that cluster, too. Hence if we need to read an inode and
> the cluster is busy because the filesystem journal needed flushing
> or another inode in the cluster is being unlinked (e.g. by a process
> in a different memcg), then the read in the first memcg will block
> until whatever is being done on the buffer is complete. IOWs, even
> at the physical on-disk inode level we violate memcg resource 
> isolation principles.
> 
> I can go on, but I think you get the picture: Filesystems are full
> of global structures whose access arbitration mechanisms
> fundamentally break the concept of operational memcg isolation.
> 
> With that in mind, we really don't care that the shrinker does
> global work and violate memcg isolation principles because we
> violately them everywhere. IOWs, if we try to enforce memcg
> isolation in the shrinker, then we can't guarantee forwards progress
> in memory reclaim because we end up with multi-dimensional memcg
> dependencies at the physical layers of the filesystem structure

Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.
Current shrinker is not scalable. Then there are many memcg and mounts,
the time of iteration shrink_slab() in case of global reclaim can
take much time. There is times of shrink_slab() by the link. A node
with 200 containers may waste 4 seconds on global reclaim just to
iterate over all shrinkers for all cgroups, call shrinker::count_objects()
and receive 0 zero objects.

Can't we call shrink of shared objects only for top memcg? Something like
this:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fcd9f8d7390..13429366c276 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2569,6 +2569,10 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
}
} while ((memcg = mem_cgroup_iter(root, memcg, )));
 
+   /* Called only for top memcg */
+   shrink_shared_objects();
+
+
if (global_reclaim(sc))
shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
sc->priority);

"Top memcg" means a memcg, which meets reclaim. It is root_mem_cgroup in case of
global reclaim; and it's task's current memcg if there is memcg reclaim.

> I don't expect people who know nothing about XFS or filesystems to
> understand the complexity of the interactions we are dealing with
> here. Everything is a compromise when it comes to the memory reclaim
> code as tehre are so many corner cases we have to handle. In this
> situation, perfect is the 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-22 Thread Kirill Tkhai
On 22.03.2018 08:01, Dave Chinner wrote:
> On Wed, Mar 21, 2018 at 07:15:14PM +0300, Kirill Tkhai wrote:
>> On 20.03.2018 17:34, Dave Chinner wrote:
>>> On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
 On 20.03.2018 03:18, Dave Chinner wrote:
> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
>> On 17.03.2018 00:39, Dave Chinner wrote:
> Actually, it is fair, because:
>
> /* proportion the scan between the caches */
> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
>
> This means that if the number of objects in the memcg aware VFS
> caches are tiny compared to the global XFS inode cache, then they
> only get a tiny amount of the total scanning done by the shrinker.

 This is just wrong. If you have two memcgs, the first is charged
 by xfs dentries and inodes, while the second is not charged by xfs
 dentries and inodes, the second will response for xfs shrink as well
 as the first.
>>>
>>> That makes no sense to me. Shrinkers are per-sb, so they only
>>> shrink per-filesystem dentries and inodes. If your memcgs are on
>>> different filesystems, then they'll behave according to the size of
>>> that filesystem's caches, not the cache of some other filesystem.
>>
>> But this break the isolation purpose of memcg. When someone places
>> different services in different pairs of memcg and cpu cgroup, he
>> wants do not do foreign work.
> 
> Too bad. Filesystems *break memcg isolation all the time*.
> 
> Think about it. The filesystem journal is owned by the filesystem,
> not the memcg that is writing a transaction to it.  If one memcg
> does lots of metadata modifications, it can use all the journal
> space and starve iitself and all other memcgs of journal space while
> the filesystem does writeback to clear it out.
> 
> Another example: if one memcg is manipulating free space in a
> particular allocation group, other memcgs that need to manipulate
> space in those allocation groups will block and be prevented from
> operating until the other memcg finishes it's work.
> 
> Another example: inode IO in XFS are done in clusters of 32. read or
> write any inode in the cluster, and we read/write all the other
> inodes in that cluster, too. Hence if we need to read an inode and
> the cluster is busy because the filesystem journal needed flushing
> or another inode in the cluster is being unlinked (e.g. by a process
> in a different memcg), then the read in the first memcg will block
> until whatever is being done on the buffer is complete. IOWs, even
> at the physical on-disk inode level we violate memcg resource 
> isolation principles.
> 
> I can go on, but I think you get the picture: Filesystems are full
> of global structures whose access arbitration mechanisms
> fundamentally break the concept of operational memcg isolation.
> 
> With that in mind, we really don't care that the shrinker does
> global work and violate memcg isolation principles because we
> violately them everywhere. IOWs, if we try to enforce memcg
> isolation in the shrinker, then we can't guarantee forwards progress
> in memory reclaim because we end up with multi-dimensional memcg
> dependencies at the physical layers of the filesystem structure

Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.
Current shrinker is not scalable. Then there are many memcg and mounts,
the time of iteration shrink_slab() in case of global reclaim can
take much time. There is times of shrink_slab() by the link. A node
with 200 containers may waste 4 seconds on global reclaim just to
iterate over all shrinkers for all cgroups, call shrinker::count_objects()
and receive 0 zero objects.

Can't we call shrink of shared objects only for top memcg? Something like
this:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fcd9f8d7390..13429366c276 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2569,6 +2569,10 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
}
} while ((memcg = mem_cgroup_iter(root, memcg, )));
 
+   /* Called only for top memcg */
+   shrink_shared_objects();
+
+
if (global_reclaim(sc))
shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
sc->priority);

"Top memcg" means a memcg, which meets reclaim. It is root_mem_cgroup in case of
global reclaim; and it's task's current memcg if there is memcg reclaim.

> I don't expect people who know nothing about XFS or filesystems to
> understand the complexity of the interactions we are dealing with
> here. Everything is a compromise when it comes to the memory reclaim
> code as tehre are so many corner cases we have to handle. In this
> situation, perfect is the 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-21 Thread Dave Chinner
On Wed, Mar 21, 2018 at 07:15:14PM +0300, Kirill Tkhai wrote:
> On 20.03.2018 17:34, Dave Chinner wrote:
> > On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
> >> On 20.03.2018 03:18, Dave Chinner wrote:
> >>> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
>  On 17.03.2018 00:39, Dave Chinner wrote:
> >>> Actually, it is fair, because:
> >>>
> >>> /* proportion the scan between the caches */
> >>> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> >>> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> >>> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
> >>>
> >>> This means that if the number of objects in the memcg aware VFS
> >>> caches are tiny compared to the global XFS inode cache, then they
> >>> only get a tiny amount of the total scanning done by the shrinker.
> >>
> >> This is just wrong. If you have two memcgs, the first is charged
> >> by xfs dentries and inodes, while the second is not charged by xfs
> >> dentries and inodes, the second will response for xfs shrink as well
> >> as the first.
> > 
> > That makes no sense to me. Shrinkers are per-sb, so they only
> > shrink per-filesystem dentries and inodes. If your memcgs are on
> > different filesystems, then they'll behave according to the size of
> > that filesystem's caches, not the cache of some other filesystem.
> 
> But this break the isolation purpose of memcg. When someone places
> different services in different pairs of memcg and cpu cgroup, he
> wants do not do foreign work.

Too bad. Filesystems *break memcg isolation all the time*.

Think about it. The filesystem journal is owned by the filesystem,
not the memcg that is writing a transaction to it.  If one memcg
does lots of metadata modifications, it can use all the journal
space and starve iitself and all other memcgs of journal space while
the filesystem does writeback to clear it out.

Another example: if one memcg is manipulating free space in a
particular allocation group, other memcgs that need to manipulate
space in those allocation groups will block and be prevented from
operating until the other memcg finishes it's work.

Another example: inode IO in XFS are done in clusters of 32. read or
write any inode in the cluster, and we read/write all the other
inodes in that cluster, too. Hence if we need to read an inode and
the cluster is busy because the filesystem journal needed flushing
or another inode in the cluster is being unlinked (e.g. by a process
in a different memcg), then the read in the first memcg will block
until whatever is being done on the buffer is complete. IOWs, even
at the physical on-disk inode level we violate memcg resource 
isolation principles.

I can go on, but I think you get the picture: Filesystems are full
of global structures whose access arbitration mechanisms
fundamentally break the concept of operational memcg isolation.

With that in mind, we really don't care that the shrinker does
global work and violate memcg isolation principles because we
violately them everywhere. IOWs, if we try to enforce memcg
isolation in the shrinker, then we can't guarantee forwards progress
in memory reclaim because we end up with multi-dimensional memcg
dependencies at the physical layers of the filesystem structure

I don't expect people who know nothing about XFS or filesystems to
understand the complexity of the interactions we are dealing with
here. Everything is a compromise when it comes to the memory reclaim
code as tehre are so many corner cases we have to handle. In this
situation, perfect is the enemy of good

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-21 Thread Dave Chinner
On Wed, Mar 21, 2018 at 07:15:14PM +0300, Kirill Tkhai wrote:
> On 20.03.2018 17:34, Dave Chinner wrote:
> > On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
> >> On 20.03.2018 03:18, Dave Chinner wrote:
> >>> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
>  On 17.03.2018 00:39, Dave Chinner wrote:
> >>> Actually, it is fair, because:
> >>>
> >>> /* proportion the scan between the caches */
> >>> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> >>> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> >>> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
> >>>
> >>> This means that if the number of objects in the memcg aware VFS
> >>> caches are tiny compared to the global XFS inode cache, then they
> >>> only get a tiny amount of the total scanning done by the shrinker.
> >>
> >> This is just wrong. If you have two memcgs, the first is charged
> >> by xfs dentries and inodes, while the second is not charged by xfs
> >> dentries and inodes, the second will response for xfs shrink as well
> >> as the first.
> > 
> > That makes no sense to me. Shrinkers are per-sb, so they only
> > shrink per-filesystem dentries and inodes. If your memcgs are on
> > different filesystems, then they'll behave according to the size of
> > that filesystem's caches, not the cache of some other filesystem.
> 
> But this break the isolation purpose of memcg. When someone places
> different services in different pairs of memcg and cpu cgroup, he
> wants do not do foreign work.

Too bad. Filesystems *break memcg isolation all the time*.

Think about it. The filesystem journal is owned by the filesystem,
not the memcg that is writing a transaction to it.  If one memcg
does lots of metadata modifications, it can use all the journal
space and starve iitself and all other memcgs of journal space while
the filesystem does writeback to clear it out.

Another example: if one memcg is manipulating free space in a
particular allocation group, other memcgs that need to manipulate
space in those allocation groups will block and be prevented from
operating until the other memcg finishes it's work.

Another example: inode IO in XFS are done in clusters of 32. read or
write any inode in the cluster, and we read/write all the other
inodes in that cluster, too. Hence if we need to read an inode and
the cluster is busy because the filesystem journal needed flushing
or another inode in the cluster is being unlinked (e.g. by a process
in a different memcg), then the read in the first memcg will block
until whatever is being done on the buffer is complete. IOWs, even
at the physical on-disk inode level we violate memcg resource 
isolation principles.

I can go on, but I think you get the picture: Filesystems are full
of global structures whose access arbitration mechanisms
fundamentally break the concept of operational memcg isolation.

With that in mind, we really don't care that the shrinker does
global work and violate memcg isolation principles because we
violately them everywhere. IOWs, if we try to enforce memcg
isolation in the shrinker, then we can't guarantee forwards progress
in memory reclaim because we end up with multi-dimensional memcg
dependencies at the physical layers of the filesystem structure

I don't expect people who know nothing about XFS or filesystems to
understand the complexity of the interactions we are dealing with
here. Everything is a compromise when it comes to the memory reclaim
code as tehre are so many corner cases we have to handle. In this
situation, perfect is the enemy of good

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-21 Thread Kirill Tkhai
On 20.03.2018 17:34, Dave Chinner wrote:
> On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
>> On 20.03.2018 03:18, Dave Chinner wrote:
>>> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
 On 17.03.2018 00:39, Dave Chinner wrote:
>>> Actually, it is fair, because:
>>>
>>> /* proportion the scan between the caches */
>>> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
>>> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
>>> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
>>>
>>> This means that if the number of objects in the memcg aware VFS
>>> caches are tiny compared to the global XFS inode cache, then they
>>> only get a tiny amount of the total scanning done by the shrinker.
>>
>> This is just wrong. If you have two memcgs, the first is charged
>> by xfs dentries and inodes, while the second is not charged by xfs
>> dentries and inodes, the second will response for xfs shrink as well
>> as the first.
> 
> That makes no sense to me. Shrinkers are per-sb, so they only
> shrink per-filesystem dentries and inodes. If your memcgs are on
> different filesystems, then they'll behave according to the size of
> that filesystem's caches, not the cache of some other filesystem.

But this break the isolation purpose of memcg. When someone places
different services in different pairs of memcg and cpu cgroup, he
wants do not do foreign work.

>> When we call shrink xfs partition of the second memcg, total_objects
>> in super_cache_count() will consist only of cached objects.
>>
>> Then, we will call super_cache_scan() in the loop, and 
>>
>> total_objects = dentries + inodes + fs_objects + 1 = fs_objects
>> s_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects) ~ almost 
>> equal to sc->nr_to_scan.
>>
>> So, if the first memcg is huge and it's charged 10GB to cached objects,
>> the second memcg will shrink signify part of this objects (depending
>> on do_shrink_slab() priority) in case of reclaim coming.
> 
> Do we really care which memcg does the global reclaim work as long
> as it all gets done in a timely manner? The fact is that when there
> is global memory pressure, shrinkers from every memcg will be doing
> work all the time and so it really does not matter which shrinker
> context does what work. In such cases, memory reclaim is all about
> bulk throughput, and we trade of instantenous fairness for overall
> fairness and higher bulk throughput.

There is container workload, when services are executed in different
memcg to controll their resources. If there are database and mail server
in two containers, you want to make them use their own resources,
and don't want to make database shrink mail server inodes and vise versa.
You just want to allocate resources enough for both of them, and make
them not affect each other.
 
> And, well, the bulk of XFS inode reclaim is not done by the shrinker
> contexts - memcg or otherwise - it's done by an asycnhronous,
> non-blocking per-filesystem background work which is kicked by a
> shrinker scan being run on that filesystem . The only reason we keep
> the shrinkers doing direct work is to throttle direct reclaim to the
> overall background inode reclaim rate. It's hardly fair for the
> filesystem to automatically do the work the memcg should be doing
> and doing it faster than the memcg reclaim can do it, is it?

But is there another solution? Something like work queue or kthread.

> FWIW, inodes are RCU objects, so they aren't freed until RCU grace
> periods expire and that means memcg reclaim does not actually free
> inodes immediately. They are always delayed to some time after the
> memcg shrinker has run and reported it freed those objects.  IOWs,
> it really doesn't matter which memcg context does the "freeing"
> work, because the freeing of inode memory in any given memcg
> actually occurs some time later from callbacks run in a global,
> non-memcg aware context
> 
>>> And because all of the memcg shrinkers walk the global inode cache,
>>> the overall scanning is shared between all memcgs under memory
>>> pressure evenly and so none are unfairly penalised by having to
>>> scan the global inode cache to trigger final garbage collection.
>>
>> So, if there is single memcg using xfs, the rest of system is
>> responsible for its shrinking. Completely unfair.
> 
> If there is only one memcg that is generating internal memory
> pressure, then only that memcg will be running the shrinker and
> reclaiming dentries, inodes and XFS inodes from that memcg. Nothing
> else will be running reclaim and hence only the memcg shrinker will
> be doing reclaim work. I don't see where you get the "rest of the
> system is responsible for it's shrinking" from - that just isn't the
> way the shrinker infrastructure works, regardless of whether XFS is
> involved or not.

Every shrinker is called for every memcg. So, if any other memcg meets
reclaim, it will also 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-21 Thread Kirill Tkhai
On 20.03.2018 17:34, Dave Chinner wrote:
> On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
>> On 20.03.2018 03:18, Dave Chinner wrote:
>>> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
 On 17.03.2018 00:39, Dave Chinner wrote:
>>> Actually, it is fair, because:
>>>
>>> /* proportion the scan between the caches */
>>> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
>>> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
>>> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
>>>
>>> This means that if the number of objects in the memcg aware VFS
>>> caches are tiny compared to the global XFS inode cache, then they
>>> only get a tiny amount of the total scanning done by the shrinker.
>>
>> This is just wrong. If you have two memcgs, the first is charged
>> by xfs dentries and inodes, while the second is not charged by xfs
>> dentries and inodes, the second will response for xfs shrink as well
>> as the first.
> 
> That makes no sense to me. Shrinkers are per-sb, so they only
> shrink per-filesystem dentries and inodes. If your memcgs are on
> different filesystems, then they'll behave according to the size of
> that filesystem's caches, not the cache of some other filesystem.

But this break the isolation purpose of memcg. When someone places
different services in different pairs of memcg and cpu cgroup, he
wants do not do foreign work.

>> When we call shrink xfs partition of the second memcg, total_objects
>> in super_cache_count() will consist only of cached objects.
>>
>> Then, we will call super_cache_scan() in the loop, and 
>>
>> total_objects = dentries + inodes + fs_objects + 1 = fs_objects
>> s_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects) ~ almost 
>> equal to sc->nr_to_scan.
>>
>> So, if the first memcg is huge and it's charged 10GB to cached objects,
>> the second memcg will shrink signify part of this objects (depending
>> on do_shrink_slab() priority) in case of reclaim coming.
> 
> Do we really care which memcg does the global reclaim work as long
> as it all gets done in a timely manner? The fact is that when there
> is global memory pressure, shrinkers from every memcg will be doing
> work all the time and so it really does not matter which shrinker
> context does what work. In such cases, memory reclaim is all about
> bulk throughput, and we trade of instantenous fairness for overall
> fairness and higher bulk throughput.

There is container workload, when services are executed in different
memcg to controll their resources. If there are database and mail server
in two containers, you want to make them use their own resources,
and don't want to make database shrink mail server inodes and vise versa.
You just want to allocate resources enough for both of them, and make
them not affect each other.
 
> And, well, the bulk of XFS inode reclaim is not done by the shrinker
> contexts - memcg or otherwise - it's done by an asycnhronous,
> non-blocking per-filesystem background work which is kicked by a
> shrinker scan being run on that filesystem . The only reason we keep
> the shrinkers doing direct work is to throttle direct reclaim to the
> overall background inode reclaim rate. It's hardly fair for the
> filesystem to automatically do the work the memcg should be doing
> and doing it faster than the memcg reclaim can do it, is it?

But is there another solution? Something like work queue or kthread.

> FWIW, inodes are RCU objects, so they aren't freed until RCU grace
> periods expire and that means memcg reclaim does not actually free
> inodes immediately. They are always delayed to some time after the
> memcg shrinker has run and reported it freed those objects.  IOWs,
> it really doesn't matter which memcg context does the "freeing"
> work, because the freeing of inode memory in any given memcg
> actually occurs some time later from callbacks run in a global,
> non-memcg aware context
> 
>>> And because all of the memcg shrinkers walk the global inode cache,
>>> the overall scanning is shared between all memcgs under memory
>>> pressure evenly and so none are unfairly penalised by having to
>>> scan the global inode cache to trigger final garbage collection.
>>
>> So, if there is single memcg using xfs, the rest of system is
>> responsible for its shrinking. Completely unfair.
> 
> If there is only one memcg that is generating internal memory
> pressure, then only that memcg will be running the shrinker and
> reclaiming dentries, inodes and XFS inodes from that memcg. Nothing
> else will be running reclaim and hence only the memcg shrinker will
> be doing reclaim work. I don't see where you get the "rest of the
> system is responsible for it's shrinking" from - that just isn't the
> way the shrinker infrastructure works, regardless of whether XFS is
> involved or not.

Every shrinker is called for every memcg. So, if any other memcg meets
reclaim, it will also 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-20 Thread Dave Chinner
On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
> On 20.03.2018 03:18, Dave Chinner wrote:
> > On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
> >> On 17.03.2018 00:39, Dave Chinner wrote:
> > Actually, it is fair, because:
> > 
> > /* proportion the scan between the caches */
> > dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> > inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> > fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
> > 
> > This means that if the number of objects in the memcg aware VFS
> > caches are tiny compared to the global XFS inode cache, then they
> > only get a tiny amount of the total scanning done by the shrinker.
> 
> This is just wrong. If you have two memcgs, the first is charged
> by xfs dentries and inodes, while the second is not charged by xfs
> dentries and inodes, the second will response for xfs shrink as well
> as the first.

That makes no sense to me. Shrinkers are per-sb, so they only
shrink per-filesystem dentries and inodes. If your memcgs are on
different filesystems, then they'll behave according to the size of
that filesystem's caches, not the cache of some other filesystem.

> When we call shrink xfs partition of the second memcg, total_objects
> in super_cache_count() will consist only of cached objects.
> 
> Then, we will call super_cache_scan() in the loop, and 
> 
> total_objects = dentries + inodes + fs_objects + 1 = fs_objects
> s_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects) ~ almost 
> equal to sc->nr_to_scan.
> 
> So, if the first memcg is huge and it's charged 10GB to cached objects,
> the second memcg will shrink signify part of this objects (depending
> on do_shrink_slab() priority) in case of reclaim coming.

Do we really care which memcg does the global reclaim work as long
as it all gets done in a timely manner? The fact is that when there
is global memory pressure, shrinkers from every memcg will be doing
work all the time and so it really does not matter which shrinker
context does what work. In such cases, memory reclaim is all about
bulk throughput, and we trade of instantenous fairness for overall
fairness and higher bulk throughput.

And, well, the bulk of XFS inode reclaim is not done by the shrinker
contexts - memcg or otherwise - it's done by an asycnhronous,
non-blocking per-filesystem background work which is kicked by a
shrinker scan being run on that filesystem . The only reason we keep
the shrinkers doing direct work is to throttle direct reclaim to the
overall background inode reclaim rate. It's hardly fair for the
filesystem to automatically do the work the memcg should be doing
and doing it faster than the memcg reclaim can do it, is it?

FWIW, inodes are RCU objects, so they aren't freed until RCU grace
periods expire and that means memcg reclaim does not actually free
inodes immediately. They are always delayed to some time after the
memcg shrinker has run and reported it freed those objects.  IOWs,
it really doesn't matter which memcg context does the "freeing"
work, because the freeing of inode memory in any given memcg
actually occurs some time later from callbacks run in a global,
non-memcg aware context

> > And because all of the memcg shrinkers walk the global inode cache,
> > the overall scanning is shared between all memcgs under memory
> > pressure evenly and so none are unfairly penalised by having to
> > scan the global inode cache to trigger final garbage collection.
> 
> So, if there is single memcg using xfs, the rest of system is
> responsible for its shrinking. Completely unfair.

If there is only one memcg that is generating internal memory
pressure, then only that memcg will be running the shrinker and
reclaiming dentries, inodes and XFS inodes from that memcg. Nothing
else will be running reclaim and hence only the memcg shrinker will
be doing reclaim work. I don't see where you get the "rest of the
system is responsible for it's shrinking" from - that just isn't the
way the shrinker infrastructure works, regardless of whether XFS is
involved or not.

> Isn't this easy to understand?

First rule of shrinkers: Shrinkers *aren't easy to understand.*
Second rule of shrinkers: See the first rule.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-20 Thread Dave Chinner
On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
> On 20.03.2018 03:18, Dave Chinner wrote:
> > On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
> >> On 17.03.2018 00:39, Dave Chinner wrote:
> > Actually, it is fair, because:
> > 
> > /* proportion the scan between the caches */
> > dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> > inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> > fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
> > 
> > This means that if the number of objects in the memcg aware VFS
> > caches are tiny compared to the global XFS inode cache, then they
> > only get a tiny amount of the total scanning done by the shrinker.
> 
> This is just wrong. If you have two memcgs, the first is charged
> by xfs dentries and inodes, while the second is not charged by xfs
> dentries and inodes, the second will response for xfs shrink as well
> as the first.

That makes no sense to me. Shrinkers are per-sb, so they only
shrink per-filesystem dentries and inodes. If your memcgs are on
different filesystems, then they'll behave according to the size of
that filesystem's caches, not the cache of some other filesystem.

> When we call shrink xfs partition of the second memcg, total_objects
> in super_cache_count() will consist only of cached objects.
> 
> Then, we will call super_cache_scan() in the loop, and 
> 
> total_objects = dentries + inodes + fs_objects + 1 = fs_objects
> s_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects) ~ almost 
> equal to sc->nr_to_scan.
> 
> So, if the first memcg is huge and it's charged 10GB to cached objects,
> the second memcg will shrink signify part of this objects (depending
> on do_shrink_slab() priority) in case of reclaim coming.

Do we really care which memcg does the global reclaim work as long
as it all gets done in a timely manner? The fact is that when there
is global memory pressure, shrinkers from every memcg will be doing
work all the time and so it really does not matter which shrinker
context does what work. In such cases, memory reclaim is all about
bulk throughput, and we trade of instantenous fairness for overall
fairness and higher bulk throughput.

And, well, the bulk of XFS inode reclaim is not done by the shrinker
contexts - memcg or otherwise - it's done by an asycnhronous,
non-blocking per-filesystem background work which is kicked by a
shrinker scan being run on that filesystem . The only reason we keep
the shrinkers doing direct work is to throttle direct reclaim to the
overall background inode reclaim rate. It's hardly fair for the
filesystem to automatically do the work the memcg should be doing
and doing it faster than the memcg reclaim can do it, is it?

FWIW, inodes are RCU objects, so they aren't freed until RCU grace
periods expire and that means memcg reclaim does not actually free
inodes immediately. They are always delayed to some time after the
memcg shrinker has run and reported it freed those objects.  IOWs,
it really doesn't matter which memcg context does the "freeing"
work, because the freeing of inode memory in any given memcg
actually occurs some time later from callbacks run in a global,
non-memcg aware context

> > And because all of the memcg shrinkers walk the global inode cache,
> > the overall scanning is shared between all memcgs under memory
> > pressure evenly and so none are unfairly penalised by having to
> > scan the global inode cache to trigger final garbage collection.
> 
> So, if there is single memcg using xfs, the rest of system is
> responsible for its shrinking. Completely unfair.

If there is only one memcg that is generating internal memory
pressure, then only that memcg will be running the shrinker and
reclaiming dentries, inodes and XFS inodes from that memcg. Nothing
else will be running reclaim and hence only the memcg shrinker will
be doing reclaim work. I don't see where you get the "rest of the
system is responsible for it's shrinking" from - that just isn't the
way the shrinker infrastructure works, regardless of whether XFS is
involved or not.

> Isn't this easy to understand?

First rule of shrinkers: Shrinkers *aren't easy to understand.*
Second rule of shrinkers: See the first rule.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-20 Thread Kirill Tkhai
On 20.03.2018 03:18, Dave Chinner wrote:
> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
>> On 17.03.2018 00:39, Dave Chinner wrote:
>>> On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
 On 16.03.2018 02:03, Dave Chinner wrote:
> On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
>> On 15.03.2018 20:49, Michal Hocko wrote:
>>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
 xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
 So, it's called for memcg reclaim too, e.g. this list is shrinked
 disproportionality to another lists.

 This looks confusing, so I'm reporting about this.
 Consider this patch as RFC.
>>>
>>> Could you be more specific about the problem you are trying to solve?
>>> Because we do skip shrinkers which are not memcg aware by
>>> shrink_slab:
>>> /*
>>>  * If kernel memory accounting is disabled, we ignore
>>>  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>>  * passing NULL for memcg.
>>>  */
>>> if (memcg_kmem_enabled() &&
>>> !!memcg != !!(shrinker->flags & 
>>> SHRINKER_MEMCG_AWARE))
>>> continue;
>>>
>>> Or am I missing something?
>>
>> sb->s_op->nr_cached_objects is a sub-method of generic 
>> super_cache_count().
>> super_cache_count() is owned and only called by superblock's shrinker,
>> which does have SHRINKER_MEMCG_AWARE flag.
>
> Xfs inodes are accounted to memcgs when they are allocated.  All the
> memcg reclaim stuff is done at the VFS inode cache level - all the
> XFS inode cache shrinker does is clean up inodes that are not
> referenced by the VFS inode cache because the memcg aware reclaim
> has already freed them.
>
> i.e. what the XFS inode cache is doing is perfectly reasonable -
> memcg aware inode reclaim is occuring at the VFS level, but on XFS
> that does not free the inodes as they are still referenced
> internally by XFS. However, once the inode is removed from the VFS
> LRU, all memcg information about the inode is destroyed, so there's
> nothing in the XFS layers that cares about memcgs.

 So, after inode is removed from LRU, memory still remains accounted
 to memcg till the time they are actually freed. I personally don't
 care, just to mention.
  
> Hence when the XFS inode shrinker then called to run a
> garbage collection pass on unreferenced inodes - the inodes that
> are now unreferenced in the memcg due to the VFS inode shrinker pass
> - it frees inodes regardless of the memcg context it was called from
> because that information is no longer present in the inode cache.
> Hence we just ignore memcgs at this level.

 But xfs_fs_free_cached_objects() returns number of these freed object
 as result to super_cache_scan(), so shrinker interprets them as related
 to a memcg, while they may be related to another memcg. This introduces
 a disproportion relatively to another shrinkers called to memcg.
>>>
>>> In what way? All memcgs see tha same values from the backing cache
>>> and so try to do the same amount of scanning work. The shrinker
>>> accounting simply doesn't care where the objects are scanned from,
>>> as long as it comes from the same place as the calculation of the
>>> number of objects in the cache it's about to scan.
>>
>> shrinker->count_objects() result is used to count number of objects,
>> do_shrink_slab() should shrink during the call:
>>
>> freeable = shrinker->count_objects(shrinker, shrinkctl);
>>
>> Then shrinker takes part of this value:
>>
>> delta = freeable >> priority;
>> delta *= 4;
>> do_div(delta, shrinker->seeks);
>>
>> This is a number, the shrinker tries to shrink during the call.
>> Let the priority is DEF_PRIORITY. Then, shrinker try to shrink
>> freeable*4/(shrinker->seeks * 2^DEF_PRIORITY) enteties from every
>> shrinker. Equal percent of every memcg shrinker.
>>
>> When XFS shows global number of cached objects in count_objects(),
>> shrinker also tryes to shrink the same percent of global objects,
>> as for other memcg shrinkers. So, when you have small memcg
>> with 128Mb memory allowed and small number of tasks related to it,
>> you may meet 1Gb of cached objects, which were queued by another
>> big cgroup. So, the small cgroup may shrink number of objects of
>> size more, then its own. It's not fair. That's all I'm worry in
>> this message.
> 
> Actually, it is fair, because:
> 
> /* proportion the scan between the caches */
> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-20 Thread Kirill Tkhai
On 20.03.2018 03:18, Dave Chinner wrote:
> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
>> On 17.03.2018 00:39, Dave Chinner wrote:
>>> On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
 On 16.03.2018 02:03, Dave Chinner wrote:
> On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
>> On 15.03.2018 20:49, Michal Hocko wrote:
>>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
 xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
 So, it's called for memcg reclaim too, e.g. this list is shrinked
 disproportionality to another lists.

 This looks confusing, so I'm reporting about this.
 Consider this patch as RFC.
>>>
>>> Could you be more specific about the problem you are trying to solve?
>>> Because we do skip shrinkers which are not memcg aware by
>>> shrink_slab:
>>> /*
>>>  * If kernel memory accounting is disabled, we ignore
>>>  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>>  * passing NULL for memcg.
>>>  */
>>> if (memcg_kmem_enabled() &&
>>> !!memcg != !!(shrinker->flags & 
>>> SHRINKER_MEMCG_AWARE))
>>> continue;
>>>
>>> Or am I missing something?
>>
>> sb->s_op->nr_cached_objects is a sub-method of generic 
>> super_cache_count().
>> super_cache_count() is owned and only called by superblock's shrinker,
>> which does have SHRINKER_MEMCG_AWARE flag.
>
> Xfs inodes are accounted to memcgs when they are allocated.  All the
> memcg reclaim stuff is done at the VFS inode cache level - all the
> XFS inode cache shrinker does is clean up inodes that are not
> referenced by the VFS inode cache because the memcg aware reclaim
> has already freed them.
>
> i.e. what the XFS inode cache is doing is perfectly reasonable -
> memcg aware inode reclaim is occuring at the VFS level, but on XFS
> that does not free the inodes as they are still referenced
> internally by XFS. However, once the inode is removed from the VFS
> LRU, all memcg information about the inode is destroyed, so there's
> nothing in the XFS layers that cares about memcgs.

 So, after inode is removed from LRU, memory still remains accounted
 to memcg till the time they are actually freed. I personally don't
 care, just to mention.
  
> Hence when the XFS inode shrinker then called to run a
> garbage collection pass on unreferenced inodes - the inodes that
> are now unreferenced in the memcg due to the VFS inode shrinker pass
> - it frees inodes regardless of the memcg context it was called from
> because that information is no longer present in the inode cache.
> Hence we just ignore memcgs at this level.

 But xfs_fs_free_cached_objects() returns number of these freed object
 as result to super_cache_scan(), so shrinker interprets them as related
 to a memcg, while they may be related to another memcg. This introduces
 a disproportion relatively to another shrinkers called to memcg.
>>>
>>> In what way? All memcgs see tha same values from the backing cache
>>> and so try to do the same amount of scanning work. The shrinker
>>> accounting simply doesn't care where the objects are scanned from,
>>> as long as it comes from the same place as the calculation of the
>>> number of objects in the cache it's about to scan.
>>
>> shrinker->count_objects() result is used to count number of objects,
>> do_shrink_slab() should shrink during the call:
>>
>> freeable = shrinker->count_objects(shrinker, shrinkctl);
>>
>> Then shrinker takes part of this value:
>>
>> delta = freeable >> priority;
>> delta *= 4;
>> do_div(delta, shrinker->seeks);
>>
>> This is a number, the shrinker tries to shrink during the call.
>> Let the priority is DEF_PRIORITY. Then, shrinker try to shrink
>> freeable*4/(shrinker->seeks * 2^DEF_PRIORITY) enteties from every
>> shrinker. Equal percent of every memcg shrinker.
>>
>> When XFS shows global number of cached objects in count_objects(),
>> shrinker also tryes to shrink the same percent of global objects,
>> as for other memcg shrinkers. So, when you have small memcg
>> with 128Mb memory allowed and small number of tasks related to it,
>> you may meet 1Gb of cached objects, which were queued by another
>> big cgroup. So, the small cgroup may shrink number of objects of
>> size more, then its own. It's not fair. That's all I'm worry in
>> this message.
> 
> Actually, it is fair, because:
> 
> /* proportion the scan between the caches */
> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-19 Thread Dave Chinner
On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
> On 17.03.2018 00:39, Dave Chinner wrote:
> > On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
> >> On 16.03.2018 02:03, Dave Chinner wrote:
> >>> On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
>  On 15.03.2018 20:49, Michal Hocko wrote:
> > On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
> >> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> >> So, it's called for memcg reclaim too, e.g. this list is shrinked
> >> disproportionality to another lists.
> >>
> >> This looks confusing, so I'm reporting about this.
> >> Consider this patch as RFC.
> >
> > Could you be more specific about the problem you are trying to solve?
> > Because we do skip shrinkers which are not memcg aware by
> > shrink_slab:
> > /*
> >  * If kernel memory accounting is disabled, we ignore
> >  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >  * passing NULL for memcg.
> >  */
> > if (memcg_kmem_enabled() &&
> > !!memcg != !!(shrinker->flags & 
> > SHRINKER_MEMCG_AWARE))
> > continue;
> >
> > Or am I missing something?
> 
>  sb->s_op->nr_cached_objects is a sub-method of generic 
>  super_cache_count().
>  super_cache_count() is owned and only called by superblock's shrinker,
>  which does have SHRINKER_MEMCG_AWARE flag.
> >>>
> >>> Xfs inodes are accounted to memcgs when they are allocated.  All the
> >>> memcg reclaim stuff is done at the VFS inode cache level - all the
> >>> XFS inode cache shrinker does is clean up inodes that are not
> >>> referenced by the VFS inode cache because the memcg aware reclaim
> >>> has already freed them.
> >>>
> >>> i.e. what the XFS inode cache is doing is perfectly reasonable -
> >>> memcg aware inode reclaim is occuring at the VFS level, but on XFS
> >>> that does not free the inodes as they are still referenced
> >>> internally by XFS. However, once the inode is removed from the VFS
> >>> LRU, all memcg information about the inode is destroyed, so there's
> >>> nothing in the XFS layers that cares about memcgs.
> >>
> >> So, after inode is removed from LRU, memory still remains accounted
> >> to memcg till the time they are actually freed. I personally don't
> >> care, just to mention.
> >>  
> >>> Hence when the XFS inode shrinker then called to run a
> >>> garbage collection pass on unreferenced inodes - the inodes that
> >>> are now unreferenced in the memcg due to the VFS inode shrinker pass
> >>> - it frees inodes regardless of the memcg context it was called from
> >>> because that information is no longer present in the inode cache.
> >>> Hence we just ignore memcgs at this level.
> >>
> >> But xfs_fs_free_cached_objects() returns number of these freed object
> >> as result to super_cache_scan(), so shrinker interprets them as related
> >> to a memcg, while they may be related to another memcg. This introduces
> >> a disproportion relatively to another shrinkers called to memcg.
> > 
> > In what way? All memcgs see tha same values from the backing cache
> > and so try to do the same amount of scanning work. The shrinker
> > accounting simply doesn't care where the objects are scanned from,
> > as long as it comes from the same place as the calculation of the
> > number of objects in the cache it's about to scan.
> 
> shrinker->count_objects() result is used to count number of objects,
> do_shrink_slab() should shrink during the call:
> 
> freeable = shrinker->count_objects(shrinker, shrinkctl);
> 
> Then shrinker takes part of this value:
> 
> delta = freeable >> priority;
> delta *= 4;
> do_div(delta, shrinker->seeks);
> 
> This is a number, the shrinker tries to shrink during the call.
> Let the priority is DEF_PRIORITY. Then, shrinker try to shrink
> freeable*4/(shrinker->seeks * 2^DEF_PRIORITY) enteties from every
> shrinker. Equal percent of every memcg shrinker.
> 
> When XFS shows global number of cached objects in count_objects(),
> shrinker also tryes to shrink the same percent of global objects,
> as for other memcg shrinkers. So, when you have small memcg
> with 128Mb memory allowed and small number of tasks related to it,
> you may meet 1Gb of cached objects, which were queued by another
> big cgroup. So, the small cgroup may shrink number of objects of
> size more, then its own. It's not fair. That's all I'm worry in
> this message.

Actually, it is fair, because:

/* proportion the scan between the caches */
dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);

This means that if the number of objects in the memcg aware VFS

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-19 Thread Dave Chinner
On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
> On 17.03.2018 00:39, Dave Chinner wrote:
> > On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
> >> On 16.03.2018 02:03, Dave Chinner wrote:
> >>> On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
>  On 15.03.2018 20:49, Michal Hocko wrote:
> > On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
> >> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> >> So, it's called for memcg reclaim too, e.g. this list is shrinked
> >> disproportionality to another lists.
> >>
> >> This looks confusing, so I'm reporting about this.
> >> Consider this patch as RFC.
> >
> > Could you be more specific about the problem you are trying to solve?
> > Because we do skip shrinkers which are not memcg aware by
> > shrink_slab:
> > /*
> >  * If kernel memory accounting is disabled, we ignore
> >  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >  * passing NULL for memcg.
> >  */
> > if (memcg_kmem_enabled() &&
> > !!memcg != !!(shrinker->flags & 
> > SHRINKER_MEMCG_AWARE))
> > continue;
> >
> > Or am I missing something?
> 
>  sb->s_op->nr_cached_objects is a sub-method of generic 
>  super_cache_count().
>  super_cache_count() is owned and only called by superblock's shrinker,
>  which does have SHRINKER_MEMCG_AWARE flag.
> >>>
> >>> Xfs inodes are accounted to memcgs when they are allocated.  All the
> >>> memcg reclaim stuff is done at the VFS inode cache level - all the
> >>> XFS inode cache shrinker does is clean up inodes that are not
> >>> referenced by the VFS inode cache because the memcg aware reclaim
> >>> has already freed them.
> >>>
> >>> i.e. what the XFS inode cache is doing is perfectly reasonable -
> >>> memcg aware inode reclaim is occuring at the VFS level, but on XFS
> >>> that does not free the inodes as they are still referenced
> >>> internally by XFS. However, once the inode is removed from the VFS
> >>> LRU, all memcg information about the inode is destroyed, so there's
> >>> nothing in the XFS layers that cares about memcgs.
> >>
> >> So, after inode is removed from LRU, memory still remains accounted
> >> to memcg till the time they are actually freed. I personally don't
> >> care, just to mention.
> >>  
> >>> Hence when the XFS inode shrinker then called to run a
> >>> garbage collection pass on unreferenced inodes - the inodes that
> >>> are now unreferenced in the memcg due to the VFS inode shrinker pass
> >>> - it frees inodes regardless of the memcg context it was called from
> >>> because that information is no longer present in the inode cache.
> >>> Hence we just ignore memcgs at this level.
> >>
> >> But xfs_fs_free_cached_objects() returns number of these freed object
> >> as result to super_cache_scan(), so shrinker interprets them as related
> >> to a memcg, while they may be related to another memcg. This introduces
> >> a disproportion relatively to another shrinkers called to memcg.
> > 
> > In what way? All memcgs see tha same values from the backing cache
> > and so try to do the same amount of scanning work. The shrinker
> > accounting simply doesn't care where the objects are scanned from,
> > as long as it comes from the same place as the calculation of the
> > number of objects in the cache it's about to scan.
> 
> shrinker->count_objects() result is used to count number of objects,
> do_shrink_slab() should shrink during the call:
> 
> freeable = shrinker->count_objects(shrinker, shrinkctl);
> 
> Then shrinker takes part of this value:
> 
> delta = freeable >> priority;
> delta *= 4;
> do_div(delta, shrinker->seeks);
> 
> This is a number, the shrinker tries to shrink during the call.
> Let the priority is DEF_PRIORITY. Then, shrinker try to shrink
> freeable*4/(shrinker->seeks * 2^DEF_PRIORITY) enteties from every
> shrinker. Equal percent of every memcg shrinker.
> 
> When XFS shows global number of cached objects in count_objects(),
> shrinker also tryes to shrink the same percent of global objects,
> as for other memcg shrinkers. So, when you have small memcg
> with 128Mb memory allowed and small number of tasks related to it,
> you may meet 1Gb of cached objects, which were queued by another
> big cgroup. So, the small cgroup may shrink number of objects of
> size more, then its own. It's not fair. That's all I'm worry in
> this message.

Actually, it is fair, because:

/* proportion the scan between the caches */
dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);

This means that if the number of objects in the memcg aware VFS

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-19 Thread Kirill Tkhai
On 19.03.2018 14:06, Kirill Tkhai wrote:
> On 17.03.2018 00:39, Dave Chinner wrote:
>> On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
>>> On 16.03.2018 02:03, Dave Chinner wrote:
 On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
> On 15.03.2018 20:49, Michal Hocko wrote:
>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
>>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>>> So, it's called for memcg reclaim too, e.g. this list is shrinked
>>> disproportionality to another lists.
>>>
>>> This looks confusing, so I'm reporting about this.
>>> Consider this patch as RFC.
>>
>> Could you be more specific about the problem you are trying to solve?
>> Because we do skip shrinkers which are not memcg aware by
>> shrink_slab:
>>  /*
>>   * If kernel memory accounting is disabled, we ignore
>>   * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>   * passing NULL for memcg.
>>   */
>>  if (memcg_kmem_enabled() &&
>>  !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>  continue;
>>
>> Or am I missing something?
>
> sb->s_op->nr_cached_objects is a sub-method of generic 
> super_cache_count().
> super_cache_count() is owned and only called by superblock's shrinker,
> which does have SHRINKER_MEMCG_AWARE flag.

 Xfs inodes are accounted to memcgs when they are allocated.  All the
 memcg reclaim stuff is done at the VFS inode cache level - all the
 XFS inode cache shrinker does is clean up inodes that are not
 referenced by the VFS inode cache because the memcg aware reclaim
 has already freed them.

 i.e. what the XFS inode cache is doing is perfectly reasonable -
 memcg aware inode reclaim is occuring at the VFS level, but on XFS
 that does not free the inodes as they are still referenced
 internally by XFS. However, once the inode is removed from the VFS
 LRU, all memcg information about the inode is destroyed, so there's
 nothing in the XFS layers that cares about memcgs.
>>>
>>> So, after inode is removed from LRU, memory still remains accounted
>>> to memcg till the time they are actually freed. I personally don't
>>> care, just to mention.
>>>  
 Hence when the XFS inode shrinker then called to run a
 garbage collection pass on unreferenced inodes - the inodes that
 are now unreferenced in the memcg due to the VFS inode shrinker pass
 - it frees inodes regardless of the memcg context it was called from
 because that information is no longer present in the inode cache.
 Hence we just ignore memcgs at this level.
>>>
>>> But xfs_fs_free_cached_objects() returns number of these freed object
>>> as result to super_cache_scan(), so shrinker interprets them as related
>>> to a memcg, while they may be related to another memcg. This introduces
>>> a disproportion relatively to another shrinkers called to memcg.
>>
>> In what way? All memcgs see tha same values from the backing cache
>> and so try to do the same amount of scanning work. The shrinker
>> accounting simply doesn't care where the objects are scanned from,
>> as long as it comes from the same place as the calculation of the
>> number of objects in the cache it's about to scan.
> 
> shrinker->count_objects() result is used to count number of objects,
> do_shrink_slab() should shrink during the call:
> 
> freeable = shrinker->count_objects(shrinker, shrinkctl);
> 
> Then shrinker takes part of this value:
> 
> delta = freeable >> priority;
> delta *= 4;
> do_div(delta, shrinker->seeks);
> 
> This is a number, the shrinker tries to shrink during the call.
> Let the priority is DEF_PRIORITY. Then, shrinker try to shrink
> freeable*4/(shrinker->seeks * 2^DEF_PRIORITY) enteties from every
> shrinker. Equal percent of every memcg shrinker.
> 
> When XFS shows global number of cached objects in count_objects(),
> shrinker also tryes to shrink the same percent of global objects,
> as for other memcg shrinkers. So, when you have small memcg
> with 128Mb memory allowed and small number of tasks related to it,
> you may meet 1Gb of cached objects, which were queued by another
> big cgroup. So, the small cgroup may shrink number of objects of
> size more, then its own. It's not fair. That's all I'm worry in
> this message.
>  
>> The memcg accounting, OTOH, is completely divorced from the
>> shrinker, so if it's still got too much slab memory accounted to it,
>> it will simply run the shrinker again and do some more memory
>> reclaim.
> 
> This message is not about OOM, it's about imbalance. See above.
> 
>> XFS does this for IO efficiency purposes, not because it's ideal
>> behaviour for memcgs. If we start doing exact memcg-only inode
>> reclaim, we're going back to the bad old days where inode reclaim
>> causes really nasty 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-19 Thread Kirill Tkhai
On 19.03.2018 14:06, Kirill Tkhai wrote:
> On 17.03.2018 00:39, Dave Chinner wrote:
>> On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
>>> On 16.03.2018 02:03, Dave Chinner wrote:
 On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
> On 15.03.2018 20:49, Michal Hocko wrote:
>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
>>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>>> So, it's called for memcg reclaim too, e.g. this list is shrinked
>>> disproportionality to another lists.
>>>
>>> This looks confusing, so I'm reporting about this.
>>> Consider this patch as RFC.
>>
>> Could you be more specific about the problem you are trying to solve?
>> Because we do skip shrinkers which are not memcg aware by
>> shrink_slab:
>>  /*
>>   * If kernel memory accounting is disabled, we ignore
>>   * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>   * passing NULL for memcg.
>>   */
>>  if (memcg_kmem_enabled() &&
>>  !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>  continue;
>>
>> Or am I missing something?
>
> sb->s_op->nr_cached_objects is a sub-method of generic 
> super_cache_count().
> super_cache_count() is owned and only called by superblock's shrinker,
> which does have SHRINKER_MEMCG_AWARE flag.

 Xfs inodes are accounted to memcgs when they are allocated.  All the
 memcg reclaim stuff is done at the VFS inode cache level - all the
 XFS inode cache shrinker does is clean up inodes that are not
 referenced by the VFS inode cache because the memcg aware reclaim
 has already freed them.

 i.e. what the XFS inode cache is doing is perfectly reasonable -
 memcg aware inode reclaim is occuring at the VFS level, but on XFS
 that does not free the inodes as they are still referenced
 internally by XFS. However, once the inode is removed from the VFS
 LRU, all memcg information about the inode is destroyed, so there's
 nothing in the XFS layers that cares about memcgs.
>>>
>>> So, after inode is removed from LRU, memory still remains accounted
>>> to memcg till the time they are actually freed. I personally don't
>>> care, just to mention.
>>>  
 Hence when the XFS inode shrinker then called to run a
 garbage collection pass on unreferenced inodes - the inodes that
 are now unreferenced in the memcg due to the VFS inode shrinker pass
 - it frees inodes regardless of the memcg context it was called from
 because that information is no longer present in the inode cache.
 Hence we just ignore memcgs at this level.
>>>
>>> But xfs_fs_free_cached_objects() returns number of these freed object
>>> as result to super_cache_scan(), so shrinker interprets them as related
>>> to a memcg, while they may be related to another memcg. This introduces
>>> a disproportion relatively to another shrinkers called to memcg.
>>
>> In what way? All memcgs see tha same values from the backing cache
>> and so try to do the same amount of scanning work. The shrinker
>> accounting simply doesn't care where the objects are scanned from,
>> as long as it comes from the same place as the calculation of the
>> number of objects in the cache it's about to scan.
> 
> shrinker->count_objects() result is used to count number of objects,
> do_shrink_slab() should shrink during the call:
> 
> freeable = shrinker->count_objects(shrinker, shrinkctl);
> 
> Then shrinker takes part of this value:
> 
> delta = freeable >> priority;
> delta *= 4;
> do_div(delta, shrinker->seeks);
> 
> This is a number, the shrinker tries to shrink during the call.
> Let the priority is DEF_PRIORITY. Then, shrinker try to shrink
> freeable*4/(shrinker->seeks * 2^DEF_PRIORITY) enteties from every
> shrinker. Equal percent of every memcg shrinker.
> 
> When XFS shows global number of cached objects in count_objects(),
> shrinker also tryes to shrink the same percent of global objects,
> as for other memcg shrinkers. So, when you have small memcg
> with 128Mb memory allowed and small number of tasks related to it,
> you may meet 1Gb of cached objects, which were queued by another
> big cgroup. So, the small cgroup may shrink number of objects of
> size more, then its own. It's not fair. That's all I'm worry in
> this message.
>  
>> The memcg accounting, OTOH, is completely divorced from the
>> shrinker, so if it's still got too much slab memory accounted to it,
>> it will simply run the shrinker again and do some more memory
>> reclaim.
> 
> This message is not about OOM, it's about imbalance. See above.
> 
>> XFS does this for IO efficiency purposes, not because it's ideal
>> behaviour for memcgs. If we start doing exact memcg-only inode
>> reclaim, we're going back to the bad old days where inode reclaim
>> causes really nasty 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-19 Thread Kirill Tkhai
On 17.03.2018 00:39, Dave Chinner wrote:
> On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
>> On 16.03.2018 02:03, Dave Chinner wrote:
>>> On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
 On 15.03.2018 20:49, Michal Hocko wrote:
> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>> So, it's called for memcg reclaim too, e.g. this list is shrinked
>> disproportionality to another lists.
>>
>> This looks confusing, so I'm reporting about this.
>> Consider this patch as RFC.
>
> Could you be more specific about the problem you are trying to solve?
> Because we do skip shrinkers which are not memcg aware by
> shrink_slab:
>   /*
>* If kernel memory accounting is disabled, we ignore
>* SHRINKER_MEMCG_AWARE flag and call all shrinkers
>* passing NULL for memcg.
>*/
>   if (memcg_kmem_enabled() &&
>   !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>   continue;
>
> Or am I missing something?

 sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
 super_cache_count() is owned and only called by superblock's shrinker,
 which does have SHRINKER_MEMCG_AWARE flag.
>>>
>>> Xfs inodes are accounted to memcgs when they are allocated.  All the
>>> memcg reclaim stuff is done at the VFS inode cache level - all the
>>> XFS inode cache shrinker does is clean up inodes that are not
>>> referenced by the VFS inode cache because the memcg aware reclaim
>>> has already freed them.
>>>
>>> i.e. what the XFS inode cache is doing is perfectly reasonable -
>>> memcg aware inode reclaim is occuring at the VFS level, but on XFS
>>> that does not free the inodes as they are still referenced
>>> internally by XFS. However, once the inode is removed from the VFS
>>> LRU, all memcg information about the inode is destroyed, so there's
>>> nothing in the XFS layers that cares about memcgs.
>>
>> So, after inode is removed from LRU, memory still remains accounted
>> to memcg till the time they are actually freed. I personally don't
>> care, just to mention.
>>  
>>> Hence when the XFS inode shrinker then called to run a
>>> garbage collection pass on unreferenced inodes - the inodes that
>>> are now unreferenced in the memcg due to the VFS inode shrinker pass
>>> - it frees inodes regardless of the memcg context it was called from
>>> because that information is no longer present in the inode cache.
>>> Hence we just ignore memcgs at this level.
>>
>> But xfs_fs_free_cached_objects() returns number of these freed object
>> as result to super_cache_scan(), so shrinker interprets them as related
>> to a memcg, while they may be related to another memcg. This introduces
>> a disproportion relatively to another shrinkers called to memcg.
> 
> In what way? All memcgs see tha same values from the backing cache
> and so try to do the same amount of scanning work. The shrinker
> accounting simply doesn't care where the objects are scanned from,
> as long as it comes from the same place as the calculation of the
> number of objects in the cache it's about to scan.

shrinker->count_objects() result is used to count number of objects,
do_shrink_slab() should shrink during the call:

freeable = shrinker->count_objects(shrinker, shrinkctl);

Then shrinker takes part of this value:

delta = freeable >> priority;
delta *= 4;
do_div(delta, shrinker->seeks);

This is a number, the shrinker tries to shrink during the call.
Let the priority is DEF_PRIORITY. Then, shrinker try to shrink
freeable*4/(shrinker->seeks * 2^DEF_PRIORITY) enteties from every
shrinker. Equal percent of every memcg shrinker.

When XFS shows global number of cached objects in count_objects(),
shrinker also tryes to shrink the same percent of global objects,
as for other memcg shrinkers. So, when you have small memcg
with 128Mb memory allowed and small number of tasks related to it,
you may meet 1Gb of cached objects, which were queued by another
big cgroup. So, the small cgroup may shrink number of objects of
size more, then its own. It's not fair. That's all I'm worry in
this message.
 
> The memcg accounting, OTOH, is completely divorced from the
> shrinker, so if it's still got too much slab memory accounted to it,
> it will simply run the shrinker again and do some more memory
> reclaim.

This message is not about OOM, it's about imbalance. See above.

> XFS does this for IO efficiency purposes, not because it's ideal
> behaviour for memcgs. If we start doing exact memcg-only inode
> reclaim, we're going back to the bad old days where inode reclaim
> causes really nasty small random write storms that essentially
> starve the storage subsystem of all other IO for seconds at a time.
> That is a *far worse problem* from a system performance 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-19 Thread Kirill Tkhai
On 17.03.2018 00:39, Dave Chinner wrote:
> On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
>> On 16.03.2018 02:03, Dave Chinner wrote:
>>> On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
 On 15.03.2018 20:49, Michal Hocko wrote:
> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>> So, it's called for memcg reclaim too, e.g. this list is shrinked
>> disproportionality to another lists.
>>
>> This looks confusing, so I'm reporting about this.
>> Consider this patch as RFC.
>
> Could you be more specific about the problem you are trying to solve?
> Because we do skip shrinkers which are not memcg aware by
> shrink_slab:
>   /*
>* If kernel memory accounting is disabled, we ignore
>* SHRINKER_MEMCG_AWARE flag and call all shrinkers
>* passing NULL for memcg.
>*/
>   if (memcg_kmem_enabled() &&
>   !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>   continue;
>
> Or am I missing something?

 sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
 super_cache_count() is owned and only called by superblock's shrinker,
 which does have SHRINKER_MEMCG_AWARE flag.
>>>
>>> Xfs inodes are accounted to memcgs when they are allocated.  All the
>>> memcg reclaim stuff is done at the VFS inode cache level - all the
>>> XFS inode cache shrinker does is clean up inodes that are not
>>> referenced by the VFS inode cache because the memcg aware reclaim
>>> has already freed them.
>>>
>>> i.e. what the XFS inode cache is doing is perfectly reasonable -
>>> memcg aware inode reclaim is occuring at the VFS level, but on XFS
>>> that does not free the inodes as they are still referenced
>>> internally by XFS. However, once the inode is removed from the VFS
>>> LRU, all memcg information about the inode is destroyed, so there's
>>> nothing in the XFS layers that cares about memcgs.
>>
>> So, after inode is removed from LRU, memory still remains accounted
>> to memcg till the time they are actually freed. I personally don't
>> care, just to mention.
>>  
>>> Hence when the XFS inode shrinker then called to run a
>>> garbage collection pass on unreferenced inodes - the inodes that
>>> are now unreferenced in the memcg due to the VFS inode shrinker pass
>>> - it frees inodes regardless of the memcg context it was called from
>>> because that information is no longer present in the inode cache.
>>> Hence we just ignore memcgs at this level.
>>
>> But xfs_fs_free_cached_objects() returns number of these freed object
>> as result to super_cache_scan(), so shrinker interprets them as related
>> to a memcg, while they may be related to another memcg. This introduces
>> a disproportion relatively to another shrinkers called to memcg.
> 
> In what way? All memcgs see tha same values from the backing cache
> and so try to do the same amount of scanning work. The shrinker
> accounting simply doesn't care where the objects are scanned from,
> as long as it comes from the same place as the calculation of the
> number of objects in the cache it's about to scan.

shrinker->count_objects() result is used to count number of objects,
do_shrink_slab() should shrink during the call:

freeable = shrinker->count_objects(shrinker, shrinkctl);

Then shrinker takes part of this value:

delta = freeable >> priority;
delta *= 4;
do_div(delta, shrinker->seeks);

This is a number, the shrinker tries to shrink during the call.
Let the priority is DEF_PRIORITY. Then, shrinker try to shrink
freeable*4/(shrinker->seeks * 2^DEF_PRIORITY) enteties from every
shrinker. Equal percent of every memcg shrinker.

When XFS shows global number of cached objects in count_objects(),
shrinker also tryes to shrink the same percent of global objects,
as for other memcg shrinkers. So, when you have small memcg
with 128Mb memory allowed and small number of tasks related to it,
you may meet 1Gb of cached objects, which were queued by another
big cgroup. So, the small cgroup may shrink number of objects of
size more, then its own. It's not fair. That's all I'm worry in
this message.
 
> The memcg accounting, OTOH, is completely divorced from the
> shrinker, so if it's still got too much slab memory accounted to it,
> it will simply run the shrinker again and do some more memory
> reclaim.

This message is not about OOM, it's about imbalance. See above.

> XFS does this for IO efficiency purposes, not because it's ideal
> behaviour for memcgs. If we start doing exact memcg-only inode
> reclaim, we're going back to the bad old days where inode reclaim
> causes really nasty small random write storms that essentially
> starve the storage subsystem of all other IO for seconds at a time.
> That is a *far worse problem* from a system performance 

Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-16 Thread Dave Chinner
On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
> On 16.03.2018 02:03, Dave Chinner wrote:
> > On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
> >> On 15.03.2018 20:49, Michal Hocko wrote:
> >>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
>  xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>  So, it's called for memcg reclaim too, e.g. this list is shrinked
>  disproportionality to another lists.
> 
>  This looks confusing, so I'm reporting about this.
>  Consider this patch as RFC.
> >>>
> >>> Could you be more specific about the problem you are trying to solve?
> >>> Because we do skip shrinkers which are not memcg aware by
> >>> shrink_slab:
> >>>   /*
> >>>* If kernel memory accounting is disabled, we ignore
> >>>* SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >>>* passing NULL for memcg.
> >>>*/
> >>>   if (memcg_kmem_enabled() &&
> >>>   !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >>>   continue;
> >>>
> >>> Or am I missing something?
> >>
> >> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
> >> super_cache_count() is owned and only called by superblock's shrinker,
> >> which does have SHRINKER_MEMCG_AWARE flag.
> > 
> > Xfs inodes are accounted to memcgs when they are allocated.  All the
> > memcg reclaim stuff is done at the VFS inode cache level - all the
> > XFS inode cache shrinker does is clean up inodes that are not
> > referenced by the VFS inode cache because the memcg aware reclaim
> > has already freed them.
> > 
> > i.e. what the XFS inode cache is doing is perfectly reasonable -
> > memcg aware inode reclaim is occuring at the VFS level, but on XFS
> > that does not free the inodes as they are still referenced
> > internally by XFS. However, once the inode is removed from the VFS
> > LRU, all memcg information about the inode is destroyed, so there's
> > nothing in the XFS layers that cares about memcgs.
> 
> So, after inode is removed from LRU, memory still remains accounted
> to memcg till the time they are actually freed. I personally don't
> care, just to mention.
>  
> > Hence when the XFS inode shrinker then called to run a
> > garbage collection pass on unreferenced inodes - the inodes that
> > are now unreferenced in the memcg due to the VFS inode shrinker pass
> > - it frees inodes regardless of the memcg context it was called from
> > because that information is no longer present in the inode cache.
> > Hence we just ignore memcgs at this level.
> 
> But xfs_fs_free_cached_objects() returns number of these freed object
> as result to super_cache_scan(), so shrinker interprets them as related
> to a memcg, while they may be related to another memcg. This introduces
> a disproportion relatively to another shrinkers called to memcg.

In what way? All memcgs see tha same values from the backing cache
and so try to do the same amount of scanning work. The shrinker
accounting simply doesn't care where the objects are scanned from,
as long as it comes from the same place as the calculation of the
number of objects in the cache it's about to scan.

The memcg accounting, OTOH, is completely divorced from the
shrinker, so if it's still got too much slab memory accounted to it,
it will simply run the shrinker again and do some more memory
reclaim.

XFS does this for IO efficiency purposes, not because it's ideal
behaviour for memcgs. If we start doing exact memcg-only inode
reclaim, we're going back to the bad old days where inode reclaim
causes really nasty small random write storms that essentially
starve the storage subsystem of all other IO for seconds at a time.
That is a *far worse problem* from a system performance POV than
having the memcg memory reclaim run a couple more shrinker passes
and do a little more work

> Is there a problem? This is what my patch about.

You've described some theoretical issue, but not described any user
visible or performance impacting behaviour that users actually care
about. What reproducable, observable behaviour does it fix/change?

> > So, is there a problem you are actually trying to fix, or is this
> > simply a "I don't understand how the superblock shrinkers work,
> > please explain" patch?
> 
> I work on some shrinker changes, and just want to understand they don't
> touch anything. 

Oh, goody, another person about to learn the hard way that shrinkers
are far more varied and complex than page reclaim :P

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-16 Thread Dave Chinner
On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
> On 16.03.2018 02:03, Dave Chinner wrote:
> > On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
> >> On 15.03.2018 20:49, Michal Hocko wrote:
> >>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
>  xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>  So, it's called for memcg reclaim too, e.g. this list is shrinked
>  disproportionality to another lists.
> 
>  This looks confusing, so I'm reporting about this.
>  Consider this patch as RFC.
> >>>
> >>> Could you be more specific about the problem you are trying to solve?
> >>> Because we do skip shrinkers which are not memcg aware by
> >>> shrink_slab:
> >>>   /*
> >>>* If kernel memory accounting is disabled, we ignore
> >>>* SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >>>* passing NULL for memcg.
> >>>*/
> >>>   if (memcg_kmem_enabled() &&
> >>>   !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >>>   continue;
> >>>
> >>> Or am I missing something?
> >>
> >> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
> >> super_cache_count() is owned and only called by superblock's shrinker,
> >> which does have SHRINKER_MEMCG_AWARE flag.
> > 
> > Xfs inodes are accounted to memcgs when they are allocated.  All the
> > memcg reclaim stuff is done at the VFS inode cache level - all the
> > XFS inode cache shrinker does is clean up inodes that are not
> > referenced by the VFS inode cache because the memcg aware reclaim
> > has already freed them.
> > 
> > i.e. what the XFS inode cache is doing is perfectly reasonable -
> > memcg aware inode reclaim is occuring at the VFS level, but on XFS
> > that does not free the inodes as they are still referenced
> > internally by XFS. However, once the inode is removed from the VFS
> > LRU, all memcg information about the inode is destroyed, so there's
> > nothing in the XFS layers that cares about memcgs.
> 
> So, after inode is removed from LRU, memory still remains accounted
> to memcg till the time they are actually freed. I personally don't
> care, just to mention.
>  
> > Hence when the XFS inode shrinker then called to run a
> > garbage collection pass on unreferenced inodes - the inodes that
> > are now unreferenced in the memcg due to the VFS inode shrinker pass
> > - it frees inodes regardless of the memcg context it was called from
> > because that information is no longer present in the inode cache.
> > Hence we just ignore memcgs at this level.
> 
> But xfs_fs_free_cached_objects() returns number of these freed object
> as result to super_cache_scan(), so shrinker interprets them as related
> to a memcg, while they may be related to another memcg. This introduces
> a disproportion relatively to another shrinkers called to memcg.

In what way? All memcgs see tha same values from the backing cache
and so try to do the same amount of scanning work. The shrinker
accounting simply doesn't care where the objects are scanned from,
as long as it comes from the same place as the calculation of the
number of objects in the cache it's about to scan.

The memcg accounting, OTOH, is completely divorced from the
shrinker, so if it's still got too much slab memory accounted to it,
it will simply run the shrinker again and do some more memory
reclaim.

XFS does this for IO efficiency purposes, not because it's ideal
behaviour for memcgs. If we start doing exact memcg-only inode
reclaim, we're going back to the bad old days where inode reclaim
causes really nasty small random write storms that essentially
starve the storage subsystem of all other IO for seconds at a time.
That is a *far worse problem* from a system performance POV than
having the memcg memory reclaim run a couple more shrinker passes
and do a little more work

> Is there a problem? This is what my patch about.

You've described some theoretical issue, but not described any user
visible or performance impacting behaviour that users actually care
about. What reproducable, observable behaviour does it fix/change?

> > So, is there a problem you are actually trying to fix, or is this
> > simply a "I don't understand how the superblock shrinkers work,
> > please explain" patch?
> 
> I work on some shrinker changes, and just want to understand they don't
> touch anything. 

Oh, goody, another person about to learn the hard way that shrinkers
are far more varied and complex than page reclaim :P

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-16 Thread Kirill Tkhai
On 16.03.2018 02:03, Dave Chinner wrote:
> On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
>> On 15.03.2018 20:49, Michal Hocko wrote:
>>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
 xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
 So, it's called for memcg reclaim too, e.g. this list is shrinked
 disproportionality to another lists.

 This looks confusing, so I'm reporting about this.
 Consider this patch as RFC.
>>>
>>> Could you be more specific about the problem you are trying to solve?
>>> Because we do skip shrinkers which are not memcg aware by
>>> shrink_slab:
>>> /*
>>>  * If kernel memory accounting is disabled, we ignore
>>>  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>>  * passing NULL for memcg.
>>>  */
>>> if (memcg_kmem_enabled() &&
>>> !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>> continue;
>>>
>>> Or am I missing something?
>>
>> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
>> super_cache_count() is owned and only called by superblock's shrinker,
>> which does have SHRINKER_MEMCG_AWARE flag.
> 
> Xfs inodes are accounted to memcgs when they are allocated.  All the
> memcg reclaim stuff is done at the VFS inode cache level - all the
> XFS inode cache shrinker does is clean up inodes that are not
> referenced by the VFS inode cache because the memcg aware reclaim
> has already freed them.
> 
> i.e. what the XFS inode cache is doing is perfectly reasonable -
> memcg aware inode reclaim is occuring at the VFS level, but on XFS
> that does not free the inodes as they are still referenced
> internally by XFS. However, once the inode is removed from the VFS
> LRU, all memcg information about the inode is destroyed, so there's
> nothing in the XFS layers that cares about memcgs.

So, after inode is removed from LRU, memory still remains accounted
to memcg till the time they are actually freed. I personally don't
care, just to mention.
 
> Hence when the XFS inode shrinker then called to run a
> garbage collection pass on unreferenced inodes - the inodes that
> are now unreferenced in the memcg due to the VFS inode shrinker pass
> - it frees inodes regardless of the memcg context it was called from
> because that information is no longer present in the inode cache.
> Hence we just ignore memcgs at this level.

But xfs_fs_free_cached_objects() returns number of these freed object
as result to super_cache_scan(), so shrinker interprets them as related
to a memcg, while they may be related to another memcg. This introduces
a disproportion relatively to another shrinkers called to memcg.
Is there a problem? This is what my patch about.

> So, is there a problem you are actually trying to fix, or is this
> simply a "I don't understand how the superblock shrinkers work,
> please explain" patch?

I work on some shrinker changes, and just want to understand they don't
touch anything. 

Kirill


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-16 Thread Kirill Tkhai
On 16.03.2018 02:03, Dave Chinner wrote:
> On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
>> On 15.03.2018 20:49, Michal Hocko wrote:
>>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
 xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
 So, it's called for memcg reclaim too, e.g. this list is shrinked
 disproportionality to another lists.

 This looks confusing, so I'm reporting about this.
 Consider this patch as RFC.
>>>
>>> Could you be more specific about the problem you are trying to solve?
>>> Because we do skip shrinkers which are not memcg aware by
>>> shrink_slab:
>>> /*
>>>  * If kernel memory accounting is disabled, we ignore
>>>  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>>  * passing NULL for memcg.
>>>  */
>>> if (memcg_kmem_enabled() &&
>>> !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>> continue;
>>>
>>> Or am I missing something?
>>
>> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
>> super_cache_count() is owned and only called by superblock's shrinker,
>> which does have SHRINKER_MEMCG_AWARE flag.
> 
> Xfs inodes are accounted to memcgs when they are allocated.  All the
> memcg reclaim stuff is done at the VFS inode cache level - all the
> XFS inode cache shrinker does is clean up inodes that are not
> referenced by the VFS inode cache because the memcg aware reclaim
> has already freed them.
> 
> i.e. what the XFS inode cache is doing is perfectly reasonable -
> memcg aware inode reclaim is occuring at the VFS level, but on XFS
> that does not free the inodes as they are still referenced
> internally by XFS. However, once the inode is removed from the VFS
> LRU, all memcg information about the inode is destroyed, so there's
> nothing in the XFS layers that cares about memcgs.

So, after inode is removed from LRU, memory still remains accounted
to memcg till the time they are actually freed. I personally don't
care, just to mention.
 
> Hence when the XFS inode shrinker then called to run a
> garbage collection pass on unreferenced inodes - the inodes that
> are now unreferenced in the memcg due to the VFS inode shrinker pass
> - it frees inodes regardless of the memcg context it was called from
> because that information is no longer present in the inode cache.
> Hence we just ignore memcgs at this level.

But xfs_fs_free_cached_objects() returns number of these freed object
as result to super_cache_scan(), so shrinker interprets them as related
to a memcg, while they may be related to another memcg. This introduces
a disproportion relatively to another shrinkers called to memcg.
Is there a problem? This is what my patch about.

> So, is there a problem you are actually trying to fix, or is this
> simply a "I don't understand how the superblock shrinkers work,
> please explain" patch?

I work on some shrinker changes, and just want to understand they don't
touch anything. 

Kirill


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Dave Chinner
On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
> On 15.03.2018 20:49, Michal Hocko wrote:
> > On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
> >> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> >> So, it's called for memcg reclaim too, e.g. this list is shrinked
> >> disproportionality to another lists.
> >>
> >> This looks confusing, so I'm reporting about this.
> >> Consider this patch as RFC.
> > 
> > Could you be more specific about the problem you are trying to solve?
> > Because we do skip shrinkers which are not memcg aware by
> > shrink_slab:
> > /*
> >  * If kernel memory accounting is disabled, we ignore
> >  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >  * passing NULL for memcg.
> >  */
> > if (memcg_kmem_enabled() &&
> > !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> > continue;
> > 
> > Or am I missing something?
> 
> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
> super_cache_count() is owned and only called by superblock's shrinker,
> which does have SHRINKER_MEMCG_AWARE flag.

Xfs inodes are accounted to memcgs when they are allocated.  All the
memcg reclaim stuff is done at the VFS inode cache level - all the
XFS inode cache shrinker does is clean up inodes that are not
referenced by the VFS inode cache because the memcg aware reclaim
has already freed them.

i.e. what the XFS inode cache is doing is perfectly reasonable -
memcg aware inode reclaim is occuring at the VFS level, but on XFS
that does not free the inodes as they are still referenced
internally by XFS. However, once the inode is removed from the VFS
LRU, all memcg information about the inode is destroyed, so there's
nothing in the XFS layers that cares about memcgs.

Hence when the XFS inode shrinker then called to run a
garbage collection pass on unreferenced inodes - the inodes that
are now unreferenced in the memcg due to the VFS inode shrinker pass
- it frees inodes regardless of the memcg context it was called from
because that information is no longer present in the inode cache.
Hence we just ignore memcgs at this level.

So, is there a problem you are actually trying to fix, or is this
simply a "I don't understand how the superblock shrinkers work,
please explain" patch?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Dave Chinner
On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
> On 15.03.2018 20:49, Michal Hocko wrote:
> > On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
> >> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> >> So, it's called for memcg reclaim too, e.g. this list is shrinked
> >> disproportionality to another lists.
> >>
> >> This looks confusing, so I'm reporting about this.
> >> Consider this patch as RFC.
> > 
> > Could you be more specific about the problem you are trying to solve?
> > Because we do skip shrinkers which are not memcg aware by
> > shrink_slab:
> > /*
> >  * If kernel memory accounting is disabled, we ignore
> >  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >  * passing NULL for memcg.
> >  */
> > if (memcg_kmem_enabled() &&
> > !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> > continue;
> > 
> > Or am I missing something?
> 
> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
> super_cache_count() is owned and only called by superblock's shrinker,
> which does have SHRINKER_MEMCG_AWARE flag.

Xfs inodes are accounted to memcgs when they are allocated.  All the
memcg reclaim stuff is done at the VFS inode cache level - all the
XFS inode cache shrinker does is clean up inodes that are not
referenced by the VFS inode cache because the memcg aware reclaim
has already freed them.

i.e. what the XFS inode cache is doing is perfectly reasonable -
memcg aware inode reclaim is occuring at the VFS level, but on XFS
that does not free the inodes as they are still referenced
internally by XFS. However, once the inode is removed from the VFS
LRU, all memcg information about the inode is destroyed, so there's
nothing in the XFS layers that cares about memcgs.

Hence when the XFS inode shrinker then called to run a
garbage collection pass on unreferenced inodes - the inodes that
are now unreferenced in the memcg due to the VFS inode shrinker pass
- it frees inodes regardless of the memcg context it was called from
because that information is no longer present in the inode cache.
Hence we just ignore memcgs at this level.

So, is there a problem you are actually trying to fix, or is this
simply a "I don't understand how the superblock shrinkers work,
please explain" patch?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Kirill Tkhai
On 15.03.2018 22:32, Michal Hocko wrote:
> On Thu 15-03-18 22:28:43, Kirill Tkhai wrote:
>> On 15.03.2018 20:49, Michal Hocko wrote:
>>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
 xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
 So, it's called for memcg reclaim too, e.g. this list is shrinked
 disproportionality to another lists.

 This looks confusing, so I'm reporting about this.
 Consider this patch as RFC.
>>>
>>> Could you be more specific about the problem you are trying to solve?
>>> Because we do skip shrinkers which are not memcg aware by
>>> shrink_slab:
>>> /*
>>>  * If kernel memory accounting is disabled, we ignore
>>>  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>>  * passing NULL for memcg.
>>>  */
>>> if (memcg_kmem_enabled() &&
>>> !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>> continue;
>>>
>>> Or am I missing something?
>>
>> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
>> super_cache_count() is owned and only called by superblock's shrinker,
>> which does have SHRINKER_MEMCG_AWARE flag.
> 
> Ohh, I see. I thought it was a standard "top-level" shrinker.

There are one more in addition (it has to go in this patch):

.free_cached_objects= xfs_fs_free_cached_objects

which is not memcg aware. It also needs to have the same fix,
if xfs_reclaim_inodes_count() is need.

Kirill


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Kirill Tkhai
On 15.03.2018 22:32, Michal Hocko wrote:
> On Thu 15-03-18 22:28:43, Kirill Tkhai wrote:
>> On 15.03.2018 20:49, Michal Hocko wrote:
>>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
 xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
 So, it's called for memcg reclaim too, e.g. this list is shrinked
 disproportionality to another lists.

 This looks confusing, so I'm reporting about this.
 Consider this patch as RFC.
>>>
>>> Could you be more specific about the problem you are trying to solve?
>>> Because we do skip shrinkers which are not memcg aware by
>>> shrink_slab:
>>> /*
>>>  * If kernel memory accounting is disabled, we ignore
>>>  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>>  * passing NULL for memcg.
>>>  */
>>> if (memcg_kmem_enabled() &&
>>> !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>> continue;
>>>
>>> Or am I missing something?
>>
>> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
>> super_cache_count() is owned and only called by superblock's shrinker,
>> which does have SHRINKER_MEMCG_AWARE flag.
> 
> Ohh, I see. I thought it was a standard "top-level" shrinker.

There are one more in addition (it has to go in this patch):

.free_cached_objects= xfs_fs_free_cached_objects

which is not memcg aware. It also needs to have the same fix,
if xfs_reclaim_inodes_count() is need.

Kirill


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Michal Hocko
On Thu 15-03-18 22:28:43, Kirill Tkhai wrote:
> On 15.03.2018 20:49, Michal Hocko wrote:
> > On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
> >> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> >> So, it's called for memcg reclaim too, e.g. this list is shrinked
> >> disproportionality to another lists.
> >>
> >> This looks confusing, so I'm reporting about this.
> >> Consider this patch as RFC.
> > 
> > Could you be more specific about the problem you are trying to solve?
> > Because we do skip shrinkers which are not memcg aware by
> > shrink_slab:
> > /*
> >  * If kernel memory accounting is disabled, we ignore
> >  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >  * passing NULL for memcg.
> >  */
> > if (memcg_kmem_enabled() &&
> > !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> > continue;
> > 
> > Or am I missing something?
> 
> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
> super_cache_count() is owned and only called by superblock's shrinker,
> which does have SHRINKER_MEMCG_AWARE flag.

Ohh, I see. I thought it was a standard "top-level" shrinker.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Michal Hocko
On Thu 15-03-18 22:28:43, Kirill Tkhai wrote:
> On 15.03.2018 20:49, Michal Hocko wrote:
> > On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
> >> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> >> So, it's called for memcg reclaim too, e.g. this list is shrinked
> >> disproportionality to another lists.
> >>
> >> This looks confusing, so I'm reporting about this.
> >> Consider this patch as RFC.
> > 
> > Could you be more specific about the problem you are trying to solve?
> > Because we do skip shrinkers which are not memcg aware by
> > shrink_slab:
> > /*
> >  * If kernel memory accounting is disabled, we ignore
> >  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >  * passing NULL for memcg.
> >  */
> > if (memcg_kmem_enabled() &&
> > !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> > continue;
> > 
> > Or am I missing something?
> 
> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
> super_cache_count() is owned and only called by superblock's shrinker,
> which does have SHRINKER_MEMCG_AWARE flag.

Ohh, I see. I thought it was a standard "top-level" shrinker.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Kirill Tkhai
On 15.03.2018 20:49, Michal Hocko wrote:
> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>> So, it's called for memcg reclaim too, e.g. this list is shrinked
>> disproportionality to another lists.
>>
>> This looks confusing, so I'm reporting about this.
>> Consider this patch as RFC.
> 
> Could you be more specific about the problem you are trying to solve?
> Because we do skip shrinkers which are not memcg aware by
> shrink_slab:
>   /*
>* If kernel memory accounting is disabled, we ignore
>* SHRINKER_MEMCG_AWARE flag and call all shrinkers
>* passing NULL for memcg.
>*/
>   if (memcg_kmem_enabled() &&
>   !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>   continue;
> 
> Or am I missing something?

sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
super_cache_count() is owned and only called by superblock's shrinker,
which does have SHRINKER_MEMCG_AWARE flag.

>  
>> Signed-off-by: Kirill Tkhai 
>> ---
>>  fs/xfs/xfs_super.c |2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 951271f57d00..124568aefa94 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
>>  struct super_block  *sb,
>>  struct shrink_control   *sc)
>>  {
>> +if (sc->memcg)
>> +return 0;
>>  return xfs_reclaim_inodes_count(XFS_M(sb));
>>  }
>>  
> 



Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Kirill Tkhai
On 15.03.2018 20:49, Michal Hocko wrote:
> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>> So, it's called for memcg reclaim too, e.g. this list is shrinked
>> disproportionality to another lists.
>>
>> This looks confusing, so I'm reporting about this.
>> Consider this patch as RFC.
> 
> Could you be more specific about the problem you are trying to solve?
> Because we do skip shrinkers which are not memcg aware by
> shrink_slab:
>   /*
>* If kernel memory accounting is disabled, we ignore
>* SHRINKER_MEMCG_AWARE flag and call all shrinkers
>* passing NULL for memcg.
>*/
>   if (memcg_kmem_enabled() &&
>   !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>   continue;
> 
> Or am I missing something?

sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
super_cache_count() is owned and only called by superblock's shrinker,
which does have SHRINKER_MEMCG_AWARE flag.

>  
>> Signed-off-by: Kirill Tkhai 
>> ---
>>  fs/xfs/xfs_super.c |2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 951271f57d00..124568aefa94 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
>>  struct super_block  *sb,
>>  struct shrink_control   *sc)
>>  {
>> +if (sc->memcg)
>> +return 0;
>>  return xfs_reclaim_inodes_count(XFS_M(sb));
>>  }
>>  
> 



Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Michal Hocko
On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> So, it's called for memcg reclaim too, e.g. this list is shrinked
> disproportionality to another lists.
> 
> This looks confusing, so I'm reporting about this.
> Consider this patch as RFC.

Could you be more specific about the problem you are trying to solve?
Because we do skip shrinkers which are not memcg aware by
shrink_slab:
/*
 * If kernel memory accounting is disabled, we ignore
 * SHRINKER_MEMCG_AWARE flag and call all shrinkers
 * passing NULL for memcg.
 */
if (memcg_kmem_enabled() &&
!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
continue;

Or am I missing something?
 
> Signed-off-by: Kirill Tkhai 
> ---
>  fs/xfs/xfs_super.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 951271f57d00..124568aefa94 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
>   struct super_block  *sb,
>   struct shrink_control   *sc)
>  {
> + if (sc->memcg)
> + return 0;
>   return xfs_reclaim_inodes_count(XFS_M(sb));
>  }
>  

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Michal Hocko
On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> So, it's called for memcg reclaim too, e.g. this list is shrinked
> disproportionality to another lists.
> 
> This looks confusing, so I'm reporting about this.
> Consider this patch as RFC.

Could you be more specific about the problem you are trying to solve?
Because we do skip shrinkers which are not memcg aware by
shrink_slab:
/*
 * If kernel memory accounting is disabled, we ignore
 * SHRINKER_MEMCG_AWARE flag and call all shrinkers
 * passing NULL for memcg.
 */
if (memcg_kmem_enabled() &&
!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
continue;

Or am I missing something?
 
> Signed-off-by: Kirill Tkhai 
> ---
>  fs/xfs/xfs_super.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 951271f57d00..124568aefa94 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
>   struct super_block  *sb,
>   struct shrink_control   *sc)
>  {
> + if (sc->memcg)
> + return 0;
>   return xfs_reclaim_inodes_count(XFS_M(sb));
>  }
>  

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Kirill Tkhai
On 15.03.2018 18:53, Darrick J. Wong wrote:
> On Thu, Mar 15, 2018 at 06:01:34PM +0300, Kirill Tkhai wrote:
>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>> So, it's called for memcg reclaim too, e.g. this list is shrinked
>> disproportionality to another lists.
>>
>> This looks confusing, so I'm reporting about this.
>> Consider this patch as RFC.
> 
> So... I think the reasoning here is that xfs doesn't allocate inodes on
> behalf of any particular memcg (or put another way the cost of the
> inodes is shared by everything in the system) so if the shrinkers get
> called because a particular memcg is bumping up against its limits it
> makes no sense to try to purge xfs inodes?

Yes, since shrinking xfs cached objects doesn't free memcg kmem.
 
> Followup questions: does the same reasoning apply to the xfs buffer and
> quota shrinkers?

It's not need, as these shrinker don't have SHRINKER_MEMCG_AWARE flag.
So, they are called only in case of global reclaim (memcg == NULL).

But they may require another type of change. They use list_lru,
so they may need to have SHRINKER_MEMCG_AWARE flag. This is depends
on how objects linked to xfs_buftarg::bt_lru and xfs_quotainfo::qi_lru
are allocated. If they are accounted to memcg (i.e., use GFP_ACCOUNT
flag in their kmalloc/slab alloc/etc), they have to have this flag.
Ideally, all user initiated allocations should be made with this flag.

If they already use GFP_ACCOUNT while shrinker is not marked as 
SHRINKER_MEMCG_AWARE,
it never shrinks objects related to memcgs.

> Do any filesystems associate their metadata memory
> allocations with a specific memcg?  If not, why not put this in
> super_cache_{scan,count}?

They do, since generic super_cache_{scan,count} have a deal with sb's 
s_dentry_lru
and s_inode_lru, and use list lru, which supports memcg accounting.
 
> --D
> 
>> Signed-off-by: Kirill Tkhai 
>> ---
>>  fs/xfs/xfs_super.c |2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 951271f57d00..124568aefa94 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
>>  struct super_block  *sb,
>>  struct shrink_control   *sc)
>>  {
>> +if (sc->memcg)
>> +return 0;
>>  return xfs_reclaim_inodes_count(XFS_M(sb));
>>  }
>>  
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Kirill Tkhai
On 15.03.2018 18:53, Darrick J. Wong wrote:
> On Thu, Mar 15, 2018 at 06:01:34PM +0300, Kirill Tkhai wrote:
>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>> So, it's called for memcg reclaim too, e.g. this list is shrinked
>> disproportionality to another lists.
>>
>> This looks confusing, so I'm reporting about this.
>> Consider this patch as RFC.
> 
> So... I think the reasoning here is that xfs doesn't allocate inodes on
> behalf of any particular memcg (or put another way the cost of the
> inodes is shared by everything in the system) so if the shrinkers get
> called because a particular memcg is bumping up against its limits it
> makes no sense to try to purge xfs inodes?

Yes, since shrinking xfs cached objects doesn't free memcg kmem.
 
> Followup questions: does the same reasoning apply to the xfs buffer and
> quota shrinkers?

It's not need, as these shrinker don't have SHRINKER_MEMCG_AWARE flag.
So, they are called only in case of global reclaim (memcg == NULL).

But they may require another type of change. They use list_lru,
so they may need to have SHRINKER_MEMCG_AWARE flag. This is depends
on how objects linked to xfs_buftarg::bt_lru and xfs_quotainfo::qi_lru
are allocated. If they are accounted to memcg (i.e., use GFP_ACCOUNT
flag in their kmalloc/slab alloc/etc), they have to have this flag.
Ideally, all user initiated allocations should be made with this flag.

If they already use GFP_ACCOUNT while shrinker is not marked as 
SHRINKER_MEMCG_AWARE,
it never shrinks objects related to memcgs.

> Do any filesystems associate their metadata memory
> allocations with a specific memcg?  If not, why not put this in
> super_cache_{scan,count}?

They do, since generic super_cache_{scan,count} have a deal with sb's 
s_dentry_lru
and s_inode_lru, and use list lru, which supports memcg accounting.
 
> --D
> 
>> Signed-off-by: Kirill Tkhai 
>> ---
>>  fs/xfs/xfs_super.c |2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 951271f57d00..124568aefa94 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
>>  struct super_block  *sb,
>>  struct shrink_control   *sc)
>>  {
>> +if (sc->memcg)
>> +return 0;
>>  return xfs_reclaim_inodes_count(XFS_M(sb));
>>  }
>>  
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Darrick J. Wong
On Thu, Mar 15, 2018 at 06:01:34PM +0300, Kirill Tkhai wrote:
> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> So, it's called for memcg reclaim too, e.g. this list is shrinked
> disproportionality to another lists.
> 
> This looks confusing, so I'm reporting about this.
> Consider this patch as RFC.

So... I think the reasoning here is that xfs doesn't allocate inodes on
behalf of any particular memcg (or put another way the cost of the
inodes is shared by everything in the system) so if the shrinkers get
called because a particular memcg is bumping up against its limits it
makes no sense to try to purge xfs inodes?

Followup questions: does the same reasoning apply to the xfs buffer and
quota shrinkers?  Do any filesystems associate their metadata memory
allocations with a specific memcg?  If not, why not put this in
super_cache_{scan,count}?

--D

> Signed-off-by: Kirill Tkhai 
> ---
>  fs/xfs/xfs_super.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 951271f57d00..124568aefa94 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
>   struct super_block  *sb,
>   struct shrink_control   *sc)
>  {
> + if (sc->memcg)
> + return 0;
>   return xfs_reclaim_inodes_count(XFS_M(sb));
>  }
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Darrick J. Wong
On Thu, Mar 15, 2018 at 06:01:34PM +0300, Kirill Tkhai wrote:
> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> So, it's called for memcg reclaim too, e.g. this list is shrinked
> disproportionality to another lists.
> 
> This looks confusing, so I'm reporting about this.
> Consider this patch as RFC.

So... I think the reasoning here is that xfs doesn't allocate inodes on
behalf of any particular memcg (or put another way the cost of the
inodes is shared by everything in the system) so if the shrinkers get
called because a particular memcg is bumping up against its limits it
makes no sense to try to purge xfs inodes?

Followup questions: does the same reasoning apply to the xfs buffer and
quota shrinkers?  Do any filesystems associate their metadata memory
allocations with a specific memcg?  If not, why not put this in
super_cache_{scan,count}?

--D

> Signed-off-by: Kirill Tkhai 
> ---
>  fs/xfs/xfs_super.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 951271f57d00..124568aefa94 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
>   struct super_block  *sb,
>   struct shrink_control   *sc)
>  {
> + if (sc->memcg)
> + return 0;
>   return xfs_reclaim_inodes_count(XFS_M(sb));
>  }
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Kirill Tkhai
xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
So, it's called for memcg reclaim too, e.g. this list is shrinked
disproportionality to another lists.

This looks confusing, so I'm reporting about this.
Consider this patch as RFC.

Signed-off-by: Kirill Tkhai 
---
 fs/xfs/xfs_super.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 951271f57d00..124568aefa94 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
struct super_block  *sb,
struct shrink_control   *sc)
 {
+   if (sc->memcg)
+   return 0;
return xfs_reclaim_inodes_count(XFS_M(sb));
 }
 



[PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

2018-03-15 Thread Kirill Tkhai
xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
So, it's called for memcg reclaim too, e.g. this list is shrinked
disproportionality to another lists.

This looks confusing, so I'm reporting about this.
Consider this patch as RFC.

Signed-off-by: Kirill Tkhai 
---
 fs/xfs/xfs_super.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 951271f57d00..124568aefa94 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1788,6 +1788,8 @@ xfs_fs_nr_cached_objects(
struct super_block  *sb,
struct shrink_control   *sc)
 {
+   if (sc->memcg)
+   return 0;
return xfs_reclaim_inodes_count(XFS_M(sb));
 }