Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2013-06-07 Thread Konrad Rzeszutek Wilk
On Wed, Nov 28, 2012 at 10:14:44AM +1100, Dave Chinner wrote:
> From: Dave Chinner 
> 
> Convert the driver shrinkers to the new API. Most changes are
> compile tested only because I either don't have the hardware or it's
> staging stuff.

I presume that the the i915, ttm_page_alloc and ttm_page_alloc_dma
were tested by you? They cover the most common graphic drivers.

> 
> FWIW, the md and android code is pretty good, but the rest of it
> makes me want to claw my eyes out.  The amount of broken code I just
> encountered is mind boggling.  I've added comments explaining what
> is broken, but I fear that some of the code would be best dealt with
> by being dragged behind the bike shed, burying in mud up to it's
> neck and then run over repeatedly with a blunt lawn mower.
> 
> Special mention goes to the zcache/zcache2 drivers. They can't
> co-exist in the build at the same time, they are under different
> menu options in menuconfig, they only show up when you've got the
> right set of mm subsystem options configured and so even compile
> testing is an exercise in pulling teeth.  And that doesn't even take
> into account the horrible, broken code...

Hm, I was under the impression that there is only one zcache code?
Are you referring to ramster perhaps?

> 
> Signed-off-by: Dave Chinner 
> ---
>  drivers/gpu/drm/i915/i915_dma.c   |4 +-
>  drivers/gpu/drm/i915/i915_gem.c   |   64 +---
>  drivers/gpu/drm/ttm/ttm_page_alloc.c  |   48 ++---
>  drivers/gpu/drm/ttm/ttm_page_alloc_dma.c  |   55 +++-
>  drivers/md/dm-bufio.c |   65 
> +++--
>  drivers/staging/android/ashmem.c  |   44 ---
>  drivers/staging/android/lowmemorykiller.c |   60 +-
>  drivers/staging/ramster/zcache-main.c |   58 ++---
>  drivers/staging/zcache/zcache-main.c  |   40 ++
>  9 files changed, 297 insertions(+), 141 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index 61ae104..0ddec32 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -1658,7 +1658,7 @@ int i915_driver_load(struct drm_device *dev, unsigned 
> long flags)
>   return 0;
>  
>  out_gem_unload:
> - if (dev_priv->mm.inactive_shrinker.shrink)
> + if (dev_priv->mm.inactive_shrinker.scan_objects)
>   unregister_shrinker(&dev_priv->mm.inactive_shrinker);
>  
>   if (dev->pdev->msi_enabled)
> @@ -1695,7 +1695,7 @@ int i915_driver_unload(struct drm_device *dev)
>  
>   i915_teardown_sysfs(dev);
>  
> - if (dev_priv->mm.inactive_shrinker.shrink)
> + if (dev_priv->mm.inactive_shrinker.scan_objects)
>   unregister_shrinker(&dev_priv->mm.inactive_shrinker);
>  
>   mutex_lock(&dev->struct_mutex);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 107f09b..ceab752 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -53,8 +53,10 @@ static void i915_gem_object_update_fence(struct 
> drm_i915_gem_object *obj,
>struct drm_i915_fence_reg *fence,
>bool enable);
>  
> -static int i915_gem_inactive_shrink(struct shrinker *shrinker,
> +static long i915_gem_inactive_count(struct shrinker *shrinker,
>   struct shrink_control *sc);
> +static long i915_gem_inactive_scan(struct shrinker *shrinker,
> +struct shrink_control *sc);
>  static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
>  static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
>  static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
> @@ -4197,7 +4199,8 @@ i915_gem_load(struct drm_device *dev)
>  
>   dev_priv->mm.interruptible = true;
>  
> - dev_priv->mm.inactive_shrinker.shrink = i915_gem_inactive_shrink;
> + dev_priv->mm.inactive_shrinker.count_objects = i915_gem_inactive_count;
> + dev_priv->mm.inactive_shrinker.scan_objects = i915_gem_inactive_scan;
>   dev_priv->mm.inactive_shrinker.seeks = DEFAULT_SEEKS;
>   register_shrinker(&dev_priv->mm.inactive_shrinker);
>  }
> @@ -4407,35 +4410,64 @@ void i915_gem_release(struct drm_device *dev, struct 
> drm_file *file)
>   spin_unlock(&file_priv->mm.lock);
>  }
>  
> -static int
> -i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control 
> *sc)
> +/*
> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've 
> seen.
> + *
> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> + * size is PAGE_SIZE. The shrinker doesn't work on bytes - it works on
> + * *objects*. So it passes a nr_to_scan of 128 objects, which is interpreted
> + * here to mean "free 128 bytes". That means a 

Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-29 Thread Dave Chinner
On Thu, Nov 29, 2012 at 02:29:33PM +0400, Glauber Costa wrote:
> On 11/29/2012 01:28 AM, Dave Chinner wrote:
> > On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
> >> On 11/28/2012 07:17 AM, Dave Chinner wrote:
> >>> On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote:
>  On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner  
>  wrote:
> > The shrinker doesn't work on bytes - it works on
> > + * *objects*.
> 
>  And I thought you were reviewing the shrinker API to be useful where a
>  single object may range between 4K and 4G.
> >>>
> >>> Which requires rewriting all the algorithms to not be dependent on
> >>> the subsystems using a fixed size object. The shrinker control
> >>> function is called shrink_slab() for a reason - it was expected to
> >>> be used to shrink caches of fixed sized objects allocated from slab
> >>> memory.
> >>>
> >>> It has no concept of the amount of memory that each object consumes,
> >>> just an idea of how much *IO* it takes to replace the object in
> >>> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> >>> encode the fact it generally takes 2 IOs to replace either a LRU
> >>> page or a filesystem slab object, and so balances the scanning based
> >>> on that value. i.e. the shrinker algorithms are solidly based around
> >>> fixed sized objects that have some relationship to the cost of
> >>> physical IO operations to replace them in the cache.
> >>
> >> One nit: It shouldn't take 2IOs to replace a slab object, right?
> >> objects.
> > 
> > A random dentry in a small directory will take on IO to read the
> > inode, then another to read the block the dirent sits in. TO read an
> > inode froma cached dentry will generally take one IO to read the
> > inode, and another to read related, out of inode information (e.g.
> > attributes or extent/block maps). Sometimes it will only take on IO,
> > sometimes it might take 3 or, in the case of dirents, coult take
> > hundreds of IOs if the directory structure is large enough.
> > 
> > So a default of 2 seeks to replace any single dentry/inode in the
> > cache is a pretty good default to use.
> > 
> >> This
> >> should be the cost of allocating a new page, that can contain, multiple
> >> Once the page is in, a new object should be quite cheap to come up with.
> > 
> 
> Indeed. More on this in the next paragraph...

I'm not sure what you are trying to say here. Are you saying that
you think that the IO cost for replacing a slab cache object doesn't
matter?

> > It's not the cost of allocating the page (a couple of microseconds)
> > that is being considered - it the 3-4 orders of magnitude worse cost
> > of reading the object from disk (could be 20ms). The slab/page
> > allocation is lost in the noise compared to the time it takes to
> > fill the page cache page with data or a single slab object.
> > Essentially, slab pages with multiple objects in them are much more
> > expensive to replace in the cache than a page cache page
> > 
> >> This is a very wild thought, but now that I am diving deep in the
> >> shrinker API, and seeing things like this:
> >>
> >> if (reclaim_state) {
> >> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> >> reclaim_state->reclaimed_slab = 0;
> >> }
> > 
> > That's not part of the shrinker - that's part of the vmscan
> > code, external to the shrinker infrastructure. It's getting
> > information back from the slab caches behind the shrinkers, and it's
> > not the full picture because many shrinkers are not backed by slab
> > caches. It's a work around for not not having accurate feedback from
> > the shrink_slab() code about how many pages were freed.
> > 
> I know it is not part of the shrinkers, and that is precisely my point.
> vmscan needs to go through this kinds of hacks because our API is not
> strong enough to just give it back the answer that matters to the caller.

What matters is that the slab caches are shrunk in proportion to the
page cache. i.e. balanced reclaim. For dentry and inode caches, what
matters is the number of objects reclaimed because the shrinker
algorithm balances based on the relative cost of object replacement
in the cache.

e.g. if you have 1000 pages in the page LRUs, and 1000 objects in
the dentry cache, each takes 1 IO to replace, then if you reclaim 2
pages from the page LRUs and 2 pages from the dentry cache, it will
take 2 IOs to replace the pages in the LRU, but 36 IOs to replace
the objects in the dentry cache that were reclaimed.

This is why the shrinker balances "objects scanned" vs "LRU pages
scanned" - it treats each page as an object and the shrinker relates
that to the relative cost of objects in the slab cache being
reclaimed. i.e. the focus is on keeping a balance between caches,
not reclaiming an absolute number of pages.

Note: I'm not saying this is perfect, what I'm trying to do is let
you know the "why" behind the current algorithm. i.e. why it mostly
works and why ignoring the princip

Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-29 Thread Glauber Costa
On 11/29/2012 01:28 AM, Dave Chinner wrote:
> On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
>> On 11/28/2012 07:17 AM, Dave Chinner wrote:
>>> On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote:
 On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner  
 wrote:
> +/*
> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've 
> seen.
> + *
> + * i915_gem_purge() expects a byte count to be passed, and the minimum 
> object
> + * size is PAGE_SIZE.

 No, purge() expects a count of pages to be freed. Each pass of the
 shrinker therefore tries to free a minimum of 128 pages.
>>>
>>> Ah, I got the shifts mixed up. I'd been looking at way too much crap
>>> already when I saw this. But the fact this can be misunderstood says
>>> something about the level of documentation that the code has (i.e.
>>> none).
>>>
> The shrinker doesn't work on bytes - it works on
> + * *objects*.

 And I thought you were reviewing the shrinker API to be useful where a
 single object may range between 4K and 4G.
>>>
>>> Which requires rewriting all the algorithms to not be dependent on
>>> the subsystems using a fixed size object. The shrinker control
>>> function is called shrink_slab() for a reason - it was expected to
>>> be used to shrink caches of fixed sized objects allocated from slab
>>> memory.
>>>
>>> It has no concept of the amount of memory that each object consumes,
>>> just an idea of how much *IO* it takes to replace the object in
>>> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
>>> encode the fact it generally takes 2 IOs to replace either a LRU
>>> page or a filesystem slab object, and so balances the scanning based
>>> on that value. i.e. the shrinker algorithms are solidly based around
>>> fixed sized objects that have some relationship to the cost of
>>> physical IO operations to replace them in the cache.
>>
>> One nit: It shouldn't take 2IOs to replace a slab object, right?
>> objects.
> 
> A random dentry in a small directory will take on IO to read the
> inode, then another to read the block the dirent sits in. TO read an
> inode froma cached dentry will generally take one IO to read the
> inode, and another to read related, out of inode information (e.g.
> attributes or extent/block maps). Sometimes it will only take on IO,
> sometimes it might take 3 or, in the case of dirents, coult take
> hundreds of IOs if the directory structure is large enough.
> 
> So a default of 2 seeks to replace any single dentry/inode in the
> cache is a pretty good default to use.
> 
>> This
>> should be the cost of allocating a new page, that can contain, multiple
>> Once the page is in, a new object should be quite cheap to come up with.
> 

Indeed. More on this in the next paragraph...

> It's not the cost of allocating the page (a couple of microseconds)
> that is being considered - it the 3-4 orders of magnitude worse cost
> of reading the object from disk (could be 20ms). The slab/page
> allocation is lost in the noise compared to the time it takes to
> fill the page cache page with data or a single slab object.
> Essentially, slab pages with multiple objects in them are much more
> expensive to replace in the cache than a page cache page
> 
>> This is a very wild thought, but now that I am diving deep in the
>> shrinker API, and seeing things like this:
>>
>> if (reclaim_state) {
>> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>> reclaim_state->reclaimed_slab = 0;
>> }
> 
> That's not part of the shrinker - that's part of the vmscan
> code, external to the shrinker infrastructure. It's getting
> information back from the slab caches behind the shrinkers, and it's
> not the full picture because many shrinkers are not backed by slab
> caches. It's a work around for not not having accurate feedback from
> the shrink_slab() code about how many pages were freed.
> 
I know it is not part of the shrinkers, and that is precisely my point.
vmscan needs to go through this kinds of hacks because our API is not
strong enough to just give it back the answer that matters to the caller.

> Essentially, the problem is an impedance mismatch between the way
> the LRUs are scanned/balanced (in pages) and slab caches are managed
> (by objects). That's what needs unifying...
> 
So read my statement again, Dave: this is precisely what I am advocating!

The fact that you are so more concerned with bringing the dentries back
from disk is just an obvious consequence of your FS background. The
problem I was more concerned, is when a user needs to allocate a page
for whatever reason. We're short on pages, and then we shrink. But the
shrink gives us nothing. If this is a user-page driven workload, it
should be better to do this, than to get rid of user pages - which we
may end up doing if the shrinkers does not release enough pages. This is
in contrast with a dcache-driven workload, where what you are saying
mak

Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-28 Thread Dave Chinner
On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
> On 11/28/2012 07:17 AM, Dave Chinner wrote:
> > On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote:
> >> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner  
> >> wrote:
> >>> +/*
> >>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've 
> >>> seen.
> >>> + *
> >>> + * i915_gem_purge() expects a byte count to be passed, and the minimum 
> >>> object
> >>> + * size is PAGE_SIZE.
> >>
> >> No, purge() expects a count of pages to be freed. Each pass of the
> >> shrinker therefore tries to free a minimum of 128 pages.
> > 
> > Ah, I got the shifts mixed up. I'd been looking at way too much crap
> > already when I saw this. But the fact this can be misunderstood says
> > something about the level of documentation that the code has (i.e.
> > none).
> > 
> >>> The shrinker doesn't work on bytes - it works on
> >>> + * *objects*.
> >>
> >> And I thought you were reviewing the shrinker API to be useful where a
> >> single object may range between 4K and 4G.
> > 
> > Which requires rewriting all the algorithms to not be dependent on
> > the subsystems using a fixed size object. The shrinker control
> > function is called shrink_slab() for a reason - it was expected to
> > be used to shrink caches of fixed sized objects allocated from slab
> > memory.
> > 
> > It has no concept of the amount of memory that each object consumes,
> > just an idea of how much *IO* it takes to replace the object in
> > memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> > encode the fact it generally takes 2 IOs to replace either a LRU
> > page or a filesystem slab object, and so balances the scanning based
> > on that value. i.e. the shrinker algorithms are solidly based around
> > fixed sized objects that have some relationship to the cost of
> > physical IO operations to replace them in the cache.
> 
> One nit: It shouldn't take 2IOs to replace a slab object, right?
> objects.

A random dentry in a small directory will take on IO to read the
inode, then another to read the block the dirent sits in. TO read an
inode froma cached dentry will generally take one IO to read the
inode, and another to read related, out of inode information (e.g.
attributes or extent/block maps). Sometimes it will only take on IO,
sometimes it might take 3 or, in the case of dirents, coult take
hundreds of IOs if the directory structure is large enough.

So a default of 2 seeks to replace any single dentry/inode in the
cache is a pretty good default to use.

> This
> should be the cost of allocating a new page, that can contain, multiple
> Once the page is in, a new object should be quite cheap to come up with.

It's not the cost of allocating the page (a couple of microseconds)
that is being considered - it the 3-4 orders of magnitude worse cost
of reading the object from disk (could be 20ms). The slab/page
allocation is lost in the noise compared to the time it takes to
fill the page cache page with data or a single slab object.
Essentially, slab pages with multiple objects in them are much more
expensive to replace in the cache than a page cache page

> This is a very wild thought, but now that I am diving deep in the
> shrinker API, and seeing things like this:
> 
> if (reclaim_state) {
> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> reclaim_state->reclaimed_slab = 0;
> }

That's not part of the shrinker - that's part of the vmscan
code, external to the shrinker infrastructure. It's getting
information back from the slab caches behind the shrinkers, and it's
not the full picture because many shrinkers are not backed by slab
caches. It's a work around for not not having accurate feedback from
the shrink_slab() code about how many pages were freed.

Essentially, the problem is an impedance mismatch between the way
the LRUs are scanned/balanced (in pages) and slab caches are managed
(by objects). That's what needs unifying...

> I am becoming more convinced that we should have a page-based mechanism,
> like the rest of vmscan.

Been thought about and consiered before. Would you like to rewrite
the slab code?

> Also, if we are seeing pressure from someone requesting user pages, what
> good does it make to free, say, 35 Mb of memory, if this means we are
> freeing objects across 5k different pages, without actually releasing
> any of them? (still is TBD if this is a theoretical problem or a
> practical one). It would maybe be better to free objects that are
> moderately hot, but are on pages dominated by cold objects...

Yup, that's a problem, but now you're asking shrinker
implementations to know  in great detail the physical locality of
object and not just the temporal locality.  the node-aware LRU list
does this at a coarse level, but to do page based reclaim you need
ot track pages in SL*B that contain unreferenced objects as those
are the only ones that can be reclaimed.

If you have no pages with unreferenced objects, then you

Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-28 Thread Glauber Costa
On 11/28/2012 07:17 AM, Dave Chinner wrote:
> On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote:
>> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner  wrote:
>>> +/*
>>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've 
>>> seen.
>>> + *
>>> + * i915_gem_purge() expects a byte count to be passed, and the minimum 
>>> object
>>> + * size is PAGE_SIZE.
>>
>> No, purge() expects a count of pages to be freed. Each pass of the
>> shrinker therefore tries to free a minimum of 128 pages.
> 
> Ah, I got the shifts mixed up. I'd been looking at way too much crap
> already when I saw this. But the fact this can be misunderstood says
> something about the level of documentation that the code has (i.e.
> none).
> 
>>> The shrinker doesn't work on bytes - it works on
>>> + * *objects*.
>>
>> And I thought you were reviewing the shrinker API to be useful where a
>> single object may range between 4K and 4G.
> 
> Which requires rewriting all the algorithms to not be dependent on
> the subsystems using a fixed size object. The shrinker control
> function is called shrink_slab() for a reason - it was expected to
> be used to shrink caches of fixed sized objects allocated from slab
> memory.
> 
> It has no concept of the amount of memory that each object consumes,
> just an idea of how much *IO* it takes to replace the object in
> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> encode the fact it generally takes 2 IOs to replace either a LRU
> page or a filesystem slab object, and so balances the scanning based
> on that value. i.e. the shrinker algorithms are solidly based around
> fixed sized objects that have some relationship to the cost of
> physical IO operations to replace them in the cache.

One nit: It shouldn't take 2IOs to replace a slab object, right? This
should be the cost of allocating a new page, that can contain, multiple
objects.

Once the page is in, a new object should be quite cheap to come up with.

This is a very wild thought, but now that I am diving deep in the
shrinker API, and seeing things like this:

if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
}

I am becoming more convinced that we should have a page-based mechanism,
like the rest of vmscan.

Also, if we are seeing pressure from someone requesting user pages, what
good does it make to free, say, 35 Mb of memory, if this means we are
freeing objects across 5k different pages, without actually releasing
any of them? (still is TBD if this is a theoretical problem or a
practical one). It would maybe be better to free objects that are
moderately hot, but are on pages dominated by cold objects...


> 
> The API change is the first step in the path to removing these built
> in assumptions. The current API is just insane and any attempt to
> build on it is going to be futile. 

Amen, brother!

> The way I see this developing is
> this:
> 
>   - make the shrink_slab count -> scan algorithm per node
> 
pages are per-node.

>   - add information about size of objects in the cache for
> fixed size object caches.
>   - the shrinker now has some idea of how many objects
> need to be freed to be able to free a page of
> memory, as well as the relative penalty for
> replacing them.
this is still guesswork, telling how many pages it should free, could
be a better idea.

>   - tells the shrinker the size of the cache
> in bytes so overall memory footprint of the caches
> can be taken into account

>   - add new count and scan operations for caches that are
> based on memory used, not object counts
>   - allows us to use the same count/scan algorithm for
> calculating how much pressure to put on caches
> with variable size objects.

IOW, pages.

> My care factor mostly ends here, as it will allow XFS to corectly
> balance the metadata buffer cache (variable size objects) against the
> inode, dentry and dquot caches which are object based. The next
> steps that I'm about to give you are based on some discussions with
> some MM people over bottles of red wine, so take it with a grain of
> salt...
> 
>   - calculate a "pressure" value for each cache controlled by a
> shrinker so that the relative memory pressure between
> caches can be compared. This allows the shrinkers to bias
> reclaim based on where the memory pressure is being
> generated
> 

Ok, if a cache is using a lot of memory, this would indicate it has the
dominant workload, right? Should we free from it, or should we free from
the others, so this ones gets the pages it needs?

>   - start grouping shrinkers into a heirarchy, allowing
> related shrinkers (e.g. all the caches in a memcg) to be
> shrunk according resource limits that can be placed on the
> 

Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-27 Thread Dave Chinner
On Wed, Nov 28, 2012 at 01:13:11AM +, Chris Wilson wrote:
> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner  wrote:
> > +/*
> > + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've 
> > seen.
> > + *
> > + * i915_gem_purge() expects a byte count to be passed, and the minimum 
> > object
> > + * size is PAGE_SIZE.
> 
> No, purge() expects a count of pages to be freed. Each pass of the
> shrinker therefore tries to free a minimum of 128 pages.

Ah, I got the shifts mixed up. I'd been looking at way too much crap
already when I saw this. But the fact this can be misunderstood says
something about the level of documentation that the code has (i.e.
none).

> > The shrinker doesn't work on bytes - it works on
> > + * *objects*.
> 
> And I thought you were reviewing the shrinker API to be useful where a
> single object may range between 4K and 4G.

Which requires rewriting all the algorithms to not be dependent on
the subsystems using a fixed size object. The shrinker control
function is called shrink_slab() for a reason - it was expected to
be used to shrink caches of fixed sized objects allocated from slab
memory.

It has no concept of the amount of memory that each object consumes,
just an idea of how much *IO* it takes to replace the object in
memory once it's been reclaimed. The DEFAULT_SEEKS is design to
encode the fact it generally takes 2 IOs to replace either a LRU
page or a filesystem slab object, and so balances the scanning based
on that value. i.e. the shrinker algorithms are solidly based around
fixed sized objects that have some relationship to the cost of
physical IO operations to replace them in the cache.

The API change is the first step in the path to removing these built
in assumptions. The current API is just insane and any attempt to
build on it is going to be futile. The way I see this developing is
this:

- make the shrink_slab count -> scan algorithm per node

- add information about size of objects in the cache for
  fixed size object caches.
- the shrinker now has some idea of how many objects
  need to be freed to be able to free a page of
  memory, as well as the relative penalty for
  replacing them.
- tells the shrinker the size of the cache
  in bytes so overall memory footprint of the caches
  can be taken into account

- add new count and scan operations for caches that are
  based on memory used, not object counts
- allows us to use the same count/scan algorithm for
  calculating how much pressure to put on caches
  with variable size objects.

My care factor mostly ends here, as it will allow XFS to corectly
balance the metadata buffer cache (variable size objects) against the
inode, dentry and dquot caches which are object based. The next
steps that I'm about to give you are based on some discussions with
some MM people over bottles of red wine, so take it with a grain of
salt...

- calculate a "pressure" value for each cache controlled by a
  shrinker so that the relative memory pressure between
  caches can be compared. This allows the shrinkers to bias
  reclaim based on where the memory pressure is being
  generated

- start grouping shrinkers into a heirarchy, allowing
  related shrinkers (e.g. all the caches in a memcg) to be
  shrunk according resource limits that can be placed on the
  group. i.e. memory pressure is proportioned across
  groups rather than many individual shrinkers.

- comments have been made to the extent that with generic
  per-node lists and a node aware shrinker, all of the page
  scanning could be driven by the shrinker infrastructure,
  rather than the shrinkers being driven by how many pages
  in the page cache just got scanned for reclaim.

  IOWs, the main memory reclaim algorithm walks all the
  shrinkers groups to calculate overall memory pressure,
  calculate how much reclaim is necessary, and then
  proportion reclaim across all the shrinker groups. i.e.
  everything is a shrinker.

This patch set is really just the start of a long process. balance
between the page cache and VFS/filesystem shrinkers is critical to
the efficient operation of the OS under many, many workloads, so I'm
not about to change more than oe little thing at a time. This API
change is just one little step. You'll get what you want eventually,
but you're not going to get it as a first step.

> > + * But the craziest part comes when i915_gem_purge() has walked all the 
> > objects
> > + * and can't free any memory. That results in i915_gem_shrink_all() being
> > + * called, which idles the GPU and frees everything the driver has in it's
> > + * active and inactive lists. It's basically hitt

Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

2012-11-27 Thread Chris Wilson
On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner  wrote:
> +/*
> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've 
> seen.
> + *
> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> + * size is PAGE_SIZE.

No, purge() expects a count of pages to be freed. Each pass of the
shrinker therefore tries to free a minimum of 128 pages.

> The shrinker doesn't work on bytes - it works on
> + * *objects*.

And I thought you were reviewing the shrinker API to be useful where a
single object may range between 4K and 4G.

> So it passes a nr_to_scan of 128 objects, which is interpreted
> + * here to mean "free 128 bytes". That means a single object will be freed, 
> as
> + * the minimum object size is a page.
> + *
> + * But the craziest part comes when i915_gem_purge() has walked all the 
> objects
> + * and can't free any memory. That results in i915_gem_shrink_all() being
> + * called, which idles the GPU and frees everything the driver has in it's
> + * active and inactive lists. It's basically hitting the driver with a great 
> big
> + * hammer because it was busy doing stuff when something else generated 
> memory
> + * pressure. This doesn't seem particularly wise...
> + */

As opposed to triggering an OOM? The choice was between custom code for
a hopefully rare code path in a situation of last resort, or first
implementing the simplest code that stopped i915 from starving the
system of memory.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/