Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-27 Thread Mel Gorman
On Wed, Sep 26, 2012 at 09:49:30AM +0900, Minchan Kim wrote:
 On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
  On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
   On Mon, 24 Sep 2012 10:39:38 +0100
   Mel Gorman mgor...@suse.de wrote:
   
On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:

 Also, what has to be done to avoid the polling altogether?  eg/ie, zap
 a pageblock's PB_migrate_skip synchronously, when something was done 
 to
 that pageblock which justifies repolling it?
 

The something event you are looking for is pages being freed or
allocated in the page allocator. A movable page being allocated in block
or a page being freed should clear the PB_migrate_skip bit if it's set.
Unfortunately this would impact the fast path of the alloc and free 
paths
of the page allocator. I felt that that was too high a price to pay.
   
   We already do a similar thing in the page allocator: clearing of
   -all_unreclaimable and -pages_scanned. 
  
  That is true but that is a simple write (shared cache line but still) to
  a struct zone. Worse, now that you point it out, that's pretty stupid. It
  should be checking if the value is non-zero before writing to it to avoid
  a cache line bounce.
  
  Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
  not as cheap as it needs to
  
  set_pageblock_skip
- set_pageblock_flags_group
  - page_zone
  - page_to_pfn
  - get_pageblock_bitmap
  - pfn_to_bitidx
  - __set_bit
  
   But that isn't on the fast
   path really - it happens once per pcp unload. 
  
  That's still an important enough path that I'm wary of making it fatter
  and that only covers the free path. To avoid the polling, the allocation
  side needs to be handled too. It could be shoved down into rmqueue() to
  put it into a slightly colder path but still, it's a price to pay to keep
  compaction happy.
  
   Can we do something
   like that?  Drop some hint into the zone without having to visit each
   page?
   
  
  Not without incurring a cost, but yes, t is possible to give a hint on when
  PG_migrate_skip should be cleared and move away from that time-based hammer.
  
  First, we'd introduce a variant of get_pageblock_migratetype() that returns
  all the bits for the pageblock flags and then helpers to extract either the
  migratetype or the PG_migrate_skip. We already are incurring the cost of
  get_pageblock_migratetype() so it will not be much more expensive than what
  is already there. If there is an allocation or free within a pageblock that
  as the PG_migrate_skip bit set then we increment a counter. When the counter
  reaches some to-be-decided threshold then compaction may clear all the
  bits. This would match the criteria of the clearing being based on activity.
  
  There are four potential problems with this
  
  1. The logic to retrieve all the bits and split them up will be a little
 convulated but maybe it would not be that bad.
  
  2. The counter is a shared-writable cache line but obviously it could
 be moved to vmstat and incremented with inc_zone_page_state to offset
 the cost a little.
  
  3. The biggested weakness is that there is not way to know if the
 counter is incremented based on activity in a small subset of blocks.
  
  4. What should the threshold be?
  
  The first problem is minor but the other three are potentially a mess.
  Adding another vmstat counter is bad enough in itself but if the counter
  is incremented based on a small subsets of pageblocks, the hint becomes
  is potentially useless.
 
 Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
 in pageblock_flags_group.
 In allocation path, we can set PG_check_migrate in a pageblock
 In free path, we can set PG_check_free in a pageblock.
 And they are cleared by compaction's scan like now.
 So we can discard 3 and 4 at least.
 

Adding a second bit does not fix problem 3 or problem 4 at all. With two
bits, all activity could be concentrated on two blocks - one migrate and
one free. The threshold still has to be selected.

 Another idea is that let's cure it by fixing fundamental problem.
 Make zone's locks more fine-grained.

Far easier said than done and only covers the contention problem. It
does nothing for the scanning problem.

 As time goes by, system uses bigger memory but our lock of zone
 isn't scalable. Recently, lru_lock and zone-lock contention report
 isn't rare so i think it's good time that we move next step.
 

Lock contention on both those locks recently were due to compaction
rather than something more fundamental.

 How about defining struct sub_zone per 2G or 4G?
 so a zone can have several sub_zone as size and subzone can replace
 current zone's role and zone is just container of subzones.
 Of course, it's not easy to implement but I think someday we should
 go that way. Is it a really overkill?
 

One one 

Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-27 Thread Mel Gorman
On Tue, Sep 25, 2012 at 01:03:52PM -0700, Andrew Morton wrote:
 On Tue, 25 Sep 2012 10:12:07 +0100
 Mel Gorman mgor...@suse.de wrote:
 
  First, we'd introduce a variant of get_pageblock_migratetype() that returns
  all the bits for the pageblock flags and then helpers to extract either the
  migratetype or the PG_migrate_skip. We already are incurring the cost of
  get_pageblock_migratetype() so it will not be much more expensive than what
  is already there. If there is an allocation or free within a pageblock that
  as the PG_migrate_skip bit set then we increment a counter. When the counter
  reaches some to-be-decided threshold then compaction may clear all the
  bits. This would match the criteria of the clearing being based on activity.
  
  There are four potential problems with this
  
  1. The logic to retrieve all the bits and split them up will be a little
 convulated but maybe it would not be that bad.
  
  2. The counter is a shared-writable cache line but obviously it could
 be moved to vmstat and incremented with inc_zone_page_state to offset
 the cost a little.
  
  3. The biggested weakness is that there is not way to know if the
 counter is incremented based on activity in a small subset of blocks.
  
  4. What should the threshold be?
  
  The first problem is minor but the other three are potentially a mess.
  Adding another vmstat counter is bad enough in itself but if the counter
  is incremented based on a small subsets of pageblocks, the hint becomes
  is potentially useless.
  
  However, does this match what you have in mind or am I over-complicating
  things?
 
 Sounds complicated.
 
 Using wall time really does suck. 

I know, we spent a fair amount of effort getting rid of congestion_wait()
from paths it did not belong to for similar reasons.

 Are you sure you can't think of
 something more logical?
 

No, I'm not sure.

As a matter of general policy I should not encourage this but apparently
you can nag better code out of me, patch is below :). I would rather it
was added on top rather than merged with the time-based series so it can
be reverted easily if necessary.

 How would we demonstrate the suckage?  What would be the observeable downside 
 of
 switching that 5 seconds to 5 hours?
 

Reduced allocation success rates.

  Lets take an unlikely case - 128G single-node machine. That loop count
  on x86-64 would be 65536. It'll be fast enough, particularly in this
  path.
 
 That could easily exceed a millisecond.  Can/should we stick a
 cond_resched() in there?

Ok, I think it is very unlikely but not impossible. I posted a patch for
it already.

Here is a candidate patch that replaces the time heuristic with one that
is based on VM activity. My own testing indicate that scan rates are
slightly higher with this patch than the time heuristic but well within
acceptable limits.

Richard, can you also test this patch and make sure your test case has
not regressed again please?

---8---
mm: compaction: Clear PG_migrate_skip based on compaction and reclaim activity

Compaction caches if a pageblock was scanned and no pages were isolated
so that the pageblocks can be skipped in the future to reduce scanning.
This information is not cleared by the page allocator based on activity
due to the impact it would have to the page allocator fast paths. Hence
there is a requirement that something clear the cache or pageblocks will
be skipped forever. Currently the cache is cleared if there were a number
of recent allocation failures and it has not been cleared within the last
5 seconds. Time-based decisions like this are terrible as they have no
relationship to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so this
patch implements a rough heuristic. There are two cases where the cache
is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
   scanner meet), the zone is marked compact_blockskip_flush. When kswapd
   goes to sleep, it will clear the cache. This is expected to be the
   common case where the cache is cleared. It does not really matter if
   kswapd happens to be asleep or going to sleep when the flag is set as
   it will be woken on the next allocation request.

2. If there has been multiple failures recently and compaction just
   finished being deferred then a process will clear the cache and start
   a full scan. This situation happens if there are multiple high-order
   allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as
THP, they will simply fail. For requests that cannot fail, they will
retry the allocation. Tests indicated that scanning rates were roughly
similar to when the time-based heuristic was used and the allocation
success rates were similar.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 include/linux/compaction.h | 

Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-25 Thread Mel Gorman
On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
 On Mon, 24 Sep 2012 10:39:38 +0100
 Mel Gorman mgor...@suse.de wrote:
 
  On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
  
   Also, what has to be done to avoid the polling altogether?  eg/ie, zap
   a pageblock's PB_migrate_skip synchronously, when something was done to
   that pageblock which justifies repolling it?
   
  
  The something event you are looking for is pages being freed or
  allocated in the page allocator. A movable page being allocated in block
  or a page being freed should clear the PB_migrate_skip bit if it's set.
  Unfortunately this would impact the fast path of the alloc and free paths
  of the page allocator. I felt that that was too high a price to pay.
 
 We already do a similar thing in the page allocator: clearing of
 -all_unreclaimable and -pages_scanned. 

That is true but that is a simple write (shared cache line but still) to
a struct zone. Worse, now that you point it out, that's pretty stupid. It
should be checking if the value is non-zero before writing to it to avoid
a cache line bounce.

Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
not as cheap as it needs to

set_pageblock_skip
  - set_pageblock_flags_group
- page_zone
- page_to_pfn
- get_pageblock_bitmap
- pfn_to_bitidx
- __set_bit

 But that isn't on the fast
 path really - it happens once per pcp unload. 

That's still an important enough path that I'm wary of making it fatter
and that only covers the free path. To avoid the polling, the allocation
side needs to be handled too. It could be shoved down into rmqueue() to
put it into a slightly colder path but still, it's a price to pay to keep
compaction happy.

 Can we do something
 like that?  Drop some hint into the zone without having to visit each
 page?
 

Not without incurring a cost, but yes, t is possible to give a hint on when
PG_migrate_skip should be cleared and move away from that time-based hammer.

First, we'd introduce a variant of get_pageblock_migratetype() that returns
all the bits for the pageblock flags and then helpers to extract either the
migratetype or the PG_migrate_skip. We already are incurring the cost of
get_pageblock_migratetype() so it will not be much more expensive than what
is already there. If there is an allocation or free within a pageblock that
as the PG_migrate_skip bit set then we increment a counter. When the counter
reaches some to-be-decided threshold then compaction may clear all the
bits. This would match the criteria of the clearing being based on activity.

There are four potential problems with this

1. The logic to retrieve all the bits and split them up will be a little
   convulated but maybe it would not be that bad.

2. The counter is a shared-writable cache line but obviously it could
   be moved to vmstat and incremented with inc_zone_page_state to offset
   the cost a little.

3. The biggested weakness is that there is not way to know if the
   counter is incremented based on activity in a small subset of blocks.

4. What should the threshold be?

The first problem is minor but the other three are potentially a mess.
Adding another vmstat counter is bad enough in itself but if the counter
is incremented based on a small subsets of pageblocks, the hint becomes
is potentially useless.

However, does this match what you have in mind or am I over-complicating
things?

   
...
   
+static void reset_isolation_suitable(struct zone *zone)
+{
+   unsigned long start_pfn = zone-zone_start_pfn;
+   unsigned long end_pfn = zone-zone_start_pfn + 
zone-spanned_pages;
+   unsigned long pfn;
+
+   /*
+* Do not reset more than once every five seconds. If 
allocations are
+* failing sufficiently quickly to allow this to happen then 
continually
+* scanning for compaction is not going to help. The choice of 
five
+* seconds is arbitrary but will mitigate excessive scanning.
+*/
+   if (time_before(jiffies, zone-compact_blockskip_expire))
+   return;
+   zone-compact_blockskip_expire = jiffies + (HZ * 5);
+
+   /* Walk the zone and mark every pageblock as suitable for 
isolation */
+   for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) 
{
+   struct page *page;
+   if (!pfn_valid(pfn))
+   continue;
+
+   page = pfn_to_page(pfn);
+   if (zone != page_zone(page))
+   continue;
+
+   clear_pageblock_skip(page);
+   }
   
   What's the worst-case loop count here?
   
  
  zone-spanned_pages  pageblock_order
 
 What's the worst-case value of (zone-spanned_pages  pageblock_order) :)

Lets take an unlikely case - 128G single-node machine. That loop count
on x86-64 would 

Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-25 Thread Andrew Morton
On Tue, 25 Sep 2012 10:12:07 +0100
Mel Gorman mgor...@suse.de wrote:

 First, we'd introduce a variant of get_pageblock_migratetype() that returns
 all the bits for the pageblock flags and then helpers to extract either the
 migratetype or the PG_migrate_skip. We already are incurring the cost of
 get_pageblock_migratetype() so it will not be much more expensive than what
 is already there. If there is an allocation or free within a pageblock that
 as the PG_migrate_skip bit set then we increment a counter. When the counter
 reaches some to-be-decided threshold then compaction may clear all the
 bits. This would match the criteria of the clearing being based on activity.
 
 There are four potential problems with this
 
 1. The logic to retrieve all the bits and split them up will be a little
convulated but maybe it would not be that bad.
 
 2. The counter is a shared-writable cache line but obviously it could
be moved to vmstat and incremented with inc_zone_page_state to offset
the cost a little.
 
 3. The biggested weakness is that there is not way to know if the
counter is incremented based on activity in a small subset of blocks.
 
 4. What should the threshold be?
 
 The first problem is minor but the other three are potentially a mess.
 Adding another vmstat counter is bad enough in itself but if the counter
 is incremented based on a small subsets of pageblocks, the hint becomes
 is potentially useless.
 
 However, does this match what you have in mind or am I over-complicating
 things?

Sounds complicated.

Using wall time really does suck.  Are you sure you can't think of
something more logical?

How would we demonstrate the suckage?  What would be the observeable downside of
switching that 5 seconds to 5 hours?

 + for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) 
 {
 + struct page *page;
 + if (!pfn_valid(pfn))
 + continue;
 +
 + page = pfn_to_page(pfn);
 + if (zone != page_zone(page))
 + continue;
 +
 + clear_pageblock_skip(page);
 + }

What's the worst-case loop count here?

   
   zone-spanned_pages  pageblock_order
  
  What's the worst-case value of (zone-spanned_pages  pageblock_order) :)
 
 Lets take an unlikely case - 128G single-node machine. That loop count
 on x86-64 would be 65536. It'll be fast enough, particularly in this
 path.

That could easily exceed a millisecond.  Can/should we stick a
cond_resched() in there?



Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-25 Thread Minchan Kim
On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
 On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
  On Mon, 24 Sep 2012 10:39:38 +0100
  Mel Gorman mgor...@suse.de wrote:
  
   On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
   
Also, what has to be done to avoid the polling altogether?  eg/ie, zap
a pageblock's PB_migrate_skip synchronously, when something was done to
that pageblock which justifies repolling it?

   
   The something event you are looking for is pages being freed or
   allocated in the page allocator. A movable page being allocated in block
   or a page being freed should clear the PB_migrate_skip bit if it's set.
   Unfortunately this would impact the fast path of the alloc and free paths
   of the page allocator. I felt that that was too high a price to pay.
  
  We already do a similar thing in the page allocator: clearing of
  -all_unreclaimable and -pages_scanned. 
 
 That is true but that is a simple write (shared cache line but still) to
 a struct zone. Worse, now that you point it out, that's pretty stupid. It
 should be checking if the value is non-zero before writing to it to avoid
 a cache line bounce.
 
 Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
 not as cheap as it needs to
 
 set_pageblock_skip
   - set_pageblock_flags_group
 - page_zone
 - page_to_pfn
 - get_pageblock_bitmap
 - pfn_to_bitidx
 - __set_bit
 
  But that isn't on the fast
  path really - it happens once per pcp unload. 
 
 That's still an important enough path that I'm wary of making it fatter
 and that only covers the free path. To avoid the polling, the allocation
 side needs to be handled too. It could be shoved down into rmqueue() to
 put it into a slightly colder path but still, it's a price to pay to keep
 compaction happy.
 
  Can we do something
  like that?  Drop some hint into the zone without having to visit each
  page?
  
 
 Not without incurring a cost, but yes, t is possible to give a hint on when
 PG_migrate_skip should be cleared and move away from that time-based hammer.
 
 First, we'd introduce a variant of get_pageblock_migratetype() that returns
 all the bits for the pageblock flags and then helpers to extract either the
 migratetype or the PG_migrate_skip. We already are incurring the cost of
 get_pageblock_migratetype() so it will not be much more expensive than what
 is already there. If there is an allocation or free within a pageblock that
 as the PG_migrate_skip bit set then we increment a counter. When the counter
 reaches some to-be-decided threshold then compaction may clear all the
 bits. This would match the criteria of the clearing being based on activity.
 
 There are four potential problems with this
 
 1. The logic to retrieve all the bits and split them up will be a little
convulated but maybe it would not be that bad.
 
 2. The counter is a shared-writable cache line but obviously it could
be moved to vmstat and incremented with inc_zone_page_state to offset
the cost a little.
 
 3. The biggested weakness is that there is not way to know if the
counter is incremented based on activity in a small subset of blocks.
 
 4. What should the threshold be?
 
 The first problem is minor but the other three are potentially a mess.
 Adding another vmstat counter is bad enough in itself but if the counter
 is incremented based on a small subsets of pageblocks, the hint becomes
 is potentially useless.

Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
in pageblock_flags_group.
In allocation path, we can set PG_check_migrate in a pageblock
In free path, we can set PG_check_free in a pageblock.
And they are cleared by compaction's scan like now.
So we can discard 3 and 4 at least.

Another idea is that let's cure it by fixing fundamental problem.
Make zone's locks more fine-grained.
As time goes by, system uses bigger memory but our lock of zone
isn't scalable. Recently, lru_lock and zone-lock contention report
isn't rare so i think it's good time that we move next step.

How about defining struct sub_zone per 2G or 4G?
so a zone can have several sub_zone as size and subzone can replace
current zone's role and zone is just container of subzones.
Of course, it's not easy to implement but I think someday we should
go that way. Is it a really overkill?

 
 However, does this match what you have in mind or am I over-complicating
 things?
 

 ...

 +static void reset_isolation_suitable(struct zone *zone)
 +{
 + unsigned long start_pfn = zone-zone_start_pfn;
 + unsigned long end_pfn = zone-zone_start_pfn + 
 zone-spanned_pages;
 + unsigned long pfn;
 +
 + /*
 +  * Do not reset more than once every five seconds. If 
 allocations are
 +  * failing sufficiently quickly to allow this to happen then 
 continually
 +  * scanning for compaction is not going to 

Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-24 Thread Mel Gorman
On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
 On Fri, 21 Sep 2012 11:46:22 +0100
 Mel Gorman mgor...@suse.de wrote:
 
  When compaction was implemented it was known that scanning could potentially
  be excessive. The ideal was that a counter be maintained for each pageblock
  but maintaining this information would incur a severe penalty due to a
  shared writable cache line. It has reached the point where the scanning
  costs are an serious problem, particularly on long-lived systems where a
  large process starts and allocates a large number of THPs at the same time.
  
  Instead of using a shared counter, this patch adds another bit to the
  pageblock flags called PG_migrate_skip. If a pageblock is scanned by
  either migrate or free scanner and 0 pages were isolated, the pageblock
  is marked to be skipped in the future. When scanning, this bit is checked
  before any scanning takes place and the block skipped if set.
  
  The main difficulty with a patch like this is when to ignore the cached
  information? If it's ignored too often, the scanning rates will still
  be excessive. If the information is too stale then allocations will fail
  that might have otherwise succeeded. In this patch
  
  o CMA always ignores the information
  o If the migrate and free scanner meet then the cached information will
be discarded if it's at least 5 seconds since the last time the cache
was discarded
  o If there are a large number of allocation failures, discard the cache.
  
  The time-based heuristic is very clumsy but there are few choices for a
  better event. Depending solely on multiple allocation failures still allows
  excessive scanning when THP allocations are failing in quick succession
  due to memory pressure. Waiting until memory pressure is relieved would
  cause compaction to continually fail instead of using reclaim/compaction
  to try allocate the page. The time-based mechanism is clumsy but a better
  option is not obvious.
 
 ick.
 

I know. I was being generous when I described it as clumsy.

 Wall time has sooo little relationship to what's happening in there. 
 If we *have* to use polling, cannot we clock the poll with some metric
 which is at least vaguely related to the amount of activity?

Initially I wanted to only depend on just this

/* Clear pageblock skip if there are numerous alloc failures */
if (zone-compact_defer_shift == COMPACT_MAX_DEFER_SHIFT)
reset_isolation_suitable(zone);

because this it at least related to activity but it's weak for two
reasons. One, it's depending on failures to make the decisions - i.e. when
it already is too late. Two, even this condition can be hit very quickly
and can result in many resets per second in the worst case.

 Number
 (or proportion) of pages scanned, for example?  Or reset everything on
 the Nth trip around the zone? 

For a full compaction failure we have scanned all pages in the zone so
there is no proportion to use there.

Resetting everything every Nth trip around the zone is similar to the
above check except it would look like

if (zone-compact_defer_shift == COMPACT_MAX_DEFER_SHIFT 
zone-compact_reset_laps == COMPACT_MAX_LAPS)
reset_isolation_suitable(zone)

but it's weak for the same reasons - depending on failures to make decisions
and can happen too quickly.

I also considered using the PGFREE vmstat to reset if a pageblock worth of
pages had been freed since the last reset but this happens very quickly
under memory pressure and would not throttle enough. I also considered
deferring until NR_FREE_PAGES was high enough but this would severely
impact allocation success rates under memory pressure.

 Or even a combination of one of these
 *and* of wall time, so the system will at least work harder when MM is
 under load.
 

We sortof do that now - we are depending on a number of failures and
time before clearing the bits.

 Also, what has to be done to avoid the polling altogether?  eg/ie, zap
 a pageblock's PB_migrate_skip synchronously, when something was done to
 that pageblock which justifies repolling it?
 

The something event you are looking for is pages being freed or
allocated in the page allocator. A movable page being allocated in block
or a page being freed should clear the PB_migrate_skip bit if it's set.
Unfortunately this would impact the fast path of the alloc and free paths
of the page allocator. I felt that that was too high a price to pay.

 
  ...
 
  +static void reset_isolation_suitable(struct zone *zone)
  +{
  +   unsigned long start_pfn = zone-zone_start_pfn;
  +   unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages;
  +   unsigned long pfn;
  +
  +   /*
  +* Do not reset more than once every five seconds. If allocations are
  +* failing sufficiently quickly to allow this to happen then continually
  +* scanning for compaction is not going to help. The choice of five
  +* seconds is arbitrary but will 

Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-24 Thread Andrew Morton
On Mon, 24 Sep 2012 10:39:38 +0100
Mel Gorman mgor...@suse.de wrote:

 On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
 
  Also, what has to be done to avoid the polling altogether?  eg/ie, zap
  a pageblock's PB_migrate_skip synchronously, when something was done to
  that pageblock which justifies repolling it?
  
 
 The something event you are looking for is pages being freed or
 allocated in the page allocator. A movable page being allocated in block
 or a page being freed should clear the PB_migrate_skip bit if it's set.
 Unfortunately this would impact the fast path of the alloc and free paths
 of the page allocator. I felt that that was too high a price to pay.

We already do a similar thing in the page allocator: clearing of
-all_unreclaimable and -pages_scanned.  But that isn't on the fast
path really - it happens once per pcp unload.  Can we do something
like that?  Drop some hint into the zone without having to visit each
page?

  
   ...
  
   +static void reset_isolation_suitable(struct zone *zone)
   +{
   + unsigned long start_pfn = zone-zone_start_pfn;
   + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages;
   + unsigned long pfn;
   +
   + /*
   +  * Do not reset more than once every five seconds. If allocations are
   +  * failing sufficiently quickly to allow this to happen then continually
   +  * scanning for compaction is not going to help. The choice of five
   +  * seconds is arbitrary but will mitigate excessive scanning.
   +  */
   + if (time_before(jiffies, zone-compact_blockskip_expire))
   + return;
   + zone-compact_blockskip_expire = jiffies + (HZ * 5);
   +
   + /* Walk the zone and mark every pageblock as suitable for isolation */
   + for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) {
   + struct page *page;
   + if (!pfn_valid(pfn))
   + continue;
   +
   + page = pfn_to_page(pfn);
   + if (zone != page_zone(page))
   + continue;
   +
   + clear_pageblock_skip(page);
   + }
  
  What's the worst-case loop count here?
  
 
 zone-spanned_pages  pageblock_order

What's the worst-case value of (zone-spanned_pages  pageblock_order) :)



[Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-21 Thread Mel Gorman
When compaction was implemented it was known that scanning could potentially
be excessive. The ideal was that a counter be maintained for each pageblock
but maintaining this information would incur a severe penalty due to a
shared writable cache line. It has reached the point where the scanning
costs are an serious problem, particularly on long-lived systems where a
large process starts and allocates a large number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock
is marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is when to ignore the cached
information? If it's ignored too often, the scanning rates will still
be excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
  be discarded if it's at least 5 seconds since the last time the cache
  was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still allows
excessive scanning when THP allocations are failing in quick succession
due to memory pressure. Waiting until memory pressure is relieved would
cause compaction to continually fail instead of using reclaim/compaction
to try allocate the page. The time-based mechanism is clumsy but a better
option is not obvious.

Signed-off-by: Mel Gorman mgor...@suse.de
Acked-by: Rik van Riel r...@redhat.com
---
 include/linux/mmzone.h  |3 ++
 include/linux/pageblock-flags.h |   19 +++-
 mm/compaction.c |   93 +--
 mm/internal.h   |1 +
 mm/page_alloc.c |1 +
 5 files changed, 111 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 603d0b5..a456361 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,6 +368,9 @@ struct zone {
 */
spinlock_t  lock;
int all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+   unsigned long   compact_blockskip_expire;
+#endif
 #ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t   span_seqlock;
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 19ef95d..eed27f4 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,6 +30,9 @@ enum pageblock_bits {
PB_migrate,
PB_migrate_end = PB_migrate + 3 - 1,
/* 3 bits required for migrate types */
+#ifdef CONFIG_COMPACTION
+   PB_migrate_skip,/* If set the block is skipped by compaction */
+#endif /* CONFIG_COMPACTION */
NR_PAGEBLOCK_BITS
 };
 
@@ -65,10 +68,22 @@ unsigned long get_pageblock_flags_group(struct page *page,
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
int start_bitidx, int end_bitidx);
 
+#ifdef CONFIG_COMPACTION
+#define get_pageblock_skip(page) \
+   get_pageblock_flags_group(page, PB_migrate_skip, \
+   PB_migrate_skip + 1)
+#define clear_pageblock_skip(page) \
+   set_pageblock_flags_group(page, 0, PB_migrate_skip,  \
+   PB_migrate_skip + 1)
+#define set_pageblock_skip(page) \
+   set_pageblock_flags_group(page, 1, PB_migrate_skip,  \
+   PB_migrate_skip + 1)
+#endif /* CONFIG_COMPACTION */
+
 #define get_pageblock_flags(page) \
-   get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+   get_pageblock_flags_group(page, 0, PB_migrate_end)
 #define set_pageblock_flags(page, flags) \
set_pageblock_flags_group(page, flags,  \
- 0, NR_PAGEBLOCK_BITS-1)
+ 0, PB_migrate_end)
 
 #endif /* PAGEBLOCK_FLAGS_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 9fc1b61..9276bc8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,64 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+/* Returns true if the pageblock should be scanned for pages to isolate. */
+static inline bool 

Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-21 Thread Rafael Aquini
On Fri, Sep 21, 2012 at 11:46:22AM +0100, Mel Gorman wrote:
 When compaction was implemented it was known that scanning could potentially
 be excessive. The ideal was that a counter be maintained for each pageblock
 but maintaining this information would incur a severe penalty due to a
 shared writable cache line. It has reached the point where the scanning
 costs are an serious problem, particularly on long-lived systems where a
 large process starts and allocates a large number of THPs at the same time.
 
 Instead of using a shared counter, this patch adds another bit to the
 pageblock flags called PG_migrate_skip. If a pageblock is scanned by
 either migrate or free scanner and 0 pages were isolated, the pageblock
 is marked to be skipped in the future. When scanning, this bit is checked
 before any scanning takes place and the block skipped if set.
 
 The main difficulty with a patch like this is when to ignore the cached
 information? If it's ignored too often, the scanning rates will still
 be excessive. If the information is too stale then allocations will fail
 that might have otherwise succeeded. In this patch
 
 o CMA always ignores the information
 o If the migrate and free scanner meet then the cached information will
   be discarded if it's at least 5 seconds since the last time the cache
   was discarded
 o If there are a large number of allocation failures, discard the cache.
 
 The time-based heuristic is very clumsy but there are few choices for a
 better event. Depending solely on multiple allocation failures still allows
 excessive scanning when THP allocations are failing in quick succession
 due to memory pressure. Waiting until memory pressure is relieved would
 cause compaction to continually fail instead of using reclaim/compaction
 to try allocate the page. The time-based mechanism is clumsy but a better
 option is not obvious.
 
 Signed-off-by: Mel Gorman mgor...@suse.de
 Acked-by: Rik van Riel r...@redhat.com
 ---

Acked-by: Rafael Aquini aqu...@redhat.com




Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-21 Thread Andrew Morton
On Fri, 21 Sep 2012 11:46:22 +0100
Mel Gorman mgor...@suse.de wrote:

 When compaction was implemented it was known that scanning could potentially
 be excessive. The ideal was that a counter be maintained for each pageblock
 but maintaining this information would incur a severe penalty due to a
 shared writable cache line. It has reached the point where the scanning
 costs are an serious problem, particularly on long-lived systems where a
 large process starts and allocates a large number of THPs at the same time.
 
 Instead of using a shared counter, this patch adds another bit to the
 pageblock flags called PG_migrate_skip. If a pageblock is scanned by
 either migrate or free scanner and 0 pages were isolated, the pageblock
 is marked to be skipped in the future. When scanning, this bit is checked
 before any scanning takes place and the block skipped if set.
 
 The main difficulty with a patch like this is when to ignore the cached
 information? If it's ignored too often, the scanning rates will still
 be excessive. If the information is too stale then allocations will fail
 that might have otherwise succeeded. In this patch
 
 o CMA always ignores the information
 o If the migrate and free scanner meet then the cached information will
   be discarded if it's at least 5 seconds since the last time the cache
   was discarded
 o If there are a large number of allocation failures, discard the cache.
 
 The time-based heuristic is very clumsy but there are few choices for a
 better event. Depending solely on multiple allocation failures still allows
 excessive scanning when THP allocations are failing in quick succession
 due to memory pressure. Waiting until memory pressure is relieved would
 cause compaction to continually fail instead of using reclaim/compaction
 to try allocate the page. The time-based mechanism is clumsy but a better
 option is not obvious.

ick.

Wall time has sooo little relationship to what's happening in there. 
If we *have* to use polling, cannot we clock the poll with some metric
which is at least vaguely related to the amount of activity?  Number
(or proportion) of pages scanned, for example?  Or reset everything on
the Nth trip around the zone?  Or even a combination of one of these
*and* of wall time, so the system will at least work harder when MM is
under load.

Also, what has to be done to avoid the polling altogether?  eg/ie, zap
a pageblock's PB_migrate_skip synchronously, when something was done to
that pageblock which justifies repolling it?


 ...

 +static void reset_isolation_suitable(struct zone *zone)
 +{
 + unsigned long start_pfn = zone-zone_start_pfn;
 + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages;
 + unsigned long pfn;
 +
 + /*
 +  * Do not reset more than once every five seconds. If allocations are
 +  * failing sufficiently quickly to allow this to happen then continually
 +  * scanning for compaction is not going to help. The choice of five
 +  * seconds is arbitrary but will mitigate excessive scanning.
 +  */
 + if (time_before(jiffies, zone-compact_blockskip_expire))
 + return;
 + zone-compact_blockskip_expire = jiffies + (HZ * 5);
 +
 + /* Walk the zone and mark every pageblock as suitable for isolation */
 + for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) {
 + struct page *page;
 + if (!pfn_valid(pfn))
 + continue;
 +
 + page = pfn_to_page(pfn);
 + if (zone != page_zone(page))
 + continue;
 +
 + clear_pageblock_skip(page);
 + }

What's the worst-case loop count here?

 +}
 +

 ...