Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-28 Thread Minchan Kim
On Wed, Sep 28, 2011 at 09:11:54AM +0200, Johannes Weiner wrote:
> On Wed, Sep 28, 2011 at 02:56:40PM +0900, Minchan Kim wrote:
> > On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> > > The maximum number of dirty pages that exist in the system at any time
> > > is determined by a number of pages considered dirtyable and a
> > > user-configured percentage of those, or an absolute number in bytes.
> > 
> > It's explanation of old approach.
> 
> What do you mean?  This does not change with this patch.  We still
> have a number of dirtyable pages and a limit that is applied
> relatively to this number.
> 
> > > This number of dirtyable pages is the sum of memory provided by all
> > > the zones in the system minus their lowmem reserves and high
> > > watermarks, so that the system can retain a healthy number of free
> > > pages without having to reclaim dirty pages.
> > 
> > It's a explanation of new approach.
> 
> Same here, this aspect is also not changed with this patch!
> 
> > > But there is a flaw in that we have a zoned page allocator which does
> > > not care about the global state but rather the state of individual
> > > memory zones.  And right now there is nothing that prevents one zone
> > > from filling up with dirty pages while other zones are spared, which
> > > frequently leads to situations where kswapd, in order to restore the
> > > watermark of free pages, does indeed have to write pages from that
> > > zone's LRU list.  This can interfere so badly with IO from the flusher
> > > threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> > > requests from reclaim already, taking away the VM's only possibility
> > > to keep such a zone balanced, aside from hoping the flushers will soon
> > > clean pages from that zone.
> > 
> > It's a explanation of old approach, again!
> > Shoudn't we move above phrase of new approach into below?
> 
> Everything above describes the current behaviour (at the point of this
> patch, so respecting lowmem_reserve e.g. is part of the current
> behaviour by now) and its problems.  And below follows a description
> of how the patch tries to fix it.

It seems that it's not a good choice to use "old" and "new" terms.
Hannes, please ignore, it's not a biggie.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-28 Thread Mel Gorman
On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> The maximum number of dirty pages that exist in the system at any time
> is determined by a number of pages considered dirtyable and a
> user-configured percentage of those, or an absolute number in bytes.
> 
> This number of dirtyable pages is the sum of memory provided by all
> the zones in the system minus their lowmem reserves and high
> watermarks, so that the system can retain a healthy number of free
> pages without having to reclaim dirty pages.
> 
> But there is a flaw in that we have a zoned page allocator which does
> not care about the global state but rather the state of individual
> memory zones.  And right now there is nothing that prevents one zone
> from filling up with dirty pages while other zones are spared, which
> frequently leads to situations where kswapd, in order to restore the
> watermark of free pages, does indeed have to write pages from that
> zone's LRU list.  This can interfere so badly with IO from the flusher
> threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> requests from reclaim already, taking away the VM's only possibility
> to keep such a zone balanced, aside from hoping the flushers will soon
> clean pages from that zone.
> 
> Enter per-zone dirty limits.  They are to a zone's dirtyable memory
> what the global limit is to the global amount of dirtyable memory, and
> try to make sure that no single zone receives more than its fair share
> of the globally allowed dirty pages in the first place.  As the number
> of pages considered dirtyable exclude the zones' lowmem reserves and
> high watermarks, the maximum number of dirty pages in a zone is such
> that the zone can always be balanced without requiring page cleaning.
> 
> As this is a placement decision in the page allocator and pages are
> dirtied only after the allocation, this patch allows allocators to
> pass __GFP_WRITE when they know in advance that the page will be
> written to and become dirty soon.  The page allocator will then
> attempt to allocate from the first zone of the zonelist - which on
> NUMA is determined by the task's NUMA memory policy - that has not
> exceeded its dirty limit.
> 
> At first glance, it would appear that the diversion to lower zones can
> increase pressure on them, but this is not the case.  With a full high
> zone, allocations will be diverted to lower zones eventually, so it is
> more of a shift in timing of the lower zone allocations.  Workloads
> that previously could fit their dirty pages completely in the higher
> zone may be forced to allocate from lower zones, but the amount of
> pages that 'spill over' are limited themselves by the lower zones'
> dirty constraints, and thus unlikely to become a problem.
> 
> For now, the problem of unfair dirty page distribution remains for
> NUMA configurations where the zones allowed for allocation are in sum
> not big enough to trigger the global dirty limits, wake up the flusher
> threads and remedy the situation.  Because of this, an allocation that
> could not succeed on any of the considered zones is allowed to ignore
> the dirty limits before going into direct reclaim or even failing the
> allocation, until a future patch changes the global dirty throttling
> and flusher thread activation so that they take individual zone states
> into account.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-28 Thread Johannes Weiner
On Wed, Sep 28, 2011 at 02:56:40PM +0900, Minchan Kim wrote:
> On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> > The maximum number of dirty pages that exist in the system at any time
> > is determined by a number of pages considered dirtyable and a
> > user-configured percentage of those, or an absolute number in bytes.
> 
> It's explanation of old approach.

What do you mean?  This does not change with this patch.  We still
have a number of dirtyable pages and a limit that is applied
relatively to this number.

> > This number of dirtyable pages is the sum of memory provided by all
> > the zones in the system minus their lowmem reserves and high
> > watermarks, so that the system can retain a healthy number of free
> > pages without having to reclaim dirty pages.
> 
> It's a explanation of new approach.

Same here, this aspect is also not changed with this patch!

> > But there is a flaw in that we have a zoned page allocator which does
> > not care about the global state but rather the state of individual
> > memory zones.  And right now there is nothing that prevents one zone
> > from filling up with dirty pages while other zones are spared, which
> > frequently leads to situations where kswapd, in order to restore the
> > watermark of free pages, does indeed have to write pages from that
> > zone's LRU list.  This can interfere so badly with IO from the flusher
> > threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> > requests from reclaim already, taking away the VM's only possibility
> > to keep such a zone balanced, aside from hoping the flushers will soon
> > clean pages from that zone.
> 
> It's a explanation of old approach, again!
> Shoudn't we move above phrase of new approach into below?

Everything above describes the current behaviour (at the point of this
patch, so respecting lowmem_reserve e.g. is part of the current
behaviour by now) and its problems.  And below follows a description
of how the patch tries to fix it.

> > Enter per-zone dirty limits.  They are to a zone's dirtyable memory
> > what the global limit is to the global amount of dirtyable memory, and
> > try to make sure that no single zone receives more than its fair share
> > of the globally allowed dirty pages in the first place.  As the number
> > of pages considered dirtyable exclude the zones' lowmem reserves and
> > high watermarks, the maximum number of dirty pages in a zone is such
> > that the zone can always be balanced without requiring page cleaning.
> > 
> > As this is a placement decision in the page allocator and pages are
> > dirtied only after the allocation, this patch allows allocators to
> > pass __GFP_WRITE when they know in advance that the page will be
> > written to and become dirty soon.  The page allocator will then
> > attempt to allocate from the first zone of the zonelist - which on
> > NUMA is determined by the task's NUMA memory policy - that has not
> > exceeded its dirty limit.
> > 
> > At first glance, it would appear that the diversion to lower zones can
> > increase pressure on them, but this is not the case.  With a full high
> > zone, allocations will be diverted to lower zones eventually, so it is
> > more of a shift in timing of the lower zone allocations.  Workloads
> > that previously could fit their dirty pages completely in the higher
> > zone may be forced to allocate from lower zones, but the amount of
> > pages that 'spill over' are limited themselves by the lower zones'
> > dirty constraints, and thus unlikely to become a problem.
> 
> That's a good justification.
> 
> > For now, the problem of unfair dirty page distribution remains for
> > NUMA configurations where the zones allowed for allocation are in sum
> > not big enough to trigger the global dirty limits, wake up the flusher
> > threads and remedy the situation.  Because of this, an allocation that
> > could not succeed on any of the considered zones is allowed to ignore
> > the dirty limits before going into direct reclaim or even failing the
> > allocation, until a future patch changes the global dirty throttling
> > and flusher thread activation so that they take individual zone states
> > into account.
> > 
> > Signed-off-by: Johannes Weiner 
> 
> Otherwise, looks good to me.
> Reviewed-by: Minchan Kim 

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-27 Thread Minchan Kim
On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> The maximum number of dirty pages that exist in the system at any time
> is determined by a number of pages considered dirtyable and a
> user-configured percentage of those, or an absolute number in bytes.

It's explanation of old approach.

> 
> This number of dirtyable pages is the sum of memory provided by all
> the zones in the system minus their lowmem reserves and high
> watermarks, so that the system can retain a healthy number of free
> pages without having to reclaim dirty pages.

It's a explanation of new approach.

> 
> But there is a flaw in that we have a zoned page allocator which does
> not care about the global state but rather the state of individual
> memory zones.  And right now there is nothing that prevents one zone
> from filling up with dirty pages while other zones are spared, which
> frequently leads to situations where kswapd, in order to restore the
> watermark of free pages, does indeed have to write pages from that
> zone's LRU list.  This can interfere so badly with IO from the flusher
> threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> requests from reclaim already, taking away the VM's only possibility
> to keep such a zone balanced, aside from hoping the flushers will soon
> clean pages from that zone.

It's a explanation of old approach, again!
Shoudn't we move above phrase of new approach into below?

> 
> Enter per-zone dirty limits.  They are to a zone's dirtyable memory
> what the global limit is to the global amount of dirtyable memory, and
> try to make sure that no single zone receives more than its fair share
> of the globally allowed dirty pages in the first place.  As the number
> of pages considered dirtyable exclude the zones' lowmem reserves and
> high watermarks, the maximum number of dirty pages in a zone is such
> that the zone can always be balanced without requiring page cleaning.
> 
> As this is a placement decision in the page allocator and pages are
> dirtied only after the allocation, this patch allows allocators to
> pass __GFP_WRITE when they know in advance that the page will be
> written to and become dirty soon.  The page allocator will then
> attempt to allocate from the first zone of the zonelist - which on
> NUMA is determined by the task's NUMA memory policy - that has not
> exceeded its dirty limit.
> 
> At first glance, it would appear that the diversion to lower zones can
> increase pressure on them, but this is not the case.  With a full high
> zone, allocations will be diverted to lower zones eventually, so it is
> more of a shift in timing of the lower zone allocations.  Workloads
> that previously could fit their dirty pages completely in the higher
> zone may be forced to allocate from lower zones, but the amount of
> pages that 'spill over' are limited themselves by the lower zones'
> dirty constraints, and thus unlikely to become a problem.

That's a good justification.

> 
> For now, the problem of unfair dirty page distribution remains for
> NUMA configurations where the zones allowed for allocation are in sum
> not big enough to trigger the global dirty limits, wake up the flusher
> threads and remedy the situation.  Because of this, an allocation that
> could not succeed on any of the considered zones is allowed to ignore
> the dirty limits before going into direct reclaim or even failing the
> allocation, until a future patch changes the global dirty throttling
> and flusher thread activation so that they take individual zone states
> into account.
> 
> Signed-off-by: Johannes Weiner 

Otherwise, looks good to me.
Reviewed-by: Minchan Kim 

-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-23 Thread Johannes Weiner
The maximum number of dirty pages that exist in the system at any time
is determined by a number of pages considered dirtyable and a
user-configured percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all
the zones in the system minus their lowmem reserves and high
watermarks, so that the system can retain a healthy number of free
pages without having to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does
not care about the global state but rather the state of individual
memory zones.  And right now there is nothing that prevents one zone
from filling up with dirty pages while other zones are spared, which
frequently leads to situations where kswapd, in order to restore the
watermark of free pages, does indeed have to write pages from that
zone's LRU list.  This can interfere so badly with IO from the flusher
threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
requests from reclaim already, taking away the VM's only possibility
to keep such a zone balanced, aside from hoping the flushers will soon
clean pages from that zone.

Enter per-zone dirty limits.  They are to a zone's dirtyable memory
what the global limit is to the global amount of dirtyable memory, and
try to make sure that no single zone receives more than its fair share
of the globally allowed dirty pages in the first place.  As the number
of pages considered dirtyable exclude the zones' lowmem reserves and
high watermarks, the maximum number of dirty pages in a zone is such
that the zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to
pass __GFP_WRITE when they know in advance that the page will be
written to and become dirty soon.  The page allocator will then
attempt to allocate from the first zone of the zonelist - which on
NUMA is determined by the task's NUMA memory policy - that has not
exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case.  With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations.  Workloads
that previously could fit their dirty pages completely in the higher
zone may be forced to allocate from lower zones, but the amount of
pages that 'spill over' are limited themselves by the lower zones'
dirty constraints, and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for
NUMA configurations where the zones allowed for allocation are in sum
not big enough to trigger the global dirty limits, wake up the flusher
threads and remedy the situation.  Because of this, an allocation that
could not succeed on any of the considered zones is allowed to ignore
the dirty limits before going into direct reclaim or even failing the
allocation, until a future patch changes the global dirty throttling
and flusher thread activation so that they take individual zone states
into account.

Signed-off-by: Johannes Weiner 
---
 include/linux/gfp.h   |4 ++-
 include/linux/writeback.h |1 +
 mm/page-writeback.c   |   83 +
 mm/page_alloc.c   |   29 
 4 files changed, 116 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..50efc7e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -36,6 +36,7 @@ struct vm_area_struct;
 #endif
 #define ___GFP_NO_KSWAPD   0x40u
 #define ___GFP_OTHER_NODE  0x80u
+#define ___GFP_WRITE   0x100u
 
 /*
  * GFP bitmasks..
@@ -85,6 +86,7 @@ struct vm_area_struct;
 
 #define __GFP_NO_KSWAPD((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
other node */
+#define __GFP_WRITE((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
dirty page */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
@@ -92,7 +94,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 24/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 25/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a5f495f..c96ee0c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -104,6 +104,7 @@ void laptop_mode_timer_fn(unsigned long data);
 static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
+bool zone_dirty_ok(struct zone *zone);
 
 extern unsigned long global_dirty_limit;
 
d