[PATCH 4/5] mm: Stall movable allocations until kswapd progresses during serious external fragmentation event

2018-11-07 Thread Mel Gorman
An event that potentially causes external fragmentation problems has
already been described but there are degrees of severity.  A "serious"
event is defined as one that steals a contiguous range of pages of an order
lower than fragment_stall_order (PAGE_ALLOC_COSTLY_ORDER by default). If a
movable allocation request that is allowed to sleep needs to steal a small
block then it schedules until kswapd makes progress or a timeout passes.
The watermarks are also boosted slightly faster so that kswapd makes
greater effort to reclaim enough pages to avoid the fragmentation event.

This stall is not guaranteed to avoid serious fragmentation events.
If memory pressure is high enough, the pages freed by kswapd may be
reallocated or the free pages may not be in pageblocks that contain
only movable pages. Furthermore an allocation request that cannot stall
(e.g. atomic allocations) or unmovable/reclaimable allocations will still
proceed without stalling.

The worst-case scenario for stalling is a combination of both high memory
pressure where kswapd is having trouble keeping free pages over the
pfmemalloc_reserve and movable allocations are fragmenting memory. In this
case, an allocation request may sleep for longer. There are both vmstats
to identify stalls are happening and a tracepoint to quantify what the
stall durations are. Note that the granularity of the stall detection is
a jiffy so the delay accounting is not precise.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--

4.20-rc1 extfrag events < order 9:  1023463
4.20-rc1+patch:  358574 (65% reduction)
4.20-rc1+patch1-3:19274 (98% reduction)
4.20-rc1+patch1-4: 1351 (99.9% reduction)

   4.20.0-rc1 4.20.0-rc1
   boost-v2r4 stall-v2r6
Amean fault-base-1  659.85 (   0.00%)  648.66 *   1.70%*
Amean fault-huge-1  172.19 (   0.00%)  167.79 (   2.56%)

thpfioscale Percentage Faults Huge
  4.20.0-rc1 4.20.0-rc1
  boost-v2r4 stall-v2r6
Percentage huge-11.68 (   0.00%)1.16 ( -30.69%)

Fragmentation events are now reduced to negligible levels.

The latencies and allocation success rates are roughly similar.  Over the
course of 16 minutes, there were 100 stalls due to fragmentation avoidance
with a total stall time of 0.4 seconds.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

4.20-rc1 extfrag events < order 9:  342549
4.20-rc1+patch: 337890 ( 1% reduction)
4.20-rc1+patch1-3:   12801 (96% reduction)
4.20-rc1+patch1-4:1117 (99.7% reduction)

   4.20.0-rc1 4.20.0-rc1
   boost-v2r4 stall-v2r6
Amean fault-base-1 1578.91 (   0.00%)43404.60 (-2649.02%)
Amean fault-huge-1 1090.23 (   0.00%) 1424.32 * -30.64%*

  4.20.0-rc1 4.20.0-rc1
  boost-v2r4 stall-v2r6
Percentage huge-1   82.59 (   0.00%)   99.92 (  20.97%)

The fragmentation events were reduced but the latencies went a bit crazy.
The "problem" is that the allocation success rates were very high and
forward progress was being made. This put the system under further pressure
and while compactions were succeeding, the latencies were high in cases
where compaction failed. The THP allocation vm stats are illustrative in this 
case

 4.20.0-rc1  4.20.0-rc1
 boost-v2r4  stall-v2r6
THP fault alloc49746016
THP fault fallback 1048   5
THP collapse alloc   65  56
THP collapse fail 4   4
THP split 03719
THP split failed  0 224

Note the THP fault alloc stats where they almost all succeeded relative
to the baseline. While the latencies are much higher, it is the case that
the application specifically requested THP while the system was under
heavy memory pressure.

There were 314 stalls over the course of 16 minutes for a total stall
time of roughly 11 seconds. The distribution of stalls is as follows

205 4000
  1 8000
  1 2
  1 32000
  1 36000
  6 4
  1 56000
 98 10

This is showing that 98 of the stalls waited until the timeout expired
at 25 jiffies which 10 microseconds on this particular configuration.
If this is considered problematic, the timeout can be reduced to tradeoff
fault times against fragmentation avoidance.

2-socket Haswell machine

[PATCH 4/5] mm: Stall movable allocations until kswapd progresses during serious external fragmentation event

2018-11-07 Thread Mel Gorman
An event that potentially causes external fragmentation problems has
already been described but there are degrees of severity.  A "serious"
event is defined as one that steals a contiguous range of pages of an order
lower than fragment_stall_order (PAGE_ALLOC_COSTLY_ORDER by default). If a
movable allocation request that is allowed to sleep needs to steal a small
block then it schedules until kswapd makes progress or a timeout passes.
The watermarks are also boosted slightly faster so that kswapd makes
greater effort to reclaim enough pages to avoid the fragmentation event.

This stall is not guaranteed to avoid serious fragmentation events.
If memory pressure is high enough, the pages freed by kswapd may be
reallocated or the free pages may not be in pageblocks that contain
only movable pages. Furthermore an allocation request that cannot stall
(e.g. atomic allocations) or unmovable/reclaimable allocations will still
proceed without stalling.

The worst-case scenario for stalling is a combination of both high memory
pressure where kswapd is having trouble keeping free pages over the
pfmemalloc_reserve and movable allocations are fragmenting memory. In this
case, an allocation request may sleep for longer. There are both vmstats
to identify stalls are happening and a tracepoint to quantify what the
stall durations are. Note that the granularity of the stall detection is
a jiffy so the delay accounting is not precise.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--

4.20-rc1 extfrag events < order 9:  1023463
4.20-rc1+patch:  358574 (65% reduction)
4.20-rc1+patch1-3:19274 (98% reduction)
4.20-rc1+patch1-4: 1351 (99.9% reduction)

   4.20.0-rc1 4.20.0-rc1
   boost-v2r4 stall-v2r6
Amean fault-base-1  659.85 (   0.00%)  648.66 *   1.70%*
Amean fault-huge-1  172.19 (   0.00%)  167.79 (   2.56%)

thpfioscale Percentage Faults Huge
  4.20.0-rc1 4.20.0-rc1
  boost-v2r4 stall-v2r6
Percentage huge-11.68 (   0.00%)1.16 ( -30.69%)

Fragmentation events are now reduced to negligible levels.

The latencies and allocation success rates are roughly similar.  Over the
course of 16 minutes, there were 100 stalls due to fragmentation avoidance
with a total stall time of 0.4 seconds.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

4.20-rc1 extfrag events < order 9:  342549
4.20-rc1+patch: 337890 ( 1% reduction)
4.20-rc1+patch1-3:   12801 (96% reduction)
4.20-rc1+patch1-4:1117 (99.7% reduction)

   4.20.0-rc1 4.20.0-rc1
   boost-v2r4 stall-v2r6
Amean fault-base-1 1578.91 (   0.00%)43404.60 (-2649.02%)
Amean fault-huge-1 1090.23 (   0.00%) 1424.32 * -30.64%*

  4.20.0-rc1 4.20.0-rc1
  boost-v2r4 stall-v2r6
Percentage huge-1   82.59 (   0.00%)   99.92 (  20.97%)

The fragmentation events were reduced but the latencies went a bit crazy.
The "problem" is that the allocation success rates were very high and
forward progress was being made. This put the system under further pressure
and while compactions were succeeding, the latencies were high in cases
where compaction failed. The THP allocation vm stats are illustrative in this 
case

 4.20.0-rc1  4.20.0-rc1
 boost-v2r4  stall-v2r6
THP fault alloc49746016
THP fault fallback 1048   5
THP collapse alloc   65  56
THP collapse fail 4   4
THP split 03719
THP split failed  0 224

Note the THP fault alloc stats where they almost all succeeded relative
to the baseline. While the latencies are much higher, it is the case that
the application specifically requested THP while the system was under
heavy memory pressure.

There were 314 stalls over the course of 16 minutes for a total stall
time of roughly 11 seconds. The distribution of stalls is as follows

205 4000
  1 8000
  1 2
  1 32000
  1 36000
  6 4
  1 56000
 98 10

This is showing that 98 of the stalls waited until the timeout expired
at 25 jiffies which 10 microseconds on this particular configuration.
If this is considered problematic, the timeout can be reduced to tradeoff
fault times against fragmentation avoidance.

2-socket Haswell machine

[PATCH 4/5] mm: Stall movable allocations until kswapd progresses during serious external fragmentation event

2018-10-31 Thread Mel Gorman
An external fragmentation causing events as already been described. A
serious external fragmentation causing event is described as one that steals
a contiguous range of pages of an order lower than fragment_stall_order
(PAGE_ALLOC_COSTLY_ORDER by default). If fragmentation would steal a
block smaller than this, this patch causes a movable allocation request
that is allowed to sleep to until kswapd makes progress. As kswapd has
just been woken due to a boosted watermark, it's expected to return quickly.

This stall is not guaranteed to avoid serious fragmentation causing events.
If memory pressure is high enough, the pages freed by kswapd may still
be used or they may not be in pageblocks that contain only movable
pages. Furthermore an allocation request that cannot stall (e.g. atomic
allocations) or if for unmovable/reclaimable pages will still proceed
without stalling.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--

4.19 extfrag events < order 0:  71227
4.19+patch1:36456 (49% reduction)
4.19+patch1-3:   4510 (94% reduction)
4.19+patch1-4:548 (99% reduction)

Fragmentation events reduced further. The latency and allocation rates
were similar so are not included for brevity.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

4.19 extfrag events < order 0:  40761
4.19+patch1:36085 (11% reduction)
4.19+patch1-3:   1887 (95% reduction)
4.19+patch1-4:394 (99% reduction)

thpfioscale Fault Latencies
   4.19.0 4.19.0
   boost-v1r5 stall-v1r6
Amean fault-base-1 1863.70 (   0.00%) 3943.28 *-111.58%*
Amean fault-huge-1  776.07 (   0.00%) 2739.80 *-253.03%*

  4.19.0 4.19.0
  boost-v1r5 stall-v1r6
Percentage huge-1   86.92 (   0.00%)   98.55 (  13.39%)

Similar to the first case, the reduction in fragmentation events
is notable. However, on this occasion the latencies are much higher
but the allocation success rate is also way higher at 98% success
rate. This is a case where the increased success rate causing pressure
elsewhere but the reduced external framentation events means that
compaction is more effective. This is a classic trade-off on whether
allocation success rate is higher but if problematic, the behaviour
can be tuned.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads


4.19 extfrag events < order 0:  882868
4.19+patch1:476937 (46% reduction)
4.19+patch1-3:   29044 (97% reduction)
4.19+patch1-4:   29290 (97% reduction)

There is little impact on fragmentation causing events but the
latency and allocation rates were similar.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

4.19 extfrag events < order 0: 803099
4.19+patch1:   654671 (23% reduction)
4.19+patch1-3:  24352 (97% reduction)
4.19+patch1-4:  16698 (98% reduction)

thpfioscale Fault Latencies
   4.19.0 4.19.0
   boost-v1r5 stall-v1r6
Amean fault-base-5 5935.74 (   0.00%) 8649.60 * -45.72%*
Amean fault-huge-5 2611.69 (   0.00%) 2799.82 (  -7.20%)

  4.19.0 4.19.0
  boost-v1r5 stall-v1r6
Percentage huge-5   66.18 (   0.00%)   77.80 (  17.56%)

Similar to the 1-socket case, the fragmentation events are reduced
but the higher THP allocation success rates also impact the latencies
as compaction goes to work.

This patch does reduce fragmentation rates overall but it's not free as
some allocataions can stall for short periods of time. While it's within
acceptable limits for the adverse test case, there may be other workloads
that cannot tolerate the stalls. Either it can be tuned to disable the
feature or more ideally, the test case is made available for analysis
to see if the stall behaviour can be reduced while still limiting the
fragmentation events. On the flip-side, it has been checked that setting
the fragment_stall_order to 9 eliminated fragmentation events entirely
on the 1-socket machine and by 99.71% on the 2-socket machine.

Signed-off-by: Mel Gorman 
---
 Documentation/sysctl/vm.txt | 23 +++
 include/linux/mm.h  |  

[PATCH 4/5] mm: Stall movable allocations until kswapd progresses during serious external fragmentation event

2018-10-31 Thread Mel Gorman
An external fragmentation causing events as already been described. A
serious external fragmentation causing event is described as one that steals
a contiguous range of pages of an order lower than fragment_stall_order
(PAGE_ALLOC_COSTLY_ORDER by default). If fragmentation would steal a
block smaller than this, this patch causes a movable allocation request
that is allowed to sleep to until kswapd makes progress. As kswapd has
just been woken due to a boosted watermark, it's expected to return quickly.

This stall is not guaranteed to avoid serious fragmentation causing events.
If memory pressure is high enough, the pages freed by kswapd may still
be used or they may not be in pageblocks that contain only movable
pages. Furthermore an allocation request that cannot stall (e.g. atomic
allocations) or if for unmovable/reclaimable pages will still proceed
without stalling.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--

4.19 extfrag events < order 0:  71227
4.19+patch1:36456 (49% reduction)
4.19+patch1-3:   4510 (94% reduction)
4.19+patch1-4:548 (99% reduction)

Fragmentation events reduced further. The latency and allocation rates
were similar so are not included for brevity.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

4.19 extfrag events < order 0:  40761
4.19+patch1:36085 (11% reduction)
4.19+patch1-3:   1887 (95% reduction)
4.19+patch1-4:394 (99% reduction)

thpfioscale Fault Latencies
   4.19.0 4.19.0
   boost-v1r5 stall-v1r6
Amean fault-base-1 1863.70 (   0.00%) 3943.28 *-111.58%*
Amean fault-huge-1  776.07 (   0.00%) 2739.80 *-253.03%*

  4.19.0 4.19.0
  boost-v1r5 stall-v1r6
Percentage huge-1   86.92 (   0.00%)   98.55 (  13.39%)

Similar to the first case, the reduction in fragmentation events
is notable. However, on this occasion the latencies are much higher
but the allocation success rate is also way higher at 98% success
rate. This is a case where the increased success rate causing pressure
elsewhere but the reduced external framentation events means that
compaction is more effective. This is a classic trade-off on whether
allocation success rate is higher but if problematic, the behaviour
can be tuned.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads


4.19 extfrag events < order 0:  882868
4.19+patch1:476937 (46% reduction)
4.19+patch1-3:   29044 (97% reduction)
4.19+patch1-4:   29290 (97% reduction)

There is little impact on fragmentation causing events but the
latency and allocation rates were similar.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

4.19 extfrag events < order 0: 803099
4.19+patch1:   654671 (23% reduction)
4.19+patch1-3:  24352 (97% reduction)
4.19+patch1-4:  16698 (98% reduction)

thpfioscale Fault Latencies
   4.19.0 4.19.0
   boost-v1r5 stall-v1r6
Amean fault-base-5 5935.74 (   0.00%) 8649.60 * -45.72%*
Amean fault-huge-5 2611.69 (   0.00%) 2799.82 (  -7.20%)

  4.19.0 4.19.0
  boost-v1r5 stall-v1r6
Percentage huge-5   66.18 (   0.00%)   77.80 (  17.56%)

Similar to the 1-socket case, the fragmentation events are reduced
but the higher THP allocation success rates also impact the latencies
as compaction goes to work.

This patch does reduce fragmentation rates overall but it's not free as
some allocataions can stall for short periods of time. While it's within
acceptable limits for the adverse test case, there may be other workloads
that cannot tolerate the stalls. Either it can be tuned to disable the
feature or more ideally, the test case is made available for analysis
to see if the stall behaviour can be reduced while still limiting the
fragmentation events. On the flip-side, it has been checked that setting
the fragment_stall_order to 9 eliminated fragmentation events entirely
on the 1-socket machine and by 99.71% on the 2-socket machine.

Signed-off-by: Mel Gorman 
---
 Documentation/sysctl/vm.txt | 23 +++
 include/linux/mm.h  |