Re: [PATCH] mm: make allocation counters per-order

2017-07-16 Thread Roman Gushchin
On Thu, Jul 06, 2017 at 04:47:05PM +0100, Mel Gorman wrote:
> On Thu, Jul 06, 2017 at 03:46:34PM +0100, Roman Gushchin wrote:
> > > The alloc counter updates are themselves a surprisingly heavy cost to
> > > the allocation path and this makes it worse for a debugging case that is
> > > relatively rare. I'm extremely reluctant for such a patch to be added
> > > given that the tracepoints can be used to assemble such a monitor even
> > > if it means running a userspace daemon to keep track of it. Would such a
> > > solution be suitable? Failing that if this is a severe issue, would it be
> > > possible to at least make this a compile-time or static tracepoint option?
> > > That way, only people that really need it have to take the penalty.
> > 
> > I've tried to measure the difference with my patch applied and without
> > any accounting at all (__count_alloc_event() redefined to an empty 
> > function),
> > and I wasn't able to find any measurable difference.
> > Can you, please, provide more details, how your scenario looked like,
> > when alloc coutners were costly?
> > 
> 
> At the time I used a page allocator microbenchmark from mmtests to call
> the allocator directly without zeroing pages. Triggering allocations from
> userspace generally mask the overhead by the zeroing costs. It's just a few
> cycles but given the budget for the page allocator in some circumstances
> is tiny, it was noticable. perf was used to examine the cost.
> 
> > As new counters replace an old one, and both are per-cpu counters, I 
> > believe,
> > that the difference should be really small.
> > 
> 
> Minimally you add a new branch and a small number of computations. It's
> small but it's there. The cache footprint of the counters is also increased.
> That is hard to take given that it's overhead for everybody on the off-chance
> it can debug something.
> 
> It's not a strong objection and I won't nak it on this basis but given
> that the same information can be easily obtained using tracepoints
> (optionally lower overhead with systemtap), the information is rarely
> going to be useful (no latency information for example) and there is an
> increased maintenance cost then it does not seem to be that useful.

I ran page allocator microbenchmark on raw 4.12 and 4.12 + the change
on my hardware.

Here are the results (raw 4.12 and patched 4.12 corrspondigly):
order  0 batch  1 alloc 1942 free 1041 order  0 batch  1 alloc 822 
free 451
order  0 batch  1 alloc 596 free 306   order  0 batch  1 alloc 493 
free 290
order  0 batch  1 alloc 417 free 245   order  0 batch  1 alloc 526 
free 286
order  0 batch  1 alloc 419 free 243   order  0 batch  1 alloc 435 
free 255
order  0 batch  1 alloc 411 free 243   order  0 batch  1 alloc 423 
free 240
order  0 batch  1 alloc 417 free 241   order  0 batch  1 alloc 406 
free 239
order  0 batch  1 alloc 376 free 225   order  0 batch  1 alloc 383 
free 219
order  0 batch  1 alloc 416 free 222   order  0 batch  1 alloc 355 
free 205
order  0 batch  1 alloc 312 free 183   order  0 batch  1 alloc 438 
free 216
order  0 batch  1 alloc 315 free 181   order  0 batch  1 alloc 347 
free 194
order  0 batch  1 alloc 305 free 181   order  0 batch  1 alloc 317 
free 185
order  0 batch  1 alloc 307 free 179   order  0 batch  1 alloc 329 
free 191
order  0 batch  1 alloc 308 free 178   order  0 batch  1 alloc 335 
free 192
order  0 batch  1 alloc 314 free 180   order  0 batch  1 alloc 350 
free 190
order  0 batch  1 alloc 301 free 180   order  0 batch  1 alloc 319 
free 184
order  0 batch  1 alloc 1807 free 1002 order  0 batch  1 alloc 813 
free 459
order  0 batch  1 alloc 633 free 302   order  0 batch  1 alloc 500 
free 287
order  0 batch  1 alloc 331 free 194   order  0 batch  1 alloc 609 
free 300
order  0 batch  1 alloc 332 free 194   order  0 batch  1 alloc 443 
free 255
order  0 batch  1 alloc 330 free 194   order  0 batch  1 alloc 410 
free 239
order  0 batch  1 alloc 331 free 194   order  0 batch  1 alloc 383 
free 222
order  0 batch  1 alloc 386 free 214   order  0 batch  1 alloc 372 
free 212
order  0 batch  1 alloc 370 free 212   order  0 batch  1 alloc 342 
free 203
order  0 batch  1 alloc 360 free 208   order  0 batch  1 alloc 428 
free 216
order  0 batch  1 alloc 324 free 186   order  0 batch  1 alloc 350 
free 195
order  0 batch  1 alloc 298 free 179   order  0 batch  1 alloc 320 
free 186
order  0 batch  1 alloc 293 free 173   order  0 batch  1 alloc 323 
free 187
order  0 batch  1 alloc 296 free 173   order  0 batch  1 alloc 320 
free 188
order  0 batch  1 alloc 294 free 173   order  0 batch  1 alloc 321 
free 186
order  0 batch  1 alloc 312 free 174  

Re: [PATCH] mm: make allocation counters per-order

2017-07-16 Thread Roman Gushchin
>From 36c6f4d293469569ca8b53b89ab8eebd358a5fa5 Mon Sep 17 00:00:00 2001
From: Roman Gushchin 
Date: Mon, 3 Jul 2017 19:02:49 +0100
Subject: [v2] mm: make allocation counters per-order

High-order allocations are obviously more costly, and it's very useful
to know how many of them happens, if there are any issues
(or suspicions) with memory fragmentation.

This commit changes existing per-zone allocation counters to be
per-zone per-order. These counters are displayed using a new
procfs interface (similar to /proc/buddyinfo):

$ cat /proc/allocinfo
 DMA  0  0  0  0  0 \
   0  0  0  0  0  0
   DMA32  3  0  1  0  0 \
   0  0  0  0  0  0
  Normal4997056  23594  10902  23686931 \
  23122786 17  1  0
 Movable  0  0  0  0  0 \
   0  0  0  0  0  0
  Device  0  0  0  0  0 \
   0  0  0  0  0  0

The existing vmstat interface remains untouched*, and still shows
the total number of single page allocations, so high-order allocations
are represented as a corresponding number of order-0 allocations.

$ cat /proc/vmstat | grep alloc
pgalloc_dma 0
pgalloc_dma32 7
pgalloc_normal 5461660
pgalloc_movable 0
pgalloc_device 0

* I've added device zone for consistency with other zones,
and to avoid messy exclusion of this zone in the code.

v2:
The functionality can be enabled/disabled by the PER_ORDER_ALLOC_COUNTERS
config option.

Signed-off-by: Roman Gushchin 
Suggested-by: Johannes Weiner 
Cc: Debabrata Banerjee 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Rik van Riel 
Cc: kernel-t...@fb.com
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 arch/s390/appldata/appldata_mem.c |  16 +
 include/linux/mmzone.h|   2 +
 include/linux/vm_event_item.h |  27 ++--
 include/linux/vmstat.h|  20 ++
 init/Kconfig  |   9 +++
 mm/page_alloc.c   |  11 +++-
 mm/vmstat.c   | 128 +++---
 7 files changed, 199 insertions(+), 14 deletions(-)

diff --git a/arch/s390/appldata/appldata_mem.c 
b/arch/s390/appldata/appldata_mem.c
index 598df57..79679d3 100644
--- a/arch/s390/appldata/appldata_mem.c
+++ b/arch/s390/appldata/appldata_mem.c
@@ -66,6 +66,21 @@ struct appldata_mem_data {
 
 } __packed;
 
+#ifdef CONFIG_PER_ORDER_ALLOC_COUNTERS
+static inline sum_pgalloc_events(u64 *pgalloc, unsigned long *ev)
+{
+   int order;
+
+   for (order = 1; order < MAX_ORDER; ++order) {
+   pgalloc += ev[PGALLOC_NORMAL + order * MAX_NR_ZONES] << order;
+   pgalloc += ev[PGALLOC_DMA + order * MAX_NR_ZONES] << order;
+   }
+}
+#else
+static inline sum_pgalloc_events(u64 *pgalloc, unsigned long *ev)
+{
+}
+#endif
 
 /*
  * appldata_get_mem_data()
@@ -92,6 +107,7 @@ static void appldata_get_mem_data(void *data)
mem_data->pswpout= ev[PSWPOUT];
mem_data->pgalloc= ev[PGALLOC_NORMAL];
mem_data->pgalloc+= ev[PGALLOC_DMA];
+   sum_pgalloc_events(&mem_data->pgalloc, ev);
mem_data->pgfault= ev[PGFAULT];
mem_data->pgmajfault = ev[PGMAJFAULT];
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b..406dfc4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -66,6 +66,8 @@ enum migratetype {
 /* In mm/page_alloc.c; keep in sync also with show_migration_types() there */
 extern char * const migratetype_names[MIGRATE_TYPES];
 
+extern const char *zone_name(int idx);
+
 #ifdef CONFIG_CMA
 #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
 #  define is_migrate_cma_page(_page) (get_pageblock_migratetype(_page) == 
MIGRATE_CMA)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 37e8d31..da94618 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -19,12 +19,31 @@
 #define HIGHMEM_ZONE(xx)
 #endif
 
-#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, 
HIGHMEM_ZONE(xx) xx##_MOVABLE
+#ifdef CONFIG_ZONE_DEVICE
+#define DEVICE_ZONE(xx) xx##__DEVICE,
+#else
+#define DEVICE_ZONE(xx)
+#endif
+
+#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, 
HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx)
+
+#ifdef CONFIG_PER_ORDER_ALLOC_COUNTERS
+#define PGALLOC_EVENTS_SIZE (MAX_NR_ZONES * MAX_ORDER)
+#define PGALLOC_EVENTS_CUT_SIZE (MAX_NR_ZONES * (MAX_ORDER - 1))
+#define PGALLOC_FIRST_ZONE (PGALLOC_NORMAL - ZONE_NORMAL)
+#else
+#define PGALLOC_EVENTS_SIZE MAX_NR_ZONES
+#define PGALLOC_EVENTS_CUT_SIZE 0
+#define PGALLOC_FIRST_ZONE (PGALLOC_NORMAL - ZONE_NORMAL)
+#endif

Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread kbuild test robot
Hi Roman,

[auto build test WARNING on mmotm/master]
[also build test WARNING on next-20170706]
[cannot apply to v4.12]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Roman-Gushchin/mm-make-allocation-counters-per-order/20170707-091630
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: s390-default_defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=s390 

All warnings (new ones prefixed by >>):

   mm/memcontrol.c: In function 'memory_stat_show':
>> mm/memcontrol.c:5262:1: warning: the frame size of 1136 bytes is larger than 
>> 1024 bytes [-Wframe-larger-than=]
}
^

vim +5262 mm/memcontrol.c

f10eb7534 Roman Gushchin  2017-06-30  5246 
events[PGSCAN_DIRECT]);
f10eb7534 Roman Gushchin  2017-06-30  5247  seq_printf(m, "pgsteal %lu\n", 
events[PGSTEAL_KSWAPD] +
f10eb7534 Roman Gushchin  2017-06-30  5248 
events[PGSTEAL_DIRECT]);
f10eb7534 Roman Gushchin  2017-06-30  5249  seq_printf(m, "pgactivate 
%lu\n", events[PGACTIVATE]);
f10eb7534 Roman Gushchin  2017-06-30  5250  seq_printf(m, "pgdeactivate 
%lu\n", events[PGDEACTIVATE]);
f10eb7534 Roman Gushchin  2017-06-30  5251  seq_printf(m, "pglazyfree 
%lu\n", events[PGLAZYFREE]);
f10eb7534 Roman Gushchin  2017-06-30  5252  seq_printf(m, "pglazyfreed 
%lu\n", events[PGLAZYFREED]);
f10eb7534 Roman Gushchin  2017-06-30  5253  
2a2e48854 Johannes Weiner 2017-05-03  5254  seq_printf(m, 
"workingset_refault %lu\n",
71cd31135 Johannes Weiner 2017-05-03  5255 
stat[WORKINGSET_REFAULT]);
2a2e48854 Johannes Weiner 2017-05-03  5256  seq_printf(m, 
"workingset_activate %lu\n",
71cd31135 Johannes Weiner 2017-05-03  5257 
stat[WORKINGSET_ACTIVATE]);
2a2e48854 Johannes Weiner 2017-05-03  5258  seq_printf(m, 
"workingset_nodereclaim %lu\n",
71cd31135 Johannes Weiner 2017-05-03  5259 
stat[WORKINGSET_NODERECLAIM]);
2a2e48854 Johannes Weiner 2017-05-03  5260  
587d9f726 Johannes Weiner 2016-01-20  5261  return 0;
587d9f726 Johannes Weiner 2016-01-20 @5262  }
587d9f726 Johannes Weiner 2016-01-20  5263  
241994ed8 Johannes Weiner 2015-02-11  5264  static struct cftype memory_files[] 
= {
241994ed8 Johannes Weiner 2015-02-11  5265  {
241994ed8 Johannes Weiner 2015-02-11  5266  .name = "current",
f5fc3c5d8 Johannes Weiner 2015-11-05  5267  .flags = 
CFTYPE_NOT_ON_ROOT,
241994ed8 Johannes Weiner 2015-02-11  5268  .read_u64 = 
memory_current_read,
241994ed8 Johannes Weiner 2015-02-11  5269  },
241994ed8 Johannes Weiner 2015-02-11  5270  {

:: The code at line 5262 was first introduced by commit
:: 587d9f726aaec52157e4156e50363dbe6cb82bdb mm: memcontrol: basic memory 
statistics in cgroup2 memory controller

:: TO: Johannes Weiner 
:: CC: Linus Torvalds 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread kbuild test robot
Hi Roman,

[auto build test ERROR on mmotm/master]
[also build test ERROR on next-20170706]
[cannot apply to v4.12]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Roman-Gushchin/mm-make-allocation-counters-per-order/20170707-091630
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x019-201727 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   mm/vmstat.c: In function 'allocinfo_show':
>> mm/vmstat.c:1513:2: error: implicit declaration of function 'sum_vm_events' 
>> [-Werror=implicit-function-declaration]
 sum_vm_events(allocs, PGALLOC_FIRST_ZONE, PGALLOC_EVENTS_SIZE);
 ^
   At top level:
   mm/vmstat.c:1491:13: warning: 'sum_alloc_events' defined but not used 
[-Wunused-function]
static void sum_alloc_events(unsigned long *v)
^~~~
   Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls64
   Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u64
   Cyclomatic Complexity 1 include/asm-generic/getorder.h:__get_order
   Cyclomatic Complexity 2 arch/x86/include/asm/jump_label.h:arch_static_branch
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic64_64.h:atomic64_read
   Cyclomatic Complexity 1 include/asm-generic/atomic-long.h:atomic_long_read
   Cyclomatic Complexity 1 include/linux/math64.h:div_u64_rem
   Cyclomatic Complexity 1 include/linux/math64.h:div_u64
   Cyclomatic Complexity 1 include/linux/err.h:ERR_PTR
   Cyclomatic Complexity 1 include/linux/spinlock.h:spinlock_check
   Cyclomatic Complexity 1 include/linux/spinlock.h:spin_unlock_irqrestore
   Cyclomatic Complexity 1 include/linux/nodemask.h:node_state
   Cyclomatic Complexity 1 include/linux/mmzone.h:zone_end_pfn
   Cyclomatic Complexity 1 include/linux/mmzone.h:populated_zone
   Cyclomatic Complexity 2 include/linux/mmzone.h:__nr_to_section
   Cyclomatic Complexity 1 include/linux/mmzone.h:__section_mem_map_addr
   Cyclomatic Complexity 3 include/linux/mmzone.h:valid_section
   Cyclomatic Complexity 1 include/linux/mmzone.h:__pfn_to_section
   Cyclomatic Complexity 2 include/linux/mmzone.h:pfn_valid
   Cyclomatic Complexity 1 include/linux/mmzone.h:memmap_valid_within
   Cyclomatic Complexity 1 include/linux/mm.h:page_zonenum
   Cyclomatic Complexity 1 include/linux/mm.h:page_zone
   Cyclomatic Complexity 1 include/linux/mm.h:page_to_section
   Cyclomatic Complexity 1 include/linux/vmstat.h:global_page_state
   Cyclomatic Complexity 1 include/linux/vmstat.h:global_node_page_state
   Cyclomatic Complexity 1 include/linux/vmstat.h:zone_page_state
   Cyclomatic Complexity 28 include/linux/slab.h:kmalloc_index
   Cyclomatic Complexity 68 include/linux/slab.h:kmalloc_large
   Cyclomatic Complexity 5 include/linux/slab.h:kmalloc
   Cyclomatic Complexity 1 include/linux/cpu.h:cpus_read_lock
   Cyclomatic Complexity 1 include/linux/cpu.h:cpus_read_unlock
   Cyclomatic Complexity 1 include/linux/cpu.h:get_online_cpus
   Cyclomatic Complexity 1 include/linux/cpu.h:put_online_cpus
   Cyclomatic Complexity 1 include/linux/proc_fs.h:proc_create
   Cyclomatic Complexity 3 mm/vmstat.c:fill_contig_page_info
   Cyclomatic Complexity 3 mm/vmstat.c:__fragmentation_index
   Cyclomatic Complexity 1 mm/vmstat.c:frag_stop
   Cyclomatic Complexity 6 mm/vmstat.c:walk_zones_in_node
   Cyclomatic Complexity 1 mm/vmstat.c:frag_show
   Cyclomatic Complexity 3 mm/vmstat.c:is_zone_first_populated
   Cyclomatic Complexity 1 mm/vmstat.c:zoneinfo_show
   Cyclomatic Complexity 2 mm/vmstat.c:allocinfo_start
   Cyclomatic Complexity 1 mm/vmstat.c:allocinfo_next
   Cyclomatic Complexity 1 mm/vmstat.c:allocinfo_stop
   Cyclomatic Complexity 3 mm/vmstat.c:vmstat_next
   Cyclomatic Complexity 2 mm/vmstat.c:unusable_free_index
   Cyclomatic Complexity 2 mm/vmstat.c:unusable_show
   Cyclomatic Complexity 1 mm/vmstat.c:extfrag_show
   Cyclomatic Complexity 1 mm/vmstat.c:zoneinfo_open
   Cyclomatic Complexity 1 mm/vmstat.c:vmstat_open
   Cyclomatic Complexity 1 mm/vmstat.c:pagetypeinfo_open
   Cyclomatic Complexity 1 mm/vmstat.c:allocinfo_open
   Cyclomatic Complexity 1 mm/vmstat.c:fragmentation_open
   Cyclomatic Complexity 1 mm/vmstat.c:extfrag_open
   Cyclomatic Complexity 1 mm/vmstat.c:unusable_open
   Cyclomatic Complexity 7 mm/vmstat.c:zoneinfo_show_print
   Cyclomatic Complexity 2 mm/vmstat.c:pagetypeinfo_showfree
   Cyclomatic Complexity 4 mm/vmstat.c:pagetypeinfo_showfree_print
   Cyclomatic Complexity 2 mm/vmstat.c:pagetypeinfo_showblockcount
   Cyclomatic Complexity 2 mm/vmstat.c:frag_show_print
   Cyclomatic Complexity 2 mm/vmstat.c:extfrag_show_print
   Cyclomatic Complexity 2 mm/vmstat.c:unusable_show_print
   Cyclomatic Complexity 1 mm/vmstat.c:frag_next
   Cycl

Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Mel Gorman
On Thu, Jul 06, 2017 at 02:00:00PM -0400, Debabrata Banerjee wrote:
> On Thu, Jul 6, 2017 at 1:16 PM, Mel Gorman  
> wrote:
> >
> > I'm still struggling to see how counters help when an agent that monitors
> > for high CPU usage could be activated
> >
> 
> I suspect Roman has the same problem set as us, the CPU usage is
> either always high, high and service critical likely when something
> interesting is happening. We'd like to collect data on 200k machines,
> and study the results statistically and with respect to time based on
> kernel versions, build configs, hardware types, process types, load
> patterns, etc, etc. Even finding good candidate machines and at the
> right time of day to manually debug with ftrace is problematic.
> Granted we could be utilizing existing counters like compact_fail
> better. Ultimately the data either leads to dealing with certain bad
> actors, different vm tunings, or patches to mm.

Same issue as described in the other mail. The number of high-order
allocations that happened in the past or even the recent past does not
give you useful information for debugging high-order allocation stalls
or fragmentation-related issues. If the high-order allocations are
steady then two machines running similar workloads can both have similar
allocation counts but only one of them may be experiencing high latency.
Similarly, with high CPU usage, it may be due to compaction or a whole
variety of other factors. Even doing a statistical analysis is not going
to be enough unless all the relevant variables are accounted for and the
raw allocation count in isolation is one of the weakest variables to
draw conclusions from.

Correlating allocstall with compaction activity from just /proc/vmstat gives
a much better hint as to whether high CPU activity is due to high-order
allocations. Combining it with top will indicate whether it's direct or
indirect costs. If it really is high-order allocations then ftrace to
identify the source of the high-order allocations becomes relevant and if
it's due to fragmentation, it's a case of tracing the allocator itself to
determine why the fragmentation occurred.

The proc file with allocation counts is such a tiny part of debugging this
class of problem that it's almost irrelevant which is why minimally I think
it should be behind Kconfig at absolute minimum. If you want to activate
it across production machines then by all means go ahead and if so, I'd
be very interested in hearing what class of problem could be debugged and
either tuned or fixed without needing ftrace to gather more information. I
say "almost irrelevant" because technically, correlating high allocation
counts with a kernel version change may be a relevant factor if a kernel
introduced a new source of high-order allocations but I suspect that's
the exception. It would be much more interesting to correlate increased
latency with a kernel version because it's much more relevant. You may
be able to correlate high allocation counts with particular hardware
(particularly network hardware that cannot scatter/gather) *but* the
same proc will will not tell you if those increased requests are
actually a problem so the usefulness is diminished.

I'm not saying that fragmentation and high-order allocation stalls are not a
problem because they can be, but the proc file is unlikely to help and even
an extremely basic systemtap script would give you the same information,
work on much older kernels and with a trivial amount of additional work
it can gather latency information as well as counts.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Debabrata Banerjee
On Thu, Jul 6, 2017 at 1:16 PM, Mel Gorman  wrote:
>
> I'm still struggling to see how counters help when an agent that monitors
> for high CPU usage could be activated
>

I suspect Roman has the same problem set as us, the CPU usage is
either always high, high and service critical likely when something
interesting is happening. We'd like to collect data on 200k machines,
and study the results statistically and with respect to time based on
kernel versions, build configs, hardware types, process types, load
patterns, etc, etc. Even finding good candidate machines and at the
right time of day to manually debug with ftrace is problematic.
Granted we could be utilizing existing counters like compact_fail
better. Ultimately the data either leads to dealing with certain bad
actors, different vm tunings, or patches to mm.


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Mel Gorman
On Thu, Jul 06, 2017 at 05:43:04PM +0100, Roman Gushchin wrote:
> > At the time I used a page allocator microbenchmark from mmtests to call
> > the allocator directly without zeroing pages. Triggering allocations from
> > userspace generally mask the overhead by the zeroing costs. It's just a few
> > cycles but given the budget for the page allocator in some circumstances
> > is tiny, it was noticable. perf was used to examine the cost.
> 
> I'll try to measure the difference with mmtests.
> 
> I agree, that it's not a feature that worth significant performance penalty,
> but if it's small even in a special benchmark, I'd say, it's acceptable.
> 

Note that even if you keep the cycle overhead down, the CPU cache footprint
for such a large increase remains. That will be permanent and unfixable which
is why I would like a Kconfig option at the very least for the vast majority
of people that have no intention or ability to debug such a situation.

> > > As new counters replace an old one, and both are per-cpu counters, I 
> > > believe,
> > > that the difference should be really small.
> > > 
> > 
> > Minimally you add a new branch and a small number of computations. It's
> > small but it's there. The cache footprint of the counters is also increased.
> > That is hard to take given that it's overhead for everybody on the 
> > off-chance
> > it can debug something.
> > 
> > It's not a strong objection and I won't nak it on this basis but given
> > that the same information can be easily obtained using tracepoints
> > (optionally lower overhead with systemtap), the information is rarely
> > going to be useful (no latency information for example) and there is an
> > increased maintenance cost then it does not seem to be that useful.
> 
> Tracepoints are good for investigations on one machine, not so convenient
> if we are talking about gathering stats from the fleet with production load.
> Unfortunately, some memory fragmentation issues are hard to reproduce on
> a single dev machine.
> 

Sure, but just knowing that some high-order allocations occurred in the
past doesn't help either.

> > Maybe it would be slightly more convincing if there was an example of
> > real problems in the field that can be debugged with this. For high-order
> > allocations, I previously found that it was the latency that was of the
> > most concern and not the absolute count that happened since the system
> > started.
> 
> We met an issue with compaction consuming too much CPU under some specific
> conditions, and one of the suspicions was a significant number of high-order
> allocations, requested by some third-party device drivers.
> 

Even if this was the suspicion, you would have to activate monitoring on
the machine under load at the time the problem is occurring to determine if
the high-order allocations are currently happening or happened in the past.
If you are continually logging this data then logging allocation stalls for
high-order allocations would give you similar information. If you have to
activate a monitor anyway (or an agent that monitors for high CPU usage),
then it might as well be ftrace based as well as anything else. Even a
basic systemtap script would be able to capture only stack traces for
allocation requests that take longer than a threshold to limit the amount
of data recorded. Even *if* you had these counters running on your grid,
they will tell you nothing about how long those allocations are or whether
compaction is involved and that is what is key to begin debugging the issue.

A basic monitor of /proc/vmstat for the compact_* can be used an indication
of excessive time spent in compaction although you're back to ftrace to
quantify how much of a problem it is in terms of time. For example,
rapidly increasing compact_fail combined with rapidly increasing
compact_migrate_scanned and compact_free_scanned will tell you that
compaction is active and failing with a comparison of the ratio of
compact_fail to compact_success telling you if it's persistent or slow
progress. You'd need top information to see if it's the compaction daemon
that is consuming all the CPU or processes. If it's the daemon then that
points you in the direction of what potentially needs fixing. If it's
processes then there is a greater problem and ftrace needs to be used to
establish *what* is doing the high-allocation requests and whether they
can be reduced somehow or whether it's a general fragmentation problem
(in which case your day takes a turn for the worse).

What I'm trying to say is that in themselves, an high-order allocation
count doesn't help debug this class of problem as much as you'd think.
Hopefully the above information is more useful to you in helping debug
what's actually wrong.

> Knowing the number of allocations is especially helpful for comparing
> different kernel versions in a such case, as it's hard to distinguish changes
> in mm, changes in these drivers or just workload/environment changes,
> leaded to an in

Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Roman Gushchin
On Thu, Jul 06, 2017 at 04:47:05PM +0100, Mel Gorman wrote:
> On Thu, Jul 06, 2017 at 03:46:34PM +0100, Roman Gushchin wrote:
> > > The alloc counter updates are themselves a surprisingly heavy cost to
> > > the allocation path and this makes it worse for a debugging case that is
> > > relatively rare. I'm extremely reluctant for such a patch to be added
> > > given that the tracepoints can be used to assemble such a monitor even
> > > if it means running a userspace daemon to keep track of it. Would such a
> > > solution be suitable? Failing that if this is a severe issue, would it be
> > > possible to at least make this a compile-time or static tracepoint option?
> > > That way, only people that really need it have to take the penalty.
> > 
> > I've tried to measure the difference with my patch applied and without
> > any accounting at all (__count_alloc_event() redefined to an empty 
> > function),
> > and I wasn't able to find any measurable difference.
> > Can you, please, provide more details, how your scenario looked like,
> > when alloc coutners were costly?
> > 
> 
> At the time I used a page allocator microbenchmark from mmtests to call
> the allocator directly without zeroing pages. Triggering allocations from
> userspace generally mask the overhead by the zeroing costs. It's just a few
> cycles but given the budget for the page allocator in some circumstances
> is tiny, it was noticable. perf was used to examine the cost.

I'll try to measure the difference with mmtests.

I agree, that it's not a feature that worth significant performance penalty,
but if it's small even in a special benchmark, I'd say, it's acceptable.

> > As new counters replace an old one, and both are per-cpu counters, I 
> > believe,
> > that the difference should be really small.
> > 
> 
> Minimally you add a new branch and a small number of computations. It's
> small but it's there. The cache footprint of the counters is also increased.
> That is hard to take given that it's overhead for everybody on the off-chance
> it can debug something.
> 
> It's not a strong objection and I won't nak it on this basis but given
> that the same information can be easily obtained using tracepoints
> (optionally lower overhead with systemtap), the information is rarely
> going to be useful (no latency information for example) and there is an
> increased maintenance cost then it does not seem to be that useful.

Tracepoints are good for investigations on one machine, not so convenient
if we are talking about gathering stats from the fleet with production load.
Unfortunately, some memory fragmentation issues are hard to reproduce on
a single dev machine.

> Maybe it would be slightly more convincing if there was an example of
> real problems in the field that can be debugged with this. For high-order
> allocations, I previously found that it was the latency that was of the
> most concern and not the absolute count that happened since the system
> started.

We met an issue with compaction consuming too much CPU under some specific
conditions, and one of the suspicions was a significant number of high-order
allocations, requested by some third-party device drivers.

Knowing the number of allocations is especially helpful for comparing
different kernel versions in a such case, as it's hard to distinguish changes
in mm, changes in these drivers or just workload/environment changes,
leaded to an increased or decreased fragmentation.

Roman


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Mel Gorman
On Thu, Jul 06, 2017 at 12:12:47PM -0400, Debabrata Banerjee wrote:
> On Thu, Jul 6, 2017 at 11:51 AM, Mel Gorman  
> wrote:
> >
> > These counters do not actually help you solve that particular problem.
> > Knowing how many allocations happened since the system booted doesn't tell
> > you much about how many failed or why they failed. You don't even know
> > what frequency they occured at unless you monitor it constantly so you're
> > back to square one whether this information is available from proc or not.
> > There even is a tracepoint that can be used to track information related
> > to events that degrade fragmentation (trace_mm_page_alloc_extfrag) although
> > the primary thing it tells you is that "the probability that an allocation
> > will fail due to fragmentation in the future is potentially higher".
> 
> I agree these counters don't have enough information, but there a
> start to a first order approximation of the current state of memory.

That incurs a universal cost on the off-chance of debugging and ultimately
the debugging is only useful in combination with developing kernel patches
in which case it could be behind a kconfig option.

> buddyinfo and pagetypeinfo basically show no information now, because

They can be used to calculate a fragmentation index at a given point in
time. Admittedly, building a bigger picture requires a full scan of memory
(and that's what was required when fragmentation avoidance was first
being implemented).

> they only involve the small amount of free memory under the watermark
> and all our machines are in this state. As second order approximation,
> it would be nice to be able to get answers like: "There are
> reclaimable high order allocations of at least this order" and "None
> of this order allocation can become available due to unmovable and
> unreclaimable allocations"

Which this patch doesn't provide as what you are looking for requires
a full scan of memory to determine. I've done it in the past using a
severe abuse of systemtap to load a module that scans all of memory with
a variation of PAGE_OWNER to identify stack traces of pages that "don't
belonw" within a pageblock.

Even *with* that information, your options for tuning an unmodified kernel
are basically limited to increasing min_free_kbytes, altering THP's level
of aggression when compacting or brute forcing with either drop_caches,
compact_node or both. All other options after that require kernel patches
-- altering annotations, altering fallback mechanisms, altering compaction,
improving support for pages that can be migrated etc.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Debabrata Banerjee
On Thu, Jul 6, 2017 at 11:51 AM, Mel Gorman  wrote:
>
> These counters do not actually help you solve that particular problem.
> Knowing how many allocations happened since the system booted doesn't tell
> you much about how many failed or why they failed. You don't even know
> what frequency they occured at unless you monitor it constantly so you're
> back to square one whether this information is available from proc or not.
> There even is a tracepoint that can be used to track information related
> to events that degrade fragmentation (trace_mm_page_alloc_extfrag) although
> the primary thing it tells you is that "the probability that an allocation
> will fail due to fragmentation in the future is potentially higher".

I agree these counters don't have enough information, but there a
start to a first order approximation of the current state of memory.
buddyinfo and pagetypeinfo basically show no information now, because
they only involve the small amount of free memory under the watermark
and all our machines are in this state. As second order approximation,
it would be nice to be able to get answers like: "There are
reclaimable high order allocations of at least this order" and "None
of this order allocation can become available due to unmovable and
unreclaimable allocations"


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Mel Gorman
On Thu, Jul 06, 2017 at 10:54:24AM -0400, Debabrata Banerjee wrote:
> On Thu, Jul 6, 2017 at 9:19 AM, Mel Gorman  
> wrote:
> 
> > The alloc counter updates are themselves a surprisingly heavy cost to
> > the allocation path and this makes it worse for a debugging case that is
> > relatively rare. I'm extremely reluctant for such a patch to be added
> > given that the tracepoints can be used to assemble such a monitor even
> > if it means running a userspace daemon to keep track of it. Would such a
> > solution be suitable? Failing that if this is a severe issue, would it be
> > possible to at least make this a compile-time or static tracepoint option?
> > That way, only people that really need it have to take the penalty.
> >
> > --
> > Mel Gorman
> 
> We (Akamai) have been struggling with memory fragmentation issues for
> years, and especially the inability to track positive or negative
> changes to fragmentation between allocator changes and kernels without
> simply looking for how many allocations are failing. We've had someone
> toying with trying to report the same data via scanning all pages at
> report time versus keeping running stats, although we don't have
> working code yet. If it did work it would avoid the runtime overhead.
> I don't believe tracepoints are a workable solution for us, since we
> would have to be collecting the data from boot, as well as continually
> processing the data in userspace at high cost. Ultimately the
> locations and other properties (merge-ability) of the allocations in
> the buddy groups are also important, which would be interesting to add
> on-top of Roman's patch.

These counters do not actually help you solve that particular problem.
Knowing how many allocations happened since the system booted doesn't tell
you much about how many failed or why they failed. You don't even know
what frequency they occured at unless you monitor it constantly so you're
back to square one whether this information is available from proc or not.
There even is a tracepoint that can be used to track information related
to events that degrade fragmentation (trace_mm_page_alloc_extfrag) although
the primary thing it tells you is that "the probability that an allocation
will fail due to fragmentation in the future is potentially higher".

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Mel Gorman
On Thu, Jul 06, 2017 at 03:46:34PM +0100, Roman Gushchin wrote:
> > The alloc counter updates are themselves a surprisingly heavy cost to
> > the allocation path and this makes it worse for a debugging case that is
> > relatively rare. I'm extremely reluctant for such a patch to be added
> > given that the tracepoints can be used to assemble such a monitor even
> > if it means running a userspace daemon to keep track of it. Would such a
> > solution be suitable? Failing that if this is a severe issue, would it be
> > possible to at least make this a compile-time or static tracepoint option?
> > That way, only people that really need it have to take the penalty.
> 
> I've tried to measure the difference with my patch applied and without
> any accounting at all (__count_alloc_event() redefined to an empty function),
> and I wasn't able to find any measurable difference.
> Can you, please, provide more details, how your scenario looked like,
> when alloc coutners were costly?
> 

At the time I used a page allocator microbenchmark from mmtests to call
the allocator directly without zeroing pages. Triggering allocations from
userspace generally mask the overhead by the zeroing costs. It's just a few
cycles but given the budget for the page allocator in some circumstances
is tiny, it was noticable. perf was used to examine the cost.

> As new counters replace an old one, and both are per-cpu counters, I believe,
> that the difference should be really small.
> 

Minimally you add a new branch and a small number of computations. It's
small but it's there. The cache footprint of the counters is also increased.
That is hard to take given that it's overhead for everybody on the off-chance
it can debug something.

It's not a strong objection and I won't nak it on this basis but given
that the same information can be easily obtained using tracepoints
(optionally lower overhead with systemtap), the information is rarely
going to be useful (no latency information for example) and there is an
increased maintenance cost then it does not seem to be that useful.

Maybe it would be slightly more convincing if there was an example of
real problems in the field that can be debugged with this. For high-order
allocations, I previously found that it was the latency that was of the
most concern and not the absolute count that happened since the system
started. Granted, the same criticism could be leveled at the existing
alloc counters but at least by correlating that value with allocstall,
you can determine what percentage of allocations stalled recently and
optionally ftrace at that point to figure out why. The same steps would
indicate then if it's only high-order allocations that stall, add stack
tracing to figure out where they are coming from and go from there. Even if
the per-order counters exist, all the other debugging steps are necessary
so I'm struggling to see how I would use them properly.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Christoph Lameter
On Thu, 6 Jul 2017, Roman Gushchin wrote:

> +#define PGALLOC_EVENTS_SIZE (MAX_NR_ZONES * MAX_ORDER)
> +#define PGALLOC_EVENTS_CUT_SIZE (MAX_NR_ZONES * (MAX_ORDER - 1))
> +#define PGALLOC_FIRST_ZONE (PGALLOC_NORMAL - ZONE_NORMAL)


You are significantly increasing the per cpu counters (ZONES *
MAX_ORDER * cpus!!!). This will increase the cache footprint of critical
functions significantly and thus lead to regressions.

Typically counters for zones are placed in the zone structures but
you would also significantly increase the per zone counters ...



Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Debabrata Banerjee
On Thu, Jul 6, 2017 at 9:19 AM, Mel Gorman  wrote:

> The alloc counter updates are themselves a surprisingly heavy cost to
> the allocation path and this makes it worse for a debugging case that is
> relatively rare. I'm extremely reluctant for such a patch to be added
> given that the tracepoints can be used to assemble such a monitor even
> if it means running a userspace daemon to keep track of it. Would such a
> solution be suitable? Failing that if this is a severe issue, would it be
> possible to at least make this a compile-time or static tracepoint option?
> That way, only people that really need it have to take the penalty.
>
> --
> Mel Gorman

We (Akamai) have been struggling with memory fragmentation issues for
years, and especially the inability to track positive or negative
changes to fragmentation between allocator changes and kernels without
simply looking for how many allocations are failing. We've had someone
toying with trying to report the same data via scanning all pages at
report time versus keeping running stats, although we don't have
working code yet. If it did work it would avoid the runtime overhead.
I don't believe tracepoints are a workable solution for us, since we
would have to be collecting the data from boot, as well as continually
processing the data in userspace at high cost. Ultimately the
locations and other properties (merge-ability) of the allocations in
the buddy groups are also important, which would be interesting to add
on-top of Roman's patch.


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Roman Gushchin
On Thu, Jul 06, 2017 at 02:19:41PM +0100, Mel Gorman wrote:
> On Thu, Jul 06, 2017 at 02:04:31PM +0100, Roman Gushchin wrote:
> > High-order allocations are obviously more costly, and it's very useful
> > to know how many of them happens, if there are any issues
> > (or suspicions) with memory fragmentation.
> > 
> > This commit changes existing per-zone allocation counters to be
> > per-zone per-order. These counters are displayed using a new
> > procfs interface (similar to /proc/buddyinfo):
> > 
> > $ cat /proc/allocinfo
> >  DMA  0  0  0  0  0 \
> >0  0  0  0  0  0
> >DMA32  3  0  1  0  0 \
> >0  0  0  0  0  0
> >   Normal4997056  23594  10902  23686931 \
> >   23122786 17  1  0
> >  Movable  0  0  0  0  0 \
> >0  0  0  0  0  0
> >   Device  0  0  0  0  0 \
> >0  0  0  0  0  0
> > 
> > The existing vmstat interface remains untouched*, and still shows
> > the total number of single page allocations, so high-order allocations
> > are represented as a corresponding number of order-0 allocations.
> > 
> > $ cat /proc/vmstat | grep alloc
> > pgalloc_dma 0
> > pgalloc_dma32 7
> > pgalloc_normal 5461660
> > pgalloc_movable 0
> > pgalloc_device 0
> > 
> > * I've added device zone for consistency with other zones,
> > and to avoid messy exclusion of this zone in the code.
> > 
> 
> The alloc counter updates are themselves a surprisingly heavy cost to
> the allocation path and this makes it worse for a debugging case that is
> relatively rare. I'm extremely reluctant for such a patch to be added
> given that the tracepoints can be used to assemble such a monitor even
> if it means running a userspace daemon to keep track of it. Would such a
> solution be suitable? Failing that if this is a severe issue, would it be
> possible to at least make this a compile-time or static tracepoint option?
> That way, only people that really need it have to take the penalty.

I've tried to measure the difference with my patch applied and without
any accounting at all (__count_alloc_event() redefined to an empty function),
and I wasn't able to find any measurable difference.
Can you, please, provide more details, how your scenario looked like,
when alloc coutners were costly?

As new counters replace an old one, and both are per-cpu counters, I believe,
that the difference should be really small.

If there is a case, when the difference is meaningful,
I'll, of course, make the whole thing a compile-time option.

Thank you!


Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Mel Gorman
On Thu, Jul 06, 2017 at 02:04:31PM +0100, Roman Gushchin wrote:
> High-order allocations are obviously more costly, and it's very useful
> to know how many of them happens, if there are any issues
> (or suspicions) with memory fragmentation.
> 
> This commit changes existing per-zone allocation counters to be
> per-zone per-order. These counters are displayed using a new
> procfs interface (similar to /proc/buddyinfo):
> 
> $ cat /proc/allocinfo
>  DMA  0  0  0  0  0 \
>0  0  0  0  0  0
>DMA32  3  0  1  0  0 \
>0  0  0  0  0  0
>   Normal4997056  23594  10902  23686931 \
>   23122786 17  1  0
>  Movable  0  0  0  0  0 \
>0  0  0  0  0  0
>   Device  0  0  0  0  0 \
>0  0  0  0  0  0
> 
> The existing vmstat interface remains untouched*, and still shows
> the total number of single page allocations, so high-order allocations
> are represented as a corresponding number of order-0 allocations.
> 
> $ cat /proc/vmstat | grep alloc
> pgalloc_dma 0
> pgalloc_dma32 7
> pgalloc_normal 5461660
> pgalloc_movable 0
> pgalloc_device 0
> 
> * I've added device zone for consistency with other zones,
> and to avoid messy exclusion of this zone in the code.
> 

The alloc counter updates are themselves a surprisingly heavy cost to
the allocation path and this makes it worse for a debugging case that is
relatively rare. I'm extremely reluctant for such a patch to be added
given that the tracepoints can be used to assemble such a monitor even
if it means running a userspace daemon to keep track of it. Would such a
solution be suitable? Failing that if this is a severe issue, would it be
possible to at least make this a compile-time or static tracepoint option?
That way, only people that really need it have to take the penalty.

-- 
Mel Gorman
SUSE Labs


[PATCH] mm: make allocation counters per-order

2017-07-06 Thread Roman Gushchin
High-order allocations are obviously more costly, and it's very useful
to know how many of them happens, if there are any issues
(or suspicions) with memory fragmentation.

This commit changes existing per-zone allocation counters to be
per-zone per-order. These counters are displayed using a new
procfs interface (similar to /proc/buddyinfo):

$ cat /proc/allocinfo
 DMA  0  0  0  0  0 \
   0  0  0  0  0  0
   DMA32  3  0  1  0  0 \
   0  0  0  0  0  0
  Normal4997056  23594  10902  23686931 \
  23122786 17  1  0
 Movable  0  0  0  0  0 \
   0  0  0  0  0  0
  Device  0  0  0  0  0 \
   0  0  0  0  0  0

The existing vmstat interface remains untouched*, and still shows
the total number of single page allocations, so high-order allocations
are represented as a corresponding number of order-0 allocations.

$ cat /proc/vmstat | grep alloc
pgalloc_dma 0
pgalloc_dma32 7
pgalloc_normal 5461660
pgalloc_movable 0
pgalloc_device 0

* I've added device zone for consistency with other zones,
and to avoid messy exclusion of this zone in the code.

Signed-off-by: Roman Gushchin 
Suggested-by: Johannes Weiner 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Rik van Riel 
Cc: kernel-t...@fb.com
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 arch/s390/appldata/appldata_mem.c |   7 +++
 include/linux/mmzone.h|   2 +
 include/linux/vm_event_item.h |  19 --
 include/linux/vmstat.h|  13 +
 mm/page_alloc.c   |  11 +++-
 mm/vmstat.c   | 120 +++---
 6 files changed, 158 insertions(+), 14 deletions(-)

diff --git a/arch/s390/appldata/appldata_mem.c 
b/arch/s390/appldata/appldata_mem.c
index 598df57..06216ff0 100644
--- a/arch/s390/appldata/appldata_mem.c
+++ b/arch/s390/appldata/appldata_mem.c
@@ -81,6 +81,7 @@ static void appldata_get_mem_data(void *data)
static struct sysinfo val;
unsigned long ev[NR_VM_EVENT_ITEMS];
struct appldata_mem_data *mem_data;
+   int order;
 
mem_data = data;
mem_data->sync_count_1++;
@@ -92,6 +93,12 @@ static void appldata_get_mem_data(void *data)
mem_data->pswpout= ev[PSWPOUT];
mem_data->pgalloc= ev[PGALLOC_NORMAL];
mem_data->pgalloc+= ev[PGALLOC_DMA];
+   for (order = 1; order < MAX_ORDER; ++order) {
+   mem_data->pgalloc +=
+   ev[PGALLOC_NORMAL + order * MAX_NR_ZONES] << order;
+   mem_data->pgalloc +=
+ev[PGALLOC_DMA + order * MAX_NR_ZONES] << order;
+   }
mem_data->pgfault= ev[PGFAULT];
mem_data->pgmajfault = ev[PGMAJFAULT];
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 16532fa..6598285 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -66,6 +66,8 @@ enum migratetype {
 /* In mm/page_alloc.c; keep in sync also with show_migration_types() there */
 extern char * const migratetype_names[MIGRATE_TYPES];
 
+extern const char *zone_name(int idx);
+
 #ifdef CONFIG_CMA
 #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
 #  define is_migrate_cma_page(_page) (get_pageblock_migratetype(_page) == 
MIGRATE_CMA)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 37e8d31..75bbac8 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -19,12 +19,23 @@
 #define HIGHMEM_ZONE(xx)
 #endif
 
-#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, 
HIGHMEM_ZONE(xx) xx##_MOVABLE
+#ifdef CONFIG_ZONE_DEVICE
+#define DEVICE_ZONE(xx) xx##__DEVICE,
+#else
+#define DEVICE_ZONE(xx)
+#endif
+
+#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, 
HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx)
+
+#define PGALLOC_EVENTS_SIZE (MAX_NR_ZONES * MAX_ORDER)
+#define PGALLOC_EVENTS_CUT_SIZE (MAX_NR_ZONES * (MAX_ORDER - 1))
+#define PGALLOC_FIRST_ZONE (PGALLOC_NORMAL - ZONE_NORMAL)
 
 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
-   FOR_ALL_ZONES(PGALLOC),
-   FOR_ALL_ZONES(ALLOCSTALL),
-   FOR_ALL_ZONES(PGSCAN_SKIP),
+   FOR_ALL_ZONES(PGALLOC)
+   __PGALLOC_LAST = PGALLOC_FIRST_ZONE + PGALLOC_EVENTS_SIZE - 1,
+   FOR_ALL_ZONES(ALLOCSTALL)
+   FOR_ALL_ZONES(PGSCAN_SKIP)
PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
PGFAULT, PGMAJFAULT,
PGLAZYFREED,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h