Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-21 Thread Michal Hocko
On Fri 18-08-17 20:23:05, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
[...]
> > +void walk_free_mem_block(void *opaque1,
> > +unsigned int min_order,
> > +void (*visit)(void *opaque2,
> 
> You can just avoid opaque2 completely I think, then opaque1 can
> be renamed opaque.
> 
> > +  unsigned long pfn,
> > +  unsigned long nr_pages))
> > +{
> > +   struct zone *zone;
> > +   struct page *page;
> > +   struct list_head *list;
> > +   unsigned int order;
> > +   enum migratetype mt;
> > +   unsigned long pfn, flags;
> > +
> > +   for_each_populated_zone(zone) {
> > +   for (order = MAX_ORDER - 1;
> > +order < MAX_ORDER && order >= min_order; order--) {
> > +   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +   spin_lock_irqsave(>lock, flags);
> > +   list = >free_area[order].free_list[mt];
> > +   list_for_each_entry(page, list, lru) {
> > +   pfn = page_to_pfn(page);
> > +   visit(opaque1, pfn, 1 << order);
> 
> My only concern here is inability of callback to
> 1. break out of list
> 2. remove page from the list

As I've said before this has to be a read only API. You cannot simply
fiddle with the page allocator internals under its feet.

> So I would make the callback bool, and I would use
> list_for_each_entry_safe.

If a bool would tell to break out of the loop then I agree. This sounds
useful.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-21 Thread Michal Hocko
On Fri 18-08-17 20:23:05, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
[...]
> > +void walk_free_mem_block(void *opaque1,
> > +unsigned int min_order,
> > +void (*visit)(void *opaque2,
> 
> You can just avoid opaque2 completely I think, then opaque1 can
> be renamed opaque.
> 
> > +  unsigned long pfn,
> > +  unsigned long nr_pages))
> > +{
> > +   struct zone *zone;
> > +   struct page *page;
> > +   struct list_head *list;
> > +   unsigned int order;
> > +   enum migratetype mt;
> > +   unsigned long pfn, flags;
> > +
> > +   for_each_populated_zone(zone) {
> > +   for (order = MAX_ORDER - 1;
> > +order < MAX_ORDER && order >= min_order; order--) {
> > +   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +   spin_lock_irqsave(>lock, flags);
> > +   list = >free_area[order].free_list[mt];
> > +   list_for_each_entry(page, list, lru) {
> > +   pfn = page_to_pfn(page);
> > +   visit(opaque1, pfn, 1 << order);
> 
> My only concern here is inability of callback to
> 1. break out of list
> 2. remove page from the list

As I've said before this has to be a read only API. You cannot simply
fiddle with the page allocator internals under its feet.

> So I would make the callback bool, and I would use
> list_for_each_entry_safe.

If a bool would tell to break out of the loop then I agree. This sounds
useful.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-21 Thread Michal Hocko
On Mon 21-08-17 14:12:47, Wei Wang wrote:
> On 08/18/2017 09:46 PM, Michal Hocko wrote:
[...]
> >>+/**
> >>+ * walk_free_mem_block - Walk through the free page blocks in the system
> >>+ * @opaque1: the context passed from the caller
> >>+ * @min_order: the minimum order of free lists to check
> >>+ * @visit: the callback function given by the caller
> >The original suggestion for using visit was motivated by a visit design
> >pattern but I can see how this can be confusing. Maybe a more explicit
> >name wold be better. What about report_free_range.
> 
> 
> I'm afraid that name would be too long to fit in nicely.
> How about simply naming it "report"?

I do not have a strong opinion on this. I wouldn't be afraid of using
slightly longer name here for the clarity sake, though.
 
> >>+ *
> >>+ * The function is used to walk through the free page blocks in the system,
> >>+ * and each free page block is reported to the caller via the @visit 
> >>callback.
> >>+ * Please note:
> >>+ * 1) The function is used to report hints of free pages, so the caller 
> >>should
> >>+ * not use those reported pages after the callback returns.
> >>+ * 2) The callback is invoked with the zone->lock being held, so it should 
> >>not
> >>+ * block and should finish as soon as possible.
> >I think that the explicit note about zone->lock is not really need. This
> >can change in future and I would even bet that somebody might rely on
> >the lock being held for some purpose and silently get broken with the
> >change. Instead I would much rather see something like the following:
> >"
> >Please note that there are no locking guarantees for the callback
> 
> Just a little confused with this one:
> 
> The callback is invoked within zone->lock, why would we claim it "no
> locking guarantees for the callback"?

Because we definitely do not want anybody to rely on that fact and
(ab)use it. This might change in future and it would be better to be
clear about that.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-21 Thread Michal Hocko
On Mon 21-08-17 14:12:47, Wei Wang wrote:
> On 08/18/2017 09:46 PM, Michal Hocko wrote:
[...]
> >>+/**
> >>+ * walk_free_mem_block - Walk through the free page blocks in the system
> >>+ * @opaque1: the context passed from the caller
> >>+ * @min_order: the minimum order of free lists to check
> >>+ * @visit: the callback function given by the caller
> >The original suggestion for using visit was motivated by a visit design
> >pattern but I can see how this can be confusing. Maybe a more explicit
> >name wold be better. What about report_free_range.
> 
> 
> I'm afraid that name would be too long to fit in nicely.
> How about simply naming it "report"?

I do not have a strong opinion on this. I wouldn't be afraid of using
slightly longer name here for the clarity sake, though.
 
> >>+ *
> >>+ * The function is used to walk through the free page blocks in the system,
> >>+ * and each free page block is reported to the caller via the @visit 
> >>callback.
> >>+ * Please note:
> >>+ * 1) The function is used to report hints of free pages, so the caller 
> >>should
> >>+ * not use those reported pages after the callback returns.
> >>+ * 2) The callback is invoked with the zone->lock being held, so it should 
> >>not
> >>+ * block and should finish as soon as possible.
> >I think that the explicit note about zone->lock is not really need. This
> >can change in future and I would even bet that somebody might rely on
> >the lock being held for some purpose and silently get broken with the
> >change. Instead I would much rather see something like the following:
> >"
> >Please note that there are no locking guarantees for the callback
> 
> Just a little confused with this one:
> 
> The callback is invoked within zone->lock, why would we claim it "no
> locking guarantees for the callback"?

Because we definitely do not want anybody to rely on that fact and
(ab)use it. This might change in future and it would be better to be
clear about that.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-21 Thread Wei Wang

On 08/18/2017 09:46 PM, Michal Hocko wrote:

On Thu 17-08-17 11:26:55, Wei Wang wrote:

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

This could see more details to be honest. Especially the usecase you are
going to use this for. This will help us to understand the motivation
in future when the current user might be gone a new ones largely diverge
into a different usage. This wouldn't be the first time I have seen
something like that.


OK, I will more details here about how it's used to accelerate live 
migration.



Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michal Hocko 
Cc: Michael S. Tsirkin 
---
  include/linux/mm.h |  6 ++
  mm/page_alloc.c| 44 
  2 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..cd29b9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * 
zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
  extern void free_initmem(void);
  
+extern void walk_free_mem_block(void *opaque1,

+   unsigned int min_order,
+   void (*visit)(void *opaque2,
+ unsigned long pfn,
+ unsigned long nr_pages));
+
  /*
   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
   * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..a721a35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t 
*nodemask)
show_swap_cache_info();
  }
  
+/**

+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque1: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @visit: the callback function given by the caller

The original suggestion for using visit was motivated by a visit design
pattern but I can see how this can be confusing. Maybe a more explicit
name wold be better. What about report_free_range.



I'm afraid that name would be too long to fit in nicely.
How about simply naming it "report"?





+ *
+ * The function is used to walk through the free page blocks in the system,
+ * and each free page block is reported to the caller via the @visit callback.
+ * Please note:
+ * 1) The function is used to report hints of free pages, so the caller should
+ * not use those reported pages after the callback returns.
+ * 2) The callback is invoked with the zone->lock being held, so it should not
+ * block and should finish as soon as possible.

I think that the explicit note about zone->lock is not really need. This
can change in future and I would even bet that somebody might rely on
the lock being held for some purpose and silently get broken with the
change. Instead I would much rather see something like the following:
"
Please note that there are no locking guarantees for the callback


Just a little confused with this one:

The callback is invoked within zone->lock, why would we claim it "no
locking guarantees for the callback"?


and
that the reported pfn range might be freed or disappear after the
callback returns so the caller has to be very careful how it is used.

The callback itself must not sleep or perform any operations which would
require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
or via any lock dependency. It is generally advisable to implement
the callback as simple as possible and defer any heavy lifting to a
different context.

There is no guarantee that each free range will be reported only once
during one walk_free_mem_block invocation.

pfn_to_page on the given range is strongly discouraged and if there is
an absolute need for that make sure to contact MM people to discuss
potential problems.

The function itself might sleep so it cannot be called from atomic
contexts.

In general low orders tend to be very volatile and so it makes more
sense to query larger ones for various optimizations which like
ballooning etc... This will reduce the overhead as well.
"


I think it looks quite comprehensive. Thanks.


Best,
Wei


Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-21 Thread Wei Wang

On 08/18/2017 09:46 PM, Michal Hocko wrote:

On Thu 17-08-17 11:26:55, Wei Wang wrote:

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

This could see more details to be honest. Especially the usecase you are
going to use this for. This will help us to understand the motivation
in future when the current user might be gone a new ones largely diverge
into a different usage. This wouldn't be the first time I have seen
something like that.


OK, I will more details here about how it's used to accelerate live 
migration.



Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
Cc: Michal Hocko 
Cc: Michael S. Tsirkin 
---
  include/linux/mm.h |  6 ++
  mm/page_alloc.c| 44 
  2 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..cd29b9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * 
zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
  extern void free_initmem(void);
  
+extern void walk_free_mem_block(void *opaque1,

+   unsigned int min_order,
+   void (*visit)(void *opaque2,
+ unsigned long pfn,
+ unsigned long nr_pages));
+
  /*
   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
   * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..a721a35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t 
*nodemask)
show_swap_cache_info();
  }
  
+/**

+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque1: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @visit: the callback function given by the caller

The original suggestion for using visit was motivated by a visit design
pattern but I can see how this can be confusing. Maybe a more explicit
name wold be better. What about report_free_range.



I'm afraid that name would be too long to fit in nicely.
How about simply naming it "report"?





+ *
+ * The function is used to walk through the free page blocks in the system,
+ * and each free page block is reported to the caller via the @visit callback.
+ * Please note:
+ * 1) The function is used to report hints of free pages, so the caller should
+ * not use those reported pages after the callback returns.
+ * 2) The callback is invoked with the zone->lock being held, so it should not
+ * block and should finish as soon as possible.

I think that the explicit note about zone->lock is not really need. This
can change in future and I would even bet that somebody might rely on
the lock being held for some purpose and silently get broken with the
change. Instead I would much rather see something like the following:
"
Please note that there are no locking guarantees for the callback


Just a little confused with this one:

The callback is invoked within zone->lock, why would we claim it "no
locking guarantees for the callback"?


and
that the reported pfn range might be freed or disappear after the
callback returns so the caller has to be very careful how it is used.

The callback itself must not sleep or perform any operations which would
require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
or via any lock dependency. It is generally advisable to implement
the callback as simple as possible and defer any heavy lifting to a
different context.

There is no guarantee that each free range will be reported only once
during one walk_free_mem_block invocation.

pfn_to_page on the given range is strongly discouraged and if there is
an absolute need for that make sure to contact MM people to discuss
potential problems.

The function itself might sleep so it cannot be called from atomic
contexts.

In general low orders tend to be very volatile and so it makes more
sense to query larger ones for various optimizations which like
ballooning etc... This will reduce the overhead as well.
"


I think it looks quite comprehensive. Thanks.


Best,
Wei


Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-18 Thread Michael S. Tsirkin
On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Cc: Michal Hocko 
> Cc: Michael S. Tsirkin 
> ---
>  include/linux/mm.h |  6 ++
>  mm/page_alloc.c| 44 
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long 
> * zones_size,
>   unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> + unsigned int min_order,
> + void (*visit)(void *opaque2,
> +   unsigned long pfn,
> +   unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t 
> *nodemask)
>   show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller
> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit 
> callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller 
> should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should 
> not
> + * block and should finish as soon as possible.
> + */
> +void walk_free_mem_block(void *opaque1,
> +  unsigned int min_order,
> +  void (*visit)(void *opaque2,

You can just avoid opaque2 completely I think, then opaque1 can
be renamed opaque.

> +unsigned long pfn,
> +unsigned long nr_pages))
> +{
> + struct zone *zone;
> + struct page *page;
> + struct list_head *list;
> + unsigned int order;
> + enum migratetype mt;
> + unsigned long pfn, flags;
> +
> + for_each_populated_zone(zone) {
> + for (order = MAX_ORDER - 1;
> +  order < MAX_ORDER && order >= min_order; order--) {
> + for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> + spin_lock_irqsave(>lock, flags);
> + list = >free_area[order].free_list[mt];
> + list_for_each_entry(page, list, lru) {
> + pfn = page_to_pfn(page);
> + visit(opaque1, pfn, 1 << order);

My only concern here is inability of callback to
1. break out of list
2. remove page from the list

So I would make the callback bool, and I would use
list_for_each_entry_safe.


> + }
> + spin_unlock_irqrestore(>lock, flags);
> + }
> + }
> + }
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>   zoneref->zone = zone;
> -- 
> 2.7.4


Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-18 Thread Michael S. Tsirkin
On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Cc: Michal Hocko 
> Cc: Michael S. Tsirkin 
> ---
>  include/linux/mm.h |  6 ++
>  mm/page_alloc.c| 44 
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long 
> * zones_size,
>   unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> + unsigned int min_order,
> + void (*visit)(void *opaque2,
> +   unsigned long pfn,
> +   unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t 
> *nodemask)
>   show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller
> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit 
> callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller 
> should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should 
> not
> + * block and should finish as soon as possible.
> + */
> +void walk_free_mem_block(void *opaque1,
> +  unsigned int min_order,
> +  void (*visit)(void *opaque2,

You can just avoid opaque2 completely I think, then opaque1 can
be renamed opaque.

> +unsigned long pfn,
> +unsigned long nr_pages))
> +{
> + struct zone *zone;
> + struct page *page;
> + struct list_head *list;
> + unsigned int order;
> + enum migratetype mt;
> + unsigned long pfn, flags;
> +
> + for_each_populated_zone(zone) {
> + for (order = MAX_ORDER - 1;
> +  order < MAX_ORDER && order >= min_order; order--) {
> + for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> + spin_lock_irqsave(>lock, flags);
> + list = >free_area[order].free_list[mt];
> + list_for_each_entry(page, list, lru) {
> + pfn = page_to_pfn(page);
> + visit(opaque1, pfn, 1 << order);

My only concern here is inability of callback to
1. break out of list
2. remove page from the list

So I would make the callback bool, and I would use
list_for_each_entry_safe.


> + }
> + spin_unlock_irqrestore(>lock, flags);
> + }
> + }
> + }
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>   zoneref->zone = zone;
> -- 
> 2.7.4


Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-18 Thread Michal Hocko
On Thu 17-08-17 11:26:55, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.

This could see more details to be honest. Especially the usecase you are
going to use this for. This will help us to understand the motivation
in future when the current user might be gone a new ones largely diverge
into a different usage. This wouldn't be the first time I have seen
something like that.

> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Cc: Michal Hocko 
> Cc: Michael S. Tsirkin 
> ---
>  include/linux/mm.h |  6 ++
>  mm/page_alloc.c| 44 
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long 
> * zones_size,
>   unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> + unsigned int min_order,
> + void (*visit)(void *opaque2,
> +   unsigned long pfn,
> +   unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t 
> *nodemask)
>   show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller

The original suggestion for using visit was motivated by a visit design
pattern but I can see how this can be confusing. Maybe a more explicit
name wold be better. What about report_free_range.

> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit 
> callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller 
> should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should 
> not
> + * block and should finish as soon as possible.

I think that the explicit note about zone->lock is not really need. This
can change in future and I would even bet that somebody might rely on
the lock being held for some purpose and silently get broken with the
change. Instead I would much rather see something like the following:
"
Please note that there are no locking guarantees for the callback and
that the reported pfn range might be freed or disappear after the
callback returns so the caller has to be very careful how it is used.

The callback itself must not sleep or perform any operations which would
require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
or via any lock dependency. It is generally advisable to implement
the callback as simple as possible and defer any heavy lifting to a
different context.

There is no guarantee that each free range will be reported only once
during one walk_free_mem_block invocation.

pfn_to_page on the given range is strongly discouraged and if there is
an absolute need for that make sure to contact MM people to discuss
potential problems.

The function itself might sleep so it cannot be called from atomic
contexts.

In general low orders tend to be very volatile and so it makes more
sense to query larger ones for various optimizations which like
ballooning etc... This will reduce the overhead as well.
"

> + */
> +void walk_free_mem_block(void *opaque1,
> +  unsigned int min_order,

make the order int and...
> +  void (*visit)(void *opaque2,
> +unsigned long pfn,
> +unsigned long nr_pages))
> +{
> + struct zone *zone;
> + struct page *page;
> + struct list_head *list;
> + unsigned int order;
> + enum migratetype mt;
> + unsigned long pfn, flags;
> +
> + for_each_populated_zone(zone) {
> + for (order = MAX_ORDER - 1;
> +  order < MAX_ORDER && order >= min_order; order--) {

you will not need the underflow check which is just 

Re: [PATCH v14 4/5] mm: support reporting free page blocks

2017-08-18 Thread Michal Hocko
On Thu 17-08-17 11:26:55, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.

This could see more details to be honest. Especially the usecase you are
going to use this for. This will help us to understand the motivation
in future when the current user might be gone a new ones largely diverge
into a different usage. This wouldn't be the first time I have seen
something like that.

> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> Cc: Michal Hocko 
> Cc: Michael S. Tsirkin 
> ---
>  include/linux/mm.h |  6 ++
>  mm/page_alloc.c| 44 
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long 
> * zones_size,
>   unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> + unsigned int min_order,
> + void (*visit)(void *opaque2,
> +   unsigned long pfn,
> +   unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t 
> *nodemask)
>   show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller

The original suggestion for using visit was motivated by a visit design
pattern but I can see how this can be confusing. Maybe a more explicit
name wold be better. What about report_free_range.

> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit 
> callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller 
> should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should 
> not
> + * block and should finish as soon as possible.

I think that the explicit note about zone->lock is not really need. This
can change in future and I would even bet that somebody might rely on
the lock being held for some purpose and silently get broken with the
change. Instead I would much rather see something like the following:
"
Please note that there are no locking guarantees for the callback and
that the reported pfn range might be freed or disappear after the
callback returns so the caller has to be very careful how it is used.

The callback itself must not sleep or perform any operations which would
require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
or via any lock dependency. It is generally advisable to implement
the callback as simple as possible and defer any heavy lifting to a
different context.

There is no guarantee that each free range will be reported only once
during one walk_free_mem_block invocation.

pfn_to_page on the given range is strongly discouraged and if there is
an absolute need for that make sure to contact MM people to discuss
potential problems.

The function itself might sleep so it cannot be called from atomic
contexts.

In general low orders tend to be very volatile and so it makes more
sense to query larger ones for various optimizations which like
ballooning etc... This will reduce the overhead as well.
"

> + */
> +void walk_free_mem_block(void *opaque1,
> +  unsigned int min_order,

make the order int and...
> +  void (*visit)(void *opaque2,
> +unsigned long pfn,
> +unsigned long nr_pages))
> +{
> + struct zone *zone;
> + struct page *page;
> + struct list_head *list;
> + unsigned int order;
> + enum migratetype mt;
> + unsigned long pfn, flags;
> +
> + for_each_populated_zone(zone) {
> + for (order = MAX_ORDER - 1;
> +  order < MAX_ORDER && order >= min_order; order--) {

you will not need the underflow check which is just ugly

> + for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +