Re: [RFC PATCH v3 3/3] mm/compaction: enhance compaction finish condition

2015-02-02 Thread Zhang Yanfei
Hello,

At 2015/2/2 18:20, Vlastimil Babka wrote:
> On 02/02/2015 08:15 AM, Joonsoo Kim wrote:
>> Compaction has anti fragmentation algorithm. It is that freepage
>> should be more than pageblock order to finish the compaction if we don't
>> find any freepage in requested migratetype buddy list. This is for
>> mitigating fragmentation, but, there is a lack of migratetype
>> consideration and it is too excessive compared to page allocator's anti
>> fragmentation algorithm.
>>
>> Not considering migratetype would cause premature finish of compaction.
>> For example, if allocation request is for unmovable migratetype,
>> freepage with CMA migratetype doesn't help that allocation and
>> compaction should not be stopped. But, current logic regards this
>> situation as compaction is no longer needed, so finish the compaction.
> 
> This is only for order >= pageblock_order, right? Perhaps should be told 
> explicitly.

I might be wrong. If we applied patch1, so after the system runs for some time,
there must be no MIGRATE_CMA free pages in the system, right? If so, the
example above doesn't exist anymore.

> 
>> Secondly, condition is too excessive compared to page allocator's logic.
>> We can steal freepage from other migratetype and change pageblock
>> migratetype on more relaxed conditions in page allocator. This is designed
>> to prevent fragmentation and we can use it here. Imposing hard constraint
>> only to the compaction doesn't help much in this case since page allocator
>> would cause fragmentation again.
>>
>> To solve these problems, this patch borrows anti fragmentation logic from
>> page allocator. It will reduce premature compaction finish in some cases
>> and reduce excessive compaction work.
>>
>> stress-highalloc test in mmtests with non movable order 7 allocation shows
>> considerable increase of compaction success rate.
>>
>> Compaction success rate (Compaction success * 100 / Compaction stalls, %)
>> 31.82 : 42.20
>>
>> Signed-off-by: Joonsoo Kim 
>> ---
>>  mm/compaction.c | 14 --
>>  mm/internal.h   |  2 ++
>>  mm/page_alloc.c | 12 
>>  3 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 782772d..d40c426 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1170,13 +1170,23 @@ static int __compact_finished(struct zone *zone, 
>> struct compact_control *cc,
>>  /* Direct compactor: Is a suitable page free? */
>>  for (order = cc->order; order < MAX_ORDER; order++) {
>>  struct free_area *area = >free_area[order];
>> +bool can_steal;
>>  
>>  /* Job done if page is free of the right migratetype */
>>  if (!list_empty(>free_list[migratetype]))
>>  return COMPACT_PARTIAL;
>>  
>> -/* Job done if allocation would set block type */
>> -if (order >= pageblock_order && area->nr_free)
>> +/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
>> +if (migratetype == MIGRATE_MOVABLE &&
>> +!list_empty(>free_list[MIGRATE_CMA]))
>> +return COMPACT_PARTIAL;
> 
> The above AFAICS needs #ifdef CMA otherwise won't compile without CMA.
> 
>> +
>> +/*
>> + * Job done if allocation would steal freepages from
>> + * other migratetype buddy lists.
>> + */
>> +if (find_suitable_fallback(area, order, migratetype,
>> +true, _steal) != -1)
>>  return COMPACT_PARTIAL;
>>  }
>>  
>> diff --git a/mm/internal.h b/mm/internal.h
>> index c4d6c9b..9640650 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -200,6 +200,8 @@ isolate_freepages_range(struct compact_control *cc,
>>  unsigned long
>>  isolate_migratepages_range(struct compact_control *cc,
>> unsigned long low_pfn, unsigned long end_pfn);
>> +int find_suitable_fallback(struct free_area *area, unsigned int order,
>> +int migratetype, bool only_stealable, bool *can_steal);
>>  
>>  #endif
>>  
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6cb18f8..0a150f1 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1177,8 +1177,8 @@ static void steal_suitable_fallback(struct zone *zone, 
>> struct page *page,
>>  set_pageblock_migratetype(page, start_type);
>>  }
>>  
>> -static int find_suitable_fallback(struct free_area *area, unsigned int 
>> order,
>> -int migratetype, bool *can_steal)
>> +int find_suitable_fallback(struct free_area *area, unsigned int order,
>> +int migratetype, bool only_stealable, bool *can_steal)
>>  {
>>  int i;
>>  int fallback_mt;
>> @@ -1198,7 +1198,11 @@ static int find_suitable_fallback(struct free_area 
>> *area, unsigned int order,
>>  if (can_steal_fallback(order, migratetype))
>>  

Re: [RFC PATCH v3 2/3] mm/page_alloc: factor out fallback freepage checking

2015-02-02 Thread Zhang Yanfei
Hello Joonsoo,

At 2015/2/2 15:15, Joonsoo Kim wrote:
> This is preparation step to use page allocator's anti fragmentation logic
> in compaction. This patch just separates fallback freepage checking part
> from fallback freepage management part. Therefore, there is no functional
> change.
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  mm/page_alloc.c | 128 
> +---
>  1 file changed, 76 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e64b260..6cb18f8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1142,14 +1142,26 @@ static void change_pageblock_range(struct page 
> *pageblock_page,
>   * as fragmentation caused by those allocations polluting movable pageblocks
>   * is worse than movable allocations stealing from unmovable and reclaimable
>   * pageblocks.
> - *
> - * If we claim more than half of the pageblock, change pageblock's 
> migratetype
> - * as well.
>   */
> -static void try_to_steal_freepages(struct zone *zone, struct page *page,
> -   int start_type, int fallback_type)
> +static bool can_steal_fallback(unsigned int order, int start_mt)
> +{
> + if (order >= pageblock_order)
> + return true;

Is this test necessary? Since an order which is >= pageblock_order
will always pass the order >= pageblock_order / 2 test below.

Thanks.

> +
> + if (order >= pageblock_order / 2 ||
> + start_mt == MIGRATE_RECLAIMABLE ||
> + start_mt == MIGRATE_UNMOVABLE ||
> + page_group_by_mobility_disabled)
> + return true;
> +
> + return false;
> +}
> +
> +static void steal_suitable_fallback(struct zone *zone, struct page *page,
> +   int start_type)
>  {
>   int current_order = page_order(page);
> + int pages;
>  
>   /* Take ownership for orders >= pageblock_order */
>   if (current_order >= pageblock_order) {
> @@ -1157,19 +1169,39 @@ static void try_to_steal_freepages(struct zone *zone, 
> struct page *page,
>   return;
>   }
>  
> - if (current_order >= pageblock_order / 2 ||
> - start_type == MIGRATE_RECLAIMABLE ||
> - start_type == MIGRATE_UNMOVABLE ||
> - page_group_by_mobility_disabled) {
> - int pages;
> + pages = move_freepages_block(zone, page, start_type);
>  
> - pages = move_freepages_block(zone, page, start_type);
> + /* Claim the whole block if over half of it is free */
> + if (pages >= (1 << (pageblock_order-1)) ||
> + page_group_by_mobility_disabled)
> + set_pageblock_migratetype(page, start_type);
> +}
>  
> - /* Claim the whole block if over half of it is free */
> - if (pages >= (1 << (pageblock_order-1)) ||
> - page_group_by_mobility_disabled)
> - set_pageblock_migratetype(page, start_type);
> +static int find_suitable_fallback(struct free_area *area, unsigned int order,
> + int migratetype, bool *can_steal)
> +{
> + int i;
> + int fallback_mt;
> +
> + if (area->nr_free == 0)
> + return -1;
> +
> + *can_steal = false;
> + for (i = 0;; i++) {
> + fallback_mt = fallbacks[migratetype][i];
> + if (fallback_mt == MIGRATE_RESERVE)
> + break;
> +
> + if (list_empty(>free_list[fallback_mt]))
> + continue;
> +
> + if (can_steal_fallback(order, migratetype))
> + *can_steal = true;
> +
> + return i;
>   }
> +
> + return -1;
>  }
>  
>  /* Remove an element from the buddy allocator from the fallback list */
> @@ -1179,53 +1211,45 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
> order, int start_migratetype)
>   struct free_area *area;
>   unsigned int current_order;
>   struct page *page;
> + int fallback_mt;
> + bool can_steal;
>  
>   /* Find the largest possible block of pages in the other list */
>   for (current_order = MAX_ORDER-1;
>   current_order >= order && current_order <= 
> MAX_ORDER-1;
>   --current_order) {
> - int i;
> - for (i = 0;; i++) {
> - int migratetype = fallbacks[start_migratetype][i];
> - int buddy_type = start_migratetype;
> -
> - /* MIGRATE_RESERVE handled later if necessary */
> - if (migratetype == MIGRATE_RESERVE)
> - break;
> -
> - area = &(zone->free_area[current_order]);
> - if (list_empty(>free_list[migratetype]))
> - continue;
> -
> - page = list_entry(area->free_list[migratetype].next,
> -

Re: [RFC PATCH v3 3/3] mm/compaction: enhance compaction finish condition

2015-02-02 Thread Zhang Yanfei
Hello,

At 2015/2/2 18:20, Vlastimil Babka wrote:
 On 02/02/2015 08:15 AM, Joonsoo Kim wrote:
 Compaction has anti fragmentation algorithm. It is that freepage
 should be more than pageblock order to finish the compaction if we don't
 find any freepage in requested migratetype buddy list. This is for
 mitigating fragmentation, but, there is a lack of migratetype
 consideration and it is too excessive compared to page allocator's anti
 fragmentation algorithm.

 Not considering migratetype would cause premature finish of compaction.
 For example, if allocation request is for unmovable migratetype,
 freepage with CMA migratetype doesn't help that allocation and
 compaction should not be stopped. But, current logic regards this
 situation as compaction is no longer needed, so finish the compaction.
 
 This is only for order = pageblock_order, right? Perhaps should be told 
 explicitly.

I might be wrong. If we applied patch1, so after the system runs for some time,
there must be no MIGRATE_CMA free pages in the system, right? If so, the
example above doesn't exist anymore.

 
 Secondly, condition is too excessive compared to page allocator's logic.
 We can steal freepage from other migratetype and change pageblock
 migratetype on more relaxed conditions in page allocator. This is designed
 to prevent fragmentation and we can use it here. Imposing hard constraint
 only to the compaction doesn't help much in this case since page allocator
 would cause fragmentation again.

 To solve these problems, this patch borrows anti fragmentation logic from
 page allocator. It will reduce premature compaction finish in some cases
 and reduce excessive compaction work.

 stress-highalloc test in mmtests with non movable order 7 allocation shows
 considerable increase of compaction success rate.

 Compaction success rate (Compaction success * 100 / Compaction stalls, %)
 31.82 : 42.20

 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 ---
  mm/compaction.c | 14 --
  mm/internal.h   |  2 ++
  mm/page_alloc.c | 12 
  3 files changed, 22 insertions(+), 6 deletions(-)

 diff --git a/mm/compaction.c b/mm/compaction.c
 index 782772d..d40c426 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -1170,13 +1170,23 @@ static int __compact_finished(struct zone *zone, 
 struct compact_control *cc,
  /* Direct compactor: Is a suitable page free? */
  for (order = cc-order; order  MAX_ORDER; order++) {
  struct free_area *area = zone-free_area[order];
 +bool can_steal;
  
  /* Job done if page is free of the right migratetype */
  if (!list_empty(area-free_list[migratetype]))
  return COMPACT_PARTIAL;
  
 -/* Job done if allocation would set block type */
 -if (order = pageblock_order  area-nr_free)
 +/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
 +if (migratetype == MIGRATE_MOVABLE 
 +!list_empty(area-free_list[MIGRATE_CMA]))
 +return COMPACT_PARTIAL;
 
 The above AFAICS needs #ifdef CMA otherwise won't compile without CMA.
 
 +
 +/*
 + * Job done if allocation would steal freepages from
 + * other migratetype buddy lists.
 + */
 +if (find_suitable_fallback(area, order, migratetype,
 +true, can_steal) != -1)
  return COMPACT_PARTIAL;
  }
  
 diff --git a/mm/internal.h b/mm/internal.h
 index c4d6c9b..9640650 100644
 --- a/mm/internal.h
 +++ b/mm/internal.h
 @@ -200,6 +200,8 @@ isolate_freepages_range(struct compact_control *cc,
  unsigned long
  isolate_migratepages_range(struct compact_control *cc,
 unsigned long low_pfn, unsigned long end_pfn);
 +int find_suitable_fallback(struct free_area *area, unsigned int order,
 +int migratetype, bool only_stealable, bool *can_steal);
  
  #endif
  
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index 6cb18f8..0a150f1 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -1177,8 +1177,8 @@ static void steal_suitable_fallback(struct zone *zone, 
 struct page *page,
  set_pageblock_migratetype(page, start_type);
  }
  
 -static int find_suitable_fallback(struct free_area *area, unsigned int 
 order,
 -int migratetype, bool *can_steal)
 +int find_suitable_fallback(struct free_area *area, unsigned int order,
 +int migratetype, bool only_stealable, bool *can_steal)
  {
  int i;
  int fallback_mt;
 @@ -1198,7 +1198,11 @@ static int find_suitable_fallback(struct free_area 
 *area, unsigned int order,
  if (can_steal_fallback(order, migratetype))
  *can_steal = true;
  
 -return i;
 +if (!only_stealable)
 +return i;
 +
 +if (*can_steal)
 +return i;
 
 So 

Re: [RFC PATCH v3 2/3] mm/page_alloc: factor out fallback freepage checking

2015-02-02 Thread Zhang Yanfei
Hello Joonsoo,

At 2015/2/2 15:15, Joonsoo Kim wrote:
 This is preparation step to use page allocator's anti fragmentation logic
 in compaction. This patch just separates fallback freepage checking part
 from fallback freepage management part. Therefore, there is no functional
 change.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 ---
  mm/page_alloc.c | 128 
 +---
  1 file changed, 76 insertions(+), 52 deletions(-)
 
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index e64b260..6cb18f8 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -1142,14 +1142,26 @@ static void change_pageblock_range(struct page 
 *pageblock_page,
   * as fragmentation caused by those allocations polluting movable pageblocks
   * is worse than movable allocations stealing from unmovable and reclaimable
   * pageblocks.
 - *
 - * If we claim more than half of the pageblock, change pageblock's 
 migratetype
 - * as well.
   */
 -static void try_to_steal_freepages(struct zone *zone, struct page *page,
 -   int start_type, int fallback_type)
 +static bool can_steal_fallback(unsigned int order, int start_mt)
 +{
 + if (order = pageblock_order)
 + return true;

Is this test necessary? Since an order which is = pageblock_order
will always pass the order = pageblock_order / 2 test below.

Thanks.

 +
 + if (order = pageblock_order / 2 ||
 + start_mt == MIGRATE_RECLAIMABLE ||
 + start_mt == MIGRATE_UNMOVABLE ||
 + page_group_by_mobility_disabled)
 + return true;
 +
 + return false;
 +}
 +
 +static void steal_suitable_fallback(struct zone *zone, struct page *page,
 +   int start_type)
  {
   int current_order = page_order(page);
 + int pages;
  
   /* Take ownership for orders = pageblock_order */
   if (current_order = pageblock_order) {
 @@ -1157,19 +1169,39 @@ static void try_to_steal_freepages(struct zone *zone, 
 struct page *page,
   return;
   }
  
 - if (current_order = pageblock_order / 2 ||
 - start_type == MIGRATE_RECLAIMABLE ||
 - start_type == MIGRATE_UNMOVABLE ||
 - page_group_by_mobility_disabled) {
 - int pages;
 + pages = move_freepages_block(zone, page, start_type);
  
 - pages = move_freepages_block(zone, page, start_type);
 + /* Claim the whole block if over half of it is free */
 + if (pages = (1  (pageblock_order-1)) ||
 + page_group_by_mobility_disabled)
 + set_pageblock_migratetype(page, start_type);
 +}
  
 - /* Claim the whole block if over half of it is free */
 - if (pages = (1  (pageblock_order-1)) ||
 - page_group_by_mobility_disabled)
 - set_pageblock_migratetype(page, start_type);
 +static int find_suitable_fallback(struct free_area *area, unsigned int order,
 + int migratetype, bool *can_steal)
 +{
 + int i;
 + int fallback_mt;
 +
 + if (area-nr_free == 0)
 + return -1;
 +
 + *can_steal = false;
 + for (i = 0;; i++) {
 + fallback_mt = fallbacks[migratetype][i];
 + if (fallback_mt == MIGRATE_RESERVE)
 + break;
 +
 + if (list_empty(area-free_list[fallback_mt]))
 + continue;
 +
 + if (can_steal_fallback(order, migratetype))
 + *can_steal = true;
 +
 + return i;
   }
 +
 + return -1;
  }
  
  /* Remove an element from the buddy allocator from the fallback list */
 @@ -1179,53 +1211,45 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
 order, int start_migratetype)
   struct free_area *area;
   unsigned int current_order;
   struct page *page;
 + int fallback_mt;
 + bool can_steal;
  
   /* Find the largest possible block of pages in the other list */
   for (current_order = MAX_ORDER-1;
   current_order = order  current_order = 
 MAX_ORDER-1;
   --current_order) {
 - int i;
 - for (i = 0;; i++) {
 - int migratetype = fallbacks[start_migratetype][i];
 - int buddy_type = start_migratetype;
 -
 - /* MIGRATE_RESERVE handled later if necessary */
 - if (migratetype == MIGRATE_RESERVE)
 - break;
 -
 - area = (zone-free_area[current_order]);
 - if (list_empty(area-free_list[migratetype]))
 - continue;
 -
 - page = list_entry(area-free_list[migratetype].next,
 - struct page, lru);
 - area-nr_free--;
 + area = (zone-free_area[current_order]);
 +   

Re: [PATCH v2 4/4] mm/compaction: enhance compaction finish condition

2015-01-31 Thread Zhang Yanfei
At 2015/1/30 20:34, Joonsoo Kim wrote:
> From: Joonsoo 
> 
> Compaction has anti fragmentation algorithm. It is that freepage
> should be more than pageblock order to finish the compaction if we don't
> find any freepage in requested migratetype buddy list. This is for
> mitigating fragmentation, but, there is a lack of migratetype
> consideration and it is too excessive compared to page allocator's anti
> fragmentation algorithm.
> 
> Not considering migratetype would cause premature finish of compaction.
> For example, if allocation request is for unmovable migratetype,
> freepage with CMA migratetype doesn't help that allocation and
> compaction should not be stopped. But, current logic regards this
> situation as compaction is no longer needed, so finish the compaction.
> 
> Secondly, condition is too excessive compared to page allocator's logic.
> We can steal freepage from other migratetype and change pageblock
> migratetype on more relaxed conditions in page allocator. This is designed
> to prevent fragmentation and we can use it here. Imposing hard constraint
> only to the compaction doesn't help much in this case since page allocator
> would cause fragmentation again.

Changing both two behaviours in compaction may change the high order allocation
behaviours in the buddy allocator slowpath, so just as Vlastimil suggested,
some data from allocator should be necessary and helpful, IMHO.

Thanks. 

> 
> To solve these problems, this patch borrows anti fragmentation logic from
> page allocator. It will reduce premature compaction finish in some cases
> and reduce excessive compaction work.
> 
> stress-highalloc test in mmtests with non movable order 7 allocation shows
> considerable increase of compaction success rate.
> 
> Compaction success rate (Compaction success * 100 / Compaction stalls, %)
> 31.82 : 42.20
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  include/linux/mmzone.h |  3 +++
>  mm/compaction.c| 30 --
>  mm/internal.h  |  1 +
>  mm/page_alloc.c|  5 ++---
>  4 files changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f279d9c..a2906bc 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -63,6 +63,9 @@ enum {
>   MIGRATE_TYPES
>  };
>  
> +#define FALLBACK_MIGRATETYPES (4)
> +extern int fallbacks[MIGRATE_TYPES][FALLBACK_MIGRATETYPES];
> +
>  #ifdef CONFIG_CMA
>  #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
>  #else
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 782772d..0460e4b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1125,6 +1125,29 @@ static isolate_migrate_t isolate_migratepages(struct 
> zone *zone,
>   return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
>  }
>  
> +static bool can_steal_fallbacks(struct free_area *area,
> + unsigned int order, int migratetype)
> +{
> + int i;
> + int fallback_mt;
> +
> + if (area->nr_free == 0)
> + return false;
> +
> + for (i = 0; i < FALLBACK_MIGRATETYPES; i++) {
> + fallback_mt = fallbacks[migratetype][i];
> + if (fallback_mt == MIGRATE_RESERVE)
> + break;
> +
> + if (list_empty(>free_list[fallback_mt]))
> + continue;
> +
> + if (can_steal_freepages(order, migratetype, fallback_mt))
> + return true;
> + }
> + return false;
> +}
> +
>  static int __compact_finished(struct zone *zone, struct compact_control *cc,
>   const int migratetype)
>  {
> @@ -1175,8 +1198,11 @@ static int __compact_finished(struct zone *zone, 
> struct compact_control *cc,
>   if (!list_empty(>free_list[migratetype]))
>   return COMPACT_PARTIAL;
>  
> - /* Job done if allocation would set block type */
> - if (order >= pageblock_order && area->nr_free)
> + /*
> +  * Job done if allocation would steal freepages from
> +  * other migratetype buddy lists.
> +  */
> + if (can_steal_fallbacks(area, order, migratetype))
>   return COMPACT_PARTIAL;
>   }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index c4d6c9b..0a89a14 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -201,6 +201,7 @@ unsigned long
>  isolate_migratepages_range(struct compact_control *cc,
>  unsigned long low_pfn, unsigned long end_pfn);
>  
> +bool can_steal_freepages(unsigned int order, int start_mt, int fallback_mt);
>  #endif
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ef74750..4c3538b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1026,7 +1026,7 @@ struct page *__rmqueue_smallest(struct zone *zone, 
> unsigned int order,
>   * This array describes the order lists are fallen back to when
>   * the free lists 

Re: [PATCH v2 3/4] mm/page_alloc: separate steal decision from steal behaviour part

2015-01-31 Thread Zhang Yanfei
At 2015/1/30 20:34, Joonsoo Kim wrote:
> From: Joonsoo 
> 
> This is preparation step to use page allocator's anti fragmentation logic
> in compaction. This patch just separates steal decision part from actual
> steal behaviour part so there is no functional change.
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  mm/page_alloc.c | 49 -
>  1 file changed, 32 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8d52ab1..ef74750 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1122,6 +1122,24 @@ static void change_pageblock_range(struct page 
> *pageblock_page,
>   }
>  }
>  
> +static bool can_steal_freepages(unsigned int order,
> + int start_mt, int fallback_mt)
> +{
> + if (is_migrate_cma(fallback_mt))
> + return false;
> +
> + if (order >= pageblock_order)
> + return true;
> +
> + if (order >= pageblock_order / 2 ||
> + start_mt == MIGRATE_RECLAIMABLE ||
> + start_mt == MIGRATE_UNMOVABLE ||
> + page_group_by_mobility_disabled)
> + return true;
> +
> + return false;
> +}

So some comments which can tell the cases can or cannot steal freepages
from other migratetype is necessary IMHO. Actually we can just
move some comments in try_to_steal_pages to here.

Thanks.

> +
>  /*
>   * When we are falling back to another migratetype during allocation, try to
>   * steal extra free pages from the same pageblocks to satisfy further
> @@ -1138,9 +1156,10 @@ static void change_pageblock_range(struct page 
> *pageblock_page,
>   * as well.
>   */
>  static void try_to_steal_freepages(struct zone *zone, struct page *page,
> -   int start_type, int fallback_type)
> +   int start_type)
>  {
>   int current_order = page_order(page);
> + int pages;
>  
>   /* Take ownership for orders >= pageblock_order */
>   if (current_order >= pageblock_order) {
> @@ -1148,19 +1167,12 @@ static void try_to_steal_freepages(struct zone *zone, 
> struct page *page,
>   return;
>   }
>  
> - if (current_order >= pageblock_order / 2 ||
> - start_type == MIGRATE_RECLAIMABLE ||
> - start_type == MIGRATE_UNMOVABLE ||
> - page_group_by_mobility_disabled) {
> - int pages;
> + pages = move_freepages_block(zone, page, start_type);
>  
> - pages = move_freepages_block(zone, page, start_type);
> -
> - /* Claim the whole block if over half of it is free */
> - if (pages >= (1 << (pageblock_order-1)) ||
> - page_group_by_mobility_disabled)
> - set_pageblock_migratetype(page, start_type);
> - }
> + /* Claim the whole block if over half of it is free */
> + if (pages >= (1 << (pageblock_order-1)) ||
> + page_group_by_mobility_disabled)
> + set_pageblock_migratetype(page, start_type);
>  }
>  
>  /* Remove an element from the buddy allocator from the fallback list */
> @@ -1170,6 +1182,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
> order, int start_migratetype)
>   struct free_area *area;
>   unsigned int current_order;
>   struct page *page;
> + bool can_steal;
>  
>   /* Find the largest possible block of pages in the other list */
>   for (current_order = MAX_ORDER-1;
> @@ -1192,10 +1205,11 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
> order, int start_migratetype)
>   struct page, lru);
>   area->nr_free--;
>  
> - if (!is_migrate_cma(migratetype)) {
> + can_steal = can_steal_freepages(current_order,
> + start_migratetype, migratetype);
> + if (can_steal) {
>   try_to_steal_freepages(zone, page,
> - start_migratetype,
> - migratetype);
> + start_migratetype);
>   } else {
>   /*
>* When borrowing from MIGRATE_CMA, we need to
> @@ -1203,7 +1217,8 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
> order, int start_migratetype)
>* itself, and we do not try to steal extra
>* free pages.
>*/
> - buddy_type = migratetype;
> + if (is_migrate_cma(migratetype))
> + buddy_type = migratetype;
>   }
>  
>   /* Remove the page from the freelists */
> 
--
To unsubscribe from this list: send the line "unsubscribe 

Re: [PATCH v2 2/4] mm/compaction: stop the isolation when we isolate enough freepage

2015-01-31 Thread Zhang Yanfei
At 2015/1/31 16:31, Vlastimil Babka wrote:
> On 01/31/2015 08:49 AM, Zhang Yanfei wrote:
>> Hello,
>>
>> At 2015/1/30 20:34, Joonsoo Kim wrote:
>>
>> Reviewed-by: Zhang Yanfei 
>>
>> IMHO, the patch making the free scanner move slower makes both scanners
>> meet further. Before this patch, if we isolate too many free pages and even 
>> after we release the unneeded free pages later the free scanner still already
>> be there and will be moved forward again next time -- the free scanner just
>> cannot be moved back to grab the free pages we released before no matter 
>> where
>> the free pages in, pcp or buddy. 
> 
> It can be actually moved back. If we are releasing free pages, it means the
> current compaction is terminating, and it will set 
> zone->compact_cached_free_pfn
> back to the position of the released free page that was furthest back. The 
> next
> compaction will start from the cached free pfn.

Yeah, you are right. I missed the release_freepages(). Thanks!

> 
> It is however possible that another compaction runs in parallel and has
> progressed further and overwrites the cached free pfn.
> 

Hmm, maybe.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/4] mm/compaction: enhance compaction finish condition

2015-01-31 Thread Zhang Yanfei
At 2015/1/30 20:34, Joonsoo Kim wrote:
 From: Joonsoo iamjoonsoo@lge.com
 
 Compaction has anti fragmentation algorithm. It is that freepage
 should be more than pageblock order to finish the compaction if we don't
 find any freepage in requested migratetype buddy list. This is for
 mitigating fragmentation, but, there is a lack of migratetype
 consideration and it is too excessive compared to page allocator's anti
 fragmentation algorithm.
 
 Not considering migratetype would cause premature finish of compaction.
 For example, if allocation request is for unmovable migratetype,
 freepage with CMA migratetype doesn't help that allocation and
 compaction should not be stopped. But, current logic regards this
 situation as compaction is no longer needed, so finish the compaction.
 
 Secondly, condition is too excessive compared to page allocator's logic.
 We can steal freepage from other migratetype and change pageblock
 migratetype on more relaxed conditions in page allocator. This is designed
 to prevent fragmentation and we can use it here. Imposing hard constraint
 only to the compaction doesn't help much in this case since page allocator
 would cause fragmentation again.

Changing both two behaviours in compaction may change the high order allocation
behaviours in the buddy allocator slowpath, so just as Vlastimil suggested,
some data from allocator should be necessary and helpful, IMHO.

Thanks. 

 
 To solve these problems, this patch borrows anti fragmentation logic from
 page allocator. It will reduce premature compaction finish in some cases
 and reduce excessive compaction work.
 
 stress-highalloc test in mmtests with non movable order 7 allocation shows
 considerable increase of compaction success rate.
 
 Compaction success rate (Compaction success * 100 / Compaction stalls, %)
 31.82 : 42.20
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 ---
  include/linux/mmzone.h |  3 +++
  mm/compaction.c| 30 --
  mm/internal.h  |  1 +
  mm/page_alloc.c|  5 ++---
  4 files changed, 34 insertions(+), 5 deletions(-)
 
 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
 index f279d9c..a2906bc 100644
 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -63,6 +63,9 @@ enum {
   MIGRATE_TYPES
  };
  
 +#define FALLBACK_MIGRATETYPES (4)
 +extern int fallbacks[MIGRATE_TYPES][FALLBACK_MIGRATETYPES];
 +
  #ifdef CONFIG_CMA
  #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
  #else
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 782772d..0460e4b 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -1125,6 +1125,29 @@ static isolate_migrate_t isolate_migratepages(struct 
 zone *zone,
   return cc-nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
  }
  
 +static bool can_steal_fallbacks(struct free_area *area,
 + unsigned int order, int migratetype)
 +{
 + int i;
 + int fallback_mt;
 +
 + if (area-nr_free == 0)
 + return false;
 +
 + for (i = 0; i  FALLBACK_MIGRATETYPES; i++) {
 + fallback_mt = fallbacks[migratetype][i];
 + if (fallback_mt == MIGRATE_RESERVE)
 + break;
 +
 + if (list_empty(area-free_list[fallback_mt]))
 + continue;
 +
 + if (can_steal_freepages(order, migratetype, fallback_mt))
 + return true;
 + }
 + return false;
 +}
 +
  static int __compact_finished(struct zone *zone, struct compact_control *cc,
   const int migratetype)
  {
 @@ -1175,8 +1198,11 @@ static int __compact_finished(struct zone *zone, 
 struct compact_control *cc,
   if (!list_empty(area-free_list[migratetype]))
   return COMPACT_PARTIAL;
  
 - /* Job done if allocation would set block type */
 - if (order = pageblock_order  area-nr_free)
 + /*
 +  * Job done if allocation would steal freepages from
 +  * other migratetype buddy lists.
 +  */
 + if (can_steal_fallbacks(area, order, migratetype))
   return COMPACT_PARTIAL;
   }
  
 diff --git a/mm/internal.h b/mm/internal.h
 index c4d6c9b..0a89a14 100644
 --- a/mm/internal.h
 +++ b/mm/internal.h
 @@ -201,6 +201,7 @@ unsigned long
  isolate_migratepages_range(struct compact_control *cc,
  unsigned long low_pfn, unsigned long end_pfn);
  
 +bool can_steal_freepages(unsigned int order, int start_mt, int fallback_mt);
  #endif
  
  /*
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index ef74750..4c3538b 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -1026,7 +1026,7 @@ struct page *__rmqueue_smallest(struct zone *zone, 
 unsigned int order,
   * This array describes the order lists are fallen back to when
   * the free lists for the desirable migrate type are depleted
   */
 -static int 

Re: [PATCH v2 2/4] mm/compaction: stop the isolation when we isolate enough freepage

2015-01-31 Thread Zhang Yanfei
At 2015/1/31 16:31, Vlastimil Babka wrote:
 On 01/31/2015 08:49 AM, Zhang Yanfei wrote:
 Hello,

 At 2015/1/30 20:34, Joonsoo Kim wrote:

 Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 IMHO, the patch making the free scanner move slower makes both scanners
 meet further. Before this patch, if we isolate too many free pages and even 
 after we release the unneeded free pages later the free scanner still already
 be there and will be moved forward again next time -- the free scanner just
 cannot be moved back to grab the free pages we released before no matter 
 where
 the free pages in, pcp or buddy. 
 
 It can be actually moved back. If we are releasing free pages, it means the
 current compaction is terminating, and it will set 
 zone-compact_cached_free_pfn
 back to the position of the released free page that was furthest back. The 
 next
 compaction will start from the cached free pfn.

Yeah, you are right. I missed the release_freepages(). Thanks!

 
 It is however possible that another compaction runs in parallel and has
 progressed further and overwrites the cached free pfn.
 

Hmm, maybe.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/4] mm/page_alloc: separate steal decision from steal behaviour part

2015-01-31 Thread Zhang Yanfei
At 2015/1/30 20:34, Joonsoo Kim wrote:
 From: Joonsoo iamjoonsoo@lge.com
 
 This is preparation step to use page allocator's anti fragmentation logic
 in compaction. This patch just separates steal decision part from actual
 steal behaviour part so there is no functional change.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 ---
  mm/page_alloc.c | 49 -
  1 file changed, 32 insertions(+), 17 deletions(-)
 
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index 8d52ab1..ef74750 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -1122,6 +1122,24 @@ static void change_pageblock_range(struct page 
 *pageblock_page,
   }
  }
  
 +static bool can_steal_freepages(unsigned int order,
 + int start_mt, int fallback_mt)
 +{
 + if (is_migrate_cma(fallback_mt))
 + return false;
 +
 + if (order = pageblock_order)
 + return true;
 +
 + if (order = pageblock_order / 2 ||
 + start_mt == MIGRATE_RECLAIMABLE ||
 + start_mt == MIGRATE_UNMOVABLE ||
 + page_group_by_mobility_disabled)
 + return true;
 +
 + return false;
 +}

So some comments which can tell the cases can or cannot steal freepages
from other migratetype is necessary IMHO. Actually we can just
move some comments in try_to_steal_pages to here.

Thanks.

 +
  /*
   * When we are falling back to another migratetype during allocation, try to
   * steal extra free pages from the same pageblocks to satisfy further
 @@ -1138,9 +1156,10 @@ static void change_pageblock_range(struct page 
 *pageblock_page,
   * as well.
   */
  static void try_to_steal_freepages(struct zone *zone, struct page *page,
 -   int start_type, int fallback_type)
 +   int start_type)
  {
   int current_order = page_order(page);
 + int pages;
  
   /* Take ownership for orders = pageblock_order */
   if (current_order = pageblock_order) {
 @@ -1148,19 +1167,12 @@ static void try_to_steal_freepages(struct zone *zone, 
 struct page *page,
   return;
   }
  
 - if (current_order = pageblock_order / 2 ||
 - start_type == MIGRATE_RECLAIMABLE ||
 - start_type == MIGRATE_UNMOVABLE ||
 - page_group_by_mobility_disabled) {
 - int pages;
 + pages = move_freepages_block(zone, page, start_type);
  
 - pages = move_freepages_block(zone, page, start_type);
 -
 - /* Claim the whole block if over half of it is free */
 - if (pages = (1  (pageblock_order-1)) ||
 - page_group_by_mobility_disabled)
 - set_pageblock_migratetype(page, start_type);
 - }
 + /* Claim the whole block if over half of it is free */
 + if (pages = (1  (pageblock_order-1)) ||
 + page_group_by_mobility_disabled)
 + set_pageblock_migratetype(page, start_type);
  }
  
  /* Remove an element from the buddy allocator from the fallback list */
 @@ -1170,6 +1182,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
 order, int start_migratetype)
   struct free_area *area;
   unsigned int current_order;
   struct page *page;
 + bool can_steal;
  
   /* Find the largest possible block of pages in the other list */
   for (current_order = MAX_ORDER-1;
 @@ -1192,10 +1205,11 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
 order, int start_migratetype)
   struct page, lru);
   area-nr_free--;
  
 - if (!is_migrate_cma(migratetype)) {
 + can_steal = can_steal_freepages(current_order,
 + start_migratetype, migratetype);
 + if (can_steal) {
   try_to_steal_freepages(zone, page,
 - start_migratetype,
 - migratetype);
 + start_migratetype);
   } else {
   /*
* When borrowing from MIGRATE_CMA, we need to
 @@ -1203,7 +1217,8 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
 order, int start_migratetype)
* itself, and we do not try to steal extra
* free pages.
*/
 - buddy_type = migratetype;
 + if (is_migrate_cma(migratetype))
 + buddy_type = migratetype;
   }
  
   /* Remove the page from the freelists */
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH v2 2/4] mm/compaction: stop the isolation when we isolate enough freepage

2015-01-30 Thread Zhang Yanfei
Hello,

At 2015/1/30 20:34, Joonsoo Kim wrote:
> From: Joonsoo 
> 
> Currently, freepage isolation in one pageblock doesn't consider how many
> freepages we isolate. When I traced flow of compaction, compaction
> sometimes isolates more than 256 freepages to migrate just 32 pages.
> 
> In this patch, freepage isolation is stopped at the point that we
> have more isolated freepage than isolated page for migration. This
> results in slowing down free page scanner and make compaction success
> rate higher.
> 
> stress-highalloc test in mmtests with non movable order 7 allocation shows
> increase of compaction success rate.
> 
> Compaction success rate (Compaction success * 100 / Compaction stalls, %)
> 27.13 : 31.82
> 
> pfn where both scanners meets on compaction complete
> (separate test due to enormous tracepoint buffer)
> (zone_start=4096, zone_end=1048576)
> 586034 : 654378
> 
> In fact, I didn't fully understand why this patch results in such good
> result. There was a guess that not used freepages are released to pcp list
> and on next compaction trial we won't isolate them again so compaction
> success rate would decrease. To prevent this effect, I tested with adding
> pcp drain code on release_freepages(), but, it has no good effect.
> 
> Anyway, this patch reduces waste time to isolate unneeded freepages so
> seems reasonable.

Reviewed-by: Zhang Yanfei 

IMHO, the patch making the free scanner move slower makes both scanners
meet further. Before this patch, if we isolate too many free pages and even 
after we release the unneeded free pages later the free scanner still already
be there and will be moved forward again next time -- the free scanner just
cannot be moved back to grab the free pages we released before no matter where
the free pages in, pcp or buddy. 

> 
> Signed-off-by: Joonsoo Kim 
> ---
>  mm/compaction.c | 17 ++---
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4954e19..782772d 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -490,6 +490,13 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>  
>   /* If a page was split, advance to the end of it */
>   if (isolated) {
> + cc->nr_freepages += isolated;
> + if (!strict &&
> + cc->nr_migratepages <= cc->nr_freepages) {
> + blockpfn += isolated;
> + break;
> + }
> +
>   blockpfn += isolated - 1;
>   cursor += isolated - 1;
>   continue;
> @@ -899,7 +906,6 @@ static void isolate_freepages(struct compact_control *cc)
>   unsigned long isolate_start_pfn; /* exact pfn we start at */
>   unsigned long block_end_pfn;/* end of current pageblock */
>   unsigned long low_pfn;   /* lowest pfn scanner is able to scan */
> - int nr_freepages = cc->nr_freepages;
>   struct list_head *freelist = >freepages;
>  
>   /*
> @@ -924,11 +930,11 @@ static void isolate_freepages(struct compact_control 
> *cc)
>* pages on cc->migratepages. We stop searching if the migrate
>* and free page scanners meet or enough free pages are isolated.
>*/
> - for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
> + for (; block_start_pfn >= low_pfn &&
> + cc->nr_migratepages > cc->nr_freepages;
>   block_end_pfn = block_start_pfn,
>   block_start_pfn -= pageblock_nr_pages,
>   isolate_start_pfn = block_start_pfn) {
> - unsigned long isolated;
>  
>   /*
>* This can iterate a massively long zone without finding any
> @@ -953,9 +959,8 @@ static void isolate_freepages(struct compact_control *cc)
>   continue;
>  
>   /* Found a block suitable for isolating free pages from. */
> - isolated = isolate_freepages_block(cc, _start_pfn,
> + isolate_freepages_block(cc, _start_pfn,
>   block_end_pfn, freelist, false);
> - nr_freepages += isolated;
>  
>   /*
>* Remember where the free scanner should restart next time,
> @@ -987,8 +992,6 @@ static void isolate_freepages(struct compact_control *cc)
>*/
>   if (block_start_pfn < low_pfn)
>   cc->free_pfn = cc->migrate_pfn;
> -
> - cc->nr_freepages = nr_freepages;
>  }
>  
>  /*
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/4] mm/compaction: fix wrong order check in compact_finished()

2015-01-30 Thread Zhang Yanfei
Hello,

At 2015/1/30 20:34, Joonsoo Kim wrote:
> What we want to check here is whether there is highorder freepage
> in buddy list of other migratetype in order to steal it without
> fragmentation. But, current code just checks cc->order which means
> allocation request order. So, this is wrong.
> 
> Without this fix, non-movable synchronous compaction below pageblock order
> would not stopped until compaction is complete, because migratetype of most
> pageblocks are movable and high order freepage made by compaction is usually
> on movable type buddy list.
> 
> There is some report related to this bug. See below link.
> 
> http://www.spinics.net/lists/linux-mm/msg81666.html
> 
> Although the issued system still has load spike comes from compaction,
> this makes that system completely stable and responsive according to
> his report.
> 
> stress-highalloc test in mmtests with non movable order 7 allocation doesn't
> show any notable difference in allocation success rate, but, it shows more
> compaction success rate.
> 
> Compaction success rate (Compaction success * 100 / Compaction stalls, %)
> 18.47 : 28.94
> 
> Cc: 
> Acked-by: Vlastimil Babka 
> Signed-off-by: Joonsoo Kim 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b68736c..4954e19 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1173,7 +1173,7 @@ static int __compact_finished(struct zone *zone, struct 
> compact_control *cc,
>   return COMPACT_PARTIAL;
>  
>   /* Job done if allocation would set block type */
> - if (cc->order >= pageblock_order && area->nr_free)
> + if (order >= pageblock_order && area->nr_free)
>   return COMPACT_PARTIAL;
>   }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/4] mm/compaction: stop the isolation when we isolate enough freepage

2015-01-30 Thread Zhang Yanfei
Hello,

At 2015/1/30 20:34, Joonsoo Kim wrote:
 From: Joonsoo iamjoonsoo@lge.com
 
 Currently, freepage isolation in one pageblock doesn't consider how many
 freepages we isolate. When I traced flow of compaction, compaction
 sometimes isolates more than 256 freepages to migrate just 32 pages.
 
 In this patch, freepage isolation is stopped at the point that we
 have more isolated freepage than isolated page for migration. This
 results in slowing down free page scanner and make compaction success
 rate higher.
 
 stress-highalloc test in mmtests with non movable order 7 allocation shows
 increase of compaction success rate.
 
 Compaction success rate (Compaction success * 100 / Compaction stalls, %)
 27.13 : 31.82
 
 pfn where both scanners meets on compaction complete
 (separate test due to enormous tracepoint buffer)
 (zone_start=4096, zone_end=1048576)
 586034 : 654378
 
 In fact, I didn't fully understand why this patch results in such good
 result. There was a guess that not used freepages are released to pcp list
 and on next compaction trial we won't isolate them again so compaction
 success rate would decrease. To prevent this effect, I tested with adding
 pcp drain code on release_freepages(), but, it has no good effect.
 
 Anyway, this patch reduces waste time to isolate unneeded freepages so
 seems reasonable.

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

IMHO, the patch making the free scanner move slower makes both scanners
meet further. Before this patch, if we isolate too many free pages and even 
after we release the unneeded free pages later the free scanner still already
be there and will be moved forward again next time -- the free scanner just
cannot be moved back to grab the free pages we released before no matter where
the free pages in, pcp or buddy. 

 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 ---
  mm/compaction.c | 17 ++---
  1 file changed, 10 insertions(+), 7 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 4954e19..782772d 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -490,6 +490,13 @@ static unsigned long isolate_freepages_block(struct 
 compact_control *cc,
  
   /* If a page was split, advance to the end of it */
   if (isolated) {
 + cc-nr_freepages += isolated;
 + if (!strict 
 + cc-nr_migratepages = cc-nr_freepages) {
 + blockpfn += isolated;
 + break;
 + }
 +
   blockpfn += isolated - 1;
   cursor += isolated - 1;
   continue;
 @@ -899,7 +906,6 @@ static void isolate_freepages(struct compact_control *cc)
   unsigned long isolate_start_pfn; /* exact pfn we start at */
   unsigned long block_end_pfn;/* end of current pageblock */
   unsigned long low_pfn;   /* lowest pfn scanner is able to scan */
 - int nr_freepages = cc-nr_freepages;
   struct list_head *freelist = cc-freepages;
  
   /*
 @@ -924,11 +930,11 @@ static void isolate_freepages(struct compact_control 
 *cc)
* pages on cc-migratepages. We stop searching if the migrate
* and free page scanners meet or enough free pages are isolated.
*/
 - for (; block_start_pfn = low_pfn  cc-nr_migratepages  nr_freepages;
 + for (; block_start_pfn = low_pfn 
 + cc-nr_migratepages  cc-nr_freepages;
   block_end_pfn = block_start_pfn,
   block_start_pfn -= pageblock_nr_pages,
   isolate_start_pfn = block_start_pfn) {
 - unsigned long isolated;
  
   /*
* This can iterate a massively long zone without finding any
 @@ -953,9 +959,8 @@ static void isolate_freepages(struct compact_control *cc)
   continue;
  
   /* Found a block suitable for isolating free pages from. */
 - isolated = isolate_freepages_block(cc, isolate_start_pfn,
 + isolate_freepages_block(cc, isolate_start_pfn,
   block_end_pfn, freelist, false);
 - nr_freepages += isolated;
  
   /*
* Remember where the free scanner should restart next time,
 @@ -987,8 +992,6 @@ static void isolate_freepages(struct compact_control *cc)
*/
   if (block_start_pfn  low_pfn)
   cc-free_pfn = cc-migrate_pfn;
 -
 - cc-nr_freepages = nr_freepages;
  }
  
  /*
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/4] mm/compaction: fix wrong order check in compact_finished()

2015-01-30 Thread Zhang Yanfei
Hello,

At 2015/1/30 20:34, Joonsoo Kim wrote:
 What we want to check here is whether there is highorder freepage
 in buddy list of other migratetype in order to steal it without
 fragmentation. But, current code just checks cc-order which means
 allocation request order. So, this is wrong.
 
 Without this fix, non-movable synchronous compaction below pageblock order
 would not stopped until compaction is complete, because migratetype of most
 pageblocks are movable and high order freepage made by compaction is usually
 on movable type buddy list.
 
 There is some report related to this bug. See below link.
 
 http://www.spinics.net/lists/linux-mm/msg81666.html
 
 Although the issued system still has load spike comes from compaction,
 this makes that system completely stable and responsive according to
 his report.
 
 stress-highalloc test in mmtests with non movable order 7 allocation doesn't
 show any notable difference in allocation success rate, but, it shows more
 compaction success rate.
 
 Compaction success rate (Compaction success * 100 / Compaction stalls, %)
 18.47 : 28.94
 
 Cc: sta...@vger.kernel.org
 Acked-by: Vlastimil Babka vba...@suse.cz
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  mm/compaction.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index b68736c..4954e19 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -1173,7 +1173,7 @@ static int __compact_finished(struct zone *zone, struct 
 compact_control *cc,
   return COMPACT_PARTIAL;
  
   /* Job done if allocation would set block type */
 - if (cc-order = pageblock_order  area-nr_free)
 + if (order = pageblock_order  area-nr_free)
   return COMPACT_PARTIAL;
   }
  
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: incorporate read-only pages into transparent huge pages

2015-01-28 Thread Zhang Yanfei
Hello

在 2015/1/28 1:39, Ebru Akagunduz 写道:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.
>
> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
>
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
>
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
>
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 50% of the program's memory back into THPs.
>
> Signed-off-by: Ebru Akagunduz 
> Reviewed-by: Rik van Riel 
> Acked-by: Vlastimil Babka 

Please feel free to add:

Acked-by: Zhang Yanfei 

> ---
> Changes in v2:
>  - Remove extra code indent (Vlastimil Babka)
>  - Add comment line for check condition of page_count() (Vlastimil Babka)
>  - Add fast path optimistic check to
>__collapse_huge_page_isolate() (Andrea Arcangeli)
>  - Move check condition of page_count() below to trylock_page() (Andrea 
> Arcangeli)
>
> Changes in v3:
>  - Add a at-least-one-writable-pte check (Zhang Yanfei)
>  - Debug page count (Vlastimil Babka, Andrea Arcangeli)
>  - Increase read-only pte counter if pte is none (Andrea Arcangeli)
>
> I've written down test results:
> With the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:  100464 kB
> AnonHugePages:  100352 kB
> Swap:   699540 kB
> Fraction:   99,88
>
> cat /proc/meminfo:
> AnonPages:  1754448 kB
> AnonHugePages:  1716224 kB
> Fraction:   97,82
>
> After swapped in:
> In a few seconds:
> cat /proc/pid/smaps:
> Anonymous:  84 kB
> AnonHugePages:  145408 kB
> Swap:   0 kB
> Fraction:   18,17
>
> cat /proc/meminfo:
> AnonPages:  2455016 kB
> AnonHugePages:  1761280 kB
> Fraction:   71,74
>
> In 5 minutes:
> cat /proc/pid/smaps
> Anonymous:  84 kB
> AnonHugePages:  407552 kB
> Swap:   0 kB
> Fraction:   50,94
>
> cat /proc/meminfo:
> AnonPages:  2456872 kB
> AnonHugePages:  2023424 kB
> Fraction:   82,35
>
> Without the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:  190660 kB
> AnonHugePages:  190464 kB
> Swap:   609344 kB
> Fraction:   99,89
>
> cat /proc/meminfo:
> AnonPages:  1740456 kB
> AnonHugePages:  1667072 kB
> Fraction:   95,78
>
> After swapped in:
> cat /proc/pid/smaps:
> Anonymous:  84 kB
> AnonHugePages:  190464 kB
> Swap:   0 kB
> Fraction:   23,80
>
> cat /proc/meminfo:
> AnonPages:  2350032 kB
> AnonHugePages:  1667072 kB
> Fraction:   70,93
>
> I waited 10 minutes the fractions
> did not change without the patch.
>
>  mm/huge_memory.c | 60 
> +---
>  1 file changed, 49 insertions(+), 11 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..17d6e59 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2148,17 +2148,18 @@ static int __collapse_huge_page_isolate(struct 
> vm_area_struct *vma,
>  {
>   struct page *page;
>   pte_t *_pte;
> - int referenced = 0, none = 0;
> + int referenced = 0, none = 0, ro = 0, writable = 0;
>   for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>_pte++, address += PAGE_SIZE) {
>   pte_t pteval = *_pte;
>   if (pte_none(pteval)) {
> + ro++;
>   if (++none <= khugepaged_max_ptes_none)
>   continue;
>   else
>   goto out;
>   }
> - if (!pte_present(pteval) || !pte_write(pteval))
> + if (!pte_present(pteval))
>   goto out;
>   page = vm_normal_page(vma, address, pteval);
>   if (unlikely(!page))
> @@ -2168,9 +2169,6 @@ static int __collapse_huge_page_isolate(struct 
> vm_area_struct *vma,
>   VM_BUG_ON_PAGE(!PageAnon(page), page);
>   VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
> - /* cannot u

Re: [PATCH v3] mm: incorporate read-only pages into transparent huge pages

2015-01-28 Thread Zhang Yanfei
Hello

在 2015/1/28 8:27, Andrea Arcangeli 写道:
> On Tue, Jan 27, 2015 at 07:39:13PM +0200, Ebru Akagunduz wrote:
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 817a875..17d6e59 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2148,17 +2148,18 @@ static int __collapse_huge_page_isolate(struct 
>> vm_area_struct *vma,
>>  {
>>  struct page *page;
>>  pte_t *_pte;
>> -int referenced = 0, none = 0;
>> +int referenced = 0, none = 0, ro = 0, writable = 0;
> So your "writable" addition is enough and simpler/better than "ro"
> counting. Once "ro" is removed "writable" can actually start to make a
> difference (at the moment it does not).
>
> I'd suggest to remove "ro".
>
> The sysctl was there only to reduce the memory footprint but
> collapsing readonly swapcache won't reduce the memory footprint. So it
> may have been handy before but this new "writable" looks better now
> and keeping both doesn't help (keeping "ro" around prevents "writable"
> to make a difference).

Agreed.

>
>> @@ -2179,6 +2177,34 @@ static int __collapse_huge_page_isolate(struct 
>> vm_area_struct *vma,
>>   */
>>  if (!trylock_page(page))
>>  goto out;
>> +
>> +/*
>> + * cannot use mapcount: can't collapse if there's a gup pin.
>> + * The page must only be referenced by the scanned process
>> + * and page swap cache.
>> + */
>> +if (page_count(page) != 1 + !!PageSwapCache(page)) {
>> +unlock_page(page);
>> +goto out;
>> +}
>> +if (!pte_write(pteval)) {
>> +if (++ro > khugepaged_max_ptes_none) {
>> +unlock_page(page);
>> +goto out;
>> +}
>> +if (PageSwapCache(page) && !reuse_swap_page(page)) {
>> +unlock_page(page);
>> +goto out;
>> +}
>> +/*
>> + * Page is not in the swap cache, and page count is
>> + * one (see above). It can be collapsed into a THP.
>> + */
>> +VM_BUG_ON(page_count(page) != 1);
> In an earlier email I commented on this suggestion you received during
> previous code review: the VM_BUG_ON is not ok because it can generate
> false positives.
>
> It's perfectly ok if page_count is not 1 if the page is isolated by
> another CPU (another cpu calling isolate_lru_page).
>
> The page_count check there is to ensure there are no gup-pins, and
> that is achieved during the check. The VM may still mangle the
> page_count and it's ok (the page count taken by the VM running in
> another CPU doesn't need to be transferred to the collapsed THP).
>
> In short, the check "page_count(page) != 1 + !!PageSwapCache(page)"
> doesn't imply that the page_count cannot change. It only means at any
> given time there was no gup-pin at the very time of the check. It also
> means there were no other VM pin, but what we care about is only the
> gup-pin. The VM LRU pin can still be taken after the check and it's
> ok. The GUP pin cannot be taken because we stopped all gup so we're
> safe if the check passes.
>
> So you can simply delete the VM_BUG_ON, the earlier code there, was fine.

So IMO, the comment should also be removed or changed as it may
mislead someone again later.

Thanks
Zhang

>
>> +} else {
>> +writable = 1;
>> +}
>> +
> I suggest to make writable a bool and use writable = false to init,
> and writable = true above.
>
> When a value can only be 0|1 bool is better (it can be casted and
> takes the same memory as an int, it just allows the compiler to be
> more strict and the fact it makes the code more self explanatory).
>
>> +if (++ro > khugepaged_max_ptes_none)
>> +goto out_unmap;
> As mentioned above the ro counting can go, and we can keep only
> your new writable addition, as mentioned above.
>
> Thanks,
> Andrea
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: incorporate read-only pages into transparent huge pages

2015-01-28 Thread Zhang Yanfei
Hello

在 2015/1/28 8:27, Andrea Arcangeli 写道:
 On Tue, Jan 27, 2015 at 07:39:13PM +0200, Ebru Akagunduz wrote:
 diff --git a/mm/huge_memory.c b/mm/huge_memory.c
 index 817a875..17d6e59 100644
 --- a/mm/huge_memory.c
 +++ b/mm/huge_memory.c
 @@ -2148,17 +2148,18 @@ static int __collapse_huge_page_isolate(struct 
 vm_area_struct *vma,
  {
  struct page *page;
  pte_t *_pte;
 -int referenced = 0, none = 0;
 +int referenced = 0, none = 0, ro = 0, writable = 0;
 So your writable addition is enough and simpler/better than ro
 counting. Once ro is removed writable can actually start to make a
 difference (at the moment it does not).

 I'd suggest to remove ro.

 The sysctl was there only to reduce the memory footprint but
 collapsing readonly swapcache won't reduce the memory footprint. So it
 may have been handy before but this new writable looks better now
 and keeping both doesn't help (keeping ro around prevents writable
 to make a difference).

Agreed.


 @@ -2179,6 +2177,34 @@ static int __collapse_huge_page_isolate(struct 
 vm_area_struct *vma,
   */
  if (!trylock_page(page))
  goto out;
 +
 +/*
 + * cannot use mapcount: can't collapse if there's a gup pin.
 + * The page must only be referenced by the scanned process
 + * and page swap cache.
 + */
 +if (page_count(page) != 1 + !!PageSwapCache(page)) {
 +unlock_page(page);
 +goto out;
 +}
 +if (!pte_write(pteval)) {
 +if (++ro  khugepaged_max_ptes_none) {
 +unlock_page(page);
 +goto out;
 +}
 +if (PageSwapCache(page)  !reuse_swap_page(page)) {
 +unlock_page(page);
 +goto out;
 +}
 +/*
 + * Page is not in the swap cache, and page count is
 + * one (see above). It can be collapsed into a THP.
 + */
 +VM_BUG_ON(page_count(page) != 1);
 In an earlier email I commented on this suggestion you received during
 previous code review: the VM_BUG_ON is not ok because it can generate
 false positives.

 It's perfectly ok if page_count is not 1 if the page is isolated by
 another CPU (another cpu calling isolate_lru_page).

 The page_count check there is to ensure there are no gup-pins, and
 that is achieved during the check. The VM may still mangle the
 page_count and it's ok (the page count taken by the VM running in
 another CPU doesn't need to be transferred to the collapsed THP).

 In short, the check page_count(page) != 1 + !!PageSwapCache(page)
 doesn't imply that the page_count cannot change. It only means at any
 given time there was no gup-pin at the very time of the check. It also
 means there were no other VM pin, but what we care about is only the
 gup-pin. The VM LRU pin can still be taken after the check and it's
 ok. The GUP pin cannot be taken because we stopped all gup so we're
 safe if the check passes.

 So you can simply delete the VM_BUG_ON, the earlier code there, was fine.

So IMO, the comment should also be removed or changed as it may
mislead someone again later.

Thanks
Zhang


 +} else {
 +writable = 1;
 +}
 +
 I suggest to make writable a bool and use writable = false to init,
 and writable = true above.

 When a value can only be 0|1 bool is better (it can be casted and
 takes the same memory as an int, it just allows the compiler to be
 more strict and the fact it makes the code more self explanatory).

 +if (++ro  khugepaged_max_ptes_none)
 +goto out_unmap;
 As mentioned above the ro counting can go, and we can keep only
 your new writable addition, as mentioned above.

 Thanks,
 Andrea
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: incorporate read-only pages into transparent huge pages

2015-01-28 Thread Zhang Yanfei
Hello

在 2015/1/28 1:39, Ebru Akagunduz 写道:
 This patch aims to improve THP collapse rates, by allowing
 THP collapse in the presence of read-only ptes, like those
 left in place by do_swap_page after a read fault.

 Currently THP can collapse 4kB pages into a THP when
 there are up to khugepaged_max_ptes_none pte_none ptes
 in a 2MB range. This patch applies the same limit for
 read-only ptes.

 The patch was tested with a test program that allocates
 800MB of memory, writes to it, and then sleeps. I force
 the system to swap out all but 190MB of the program by
 touching other memory. Afterwards, the test program does
 a mix of reads and writes to its memory, and the memory
 gets swapped back in.

 Without the patch, only the memory that did not get
 swapped out remained in THPs, which corresponds to 24% of
 the memory of the program. The percentage did not increase
 over time.

 With this patch, after 5 minutes of waiting khugepaged had
 collapsed 50% of the program's memory back into THPs.

 Signed-off-by: Ebru Akagunduz ebru.akagun...@gmail.com
 Reviewed-by: Rik van Riel r...@redhat.com
 Acked-by: Vlastimil Babka vba...@suse.cz

Please feel free to add:

Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
 Changes in v2:
  - Remove extra code indent (Vlastimil Babka)
  - Add comment line for check condition of page_count() (Vlastimil Babka)
  - Add fast path optimistic check to
__collapse_huge_page_isolate() (Andrea Arcangeli)
  - Move check condition of page_count() below to trylock_page() (Andrea 
 Arcangeli)

 Changes in v3:
  - Add a at-least-one-writable-pte check (Zhang Yanfei)
  - Debug page count (Vlastimil Babka, Andrea Arcangeli)
  - Increase read-only pte counter if pte is none (Andrea Arcangeli)

 I've written down test results:
 With the patch:
 After swapped out:
 cat /proc/pid/smaps:
 Anonymous:  100464 kB
 AnonHugePages:  100352 kB
 Swap:   699540 kB
 Fraction:   99,88

 cat /proc/meminfo:
 AnonPages:  1754448 kB
 AnonHugePages:  1716224 kB
 Fraction:   97,82

 After swapped in:
 In a few seconds:
 cat /proc/pid/smaps:
 Anonymous:  84 kB
 AnonHugePages:  145408 kB
 Swap:   0 kB
 Fraction:   18,17

 cat /proc/meminfo:
 AnonPages:  2455016 kB
 AnonHugePages:  1761280 kB
 Fraction:   71,74

 In 5 minutes:
 cat /proc/pid/smaps
 Anonymous:  84 kB
 AnonHugePages:  407552 kB
 Swap:   0 kB
 Fraction:   50,94

 cat /proc/meminfo:
 AnonPages:  2456872 kB
 AnonHugePages:  2023424 kB
 Fraction:   82,35

 Without the patch:
 After swapped out:
 cat /proc/pid/smaps:
 Anonymous:  190660 kB
 AnonHugePages:  190464 kB
 Swap:   609344 kB
 Fraction:   99,89

 cat /proc/meminfo:
 AnonPages:  1740456 kB
 AnonHugePages:  1667072 kB
 Fraction:   95,78

 After swapped in:
 cat /proc/pid/smaps:
 Anonymous:  84 kB
 AnonHugePages:  190464 kB
 Swap:   0 kB
 Fraction:   23,80

 cat /proc/meminfo:
 AnonPages:  2350032 kB
 AnonHugePages:  1667072 kB
 Fraction:   70,93

 I waited 10 minutes the fractions
 did not change without the patch.

  mm/huge_memory.c | 60 
 +---
  1 file changed, 49 insertions(+), 11 deletions(-)

 diff --git a/mm/huge_memory.c b/mm/huge_memory.c
 index 817a875..17d6e59 100644
 --- a/mm/huge_memory.c
 +++ b/mm/huge_memory.c
 @@ -2148,17 +2148,18 @@ static int __collapse_huge_page_isolate(struct 
 vm_area_struct *vma,
  {
   struct page *page;
   pte_t *_pte;
 - int referenced = 0, none = 0;
 + int referenced = 0, none = 0, ro = 0, writable = 0;
   for (_pte = pte; _pte  pte+HPAGE_PMD_NR;
_pte++, address += PAGE_SIZE) {
   pte_t pteval = *_pte;
   if (pte_none(pteval)) {
 + ro++;
   if (++none = khugepaged_max_ptes_none)
   continue;
   else
   goto out;
   }
 - if (!pte_present(pteval) || !pte_write(pteval))
 + if (!pte_present(pteval))
   goto out;
   page = vm_normal_page(vma, address, pteval);
   if (unlikely(!page))
 @@ -2168,9 +2169,6 @@ static int __collapse_huge_page_isolate(struct 
 vm_area_struct *vma,
   VM_BUG_ON_PAGE(!PageAnon(page), page);
   VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
  
 - /* cannot use mapcount: can't collapse if there's a gup pin */
 - if (page_count(page) != 1)
 - goto out;
   /*
* We can do it before isolate_lru_page because the
* page can't be freed from under us. NOTE: PG_lock
 @@ -2179,6 +2177,34 @@ static int __collapse_huge_page_isolate(struct 
 vm_area_struct *vma,
*/
   if (!trylock_page(page))
   goto out;
 +
 + /*
 +  * cannot use

Re: [PATCH] mm: incorporate read-only pages into transparent huge pages

2015-01-25 Thread Zhang Yanfei
Hello

在 2015/1/25 17:25, Vlastimil Babka 写道:
> On 23.1.2015 20:18, Andrea Arcangeli wrote:
>>> >+if (!pte_write(pteval)) {
>>> >+if (++ro > khugepaged_max_ptes_none)
>>> >+goto out_unmap;
>>> >+}
>> It's true this is maxed out at 511, so there must be at least one
>> writable and not none pte (as results of the two "ro" and "none"
>> counters checks).
> 
> Hm, but if we consider ro and pte_none separately, both can be lower
> than 512, but the sum of the two can be 512, so we can actually be in
> read-only VMA?

Yes, I also think so.

So is it necessary to add a at-least-one-writable-pte check just like the 
existing
at-least-one-page-referenced check?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: incorporate read-only pages into transparent huge pages

2015-01-25 Thread Zhang Yanfei
Hello

在 2015/1/25 17:25, Vlastimil Babka 写道:
 On 23.1.2015 20:18, Andrea Arcangeli wrote:
 +if (!pte_write(pteval)) {
 +if (++ro  khugepaged_max_ptes_none)
 +goto out_unmap;
 +}
 It's true this is maxed out at 511, so there must be at least one
 writable and not none pte (as results of the two ro and none
 counters checks).
 
 Hm, but if we consider ro and pte_none separately, both can be lower
 than 512, but the sum of the two can be 512, so we can actually be in
 read-only VMA?

Yes, I also think so.

So is it necessary to add a at-least-one-writable-pte check just like the 
existing
at-least-one-page-referenced check?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMA: treat free cma pages as non-free if not ALLOC_CMA on watermark checking

2015-01-20 Thread Zhang Yanfei
Hello Minchan,

How are you?

在 2015/1/19 14:55, Minchan Kim 写道:
> Hello,
> 
> On Sun, Jan 18, 2015 at 04:32:59PM +0800, Hui Zhu wrote:
>> From: Hui Zhu 
>>
>> The original of this patch [1] is part of Joonsoo's CMA patch series.
>> I made a patch [2] to fix the issue of this patch.  Joonsoo reminded me
>> that this issue affect current kernel too.  So made a new one for upstream.
> 
> Recently, we found many problems of CMA and Joonsoo tried to add more
> hooks into MM like agressive allocation but I suggested adding new zone

Just out of curiosity, "new zone"? Something like movable zone?

Thanks.

> would be more desirable than more hooks in mm fast path in various aspect.
> (ie, remove lots of hooks in hot path of MM, don't need reclaim hooks
>  for special CMA pages, don't need custom fair allocation for CMA).
> 
> Joonsoo is investigating the direction so please wait.
> If it turns out we have lots of hurdle to go that way,
> this direction(ie, putting more hooks) should be second plan.
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone

2015-01-20 Thread Zhang Yanfei
Hello Vlastimil

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Compaction employs two page scanners - migration scanner isolates pages to be
> the source of migration, free page scanner isolates pages to be the target of
> migration. Currently, migration scanner starts at the zone's first pageblock
> and progresses towards the last one. Free scanner starts at the last pageblock
> and progresses towards the first one. Within a pageblock, each scanner scans
> pages from the first to the last one. When the scanners meet within the same
> pageblock, compaction terminates.
> 
> One consequence of the current scheme, that turns out to be unfortunate, is
> that the migration scanner does not encounter the pageblocks which were
> scanned by the free scanner. In a test with stress-highalloc from mmtests,
> the scanners were observed to meet around the middle of the zone in first two
> phases (with background memory pressure) of the test when executed after fresh
> reboot. On further executions without reboot, the meeting point shifts to
> roughly third of the zone, and compaction activity as well as allocation
> success rates deteriorates compared to the run after fresh reboot.
> 
> It turns out that the deterioration is indeed due to the migration scanner
> processing only a small part of the zone. Compaction also keeps making this
> bias worse by its activity - by moving all migratable pages towards end of the
> zone, the free scanner has to scan a lot of full pageblocks to find more free
> pages. The beginning of the zone contains pageblocks that have been compacted
> as much as possible, but the free pages there cannot be further merged into
> larger orders due to unmovable pages. The rest of the zone might contain more
> suitable pageblocks, but the migration scanner will not reach them. It also
> isn't be able to move movable pages out of unmovable pageblocks there, which
> affects fragmentation.
> 
> This patch is the first step to remove this bias. It allows the compaction
> scanners to start at arbitrary pfn (aligned to pageblock for practical
> purposes), called pivot, within the zone. The migration scanner starts at the
> exact pfn, the free scanner starts at the pageblock preceding the pivot. The
> direction of scanning is unaffected, but when the migration scanner reaches
> the last pageblock of the zone, or the free scanner reaches the first
> pageblock, they wrap and continue with the first or last pageblock,
> respectively. Compaction terminates when any of the scanners wrap and both
> meet within the same pageblock.
> 
> For easier bisection of potential regressions, this patch always uses the
> first zone's pfn as the pivot. That means the free scanner immediately wraps
> to the last pageblock and the operation of scanners is thus unchanged. The
> actual pivot changing is done by the next patch.
> 
> Signed-off-by: Vlastimil Babka 

I read through the whole patch, and you can feel free to add:

Acked-by: Zhang Yanfei 

I agree with you and the approach to improve the current scheme. One thing
I think should be carefully treated is how to avoid migrating back and forth
since the pivot pfn can be changed. I see patch 5 has introduced a policy to
change the pivot so we can have a careful observation on it.

(The changes in the patch make the code more difficult to understand now...
and I just find a tiny mistake, please see below)

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  include/linux/mmzone.h |   2 +
>  mm/compaction.c| 204 
> +++--
>  mm/internal.h  |   1 +
>  3 files changed, 182 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2f0856d..47aa181 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -503,6 +503,8 @@ struct zone {
>   unsigned long percpu_drift_mark;
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> + /* pfn where compaction scanners have initially started last time */
> + unsigned long   compact_cached_pivot_pfn;
>   /* pfn where compaction free scanner should start */
>   unsigned long   compact_cached_free_pfn;
>   /* pfn where async and sync compaction migration scanner should start */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5626220..abae89a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -123,11 +123,16 @@ static inline bool isolation_suitable(struct 
> compact_control *cc,
>   return !get_pageblock_skip(page);
>  }
>  
> +/*
> + * Invalidate cached compaction scanner positions, so tha

Re: [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions

2015-01-20 Thread Zhang Yanfei
在 2015/1/19 18:05, Vlastimil Babka 写道:
> Reseting the cached compaction scanner positions is now done implicitly in
> __reset_isolation_suitable() and compact_finished(). Encapsulate the
> functionality in a new function reset_cached_positions() and call it
> explicitly where needed.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Zhang Yanfei 

Should the new function be inline?

Thanks.

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  mm/compaction.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 45799a4..5626220 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -123,6 +123,13 @@ static inline bool isolation_suitable(struct 
> compact_control *cc,
>   return !get_pageblock_skip(page);
>  }
>  
> +static void reset_cached_positions(struct zone *zone)
> +{
> + zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
> + zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> + zone->compact_cached_free_pfn = zone_end_pfn(zone);
> +}
> +
>  /*
>   * This function is called to clear all cached information on pageblocks that
>   * should be skipped for page isolation when the migrate and free page 
> scanner
> @@ -134,9 +141,6 @@ static void __reset_isolation_suitable(struct zone *zone)
>   unsigned long end_pfn = zone_end_pfn(zone);
>   unsigned long pfn;
>  
> - zone->compact_cached_migrate_pfn[0] = start_pfn;
> - zone->compact_cached_migrate_pfn[1] = start_pfn;
> - zone->compact_cached_free_pfn = end_pfn;
>   zone->compact_blockskip_flush = false;
>  
>   /* Walk the zone and mark every pageblock as suitable for isolation */
> @@ -166,8 +170,10 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>   continue;
>  
>   /* Only flush if a full compaction finished recently */
> - if (zone->compact_blockskip_flush)
> + if (zone->compact_blockskip_flush) {
>   __reset_isolation_suitable(zone);
> + reset_cached_positions(zone);
> + }
>   }
>  }
>  
> @@ -1059,9 +1065,7 @@ static int compact_finished(struct zone *zone, struct 
> compact_control *cc,
>   /* Compaction run completes if the migrate and free scanner meet */
>   if (compact_scanners_met(cc)) {
>   /* Let the next compaction start anew. */
> - zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
> - zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> - zone->compact_cached_free_pfn = zone_end_pfn(zone);
> + reset_cached_positions(zone);
>  
>   /*
>* Mark that the PG_migrate_skip information should be cleared
> @@ -1187,8 +1191,10 @@ static int compact_zone(struct zone *zone, struct 
> compact_control *cc)
>* is about to be retried after being deferred. kswapd does not do
>* this reset as it'll reset the cached information when going to sleep.
>*/
> - if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
> + if (compaction_restarting(zone, cc->order) && !current_is_kswapd()) {
>   __reset_isolation_suitable(zone);
> + reset_cached_positions(zone);
> + }
>  
>   /*
>* Setup to move all movable pages to the end of the zone. Used cached
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner

2015-01-20 Thread Zhang Yanfei
Hello,

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Handling the position where compaction free scanner should restart (stored in
> cc->free_pfn) got more complex with commit e14c720efdd7 ("mm, compaction:
> remember position within pageblock in free pages scanner"). Currently the
> position is updated in each loop iteration isolate_freepages(), although it's
> enough to update it only when exiting the loop when we have found enough free
> pages, or detected contention in async compaction. Then an extra check outside
> the loop updates the position in case we have met the migration scanner.
> 
> This can be simplified if we move the test for having isolated enough from
> for loop header next to the test for contention, and determining the restart
> position only in these cases. We can reuse the isolate_start_pfn variable for
> this instead of setting cc->free_pfn directly. Outside the loop, we can simply
> set cc->free_pfn to value of isolate_start_pfn without extra check.
> 
> We also add VM_BUG_ON to future-proof the code, in case somebody adds a new
> condition that terminates isolate_freepages_block() prematurely, which
> wouldn't be also considered in isolate_freepages().
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  mm/compaction.c | 34 +++---
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5fdbdb8..45799a4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -849,7 +849,7 @@ static void isolate_freepages(struct compact_control *cc)
>* pages on cc->migratepages. We stop searching if the migrate
>* and free page scanners meet or enough free pages are isolated.
>*/
> - for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
> + for (; block_start_pfn >= low_pfn;
>   block_end_pfn = block_start_pfn,
>   block_start_pfn -= pageblock_nr_pages,
>   isolate_start_pfn = block_start_pfn) {
> @@ -883,6 +883,8 @@ static void isolate_freepages(struct compact_control *cc)
>   nr_freepages += isolated;
>  
>   /*
> +  * If we isolated enough freepages, or aborted due to async
> +  * compaction being contended, terminate the loop.
>* Remember where the free scanner should restart next time,
>* which is where isolate_freepages_block() left off.
>* But if it scanned the whole pageblock, isolate_start_pfn
> @@ -891,28 +893,30 @@ static void isolate_freepages(struct compact_control 
> *cc)
>* In that case we will however want to restart at the start
>* of the previous pageblock.
>*/
> - cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
> - isolate_start_pfn :
> - block_start_pfn - pageblock_nr_pages;
> -
> - /*
> -  * isolate_freepages_block() might have aborted due to async
> -  * compaction being contended
> -  */
> - if (cc->contended)
> + if ((nr_freepages > cc->nr_migratepages) || cc->contended) {

Shouldn't this be nr_freepages >= cc->nr_migratepages?

Thanks

> + if (isolate_start_pfn >= block_end_pfn)
> + isolate_start_pfn =
> + block_start_pfn - pageblock_nr_pages;
>   break;
> + } else {
> + /*
> +  * isolate_freepages_block() should not terminate
> +  * prematurely unless contended, or isolated enough
> +  */
> + VM_BUG_ON(isolate_start_pfn < block_end_pfn);
> + }
>   }
>  
>   /* split_free_page does not map the pages */
>   map_pages(freelist);
>  
>   /*
> -  * If we crossed the migrate scanner, we want to keep it that way
> -  * so that compact_finished() may detect this
> +  * Record where the free scanner will restart next time. Either we
> +  * broke from the loop and set isolate_start_pfn based on the last
> +  * call to isolate_freepages_block(), or we met the migration scanner
> +  * and the loop terminated due to isolate_start_pfn < low_pfn
>*/
> - if (block_start_pfn < low_pfn)
> - cc->free_pfn = cc->migrate_pfn;
> -
> + cc->free_pfn = isolate_start_pfn;
>   cc->nr_freepages = nr_freepages;
>  }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH 1/5] mm, compaction: more robust check for scanners meeting

2015-01-20 Thread Zhang Yanfei
在 2015/1/19 18:05, Vlastimil Babka 写道:
> Compaction should finish when the migration and free scanner meet, i.e. they
> reach the same pageblock. Currently however, the test in compact_finished()
> simply just compares the exact pfns, which may yield a false negative when the
> free scanner position is in the middle of a pageblock and the migration
> scanner reaches the begining of the same pageblock.
> 
> This hasn't been a problem until commit e14c720efdd7 ("mm, compaction:
> remember position within pageblock in free pages scanner") allowed the free
> scanner position to be in the middle of a pageblock between invocations.
> The hot-fix 1d5bfe1ffb5b ("mm, compaction: prevent infinite loop in
> compact_zone") prevented the issue by adding a special check in the migration
> scanner to satisfy the current detection of scanners meeting.
> 
> However, the proper fix is to make the detection more robust. This patch
> introduces the compact_scanners_met() function that returns true when the free
> scanner position is in the same or lower pageblock than the migration scanner.
> The special case in isolate_migratepages() introduced by 1d5bfe1ffb5b is
> removed.
> 
> Suggested-by: Joonsoo Kim 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Zhang Yanfei 

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  mm/compaction.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 546e571..5fdbdb8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -803,6 +803,16 @@ isolate_migratepages_range(struct compact_control *cc, 
> unsigned long start_pfn,
>  #endif /* CONFIG_COMPACTION || CONFIG_CMA */
>  #ifdef CONFIG_COMPACTION
>  /*
> + * Test whether the free scanner has reached the same or lower pageblock than
> + * the migration scanner, and compaction should thus terminate.
> + */
> +static inline bool compact_scanners_met(struct compact_control *cc)
> +{
> + return (cc->free_pfn >> pageblock_order)
> + <= (cc->migrate_pfn >> pageblock_order);
> +}
> +
> +/*
>   * Based on information in the current compact_control, find blocks
>   * suitable for isolating free pages from and then isolate them.
>   */
> @@ -1027,12 +1037,8 @@ static isolate_migrate_t isolate_migratepages(struct 
> zone *zone,
>   }
>  
>   acct_isolated(zone, cc);
> - /*
> -  * Record where migration scanner will be restarted. If we end up in
> -  * the same pageblock as the free scanner, make the scanners fully
> -  * meet so that compact_finished() terminates compaction.
> -  */
> - cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn;
> + /* Record where migration scanner will be restarted. */
> + cc->migrate_pfn = low_pfn;
>  
>   return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
>  }
> @@ -1047,7 +1053,7 @@ static int compact_finished(struct zone *zone, struct 
> compact_control *cc,
>   return COMPACT_PARTIAL;
>  
>   /* Compaction run completes if the migrate and free scanner meet */
> - if (cc->free_pfn <= cc->migrate_pfn) {
> + if (compact_scanners_met(cc)) {
>   /* Let the next compaction start anew. */
>   zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
>   zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> @@ -1238,7 +1244,7 @@ static int compact_zone(struct zone *zone, struct 
> compact_control *cc)
>* migrate_pages() may return -ENOMEM when scanners meet
>* and we want compact_finished() to detect it
>*/
> - if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {
> + if (err == -ENOMEM && !compact_scanners_met(cc)) {
>   ret = COMPACT_PARTIAL;
>   goto out;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner

2015-01-20 Thread Zhang Yanfei
Hello,

在 2015/1/19 18:05, Vlastimil Babka 写道:
 Handling the position where compaction free scanner should restart (stored in
 cc-free_pfn) got more complex with commit e14c720efdd7 (mm, compaction:
 remember position within pageblock in free pages scanner). Currently the
 position is updated in each loop iteration isolate_freepages(), although it's
 enough to update it only when exiting the loop when we have found enough free
 pages, or detected contention in async compaction. Then an extra check outside
 the loop updates the position in case we have met the migration scanner.
 
 This can be simplified if we move the test for having isolated enough from
 for loop header next to the test for contention, and determining the restart
 position only in these cases. We can reuse the isolate_start_pfn variable for
 this instead of setting cc-free_pfn directly. Outside the loop, we can simply
 set cc-free_pfn to value of isolate_start_pfn without extra check.
 
 We also add VM_BUG_ON to future-proof the code, in case somebody adds a new
 condition that terminates isolate_freepages_block() prematurely, which
 wouldn't be also considered in isolate_freepages().
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com
 ---
  mm/compaction.c | 34 +++---
  1 file changed, 19 insertions(+), 15 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 5fdbdb8..45799a4 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -849,7 +849,7 @@ static void isolate_freepages(struct compact_control *cc)
* pages on cc-migratepages. We stop searching if the migrate
* and free page scanners meet or enough free pages are isolated.
*/
 - for (; block_start_pfn = low_pfn  cc-nr_migratepages  nr_freepages;
 + for (; block_start_pfn = low_pfn;
   block_end_pfn = block_start_pfn,
   block_start_pfn -= pageblock_nr_pages,
   isolate_start_pfn = block_start_pfn) {
 @@ -883,6 +883,8 @@ static void isolate_freepages(struct compact_control *cc)
   nr_freepages += isolated;
  
   /*
 +  * If we isolated enough freepages, or aborted due to async
 +  * compaction being contended, terminate the loop.
* Remember where the free scanner should restart next time,
* which is where isolate_freepages_block() left off.
* But if it scanned the whole pageblock, isolate_start_pfn
 @@ -891,28 +893,30 @@ static void isolate_freepages(struct compact_control 
 *cc)
* In that case we will however want to restart at the start
* of the previous pageblock.
*/
 - cc-free_pfn = (isolate_start_pfn  block_end_pfn) ?
 - isolate_start_pfn :
 - block_start_pfn - pageblock_nr_pages;
 -
 - /*
 -  * isolate_freepages_block() might have aborted due to async
 -  * compaction being contended
 -  */
 - if (cc-contended)
 + if ((nr_freepages  cc-nr_migratepages) || cc-contended) {

Shouldn't this be nr_freepages = cc-nr_migratepages?

Thanks

 + if (isolate_start_pfn = block_end_pfn)
 + isolate_start_pfn =
 + block_start_pfn - pageblock_nr_pages;
   break;
 + } else {
 + /*
 +  * isolate_freepages_block() should not terminate
 +  * prematurely unless contended, or isolated enough
 +  */
 + VM_BUG_ON(isolate_start_pfn  block_end_pfn);
 + }
   }
  
   /* split_free_page does not map the pages */
   map_pages(freelist);
  
   /*
 -  * If we crossed the migrate scanner, we want to keep it that way
 -  * so that compact_finished() may detect this
 +  * Record where the free scanner will restart next time. Either we
 +  * broke from the loop and set isolate_start_pfn based on the last
 +  * call to isolate_freepages_block(), or we met the migration scanner
 +  * and the loop terminated due to isolate_start_pfn  low_pfn
*/
 - if (block_start_pfn  low_pfn)
 - cc-free_pfn = cc-migrate_pfn;
 -
 + cc-free_pfn = isolate_start_pfn;
   cc-nr_freepages = nr_freepages;
  }
  
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions

2015-01-20 Thread Zhang Yanfei
在 2015/1/19 18:05, Vlastimil Babka 写道:
 Reseting the cached compaction scanner positions is now done implicitly in
 __reset_isolation_suitable() and compact_finished(). Encapsulate the
 functionality in a new function reset_cached_positions() and call it
 explicitly where needed.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

Should the new function be inline?

Thanks.

 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com
 ---
  mm/compaction.c | 22 ++
  1 file changed, 14 insertions(+), 8 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 45799a4..5626220 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -123,6 +123,13 @@ static inline bool isolation_suitable(struct 
 compact_control *cc,
   return !get_pageblock_skip(page);
  }
  
 +static void reset_cached_positions(struct zone *zone)
 +{
 + zone-compact_cached_migrate_pfn[0] = zone-zone_start_pfn;
 + zone-compact_cached_migrate_pfn[1] = zone-zone_start_pfn;
 + zone-compact_cached_free_pfn = zone_end_pfn(zone);
 +}
 +
  /*
   * This function is called to clear all cached information on pageblocks that
   * should be skipped for page isolation when the migrate and free page 
 scanner
 @@ -134,9 +141,6 @@ static void __reset_isolation_suitable(struct zone *zone)
   unsigned long end_pfn = zone_end_pfn(zone);
   unsigned long pfn;
  
 - zone-compact_cached_migrate_pfn[0] = start_pfn;
 - zone-compact_cached_migrate_pfn[1] = start_pfn;
 - zone-compact_cached_free_pfn = end_pfn;
   zone-compact_blockskip_flush = false;
  
   /* Walk the zone and mark every pageblock as suitable for isolation */
 @@ -166,8 +170,10 @@ void reset_isolation_suitable(pg_data_t *pgdat)
   continue;
  
   /* Only flush if a full compaction finished recently */
 - if (zone-compact_blockskip_flush)
 + if (zone-compact_blockskip_flush) {
   __reset_isolation_suitable(zone);
 + reset_cached_positions(zone);
 + }
   }
  }
  
 @@ -1059,9 +1065,7 @@ static int compact_finished(struct zone *zone, struct 
 compact_control *cc,
   /* Compaction run completes if the migrate and free scanner meet */
   if (compact_scanners_met(cc)) {
   /* Let the next compaction start anew. */
 - zone-compact_cached_migrate_pfn[0] = zone-zone_start_pfn;
 - zone-compact_cached_migrate_pfn[1] = zone-zone_start_pfn;
 - zone-compact_cached_free_pfn = zone_end_pfn(zone);
 + reset_cached_positions(zone);
  
   /*
* Mark that the PG_migrate_skip information should be cleared
 @@ -1187,8 +1191,10 @@ static int compact_zone(struct zone *zone, struct 
 compact_control *cc)
* is about to be retried after being deferred. kswapd does not do
* this reset as it'll reset the cached information when going to sleep.
*/
 - if (compaction_restarting(zone, cc-order)  !current_is_kswapd())
 + if (compaction_restarting(zone, cc-order)  !current_is_kswapd()) {
   __reset_isolation_suitable(zone);
 + reset_cached_positions(zone);
 + }
  
   /*
* Setup to move all movable pages to the end of the zone. Used cached
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] mm, compaction: more robust check for scanners meeting

2015-01-20 Thread Zhang Yanfei
在 2015/1/19 18:05, Vlastimil Babka 写道:
 Compaction should finish when the migration and free scanner meet, i.e. they
 reach the same pageblock. Currently however, the test in compact_finished()
 simply just compares the exact pfns, which may yield a false negative when the
 free scanner position is in the middle of a pageblock and the migration
 scanner reaches the begining of the same pageblock.
 
 This hasn't been a problem until commit e14c720efdd7 (mm, compaction:
 remember position within pageblock in free pages scanner) allowed the free
 scanner position to be in the middle of a pageblock between invocations.
 The hot-fix 1d5bfe1ffb5b (mm, compaction: prevent infinite loop in
 compact_zone) prevented the issue by adding a special check in the migration
 scanner to satisfy the current detection of scanners meeting.
 
 However, the proper fix is to make the detection more robust. This patch
 introduces the compact_scanners_met() function that returns true when the free
 scanner position is in the same or lower pageblock than the migration scanner.
 The special case in isolate_migratepages() introduced by 1d5bfe1ffb5b is
 removed.
 
 Suggested-by: Joonsoo Kim iamjoonsoo@lge.com
 Signed-off-by: Vlastimil Babka vba...@suse.cz

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com
 ---
  mm/compaction.c | 22 ++
  1 file changed, 14 insertions(+), 8 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 546e571..5fdbdb8 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -803,6 +803,16 @@ isolate_migratepages_range(struct compact_control *cc, 
 unsigned long start_pfn,
  #endif /* CONFIG_COMPACTION || CONFIG_CMA */
  #ifdef CONFIG_COMPACTION
  /*
 + * Test whether the free scanner has reached the same or lower pageblock than
 + * the migration scanner, and compaction should thus terminate.
 + */
 +static inline bool compact_scanners_met(struct compact_control *cc)
 +{
 + return (cc-free_pfn  pageblock_order)
 + = (cc-migrate_pfn  pageblock_order);
 +}
 +
 +/*
   * Based on information in the current compact_control, find blocks
   * suitable for isolating free pages from and then isolate them.
   */
 @@ -1027,12 +1037,8 @@ static isolate_migrate_t isolate_migratepages(struct 
 zone *zone,
   }
  
   acct_isolated(zone, cc);
 - /*
 -  * Record where migration scanner will be restarted. If we end up in
 -  * the same pageblock as the free scanner, make the scanners fully
 -  * meet so that compact_finished() terminates compaction.
 -  */
 - cc-migrate_pfn = (end_pfn = cc-free_pfn) ? low_pfn : cc-free_pfn;
 + /* Record where migration scanner will be restarted. */
 + cc-migrate_pfn = low_pfn;
  
   return cc-nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
  }
 @@ -1047,7 +1053,7 @@ static int compact_finished(struct zone *zone, struct 
 compact_control *cc,
   return COMPACT_PARTIAL;
  
   /* Compaction run completes if the migrate and free scanner meet */
 - if (cc-free_pfn = cc-migrate_pfn) {
 + if (compact_scanners_met(cc)) {
   /* Let the next compaction start anew. */
   zone-compact_cached_migrate_pfn[0] = zone-zone_start_pfn;
   zone-compact_cached_migrate_pfn[1] = zone-zone_start_pfn;
 @@ -1238,7 +1244,7 @@ static int compact_zone(struct zone *zone, struct 
 compact_control *cc)
* migrate_pages() may return -ENOMEM when scanners meet
* and we want compact_finished() to detect it
*/
 - if (err == -ENOMEM  cc-free_pfn  cc-migrate_pfn) {
 + if (err == -ENOMEM  !compact_scanners_met(cc)) {
   ret = COMPACT_PARTIAL;
   goto out;
   }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone

2015-01-20 Thread Zhang Yanfei
Hello Vlastimil

在 2015/1/19 18:05, Vlastimil Babka 写道:
 Compaction employs two page scanners - migration scanner isolates pages to be
 the source of migration, free page scanner isolates pages to be the target of
 migration. Currently, migration scanner starts at the zone's first pageblock
 and progresses towards the last one. Free scanner starts at the last pageblock
 and progresses towards the first one. Within a pageblock, each scanner scans
 pages from the first to the last one. When the scanners meet within the same
 pageblock, compaction terminates.
 
 One consequence of the current scheme, that turns out to be unfortunate, is
 that the migration scanner does not encounter the pageblocks which were
 scanned by the free scanner. In a test with stress-highalloc from mmtests,
 the scanners were observed to meet around the middle of the zone in first two
 phases (with background memory pressure) of the test when executed after fresh
 reboot. On further executions without reboot, the meeting point shifts to
 roughly third of the zone, and compaction activity as well as allocation
 success rates deteriorates compared to the run after fresh reboot.
 
 It turns out that the deterioration is indeed due to the migration scanner
 processing only a small part of the zone. Compaction also keeps making this
 bias worse by its activity - by moving all migratable pages towards end of the
 zone, the free scanner has to scan a lot of full pageblocks to find more free
 pages. The beginning of the zone contains pageblocks that have been compacted
 as much as possible, but the free pages there cannot be further merged into
 larger orders due to unmovable pages. The rest of the zone might contain more
 suitable pageblocks, but the migration scanner will not reach them. It also
 isn't be able to move movable pages out of unmovable pageblocks there, which
 affects fragmentation.
 
 This patch is the first step to remove this bias. It allows the compaction
 scanners to start at arbitrary pfn (aligned to pageblock for practical
 purposes), called pivot, within the zone. The migration scanner starts at the
 exact pfn, the free scanner starts at the pageblock preceding the pivot. The
 direction of scanning is unaffected, but when the migration scanner reaches
 the last pageblock of the zone, or the free scanner reaches the first
 pageblock, they wrap and continue with the first or last pageblock,
 respectively. Compaction terminates when any of the scanners wrap and both
 meet within the same pageblock.
 
 For easier bisection of potential regressions, this patch always uses the
 first zone's pfn as the pivot. That means the free scanner immediately wraps
 to the last pageblock and the operation of scanners is thus unchanged. The
 actual pivot changing is done by the next patch.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz

I read through the whole patch, and you can feel free to add:

Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

I agree with you and the approach to improve the current scheme. One thing
I think should be carefully treated is how to avoid migrating back and forth
since the pivot pfn can be changed. I see patch 5 has introduced a policy to
change the pivot so we can have a careful observation on it.

(The changes in the patch make the code more difficult to understand now...
and I just find a tiny mistake, please see below)

 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com
 ---
  include/linux/mmzone.h |   2 +
  mm/compaction.c| 204 
 +++--
  mm/internal.h  |   1 +
  3 files changed, 182 insertions(+), 25 deletions(-)
 
 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
 index 2f0856d..47aa181 100644
 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -503,6 +503,8 @@ struct zone {
   unsigned long percpu_drift_mark;
  
  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 + /* pfn where compaction scanners have initially started last time */
 + unsigned long   compact_cached_pivot_pfn;
   /* pfn where compaction free scanner should start */
   unsigned long   compact_cached_free_pfn;
   /* pfn where async and sync compaction migration scanner should start */
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 5626220..abae89a 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -123,11 +123,16 @@ static inline bool isolation_suitable(struct 
 compact_control *cc,
   return !get_pageblock_skip(page);
  }
  
 +/*
 + * Invalidate cached compaction scanner positions, so that compact_zone()
 + * will reinitialize them on the next compaction.
 + */
  static void reset_cached_positions(struct zone *zone)
  {
 - zone

Re: [PATCH] CMA: treat free cma pages as non-free if not ALLOC_CMA on watermark checking

2015-01-20 Thread Zhang Yanfei
Hello Minchan,

How are you?

在 2015/1/19 14:55, Minchan Kim 写道:
 Hello,
 
 On Sun, Jan 18, 2015 at 04:32:59PM +0800, Hui Zhu wrote:
 From: Hui Zhu zhu...@xiaomi.com

 The original of this patch [1] is part of Joonsoo's CMA patch series.
 I made a patch [2] to fix the issue of this patch.  Joonsoo reminded me
 that this issue affect current kernel too.  So made a new one for upstream.
 
 Recently, we found many problems of CMA and Joonsoo tried to add more
 hooks into MM like agressive allocation but I suggested adding new zone

Just out of curiosity, new zone? Something like movable zone?

Thanks.

 would be more desirable than more hooks in mm fast path in various aspect.
 (ie, remove lots of hooks in hot path of MM, don't need reclaim hooks
  for special CMA pages, don't need custom fair allocation for CMA).
 
 Joonsoo is investigating the direction so please wait.
 If it turns out we have lots of hurdle to go that way,
 this direction(ie, putting more hooks) should be second plan.
 
 Thanks.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] mm/hugetlb: gigantic hugetlb page pools shrink supporting

2014-08-21 Thread Zhang Yanfei
Hello Wanpeng

On 08/22/2014 07:37 AM, Wanpeng Li wrote:
> Hi Andi,
> On Fri, Apr 12, 2013 at 05:22:37PM +0200, Andi Kleen wrote:
>> On Fri, Apr 12, 2013 at 07:29:07AM +0800, Wanpeng Li wrote:
>>> Ping Andi,
>>> On Thu, Apr 04, 2013 at 05:09:08PM +0800, Wanpeng Li wrote:
>>>> order >= MAX_ORDER pages are only allocated at boot stage using the 
>>>> bootmem allocator with the "hugepages=xxx" option. These pages are never 
>>>> free after boot by default since it would be a one-way street(>= MAX_ORDER
>>>> pages cannot be allocated later), but if administrator confirm not to 
>>>> use these gigantic pages any more, these pinned pages will waste memory
>>>> since other users can't grab free pages from gigantic hugetlb pool even
>>>> if OOM, it's not flexible.  The patchset add hugetlb gigantic page pools
>>>> shrink supporting. Administrator can enable knob exported in sysctl to
>>>> permit to shrink gigantic hugetlb pool.
>>
>>
>> I originally didn't allow this because it's only one way and it seemed
>> dubious.  I've been recently working on a new patchkit to allocate
>> GB pages from CMA. With that freeing actually makes sense, as 
>> the pages can be reallocated.
>>
> 
> More than one year past, If your allocate GB pages from CMA merged? 

commit 944d9fec8d7aee3f2e16573e9b6a16634b33f403
Author: Luiz Capitulino 
Date:   Wed Jun 4 16:07:13 2014 -0700

hugetlb: add support for gigantic page allocation at runtime


> 
> Regards,
> Wanpeng Li 
> 
>> -Andi
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> .
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] mm/slab_common: move kmem_cache definition to internal header

2014-08-21 Thread Zhang Yanfei
stem */
> +};
> +
> +#endif /* CONFIG_SLOB */
> +
> +#ifdef CONFIG_SLAB
> +#include 
> +#endif
> +
> +#ifdef CONFIG_SLUB
> +#include 
> +#endif
> +
>  /*
>   * State of the slab allocator.
>   *
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index d319502..2088904 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -30,6 +30,14 @@ LIST_HEAD(slab_caches);
>  DEFINE_MUTEX(slab_mutex);
>  struct kmem_cache *kmem_cache;
>  
> +/*
> + * Determine the size of a slab object
> + */
> +unsigned int kmem_cache_size(struct kmem_cache *s)
> +{
> + return s->object_size;
> +}
> +
>  #ifdef CONFIG_DEBUG_VM
>  static int kmem_cache_sanity_check(const char *name, size_t size)
>  {
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] mm/slab_common: move kmem_cache definition to internal header

2014-08-21 Thread Zhang Yanfei
 kmem_cache *kmem_cache;
  
 +/*
 + * Determine the size of a slab object
 + */
 +unsigned int kmem_cache_size(struct kmem_cache *s)
 +{
 + return s-object_size;
 +}
 +
  #ifdef CONFIG_DEBUG_VM
  static int kmem_cache_sanity_check(const char *name, size_t size)
  {
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] mm/hugetlb: gigantic hugetlb page pools shrink supporting

2014-08-21 Thread Zhang Yanfei
Hello Wanpeng

On 08/22/2014 07:37 AM, Wanpeng Li wrote:
 Hi Andi,
 On Fri, Apr 12, 2013 at 05:22:37PM +0200, Andi Kleen wrote:
 On Fri, Apr 12, 2013 at 07:29:07AM +0800, Wanpeng Li wrote:
 Ping Andi,
 On Thu, Apr 04, 2013 at 05:09:08PM +0800, Wanpeng Li wrote:
 order = MAX_ORDER pages are only allocated at boot stage using the 
 bootmem allocator with the hugepages=xxx option. These pages are never 
 free after boot by default since it would be a one-way street(= MAX_ORDER
 pages cannot be allocated later), but if administrator confirm not to 
 use these gigantic pages any more, these pinned pages will waste memory
 since other users can't grab free pages from gigantic hugetlb pool even
 if OOM, it's not flexible.  The patchset add hugetlb gigantic page pools
 shrink supporting. Administrator can enable knob exported in sysctl to
 permit to shrink gigantic hugetlb pool.


 I originally didn't allow this because it's only one way and it seemed
 dubious.  I've been recently working on a new patchkit to allocate
 GB pages from CMA. With that freeing actually makes sense, as 
 the pages can be reallocated.

 
 More than one year past, If your allocate GB pages from CMA merged? 

commit 944d9fec8d7aee3f2e16573e9b6a16634b33f403
Author: Luiz Capitulino lcapitul...@redhat.com
Date:   Wed Jun 4 16:07:13 2014 -0700

hugetlb: add support for gigantic page allocation at runtime


 
 Regards,
 Wanpeng Li 
 
 -Andi

 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 .
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/8] mm/page_alloc: fix pcp high, batch management

2014-08-06 Thread Zhang Yanfei
> + *output_batch = max(1, 1 * input_batch);
>  }
>  
> -/* a companion to pageset_set_high() */
> -static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> +static void pageset_get_values(struct zone *zone, int *high, int *batch)
>  {
> - pageset_update(>pcp, 6 * batch, max(1UL, 1 * batch));
> + if (percpu_pagelist_fraction) {
> + pageset_get_values_by_high(
> + (zone->managed_pages / percpu_pagelist_fraction),
> + high, batch);
> + } else
> + pageset_get_values_by_batch(zone_batchsize(zone), high, batch);
>  }
>  
>  static void pageset_init(struct per_cpu_pageset *p)
> @@ -4263,51 +4298,38 @@ static void pageset_init(struct per_cpu_pageset *p)
>   INIT_LIST_HEAD(>lists[migratetype]);
>  }
>  
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> +/* Use this only in boot time, because it doesn't do any synchronization */
> +static void setup_pageset(struct per_cpu_pageset __percpu *pcp)
>  {
> - pageset_init(p);
> - pageset_set_batch(p, batch);
> -}
> -
> -/*
> - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
> - * to the value high for the pageset p.
> - */
> -static void pageset_set_high(struct per_cpu_pageset *p,
> - unsigned long high)
> -{
> - unsigned long batch = max(1UL, high / 4);
> - if ((high / 4) > (PAGE_SHIFT * 8))
> - batch = PAGE_SHIFT * 8;
> -
> - pageset_update(>pcp, high, batch);
> -}
> -
> -static void pageset_set_high_and_batch(struct zone *zone,
> -struct per_cpu_pageset *pcp)
> -{
> - if (percpu_pagelist_fraction)
> - pageset_set_high(pcp,
> - (zone->managed_pages /
> - percpu_pagelist_fraction));
> - else
> - pageset_set_batch(pcp, zone_batchsize(zone));
> -}
> + int cpu;
> + int high, batch;
> + struct per_cpu_pageset *p;
>  
> -static void __meminit zone_pageset_init(struct zone *zone, int cpu)
> -{
> - struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> + pageset_get_values_by_batch(0, , );
>  
> - pageset_init(pcp);
> - pageset_set_high_and_batch(zone, pcp);
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(pcp, cpu);
> + pageset_init(p);
> + p->pcp.high = high;
> + p->pcp.batch = batch;
> + }
>  }
>  
>  static void __meminit setup_zone_pageset(struct zone *zone)
>  {
>   int cpu;
> + int high, batch;
> + struct per_cpu_pageset *p;
> +
> + pageset_get_values(zone, , );
> +
>   zone->pageset = alloc_percpu(struct per_cpu_pageset);
> - for_each_possible_cpu(cpu)
> - zone_pageset_init(zone, cpu);
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(zone->pageset, cpu);
> + pageset_init(p);
> + p->pcp.high = high;
> + p->pcp.batch = batch;
> + }
>  }
>  
>  /*
> @@ -5928,11 +5950,10 @@ int percpu_pagelist_fraction_sysctl_handler(struct 
> ctl_table *table, int write,
>   goto out;
>  
>   for_each_populated_zone(zone) {
> - unsigned int cpu;
> + int high, batch;
>  
> - for_each_possible_cpu(cpu)
> - pageset_set_high_and_batch(zone,
> - per_cpu_ptr(zone->pageset, cpu));
> + pageset_get_values(zone, , );
> + pageset_update(zone, high, batch);
>   }
>  out:
>   mutex_unlock(_batch_high_lock);
> @@ -6455,11 +6476,11 @@ void free_contig_range(unsigned long pfn, unsigned 
> nr_pages)
>   */
>  void __meminit zone_pcp_update(struct zone *zone)
>  {
> - unsigned cpu;
> + int high, batch;
> +
>   mutex_lock(_batch_high_lock);
> - for_each_possible_cpu(cpu)
> - pageset_set_high_and_batch(zone,
> - per_cpu_ptr(zone->pageset, cpu));
> + pageset_get_values(zone, , );
> + pageset_update(zone, high, batch);
>   mutex_unlock(_batch_high_lock);
>  }
>  #endif
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/8] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC

2014-08-06 Thread Zhang Yanfei
On 08/06/2014 03:18 PM, Joonsoo Kim wrote:
> In __free_one_page(), we check the buddy page if it is guard page.
> And, if so, we should clear guard attribute on the buddy page. But,
> currently, we clear original page's order rather than buddy one's.
> This doesn't have any problem, because resetting buddy's order
> is useless and the original page's order is re-assigned soon.
> But, it is better to correct code.
> 
> Additionally, I change (set/clear)_page_guard_flag() to
> (set/clear)_page_guard() and makes these functions do all works
> needed for guard page. This may make code more understandable.
> 
> One more thing, I did in this patch, is that fixing freepage accounting.
> If we clear guard page and link it onto isolate buddy list, we should
> not increase freepage count.
> 
> Acked-by: Vlastimil Babka 
> Signed-off-by: Joonsoo Kim 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/page_alloc.c |   29 -
>  1 file changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b99643d4..e6fee4b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -441,18 +441,28 @@ static int __init debug_guardpage_minorder_setup(char 
> *buf)
>  }
>  __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);
>  
> -static inline void set_page_guard_flag(struct page *page)
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype)
>  {
>   __set_bit(PAGE_DEBUG_FLAG_GUARD, >debug_flags);
> + set_page_private(page, order);
> + /* Guard pages are not available for any usage */
> + __mod_zone_freepage_state(zone, -(1 << order), migratetype);
>  }
>  
> -static inline void clear_page_guard_flag(struct page *page)
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype)
>  {
>   __clear_bit(PAGE_DEBUG_FLAG_GUARD, >debug_flags);
> + set_page_private(page, 0);
> + if (!is_migrate_isolate(migratetype))
> + __mod_zone_freepage_state(zone, (1 << order), migratetype);
>  }
>  #else
> -static inline void set_page_guard_flag(struct page *page) { }
> -static inline void clear_page_guard_flag(struct page *page) { }
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype) {}
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype) {}
>  #endif
>  
>  static inline void set_page_order(struct page *page, unsigned int order)
> @@ -594,10 +604,7 @@ static inline void __free_one_page(struct page *page,
>* merge with it and move up one order.
>*/
>   if (page_is_guard(buddy)) {
> - clear_page_guard_flag(buddy);
> - set_page_private(page, 0);
> - __mod_zone_freepage_state(zone, 1 << order,
> -   migratetype);
> + clear_page_guard(zone, buddy, order, migratetype);
>   } else {
>   list_del(>lru);
>   zone->free_area[order].nr_free--;
> @@ -876,11 +883,7 @@ static inline void expand(struct zone *zone, struct page 
> *page,
>* pages will stay not present in virtual address space
>*/
>   INIT_LIST_HEAD([size].lru);
> - set_page_guard_flag([size]);
> - set_page_private([size], high);
> - /* Guard pages are not available for any usage */
> -         __mod_zone_freepage_state(zone, -(1 << high),
> -   migratetype);
> + set_page_guard(zone, [size], high, migratetype);
>   continue;
>   }
>  #endif
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/8] fix freepage count problems in memory isolation

2014-08-06 Thread Zhang Yanfei
Hi Joonsoo,

The first 3 patches in this patchset are in a bit of mess.

On 08/06/2014 03:18 PM, Joonsoo Kim wrote:
> Hello,
> 
> This patchset aims at fixing problems during memory isolation found by
> testing my patchset [1].
> 
> These are really subtle problems so I can be wrong. If you find what I am
> missing, please let me know.
> 
> Before describing bugs itself, I first explain definition of freepage.
> 
> 1. pages on buddy list are counted as freepage.
> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
> 3. pages on cma buddy list are counted as CMA freepage, too.
> 4. pages for guard are *not* counted as freepage.
> 
> Now, I describe problems and related patch.
> 
> Patch 1: If guard page are cleared and merged into isolate buddy list,
> we should not add freepage count.
> 
> Patch 4: There is race conditions that results in misplacement of free
> pages on buddy list. Then, it results in incorrect freepage count and
> un-availability of freepage.
> 
> Patch 5: To count freepage correctly, we should prevent freepage from
> being added to buddy list in some period of isolation. Without it, we
> cannot be sure if the freepage is counted or not and miscount number
> of freepage.
> 
> Patch 7: In spite of above fixes, there is one more condition for
> incorrect freepage count. pageblock isolation could be done in pageblock
> unit  so we can't prevent freepage from merging with page on next
> pageblock. To fix it, start_isolate_page_range() and
> undo_isolate_page_range() is modified to process whole range at one go.
> With this change, if input parameter of start_isolate_page_range() and
> undo_isolate_page_range() is properly aligned, there is no condition for
> incorrect merging.
> 
> Without patchset [1], above problem doesn't happens on my CMA allocation
> test, because CMA reserved pages aren't used at all. So there is no
> chance for above race.
> 
> With patchset [1], I did simple CMA allocation test and get below result.
> 
> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> - run kernel build (make -j16) on background
> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> - Result: more than 5000 freepage count are missed
> 
> With patchset [1] and this patchset, I found that no freepage count are
> missed so that I conclude that problems are solved.
> 
> These problems can be possible on memory hot remove users, although
> I didn't check it further.
> 
> This patchset is based on linux-next-20140728.
> Please see individual patches for more information.
> 
> Thanks.
> 
> [1]: Aggressively allocate the pages on cma reserved memory
>  https://lkml.org/lkml/2014/5/30/291
> 
> Joonsoo Kim (8):
>   mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>   mm/isolation: remove unstable check for isolated page
>   mm/page_alloc: fix pcp high, batch management
>   mm/isolation: close the two race problems related to pageblock
> isolation
>   mm/isolation: change pageblock isolation logic to fix freepage
> counting bugs
>   mm/isolation: factor out pre/post logic on
> set/unset_migratetype_isolate()
>   mm/isolation: fix freepage counting bug on
> start/undo_isolat_page_range()
>   mm/isolation: remove useless race handling related to pageblock
> isolation
> 
>  include/linux/page-isolation.h |2 +
>  mm/internal.h  |    5 +
>  mm/page_alloc.c|  223 +-
>  mm/page_isolation.c|  292 
> +++-
>  4 files changed, 368 insertions(+), 154 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/8] fix freepage count problems in memory isolation

2014-08-06 Thread Zhang Yanfei
Hi Joonsoo,

The first 3 patches in this patchset are in a bit of mess.

On 08/06/2014 03:18 PM, Joonsoo Kim wrote:
 Hello,
 
 This patchset aims at fixing problems during memory isolation found by
 testing my patchset [1].
 
 These are really subtle problems so I can be wrong. If you find what I am
 missing, please let me know.
 
 Before describing bugs itself, I first explain definition of freepage.
 
 1. pages on buddy list are counted as freepage.
 2. pages on isolate migratetype buddy list are *not* counted as freepage.
 3. pages on cma buddy list are counted as CMA freepage, too.
 4. pages for guard are *not* counted as freepage.
 
 Now, I describe problems and related patch.
 
 Patch 1: If guard page are cleared and merged into isolate buddy list,
 we should not add freepage count.
 
 Patch 4: There is race conditions that results in misplacement of free
 pages on buddy list. Then, it results in incorrect freepage count and
 un-availability of freepage.
 
 Patch 5: To count freepage correctly, we should prevent freepage from
 being added to buddy list in some period of isolation. Without it, we
 cannot be sure if the freepage is counted or not and miscount number
 of freepage.
 
 Patch 7: In spite of above fixes, there is one more condition for
 incorrect freepage count. pageblock isolation could be done in pageblock
 unit  so we can't prevent freepage from merging with page on next
 pageblock. To fix it, start_isolate_page_range() and
 undo_isolate_page_range() is modified to process whole range at one go.
 With this change, if input parameter of start_isolate_page_range() and
 undo_isolate_page_range() is properly aligned, there is no condition for
 incorrect merging.
 
 Without patchset [1], above problem doesn't happens on my CMA allocation
 test, because CMA reserved pages aren't used at all. So there is no
 chance for above race.
 
 With patchset [1], I did simple CMA allocation test and get below result.
 
 - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
 - run kernel build (make -j16) on background
 - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
 - Result: more than 5000 freepage count are missed
 
 With patchset [1] and this patchset, I found that no freepage count are
 missed so that I conclude that problems are solved.
 
 These problems can be possible on memory hot remove users, although
 I didn't check it further.
 
 This patchset is based on linux-next-20140728.
 Please see individual patches for more information.
 
 Thanks.
 
 [1]: Aggressively allocate the pages on cma reserved memory
  https://lkml.org/lkml/2014/5/30/291
 
 Joonsoo Kim (8):
   mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
   mm/isolation: remove unstable check for isolated page
   mm/page_alloc: fix pcp high, batch management
   mm/isolation: close the two race problems related to pageblock
 isolation
   mm/isolation: change pageblock isolation logic to fix freepage
 counting bugs
   mm/isolation: factor out pre/post logic on
 set/unset_migratetype_isolate()
   mm/isolation: fix freepage counting bug on
 start/undo_isolat_page_range()
   mm/isolation: remove useless race handling related to pageblock
 isolation
 
  include/linux/page-isolation.h |2 +
  mm/internal.h  |5 +
  mm/page_alloc.c|  223 +-
  mm/page_isolation.c|  292 
 +++-
  4 files changed, 368 insertions(+), 154 deletions(-)
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/8] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC

2014-08-06 Thread Zhang Yanfei
On 08/06/2014 03:18 PM, Joonsoo Kim wrote:
 In __free_one_page(), we check the buddy page if it is guard page.
 And, if so, we should clear guard attribute on the buddy page. But,
 currently, we clear original page's order rather than buddy one's.
 This doesn't have any problem, because resetting buddy's order
 is useless and the original page's order is re-assigned soon.
 But, it is better to correct code.
 
 Additionally, I change (set/clear)_page_guard_flag() to
 (set/clear)_page_guard() and makes these functions do all works
 needed for guard page. This may make code more understandable.
 
 One more thing, I did in this patch, is that fixing freepage accounting.
 If we clear guard page and link it onto isolate buddy list, we should
 not increase freepage count.
 
 Acked-by: Vlastimil Babka vba...@suse.cz
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  mm/page_alloc.c |   29 -
  1 file changed, 16 insertions(+), 13 deletions(-)
 
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b99643d4..e6fee4b 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -441,18 +441,28 @@ static int __init debug_guardpage_minorder_setup(char 
 *buf)
  }
  __setup(debug_guardpage_minorder=, debug_guardpage_minorder_setup);
  
 -static inline void set_page_guard_flag(struct page *page)
 +static inline void set_page_guard(struct zone *zone, struct page *page,
 + unsigned int order, int migratetype)
  {
   __set_bit(PAGE_DEBUG_FLAG_GUARD, page-debug_flags);
 + set_page_private(page, order);
 + /* Guard pages are not available for any usage */
 + __mod_zone_freepage_state(zone, -(1  order), migratetype);
  }
  
 -static inline void clear_page_guard_flag(struct page *page)
 +static inline void clear_page_guard(struct zone *zone, struct page *page,
 + unsigned int order, int migratetype)
  {
   __clear_bit(PAGE_DEBUG_FLAG_GUARD, page-debug_flags);
 + set_page_private(page, 0);
 + if (!is_migrate_isolate(migratetype))
 + __mod_zone_freepage_state(zone, (1  order), migratetype);
  }
  #else
 -static inline void set_page_guard_flag(struct page *page) { }
 -static inline void clear_page_guard_flag(struct page *page) { }
 +static inline void set_page_guard(struct zone *zone, struct page *page,
 + unsigned int order, int migratetype) {}
 +static inline void clear_page_guard(struct zone *zone, struct page *page,
 + unsigned int order, int migratetype) {}
  #endif
  
  static inline void set_page_order(struct page *page, unsigned int order)
 @@ -594,10 +604,7 @@ static inline void __free_one_page(struct page *page,
* merge with it and move up one order.
*/
   if (page_is_guard(buddy)) {
 - clear_page_guard_flag(buddy);
 - set_page_private(page, 0);
 - __mod_zone_freepage_state(zone, 1  order,
 -   migratetype);
 + clear_page_guard(zone, buddy, order, migratetype);
   } else {
   list_del(buddy-lru);
   zone-free_area[order].nr_free--;
 @@ -876,11 +883,7 @@ static inline void expand(struct zone *zone, struct page 
 *page,
* pages will stay not present in virtual address space
*/
   INIT_LIST_HEAD(page[size].lru);
 - set_page_guard_flag(page[size]);
 - set_page_private(page[size], high);
 - /* Guard pages are not available for any usage */
 - __mod_zone_freepage_state(zone, -(1  high),
 -   migratetype);
 + set_page_guard(zone, page[size], high, migratetype);
   continue;
   }
  #endif
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/8] mm/page_alloc: fix pcp high, batch management

2014-08-06 Thread Zhang Yanfei
 / percpu_pagelist_fraction),
 + high, batch);
 + } else
 + pageset_get_values_by_batch(zone_batchsize(zone), high, batch);
  }
  
  static void pageset_init(struct per_cpu_pageset *p)
 @@ -4263,51 +4298,38 @@ static void pageset_init(struct per_cpu_pageset *p)
   INIT_LIST_HEAD(pcp-lists[migratetype]);
  }
  
 -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 +/* Use this only in boot time, because it doesn't do any synchronization */
 +static void setup_pageset(struct per_cpu_pageset __percpu *pcp)
  {
 - pageset_init(p);
 - pageset_set_batch(p, batch);
 -}
 -
 -/*
 - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
 - * to the value high for the pageset p.
 - */
 -static void pageset_set_high(struct per_cpu_pageset *p,
 - unsigned long high)
 -{
 - unsigned long batch = max(1UL, high / 4);
 - if ((high / 4)  (PAGE_SHIFT * 8))
 - batch = PAGE_SHIFT * 8;
 -
 - pageset_update(p-pcp, high, batch);
 -}
 -
 -static void pageset_set_high_and_batch(struct zone *zone,
 -struct per_cpu_pageset *pcp)
 -{
 - if (percpu_pagelist_fraction)
 - pageset_set_high(pcp,
 - (zone-managed_pages /
 - percpu_pagelist_fraction));
 - else
 - pageset_set_batch(pcp, zone_batchsize(zone));
 -}
 + int cpu;
 + int high, batch;
 + struct per_cpu_pageset *p;
  
 -static void __meminit zone_pageset_init(struct zone *zone, int cpu)
 -{
 - struct per_cpu_pageset *pcp = per_cpu_ptr(zone-pageset, cpu);
 + pageset_get_values_by_batch(0, high, batch);
  
 - pageset_init(pcp);
 - pageset_set_high_and_batch(zone, pcp);
 + for_each_possible_cpu(cpu) {
 + p = per_cpu_ptr(pcp, cpu);
 + pageset_init(p);
 + p-pcp.high = high;
 + p-pcp.batch = batch;
 + }
  }
  
  static void __meminit setup_zone_pageset(struct zone *zone)
  {
   int cpu;
 + int high, batch;
 + struct per_cpu_pageset *p;
 +
 + pageset_get_values(zone, high, batch);
 +
   zone-pageset = alloc_percpu(struct per_cpu_pageset);
 - for_each_possible_cpu(cpu)
 - zone_pageset_init(zone, cpu);
 + for_each_possible_cpu(cpu) {
 + p = per_cpu_ptr(zone-pageset, cpu);
 + pageset_init(p);
 + p-pcp.high = high;
 + p-pcp.batch = batch;
 + }
  }
  
  /*
 @@ -5928,11 +5950,10 @@ int percpu_pagelist_fraction_sysctl_handler(struct 
 ctl_table *table, int write,
   goto out;
  
   for_each_populated_zone(zone) {
 - unsigned int cpu;
 + int high, batch;
  
 - for_each_possible_cpu(cpu)
 - pageset_set_high_and_batch(zone,
 - per_cpu_ptr(zone-pageset, cpu));
 + pageset_get_values(zone, high, batch);
 + pageset_update(zone, high, batch);
   }
  out:
   mutex_unlock(pcp_batch_high_lock);
 @@ -6455,11 +6476,11 @@ void free_contig_range(unsigned long pfn, unsigned 
 nr_pages)
   */
  void __meminit zone_pcp_update(struct zone *zone)
  {
 - unsigned cpu;
 + int high, batch;
 +
   mutex_lock(pcp_batch_high_lock);
 - for_each_possible_cpu(cpu)
 - pageset_set_high_and_batch(zone,
 - per_cpu_ptr(zone-pageset, cpu));
 + pageset_get_values(zone, high, batch);
 + pageset_update(zone, high, batch);
   mutex_unlock(pcp_batch_high_lock);
  }
  #endif
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMA/HOTPLUG: clear buffer-head lru before page migration

2014-07-18 Thread Zhang Yanfei
Hello,

On 07/18/2014 04:23 PM, Gioh Kim wrote:
> 
> 
> 2014-07-18 오후 4:50, Marek Szyprowski 쓴 글:
>> Hello,
>>
>> On 2014-07-18 08:45, Gioh Kim wrote:
>>> For page migration of CMA, buffer-heads of lru should be dropped.
>>> Please refer to https://lkml.org/lkml/2014/7/4/101 for the history.
>>>
>>> I have two solution to drop bhs.
>>> One is invalidating entire lru.
>>> Another is searching the lru and dropping only one bh that Laura proposed
>>> at https://lkml.org/lkml/2012/8/31/313.
>>>
>>> I'm not sure which has better performance.
>>> So I did performance test on my cortex-a7 platform with Lmbench
>>> that has "File & VM system latencies" test.
>>> I am attaching the results.
>>> The first line is of invalidating entire lru and the second is dropping 
>>> selected bh.
>>>
>>> File & VM system latencies in microseconds - smaller is better
>>> ---
>>> Host OS   0K File  10K File MmapProt   Page   
>>> 100fd
>>>  Create Delete Create Delete Latency Fault  Fault  
>>> selct
>>> - - -- -- -- -- --- - --- 
>>> -
>>> 10.178.33 Linux 3.10.19   25.1   19.6   32.6   19.7  5098.0 0.666 3.45880 
>>> 6.506
>>> 10.178.33 Linux 3.10.19   24.9   19.5   32.3   19.4  5059.0 0.563 3.46380 
>>> 6.521
>>>
>>>
>>> I tried several times but the result tells that they are the same under 1% 
>>> gap
>>> except Protection Fault.
>>> But the latency of Protection Fault is very small and I think it has little 
>>> effect.
>>>
>>> Therefore we can choose anything but I choose invalidating entire lru.
>>> The try_to_free_buffers() which is calling drop_buffers() is called by many 
>>> filesystem code.
>>> So I think inserting codes in drop_buffers() can affect the system.
>>> And also we cannot distinguish migration type in drop_buffers().
>>>
>>> In alloc_contig_range() we can distinguish migration type and invalidate 
>>> lru if it needs.
>>> I think alloc_contig_range() is proper to deal with bh like following patch.
>>>
>>> Laura, can I have you name on Acked-by line?
>>> Please let me represent my thanks.
>>>
>>> Thanks for any feedback.
>>>
>>> --- 8< --
>>>
>>> >From 33c894b1bab9bc26486716f0c62c452d3a04d35d Mon Sep 17 00:00:00 2001
>>> From: Gioh Kim 
>>> Date: Fri, 18 Jul 2014 13:40:01 +0900
>>> Subject: [PATCH] CMA/HOTPLUG: clear buffer-head lru before page migration
>>>
>>> The bh must be free to migrate a page at which bh is mapped.
>>> The reference count of bh is increased when it is installed
>>> into lru so that the bh of lru must be freed before migrating the page.
>>>
>>> This frees every bh of lru. We could free only bh of migrating page.
>>> But searching lru costs more than invalidating entire lru.
>>>
>>> Signed-off-by: Gioh Kim 
>>> Acked-by: Laura Abbott 
>>> ---
>>>   mm/page_alloc.c |3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index b99643d4..3b474e0 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6369,6 +6369,9 @@ int alloc_contig_range(unsigned long start, unsigned 
>>> long end,
>>>  if (ret)
>>>  return ret;
>>>
>>> +   if (migratetype == MIGRATE_CMA || migratetype == MIGRATE_MOVABLE)
>>
>> I'm not sure if it really makes sense to check the migratetype here. This 
>> check
>> doesn't add any new information to the code and make false impression that 
>> this
>> function can be called for other migratetypes than CMA or MOVABLE. Even if 
>> so,
>> then invalidating bh_lrus unconditionally will make more sense, IMHO.
> 
> I agree. I cannot understand why alloc_contig_range has an argument of 
> migratetype.
> Can the alloc_contig_range is called for other migrate type than CMA/MOVABLE?
> 
> What do you think about removing the argument of migratetype and
> checking migratetype (if (migratetype == MIGRATE_CMA || migratetype == 
> MIGRATE_MOVABLE))?
> 

Remove the checking only. Because gigantic page allocation used for hugetlb is
using alloc_contig_range(.. MIGRATE_MOVABLE).

Thanks.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] memory-hotplug: suitable memory should go to ZONE_MOVABLE

2014-07-18 Thread Zhang Yanfei
Hello,

On 07/18/2014 03:55 PM, Wang Nan wrote:
> This series of patches fix a problem when adding memory in bad manner.
> For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
> memory installed, following commands cause problem:
> 
>  # echo 0x4000 > /sys/devices/system/memory/probe
> [   28.613895] init_memory_mapping: [mem 0x4000-0x47ff]
>  # echo 0x4800 > /sys/devices/system/memory/probe
> [   28.693675] init_memory_mapping: [mem 0x4800-0x4fff]
>  # echo online_movable > /sys/devices/system/memory/memory9/state
>  # echo 0x5000 > /sys/devices/system/memory/probe 
> [   29.084090] init_memory_mapping: [mem 0x5000-0x57ff]
>  # echo 0x5800 > /sys/devices/system/memory/probe 
> [   29.151880] init_memory_mapping: [mem 0x5800-0x5fff]
>  # echo online_movable > /sys/devices/system/memory/memory11/state
>  # echo online> /sys/devices/system/memory/memory8/state
>  # echo online> /sys/devices/system/memory/memory10/state
>  # echo offline> /sys/devices/system/memory/memory9/state
> [   30.558819] Offlined Pages 32768
>  # free
>  total   used   free sharedbuffers cached
> Mem:780588 18014398509432020 830552  0  0  
> 51180
> -/+ buffers/cache: 18014398509380840 881732
> Swap:0  0  0
> 
> This is because the above commands probe higher memory after online a
> section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
> for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.

Yeah, this is rare in reality but can happen. Could you please also
include the free result and zoneinfo after applying your patch?

Thanks.

> 
> After the second online_movable, the problem can be observed from
> zoneinfo:
> 
>  # cat /proc/zoneinfo
> ...
> Node 0, zone  Movable
>   pages free 65491
> min  250
> low  312
> high 375
> scanned  0
> spanned  18446744073709518848
> present  65536
> managed  65536
> ...
> 
> This series of patches solve the problem by checking ZONE_MOVABLE when
> choosing zone for new memory. If new memory is inside or higher than
> ZONE_MOVABLE, makes it go there instead.
> 
> 
> Wang Nan (5):
>   memory-hotplug: x86_64: suitable memory should go to ZONE_MOVABLE
>   memory-hotplug: x86_32: suitable memory should go to ZONE_MOVABLE
>   memory-hotplug: ia64: suitable memory should go to ZONE_MOVABLE
>   memory-hotplug: sh: suitable memory should go to ZONE_MOVABLE
>   memory-hotplug: powerpc: suitable memory should go to ZONE_MOVABLE
> 
>  arch/ia64/mm/init.c   |  7 +++
>  arch/powerpc/mm/mem.c |  6 ++
>  arch/sh/mm/init.c | 13 -
>  arch/x86/mm/init_32.c |  6 ++
>  arch/x86/mm/init_64.c | 10 --
>  5 files changed, 35 insertions(+), 7 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] memory-hotplug: suitable memory should go to ZONE_MOVABLE

2014-07-18 Thread Zhang Yanfei
Hello,

On 07/18/2014 03:55 PM, Wang Nan wrote:
 This series of patches fix a problem when adding memory in bad manner.
 For example: for a x86_64 machine booted with mem=400M and with 2GiB
 memory installed, following commands cause problem:
 
  # echo 0x4000  /sys/devices/system/memory/probe
 [   28.613895] init_memory_mapping: [mem 0x4000-0x47ff]
  # echo 0x4800  /sys/devices/system/memory/probe
 [   28.693675] init_memory_mapping: [mem 0x4800-0x4fff]
  # echo online_movable  /sys/devices/system/memory/memory9/state
  # echo 0x5000  /sys/devices/system/memory/probe 
 [   29.084090] init_memory_mapping: [mem 0x5000-0x57ff]
  # echo 0x5800  /sys/devices/system/memory/probe 
 [   29.151880] init_memory_mapping: [mem 0x5800-0x5fff]
  # echo online_movable  /sys/devices/system/memory/memory11/state
  # echo online /sys/devices/system/memory/memory8/state
  # echo online /sys/devices/system/memory/memory10/state
  # echo offline /sys/devices/system/memory/memory9/state
 [   30.558819] Offlined Pages 32768
  # free
  total   used   free sharedbuffers cached
 Mem:780588 18014398509432020 830552  0  0  
 51180
 -/+ buffers/cache: 18014398509380840 881732
 Swap:0  0  0
 
 This is because the above commands probe higher memory after online a
 section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
 for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.

Yeah, this is rare in reality but can happen. Could you please also
include the free result and zoneinfo after applying your patch?

Thanks.

 
 After the second online_movable, the problem can be observed from
 zoneinfo:
 
  # cat /proc/zoneinfo
 ...
 Node 0, zone  Movable
   pages free 65491
 min  250
 low  312
 high 375
 scanned  0
 spanned  18446744073709518848
 present  65536
 managed  65536
 ...
 
 This series of patches solve the problem by checking ZONE_MOVABLE when
 choosing zone for new memory. If new memory is inside or higher than
 ZONE_MOVABLE, makes it go there instead.
 
 
 Wang Nan (5):
   memory-hotplug: x86_64: suitable memory should go to ZONE_MOVABLE
   memory-hotplug: x86_32: suitable memory should go to ZONE_MOVABLE
   memory-hotplug: ia64: suitable memory should go to ZONE_MOVABLE
   memory-hotplug: sh: suitable memory should go to ZONE_MOVABLE
   memory-hotplug: powerpc: suitable memory should go to ZONE_MOVABLE
 
  arch/ia64/mm/init.c   |  7 +++
  arch/powerpc/mm/mem.c |  6 ++
  arch/sh/mm/init.c | 13 -
  arch/x86/mm/init_32.c |  6 ++
  arch/x86/mm/init_64.c | 10 --
  5 files changed, 35 insertions(+), 7 deletions(-)
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CMA/HOTPLUG: clear buffer-head lru before page migration

2014-07-18 Thread Zhang Yanfei
Hello,

On 07/18/2014 04:23 PM, Gioh Kim wrote:
 
 
 2014-07-18 오후 4:50, Marek Szyprowski 쓴 글:
 Hello,

 On 2014-07-18 08:45, Gioh Kim wrote:
 For page migration of CMA, buffer-heads of lru should be dropped.
 Please refer to https://lkml.org/lkml/2014/7/4/101 for the history.

 I have two solution to drop bhs.
 One is invalidating entire lru.
 Another is searching the lru and dropping only one bh that Laura proposed
 at https://lkml.org/lkml/2012/8/31/313.

 I'm not sure which has better performance.
 So I did performance test on my cortex-a7 platform with Lmbench
 that has File  VM system latencies test.
 I am attaching the results.
 The first line is of invalidating entire lru and the second is dropping 
 selected bh.

 File  VM system latencies in microseconds - smaller is better
 ---
 Host OS   0K File  10K File MmapProt   Page   
 100fd
  Create Delete Create Delete Latency Fault  Fault  
 selct
 - - -- -- -- -- --- - --- 
 -
 10.178.33 Linux 3.10.19   25.1   19.6   32.6   19.7  5098.0 0.666 3.45880 
 6.506
 10.178.33 Linux 3.10.19   24.9   19.5   32.3   19.4  5059.0 0.563 3.46380 
 6.521


 I tried several times but the result tells that they are the same under 1% 
 gap
 except Protection Fault.
 But the latency of Protection Fault is very small and I think it has little 
 effect.

 Therefore we can choose anything but I choose invalidating entire lru.
 The try_to_free_buffers() which is calling drop_buffers() is called by many 
 filesystem code.
 So I think inserting codes in drop_buffers() can affect the system.
 And also we cannot distinguish migration type in drop_buffers().

 In alloc_contig_range() we can distinguish migration type and invalidate 
 lru if it needs.
 I think alloc_contig_range() is proper to deal with bh like following patch.

 Laura, can I have you name on Acked-by line?
 Please let me represent my thanks.

 Thanks for any feedback.

 --- 8 --

 From 33c894b1bab9bc26486716f0c62c452d3a04d35d Mon Sep 17 00:00:00 2001
 From: Gioh Kim gioh@lge.com
 Date: Fri, 18 Jul 2014 13:40:01 +0900
 Subject: [PATCH] CMA/HOTPLUG: clear buffer-head lru before page migration

 The bh must be free to migrate a page at which bh is mapped.
 The reference count of bh is increased when it is installed
 into lru so that the bh of lru must be freed before migrating the page.

 This frees every bh of lru. We could free only bh of migrating page.
 But searching lru costs more than invalidating entire lru.

 Signed-off-by: Gioh Kim gioh@lge.com
 Acked-by: Laura Abbott lau...@codeaurora.org
 ---
   mm/page_alloc.c |3 +++
   1 file changed, 3 insertions(+)

 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b99643d4..3b474e0 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -6369,6 +6369,9 @@ int alloc_contig_range(unsigned long start, unsigned 
 long end,
  if (ret)
  return ret;

 +   if (migratetype == MIGRATE_CMA || migratetype == MIGRATE_MOVABLE)

 I'm not sure if it really makes sense to check the migratetype here. This 
 check
 doesn't add any new information to the code and make false impression that 
 this
 function can be called for other migratetypes than CMA or MOVABLE. Even if 
 so,
 then invalidating bh_lrus unconditionally will make more sense, IMHO.
 
 I agree. I cannot understand why alloc_contig_range has an argument of 
 migratetype.
 Can the alloc_contig_range is called for other migrate type than CMA/MOVABLE?
 
 What do you think about removing the argument of migratetype and
 checking migratetype (if (migratetype == MIGRATE_CMA || migratetype == 
 MIGRATE_MOVABLE))?
 

Remove the checking only. Because gigantic page allocation used for hugetlb is
using alloc_contig_range(.. MIGRATE_MOVABLE).

Thanks.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 2/7] x86: add pmd_[dirty|mkclean] for THP

2014-07-08 Thread Zhang Yanfei
On 07/08/2014 02:03 PM, Minchan Kim wrote:
> MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
> overwrite of the contents since MADV_FREE syscall is called for
> THP page.
> 
> This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
> support.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Acked-by: Kirill A. Shutemov 
> Signed-off-by: Minchan Kim 

Acked-by: Zhang Yanfei 

> ---
>  arch/x86/include/asm/pgtable.h | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 0ec056012618..329865799653 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -104,6 +104,11 @@ static inline int pmd_young(pmd_t pmd)
>   return pmd_flags(pmd) & _PAGE_ACCESSED;
>  }
>  
> +static inline int pmd_dirty(pmd_t pmd)
> +{
> + return pmd_flags(pmd) & _PAGE_DIRTY;
> +}
> +
>  static inline int pte_write(pte_t pte)
>  {
>   return pte_flags(pte) & _PAGE_RW;
> @@ -267,6 +272,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
>   return pmd_clear_flags(pmd, _PAGE_ACCESSED);
>  }
>  
> +static inline pmd_t pmd_mkclean(pmd_t pmd)
> +{
> + return pmd_clear_flags(pmd, _PAGE_DIRTY);
> +}
> +
>  static inline pmd_t pmd_wrprotect(pmd_t pmd)
>  {
>   return pmd_clear_flags(pmd, _PAGE_RW);
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 1/7] mm: support madvise(MADV_FREE)

2014-07-08 Thread Zhang Yanfei
On 07/08/2014 02:03 PM, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).
> 
> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
> 
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
> 
> How to work is following as.
> 
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. If memory pressure happens, VM checks dirty bit of
> page table and if it found still "clean", it means it's a
> "lazyfree pages" so VM could discard the page instead of swapping out.
> Once there was store operation for the page before VM peek a page
> to reclaim, dirty bit is set so VM can swap out the page instead of
> discarding.
> 
> Firstly, heavy users would be general allocators(ex, jemalloc,
> tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> have supported the feature for other OS(ex, FreeBSD)
> 
> barrios@blaptop:~/benchmark/ebizzy$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):4
> On-line CPU(s) list:   0-3
> Thread(s) per core:2
> Core(s) per socket:2
> Socket(s): 1
> NUMA node(s):  1
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 42
> Stepping:  7
> CPU MHz:   2801.000
> BogoMIPS:  5581.64
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  4096K
> NUMA node0 CPU(s): 0-3
> 
> ebizzy benchmark(./ebizzy -S 10 -n 512)
> 
>  vanilla-jemalloc MADV_free-jemalloc
> 
> 1 thread
> records:  10  records:  10
> avg:  7682.10 avg:  15306.10
> std:  62.35(0.81%)std:  347.99(2.27%)
> max:  7770.00 max:  15622.00
> min:  7598.00 min:  14772.00
> 
> 2 thread
> records:  10  records:  10
> avg:  12747.50avg:  24171.00
> std:  792.06(6.21%)   std:  895.18(3.70%)
> max:  13337.00max:  26023.00
> min:  10535.00min:  23152.00
> 
> 4 thread
> records:  10  records:  10
> avg:  16474.60avg:  33717.90
> std:  1496.45(9.08%)  std:  2008.97(5.96%)
> max:  17877.00max:  35958.00
> min:  12224.00min:  29565.00
> 
> 8 thread
> records:  10  records:  10
> avg:  16778.50avg:  33308.10
> std:  825.53(4.92%)   std:  1668.30(5.01%)
> max:  17543.00max:  36010.00
> min:  14576.00min:  29577.00
> 
> 16 thread
> records:  10  records:  10
> avg:  20614.40avg:  35516.30
> std:  602.95(2.92%)   std:  1283.65(3.61%)
> max:  21753.00max:  37178.00
> min:  19605.00min:  33217.00
> 
> 32 thread
> records:  10  records:  10
> avg:  22771.70avg:  36018.50
> std:  598.94(2.63%)   std:  1046.76(2.91%)
> max:  24035.00    max:      37266.00
> min:  22108.00min:  34149.00
> 
> In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> 
> Cc: Michael Kerrisk 
> Cc: Linux API 
> Cc: Hugh Dickins 
> Cc: Johannes Weiner 
> Cc: KOSAKI Motohiro 
> Cc: Mel Gorman 
> Cc: Jason Evans 
> Cc: Zhang Yanfei 
> Acked-by: Rik van Riel 
> Signed-off-by: Minchan Kim 

A quick respin, looks good to me now for this !THP part. And
looks neat with the Pagewalker.

Acked-by: Zhang Yanfei 

> ---
>  include/linux/rmap.h   |   9 ++-
>  include/linux/vm_event_item.h  |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c   | 135 
> +
>  mm/rmap.c  |  42 +-
>  mm/vmscan.c|  40 --
>  mm/vmstat.c|   1 +
>  7 files changed, 217 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index be574506e6a9..0ba377b97a38 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -75,6 +75,7 @@ enum ttu_flags {
>   TTU_UNMAP = 1,  /* unmap mode */
>   TTU_MIGRATION = 2,  /* migration mode */
>   TTU_MUNLOCK = 4, 

Re: [PATCH v11 1/7] mm: support madvise(MADV_FREE)

2014-07-08 Thread Zhang Yanfei
On 07/08/2014 02:03 PM, Minchan Kim wrote:
 Linux doesn't have an ability to free pages lazy while other OS
 already have been supported that named by madvise(MADV_FREE).
 
 The gain is clear that kernel can discard freed pages rather than
 swapping out or OOM if memory pressure happens.
 
 Without memory pressure, freed pages would be reused by userspace
 without another additional overhead(ex, page fault + allocation
 + zeroing).
 
 How to work is following as.
 
 When madvise syscall is called, VM clears dirty bit of ptes of
 the range. If memory pressure happens, VM checks dirty bit of
 page table and if it found still clean, it means it's a
 lazyfree pages so VM could discard the page instead of swapping out.
 Once there was store operation for the page before VM peek a page
 to reclaim, dirty bit is set so VM can swap out the page instead of
 discarding.
 
 Firstly, heavy users would be general allocators(ex, jemalloc,
 tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
 have supported the feature for other OS(ex, FreeBSD)
 
 barrios@blaptop:~/benchmark/ebizzy$ lscpu
 Architecture:  x86_64
 CPU op-mode(s):32-bit, 64-bit
 Byte Order:Little Endian
 CPU(s):4
 On-line CPU(s) list:   0-3
 Thread(s) per core:2
 Core(s) per socket:2
 Socket(s): 1
 NUMA node(s):  1
 Vendor ID: GenuineIntel
 CPU family:6
 Model: 42
 Stepping:  7
 CPU MHz:   2801.000
 BogoMIPS:  5581.64
 Virtualization:VT-x
 L1d cache: 32K
 L1i cache: 32K
 L2 cache:  256K
 L3 cache:  4096K
 NUMA node0 CPU(s): 0-3
 
 ebizzy benchmark(./ebizzy -S 10 -n 512)
 
  vanilla-jemalloc MADV_free-jemalloc
 
 1 thread
 records:  10  records:  10
 avg:  7682.10 avg:  15306.10
 std:  62.35(0.81%)std:  347.99(2.27%)
 max:  7770.00 max:  15622.00
 min:  7598.00 min:  14772.00
 
 2 thread
 records:  10  records:  10
 avg:  12747.50avg:  24171.00
 std:  792.06(6.21%)   std:  895.18(3.70%)
 max:  13337.00max:  26023.00
 min:  10535.00min:  23152.00
 
 4 thread
 records:  10  records:  10
 avg:  16474.60avg:  33717.90
 std:  1496.45(9.08%)  std:  2008.97(5.96%)
 max:  17877.00max:  35958.00
 min:  12224.00min:  29565.00
 
 8 thread
 records:  10  records:  10
 avg:  16778.50avg:  33308.10
 std:  825.53(4.92%)   std:  1668.30(5.01%)
 max:  17543.00max:  36010.00
 min:  14576.00min:  29577.00
 
 16 thread
 records:  10  records:  10
 avg:  20614.40avg:  35516.30
 std:  602.95(2.92%)   std:  1283.65(3.61%)
 max:  21753.00max:  37178.00
 min:  19605.00min:  33217.00
 
 32 thread
 records:  10  records:  10
 avg:  22771.70avg:  36018.50
 std:  598.94(2.63%)   std:  1046.76(2.91%)
 max:  24035.00max:  37266.00
 min:  22108.00min:  34149.00
 
 In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
 
 Cc: Michael Kerrisk mtk.manpa...@gmail.com
 Cc: Linux API linux-...@vger.kernel.org
 Cc: Hugh Dickins hu...@google.com
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
 Cc: Mel Gorman mgor...@suse.de
 Cc: Jason Evans j...@fb.com
 Cc: Zhang Yanfei zhangyan...@cn.fujitsu.com
 Acked-by: Rik van Riel r...@redhat.com
 Signed-off-by: Minchan Kim minc...@kernel.org

A quick respin, looks good to me now for this !THP part. And
looks neat with the Pagewalker.

Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  include/linux/rmap.h   |   9 ++-
  include/linux/vm_event_item.h  |   1 +
  include/uapi/asm-generic/mman-common.h |   1 +
  mm/madvise.c   | 135 
 +
  mm/rmap.c  |  42 +-
  mm/vmscan.c|  40 --
  mm/vmstat.c|   1 +
  7 files changed, 217 insertions(+), 12 deletions(-)
 
 diff --git a/include/linux/rmap.h b/include/linux/rmap.h
 index be574506e6a9..0ba377b97a38 100644
 --- a/include/linux/rmap.h
 +++ b/include/linux/rmap.h
 @@ -75,6 +75,7 @@ enum ttu_flags {
   TTU_UNMAP = 1,  /* unmap mode */
   TTU_MIGRATION = 2,  /* migration mode */
   TTU_MUNLOCK = 4,/* munlock mode */
 + TTU_FREE = 8,   /* free mode */
  
   TTU_IGNORE_MLOCK = (1  8),/* ignore mlock */
   TTU_IGNORE_ACCESS = (1  9),   /* don't age */
 @@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
   * Called from mm/vmscan.c to handle

Re: [PATCH v11 2/7] x86: add pmd_[dirty|mkclean] for THP

2014-07-08 Thread Zhang Yanfei
On 07/08/2014 02:03 PM, Minchan Kim wrote:
 MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
 overwrite of the contents since MADV_FREE syscall is called for
 THP page.
 
 This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
 support.
 
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: x...@kernel.org
 Acked-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
 Signed-off-by: Minchan Kim minc...@kernel.org

Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  arch/x86/include/asm/pgtable.h | 10 ++
  1 file changed, 10 insertions(+)
 
 diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
 index 0ec056012618..329865799653 100644
 --- a/arch/x86/include/asm/pgtable.h
 +++ b/arch/x86/include/asm/pgtable.h
 @@ -104,6 +104,11 @@ static inline int pmd_young(pmd_t pmd)
   return pmd_flags(pmd)  _PAGE_ACCESSED;
  }
  
 +static inline int pmd_dirty(pmd_t pmd)
 +{
 + return pmd_flags(pmd)  _PAGE_DIRTY;
 +}
 +
  static inline int pte_write(pte_t pte)
  {
   return pte_flags(pte)  _PAGE_RW;
 @@ -267,6 +272,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
   return pmd_clear_flags(pmd, _PAGE_ACCESSED);
  }
  
 +static inline pmd_t pmd_mkclean(pmd_t pmd)
 +{
 + return pmd_clear_flags(pmd, _PAGE_DIRTY);
 +}
 +
  static inline pmd_t pmd_wrprotect(pmd_t pmd)
  {
   return pmd_clear_flags(pmd, _PAGE_RW);
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 1/7] mm: support madvise(MADV_FREE)

2014-07-07 Thread Zhang Yanfei
Hi Minchan,

On 07/07/2014 08:53 AM, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).
> 
> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
> 
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
> 
> How to work is following as.
> 
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. 

This should be updated because the implementation has been changed.
It also remove the page from the swapcache if it is.

Thank you for your effort!

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 1/7] mm: support madvise(MADV_FREE)

2014-07-07 Thread Zhang Yanfei
Hi Minchan,

On 07/07/2014 08:53 AM, Minchan Kim wrote:
 Linux doesn't have an ability to free pages lazy while other OS
 already have been supported that named by madvise(MADV_FREE).
 
 The gain is clear that kernel can discard freed pages rather than
 swapping out or OOM if memory pressure happens.
 
 Without memory pressure, freed pages would be reused by userspace
 without another additional overhead(ex, page fault + allocation
 + zeroing).
 
 How to work is following as.
 
 When madvise syscall is called, VM clears dirty bit of ptes of
 the range. 

This should be updated because the implementation has been changed.
It also remove the page from the swapcache if it is.

Thank you for your effort!

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

2014-06-23 Thread Zhang Yanfei
Hello Minchan

Thank you for your explain. Actually, I read the kernel with an old
version. The latest upstream kernel has the behaviour like you described
below. Oops, how long didn't I follow the buddy allocator change.

Thanks.

On 06/24/2014 07:35 AM, Minchan Kim wrote:
>>> Anyway, most big concern is that you are changing current behavior as
>>> > > I said earlier.
>>> > > 
>>> > > Old behavior in THP page fault when it consumes own timeslot was just
>>> > > abort and fallback 4K page but with your patch, new behavior is
>>> > > take a rest when it founds need_resched and goes to another round with
>>> > > async, not sync compaction. I'm not sure we need another round with
>>> > > async compaction at the cost of increasing latency rather than fallback
>>> > > 4 page.
>> > 
>> > I don't see the new behavior works like what you said. If need_resched
>> > is true, it calls cond_resched() and after a rest it just breaks the loop.
>> > Why there is another round with async compact?
> One example goes
> 
> Old:
> page fault
> huge page allocation
> __alloc_pages_slowpath
> __alloc_pages_direct_compact
> compact_zone_order
> isolate_migratepages
> compact_checklock_irqsave
> need_resched is true
> cc->contended = true;
> return ISOLATE_ABORT
> return COMPACT_PARTIAL with *contented = cc.contended;
> COMPACTFAIL
> if (contended_compaction && gfp_mask & __GFP_NO_KSWAPD)
> goto nopage;
> 
> New:
> 
> page fault
> huge page allocation
> __alloc_pages_slowpath
> __alloc_pages_direct_compact
> compact_zone_order
> isolate_migratepages
> compact_unlock_should_abort
> need_resched is true
> cc->contended = COMPACT_CONTENDED_SCHED;
> return true;
> return ISOLATE_ABORT
> return COMPACT_PARTIAL with *contended = cc.contended == 
> COMPACT_CONTENDED_LOCK (1)
> COMPACTFAIL
> if (contended_compaction && gfp_mask & __GFP_NO_KSWAPD)
> no goto nopage because contended_compaction was false by (1)
> 
> __alloc_pages_direct_reclaim
> if (should_alloc_retry)
> else
> __alloc_pages_direct_compact again with ASYNC_MODE
> 
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

2014-06-23 Thread Zhang Yanfei
On 06/23/2014 05:52 PM, Vlastimil Babka wrote:
> On 06/23/2014 07:39 AM, Zhang Yanfei wrote:
>> Hello
>>
>> On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
>>> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>>>> When allocating huge page for collapsing, khugepaged currently holds 
>>>> mmap_sem
>>>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>>>> dropped before write lock is taken on the same mmap_sem.
>>>>
>>>> Holding mmap_sem during whole huge page allocation is therefore useless, 
>>>> the
>>>> vma needs to be rechecked after taking the write lock anyway. Furthemore, 
>>>> huge
>>>> page allocation might involve a rather long sync compaction, and thus block
>>>> any mmap_sem writers and i.e. affect workloads that perform frequent 
>>>> m(un)map
>>>> or mprotect oterations.
>>>>
>>>> This patch simply releases the read lock before allocating a huge page. It
>>>> also deletes an outdated comment that assumed vma must be stable, as it was
>>>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>>>> ("mm: thp: khugepaged: add policy for finding target node").
>>>
>>> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
>>> all. Please, move up_read() outside khugepaged_alloc_page().
>>>
> 
> Well there's also currently no point in passing several parameters to 
> khugepaged_alloc_page(). So I could clean it up as well, but I imagine later 
> we would perhaps reintroduce them back, as I don't think the current 
> situation is ideal for at least two reasons.
> 
> 1. If you read commit 9f1b868a13 ("mm: thp: khugepaged: add policy for 
> finding target node"), it's based on a report where somebody found that 
> mempolicy is not observed properly when collapsing THP's. But the 'policy' 
> introduced by the commit isn't based on real mempolicy, it might just under 
> certain conditions results in an interleave, which happens to be what the 
> reporter was trying.
> 
> So ideally, it should be making node allocation decisions based on where the 
> original 4KB pages are located. For example, allocate a THP only if all the 
> 4KB pages are on the same node. That would also automatically obey any policy 
> that has lead to the allocation of those 4KB pages.
> 
> And for this, it will need again the parameters and mmap_sem in read mode. It 
> would be however still a good idea to drop mmap_sem before the allocation 
> itself, since compaction/reclaim might take some time...
> 
> 2. (less related) I'd expect khugepaged to first allocate a hugepage and then 
> scan for collapsing. Yes there's khugepaged_prealloc_page, but that only does 
> something on !NUMA systems and these are not the future.
> Although I don't have the data, I expect allocating a hugepage is a bigger 
> issue than finding something that could be collapsed. So why scan for 
> collapsing if in the end I cannot allocate a hugepage? And if I really cannot 
> find something to collapse, would e.g. caching a single hugepage per node be 
> a big hit? Also, if there's really nothing to collapse, then it means 
> khugepaged won't compact. And since khugepaged is becoming the only source of 
> sync compaction that doesn't give up easily and tries to e.g. migrate movable 
> pages out of unmovable pageblocks, this might have bad effects on 
> fragmentation.
> I believe this could be done smarter.
> 
>> I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round 
>> again
>> do the for loop to get the next vma and handle it. Does we do this without 
>> holding
>> the mmap_sem in any mode?
>>
>> And if the loop end, we have another up_read in breakouterloop. What if we 
>> have
>> released the mmap_sem in collapse_huge_page()?
> 
> collapse_huge_page() is only called from khugepaged_scan_pmd() in the if 
> (ret) condition. And khugepaged_scan_mm_slot() has similar if (ret) for the 
> return value of khugepaged_scan_pmd() to break out of the loop (and not doing 
> up_read() again). So I think this is correct and moving up_read from 
> khugepaged_alloc_page() to collapse_huge_page() wouldn't
> change this?

Ah, right.

> 
> 
> .
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 11/13] mm, compaction: pass gfp mask to compact_control

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> From: David Rientjes 
> 
> struct compact_control currently converts the gfp mask to a migratetype, but 
> we
> need the entire gfp mask in a follow-up patch.
> 
> Pass the entire gfp mask as part of struct compact_control.
> 
> Signed-off-by: David Rientjes 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 12 +++-
>  mm/internal.h   |  2 +-
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 32c768b..d4e0c13 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -975,8 +975,8 @@ static isolate_migrate_t isolate_migratepages(struct zone 
> *zone,
>   return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
>  }
>  
> -static int compact_finished(struct zone *zone,
> - struct compact_control *cc)
> +static int compact_finished(struct zone *zone, struct compact_control *cc,
> + const int migratetype)
>  {
>   unsigned int order;
>   unsigned long watermark;
> @@ -1022,7 +1022,7 @@ static int compact_finished(struct zone *zone,
>   struct free_area *area = >free_area[order];
>  
>   /* Job done if page is free of the right migratetype */
> - if (!list_empty(>free_list[cc->migratetype]))
> + if (!list_empty(>free_list[migratetype]))
>   return COMPACT_PARTIAL;
>  
>   /* Job done if allocation would set block type */
> @@ -1088,6 +1088,7 @@ static int compact_zone(struct zone *zone, struct 
> compact_control *cc)
>   int ret;
>   unsigned long start_pfn = zone->zone_start_pfn;
>   unsigned long end_pfn = zone_end_pfn(zone);
> + const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
>   const bool sync = cc->mode != MIGRATE_ASYNC;
>  
>   ret = compaction_suitable(zone, cc->order);
> @@ -1130,7 +1131,8 @@ static int compact_zone(struct zone *zone, struct 
> compact_control *cc)
>  
>   migrate_prep_local();
>  
> - while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
> + while ((ret = compact_finished(zone, cc, migratetype)) ==
> + COMPACT_CONTINUE) {
>   int err;
>  
>   switch (isolate_migratepages(zone, cc)) {
> @@ -1185,7 +1187,7 @@ static unsigned long compact_zone_order(struct zone 
> *zone, int order,
>   .nr_freepages = 0,
>   .nr_migratepages = 0,
>   .order = order,
> - .migratetype = gfpflags_to_migratetype(gfp_mask),
> + .gfp_mask = gfp_mask,
>   .zone = zone,
>   .mode = mode,
>   };
> diff --git a/mm/internal.h b/mm/internal.h
> index 584cd69..dd17a40 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -149,7 +149,7 @@ struct compact_control {
>   bool finished_update_migrate;
>  
>   int order;  /* order a direct compactor needs */
> - int migratetype;/* MOVABLE, RECLAIMABLE etc */
> + const gfp_t gfp_mask;   /* gfp mask of a direct compactor */
>   struct zone *zone;
>   enum compact_contended contended; /* Signal need_sched() or lock
>  * contention detected during
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 09/13] mm, compaction: skip buddy pages by their order in the migrate scanner

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> The migration scanner skips PageBuddy pages, but does not consider their order
> as checking page_order() is generally unsafe without holding the zone->lock,
> and acquiring the lock just for the check wouldn't be a good tradeoff.
> 
> Still, this could avoid some iterations over the rest of the buddy page, and
> if we are careful, the race window between PageBuddy() check and page_order()
> is small, and the worst thing that can happen is that we skip too much and 
> miss
> some isolation candidates. This is not that bad, as compaction can already 
> fail
> for many other reasons like parallel allocations, and those have much larger
> race window.
> 
> This patch therefore makes the migration scanner obtain the buddy page order
> and use it to skip the whole buddy page, if the order appears to be in the
> valid range.
> 
> It's important that the page_order() is read only once, so that the value used
> in the checks and in the pfn calculation is the same. But in theory the
> compiler can replace the local variable by multiple inlines of page_order().
> Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
> prevent this.
> 
> Testing with stress-highalloc from mmtests shows a 15% reduction in number of
> pages scanned by migration scanner. This change is also a prerequisite for a
> later patch which is detecting when a cc->order block of pages contains
> non-buddy pages that cannot be isolated, and the scanner should thus skip to
> the next block immediately.
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 

Fair enough.

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 36 +++-
>  mm/internal.h   | 16 +++-
>  2 files changed, 46 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 41c7005..df0961b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -270,8 +270,15 @@ static inline bool compact_should_abort(struct 
> compact_control *cc)
>  static bool suitable_migration_target(struct page *page)
>  {
>   /* If the page is a large free page, then disallow migration */
> - if (PageBuddy(page) && page_order(page) >= pageblock_order)
> - return false;
> + if (PageBuddy(page)) {
> + /*
> +  * We are checking page_order without zone->lock taken. But
> +  * the only small danger is that we skip a potentially suitable
> +  * pageblock, so it's not worth to check order for valid range.
> +  */
> + if (page_order_unsafe(page) >= pageblock_order)
> + return false;
> + }
>  
>   /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
>   if (migrate_async_suitable(get_pageblock_migratetype(page)))
> @@ -591,11 +598,23 @@ isolate_migratepages_range(struct zone *zone, struct 
> compact_control *cc,
>   valid_page = page;
>  
>   /*
> -  * Skip if free. page_order cannot be used without zone->lock
> -  * as nothing prevents parallel allocations or buddy merging.
> +  * Skip if free. We read page order here without zone lock
> +  * which is generally unsafe, but the race window is small and
> +  * the worst thing that can happen is that we skip some
> +  * potential isolation targets.
>*/
> - if (PageBuddy(page))
> + if (PageBuddy(page)) {
> + unsigned long freepage_order = page_order_unsafe(page);
> +
> + /*
> +  * Without lock, we cannot be sure that what we got is
> +  * a valid page order. Consider only values in the
> +  * valid order range to prevent low_pfn overflow.
> +  */
> + if (freepage_order > 0 && freepage_order < MAX_ORDER)
> + low_pfn += (1UL << freepage_order) - 1;
>   continue;
> + }
>  
>   /*
>* Check may be lockless but that's ok as we recheck later.
> @@ -683,6 +702,13 @@ next_pageblock:
>   low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
>   }
>  
> + /*
> +  * The PageBuddy() check could have potentially brought us outside
> +  * the range to be scanned.
> +  */
> + if (unlikely(lo

Re: [PATCH v3 08/13] mm, compaction: remember position within pageblock in free pages scanner

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka 
> Acked-by: David Rientjes 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: Zhang Yanfei 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 40 +++-
>  1 file changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9f6e857..41c7005 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -287,7 +287,7 @@ static bool suitable_migration_target(struct page *page)
>   * (even though it may still end up isolating some pages).
>   */
>  static unsigned long isolate_freepages_block(struct compact_control *cc,
> - unsigned long blockpfn,
> + unsigned long *start_pfn,
>   unsigned long end_pfn,
>   struct list_head *freelist,
>   bool strict)
> @@ -296,6 +296,7 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   struct page *cursor, *valid_page = NULL;
>   unsigned long flags;
>   bool locked = false;
> + unsigned long blockpfn = *start_pfn;
>  
>   cursor = pfn_to_page(blockpfn);
>  
> @@ -369,6 +370,9 @@ isolate_fail:
>   break;
>   }
>  
> + /* Record how far we have got within the block */
> + *start_pfn = blockpfn;
> +
>   trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
>  
>   /*
> @@ -413,6 +417,9 @@ isolate_freepages_range(struct compact_control *cc,
>   LIST_HEAD(freelist);
>  
>   for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> + /* Protect pfn from changing by isolate_freepages_block */
> + unsigned long isolate_start_pfn = pfn;
> +
>   if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
>   break;
>  
> @@ -423,8 +430,8 @@ isolate_freepages_range(struct compact_control *cc,
>   block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
>   block_end_pfn = min(block_end_pfn, end_pfn);
>  
> - isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> -, true);
> + isolated = isolate_freepages_block(cc, _start_pfn,
> + block_end_pfn, , true);
>  
>   /*
>* In strict mode, isolate_freepages_block() returns 0 if
> @@ -708,6 +715,7 @@ static void isolate_freepages(struct zone *zone,
>  {
>   struct page *page;
>   unsigned long block_start_pfn;  /* start of current pageblock */
> + unsigned long isolate_start_pfn; /* exact pfn we start at */
>   unsigned long block_end_pfn;/* end of current pageblock */
>   unsigned long low_pfn;   /* lowest pfn scanner is able to scan */
>   int nr_freepages = cc->nr_freepages;
> @@ -716,14 +724,15 @@ static void isolate_freepages(struct zone *zone,
>   /*
>* Initialise the free scanner. The starting point is where we last
>* successfully isolated from, zone-cached value, or the end of the
> -  * zone when isolating for the first time. We need this aligned to
> -  * the pageblock boundary, because we do
> +  * zone when isolating for the first time. For looping we also need
> +  * this pfn aligned down to the pageblock boundary, because we do
>* block_start_pfn -= pageblock_nr_pages in the for loop.
>* For ending point, take care when isolating in last pageblock of a
>* a zone 

Re: [PATCH v3 07/13] mm, compaction: skip rechecks when lock was already held

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> Compaction scanners try to lock zone locks as late as possible by checking
> many page or pageblock properties opportunistically without lock and skipping
> them if not unsuitable. For pages that pass the initial checks, some 
> properties
> have to be checked again safely under lock. However, if the lock was already
> held from a previous iteration in the initial checks, the rechecks are
> unnecessary.
> 
> This patch therefore skips the rechecks when the lock was already held. This 
> is
> now possible to do, since we don't (potentially) drop and reacquire the lock
> between the initial checks and the safe rechecks anymore.
> 
> Signed-off-by: Vlastimil Babka 
> Acked-by: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Acked-by: David Rientjes 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 53 +++--
>  1 file changed, 31 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 40da812..9f6e857 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -324,22 +324,30 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   goto isolate_fail;
>  
>   /*
> -  * The zone lock must be held to isolate freepages.
> -  * Unfortunately this is a very coarse lock and can be
> -  * heavily contended if there are parallel allocations
> -  * or parallel compactions. For async compaction do not
> -  * spin on the lock and we acquire the lock as late as
> -  * possible.
> +  * If we already hold the lock, we can skip some rechecking.
> +  * Note that if we hold the lock now, checked_pageblock was
> +  * already set in some previous iteration (or strict is true),
> +  * so it is correct to skip the suitable migration target
> +  * recheck as well.
>*/
> - if (!locked)
> + if (!locked) {
> + /*
> +  * The zone lock must be held to isolate freepages.
> +  * Unfortunately this is a very coarse lock and can be
> +  * heavily contended if there are parallel allocations
> +  * or parallel compactions. For async compaction do not
> +  * spin on the lock and we acquire the lock as late as
> +  * possible.
> +  */
>   locked = compact_trylock_irqsave(>zone->lock,
>   , cc);
> - if (!locked)
> - break;
> + if (!locked)
> + break;
>  
> - /* Recheck this is a buddy page under lock */
> - if (!PageBuddy(page))
> - goto isolate_fail;
> + /* Recheck this is a buddy page under lock */
> + if (!PageBuddy(page))
> + goto isolate_fail;
> + }
>  
>   /* Found a free page, break it into order-0 pages */
>   isolated = split_free_page(page);
> @@ -623,19 +631,20 @@ isolate_migratepages_range(struct zone *zone, struct 
> compact_control *cc,
>   page_count(page) > page_mapcount(page))
>   continue;
>  
> - /* If the lock is not held, try to take it */
> - if (!locked)
> + /* If we already hold the lock, we can skip some rechecking */
> + if (!locked) {
>   locked = compact_trylock_irqsave(>lru_lock,
>   , cc);
> - if (!locked)
> - break;
> + if (!locked)
> + break;
>  
> - /* Recheck PageLRU and PageTransHuge under lock */
> - if (!PageLRU(page))
> - continue;
> - if (PageTransHuge(page)) {
> - low_pfn += (1 << compound_order(page)) - 1;
> - continue;
> + /* Recheck PageLRU and PageTransHuge under lock */
> + if (!PageLRU(page))
> +     continue;
> + if (PageTransHuge(page)) {
> + low_pfn += (1 << compound_order(page)) - 1;
> + continue;
> + 

Re: [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> Compaction scanners regularly check for lock contention and need_resched()
> through the compact_checklock_irqsave() function. However, if there is no
> contention, the lock can be held and IRQ disabled for potentially long time.
> 
> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
> time IRQs are disabled while isolating pages for migration") for the migration
> scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
> acquire the zone->lru_lock as late as possible") has changed the conditions so
> that the lock is dropped only when there's contention on the lock or
> need_resched() is true. Also, need_resched() is checked only when the lock is
> already held. The comment "give a chance to irqs before checking need_resched"
> is therefore misleading, as IRQs remain disabled when the check is done.
> 
> This patch restores the behavior intended by commit b2eef8c0d0 and also tries
> to better balance and make more deterministic the time spent by checking for
> contention vs the time the scanners might run between the checks. It also
> avoids situations where checking has not been done often enough before. The
> result should be avoiding both too frequent and too infrequent contention
> checking, and especially the potentially long-running scans with IRQs disabled
> and no checking of need_resched() or for fatal signal pending, which can 
> happen
> when many consecutive pages or pageblocks fail the preliminary tests and do 
> not
> reach the later call site to compact_checklock_irqsave(), as explained below.
> 
> Before the patch:
> 
> In the migration scanner, compact_checklock_irqsave() was called each loop, if
> reached. If not reached, some lower-frequency checking could still be done if
> the lock was already held, but this would not result in aborting contended
> async compaction until reaching compact_checklock_irqsave() or end of
> pageblock. In the free scanner, it was similar but completely without the
> periodical checking, so lock can be potentially held until reaching the end of
> pageblock.
> 
> After the patch, in both scanners:
> 
> The periodical check is done as the first thing in the loop on each
> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
> function, which always unlocks the lock (if locked) and aborts async 
> compaction
> if scheduling is needed. It also aborts any type of compaction when a fatal
> signal is pending.
> 
> The compact_checklock_irqsave() function is replaced with a slightly different
> compact_trylock_irqsave(). The biggest difference is that the function is not
> called at all if the lock is already held. The periodical need_resched()
> checking is left solely to compact_unlock_should_abort(). The lock contention
> avoidance for async compaction is achieved by the periodical unlock by
> compact_unlock_should_abort() and by using trylock in 
> compact_trylock_irqsave()
> and aborting when trylock fails. Sync compaction does not use trylock.
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 114 
> 
>  1 file changed, 73 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e8cfac9..40da812 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -180,54 +180,72 @@ static void update_pageblock_skip(struct 
> compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -enum compact_contended should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock,
> + unsigned long *flags, struct compact_control *cc)
>  {
> - if (spin_is_contended(lock))
> - return COMPACT_CONTENDED_LOCK;
> - else if (need_resched())
> - return COMPACT_CONTENDED_SCHED;
> - else
> - return COMPACT_CONTENDED_NONE;
> + if (cc->mode == MIGRATE_ASYNC) {
> + if (!spin_trylock_irqsave(lock, *flags)) {
> + cc->contended = COMPACT_CONTENDED_LOCK;
> +   

Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

2014-06-23 Thread Zhang Yanfei
+ * and we report if all zones that were tried were contended.
>> + */
>> +if (!*deferred) {
>>  count_compact_event(COMPACTSTALL);
>> +*contended = all_zones_contended;
> 
> Why don't you initialize contended as *false* in function's intro?
> 
>> +}
>>  
>>  return rc;
>>  }
>> diff --git a/mm/internal.h b/mm/internal.h
>> index a1b651b..2c187d2 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>  
>>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>  
>> +/* Used to signal whether compaction detected need_sched() or lock 
>> contention */
>> +enum compact_contended {
>> +COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>> +COMPACT_CONTENDED_SCHED,/* need_sched() was true */
>> +COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */
>> +};
>> +
>>  /*
>>   * in mm/compaction.c
>>   */
>> @@ -144,10 +151,10 @@ struct compact_control {
>>  int order;  /* order a direct compactor needs */
>>  int migratetype;/* MOVABLE, RECLAIMABLE etc */
>>  struct zone *zone;
>> -bool contended; /* True if a lock was contended, or
>> - * need_resched() true during async
>> - * compaction
>> - */
>> +enum compact_contended contended; /* Signal need_sched() or lock
>> +   * contention detected during
>> +   * compaction
>> +   */
>>  };
>>  
>>  unsigned long
>> -- 
> 
> Anyway, most big concern is that you are changing current behavior as
> I said earlier.
> 
> Old behavior in THP page fault when it consumes own timeslot was just
> abort and fallback 4K page but with your patch, new behavior is
> take a rest when it founds need_resched and goes to another round with
> async, not sync compaction. I'm not sure we need another round with
> async compaction at the cost of increasing latency rather than fallback
> 4 page.

I don't see the new behavior works like what you said. If need_resched
is true, it calls cond_resched() and after a rest it just breaks the loop.
Why there is another round with async compact?

Thanks.

> 
> It might be okay if the VMA has MADV_HUGEPAGE which is good hint to
> indicate non-temporal VMA so latency would be trade-off but it's not
> for temporal big memory allocation in HUGEPAGE_ALWAYS system.
> 
> If you really want to go this, could you show us numbers?
> 
> 1. How many could we can be successful in direct compaction by this patch?
> 2. How long could we increase latency for temporal allocation
>for HUGEPAGE_ALWAYS system?
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> isolate_migratepages_range() is the main function of the compaction scanner,
> called either on a single pageblock by isolate_migratepages() during regular
> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
> It currently perfoms two pageblock-wide compaction suitability checks, and
> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
> order to repeat those checks.
> 
> However, closer inspection shows that those checks are always true for CMA:
> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
> - migrate_async_suitable() check is skipped because CMA uses sync compaction
> 
> We can therefore move the checks to isolate_migratepages(), reducing variables
> and simplifying isolate_migratepages_range(). The update_pageblock_skip()
> function also no longer needs set_unsuitable parameter.
> 
> Furthermore, going back to compact_zone() and compact_finished() when 
> pageblock
> is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
> The patch therefore also introduces a simple loop into isolate_migratepages()
> so that it does not return immediately on pageblock checks, but keeps going
> until isolate_migratepages_range() gets called once. Similarily to
> isolate_freepages(), the function periodically checks if it needs to 
> reschedule
> or abort async compaction.
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 

I think this is a good clean-up to make code more clear.

Reviewed-by: Zhang Yanfei 

Only a tiny nit-pick below.

> ---
>  mm/compaction.c | 112 
> +---
>  1 file changed, 59 insertions(+), 53 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 3064a7f..ebe30c9 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>   */
>  static void update_pageblock_skip(struct compact_control *cc,
>   struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
>  {
>   struct zone *zone = cc->zone;
>   unsigned long pfn;
> @@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control 
> *cc,
>   if (nr_isolated)
>   return;
>  
> - /*
> -  * Only skip pageblocks when all forms of compaction will be known to
> -  * fail in the near future.
> -  */
> - if (set_unsuitable)
> - set_pageblock_skip(page);
> + set_pageblock_skip(page);
>  
>   pfn = page_to_pfn(page);
>  
> @@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct 
> compact_control *cc,
>  
>  static void update_pageblock_skip(struct compact_control *cc,
>   struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
>  {
>  }
>  #endif /* CONFIG_COMPACTION */
> @@ -345,8 +340,7 @@ isolate_fail:
>  
>   /* Update the pageblock-skip if the whole pageblock was scanned */
>   if (blockpfn == end_pfn)
> - update_pageblock_skip(cc, valid_page, total_isolated, true,
> -   false);
> + update_pageblock_skip(cc, valid_page, total_isolated, false);
>  
>   count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
>   if (total_isolated)
> @@ -474,14 +468,12 @@ unsigned long
>  isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>   unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
>  {
> - unsigned long last_pageblock_nr = 0, pageblock_nr;
>   unsigned long nr_scanned = 0, nr_isolated = 0;
>   struct list_head *migratelist = >migratepages;
>   struct lruvec *lruvec;
>   unsigned long flags;
>   bool locked = false;
>   struct page *page = NULL, *valid_page = NULL;
> - bool set_unsuitable = true;
>   const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
>   ISOLATE_ASYNC_MIGRATE : 0) |
>   (unevictable ? ISOLATE_UNEVICTABLE : 0);
> @@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct 
> compact_control *cc,
>   if (!valid_page)
>   valid_page = page;
>  
> - /* If isolation re

Re: [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> When direct sync compaction is often unsuccessful, it may become deferred for
> some time to avoid further useless attempts, both sync and async. Successful
> high-order allocations un-defer compaction, while further unsuccessful
> compaction attempts prolong the copmaction deferred period.
> 
> Currently the checking and setting deferred status is performed only on the
> preferred zone of the allocation that invoked direct compaction. But 
> compaction
> itself is attempted on all eligible zones in the zonelist, so the behavior is
> suboptimal and may lead both to scenarios where 1) compaction is attempted
> uselessly, or 2) where it's not attempted despite good chances of succeeding,
> as shown on the examples below:
> 
> 1) A direct compaction with Normal preferred zone failed and set deferred
>compaction for the Normal zone. Another unrelated direct compaction with
>DMA32 as preferred zone will attempt to compact DMA32 zone even though
>the first compaction attempt also included DMA32 zone.
> 
>In another scenario, compaction with Normal preferred zone failed to 
> compact
>Normal zone, but succeeded in the DMA32 zone, so it will not defer
>compaction. In the next attempt, it will try Normal zone which will fail
>again, instead of skipping Normal zone and trying DMA32 directly.
> 
> 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
>looking good. A direct compaction with preferred Normal zone will skip
>compaction of all zones including DMA32 because Normal was still deferred.
>The allocation might have succeeded in DMA32, but won't.
> 
> This patch makes compaction deferring work on individual zone basis instead of
> preferred zone. For each zone, it checks compaction_deferred() to decide if 
> the
> zone should be skipped. If watermarks fail after compacting the zone,
> defer_compaction() is called. The zone where watermarks passed can still be
> deferred when the allocation attempt is unsuccessful. When allocation is
> successful, compaction_defer_reset() is called for the zone containing the
> allocated page. This approach should approximate calling defer_compaction()
> only on zones where compaction was attempted and did not yield allocated page.
> There might be corner cases but that is inevitable as long as the decision
> to stop compacting dues not guarantee that a page will be allocated.
> 
> During testing on a two-node machine with a single very small Normal zone on
> node 1, this patch has improved success rates in stress-highalloc mmtests
> benchmark. The success here were previously made worse by commit 3a025760fc
> ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
> no longer resetting often enough the deferred compaction for the Normal zone,
> and DMA32 zones on both nodes were thus not considered for compaction.
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 

Really good.

Reviewed-by: Zhang Yanfei 

> ---
>  include/linux/compaction.h |  6 --
>  mm/compaction.c| 29 -
>  mm/page_alloc.c| 33 ++---
>  3 files changed, 46 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 01e3132..76f9beb 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -22,7 +22,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, 
> int write,
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>   int order, gfp_t gfp_mask, nodemask_t *mask,
> - enum migrate_mode mode, bool *contended);
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone);
>  extern void compact_pgdat(pg_data_t *pgdat, int order);
>  extern void reset_isolation_suitable(pg_data_t *pgdat);
>  extern unsigned long compaction_suitable(struct zone *zone, int order);
> @@ -91,7 +92,8 @@ static inline bool compaction_restarting(struct zone *zone, 
> int order)
>  #else
>  static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>   int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone

Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

2014-06-23 Thread Zhang Yanfei
Hello Minchan

Thank you for your explain. Actually, I read the kernel with an old
version. The latest upstream kernel has the behaviour like you described
below. Oops, how long didn't I follow the buddy allocator change.

Thanks.

On 06/24/2014 07:35 AM, Minchan Kim wrote:
 Anyway, most big concern is that you are changing current behavior as
   I said earlier.
   
   Old behavior in THP page fault when it consumes own timeslot was just
   abort and fallback 4K page but with your patch, new behavior is
   take a rest when it founds need_resched and goes to another round with
   async, not sync compaction. I'm not sure we need another round with
   async compaction at the cost of increasing latency rather than fallback
   4 page.
  
  I don't see the new behavior works like what you said. If need_resched
  is true, it calls cond_resched() and after a rest it just breaks the loop.
  Why there is another round with async compact?
 One example goes
 
 Old:
 page fault
 huge page allocation
 __alloc_pages_slowpath
 __alloc_pages_direct_compact
 compact_zone_order
 isolate_migratepages
 compact_checklock_irqsave
 need_resched is true
 cc-contended = true;
 return ISOLATE_ABORT
 return COMPACT_PARTIAL with *contented = cc.contended;
 COMPACTFAIL
 if (contended_compaction  gfp_mask  __GFP_NO_KSWAPD)
 goto nopage;
 
 New:
 
 page fault
 huge page allocation
 __alloc_pages_slowpath
 __alloc_pages_direct_compact
 compact_zone_order
 isolate_migratepages
 compact_unlock_should_abort
 need_resched is true
 cc-contended = COMPACT_CONTENDED_SCHED;
 return true;
 return ISOLATE_ABORT
 return COMPACT_PARTIAL with *contended = cc.contended == 
 COMPACT_CONTENDED_LOCK (1)
 COMPACTFAIL
 if (contended_compaction  gfp_mask  __GFP_NO_KSWAPD)
 no goto nopage because contended_compaction was false by (1)
 
 __alloc_pages_direct_reclaim
 if (should_alloc_retry)
 else
 __alloc_pages_direct_compact again with ASYNC_MODE
 
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
 When direct sync compaction is often unsuccessful, it may become deferred for
 some time to avoid further useless attempts, both sync and async. Successful
 high-order allocations un-defer compaction, while further unsuccessful
 compaction attempts prolong the copmaction deferred period.
 
 Currently the checking and setting deferred status is performed only on the
 preferred zone of the allocation that invoked direct compaction. But 
 compaction
 itself is attempted on all eligible zones in the zonelist, so the behavior is
 suboptimal and may lead both to scenarios where 1) compaction is attempted
 uselessly, or 2) where it's not attempted despite good chances of succeeding,
 as shown on the examples below:
 
 1) A direct compaction with Normal preferred zone failed and set deferred
compaction for the Normal zone. Another unrelated direct compaction with
DMA32 as preferred zone will attempt to compact DMA32 zone even though
the first compaction attempt also included DMA32 zone.
 
In another scenario, compaction with Normal preferred zone failed to 
 compact
Normal zone, but succeeded in the DMA32 zone, so it will not defer
compaction. In the next attempt, it will try Normal zone which will fail
again, instead of skipping Normal zone and trying DMA32 directly.
 
 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
looking good. A direct compaction with preferred Normal zone will skip
compaction of all zones including DMA32 because Normal was still deferred.
The allocation might have succeeded in DMA32, but won't.
 
 This patch makes compaction deferring work on individual zone basis instead of
 preferred zone. For each zone, it checks compaction_deferred() to decide if 
 the
 zone should be skipped. If watermarks fail after compacting the zone,
 defer_compaction() is called. The zone where watermarks passed can still be
 deferred when the allocation attempt is unsuccessful. When allocation is
 successful, compaction_defer_reset() is called for the zone containing the
 allocated page. This approach should approximate calling defer_compaction()
 only on zones where compaction was attempted and did not yield allocated page.
 There might be corner cases but that is inevitable as long as the decision
 to stop compacting dues not guarantee that a page will be allocated.
 
 During testing on a two-node machine with a single very small Normal zone on
 node 1, this patch has improved success rates in stress-highalloc mmtests
 benchmark. The success here were previously made worse by commit 3a025760fc
 (mm: page_alloc: spill to remote nodes before waking kswapd) as kswapd was
 no longer resetting often enough the deferred compaction for the Normal zone,
 and DMA32 zones on both nodes were thus not considered for compaction.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com

Really good.

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  include/linux/compaction.h |  6 --
  mm/compaction.c| 29 -
  mm/page_alloc.c| 33 ++---
  3 files changed, 46 insertions(+), 22 deletions(-)
 
 diff --git a/include/linux/compaction.h b/include/linux/compaction.h
 index 01e3132..76f9beb 100644
 --- a/include/linux/compaction.h
 +++ b/include/linux/compaction.h
 @@ -22,7 +22,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, 
 int write,
  extern int fragmentation_index(struct zone *zone, unsigned int order);
  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
   int order, gfp_t gfp_mask, nodemask_t *mask,
 - enum migrate_mode mode, bool *contended);
 + enum migrate_mode mode, bool *contended, bool *deferred,
 + struct zone **candidate_zone);
  extern void compact_pgdat(pg_data_t *pgdat, int order);
  extern void reset_isolation_suitable(pg_data_t *pgdat);
  extern unsigned long compaction_suitable(struct zone *zone, int order);
 @@ -91,7 +92,8 @@ static inline bool compaction_restarting(struct zone *zone, 
 int order)
  #else
  static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
   int order, gfp_t gfp_mask, nodemask_t *nodemask,
 - enum migrate_mode mode, bool *contended)
 + enum migrate_mode mode, bool *contended, bool *deferred,
 + struct zone **candidate_zone)
  {
   return COMPACT_CONTINUE;
  }
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 5175019..7c491d0 100644
 --- a/mm/compaction.c
 +++ b/mm

Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
 isolate_migratepages_range() is the main function of the compaction scanner,
 called either on a single pageblock by isolate_migratepages() during regular
 compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
 It currently perfoms two pageblock-wide compaction suitability checks, and
 because of the CMA callpath, it tracks if it crossed a pageblock boundary in
 order to repeat those checks.
 
 However, closer inspection shows that those checks are always true for CMA:
 - isolation_suitable() is true because CMA sets cc-ignore_skip_hint to true
 - migrate_async_suitable() check is skipped because CMA uses sync compaction
 
 We can therefore move the checks to isolate_migratepages(), reducing variables
 and simplifying isolate_migratepages_range(). The update_pageblock_skip()
 function also no longer needs set_unsuitable parameter.
 
 Furthermore, going back to compact_zone() and compact_finished() when 
 pageblock
 is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
 The patch therefore also introduces a simple loop into isolate_migratepages()
 so that it does not return immediately on pageblock checks, but keeps going
 until isolate_migratepages_range() gets called once. Similarily to
 isolate_freepages(), the function periodically checks if it needs to 
 reschedule
 or abort async compaction.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com

I think this is a good clean-up to make code more clear.

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

Only a tiny nit-pick below.

 ---
  mm/compaction.c | 112 
 +---
  1 file changed, 59 insertions(+), 53 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 3064a7f..ebe30c9 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
   */
  static void update_pageblock_skip(struct compact_control *cc,
   struct page *page, unsigned long nr_isolated,
 - bool set_unsuitable, bool migrate_scanner)
 + bool migrate_scanner)
  {
   struct zone *zone = cc-zone;
   unsigned long pfn;
 @@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control 
 *cc,
   if (nr_isolated)
   return;
  
 - /*
 -  * Only skip pageblocks when all forms of compaction will be known to
 -  * fail in the near future.
 -  */
 - if (set_unsuitable)
 - set_pageblock_skip(page);
 + set_pageblock_skip(page);
  
   pfn = page_to_pfn(page);
  
 @@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct 
 compact_control *cc,
  
  static void update_pageblock_skip(struct compact_control *cc,
   struct page *page, unsigned long nr_isolated,
 - bool set_unsuitable, bool migrate_scanner)
 + bool migrate_scanner)
  {
  }
  #endif /* CONFIG_COMPACTION */
 @@ -345,8 +340,7 @@ isolate_fail:
  
   /* Update the pageblock-skip if the whole pageblock was scanned */
   if (blockpfn == end_pfn)
 - update_pageblock_skip(cc, valid_page, total_isolated, true,
 -   false);
 + update_pageblock_skip(cc, valid_page, total_isolated, false);
  
   count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
   if (total_isolated)
 @@ -474,14 +468,12 @@ unsigned long
  isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
   unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
  {
 - unsigned long last_pageblock_nr = 0, pageblock_nr;
   unsigned long nr_scanned = 0, nr_isolated = 0;
   struct list_head *migratelist = cc-migratepages;
   struct lruvec *lruvec;
   unsigned long flags;
   bool locked = false;
   struct page *page = NULL, *valid_page = NULL;
 - bool set_unsuitable = true;
   const isolate_mode_t mode = (cc-mode == MIGRATE_ASYNC ?
   ISOLATE_ASYNC_MIGRATE : 0) |
   (unevictable ? ISOLATE_UNEVICTABLE : 0);
 @@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct 
 compact_control *cc,
   if (!valid_page)
   valid_page = page;
  
 - /* If isolation recently failed, do not retry */
 - pageblock_nr = low_pfn  pageblock_order;
 - if (last_pageblock_nr != pageblock_nr) {
 - int mt;
 -
 - last_pageblock_nr = pageblock_nr

Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

2014-06-23 Thread Zhang Yanfei
 +   * compaction
 +   */
  };
  
  unsigned long
 -- 
 
 Anyway, most big concern is that you are changing current behavior as
 I said earlier.
 
 Old behavior in THP page fault when it consumes own timeslot was just
 abort and fallback 4K page but with your patch, new behavior is
 take a rest when it founds need_resched and goes to another round with
 async, not sync compaction. I'm not sure we need another round with
 async compaction at the cost of increasing latency rather than fallback
 4 page.

I don't see the new behavior works like what you said. If need_resched
is true, it calls cond_resched() and after a rest it just breaks the loop.
Why there is another round with async compact?

Thanks.

 
 It might be okay if the VMA has MADV_HUGEPAGE which is good hint to
 indicate non-temporal VMA so latency would be trade-off but it's not
 for temporal big memory allocation in HUGEPAGE_ALWAYS system.
 
 If you really want to go this, could you show us numbers?
 
 1. How many could we can be successful in direct compaction by this patch?
 2. How long could we increase latency for temporal allocation
for HUGEPAGE_ALWAYS system?
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
 Compaction scanners regularly check for lock contention and need_resched()
 through the compact_checklock_irqsave() function. However, if there is no
 contention, the lock can be held and IRQ disabled for potentially long time.
 
 This has been addressed by commit b2eef8c0d0 (mm: compaction: minimise the
 time IRQs are disabled while isolating pages for migration) for the migration
 scanner. However, the refactoring done by commit 748446bb6b (mm: compaction:
 acquire the zone-lru_lock as late as possible) has changed the conditions so
 that the lock is dropped only when there's contention on the lock or
 need_resched() is true. Also, need_resched() is checked only when the lock is
 already held. The comment give a chance to irqs before checking need_resched
 is therefore misleading, as IRQs remain disabled when the check is done.
 
 This patch restores the behavior intended by commit b2eef8c0d0 and also tries
 to better balance and make more deterministic the time spent by checking for
 contention vs the time the scanners might run between the checks. It also
 avoids situations where checking has not been done often enough before. The
 result should be avoiding both too frequent and too infrequent contention
 checking, and especially the potentially long-running scans with IRQs disabled
 and no checking of need_resched() or for fatal signal pending, which can 
 happen
 when many consecutive pages or pageblocks fail the preliminary tests and do 
 not
 reach the later call site to compact_checklock_irqsave(), as explained below.
 
 Before the patch:
 
 In the migration scanner, compact_checklock_irqsave() was called each loop, if
 reached. If not reached, some lower-frequency checking could still be done if
 the lock was already held, but this would not result in aborting contended
 async compaction until reaching compact_checklock_irqsave() or end of
 pageblock. In the free scanner, it was similar but completely without the
 periodical checking, so lock can be potentially held until reaching the end of
 pageblock.
 
 After the patch, in both scanners:
 
 The periodical check is done as the first thing in the loop on each
 SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
 function, which always unlocks the lock (if locked) and aborts async 
 compaction
 if scheduling is needed. It also aborts any type of compaction when a fatal
 signal is pending.
 
 The compact_checklock_irqsave() function is replaced with a slightly different
 compact_trylock_irqsave(). The biggest difference is that the function is not
 called at all if the lock is already held. The periodical need_resched()
 checking is left solely to compact_unlock_should_abort(). The lock contention
 avoidance for async compaction is achieved by the periodical unlock by
 compact_unlock_should_abort() and by using trylock in 
 compact_trylock_irqsave()
 and aborting when trylock fails. Sync compaction does not use trylock.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  mm/compaction.c | 114 
 
  1 file changed, 73 insertions(+), 41 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index e8cfac9..40da812 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -180,54 +180,72 @@ static void update_pageblock_skip(struct 
 compact_control *cc,
  }
  #endif /* CONFIG_COMPACTION */
  
 -enum compact_contended should_release_lock(spinlock_t *lock)
 +/*
 + * Compaction requires the taking of some coarse locks that are potentially
 + * very heavily contended. For async compaction, back out if the lock cannot
 + * be taken immediately. For sync compaction, spin on the lock if needed.
 + *
 + * Returns true if the lock is held
 + * Returns false if the lock is not held and compaction should abort
 + */
 +static bool compact_trylock_irqsave(spinlock_t *lock,
 + unsigned long *flags, struct compact_control *cc)
  {
 - if (spin_is_contended(lock))
 - return COMPACT_CONTENDED_LOCK;
 - else if (need_resched())
 - return COMPACT_CONTENDED_SCHED;
 - else
 - return COMPACT_CONTENDED_NONE;
 + if (cc-mode == MIGRATE_ASYNC) {
 + if (!spin_trylock_irqsave(lock, *flags)) {
 + cc-contended = COMPACT_CONTENDED_LOCK;
 + return false;
 + }
 + } else {
 + spin_lock_irqsave(lock, *flags);
 + }
 +
 + return true;
  }
  
  /*
   * Compaction requires the taking of some coarse locks that are potentially
 - * very heavily contended. Check if the process needs

Re: [PATCH v3 07/13] mm, compaction: skip rechecks when lock was already held

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
 Compaction scanners try to lock zone locks as late as possible by checking
 many page or pageblock properties opportunistically without lock and skipping
 them if not unsuitable. For pages that pass the initial checks, some 
 properties
 have to be checked again safely under lock. However, if the lock was already
 held from a previous iteration in the initial checks, the rechecks are
 unnecessary.
 
 This patch therefore skips the rechecks when the lock was already held. This 
 is
 now possible to do, since we don't (potentially) drop and reacquire the lock
 between the initial checks and the safe rechecks anymore.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Acked-by: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Acked-by: David Rientjes rient...@google.com

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  mm/compaction.c | 53 +++--
  1 file changed, 31 insertions(+), 22 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 40da812..9f6e857 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -324,22 +324,30 @@ static unsigned long isolate_freepages_block(struct 
 compact_control *cc,
   goto isolate_fail;
  
   /*
 -  * The zone lock must be held to isolate freepages.
 -  * Unfortunately this is a very coarse lock and can be
 -  * heavily contended if there are parallel allocations
 -  * or parallel compactions. For async compaction do not
 -  * spin on the lock and we acquire the lock as late as
 -  * possible.
 +  * If we already hold the lock, we can skip some rechecking.
 +  * Note that if we hold the lock now, checked_pageblock was
 +  * already set in some previous iteration (or strict is true),
 +  * so it is correct to skip the suitable migration target
 +  * recheck as well.
*/
 - if (!locked)
 + if (!locked) {
 + /*
 +  * The zone lock must be held to isolate freepages.
 +  * Unfortunately this is a very coarse lock and can be
 +  * heavily contended if there are parallel allocations
 +  * or parallel compactions. For async compaction do not
 +  * spin on the lock and we acquire the lock as late as
 +  * possible.
 +  */
   locked = compact_trylock_irqsave(cc-zone-lock,
   flags, cc);
 - if (!locked)
 - break;
 + if (!locked)
 + break;
  
 - /* Recheck this is a buddy page under lock */
 - if (!PageBuddy(page))
 - goto isolate_fail;
 + /* Recheck this is a buddy page under lock */
 + if (!PageBuddy(page))
 + goto isolate_fail;
 + }
  
   /* Found a free page, break it into order-0 pages */
   isolated = split_free_page(page);
 @@ -623,19 +631,20 @@ isolate_migratepages_range(struct zone *zone, struct 
 compact_control *cc,
   page_count(page)  page_mapcount(page))
   continue;
  
 - /* If the lock is not held, try to take it */
 - if (!locked)
 + /* If we already hold the lock, we can skip some rechecking */
 + if (!locked) {
   locked = compact_trylock_irqsave(zone-lru_lock,
   flags, cc);
 - if (!locked)
 - break;
 + if (!locked)
 + break;
  
 - /* Recheck PageLRU and PageTransHuge under lock */
 - if (!PageLRU(page))
 - continue;
 - if (PageTransHuge(page)) {
 - low_pfn += (1  compound_order(page)) - 1;
 - continue;
 + /* Recheck PageLRU and PageTransHuge under lock */
 + if (!PageLRU(page))
 + continue;
 + if (PageTransHuge(page)) {
 + low_pfn += (1  compound_order(page)) - 1;
 + continue;
 + }
   }
  
   lruvec = mem_cgroup_page_lruvec(page, zone);
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More

Re: [PATCH v3 08/13] mm, compaction: remember position within pageblock in free pages scanner

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
 Unlike the migration scanner, the free scanner remembers the beginning of the
 last scanned pageblock in cc-free_pfn. It might be therefore rescanning pages
 uselessly when called several times during single compaction. This might have
 been useful when pages were returned to the buddy allocator after a failed
 migration, but this is no longer the case.
 
 This patch changes the meaning of cc-free_pfn so that if it points to a
 middle of a pageblock, that pageblock is scanned only from cc-free_pfn to the
 end. isolate_freepages_block() will record the pfn of the last page it looked
 at, which is then used to update cc-free_pfn.
 
 In the mmtests stress-highalloc benchmark, this has resulted in lowering the
 ratio between pages scanned by both scanners, from 2.5 free pages per migrate
 page, to 2.25 free pages per migrate page, without affecting success rates.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Acked-by: David Rientjes rient...@google.com
 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: Zhang Yanfei zhangyan...@cn.fujitsu.com

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  mm/compaction.c | 40 +++-
  1 file changed, 31 insertions(+), 9 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 9f6e857..41c7005 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -287,7 +287,7 @@ static bool suitable_migration_target(struct page *page)
   * (even though it may still end up isolating some pages).
   */
  static unsigned long isolate_freepages_block(struct compact_control *cc,
 - unsigned long blockpfn,
 + unsigned long *start_pfn,
   unsigned long end_pfn,
   struct list_head *freelist,
   bool strict)
 @@ -296,6 +296,7 @@ static unsigned long isolate_freepages_block(struct 
 compact_control *cc,
   struct page *cursor, *valid_page = NULL;
   unsigned long flags;
   bool locked = false;
 + unsigned long blockpfn = *start_pfn;
  
   cursor = pfn_to_page(blockpfn);
  
 @@ -369,6 +370,9 @@ isolate_fail:
   break;
   }
  
 + /* Record how far we have got within the block */
 + *start_pfn = blockpfn;
 +
   trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
  
   /*
 @@ -413,6 +417,9 @@ isolate_freepages_range(struct compact_control *cc,
   LIST_HEAD(freelist);
  
   for (pfn = start_pfn; pfn  end_pfn; pfn += isolated) {
 + /* Protect pfn from changing by isolate_freepages_block */
 + unsigned long isolate_start_pfn = pfn;
 +
   if (!pfn_valid(pfn) || cc-zone != page_zone(pfn_to_page(pfn)))
   break;
  
 @@ -423,8 +430,8 @@ isolate_freepages_range(struct compact_control *cc,
   block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
   block_end_pfn = min(block_end_pfn, end_pfn);
  
 - isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
 -freelist, true);
 + isolated = isolate_freepages_block(cc, isolate_start_pfn,
 + block_end_pfn, freelist, true);
  
   /*
* In strict mode, isolate_freepages_block() returns 0 if
 @@ -708,6 +715,7 @@ static void isolate_freepages(struct zone *zone,
  {
   struct page *page;
   unsigned long block_start_pfn;  /* start of current pageblock */
 + unsigned long isolate_start_pfn; /* exact pfn we start at */
   unsigned long block_end_pfn;/* end of current pageblock */
   unsigned long low_pfn;   /* lowest pfn scanner is able to scan */
   int nr_freepages = cc-nr_freepages;
 @@ -716,14 +724,15 @@ static void isolate_freepages(struct zone *zone,
   /*
* Initialise the free scanner. The starting point is where we last
* successfully isolated from, zone-cached value, or the end of the
 -  * zone when isolating for the first time. We need this aligned to
 -  * the pageblock boundary, because we do
 +  * zone when isolating for the first time. For looping we also need
 +  * this pfn aligned down to the pageblock boundary, because we do
* block_start_pfn -= pageblock_nr_pages in the for loop.
* For ending point, take care when isolating in last pageblock of a
* a zone which ends in the middle of a pageblock.
* The low boundary is the end of the pageblock the migration scanner
* is using.
*/
 + isolate_start_pfn = cc-free_pfn;
   block_start_pfn = cc-free_pfn  ~(pageblock_nr_pages

Re: [PATCH v3 09/13] mm, compaction: skip buddy pages by their order in the migrate scanner

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
 The migration scanner skips PageBuddy pages, but does not consider their order
 as checking page_order() is generally unsafe without holding the zone-lock,
 and acquiring the lock just for the check wouldn't be a good tradeoff.
 
 Still, this could avoid some iterations over the rest of the buddy page, and
 if we are careful, the race window between PageBuddy() check and page_order()
 is small, and the worst thing that can happen is that we skip too much and 
 miss
 some isolation candidates. This is not that bad, as compaction can already 
 fail
 for many other reasons like parallel allocations, and those have much larger
 race window.
 
 This patch therefore makes the migration scanner obtain the buddy page order
 and use it to skip the whole buddy page, if the order appears to be in the
 valid range.
 
 It's important that the page_order() is read only once, so that the value used
 in the checks and in the pfn calculation is the same. But in theory the
 compiler can replace the local variable by multiple inlines of page_order().
 Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
 prevent this.
 
 Testing with stress-highalloc from mmtests shows a 15% reduction in number of
 pages scanned by migration scanner. This change is also a prerequisite for a
 later patch which is detecting when a cc-order block of pages contains
 non-buddy pages that cannot be isolated, and the scanner should thus skip to
 the next block immediately.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com

Fair enough.

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  mm/compaction.c | 36 +++-
  mm/internal.h   | 16 +++-
  2 files changed, 46 insertions(+), 6 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 41c7005..df0961b 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -270,8 +270,15 @@ static inline bool compact_should_abort(struct 
 compact_control *cc)
  static bool suitable_migration_target(struct page *page)
  {
   /* If the page is a large free page, then disallow migration */
 - if (PageBuddy(page)  page_order(page) = pageblock_order)
 - return false;
 + if (PageBuddy(page)) {
 + /*
 +  * We are checking page_order without zone-lock taken. But
 +  * the only small danger is that we skip a potentially suitable
 +  * pageblock, so it's not worth to check order for valid range.
 +  */
 + if (page_order_unsafe(page) = pageblock_order)
 + return false;
 + }
  
   /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
   if (migrate_async_suitable(get_pageblock_migratetype(page)))
 @@ -591,11 +598,23 @@ isolate_migratepages_range(struct zone *zone, struct 
 compact_control *cc,
   valid_page = page;
  
   /*
 -  * Skip if free. page_order cannot be used without zone-lock
 -  * as nothing prevents parallel allocations or buddy merging.
 +  * Skip if free. We read page order here without zone lock
 +  * which is generally unsafe, but the race window is small and
 +  * the worst thing that can happen is that we skip some
 +  * potential isolation targets.
*/
 - if (PageBuddy(page))
 + if (PageBuddy(page)) {
 + unsigned long freepage_order = page_order_unsafe(page);
 +
 + /*
 +  * Without lock, we cannot be sure that what we got is
 +  * a valid page order. Consider only values in the
 +  * valid order range to prevent low_pfn overflow.
 +  */
 + if (freepage_order  0  freepage_order  MAX_ORDER)
 + low_pfn += (1UL  freepage_order) - 1;
   continue;
 + }
  
   /*
* Check may be lockless but that's ok as we recheck later.
 @@ -683,6 +702,13 @@ next_pageblock:
   low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
   }
  
 + /*
 +  * The PageBuddy() check could have potentially brought us outside
 +  * the range to be scanned.
 +  */
 + if (unlikely(low_pfn  end_pfn))
 + low_pfn = end_pfn;
 +
   acct_isolated(zone, locked, cc);
  
   if (locked)
 diff --git a/mm/internal.h b/mm/internal.h
 index 2c187d2..584cd69 100644
 --- a/mm/internal.h
 +++ b/mm/internal.h
 @@ -171,7 +171,8 @@ isolate_migratepages_range(struct

Re: [PATCH v3 11/13] mm, compaction: pass gfp mask to compact_control

2014-06-23 Thread Zhang Yanfei
On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
 From: David Rientjes rient...@google.com
 
 struct compact_control currently converts the gfp mask to a migratetype, but 
 we
 need the entire gfp mask in a follow-up patch.
 
 Pass the entire gfp mask as part of struct compact_control.
 
 Signed-off-by: David Rientjes rient...@google.com
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 ---
  mm/compaction.c | 12 +++-
  mm/internal.h   |  2 +-
  2 files changed, 8 insertions(+), 6 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 32c768b..d4e0c13 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -975,8 +975,8 @@ static isolate_migrate_t isolate_migratepages(struct zone 
 *zone,
   return cc-nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
  }
  
 -static int compact_finished(struct zone *zone,
 - struct compact_control *cc)
 +static int compact_finished(struct zone *zone, struct compact_control *cc,
 + const int migratetype)
  {
   unsigned int order;
   unsigned long watermark;
 @@ -1022,7 +1022,7 @@ static int compact_finished(struct zone *zone,
   struct free_area *area = zone-free_area[order];
  
   /* Job done if page is free of the right migratetype */
 - if (!list_empty(area-free_list[cc-migratetype]))
 + if (!list_empty(area-free_list[migratetype]))
   return COMPACT_PARTIAL;
  
   /* Job done if allocation would set block type */
 @@ -1088,6 +1088,7 @@ static int compact_zone(struct zone *zone, struct 
 compact_control *cc)
   int ret;
   unsigned long start_pfn = zone-zone_start_pfn;
   unsigned long end_pfn = zone_end_pfn(zone);
 + const int migratetype = gfpflags_to_migratetype(cc-gfp_mask);
   const bool sync = cc-mode != MIGRATE_ASYNC;
  
   ret = compaction_suitable(zone, cc-order);
 @@ -1130,7 +1131,8 @@ static int compact_zone(struct zone *zone, struct 
 compact_control *cc)
  
   migrate_prep_local();
  
 - while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
 + while ((ret = compact_finished(zone, cc, migratetype)) ==
 + COMPACT_CONTINUE) {
   int err;
  
   switch (isolate_migratepages(zone, cc)) {
 @@ -1185,7 +1187,7 @@ static unsigned long compact_zone_order(struct zone 
 *zone, int order,
   .nr_freepages = 0,
   .nr_migratepages = 0,
   .order = order,
 - .migratetype = gfpflags_to_migratetype(gfp_mask),
 + .gfp_mask = gfp_mask,
   .zone = zone,
   .mode = mode,
   };
 diff --git a/mm/internal.h b/mm/internal.h
 index 584cd69..dd17a40 100644
 --- a/mm/internal.h
 +++ b/mm/internal.h
 @@ -149,7 +149,7 @@ struct compact_control {
   bool finished_update_migrate;
  
   int order;  /* order a direct compactor needs */
 - int migratetype;/* MOVABLE, RECLAIMABLE etc */
 + const gfp_t gfp_mask;   /* gfp mask of a direct compactor */
   struct zone *zone;
   enum compact_contended contended; /* Signal need_sched() or lock
  * contention detected during
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

2014-06-23 Thread Zhang Yanfei
On 06/23/2014 05:52 PM, Vlastimil Babka wrote:
 On 06/23/2014 07:39 AM, Zhang Yanfei wrote:
 Hello

 On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
 On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
 When allocating huge page for collapsing, khugepaged currently holds 
 mmap_sem
 for reading on the mm where collapsing occurs. Afterwards the read lock is
 dropped before write lock is taken on the same mmap_sem.

 Holding mmap_sem during whole huge page allocation is therefore useless, 
 the
 vma needs to be rechecked after taking the write lock anyway. Furthemore, 
 huge
 page allocation might involve a rather long sync compaction, and thus block
 any mmap_sem writers and i.e. affect workloads that perform frequent 
 m(un)map
 or mprotect oterations.

 This patch simply releases the read lock before allocating a huge page. It
 also deletes an outdated comment that assumed vma must be stable, as it was
 using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
 (mm: thp: khugepaged: add policy for finding target node).

 There is no point in touching -mmap_sem in khugepaged_alloc_page() at
 all. Please, move up_read() outside khugepaged_alloc_page().

 
 Well there's also currently no point in passing several parameters to 
 khugepaged_alloc_page(). So I could clean it up as well, but I imagine later 
 we would perhaps reintroduce them back, as I don't think the current 
 situation is ideal for at least two reasons.
 
 1. If you read commit 9f1b868a13 (mm: thp: khugepaged: add policy for 
 finding target node), it's based on a report where somebody found that 
 mempolicy is not observed properly when collapsing THP's. But the 'policy' 
 introduced by the commit isn't based on real mempolicy, it might just under 
 certain conditions results in an interleave, which happens to be what the 
 reporter was trying.
 
 So ideally, it should be making node allocation decisions based on where the 
 original 4KB pages are located. For example, allocate a THP only if all the 
 4KB pages are on the same node. That would also automatically obey any policy 
 that has lead to the allocation of those 4KB pages.
 
 And for this, it will need again the parameters and mmap_sem in read mode. It 
 would be however still a good idea to drop mmap_sem before the allocation 
 itself, since compaction/reclaim might take some time...
 
 2. (less related) I'd expect khugepaged to first allocate a hugepage and then 
 scan for collapsing. Yes there's khugepaged_prealloc_page, but that only does 
 something on !NUMA systems and these are not the future.
 Although I don't have the data, I expect allocating a hugepage is a bigger 
 issue than finding something that could be collapsed. So why scan for 
 collapsing if in the end I cannot allocate a hugepage? And if I really cannot 
 find something to collapse, would e.g. caching a single hugepage per node be 
 a big hit? Also, if there's really nothing to collapse, then it means 
 khugepaged won't compact. And since khugepaged is becoming the only source of 
 sync compaction that doesn't give up easily and tries to e.g. migrate movable 
 pages out of unmovable pageblocks, this might have bad effects on 
 fragmentation.
 I believe this could be done smarter.
 
 I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round 
 again
 do the for loop to get the next vma and handle it. Does we do this without 
 holding
 the mmap_sem in any mode?

 And if the loop end, we have another up_read in breakouterloop. What if we 
 have
 released the mmap_sem in collapse_huge_page()?
 
 collapse_huge_page() is only called from khugepaged_scan_pmd() in the if 
 (ret) condition. And khugepaged_scan_mm_slot() has similar if (ret) for the 
 return value of khugepaged_scan_pmd() to break out of the loop (and not doing 
 up_read() again). So I think this is correct and moving up_read from 
 khugepaged_alloc_page() to collapse_huge_page() wouldn't
 change this?

Ah, right.

 
 
 .
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

2014-06-22 Thread Zhang Yanfei
Hello

On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>> dropped before write lock is taken on the same mmap_sem.
>>
>> Holding mmap_sem during whole huge page allocation is therefore useless, the
>> vma needs to be rechecked after taking the write lock anyway. Furthemore, 
>> huge
>> page allocation might involve a rather long sync compaction, and thus block
>> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
>> or mprotect oterations.
>>
>> This patch simply releases the read lock before allocating a huge page. It
>> also deletes an outdated comment that assumed vma must be stable, as it was
>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>> ("mm: thp: khugepaged: add policy for finding target node").
> 
> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
> all. Please, move up_read() outside khugepaged_alloc_page().
> 

I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round again
do the for loop to get the next vma and handle it. Does we do this without 
holding
the mmap_sem in any mode?

And if the loop end, we have another up_read in breakouterloop. What if we have
released the mmap_sem in collapse_huge_page()?

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

2014-06-22 Thread Zhang Yanfei
Hello

On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
 On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
 When allocating huge page for collapsing, khugepaged currently holds mmap_sem
 for reading on the mm where collapsing occurs. Afterwards the read lock is
 dropped before write lock is taken on the same mmap_sem.

 Holding mmap_sem during whole huge page allocation is therefore useless, the
 vma needs to be rechecked after taking the write lock anyway. Furthemore, 
 huge
 page allocation might involve a rather long sync compaction, and thus block
 any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
 or mprotect oterations.

 This patch simply releases the read lock before allocating a huge page. It
 also deletes an outdated comment that assumed vma must be stable, as it was
 using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
 (mm: thp: khugepaged: add policy for finding target node).
 
 There is no point in touching -mmap_sem in khugepaged_alloc_page() at
 all. Please, move up_read() outside khugepaged_alloc_page().
 

I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round again
do the for loop to get the next vma and handle it. Does we do this without 
holding
the mmap_sem in any mode?

And if the loop end, we have another up_read in breakouterloop. What if we have
released the mmap_sem in collapse_huge_page()?

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] mm: add page cache limit and reclaim feature

2014-06-16 Thread Zhang Yanfei
Hi,

On 06/16/2014 05:24 PM, Xishi Qiu wrote:
> When system(e.g. smart phone) running for a long time, the cache often takes
> a large memory, maybe the free memory is less than 50M, then OOM will happen
> if APP allocate a large order pages suddenly and memory reclaim too slowly. 

If there is really too many page caches, and the free memory is low. I think
the page allocator will enter the slowpath to free more memory for allocation.
And it the slowpath, there is indeed the direct reclaim operation, so is that
really not enough to reclaim pagecaches?

> 
> Use "echo 3 > /proc/sys/vm/drop_caches" will drop the whole cache, this will
> affect the performance, so it is used for debugging only. 
> 
> suse has this feature, I tested it before, but it can not limit the page cache
> actually. So I rewrite the feature and add some parameters.
> 
> Christoph Lameter has written a patch "Limit the size of the pagecache"
> http://marc.info/?l=linux-mm=116959990228182=2
> It changes in zone fallback, this is not a good way.
> 
> The patchset is based on v3.15, it introduces two features, page cache limit
> and page cache reclaim in circles.
> 
> Add four parameters in /proc/sys/vm
> 
> 1) cache_limit_mbytes
> This is used to limit page cache amount.
> The input unit is MB, value range is from 0 to totalram_pages.
> If this is set to 0, it will not limit page cache.
> When written to the file, cache_limit_ratio will be updated too.
> The default value is 0.
> 
> 2) cache_limit_ratio
> This is used to limit page cache amount.
> The input unit is percent, value range is from 0 to 100.
> If this is set to 0, it will not limit page cache.
> When written to the file, cache_limit_mbytes will be updated too.
> The default value is 0.
> 
> 3) cache_reclaim_s
> This is used to reclaim page cache in circles.
> The input unit is second, the minimum value is 0.
> If this is set to 0, it will disable the feature.
> The default value is 0.
> 
> 4) cache_reclaim_weight
> This is used to speed up page cache reclaim.
> It depend on enabling cache_limit_mbytes/cache_limit_ratio or cache_reclaim_s.
> Value range is from 1(slow) to 100(fast).
> The default value is 1.
> 
> I tested the two features on my system(x86_64), it seems to work right.
> However, as it changes the hot path "add_to_page_cache_lru()", I don't know
> how much it will the affect the performance,

Yeah, at a quick glance, for every invoke of add_to_page_cache_lru(), there is 
the 
newly added test:

if (vm_cache_limit_mbytes && page_cache_over_limit())

and if the test is passed, shrink_page_cache()->do_try_to_free_pages() is 
called.
And this is a sync operation. IMO, it is better to make such an operation async.
(You've implemented async operation but I doubt if it is suitable to put the 
sync operation
here.)

Thanks.

 maybe there are some errors
> in the patches too, RFC.
> 
> 
> *** BLURB HERE ***
> 
> Xishi Qiu (8):
>   mm: introduce cache_limit_ratio and cache_limit_mbytes
>   mm: add shrink page cache core
>   mm: implement page cache limit feature
>   mm: introduce cache_reclaim_s
>   mm: implement page cache reclaim in circles
>   mm: introduce cache_reclaim_weight
>   mm: implement page cache reclaim speed
>   doc: update Documentation/sysctl/vm.txt
> 
>  Documentation/sysctl/vm.txt |   43 +++
>  include/linux/swap.h|   17 
>  kernel/sysctl.c |   35 +++
>  mm/filemap.c|3 +
>  mm/hugetlb.c|3 +
>  mm/page_alloc.c |   51 ++
>  mm/vmscan.c |   97 
> ++-
>  7 files changed, 248 insertions(+), 1 deletions(-)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> .
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] mm: add page cache limit and reclaim feature

2014-06-16 Thread Zhang Yanfei
Hi,

On 06/16/2014 05:24 PM, Xishi Qiu wrote:
 When system(e.g. smart phone) running for a long time, the cache often takes
 a large memory, maybe the free memory is less than 50M, then OOM will happen
 if APP allocate a large order pages suddenly and memory reclaim too slowly. 

If there is really too many page caches, and the free memory is low. I think
the page allocator will enter the slowpath to free more memory for allocation.
And it the slowpath, there is indeed the direct reclaim operation, so is that
really not enough to reclaim pagecaches?

 
 Use echo 3  /proc/sys/vm/drop_caches will drop the whole cache, this will
 affect the performance, so it is used for debugging only. 
 
 suse has this feature, I tested it before, but it can not limit the page cache
 actually. So I rewrite the feature and add some parameters.
 
 Christoph Lameter has written a patch Limit the size of the pagecache
 http://marc.info/?l=linux-mmm=116959990228182w=2
 It changes in zone fallback, this is not a good way.
 
 The patchset is based on v3.15, it introduces two features, page cache limit
 and page cache reclaim in circles.
 
 Add four parameters in /proc/sys/vm
 
 1) cache_limit_mbytes
 This is used to limit page cache amount.
 The input unit is MB, value range is from 0 to totalram_pages.
 If this is set to 0, it will not limit page cache.
 When written to the file, cache_limit_ratio will be updated too.
 The default value is 0.
 
 2) cache_limit_ratio
 This is used to limit page cache amount.
 The input unit is percent, value range is from 0 to 100.
 If this is set to 0, it will not limit page cache.
 When written to the file, cache_limit_mbytes will be updated too.
 The default value is 0.
 
 3) cache_reclaim_s
 This is used to reclaim page cache in circles.
 The input unit is second, the minimum value is 0.
 If this is set to 0, it will disable the feature.
 The default value is 0.
 
 4) cache_reclaim_weight
 This is used to speed up page cache reclaim.
 It depend on enabling cache_limit_mbytes/cache_limit_ratio or cache_reclaim_s.
 Value range is from 1(slow) to 100(fast).
 The default value is 1.
 
 I tested the two features on my system(x86_64), it seems to work right.
 However, as it changes the hot path add_to_page_cache_lru(), I don't know
 how much it will the affect the performance,

Yeah, at a quick glance, for every invoke of add_to_page_cache_lru(), there is 
the 
newly added test:

if (vm_cache_limit_mbytes  page_cache_over_limit())

and if the test is passed, shrink_page_cache()-do_try_to_free_pages() is 
called.
And this is a sync operation. IMO, it is better to make such an operation async.
(You've implemented async operation but I doubt if it is suitable to put the 
sync operation
here.)

Thanks.

 maybe there are some errors
 in the patches too, RFC.
 
 
 *** BLURB HERE ***
 
 Xishi Qiu (8):
   mm: introduce cache_limit_ratio and cache_limit_mbytes
   mm: add shrink page cache core
   mm: implement page cache limit feature
   mm: introduce cache_reclaim_s
   mm: implement page cache reclaim in circles
   mm: introduce cache_reclaim_weight
   mm: implement page cache reclaim speed
   doc: update Documentation/sysctl/vm.txt
 
  Documentation/sysctl/vm.txt |   43 +++
  include/linux/swap.h|   17 
  kernel/sysctl.c |   35 +++
  mm/filemap.c|3 +
  mm/hugetlb.c|3 +
  mm/page_alloc.c |   51 ++
  mm/vmscan.c |   97 
 ++-
  7 files changed, 248 insertions(+), 1 deletions(-)
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 .
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 08/10] mm, cma: clean-up cma allocation error path

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
> We can remove one call sites for clear_cma_bitmap() if we first
> call it before checking error number.
> 
> Signed-off-by: Joonsoo Kim 

Reviewed-by: Zhang Yanfei 

> 
> diff --git a/mm/cma.c b/mm/cma.c
> index 1e1b017..01a0713 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -282,11 +282,12 @@ struct page *cma_alloc(struct cma *cma, int count, 
> unsigned int align)
>   if (ret == 0) {
>   page = pfn_to_page(pfn);
>   break;
> - } else if (ret != -EBUSY) {
> - clear_cma_bitmap(cma, pfn, count);
> - break;
>   }
> +
>   clear_cma_bitmap(cma, pfn, count);
> + if (ret != -EBUSY)
> + break;
> +
>   pr_debug("%s(): memory range at %p is busy, retrying\n",
>__func__, pfn_to_page(pfn));
>   /* try again with a bit different memory target */
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/10] CMA: generalize CMA reserved area management functionality

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
> Currently, there are two users on CMA functionality, one is the DMA
> subsystem and the other is the kvm on powerpc. They have their own code
> to manage CMA reserved area even if they looks really similar.
>>From my guess, it is caused by some needs on bitmap management. Kvm side
> wants to maintain bitmap not for 1 page, but for more size. Eventually it
> use bitmap where one bit represents 64 pages.
> 
> When I implement CMA related patches, I should change those two places
> to apply my change and it seem to be painful to me. I want to change
> this situation and reduce future code management overhead through
> this patch.
> 
> This change could also help developer who want to use CMA in their
> new feature development, since they can use CMA easily without
> copying & pasting this reserved area management code.
> 
> In previous patches, we have prepared some features to generalize
> CMA reserved area management and now it's time to do it. This patch
> moves core functions to mm/cma.c and change DMA APIs to use
> these functions.
> 
> There is no functional change in DMA APIs.
> 
> v2: There is no big change from v1 in mm/cma.c. Mostly renaming.
> 
> Acked-by: Michal Nazarewicz 
> Signed-off-by: Joonsoo Kim 

Acked-by: Zhang Yanfei 

> 
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 00e13ce..4eac559 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -283,16 +283,6 @@ config CMA_ALIGNMENT
>  
> If unsure, leave the default value "8".
>  
> -config CMA_AREAS
> - int "Maximum count of the CMA device-private areas"
> - default 7
> - help
> -   CMA allows to create CMA areas for particular devices. This parameter
> -   sets the maximum number of such device private CMA areas in the
> -   system.
> -
> -   If unsure, leave the default value "7".
> -
>  endif
>  
>  endmenu
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index 9bc9340..f177f73 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -24,25 +24,10 @@
>  
>  #include 
>  #include 
> -#include 
> -#include 
> -#include 
>  #include 
> -#include 
> -#include 
> -#include 
>  #include 
>  #include 
> -
> -struct cma {
> - unsigned long   base_pfn;
> - unsigned long   count;
> - unsigned long   *bitmap;
> - int order_per_bit; /* Order of pages represented by one bit */
> - struct mutexlock;
> -};
> -
> -struct cma *dma_contiguous_default_area;
> +#include 
>  
>  #ifdef CONFIG_CMA_SIZE_MBYTES
>  #define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
> @@ -50,6 +35,8 @@ struct cma *dma_contiguous_default_area;
>  #define CMA_SIZE_MBYTES 0
>  #endif
>  
> +struct cma *dma_contiguous_default_area;
> +
>  /*
>   * Default global CMA area size can be defined in kernel's .config.
>   * This is useful mainly for distro maintainers to create a kernel
> @@ -156,199 +143,13 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>   }
>  }
>  
> -static DEFINE_MUTEX(cma_mutex);
> -
> -static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
> align_order)
> -{
> - return (1 << (align_order >> cma->order_per_bit)) - 1;
> -}
> -
> -static unsigned long cma_bitmap_maxno(struct cma *cma)
> -{
> - return cma->count >> cma->order_per_bit;
> -}
> -
> -static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
> - unsigned long pages)
> -{
> - return ALIGN(pages, 1 << cma->order_per_bit) >> cma->order_per_bit;
> -}
> -
> -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
> -{
> - unsigned long bitmapno, nr_bits;
> -
> - bitmapno = (pfn - cma->base_pfn) >> cma->order_per_bit;
> - nr_bits = cma_bitmap_pages_to_bits(cma, count);
> -
> - mutex_lock(>lock);
> - bitmap_clear(cma->bitmap, bitmapno, nr_bits);
> - mutex_unlock(>lock);
> -}
> -
> -static int __init cma_activate_area(struct cma *cma)
> -{
> - int bitmap_maxno = cma_bitmap_maxno(cma);
> - int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
> - unsigned long base_pfn = cma->base_pfn, pfn = base_pfn;
> - unsigned i = cma->count >> pageblock_order;
> - struct zone *zone;
> -
> - pr_debug("%s()\n", __func__);
> -
> - cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
> - if (!cma->bitmap)
> - 

Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
> ppc kvm's cma region management requires arbitrary bitmap granularity,
> since they want to reserve very large memory and manage this region
> with bitmap that one bit for several pages to reduce management overheads.
> So support arbitrary bitmap granularity for following generalization.
> 
> Signed-off-by: Joonsoo Kim 

Acked-by: Zhang Yanfei 

> 
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index bc4c171..9bc9340 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -38,6 +38,7 @@ struct cma {
>   unsigned long   base_pfn;
>   unsigned long   count;
>   unsigned long   *bitmap;
> + int order_per_bit; /* Order of pages represented by one bit */
>   struct mutexlock;
>  };
>  
> @@ -157,9 +158,38 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>  
>  static DEFINE_MUTEX(cma_mutex);
>  
> +static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
> align_order)
> +{
> + return (1 << (align_order >> cma->order_per_bit)) - 1;
> +}
> +
> +static unsigned long cma_bitmap_maxno(struct cma *cma)
> +{
> + return cma->count >> cma->order_per_bit;
> +}
> +
> +static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
> + unsigned long pages)
> +{
> + return ALIGN(pages, 1 << cma->order_per_bit) >> cma->order_per_bit;
> +}
> +
> +static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
> +{
> + unsigned long bitmapno, nr_bits;
> +
> + bitmapno = (pfn - cma->base_pfn) >> cma->order_per_bit;
> + nr_bits = cma_bitmap_pages_to_bits(cma, count);
> +
> + mutex_lock(>lock);
> + bitmap_clear(cma->bitmap, bitmapno, nr_bits);
> + mutex_unlock(>lock);
> +}
> +
>  static int __init cma_activate_area(struct cma *cma)
>  {
> - int bitmap_size = BITS_TO_LONGS(cma->count) * sizeof(long);
> + int bitmap_maxno = cma_bitmap_maxno(cma);
> + int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
>   unsigned long base_pfn = cma->base_pfn, pfn = base_pfn;
>   unsigned i = cma->count >> pageblock_order;
>   struct zone *zone;
> @@ -221,6 +251,7 @@ core_initcall(cma_init_reserved_areas);
>   * @base: Base address of the reserved area optional, use 0 for any
>   * @limit: End address of the reserved memory (optional, 0 for any).
>   * @alignment: Alignment for the contiguous memory area, should be power of 2
> + * @order_per_bit: Order of pages represented by one bit on bitmap.
>   * @res_cma: Pointer to store the created cma region.
>   * @fixed: hint about where to place the reserved area
>   *
> @@ -235,7 +266,7 @@ core_initcall(cma_init_reserved_areas);
>   */
>  static int __init __dma_contiguous_reserve_area(phys_addr_t size,
>   phys_addr_t base, phys_addr_t limit,
> - phys_addr_t alignment,
> + phys_addr_t alignment, int order_per_bit,
>   struct cma **res_cma, bool fixed)
>  {
>   struct cma *cma = _areas[cma_area_count];
> @@ -269,6 +300,8 @@ static int __init 
> __dma_contiguous_reserve_area(phys_addr_t size,
>   base = ALIGN(base, alignment);
>   size = ALIGN(size, alignment);
>   limit &= ~(alignment - 1);
> + /* size should be aligned with order_per_bit */
> + BUG_ON(!IS_ALIGNED(size >> PAGE_SHIFT, 1 << order_per_bit));
>  
>   /* Reserve memory */
>   if (base && fixed) {
> @@ -294,6 +327,7 @@ static int __init 
> __dma_contiguous_reserve_area(phys_addr_t size,
>*/
>   cma->base_pfn = PFN_DOWN(base);
>   cma->count = size >> PAGE_SHIFT;
> + cma->order_per_bit = order_per_bit;
>   *res_cma = cma;
>   cma_area_count++;
>  
> @@ -313,7 +347,7 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
> phys_addr_t base,
>  {
>   int ret;
>  
> - ret = __dma_contiguous_reserve_area(size, base, limit, 0,
> + ret = __dma_contiguous_reserve_area(size, base, limit, 0, 0,
>   res_cma, fixed);
>   if (ret)
>   return ret;
> @@ -324,13 +358,6 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
> phys_addr_t base,
>   return 0;
>  }
>  
> -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
> -{
> - mutex_lock(>lock);
> - bitmap_clear(cma->bitmap, pfn - cma->base_pfn, count);
> - mutex_unlock(&g

Re: [PATCH v2 02/10] DMA, CMA: fix possible memory leak

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 02:02 PM, Joonsoo Kim wrote:
> On Thu, Jun 12, 2014 at 02:25:43PM +0900, Minchan Kim wrote:
>> On Thu, Jun 12, 2014 at 12:21:39PM +0900, Joonsoo Kim wrote:
>>> We should free memory for bitmap when we find zone mis-match,
>>> otherwise this memory will leak.
>>
>> Then, -stable stuff?
> 
> I don't think so. This is just possible leak candidate, so we don't
> need to push this to stable tree.
> 
>>
>>>
>>> Additionally, I copy code comment from ppc kvm's cma code to notify
>>> why we need to check zone mis-match.
>>>
>>> Signed-off-by: Joonsoo Kim 
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index bd0bb81..fb0cdce 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -177,14 +177,24 @@ static int __init cma_activate_area(struct cma *cma)
>>> base_pfn = pfn;
>>> for (j = pageblock_nr_pages; j; --j, pfn++) {
>>> WARN_ON_ONCE(!pfn_valid(pfn));
>>> +   /*
>>> +* alloc_contig_range requires the pfn range
>>> +* specified to be in the same zone. Make this
>>> +* simple by forcing the entire CMA resv range
>>> +* to be in the same zone.
>>> +*/
>>> if (page_zone(pfn_to_page(pfn)) != zone)
>>> -   return -EINVAL;
>>> +   goto err;
>>
>> At a first glance, I thought it would be better to handle such error
>> before activating.
>> So when I see the registration code(ie, dma_contiguous_revere_area),
>> I realized it is impossible because we didn't set up zone yet. :(
>>
>> If so, when we detect to fail here, it would be better to report more
>> meaningful error message like what was successful zone and what is
>> new zone and failed pfn number?
> 
> What I want to do in early phase of this patchset is to make cma code
> on DMA APIs similar to ppc kvm's cma code. ppc kvm's cma code already
> has this error handling logic, so I make this patch.
> 
> If we think that we need more things, we can do that on general cma code
> after merging this patchset.
> 

Yeah, I also like the idea. After all, this patchset aims to a general CMA
management, we could improve more after this patchset. So

Acked-by: Zhang Yanfei 

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 01/10] DMA, CMA: clean-up log message

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
> We don't need explicit 'CMA:' prefix, since we already define prefix
> 'cma:' in pr_fmt. So remove it.
> 
> And, some logs print function name and others doesn't. This looks
> bad to me, so I unify log format to print function name consistently.
> 
> Lastly, I add one more debug log on cma_activate_area().
> 
> Signed-off-by: Joonsoo Kim 

Reviewed-by: Zhang Yanfei 

> 
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index 83969f8..bd0bb81 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -144,7 +144,7 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>   }
>  
>   if (selected_size && !dma_contiguous_default_area) {
> - pr_debug("%s: reserving %ld MiB for global area\n", __func__,
> + pr_debug("%s(): reserving %ld MiB for global area\n", __func__,
>(unsigned long)selected_size / SZ_1M);
>  
>   dma_contiguous_reserve_area(selected_size, selected_base,
> @@ -163,8 +163,9 @@ static int __init cma_activate_area(struct cma *cma)
>   unsigned i = cma->count >> pageblock_order;
>   struct zone *zone;
>  
> - cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
> + pr_debug("%s()\n", __func__);
>  
> + cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
>   if (!cma->bitmap)
>   return -ENOMEM;
>  
> @@ -234,7 +235,8 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
> phys_addr_t base,
>  
>   /* Sanity checks */
>   if (cma_area_count == ARRAY_SIZE(cma_areas)) {
> - pr_err("Not enough slots for CMA reserved regions!\n");
> + pr_err("%s(): Not enough slots for CMA reserved regions!\n",
> + __func__);
>   return -ENOSPC;
>   }
>  
> @@ -274,14 +276,15 @@ int __init dma_contiguous_reserve_area(phys_addr_t 
> size, phys_addr_t base,
>   *res_cma = cma;
>   cma_area_count++;
>  
> - pr_info("CMA: reserved %ld MiB at %08lx\n", (unsigned long)size / SZ_1M,
> - (unsigned long)base);
> + pr_info("%s(): reserved %ld MiB at %08lx\n",
> + __func__, (unsigned long)size / SZ_1M, (unsigned long)base);
>  
>   /* Architecture specific contiguous memory fixup. */
>   dma_contiguous_early_fixup(base, size);
>   return 0;
>  err:
> - pr_err("CMA: failed to reserve %ld MiB\n", (unsigned long)size / SZ_1M);
> + pr_err("%s(): failed to reserve %ld MiB\n",
> + __func__, (unsigned long)size / SZ_1M);
>   return ret;
>  }
>  
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-12 Thread Zhang Yanfei
ze, base, limit, 0, 0,
>>  res_cma, fixed);
>>  if (ret)
>>  return ret;
>> @@ -324,13 +358,6 @@ int __init dma_contiguous_reserve_area(phys_addr_t 
>> size, phys_addr_t base,
>>  return 0;
>>  }
>>  
>> -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
>> -{
>> -mutex_lock(>lock);
>> -bitmap_clear(cma->bitmap, pfn - cma->base_pfn, count);
>> -mutex_unlock(>lock);
>> -}
>> -
>>  /**
>>   * dma_alloc_from_contiguous() - allocate pages from contiguous area
>>   * @dev:   Pointer to device for which the allocation is performed.
>> @@ -345,7 +372,8 @@ static void clear_cma_bitmap(struct cma *cma, unsigned 
>> long pfn, int count)
>>  static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
>> unsigned int align)
>>  {
>> -unsigned long mask, pfn, pageno, start = 0;
>> +unsigned long mask, pfn, start = 0;
>> +unsigned long bitmap_maxno, bitmapno, nr_bits;
> 
> Just Nit: bitmap_maxno, bitmap_no or something consistent.
> I know you love consistent when I read description in first patch
> in this patchset. ;-)
> 

Yeah, not only in this patchset, I saw Joonsoo trying to unify all
kinds of things in the MM. This is great for newbies, IMO.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-12 Thread Zhang Yanfei
;
 
 Just Nit: bitmap_maxno, bitmap_no or something consistent.
 I know you love consistent when I read description in first patch
 in this patchset. ;-)
 

Yeah, not only in this patchset, I saw Joonsoo trying to unify all
kinds of things in the MM. This is great for newbies, IMO.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 01/10] DMA, CMA: clean-up log message

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
 We don't need explicit 'CMA:' prefix, since we already define prefix
 'cma:' in pr_fmt. So remove it.
 
 And, some logs print function name and others doesn't. This looks
 bad to me, so I unify log format to print function name consistently.
 
 Lastly, I add one more debug log on cma_activate_area().
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 83969f8..bd0bb81 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -144,7 +144,7 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
   }
  
   if (selected_size  !dma_contiguous_default_area) {
 - pr_debug(%s: reserving %ld MiB for global area\n, __func__,
 + pr_debug(%s(): reserving %ld MiB for global area\n, __func__,
(unsigned long)selected_size / SZ_1M);
  
   dma_contiguous_reserve_area(selected_size, selected_base,
 @@ -163,8 +163,9 @@ static int __init cma_activate_area(struct cma *cma)
   unsigned i = cma-count  pageblock_order;
   struct zone *zone;
  
 - cma-bitmap = kzalloc(bitmap_size, GFP_KERNEL);
 + pr_debug(%s()\n, __func__);
  
 + cma-bitmap = kzalloc(bitmap_size, GFP_KERNEL);
   if (!cma-bitmap)
   return -ENOMEM;
  
 @@ -234,7 +235,8 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
 phys_addr_t base,
  
   /* Sanity checks */
   if (cma_area_count == ARRAY_SIZE(cma_areas)) {
 - pr_err(Not enough slots for CMA reserved regions!\n);
 + pr_err(%s(): Not enough slots for CMA reserved regions!\n,
 + __func__);
   return -ENOSPC;
   }
  
 @@ -274,14 +276,15 @@ int __init dma_contiguous_reserve_area(phys_addr_t 
 size, phys_addr_t base,
   *res_cma = cma;
   cma_area_count++;
  
 - pr_info(CMA: reserved %ld MiB at %08lx\n, (unsigned long)size / SZ_1M,
 - (unsigned long)base);
 + pr_info(%s(): reserved %ld MiB at %08lx\n,
 + __func__, (unsigned long)size / SZ_1M, (unsigned long)base);
  
   /* Architecture specific contiguous memory fixup. */
   dma_contiguous_early_fixup(base, size);
   return 0;
  err:
 - pr_err(CMA: failed to reserve %ld MiB\n, (unsigned long)size / SZ_1M);
 + pr_err(%s(): failed to reserve %ld MiB\n,
 + __func__, (unsigned long)size / SZ_1M);
   return ret;
  }
  
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 02/10] DMA, CMA: fix possible memory leak

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 02:02 PM, Joonsoo Kim wrote:
 On Thu, Jun 12, 2014 at 02:25:43PM +0900, Minchan Kim wrote:
 On Thu, Jun 12, 2014 at 12:21:39PM +0900, Joonsoo Kim wrote:
 We should free memory for bitmap when we find zone mis-match,
 otherwise this memory will leak.

 Then, -stable stuff?
 
 I don't think so. This is just possible leak candidate, so we don't
 need to push this to stable tree.
 


 Additionally, I copy code comment from ppc kvm's cma code to notify
 why we need to check zone mis-match.

 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index bd0bb81..fb0cdce 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -177,14 +177,24 @@ static int __init cma_activate_area(struct cma *cma)
 base_pfn = pfn;
 for (j = pageblock_nr_pages; j; --j, pfn++) {
 WARN_ON_ONCE(!pfn_valid(pfn));
 +   /*
 +* alloc_contig_range requires the pfn range
 +* specified to be in the same zone. Make this
 +* simple by forcing the entire CMA resv range
 +* to be in the same zone.
 +*/
 if (page_zone(pfn_to_page(pfn)) != zone)
 -   return -EINVAL;
 +   goto err;

 At a first glance, I thought it would be better to handle such error
 before activating.
 So when I see the registration code(ie, dma_contiguous_revere_area),
 I realized it is impossible because we didn't set up zone yet. :(

 If so, when we detect to fail here, it would be better to report more
 meaningful error message like what was successful zone and what is
 new zone and failed pfn number?
 
 What I want to do in early phase of this patchset is to make cma code
 on DMA APIs similar to ppc kvm's cma code. ppc kvm's cma code already
 has this error handling logic, so I make this patch.
 
 If we think that we need more things, we can do that on general cma code
 after merging this patchset.
 

Yeah, I also like the idea. After all, this patchset aims to a general CMA
management, we could improve more after this patchset. So

Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
 ppc kvm's cma region management requires arbitrary bitmap granularity,
 since they want to reserve very large memory and manage this region
 with bitmap that one bit for several pages to reduce management overheads.
 So support arbitrary bitmap granularity for following generalization.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index bc4c171..9bc9340 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -38,6 +38,7 @@ struct cma {
   unsigned long   base_pfn;
   unsigned long   count;
   unsigned long   *bitmap;
 + int order_per_bit; /* Order of pages represented by one bit */
   struct mutexlock;
  };
  
 @@ -157,9 +158,38 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
  
  static DEFINE_MUTEX(cma_mutex);
  
 +static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
 align_order)
 +{
 + return (1  (align_order  cma-order_per_bit)) - 1;
 +}
 +
 +static unsigned long cma_bitmap_maxno(struct cma *cma)
 +{
 + return cma-count  cma-order_per_bit;
 +}
 +
 +static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
 + unsigned long pages)
 +{
 + return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
 +}
 +
 +static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
 +{
 + unsigned long bitmapno, nr_bits;
 +
 + bitmapno = (pfn - cma-base_pfn)  cma-order_per_bit;
 + nr_bits = cma_bitmap_pages_to_bits(cma, count);
 +
 + mutex_lock(cma-lock);
 + bitmap_clear(cma-bitmap, bitmapno, nr_bits);
 + mutex_unlock(cma-lock);
 +}
 +
  static int __init cma_activate_area(struct cma *cma)
  {
 - int bitmap_size = BITS_TO_LONGS(cma-count) * sizeof(long);
 + int bitmap_maxno = cma_bitmap_maxno(cma);
 + int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
   unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
   unsigned i = cma-count  pageblock_order;
   struct zone *zone;
 @@ -221,6 +251,7 @@ core_initcall(cma_init_reserved_areas);
   * @base: Base address of the reserved area optional, use 0 for any
   * @limit: End address of the reserved memory (optional, 0 for any).
   * @alignment: Alignment for the contiguous memory area, should be power of 2
 + * @order_per_bit: Order of pages represented by one bit on bitmap.
   * @res_cma: Pointer to store the created cma region.
   * @fixed: hint about where to place the reserved area
   *
 @@ -235,7 +266,7 @@ core_initcall(cma_init_reserved_areas);
   */
  static int __init __dma_contiguous_reserve_area(phys_addr_t size,
   phys_addr_t base, phys_addr_t limit,
 - phys_addr_t alignment,
 + phys_addr_t alignment, int order_per_bit,
   struct cma **res_cma, bool fixed)
  {
   struct cma *cma = cma_areas[cma_area_count];
 @@ -269,6 +300,8 @@ static int __init 
 __dma_contiguous_reserve_area(phys_addr_t size,
   base = ALIGN(base, alignment);
   size = ALIGN(size, alignment);
   limit = ~(alignment - 1);
 + /* size should be aligned with order_per_bit */
 + BUG_ON(!IS_ALIGNED(size  PAGE_SHIFT, 1  order_per_bit));
  
   /* Reserve memory */
   if (base  fixed) {
 @@ -294,6 +327,7 @@ static int __init 
 __dma_contiguous_reserve_area(phys_addr_t size,
*/
   cma-base_pfn = PFN_DOWN(base);
   cma-count = size  PAGE_SHIFT;
 + cma-order_per_bit = order_per_bit;
   *res_cma = cma;
   cma_area_count++;
  
 @@ -313,7 +347,7 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
 phys_addr_t base,
  {
   int ret;
  
 - ret = __dma_contiguous_reserve_area(size, base, limit, 0,
 + ret = __dma_contiguous_reserve_area(size, base, limit, 0, 0,
   res_cma, fixed);
   if (ret)
   return ret;
 @@ -324,13 +358,6 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
 phys_addr_t base,
   return 0;
  }
  
 -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
 -{
 - mutex_lock(cma-lock);
 - bitmap_clear(cma-bitmap, pfn - cma-base_pfn, count);
 - mutex_unlock(cma-lock);
 -}
 -
  /**
   * dma_alloc_from_contiguous() - allocate pages from contiguous area
   * @dev:   Pointer to device for which the allocation is performed.
 @@ -345,7 +372,8 @@ static void clear_cma_bitmap(struct cma *cma, unsigned 
 long pfn, int count)
  static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
  unsigned int align)
  {
 - unsigned long mask, pfn, pageno, start = 0;
 + unsigned long mask, pfn, start = 0;
 + unsigned long bitmap_maxno, bitmapno, nr_bits;
   struct page

Re: [PATCH v2 06/10] CMA: generalize CMA reserved area management functionality

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
 Currently, there are two users on CMA functionality, one is the DMA
 subsystem and the other is the kvm on powerpc. They have their own code
 to manage CMA reserved area even if they looks really similar.
From my guess, it is caused by some needs on bitmap management. Kvm side
 wants to maintain bitmap not for 1 page, but for more size. Eventually it
 use bitmap where one bit represents 64 pages.
 
 When I implement CMA related patches, I should change those two places
 to apply my change and it seem to be painful to me. I want to change
 this situation and reduce future code management overhead through
 this patch.
 
 This change could also help developer who want to use CMA in their
 new feature development, since they can use CMA easily without
 copying  pasting this reserved area management code.
 
 In previous patches, we have prepared some features to generalize
 CMA reserved area management and now it's time to do it. This patch
 moves core functions to mm/cma.c and change DMA APIs to use
 these functions.
 
 There is no functional change in DMA APIs.
 
 v2: There is no big change from v1 in mm/cma.c. Mostly renaming.
 
 Acked-by: Michal Nazarewicz min...@mina86.com
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 
 diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
 index 00e13ce..4eac559 100644
 --- a/drivers/base/Kconfig
 +++ b/drivers/base/Kconfig
 @@ -283,16 +283,6 @@ config CMA_ALIGNMENT
  
 If unsure, leave the default value 8.
  
 -config CMA_AREAS
 - int Maximum count of the CMA device-private areas
 - default 7
 - help
 -   CMA allows to create CMA areas for particular devices. This parameter
 -   sets the maximum number of such device private CMA areas in the
 -   system.
 -
 -   If unsure, leave the default value 7.
 -
  endif
  
  endmenu
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 9bc9340..f177f73 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -24,25 +24,10 @@
  
  #include linux/memblock.h
  #include linux/err.h
 -#include linux/mm.h
 -#include linux/mutex.h
 -#include linux/page-isolation.h
  #include linux/sizes.h
 -#include linux/slab.h
 -#include linux/swap.h
 -#include linux/mm_types.h
  #include linux/dma-contiguous.h
  #include linux/log2.h
 -
 -struct cma {
 - unsigned long   base_pfn;
 - unsigned long   count;
 - unsigned long   *bitmap;
 - int order_per_bit; /* Order of pages represented by one bit */
 - struct mutexlock;
 -};
 -
 -struct cma *dma_contiguous_default_area;
 +#include linux/cma.h
  
  #ifdef CONFIG_CMA_SIZE_MBYTES
  #define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
 @@ -50,6 +35,8 @@ struct cma *dma_contiguous_default_area;
  #define CMA_SIZE_MBYTES 0
  #endif
  
 +struct cma *dma_contiguous_default_area;
 +
  /*
   * Default global CMA area size can be defined in kernel's .config.
   * This is useful mainly for distro maintainers to create a kernel
 @@ -156,199 +143,13 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
   }
  }
  
 -static DEFINE_MUTEX(cma_mutex);
 -
 -static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
 align_order)
 -{
 - return (1  (align_order  cma-order_per_bit)) - 1;
 -}
 -
 -static unsigned long cma_bitmap_maxno(struct cma *cma)
 -{
 - return cma-count  cma-order_per_bit;
 -}
 -
 -static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
 - unsigned long pages)
 -{
 - return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
 -}
 -
 -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
 -{
 - unsigned long bitmapno, nr_bits;
 -
 - bitmapno = (pfn - cma-base_pfn)  cma-order_per_bit;
 - nr_bits = cma_bitmap_pages_to_bits(cma, count);
 -
 - mutex_lock(cma-lock);
 - bitmap_clear(cma-bitmap, bitmapno, nr_bits);
 - mutex_unlock(cma-lock);
 -}
 -
 -static int __init cma_activate_area(struct cma *cma)
 -{
 - int bitmap_maxno = cma_bitmap_maxno(cma);
 - int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
 - unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
 - unsigned i = cma-count  pageblock_order;
 - struct zone *zone;
 -
 - pr_debug(%s()\n, __func__);
 -
 - cma-bitmap = kzalloc(bitmap_size, GFP_KERNEL);
 - if (!cma-bitmap)
 - return -ENOMEM;
 -
 - WARN_ON_ONCE(!pfn_valid(pfn));
 - zone = page_zone(pfn_to_page(pfn));
 -
 - do {
 - unsigned j;
 - base_pfn = pfn;
 - for (j = pageblock_nr_pages; j; --j, pfn++) {
 - WARN_ON_ONCE(!pfn_valid(pfn));
 - /*
 -  * alloc_contig_range requires the pfn range
 -  * specified to be in the same zone. Make this
 -  * simple by forcing

Re: [PATCH v2 08/10] mm, cma: clean-up cma allocation error path

2014-06-12 Thread Zhang Yanfei
On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
 We can remove one call sites for clear_cma_bitmap() if we first
 call it before checking error number.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 
 diff --git a/mm/cma.c b/mm/cma.c
 index 1e1b017..01a0713 100644
 --- a/mm/cma.c
 +++ b/mm/cma.c
 @@ -282,11 +282,12 @@ struct page *cma_alloc(struct cma *cma, int count, 
 unsigned int align)
   if (ret == 0) {
   page = pfn_to_page(pfn);
   break;
 - } else if (ret != -EBUSY) {
 - clear_cma_bitmap(cma, pfn, count);
 - break;
   }
 +
   clear_cma_bitmap(cma, pfn, count);
 + if (ret != -EBUSY)
 + break;
 +
   pr_debug(%s(): memory range at %p is busy, retrying\n,
__func__, pfn_to_page(pfn));
   /* try again with a bit different memory target */
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity

2014-06-10 Thread Zhang Yanfei
On 06/11/2014 10:41 AM, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:19AM +0200, Vlastimil Babka wrote:
>> From: David Rientjes 
>>
>> The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
>> ALLOC_CPUSET) that have separate semantics.
>>
>> The function allocflags_to_migratetype() actually takes gfp flags, not alloc
>> flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().
>>
>> Signed-off-by: David Rientjes 
>> Signed-off-by: Vlastimil Babka 
> 
> I was one of person who got confused sometime.

Some names in MM really make people confused. But sometimes thinking
an appropriate name is also a hard thing. Like I once wanted to change
the name of function nr_free_zone_pages() and also nr_free_buffer_pages().
But it is hard to name them, so at last Andrew suggested to add the
detailed function description to make it clear only.

Reviewed-by: Zhang Yanfei 

> 
> Acked-by: Minchan Kim 
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner

2014-06-10 Thread Zhang Yanfei
On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Zhang Yanfei 

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  mm/compaction.c | 33 -
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 83f72bd..58dfaaa 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
>   * (even though it may still end up isolating some pages).
>   */
>  static unsigned long isolate_freepages_block(struct compact_control *cc,
> - unsigned long blockpfn,
> + unsigned long *start_pfn,
>   unsigned long end_pfn,
>   struct list_head *freelist,
>   bool strict)
> @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   struct page *cursor, *valid_page = NULL;
>   unsigned long flags;
>   bool locked = false;
> + unsigned long blockpfn = *start_pfn;
>  
>   cursor = pfn_to_page(blockpfn);
>  
> @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   int isolated, i;
>   struct page *page = cursor;
>  
> + /* Record how far we have got within the block */
> + *start_pfn = blockpfn;
> +
>   /*
>* Periodically drop the lock (if held) regardless of its
>* contention, to give chance to IRQs. Abort async compaction
> @@ -424,6 +428,9 @@ isolate_freepages_range(struct compact_control *cc,
>   LIST_HEAD(freelist);
>  
>   for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> + /* Protect pfn from changing by isolate_freepages_block */
> + unsigned long isolate_start_pfn = pfn;
> +
>   if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
>   break;
>  
> @@ -434,8 +441,8 @@ isolate_freepages_range(struct compact_control *cc,
>   block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
>   block_end_pfn = min(block_end_pfn, end_pfn);
>  
> - isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> -, true);
> + isolated = isolate_freepages_block(cc, _start_pfn,
> + block_end_pfn, , true);
>  
>   /*
>* In strict mode, isolate_freepages_block() returns 0 if
> @@ -774,6 +781,7 @@ static void isolate_freepages(struct zone *zone,
>   block_end_pfn = block_start_pfn,
>   block_start_pfn -= pageblock_nr_pages) {
>   unsigned long isolated;
> + unsigned long isolate_start_pfn;
>  
>   /*
>* This can iterate a massively long zone without finding any
> @@ -807,12 +815,27 @@ static void isolate_freepages(struct zone *zone,
>   continue;
>  
>   /* Found a block suitable for isolating free pages from */
> - cc->free_pfn = block_start_pfn;
> - isolated = isolate_freepages_block(cc, block_start_pfn,
> + isolate_start_pfn = block_start_pfn;
> +
> + /*
> +  * If we are restarting the free scanner in this block, do not
> +  * rescan the beginning of the block
> +  */
> + if (cc->free_pfn < block_end_pfn)
&

Re: [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock

2014-06-10 Thread Zhang Yanfei
On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
> isolate_freepages_block() rechecks if the pageblock is suitable to be a target
> for migration after it has taken the zone->lock. However, the check has been
> optimized to occur only once per pageblock, and compact_checklock_irqsave()
> might be dropping and reacquiring lock, which means somebody else might have
> changed the pageblock's migratetype meanwhile.
> 
> Furthermore, nothing prevents the migratetype to change right after
> isolate_freepages_block() has finished isolating. Given how imperfect this is,
> it's simpler to just rely on the check done in isolate_freepages() without
> lock, and not pretend that the recheck under lock guarantees anything. It is
> just a heuristic after all.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Zhang Yanfei 

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
> I suggest folding mm-compactionc-isolate_freepages_block-small-tuneup.patch 
> into this
> 
>  mm/compaction.c | 13 -
>  1 file changed, 13 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5175019..b73b182 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   struct page *cursor, *valid_page = NULL;
>   unsigned long flags;
>   bool locked = false;
> - bool checked_pageblock = false;
>  
>   cursor = pfn_to_page(blockpfn);
>  
> @@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   if (!locked)
>   break;
>  
> - /* Recheck this is a suitable migration target under lock */
> - if (!strict && !checked_pageblock) {
> - /*
> -  * We need to check suitability of pageblock only once
> -  * and this isolate_freepages_block() is called with
> -  * pageblock range, so just check once is sufficient.
> -  */
> - checked_pageblock = true;
> - if (!suitable_migration_target(page))
> - break;
> -     }
> -
>   /* Recheck this is a buddy page under lock */
>   if (!PageBuddy(page))
>   goto isolate_fail;
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock

2014-06-10 Thread Zhang Yanfei
On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
 isolate_freepages_block() rechecks if the pageblock is suitable to be a target
 for migration after it has taken the zone-lock. However, the check has been
 optimized to occur only once per pageblock, and compact_checklock_irqsave()
 might be dropping and reacquiring lock, which means somebody else might have
 changed the pageblock's migratetype meanwhile.
 
 Furthermore, nothing prevents the migratetype to change right after
 isolate_freepages_block() has finished isolating. Given how imperfect this is,
 it's simpler to just rely on the check done in isolate_freepages() without
 lock, and not pretend that the recheck under lock guarantees anything. It is
 just a heuristic after all.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com
 ---
 I suggest folding mm-compactionc-isolate_freepages_block-small-tuneup.patch 
 into this
 
  mm/compaction.c | 13 -
  1 file changed, 13 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 5175019..b73b182 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct 
 compact_control *cc,
   struct page *cursor, *valid_page = NULL;
   unsigned long flags;
   bool locked = false;
 - bool checked_pageblock = false;
  
   cursor = pfn_to_page(blockpfn);
  
 @@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct 
 compact_control *cc,
   if (!locked)
   break;
  
 - /* Recheck this is a suitable migration target under lock */
 - if (!strict  !checked_pageblock) {
 - /*
 -  * We need to check suitability of pageblock only once
 -  * and this isolate_freepages_block() is called with
 -  * pageblock range, so just check once is sufficient.
 -  */
 - checked_pageblock = true;
 - if (!suitable_migration_target(page))
 - break;
 - }
 -
   /* Recheck this is a buddy page under lock */
   if (!PageBuddy(page))
   goto isolate_fail;
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner

2014-06-10 Thread Zhang Yanfei
On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
 Unlike the migration scanner, the free scanner remembers the beginning of the
 last scanned pageblock in cc-free_pfn. It might be therefore rescanning pages
 uselessly when called several times during single compaction. This might have
 been useful when pages were returned to the buddy allocator after a failed
 migration, but this is no longer the case.
 
 This patch changes the meaning of cc-free_pfn so that if it points to a
 middle of a pageblock, that pageblock is scanned only from cc-free_pfn to the
 end. isolate_freepages_block() will record the pfn of the last page it looked
 at, which is then used to update cc-free_pfn.
 
 In the mmtests stress-highalloc benchmark, this has resulted in lowering the
 ratio between pages scanned by both scanners, from 2.5 free pages per migrate
 page, to 2.25 free pages per migrate page, without affecting success rates.
 
 Signed-off-by: Vlastimil Babka vba...@suse.cz

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 Cc: Minchan Kim minc...@kernel.org
 Cc: Mel Gorman mgor...@suse.de
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: Christoph Lameter c...@linux.com
 Cc: Rik van Riel r...@redhat.com
 Cc: David Rientjes rient...@google.com
 ---
  mm/compaction.c | 33 -
  1 file changed, 28 insertions(+), 5 deletions(-)
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index 83f72bd..58dfaaa 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
   * (even though it may still end up isolating some pages).
   */
  static unsigned long isolate_freepages_block(struct compact_control *cc,
 - unsigned long blockpfn,
 + unsigned long *start_pfn,
   unsigned long end_pfn,
   struct list_head *freelist,
   bool strict)
 @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct 
 compact_control *cc,
   struct page *cursor, *valid_page = NULL;
   unsigned long flags;
   bool locked = false;
 + unsigned long blockpfn = *start_pfn;
  
   cursor = pfn_to_page(blockpfn);
  
 @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct 
 compact_control *cc,
   int isolated, i;
   struct page *page = cursor;
  
 + /* Record how far we have got within the block */
 + *start_pfn = blockpfn;
 +
   /*
* Periodically drop the lock (if held) regardless of its
* contention, to give chance to IRQs. Abort async compaction
 @@ -424,6 +428,9 @@ isolate_freepages_range(struct compact_control *cc,
   LIST_HEAD(freelist);
  
   for (pfn = start_pfn; pfn  end_pfn; pfn += isolated) {
 + /* Protect pfn from changing by isolate_freepages_block */
 + unsigned long isolate_start_pfn = pfn;
 +
   if (!pfn_valid(pfn) || cc-zone != page_zone(pfn_to_page(pfn)))
   break;
  
 @@ -434,8 +441,8 @@ isolate_freepages_range(struct compact_control *cc,
   block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
   block_end_pfn = min(block_end_pfn, end_pfn);
  
 - isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
 -freelist, true);
 + isolated = isolate_freepages_block(cc, isolate_start_pfn,
 + block_end_pfn, freelist, true);
  
   /*
* In strict mode, isolate_freepages_block() returns 0 if
 @@ -774,6 +781,7 @@ static void isolate_freepages(struct zone *zone,
   block_end_pfn = block_start_pfn,
   block_start_pfn -= pageblock_nr_pages) {
   unsigned long isolated;
 + unsigned long isolate_start_pfn;
  
   /*
* This can iterate a massively long zone without finding any
 @@ -807,12 +815,27 @@ static void isolate_freepages(struct zone *zone,
   continue;
  
   /* Found a block suitable for isolating free pages from */
 - cc-free_pfn = block_start_pfn;
 - isolated = isolate_freepages_block(cc, block_start_pfn,
 + isolate_start_pfn = block_start_pfn;
 +
 + /*
 +  * If we are restarting the free scanner in this block, do not
 +  * rescan the beginning of the block
 +  */
 + if (cc-free_pfn  block_end_pfn)
 + isolate_start_pfn = cc-free_pfn;
 +
 + isolated = isolate_freepages_block(cc, isolate_start_pfn,
   block_end_pfn, freelist, false);
   nr_freepages += isolated

Re: [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity

2014-06-10 Thread Zhang Yanfei
On 06/11/2014 10:41 AM, Minchan Kim wrote:
 On Mon, Jun 09, 2014 at 11:26:19AM +0200, Vlastimil Babka wrote:
 From: David Rientjes rient...@google.com

 The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
 ALLOC_CPUSET) that have separate semantics.

 The function allocflags_to_migratetype() actually takes gfp flags, not alloc
 flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().

 Signed-off-by: David Rientjes rient...@google.com
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 
 I was one of person who got confused sometime.

Some names in MM really make people confused. But sometimes thinking
an appropriate name is also a hard thing. Like I once wanted to change
the name of function nr_free_zone_pages() and also nr_free_buffer_pages().
But it is hard to name them, so at last Andrew suggested to add the
detailed function description to make it clear only.

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 
 Acked-by: Minchan Kim minc...@kernel.org
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/swap: cleanup *lru_cache_add* functions

2014-04-21 Thread Zhang Yanfei
On 04/21/2014 12:02 PM, Jianyu Zhan wrote:
> Hi,  Yanfei,
> 
> On Mon, Apr 21, 2014 at 9:00 AM, Zhang Yanfei
>  wrote:
>> What should be exported?
>>
>> lru_cache_add()
>> lru_cache_add_anon()
>> lru_cache_add_file()
>>
>> It seems you only export lru_cache_add_file() in the patch.
> 
> Right, lru_cache_add_anon() is only used by VM code, so it should not
> be exported.
> 
> lru_cache_add_file() and lru_cache_add() are supposed to be used by
> vfs ans fs code.
> 
> But  now only lru_cache_add_file() is  used by CIFS and FUSE, which
> both could be
> built as module, so it must be exported;  and lru_cache_add() has now
> no module users,
> so as Rik suggests, it is unexported too.
> 

OK. So The sentence in the patch log confused me:

[ However, lru_cache_add() is supposed to
be used by vfs, or whatever others, but it is not exported.]

otherwise, 
Reviewed-by: Zhang Yanfei 

Thanks.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/swap: cleanup *lru_cache_add* functions

2014-04-21 Thread Zhang Yanfei
On 04/21/2014 12:02 PM, Jianyu Zhan wrote:
 Hi,  Yanfei,
 
 On Mon, Apr 21, 2014 at 9:00 AM, Zhang Yanfei
 zhangyan...@cn.fujitsu.com wrote:
 What should be exported?

 lru_cache_add()
 lru_cache_add_anon()
 lru_cache_add_file()

 It seems you only export lru_cache_add_file() in the patch.
 
 Right, lru_cache_add_anon() is only used by VM code, so it should not
 be exported.
 
 lru_cache_add_file() and lru_cache_add() are supposed to be used by
 vfs ans fs code.
 
 But  now only lru_cache_add_file() is  used by CIFS and FUSE, which
 both could be
 built as module, so it must be exported;  and lru_cache_add() has now
 no module users,
 so as Rik suggests, it is unexported too.
 

OK. So The sentence in the patch log confused me:

[ However, lru_cache_add() is supposed to
be used by vfs, or whatever others, but it is not exported.]

otherwise, 
Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

Thanks.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/swap: cleanup *lru_cache_add* functions

2014-04-20 Thread Zhang Yanfei
Hi Jianyu

On 04/18/2014 11:39 PM, Jianyu Zhan wrote:
> Hi, Christoph Hellwig,
> 
>> There are no modular users of lru_cache_add, so please don't needlessly
>> export it.
> 
> yep, I re-checked and found there is no module user of neither 
> lru_cache_add() nor lru_cache_add_anon(), so don't export it.
> 
> Here is the renewed patch:
> ---
> 
> In mm/swap.c, __lru_cache_add() is exported, but actually there are
> no users outside this file. However, lru_cache_add() is supposed to
> be used by vfs, or whatever others, but it is not exported.
> 
> This patch unexports __lru_cache_add(), and makes it static.
> It also exports lru_cache_add_file(), as it is use by cifs, which
> be loaded as module.

What should be exported?

lru_cache_add()
lru_cache_add_anon()
lru_cache_add_file()

It seems you only export lru_cache_add_file() in the patch.

Thanks

> 
> Signed-off-by: Jianyu Zhan 
> ---
>  include/linux/swap.h | 19 ++-
>  mm/swap.c| 31 +++
>  2 files changed, 25 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3507115..5a14b92 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -308,8 +308,9 @@ extern unsigned long nr_free_pagecache_pages(void);
>  
>  
>  /* linux/mm/swap.c */
> -extern void __lru_cache_add(struct page *);
>  extern void lru_cache_add(struct page *);
> +extern void lru_cache_add_anon(struct page *page);
> +extern void lru_cache_add_file(struct page *page);
>  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
>struct lruvec *lruvec, struct list_head *head);
>  extern void activate_page(struct page *);
> @@ -323,22 +324,6 @@ extern void swap_setup(void);
>  
>  extern void add_page_to_unevictable_list(struct page *page);
>  
> -/**
> - * lru_cache_add: add a page to the page lists
> - * @page: the page to add
> - */
> -static inline void lru_cache_add_anon(struct page *page)
> -{
> - ClearPageActive(page);
> - __lru_cache_add(page);
> -}
> -
> -static inline void lru_cache_add_file(struct page *page)
> -{
> - ClearPageActive(page);
> - __lru_cache_add(page);
> -}
> -
>  /* linux/mm/vmscan.c */
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>   gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/mm/swap.c b/mm/swap.c
> index ab3f508..c0cd7d0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -582,13 +582,7 @@ void mark_page_accessed(struct page *page)
>  }
>  EXPORT_SYMBOL(mark_page_accessed);
>  
> -/*
> - * Queue the page for addition to the LRU via pagevec. The decision on 
> whether
> - * to add the page to the [in]active [file|anon] list is deferred until the
> - * pagevec is drained. This gives a chance for the caller of 
> __lru_cache_add()
> - * have the page added to the active list using mark_page_accessed().
> - */
> -void __lru_cache_add(struct page *page)
> +static void __lru_cache_add(struct page *page)
>  {
>   struct pagevec *pvec = _cpu_var(lru_add_pvec);
>  
> @@ -598,11 +592,32 @@ void __lru_cache_add(struct page *page)
>   pagevec_add(pvec, page);
>   put_cpu_var(lru_add_pvec);
>  }
> -EXPORT_SYMBOL(__lru_cache_add);
> +
> +/**
> + * lru_cache_add: add a page to the page lists
> + * @page: the page to add
> + */
> +void lru_cache_add_anon(struct page *page)
> +{
> + ClearPageActive(page);
> + __lru_cache_add(page);
> +}
> +
> +void lru_cache_add_file(struct page *page)
> +{
> + ClearPageActive(page);
> + __lru_cache_add(page);
> +}
> +EXPORT_SYMBOL(lru_cache_add_file);
>  
>  /**
>   * lru_cache_add - add a page to a page list
>   * @page: the page to be added to the LRU.
> + *
> + * Queue the page for addition to the LRU via pagevec. The decision on 
> whether
> + * to add the page to the [in]active [file|anon] list is deferred until the
> + * pagevec is drained. This gives a chance for the caller of lru_cache_add()
> + * have the page added to the active list using mark_page_accessed().
>   */
>  void lru_cache_add(struct page *page)
>  {
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/swap: cleanup *lru_cache_add* functions

2014-04-20 Thread Zhang Yanfei
Hi Jianyu

On 04/18/2014 11:39 PM, Jianyu Zhan wrote:
 Hi, Christoph Hellwig,
 
 There are no modular users of lru_cache_add, so please don't needlessly
 export it.
 
 yep, I re-checked and found there is no module user of neither 
 lru_cache_add() nor lru_cache_add_anon(), so don't export it.
 
 Here is the renewed patch:
 ---
 
 In mm/swap.c, __lru_cache_add() is exported, but actually there are
 no users outside this file. However, lru_cache_add() is supposed to
 be used by vfs, or whatever others, but it is not exported.
 
 This patch unexports __lru_cache_add(), and makes it static.
 It also exports lru_cache_add_file(), as it is use by cifs, which
 be loaded as module.

What should be exported?

lru_cache_add()
lru_cache_add_anon()
lru_cache_add_file()

It seems you only export lru_cache_add_file() in the patch.

Thanks

 
 Signed-off-by: Jianyu Zhan nasa4...@gmail.com
 ---
  include/linux/swap.h | 19 ++-
  mm/swap.c| 31 +++
  2 files changed, 25 insertions(+), 25 deletions(-)
 
 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index 3507115..5a14b92 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -308,8 +308,9 @@ extern unsigned long nr_free_pagecache_pages(void);
  
  
  /* linux/mm/swap.c */
 -extern void __lru_cache_add(struct page *);
  extern void lru_cache_add(struct page *);
 +extern void lru_cache_add_anon(struct page *page);
 +extern void lru_cache_add_file(struct page *page);
  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
struct lruvec *lruvec, struct list_head *head);
  extern void activate_page(struct page *);
 @@ -323,22 +324,6 @@ extern void swap_setup(void);
  
  extern void add_page_to_unevictable_list(struct page *page);
  
 -/**
 - * lru_cache_add: add a page to the page lists
 - * @page: the page to add
 - */
 -static inline void lru_cache_add_anon(struct page *page)
 -{
 - ClearPageActive(page);
 - __lru_cache_add(page);
 -}
 -
 -static inline void lru_cache_add_file(struct page *page)
 -{
 - ClearPageActive(page);
 - __lru_cache_add(page);
 -}
 -
  /* linux/mm/vmscan.c */
  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
   gfp_t gfp_mask, nodemask_t *mask);
 diff --git a/mm/swap.c b/mm/swap.c
 index ab3f508..c0cd7d0 100644
 --- a/mm/swap.c
 +++ b/mm/swap.c
 @@ -582,13 +582,7 @@ void mark_page_accessed(struct page *page)
  }
  EXPORT_SYMBOL(mark_page_accessed);
  
 -/*
 - * Queue the page for addition to the LRU via pagevec. The decision on 
 whether
 - * to add the page to the [in]active [file|anon] list is deferred until the
 - * pagevec is drained. This gives a chance for the caller of 
 __lru_cache_add()
 - * have the page added to the active list using mark_page_accessed().
 - */
 -void __lru_cache_add(struct page *page)
 +static void __lru_cache_add(struct page *page)
  {
   struct pagevec *pvec = get_cpu_var(lru_add_pvec);
  
 @@ -598,11 +592,32 @@ void __lru_cache_add(struct page *page)
   pagevec_add(pvec, page);
   put_cpu_var(lru_add_pvec);
  }
 -EXPORT_SYMBOL(__lru_cache_add);
 +
 +/**
 + * lru_cache_add: add a page to the page lists
 + * @page: the page to add
 + */
 +void lru_cache_add_anon(struct page *page)
 +{
 + ClearPageActive(page);
 + __lru_cache_add(page);
 +}
 +
 +void lru_cache_add_file(struct page *page)
 +{
 + ClearPageActive(page);
 + __lru_cache_add(page);
 +}
 +EXPORT_SYMBOL(lru_cache_add_file);
  
  /**
   * lru_cache_add - add a page to a page list
   * @page: the page to be added to the LRU.
 + *
 + * Queue the page for addition to the LRU via pagevec. The decision on 
 whether
 + * to add the page to the [in]active [file|anon] list is deferred until the
 + * pagevec is drained. This gives a chance for the caller of lru_cache_add()
 + * have the page added to the active list using mark_page_accessed().
   */
  void lru_cache_add(struct page *page)
  {
 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] mm: support madvise(MADV_FREE)

2014-04-14 Thread Zhang Yanfei
On 04/15/2014 12:46 PM, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).
> 
> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
> 
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
> 
> How to work is following as.
> 
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. If memory pressure happens, VM checks dirty bit of
> page table and if it found still "clean", it means it's a
> "lazyfree pages" so VM could discard the page instead of swapping out.
> Once there was store operation for the page before VM peek a page
> to reclaim, dirty bit is set so VM can swap out the page instead of
> discarding.
> 
> Firstly, heavy users would be general allocators(ex, jemalloc,
> tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> have supported the feature for other OS(ex, FreeBSD)

Reviewed-by: Zhang Yanfei 

> 
> barrios@blaptop:~/benchmark/ebizzy$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):4
> On-line CPU(s) list:   0-3
> Thread(s) per core:2
> Core(s) per socket:2
> Socket(s): 1
> NUMA node(s):  1
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 42
> Stepping:  7
> CPU MHz:   2801.000
> BogoMIPS:  5581.64
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  4096K
> NUMA node0 CPU(s): 0-3
> 
> ebizzy benchmark(./ebizzy -S 10 -n 512)
> 
>  vanilla-jemalloc MADV_free-jemalloc
> 
> 1 thread
> records:  10  records:  10
> avg:  7436.70 avg:  15292.70
> std:  48.01(0.65%)std:  496.40(3.25%)
> max:  7542.00 max:  15944.00
> min:  7366.00 min:  14478.00
> 
> 2 thread
> records:  10  records:  10
> avg:  12190.50avg:  24975.50
> std:  1011.51(8.30%)  std:  1127.22(4.51%)
> max:  13012.00max:  26382.00
> min:  10192.00min:  23265.00
> 
> 4 thread
> records:  10  records:  10
> avg:  16875.30avg:  36320.90
> std:  562.59(3.33%)   std:  1503.75(4.14%)
> max:  17465.00max:  38314.00
> min:  15552.00min:  33863.00
> 
> 8 thread
> records:  10  records:  10
> avg:  16966.80avg:  35915.20
> std:  229.35(1.35%)   std:  2153.89(6.00%)
> max:  17456.00max:  37943.00
> min:  16742.00min:  29891.00
> 
> 16 thread
> records:  10  records:  10
> avg:  20590.90avg:  37388.40
> std:  362.33(1.76%)   std:  1282.59(3.43%)
> max:  20954.00max:  38911.00
> min:  19985.00min:  34928.00
> 
> 32 thread
> records:  10  records:  10
> avg:  22633.40avg:  37118.00
> std:  413.73(1.83%)   std:  766.36(2.06%)
> max:  23120.00max:  38328.00
> min:  22071.00min:  35557.00
> 
> In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> Patchset is based on 3.14
> 
> * From v3
>  * Add "how to work part" in description - Zhang
>  * Add page_discardable utility function - Zhang
>  * Clean up
> 
> * From v2
>  * Remove forceful dirty marking of swap-readed page - Johannes
>  * Remove deactivation logic of lazyfreed page
>  * Rebased on 3.14
>  * Remove RFC tag
> 
> * From v1
>  * Use custom page table walker for madvise_free - Johannes
>  * Remove PG_lazypage flag - Johannes
>  * Do madvise_dontneed instead of madvise_freein swapless system
> 
> Cc: Hugh Dickins 
> Cc: Johannes Weiner 
> Cc: Rik van Riel 
> Cc: KOSAKI Motohiro 
> Cc: Mel Gorman 
> Cc: Jason Evans 
> Signed-off-by: Minchan Kim 
> ---
>  include/linux/mm.h |   2 +
>  include/linux/rmap.h   |  21 -
>  include/linux/vm_event_item.h  |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c   |  25 ++
>  mm/memory.c| 140 
> +
>  mm/rmap.c  |  82 +--
>  mm/vmscan.c

Re: [RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index

2014-04-14 Thread Zhang Yanfei
On 04/14/2014 04:43 PM, Li Zhong wrote:
> Seems we all agree that information about SECTION, e.g. section size,
> sections per memory block should be kept as kernel internals, and not
> exposed to userspace.
> 
> This patch updates Documentation/memory-hotplug.txt to refer to memory
> blocks instead of memory sections where appropriate and added a
> paragraph to explain that memory blocks are made of memory sections.
> The documentation update is mostly provided by Nathan.
> 
> Also, as end_phys_index in code is actually not the end section id, but
> the end memory block id, which should always be the same as phys_index.
> So it is removed here.
> 
> Signed-off-by: Li Zhong 

Reviewed-by: Zhang Yanfei 

Still the nitpick there.

> ---
>  Documentation/memory-hotplug.txt |  125 
> +++---
>  drivers/base/memory.c|   12 
>  2 files changed, 61 insertions(+), 76 deletions(-)
> 
> diff --git a/Documentation/memory-hotplug.txt 
> b/Documentation/memory-hotplug.txt
> index 58340d5..1aa239f 100644
> --- a/Documentation/memory-hotplug.txt
> +++ b/Documentation/memory-hotplug.txt
> @@ -88,16 +88,21 @@ phase by hand.
>  
>  1.3. Unit of Memory online/offline operation
>  
> -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole 
> memory
> -into chunks of the same size. The chunk is called a "section". The size of
> -a section is architecture dependent. For example, power uses 16MiB, ia64 uses
> -1GiB. The unit of online/offline operation is "one section". (see Section 3.)
> +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
> +into chunks of the same size. These chunks are called "sections". The size of
> +a memory section is architecture dependent. For example, power uses 16MiB, 
> ia64
> +uses 1GiB.
>  
> -To determine the size of sections, please read this file:
> +Memory sections are combined into chunks referred to as "memory blocks". The
> +size of a memory block is architecture dependent and represents the logical
> +unit upon which memory online/offline operations are to be performed. The
> +default size of a memory block is the same as memory section size unless an
> +architecture specifies otherwise. (see Section 3.)
> +
> +To determine the size (in bytes) of a memory block please read this file:
>  
>  /sys/devices/system/memory/block_size_bytes
>  
> -This file shows the size of sections in byte.
>  
>  ---
>  2. Kernel Configuration
> @@ -123,42 +128,35 @@ config options.
>  (CONFIG_ACPI_CONTAINER).
>  This option can be kernel module too.
>  
> +
>  
> -4 sysfs files for memory hotplug
> +3 sysfs files for memory hotplug
>  
> -All sections have their device information in sysfs.  Each section is part of
> -a memory block under /sys/devices/system/memory as
> +All memory blocks have their device information in sysfs.  Each memory block
> +is described under /sys/devices/system/memory as
>  
>  /sys/devices/system/memory/memoryXXX
> -(XXX is the section id.)
> +(XXX is the memory block id.)
>  
> -Now, XXX is defined as (start_address_of_section / section_size) of the first
> -section contained in the memory block.  The files 'phys_index' and
> -'end_phys_index' under each directory report the beginning and end section 
> id's
> -for the memory block covered by the sysfs directory.  It is expected that all
> +For the memory block covered by the sysfs directory.  It is expected that all
>  memory sections in this range are present and no memory holes exist in the
>  range. Currently there is no way to determine if there is a memory hole, but
>  the existence of one should not affect the hotplug capabilities of the memory
>  block.
>  
> -For example, assume 1GiB section size. A device for a memory starting at
> +For example, assume 1GiB memory block size. A device for a memory starting at
>  0x1 is /sys/device/system/memory/memory4
>  (0x1 / 1Gib = 4)
>  This device covers address range [0x1 ... 0x14000)
>  
> -Under each section, you can see 4 or 5 files, the end_phys_index file being
> -a recent addition and not present on older kernels.
> +Under each memory block, you can see 4 files:
>  
> -/sys/devices/system/memory/memoryXXX/start_phys_index
> -/sys/devices/system/memory/memoryXXX/end_phys_index
> +/sys/devices/system/memory/memoryXXX/phys_index
>  /sys/devices/system/memory/memoryXXX/phys_device
>  /sys/devices/system/memory/memoryXXX/state
>  /sys/devices/system/memory/memoryXXX/removable
>  
> -'phys_inde

Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime

2014-04-14 Thread Zhang Yanfei
Clear explanation and implementation!

Reviewed-by: Zhang Yanfei 

On 04/11/2014 01:58 AM, Luiz Capitulino wrote:
> [Full introduction right after the changelog]
> 
> Changelog
> -
> 
> v3
> 
> - Dropped unnecessary WARN_ON() call [Kirill]
> - Always check if the pfn range lies within a zone [Yasuaki]
> - Renamed some function arguments for consistency
> 
> v2
> 
> - Rewrote allocation loop to avoid scanning unless PFNs [Yasuaki]
> - Dropped incomplete multi-arch support [Naoya]
> - Added patch to drop __init from prep_compound_gigantic_page()
> - Restricted the feature to x86_64 (more details in patch 5/5)
> - Added review-bys plus minor changelog changes
> 
> Introduction
> 
> 
> The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> runtime. This means that hugepages allocation during runtime is limited to
> MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> greater than MAX_ORDER), this in turn means that those pages can't be
> allocated at runtime.
> 
> HugeTLB supports gigantic page allocation during boottime, via the boot
> allocator. To this end the kernel provides the command-line options
> hugepagesz= and hugepages=, which can be used to instruct the kernel to
> allocate N gigantic pages during boot.
> 
> For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
> be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
> this has to be done at boot via the hugepagesz= and hugepages= command-line
> options.
> 
> Now, gigantic page allocation at boottime has two serious problems:
> 
>  1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
> evenly distributes boottime allocated hugepages among nodes.
> 
> For example, suppose you have a four-node NUMA machine and want
> to allocate four 1G gigantic pages at boottime. The kernel will
> allocate one gigantic page per node.
> 
> On the other hand, we do have users who want to be able to specify
> which NUMA node gigantic pages should allocated from. So that they
> can place virtual machines on a specific NUMA node.
> 
>  2. Gigantic pages allocated at boottime can't be freed
> 
> At this point it's important to observe that regular hugepages allocated
> at runtime don't have those problems. This is so because HugeTLB interface
> for runtime allocation in sysfs supports NUMA and runtime allocated pages
> can be freed just fine via the buddy allocator.
> 
> This series adds support for allocating gigantic pages at runtime. It does
> so by allocating gigantic pages via CMA instead of the buddy allocator.
> Releasing gigantic pages is also supported via CMA. As this series builds
> on top of the existing HugeTLB interface, it makes gigantic page allocation
> and releasing just like regular sized hugepages. This also means that NUMA
> support just works.
> 
> For example, to allocate two 1G gigantic pages on node 1, one can do:
> 
>  # echo 2 > \
>/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And, to release all gigantic pages on the same node:
> 
>  # echo 0 > \
>/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> Please, refer to patch 5/5 for full technical details.
> 
> Finally, please note that this series is a follow up for a previous series
> that tried to extend the command-line options set to be NUMA aware:
> 
>  http://marc.info/?l=linux-mm=139593335312191=2
> 
> During the discussion of that series it was agreed that having runtime
> allocation support for gigantic pages was a better solution.
> 
> Luiz Capitulino (5):
>   hugetlb: prep_compound_gigantic_page(): drop __init marker
>   hugetlb: add hstate_is_gigantic()
>   hugetlb: update_and_free_page(): don't clear PG_reserved bit
>   hugetlb: move helpers up in the file
>   hugetlb: add support for gigantic page allocation at runtime
> 
>  include/linux/hugetlb.h |   5 +
>  mm/hugetlb.c| 336 
> ++--
>  2 files changed, 245 insertions(+), 96 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] mm: support madvise(MADV_FREE)

2014-04-14 Thread Zhang Yanfei
On 04/15/2014 12:46 PM, Minchan Kim wrote:
 Linux doesn't have an ability to free pages lazy while other OS
 already have been supported that named by madvise(MADV_FREE).
 
 The gain is clear that kernel can discard freed pages rather than
 swapping out or OOM if memory pressure happens.
 
 Without memory pressure, freed pages would be reused by userspace
 without another additional overhead(ex, page fault + allocation
 + zeroing).
 
 How to work is following as.
 
 When madvise syscall is called, VM clears dirty bit of ptes of
 the range. If memory pressure happens, VM checks dirty bit of
 page table and if it found still clean, it means it's a
 lazyfree pages so VM could discard the page instead of swapping out.
 Once there was store operation for the page before VM peek a page
 to reclaim, dirty bit is set so VM can swap out the page instead of
 discarding.
 
 Firstly, heavy users would be general allocators(ex, jemalloc,
 tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
 have supported the feature for other OS(ex, FreeBSD)

Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 
 barrios@blaptop:~/benchmark/ebizzy$ lscpu
 Architecture:  x86_64
 CPU op-mode(s):32-bit, 64-bit
 Byte Order:Little Endian
 CPU(s):4
 On-line CPU(s) list:   0-3
 Thread(s) per core:2
 Core(s) per socket:2
 Socket(s): 1
 NUMA node(s):  1
 Vendor ID: GenuineIntel
 CPU family:6
 Model: 42
 Stepping:  7
 CPU MHz:   2801.000
 BogoMIPS:  5581.64
 Virtualization:VT-x
 L1d cache: 32K
 L1i cache: 32K
 L2 cache:  256K
 L3 cache:  4096K
 NUMA node0 CPU(s): 0-3
 
 ebizzy benchmark(./ebizzy -S 10 -n 512)
 
  vanilla-jemalloc MADV_free-jemalloc
 
 1 thread
 records:  10  records:  10
 avg:  7436.70 avg:  15292.70
 std:  48.01(0.65%)std:  496.40(3.25%)
 max:  7542.00 max:  15944.00
 min:  7366.00 min:  14478.00
 
 2 thread
 records:  10  records:  10
 avg:  12190.50avg:  24975.50
 std:  1011.51(8.30%)  std:  1127.22(4.51%)
 max:  13012.00max:  26382.00
 min:  10192.00min:  23265.00
 
 4 thread
 records:  10  records:  10
 avg:  16875.30avg:  36320.90
 std:  562.59(3.33%)   std:  1503.75(4.14%)
 max:  17465.00max:  38314.00
 min:  15552.00min:  33863.00
 
 8 thread
 records:  10  records:  10
 avg:  16966.80avg:  35915.20
 std:  229.35(1.35%)   std:  2153.89(6.00%)
 max:  17456.00max:  37943.00
 min:  16742.00min:  29891.00
 
 16 thread
 records:  10  records:  10
 avg:  20590.90avg:  37388.40
 std:  362.33(1.76%)   std:  1282.59(3.43%)
 max:  20954.00max:  38911.00
 min:  19985.00min:  34928.00
 
 32 thread
 records:  10  records:  10
 avg:  22633.40avg:  37118.00
 std:  413.73(1.83%)   std:  766.36(2.06%)
 max:  23120.00max:  38328.00
 min:  22071.00min:  35557.00
 
 In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
 Patchset is based on 3.14
 
 * From v3
  * Add how to work part in description - Zhang
  * Add page_discardable utility function - Zhang
  * Clean up
 
 * From v2
  * Remove forceful dirty marking of swap-readed page - Johannes
  * Remove deactivation logic of lazyfreed page
  * Rebased on 3.14
  * Remove RFC tag
 
 * From v1
  * Use custom page table walker for madvise_free - Johannes
  * Remove PG_lazypage flag - Johannes
  * Do madvise_dontneed instead of madvise_freein swapless system
 
 Cc: Hugh Dickins hu...@google.com
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Rik van Riel r...@redhat.com
 Cc: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
 Cc: Mel Gorman mgor...@suse.de
 Cc: Jason Evans j...@fb.com
 Signed-off-by: Minchan Kim minc...@kernel.org
 ---
  include/linux/mm.h |   2 +
  include/linux/rmap.h   |  21 -
  include/linux/vm_event_item.h  |   1 +
  include/uapi/asm-generic/mman-common.h |   1 +
  mm/madvise.c   |  25 ++
  mm/memory.c| 140 
 +
  mm/rmap.c  |  82 +--
  mm/vmscan.c|  29 ++-
  mm/vmstat.c|   1 +
  9 files changed, 290 insertions(+), 12 deletions(-)
 
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index c1b7414c7bef..79af90212c19 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1063,6 +1063,8 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned 
 long address,
   unsigned long

  1   2   3   4   5   6   7   8   >