[PATCH v2] vhost-vdpa: account iommu allocations

2023-12-26 Thread Pasha Tatashin
iommu allocations should be accounted in order to allow admins to
monitor and limit the amount of iommu memory.

Signed-off-by: Pasha Tatashin 
Acked-by: Michael S. Tsirkin 
---
 drivers/vhost/vdpa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Changelog:

v1:
This patch is spinned of from the series:
https://lore.kernel.org/all/20231128204938.1453583-1-pasha.tatas...@soleen.com

v2:
- Synced with v6.7-rc7
- Added Acked-by Michael S. Tsirkin.

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index da7ec77cdaff..a51c69c078d9 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -968,7 +968,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
r = ops->set_map(vdpa, asid, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
- perm_to_iommu_flags(perm), GFP_KERNEL);
+ perm_to_iommu_flags(perm),
+ GFP_KERNEL_ACCOUNT);
}
if (r) {
vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
-- 
2.43.0.472.g3155946c3a-goog




[PATCH] vhost-vdpa: account iommu allocations

2023-11-30 Thread Pasha Tatashin
iommu allocations should be accounted in order to allow admins to
monitor and limit the amount of iommu memory.

Signed-off-by: Pasha Tatashin 
---
 drivers/vhost/vdpa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

This patch is spinned of from the series:
https://lore.kernel.org/all/20231128204938.1453583-1-pasha.tatas...@soleen.com

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index da7ec77cdaff..a51c69c078d9 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -968,7 +968,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
r = ops->set_map(vdpa, asid, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
- perm_to_iommu_flags(perm), GFP_KERNEL);
+ perm_to_iommu_flags(perm),
+ GFP_KERNEL_ACCOUNT);
}
if (r) {
vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
-- 
2.43.0.rc2.451.g8631bc7472-goog




Re: [PATCH v1] virtio_pmem: populate numa information

2022-10-26 Thread Pasha Tatashin
On Mon, Oct 17, 2022 at 1:11 PM Michael Sammler  wrote:
>
> Compute the numa information for a virtio_pmem device from the memory
> range of the device. Previously, the target_node was always 0 since
> the ndr_desc.target_node field was never explicitly set. The code for
> computing the numa node is taken from cxl_pmem_region_probe in
> drivers/cxl/pmem.c.
>
> Signed-off-by: Michael Sammler 

Enables the hot-plugging of virtio-pmem memory into correct memory
nodes. Does not look like it effect the FS_DAX.

Reviewed-by: Pasha Tatashin 

Thanks,
Pasha

> ---
>  drivers/nvdimm/virtio_pmem.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index 20da455d2ef6..a92eb172f0e7 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem)
>  static int virtio_pmem_probe(struct virtio_device *vdev)
>  {
> struct nd_region_desc ndr_desc = {};
> -   int nid = dev_to_node(&vdev->dev);
> struct nd_region *nd_region;
> struct virtio_pmem *vpmem;
> struct resource res;
> @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus);
>
> ndr_desc.res = &res;
> -   ndr_desc.numa_node = nid;
> +
> +   ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start);
> +   ndr_desc.target_node = phys_to_target_node(res.start);
> +   if (ndr_desc.target_node == NUMA_NO_NODE) {
> +   ndr_desc.target_node = ndr_desc.numa_node;
> +   dev_dbg(&vdev->dev, "changing target node from %d to %d",
> +   NUMA_NO_NODE, ndr_desc.target_node);
> +   }
> +
> ndr_desc.flush = async_pmem_flush;
> ndr_desc.provider_data = vdev;
> set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> --
> 2.38.0.413.g74048e4d9e-goog



Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init

2018-10-31 Thread Pasha Tatashin


On 10/31/18 12:05 PM, Alexander Duyck wrote:
> On Wed, 2018-10-31 at 15:40 +0000, Pasha Tatashin wrote:
>>
>> On 10/17/18 7:54 PM, Alexander Duyck wrote:
>>> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
>>>
>>> This iterator will take care of making sure a given memory range provided
>>> is in fact contained within a zone. It takes are of all the bounds checking
>>> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
>>> it should help to speed up the search a bit by iterating until the end of a
>>> range is greater than the start of the zone pfn range, and will exit
>>> completely if the start is beyond the end of the zone.
>>>
>>> This patch adds yet another iterator called
>>> for_each_free_mem_range_in_zone_from and then uses it to support
>>> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
>>> By doing this we can greatly improve the cache locality of the pages while
>>> we do several loops over them in the init and freeing process.
>>>
>>> We are able to tighten the loops as a result since we only really need the
>>> checks for first_init_pfn in our first iteration and after that we can
>>> assume that all future values will be greater than this. So I have added a
>>> function called deferred_init_mem_pfn_range_in_zone that primes the
>>> iterators and if it fails we can just exit.
>>>
>>> On my x86_64 test system with 384GB of memory per node I saw a reduction in
>>> initialization time from 1.85s to 1.38s as a result of this patch.
>>>
>>> Signed-off-by: Alexander Duyck 
>>
>> Hi Alex,
>>
>> Could you please split this patch into two parts:
>>
>> 1. Add deferred_init_maxorder()
>> 2. Add memblock iterator?
>>
>> This would allow a better bisecting in case of problems. Chaning two
>> loops into deferred_init_maxorder() while a good idea, is still
>> non-trivial and might lead to bugs.
>>
>> Thank you,
>> Pavel
> 
> I can do that, but I will need to flip the order. I will add the new
> iterator first and then deferred_init_maxorder. Otherwise the
> intermediate step ends up being too much throw-away code.

That sounds good.

Thank you,
Pavel

> 
> - Alex
> 

Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init

2018-10-31 Thread Pasha Tatashin


On 10/17/18 7:54 PM, Alexander Duyck wrote:
> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> 
> This iterator will take care of making sure a given memory range provided
> is in fact contained within a zone. It takes are of all the bounds checking
> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> it should help to speed up the search a bit by iterating until the end of a
> range is greater than the start of the zone pfn range, and will exit
> completely if the start is beyond the end of the zone.
> 
> This patch adds yet another iterator called
> for_each_free_mem_range_in_zone_from and then uses it to support
> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> By doing this we can greatly improve the cache locality of the pages while
> we do several loops over them in the init and freeing process.
> 
> We are able to tighten the loops as a result since we only really need the
> checks for first_init_pfn in our first iteration and after that we can
> assume that all future values will be greater than this. So I have added a
> function called deferred_init_mem_pfn_range_in_zone that primes the
> iterators and if it fails we can just exit.
> 
> On my x86_64 test system with 384GB of memory per node I saw a reduction in
> initialization time from 1.85s to 1.38s as a result of this patch.
> 
> Signed-off-by: Alexander Duyck 

Hi Alex,

Could you please split this patch into two parts:

1. Add deferred_init_maxorder()
2. Add memblock iterator?

This would allow a better bisecting in case of problems. Chaning two
loops into deferred_init_maxorder() while a good idea, is still
non-trivial and might lead to bugs.

Thank you,
Pavel

> ---
>  include/linux/memblock.h |   58 +++
>  mm/memblock.c|   63 
>  mm/page_alloc.c  |  176 
> --
>  3 files changed, 242 insertions(+), 55 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a6aa76..2ddd1bafdd03 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t 
> *out_start,
> p_start, p_end, p_nid))
>  
>  /**
> + * for_each_mem_range_from - iterate through memblock areas from type_a and 
> not
> + * included in type_b. Or just type_a if type_b is NULL.
> + * @i: u64 used as loop variable
> + * @type_a: ptr to memblock_type to iterate
> + * @type_b: ptr to memblock_type which excludes from the iteration
> + * @nid: node selector, %NUMA_NO_NODE for all nodes
> + * @flags: pick from blocks based on memory attributes
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + * @p_nid: ptr to int for nid of the range, can be %NULL
> + */
> +#define for_each_mem_range_from(i, type_a, type_b, nid, flags,   
> \
> +p_start, p_end, p_nid)   \
> + for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b,\
> +  p_start, p_end, p_nid);\
> +  i != (u64)ULLONG_MAX;  \
> +  __next_mem_range(&i, nid, flags, type_a, type_b,   \
> +   p_start, p_end, p_nid))
> +/**
>   * for_each_mem_range_rev - reverse iterate through memblock areas from
>   * type_a and not included in type_b. Or just type_a if type_b is NULL.
>   * @i: u64 used as loop variable
> @@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned 
> long *out_start_pfn,
>i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> +   unsigned long *out_spfn,
> +   unsigned long *out_epfn);
> +/**
> + * for_each_free_mem_range_in_zone - iterate through zone specific free
> + * memblock areas
> + * @i: u64 used as loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over free (memory && !reserved) areas of memblock in a specific
> + * zone. Available as soon as memblock is initialized.
> + */
> +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end) \
> + for (i = 0, \
> +  __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);\
> +  i != (u64)ULLONG_MAX;  \
> +  __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> +
> +/**
> + * f

Re: [PATCH v4 3/5] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-21 Thread Pasha Tatashin


On 9/20/18 6:29 PM, Alexander Duyck wrote:
> The ZONE_DEVICE pages were being initialized in two locations. One was with
> the memory_hotplug lock held and another was outside of that lock. The
> problem with this is that it was nearly doubling the memory initialization
> time. Instead of doing this twice, once while holding a global lock and
> once without, I am opting to defer the initialization to the one outside of
> the lock. This allows us to avoid serializing the overhead for memory init
> and we can instead focus on per-node init times.
> 
> One issue I encountered is that devm_memremap_pages and
> hmm_devmmem_pages_create were initializing only the pgmap field the same
> way. One wasn't initializing hmm_data, and the other was initializing it to
> a poison value. Since this is something that is exposed to the driver in
> the case of hmm I am opting for a third option and just initializing
> hmm_data to 0 since this is going to be exposed to unknown third party
> drivers.
> 
> Signed-off-by: Alexander Duyck 

> +void __ref memmap_init_zone_device(struct zone *zone,
> +unsigned long start_pfn,
> +unsigned long size,
> +struct dev_pagemap *pgmap)
> +{
> + unsigned long pfn, end_pfn = start_pfn + size;
> + struct pglist_data *pgdat = zone->zone_pgdat;
> + unsigned long zone_idx = zone_idx(zone);
> + unsigned long start = jiffies;
> + int nid = pgdat->node_id;
> +
> + if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
> + return;
> +
> + /*
> +  * The call to memmap_init_zone should have already taken care
> +  * of the pages reserved for the memmap, so we can just jump to
> +  * the end of that region and start processing the device pages.
> +  */
> + if (pgmap->altmap_valid) {
> + struct vmem_altmap *altmap = &pgmap->altmap;
> +
> + start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> + size = end_pfn - start_pfn;
> + }
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> + struct page *page = pfn_to_page(pfn);
> +
> + __init_single_page(page, pfn, zone_idx, nid);
> +
> + /*
> +  * Mark page reserved as it will need to wait for onlining
> +  * phase for it to be fully associated with a zone.
> +  *
> +  * We can use the non-atomic __set_bit operation for setting
> +  * the flag as we are still initializing the pages.
> +  */
> + __SetPageReserved(page);
> +
> + /*
> +  * ZONE_DEVICE pages union ->lru with a ->pgmap back
> +  * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
> +  * page is ever freed or placed on a driver-private list.
> +  */
> + page->pgmap = pgmap;
> + page->hmm_data = 0;

__init_single_page()
  mm_zero_struct_page()

Takes care of zeroing, no need to do another store here.


Looks good otherwise.

Reviewed-by: Pavel Tatashin 


Re: [PATCH v2 3/4] mm/memory_hotplug: Simplify node_states_check_changes_online

2018-09-21 Thread Pasha Tatashin


On 9/21/18 9:26 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> While looking at node_states_check_changes_online, I stumbled
> upon some confusing things.
> 
> Right after entering the function, we find this:
> 
> if (N_MEMORY == N_NORMAL_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> This is wrong.
> N_MEMORY cannot really be equal to N_NORMAL_MEMORY.
> My guess is that this wanted to be something like:
> 
> if (N_NORMAL_MEMORY == N_HIGH_MEMORY)
> 
> to check if we have CONFIG_HIGHMEM.
> 
> Later on, in the CONFIG_HIGHMEM block, we have:
> 
> if (N_MEMORY == N_HIGH_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> Again, this is wrong, and will never be evaluated to true.
> 
> Besides removing these wrong if statements, I simplified
> the function a bit.
> 
> Signed-off-by: Oscar Salvador 
> Suggested-by: Pavel Tatashin 

Reviewed-by: Pavel Tatashin 

Re: [PATCH 5/5] mm/memory_hotplug: Clean up node_states_check_changes_offline

2018-09-20 Thread Pasha Tatashin
On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> This patch, as the previous one, gets rid of the wrong if statements.
> While at it, I realized that the comments are sometimes very confusing,
> to say the least, and wrong.
> For example:
> 
> ---
> zone_last = ZONE_MOVABLE;
> 
> /*
>  * check whether node_states[N_HIGH_MEMORY] will be changed
>  * If we try to offline the last present @nr_pages from the node,
>  * we can determind we will need to clear the node from
>  * node_states[N_HIGH_MEMORY].
>  */
> 
> for (; zt <= zone_last; zt++)
>   present_pages += pgdat->node_zones[zt].present_pages;
> if (nr_pages >= present_pages)
>   arg->status_change_nid = zone_to_nid(zone);
> else
>   arg->status_change_nid = -1;
> ---
> 
> In case the node gets empry, it must be removed from N_MEMORY.
> We already check N_HIGH_MEMORY a bit above within the CONFIG_HIGHMEM
> ifdef code.
> Not to say that status_change_nid is for N_MEMORY, and not for
> N_HIGH_MEMORY.
> 
> So I re-wrote some of the comments to what I think is better.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/memory_hotplug.c | 71 
> +
>  1 file changed, 28 insertions(+), 43 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index ab3c1de18c5d..15ecf3d7a554 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1485,51 +1485,36 @@ static void 
> node_states_check_changes_offline(unsigned long nr_pages,
>  {
>   struct pglist_data *pgdat = zone->zone_pgdat;
>   unsigned long present_pages = 0;
> - enum zone_type zt, zone_last = ZONE_NORMAL;
> + enum zone_type zt;
>  
>   /*
> -  * If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_NORMAL,
> -  * set zone_last to ZONE_NORMAL.
> -  *
> -  * If we don't have HIGHMEM nor movable node,
> -  * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> -  * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> +  * Check whether node_states[N_NORMAL_MEMORY] will be changed.
> +  * If the memory to be offline is within the range
> +  * [0..ZONE_NORMAL], and it is the last present memory there,
> +  * the zones in that range will become empty after the offlining,
> +  * thus we can determine that we need to clear the node from
> +  * node_states[N_NORMAL_MEMORY].
>*/
> - if (N_MEMORY == N_NORMAL_MEMORY)
> - zone_last = ZONE_MOVABLE;
> -
> - /*
> -  * check whether node_states[N_NORMAL_MEMORY] will be changed.
> -  * If the memory to be offline is in a zone of 0...zone_last,
> -  * and it is the last present memory, 0...zone_last will
> -  * become empty after offline , thus we can determind we will
> -  * need to clear the node from node_states[N_NORMAL_MEMORY].
> -  */
> - for (zt = 0; zt <= zone_last; zt++)
> + for (zt = 0; zt <= ZONE_NORMAL; zt++)
>   present_pages += pgdat->node_zones[zt].present_pages;
> - if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
> + if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages)
>   arg->status_change_nid_normal = zone_to_nid(zone);
>   else
>   arg->status_change_nid_normal = -1;
>  
>  #ifdef CONFIG_HIGHMEM
>   /*
> -  * If we have movable node, node_states[N_HIGH_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_HIGHMEM,
> -  * set zone_last to ZONE_HIGHMEM.
> -  *
> -  * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_MOVABLE,
> -  * set zone_last to ZONE_MOVABLE.
> +  * node_states[N_HIGH_MEMORY] contains nodes which
> +  * have normal memory or high memory.
> +  * Here we add the present_pages belonging to ZONE_HIGHMEM.
> +  * If the zone is within the range of [0..ZONE_HIGHMEM), and
> +  * we determine that the zones in that range become empty,
> +  * we need to clear the node for N_HIGH_MEMORY.
>*/
> - zone_last = ZONE_HIGHMEM;
> - if (N_MEMORY == N_HIGH_MEMORY)
> - zone_last = ZONE_MOVABLE;
> + zt = ZONE_HIGHMEM;
> + present_pages += pgdat->node_zones[zt].present_pages;
>  
> - for (; zt <= zone_last; zt++)
> - present_pages += pgdat->node_zones[zt].present_pages;
> - if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
> + if (zone_idx(zone) <= zt && nr_pages >= present_pages)
>   arg->status_change_nid_high = zone_to_nid(zone);
>   else
>   arg->status_change_nid_high = -1;
> @@ -1542,18 +1527,18 @@ static void 
> node_states_check_changes_offline(unsigned long nr_pages,
>  #endif
>  
>   /*
> -  * node_states[N_HIGH_MEMORY] contains nodes which have 0...ZONE_MOVABLE
> +  * We have accounted the pages from [0..ZONE_NORMAL), and
> +  * in 

Re: [PATCH 4/5] mm/memory_hotplug: Simplify node_states_check_changes_online

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> While looking at node_states_check_changes_online, I stumbled
> upon some confusing things.
> 
> Right after entering the function, we find this:
> 
> if (N_MEMORY == N_NORMAL_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> This is wrong.
> N_MEMORY cannot really be equal to N_NORMAL_MEMORY.
> My guess is that this wanted to be something like:
> 
> if (N_NORMAL_MEMORY == N_HIGH_MEMORY)
> 
> to check if we have CONFIG_HIGHMEM.
> 
> Later on, in the CONFIG_HIGHMEM block, we have:
> 
> if (N_MEMORY == N_HIGH_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> Again, this is wrong, and will never be evaluated to true.
> 
> Besides removing these wrong if statements, I simplified
> the function a bit.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/memory_hotplug.c | 71 
> +
>  1 file changed, 23 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 131c08106d54..ab3c1de18c5d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -687,61 +687,36 @@ static void node_states_check_changes_online(unsigned 
> long nr_pages,
>   struct zone *zone, struct memory_notify *arg)
>  {
>   int nid = zone_to_nid(zone);
> - enum zone_type zone_last = ZONE_NORMAL;
>  
>   /*
> -  * If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_NORMAL,
> -  * set zone_last to ZONE_NORMAL.
> -  *
> -  * If we don't have HIGHMEM nor movable node,
> -  * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> -  * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> +  * zone_for_pfn_range() can only return a zone within
> +  * (0..ZONE_NORMAL] or ZONE_MOVABLE.

But what if that changes, will this function need to change as well?

> +  * If the zone is within the range (0..ZONE_NORMAL],
> +  * we need to check if:
> +  * 1) We need to set the node for N_NORMAL_MEMORY
> +  * 2) On CONFIG_HIGHMEM systems, we need to also set
> +  *the node for N_HIGH_MEMORY.
> +  * 3) On !CONFIG_HIGHMEM, we can disregard N_HIGH_MEMORY,
> +  *as N_HIGH_MEMORY falls back to N_NORMAL_MEMORY.
>*/
> - if (N_MEMORY == N_NORMAL_MEMORY)
> - zone_last = ZONE_MOVABLE;
>  
> - /*
> -  * if the memory to be online is in a zone of 0...zone_last, and
> -  * the zones of 0...zone_last don't have memory before online, we will
> -  * need to set the node to node_states[N_NORMAL_MEMORY] after
> -  * the memory is online.
> -  */
> - if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
> - arg->status_change_nid_normal = nid;
> - else
> - arg->status_change_nid_normal = -1;
> -
> -#ifdef CONFIG_HIGHMEM
> - /*
> -  * If we have movable node, node_states[N_HIGH_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_HIGHMEM,
> -  * set zone_last to ZONE_HIGHMEM.
> -  *
> -  * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_MOVABLE,
> -  * set zone_last to ZONE_MOVABLE.
> -  */
> - zone_last = ZONE_HIGHMEM;
> - if (N_MEMORY == N_HIGH_MEMORY)
> - zone_last = ZONE_MOVABLE;
> + if (zone_idx(zone) <= ZONE_NORMAL) {
> + if (!node_state(nid, N_NORMAL_MEMORY))
> + arg->status_change_nid_normal = nid;
> + else
> + arg->status_change_nid_normal = -1;
>  
> - if (zone_idx(zone) <= zone_last && !node_state(nid, N_HIGH_MEMORY))
> - arg->status_change_nid_high = nid;
> - else
> - arg->status_change_nid_high = -1;
> -#else
> - /*
> -  * When !CONFIG_HIGHMEM, N_HIGH_MEMORY equals N_NORMAL_MEMORY
> -  * so setting the node for N_NORMAL_MEMORY is enough.
> -  */
> - arg->status_change_nid_high = -1;
> -#endif
> + if (IS_ENABLED(CONFIG_HIGHMEM)) {
> + if (!node_state(nid, N_HIGH_MEMORY))
> + arg->status_change_nid_high = nid;

Should not we have:
else
arg->status_change_nid_high = -1; ?

> + } else
> + arg->status_change_nid_high = -1;


I prefer to have braces in else part as well when if has braces.

> + }
>  
>   /*
> -  * if the node don't have memory befor online, we will need to
> -  * set the node to node_states[N_MEMORY] after the memory
> -  * is online.
> +  * if the node doesn't have memory before onlining it, we will need
> +  * to set the node to node_states[N_MEMORY] after the memory
> +  * gets onlined.
>*/
>   if (!node_state(nid, N_MEMORY))
>   arg->status_change_nid = nid;
> 

I think it is simpler to have somethin

Re: [PATCH 3/5] mm/memory_hotplug: Tidy up node_states_clear_node

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> node_states_clear has the following if statements:
> 
> if ((N_MEMORY != N_NORMAL_MEMORY) &&
> (arg->status_change_nid_high >= 0))
> ...
> 
> if ((N_MEMORY != N_HIGH_MEMORY) &&
> (arg->status_change_nid >= 0))
> ...
> 
> N_MEMORY can never be equal to neither N_NORMAL_MEMORY nor
> N_HIGH_MEMORY.
> 
> Similar problem was found in [1].
> Since this is wrong, let us get rid of it.
> 
> [1] 
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fpatch%2F10579155%2F&data=02%7C01%7CPavel.Tatashin%40microsoft.com%7C1e31e6a5c8754abe0b4608d61e17e01c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636729485241367584&sdata=ztkPNyRIv2c0j5lrujwGM%2FrD5in6G7AvvdqxVXCzwGs%3D&reserved=0
> 
> Signed-off-by: Oscar Salvador 

Reviewed-by: Pavel Tatashin 

> ---
>  mm/memory_hotplug.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c2c7359bd0a7..131c08106d54 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1590,12 +1590,10 @@ static void node_states_clear_node(int node, struct 
> memory_notify *arg)
>   if (arg->status_change_nid_normal >= 0)
>   node_clear_state(node, N_NORMAL_MEMORY);
>  
> - if ((N_MEMORY != N_NORMAL_MEMORY) &&
> - (arg->status_change_nid_high >= 0))
> + if (arg->status_change_nid_high >= 0)
>   node_clear_state(node, N_HIGH_MEMORY);
>  
> - if ((N_MEMORY != N_HIGH_MEMORY) &&
> - (arg->status_change_nid >= 0))
> + if (arg->status_change_nid >= 0)
>   node_clear_state(node, N_MEMORY);
>  }
>  
> 

Re: [PATCH 2/5] mm/memory_hotplug: Avoid node_set/clear_state(N_HIGH_MEMORY) when !CONFIG_HIGHMEM

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Currently, when !CONFIG_HIGHMEM, status_change_nid_high is being set
> to status_change_nid_normal, but on such systems N_HIGH_MEMORY falls
> back to N_NORMAL_MEMORY.
> That means that if status_change_nid_normal is not -1,
> we will perform two calls to node_set_state for the same memory type.
> 
> Set status_change_nid_high to -1 for !CONFIG_HIGHMEM, so we skip the
> double call in node_states_set_node.
> 
> The same goes for node_clear_state.
> 
> Signed-off-by: Oscar Salvador 

Reviewed-by: Pavel Tatashin 

This is a rare case where I think comments are unnecessary as the code
is self explanatory. So, I would remove the comments before:

> + arg->status_change_nid_high = -1;

Pavel

> ---
>  mm/memory_hotplug.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 63facfc57224..c2c7359bd0a7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -731,7 +731,11 @@ static void node_states_check_changes_online(unsigned 
> long nr_pages,
>   else
>   arg->status_change_nid_high = -1;
>  #else
> - arg->status_change_nid_high = arg->status_change_nid_normal;
> + /*
> +  * When !CONFIG_HIGHMEM, N_HIGH_MEMORY equals N_NORMAL_MEMORY
> +  * so setting the node for N_NORMAL_MEMORY is enough.
> +  */
> + arg->status_change_nid_high = -1;
>  #endif
>  
>   /*
> @@ -1555,7 +1559,11 @@ static void node_states_check_changes_offline(unsigned 
> long nr_pages,
>   else
>   arg->status_change_nid_high = -1;
>  #else
> - arg->status_change_nid_high = arg->status_change_nid_normal;
> + /*
> +  * When !CONFIG_HIGHMEM, N_HIGH_MEMORY equals N_NORMAL_MEMORY
> +  * so clearing the node for N_NORMAL_MEMORY is enough.
> +  */
> + arg->status_change_nid_high = -1;
>  #endif
>  
>   /*
> 

Re: [PATCH 1/5] mm/memory_hotplug: Spare unnecessary calls to node_set_state

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> In node_states_check_changes_online, we check if the node will
> have to be set for any of the N_*_MEMORY states after the pages
> have been onlined.
> 
> Later on, we perform the activation in node_states_set_node.
> Currently, in node_states_set_node we set the node to N_MEMORY
> unconditionally.
> This means that we call node_set_state for N_MEMORY every time
> pages go online, but we only need to do it if the node has not yet been
> set for N_MEMORY.
> 
> Fix this by checking status_change_nid.
> 
> Signed-off-by: Oscar Salvador 

Reviewed-by: Pavel Tatashin 

> ---
>  mm/memory_hotplug.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 38d94b703e9d..63facfc57224 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -753,7 +753,8 @@ static void node_states_set_node(int node, struct 
> memory_notify *arg)
>   if (arg->status_change_nid_high >= 0)
>   node_set_state(node, N_HIGH_MEMORY);
>  
> - node_set_state(node, N_MEMORY);
> + if (arg->status_change_nid >= 0)
> + node_set_state(node, N_MEMORY);
>  }
>  
>  static void __meminit resize_zone_range(struct zone *zone, unsigned long 
> start_pfn,
> 

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Pasha Tatashin


On 9/12/18 10:27 AM, Gerald Schaefer wrote:
> On Wed, 12 Sep 2018 15:39:33 +0200
> Michal Hocko  wrote:
> 
>> On Wed 12-09-18 15:03:56, Gerald Schaefer wrote:
>> [...]
>>> BTW, those sysfs attributes are world-readable, so anyone can trigger
>>> the panic by simply reading them, or just run lsmem (also available for
>>> x86 since util-linux 2.32). OK, you need a special not-memory-block-aligned
>>> mem= parameter and DEBUG_VM for poison check, but w/o DEBUG_VM you would
>>> still access uninitialized struct pages. This sounds very wrong, and I
>>> think it really should be fixed.  
>>
>> Ohh, absolutely. Nobody is questioning that. The thing is that the
>> code has been likely always broken. We just haven't noticed because
>> those unitialized parts where zeroed previously. Now that the implicit
>> zeroying is gone it is just visible.
>>
>> All that I am arguing is that there are many places which assume
>> pageblocks to be fully initialized and plugging one place that blows up
>> at the time is just whack a mole. We need to address this much earlier.
>> E.g. by allowing only full pageblocks when adding a memory range.
> 
> Just to make sure we are talking about the same thing: when you say
> "pageblocks", do you mean the MAX_ORDER_NR_PAGES / pageblock_nr_pages
> unit of pages, or do you mean the memory (hotplug) block unit?

From early discussion, it was about pageblock_nr_pages not about
memory_block_size_bytes

> 
> I do not see any issue here with MAX_ORDER_NR_PAGES / pageblock_nr_pages
> pageblocks, and if there was such an issue, of course you are right that
> this would affect many places. If there was such an issue, I would also
> assume that we would see the new page poison warning in many other places.
> 
> The bug that Mikhails patch would fix only affects code that operates
> on / iterates through memory (hotplug) blocks, and that does not happen
> in many places, only in the two functions that his patch fixes.

Just to be clear, so memory is pageblock_nr_pages aligned, yet
memory_block are larger and panic is still triggered?

I ask, because 3075M is not 128M aligned.

> 
> When you say "address this much earlier", do you mean changing the way
> that free_area_init_core()/memmap_init() initialize struct pages, i.e.
> have them not use zone->spanned_pages as limit, but rather align that
> up to the memory block (not pageblock) boundary?
> 

This was my initial proposal, to fix memmap_init() and initialize struct
pages beyond the "end", and before the "start" to cover the whole
section. But, I think Michal suggested (and he might correct me) to
simply ignore unaligned memory to section memory much earlier: so
anything that does not align to sparse order is not added at all to the
system.

I think Michal's proposal would simplify and strengthen memory mapping
overall.

Pavel

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin


On 9/10/18 10:41 AM, Michal Hocko wrote:
> On Mon 10-09-18 14:32:16, Pavel Tatashin wrote:
>> On Mon, Sep 10, 2018 at 10:19 AM Michal Hocko  wrote:
>>>
>>> On Mon 10-09-18 14:11:45, Pavel Tatashin wrote:
 Hi Michal,

 It is tricky, but probably can be done. Either change
 memmap_init_zone() or its caller to also cover the ends and starts of
 unaligned sections to initialize and reserve pages.

 The same thing would also need to be done in deferred_init_memmap() to
 cover the deferred init case.
>>>
>>> Well, I am not sure TBH. I have to think about that much more. Maybe it
>>> would be much more simple to make sure that we will never add incomplete
>>> memblocks and simply refuse them during the discovery. At least for now.
>>
>> On x86 memblocks can be upto 2G on machines with over 64G of RAM.
> 
> sorry I meant pageblock_nr_pages rather than memblocks.

OK. This sound reasonable, but, to be honest I am not sure how to
achieve this yet, I need to think more about this. In theory, if we have
sparse memory model, it makes sense to enforce memory alignment to
section sizes, sounds a lot safer.

> 
>> Also, memory size is way to easy too change via qemu arguments when VM
>> starts. If we simply disable unaligned trailing memblocks, I am sure
>> we would get tons of noise of missing memory.
>>
>> I think, adding check_hotplug_memory_range() would work to fix the
>> immediate problem. But, we do need to figure out  a better solution.
>>
>> memblock design is based on archaic assumption that hotplug units are
>> physical dimms. VMs and hypervisors changed all of that, and we can
>> have much finer hotplug requests on machines with huge DIMMs. Yet, we
>> do not want to pollute sysfs with millions of tiny memory devices. I
>> am not sure what a long term proper solution for this problem should
>> be, but I see that linux hotplug/hotremove subsystems must be
>> redesigned based on the new requirements.
> 
> Not an easy task though. Anyway, sparse memory modely is highly based on
> memory sections so it makes some sense to have hotplug section based as
> well. Memblocks as a higher logical unit on top of that is kinda hack.
> The userspace API has never been properly thought through I am afraid.

I agree memoryblock is a hack, it fails to do both things it was
designed to do:

1. On bare metal you cannot free a physical dimm of memory using
memoryblock granularity because memory devices do not equal to physical
dimms. Thus, if for some reason a particular dimm must be
remove/replaced, memoryblock does not help us.

2. On machines with hypervisors it fails to provide an adequate
granularity to add/remove memory.

We should define a new user interface where memory can be added/removed
at a finer granularity: sparse section size, but without a memory
devices for each section. We should also provide an optional access to
legacy interface where memory devices are exported but each is of
section size.

So, when legacy interface is enabled, current way would work:

echo offline > /sys/devices/system/memory/memoryXXX/state

And new interface would allow us to do something like this:

echo offline 256M > /sys/devices/system/node/nodeXXX/memory

With optional start address for offline memory.
echo offline [start_pa] size > /sys/devices/system/node/nodeXXX/memory
start_pa and size must be section size aligned (128M).

It would probably be a good discussion for the next MM Summit how to
solve the current memory hotplug interface limitations.

Pavel

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin
On Mon, Sep 10, 2018 at 10:19 AM Michal Hocko  wrote:
>
> On Mon 10-09-18 14:11:45, Pavel Tatashin wrote:
> > Hi Michal,
> >
> > It is tricky, but probably can be done. Either change
> > memmap_init_zone() or its caller to also cover the ends and starts of
> > unaligned sections to initialize and reserve pages.
> >
> > The same thing would also need to be done in deferred_init_memmap() to
> > cover the deferred init case.
>
> Well, I am not sure TBH. I have to think about that much more. Maybe it
> would be much more simple to make sure that we will never add incomplete
> memblocks and simply refuse them during the discovery. At least for now.

On x86 memblocks can be upto 2G on machines with over 64G of RAM.
Also, memory size is way to easy too change via qemu arguments when VM
starts. If we simply disable unaligned trailing memblocks, I am sure
we would get tons of noise of missing memory.

I think, adding check_hotplug_memory_range() would work to fix the
immediate problem. But, we do need to figure out  a better solution.

memblock design is based on archaic assumption that hotplug units are
physical dimms. VMs and hypervisors changed all of that, and we can
have much finer hotplug requests on machines with huge DIMMs. Yet, we
do not want to pollute sysfs with millions of tiny memory devices. I
am not sure what a long term proper solution for this problem should
be, but I see that linux hotplug/hotremove subsystems must be
redesigned based on the new requirements.

Pavel

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin
Hi Michal,

It is tricky, but probably can be done. Either change
memmap_init_zone() or its caller to also cover the ends and starts of
unaligned sections to initialize and reserve pages.

The same thing would also need to be done in deferred_init_memmap() to
cover the deferred init case.

For hotplugged memory we do not need to worry, as that one is always
section aligned.

Do you think this would be a better approach?

Thank you,
Pavel
On Mon, Sep 10, 2018 at 10:00 AM Michal Hocko  wrote:
>
> On Mon 10-09-18 13:46:45, Pavel Tatashin wrote:
> >
> >
> > On 9/10/18 9:17 AM, Michal Hocko wrote:
> > > [Cc Pavel]
> > >
> > > On Mon 10-09-18 14:35:27, Mikhail Zaslonko wrote:
> > >> If memory end is not aligned with the linux memory section boundary, such
> > >> a section is only partly initialized. This may lead to VM_BUG_ON due to
> > >> uninitialized struct pages access from is_mem_section_removable() or
> > >> test_pages_in_a_zone() function.
> > >>
> > >> Here is one of the panic examples:
> > >>  CONFIG_DEBUG_VM_PGFLAGS=y
> > >>  kernel parameter mem=3075M
> > >
> > > OK, so the last memory section is not full and we have a partial memory
> > > block right?
> > >
> > >>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > >
> > > OK, this means that the struct page is not fully initialized. Do you
> > > have a specific place which has triggered this assert?
> > >
> > >>  [ cut here ]
> > >>  Call Trace:
> > >>  ([<0039b8a4>] is_mem_section_removable+0xcc/0x1c0)
> > >>   [<009558ba>] show_mem_removable+0xda/0xe0
> > >>   [<009325fc>] dev_attr_show+0x3c/0x80
> > >>   [<0047e7ea>] sysfs_kf_seq_show+0xda/0x160
> > >>   [<003fc4e0>] seq_read+0x208/0x4c8
> > >>   [<003cb80e>] __vfs_read+0x46/0x180
> > >>   [<003cb9ce>] vfs_read+0x86/0x148
> > >>   [<003cc06a>] ksys_read+0x62/0xc0
> > >>   [<00c001c0>] system_call+0xdc/0x2d8
> > >>
> > >> This fix checks if the page lies within the zone boundaries before
> > >> accessing the struct page data. The check is added to both functions.
> > >> Actually similar check has already been present in
> > >> is_pageblock_removable_nolock() function but only after the struct page
> > >> is accessed.
> > >>
> > >
> > > Well, I am afraid this is not the proper solution. We are relying on the
> > > full pageblock worth of initialized struct pages at many other place. We
> > > used to do that in the past because we have initialized the full
> > > section but this has been changed recently. Pavel, do you have any ideas
> > > how to deal with this partial mem sections now?
> >
> > We have:
> >
> > remove_memory()
> >   BUG_ON(check_hotplug_memory_range(start, size))
> >
> > That supposed to safely check for this condition: if [start, start +
> > size) not block size aligned (and we know block size is section
> > aligned), hot remove is not allowed. The problem is this check is late,
> > and only happens when invalid range has already passed through previous
> > checks.
> >
> > We could add check_hotplug_memory_range() to is_mem_section_removable():
> >
> > is_mem_section_removable(start_pfn, nr_pages)
> >  if (check_hotplug_memory_range(PFN_PHYS(start_pfn), PFN_PHYS(nr_pages)))
> >   return false;
> >
> > I think it should work.
>
> I do not think we want to sprinkle these tests over all pfn walkers. Can
> we simply initialize those uninitialized holes as well and make them
> reserved without handing them over to the page allocator? That would be
> much more robust approach IMHO.
> --
> Michal Hocko
> SUSE Labs
>

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin


On 9/10/18 9:17 AM, Michal Hocko wrote:
> [Cc Pavel]
> 
> On Mon 10-09-18 14:35:27, Mikhail Zaslonko wrote:
>> If memory end is not aligned with the linux memory section boundary, such
>> a section is only partly initialized. This may lead to VM_BUG_ON due to
>> uninitialized struct pages access from is_mem_section_removable() or
>> test_pages_in_a_zone() function.
>>
>> Here is one of the panic examples:
>>  CONFIG_DEBUG_VM_PGFLAGS=y
>>  kernel parameter mem=3075M
> 
> OK, so the last memory section is not full and we have a partial memory
> block right?
> 
>>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> 
> OK, this means that the struct page is not fully initialized. Do you
> have a specific place which has triggered this assert?
> 
>>  [ cut here ]
>>  Call Trace:
>>  ([<0039b8a4>] is_mem_section_removable+0xcc/0x1c0)
>>   [<009558ba>] show_mem_removable+0xda/0xe0
>>   [<009325fc>] dev_attr_show+0x3c/0x80
>>   [<0047e7ea>] sysfs_kf_seq_show+0xda/0x160
>>   [<003fc4e0>] seq_read+0x208/0x4c8
>>   [<003cb80e>] __vfs_read+0x46/0x180
>>   [<003cb9ce>] vfs_read+0x86/0x148
>>   [<003cc06a>] ksys_read+0x62/0xc0
>>   [<00c001c0>] system_call+0xdc/0x2d8
>>
>> This fix checks if the page lies within the zone boundaries before
>> accessing the struct page data. The check is added to both functions.
>> Actually similar check has already been present in
>> is_pageblock_removable_nolock() function but only after the struct page
>> is accessed.
>>
> 
> Well, I am afraid this is not the proper solution. We are relying on the
> full pageblock worth of initialized struct pages at many other place. We
> used to do that in the past because we have initialized the full
> section but this has been changed recently. Pavel, do you have any ideas
> how to deal with this partial mem sections now?

We have:

remove_memory()
BUG_ON(check_hotplug_memory_range(start, size))

That supposed to safely check for this condition: if [start, start +
size) not block size aligned (and we know block size is section
aligned), hot remove is not allowed. The problem is this check is late,
and only happens when invalid range has already passed through previous
checks.

We could add check_hotplug_memory_range() to is_mem_section_removable():

is_mem_section_removable(start_pfn, nr_pages)
 if (check_hotplug_memory_range(PFN_PHYS(start_pfn), PFN_PHYS(nr_pages)))
  return false;

I think it should work.

Pavel

> 
>> Signed-off-by: Mikhail Zaslonko 
>> Reviewed-by: Gerald Schaefer 
>> Cc: 
>> ---
>>  mm/memory_hotplug.c | 20 +++-
>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 9eea6e809a4e..8e20e8fcc3b0 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1229,9 +1229,8 @@ static struct page *next_active_pageblock(struct page 
>> *page)
>>  return page + pageblock_nr_pages;
>>  }
>>  
>> -static bool is_pageblock_removable_nolock(struct page *page)
>> +static bool is_pageblock_removable_nolock(struct page *page, struct zone 
>> **zone)
>>  {
>> -struct zone *zone;
>>  unsigned long pfn;
>>  
>>  /*
>> @@ -1241,15 +1240,14 @@ static bool is_pageblock_removable_nolock(struct 
>> page *page)
>>   * We have to take care about the node as well. If the node is offline
>>   * its NODE_DATA will be NULL - see page_zone.
>>   */
>> -if (!node_online(page_to_nid(page)))
>> -return false;
>> -
>> -zone = page_zone(page);
>>  pfn = page_to_pfn(page);
>> -if (!zone_spans_pfn(zone, pfn))
>> +if (*zone && !zone_spans_pfn(*zone, pfn))
>>  return false;
>> +if (!node_online(page_to_nid(page)))
>> +return false;
>> +*zone = page_zone(page);
>>  
>> -return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, true);
>> +return !has_unmovable_pages(*zone, page, 0, MIGRATE_MOVABLE, true);
>>  }
>>  
>>  /* Checks if this range of memory is likely to be hot-removable. */
>> @@ -1257,10 +1255,11 @@ bool is_mem_section_removable(unsigned long 
>> start_pfn, unsigned long nr_pages)
>>  {
>>  struct page *page = pfn_to_page(start_pfn);
>>  struct page *end_page = page + nr_pages;
>> +struct zone *zone = NULL;
>>  
>>  /* Check the starting page of each pageblock within the range */
>>  for (; page < end_page; page = next_active_pageblock(page)) {
>> -if (!is_pageblock_removable_nolock(page))
>> +if (!is_pageblock_removable_nolock(page, &zone))
>>  return false;
>>  cond_resched();
>>  }
>> @@ -1296,6 +1295,9 @@ int test_pages_in_a_zone(unsigned long start_pfn, 
>> unsigned long end_pfn,
>>  i++;
>>  if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
>>  continue;
>> +/* Check if we g

Re: [PATCH] Revert "x86/tsc: Consolidate init code"

2018-09-10 Thread Pasha Tatashin
Hi Ville,


The failure is surprising, because the commit is tiny, and almost does
not change the code logic.

From looking through the commit, the only functional difference this
commit makes is:

static_branch_enable(&__use_tsc) was called unconditionally from
tsc_init(), but after the commit only when tsc_khz == 0.

I wonder if on p3 static_branch_enable(&__use_tsc) fails to enable
early, when it supposed to? But, I would first try to make that
unconditional call again, and see if this fixes the problem, and then
figure out why it was not enabled when it was supposed to.

So, in tsc_init(void)

First try to add this one line back:

cyc2ns_init_secondary_cpus();
-   static_branch_enable(&__use_tsc);


See if it fixes everything, and lets work from there.

Thank you,
Pavel

On 9/10/18 8:48 AM, Thomas Gleixner wrote:
> Ville,
> 
> On Mon, 10 Sep 2018, Ville Syrjala wrote:
> 
>> From: Ville Syrjälä 
>>
>> This reverts commit 608008a45798fe9e2aee04f99b5270ea57c1376f.
>>
>> It breaks wifi on my pentium 3 Fujitsu-Siemens Lifebook S6010
>> laptop. Scanning for APs doesn't seem to work most of the time,
>> and, even when it manages to find some APs it never manages to
>> authenticate successfully. dmesg is just littered with:
>> "wlan0: send auth to ... (try 1/3)
>>  wlan0: send auth to ... (try 2/3)
>>  wlan0: send auth to ... (try 3/3)
>>  wlan0: authentication with ... timed out"
> 
> I asked for that before and I really do not understand why you do not even
> make an attempt to report an issue first and allow the developers to work
> with you to figure out what exactly is the problem. All you do is to send
> an revert patch with a changelog which describes symptoms and probably
> breaks more than it cures. Not really helpful, really.
> 
> It's surely helpful to know that you bisected it to that commit and
> reverting it helps. Can you please provide more detailes information like
> dmesg of an good and a bad boot?
> 
> Thanks,
> 
>   tglx
> 
> 
> 

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-06 Thread Pasha Tatashin


On 9/6/18 1:03 PM, Michal Hocko wrote:
> On Thu 06-09-18 08:41:52, Alexander Duyck wrote:
>> On Thu, Sep 6, 2018 at 8:13 AM Michal Hocko  wrote:
>>>
>>> On Thu 06-09-18 07:59:03, Dave Hansen wrote:
 On 09/05/2018 10:47 PM, Michal Hocko wrote:
> why do you have to keep DEBUG_VM enabled for workloads where the boot
> time matters so much that few seconds matter?

 There are a number of distributions that run with it enabled in the
 default build.  Fedora, for one.  We've basically assumed for a while
 that we have to live with it in production environments.

 So, where does leave us?  I think we either need a _generic_ debug
 option like:

   CONFIG_DEBUG_VM_SLOW_AS_HECK

 under which we can put this an other really slow VM debugging.  Or, we
 need some kind of boot-time parameter to trigger the extra checking
 instead of a new CONFIG option.
>>>
>>> I strongly suspect nobody will ever enable such a scary looking config
>>> TBH. Besides I am not sure what should go under that config option.
>>> Something that takes few cycles but it is called often or one time stuff
>>> that takes quite a long but less than aggregated overhead of the former?
>>>
>>> Just consider this particular case. It basically re-adds an overhead
>>> that has always been there before the struct page init optimization
>>> went it. The poisoning just returns it in a different form to catch
>>> potential left overs. And we would like to have as many people willing
>>> to running in debug mode to test for those paths because they are
>>> basically impossible to review by the code inspection. More importantnly
>>> the major overhead is boot time so my question still stands. Is this
>>> worth a separate config option almost nobody is going to enable?
>>>
>>> Enabling DEBUG_VM by Fedora and others serves us a very good testing
>>> coverage and I appreciate that because it has generated some useful bug
>>> reports. Those people are paying quite a lot of overhead in runtime
>>> which can aggregate over time is it so much to ask about one time boot
>>> overhead?
>>
>> The kind of boot time add-on I saw as a result of this was about 170
>> seconds, or 2 minutes and 50 seconds on a 12TB system.
> 
> Just curious. How long does it take to get from power on to even reaach
> boot loader on that machine... ;)
> 
>> I spent a
>> couple minutes wondering if I had built a bad kernel or not as I was
>> staring at a dead console the entire time after the grub prompt since
>> I hit this so early in the boot. That is the reason why I am so eager
>> to slice this off and make it something separate. I could easily see
>> this as something that would get in the way of other debugging that is
>> going on in a system.
> 
> But you would get the same overhead a kernel release ago when the
> memmap init optimization was merged. So you are basically back to what
> we used to have for years. Unless I misremember.

You remeber this correctly:

2f47a91f4dab1905cdcfced9dfcaf3f5257e has data before vs after
zeroing memory in memblock allocator.

On SPARC with 15T we saved 55.4s, because SPARC has larger base pages,
thus fewer struct pages.

On x86 with 1T saved 15.8s:  which is 189.6s if it was 12T machine
Alexander is using, close to 170s he is seeing, but CPU must be faster.

Pavel

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-06 Thread Pasha Tatashin


On 9/6/18 11:41 AM, Alexander Duyck wrote:
> On Thu, Sep 6, 2018 at 8:13 AM Michal Hocko  wrote:
>>
>> On Thu 06-09-18 07:59:03, Dave Hansen wrote:
>>> On 09/05/2018 10:47 PM, Michal Hocko wrote:
 why do you have to keep DEBUG_VM enabled for workloads where the boot
 time matters so much that few seconds matter?
>>>
>>> There are a number of distributions that run with it enabled in the
>>> default build.  Fedora, for one.  We've basically assumed for a while
>>> that we have to live with it in production environments.
>>>
>>> So, where does leave us?  I think we either need a _generic_ debug
>>> option like:
>>>
>>>   CONFIG_DEBUG_VM_SLOW_AS_HECK
>>>
>>> under which we can put this an other really slow VM debugging.  Or, we
>>> need some kind of boot-time parameter to trigger the extra checking
>>> instead of a new CONFIG option.
>>
>> I strongly suspect nobody will ever enable such a scary looking config
>> TBH. Besides I am not sure what should go under that config option.
>> Something that takes few cycles but it is called often or one time stuff
>> that takes quite a long but less than aggregated overhead of the former?
>>
>> Just consider this particular case. It basically re-adds an overhead
>> that has always been there before the struct page init optimization
>> went it. The poisoning just returns it in a different form to catch
>> potential left overs. And we would like to have as many people willing
>> to running in debug mode to test for those paths because they are
>> basically impossible to review by the code inspection. More importantnly
>> the major overhead is boot time so my question still stands. Is this
>> worth a separate config option almost nobody is going to enable?
>>
>> Enabling DEBUG_VM by Fedora and others serves us a very good testing
>> coverage and I appreciate that because it has generated some useful bug
>> reports. Those people are paying quite a lot of overhead in runtime
>> which can aggregate over time is it so much to ask about one time boot
>> overhead?
> 
> The kind of boot time add-on I saw as a result of this was about 170
> seconds, or 2 minutes and 50 seconds on a 12TB system. I spent a
> couple minutes wondering if I had built a bad kernel or not as I was
> staring at a dead console the entire time after the grub prompt since
> I hit this so early in the boot. That is the reason why I am so eager
> to slice this off and make it something separate. I could easily see
> this as something that would get in the way of other debugging that is
> going on in a system.
> 
> If we don't want to do a config option, then what about adding a
> kernel parameter to put a limit on how much memory we will initialize
> like this before we just start skipping it. We could put a default
> limit on it like 256GB and then once we cross that threshold we just
> don't bother poisoning any more memory. With that we would probably be
> able to at least cover most of the early memory init, and that value
> should cover most systems without getting into delays on the order of
> minutes.

I am OK with a boot parameter to optionally disable it when DEBUG_VM is
enabled. But, I do not think it is a good idea to make that parameter
"smart" basically always poison memory with DEBUG_VM unless bootet with
a parameter that tells not to poison memory.

CONFIG_DEBUG_VM is disbled on:

RedHat, Oracle Linux, CentOS, Ubuntu, Arch Linux, SUSE

Enabled on:

Fedora

Are there other distros where it is enabled? I think, this could be
filed as a performance bug against Fedora distro, and let the decide
what to do about it.

I do not want to make this feature less tested. Poisoning memory allowed
us to catch corner case bugs like these:

ab1e8d8960b68f54af42b6484b5950bd13a4054b
mm: don't allow deferred pages with NEED_PER_CPU_KM

e181ae0c5db9544de9c53239eb22bc012ce75033
mm: zero unavailable pages before memmap init

And several more that were fixed by other people.

For a very long linux relied on assumption that boot memory is zeroed,
and I am sure we will continue detect more bugs over time.

Thank you,
Pavel

> 
> - Alex
> 

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-05 Thread Pasha Tatashin


On 9/5/18 5:29 PM, Alexander Duyck wrote:
> On Wed, Sep 5, 2018 at 2:22 PM Pasha Tatashin
>  wrote:
>>
>>
>>
>> On 9/5/18 5:13 PM, Alexander Duyck wrote:
>>> From: Alexander Duyck 
>>>
>>> On systems with a large amount of memory it can take a significant amount
>>> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
>>> value. I have seen it take over 2 minutes to initialize a system with
>>> over 12GB of RAM.
>>>
>>> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
>>> the boot time returned to something much more reasonable as the
>>> arch_add_memory call completed in milliseconds versus seconds. However in
>>> doing that I had to disable all of the other VM debugging on the system.
>>>
>>> Instead of keeping the value in CONFIG_DEBUG_VM I am adding a new CONFIG
>>> value called CONFIG_DEBUG_VM_PAGE_INIT_POISON that will control the page
>>> poisoning independent of the CONFIG_DEBUG_VM option.
>>>
>>> Signed-off-by: Alexander Duyck 
>>> ---
>>>  include/linux/page-flags.h |8 
>>>  lib/Kconfig.debug  |   14 ++
>>>  mm/memblock.c  |5 ++---
>>>  mm/sparse.c|4 +---
>>>  4 files changed, 25 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index 74bee8cecf4c..0e95ca63375a 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -13,6 +13,7 @@
>>>  #include 
>>>  #include 
>>>  #endif /* !__GENERATING_BOUNDS_H */
>>> +#include 
>>>
>>>  /*
>>>   * Various page->flags bits:
>>> @@ -162,6 +163,13 @@ static inline int PagePoisoned(const struct page *page)
>>>   return page->flags == PAGE_POISON_PATTERN;
>>>  }
>>>
>>> +static inline void page_init_poison(struct page *page, size_t size)
>>> +{
>>> +#ifdef CONFIG_DEBUG_VM_PAGE_INIT_POISON
>>> + memset(page, PAGE_POISON_PATTERN, size);
>>> +#endif
>>> +}
>>> +
>>>  /*
>>>   * Page flags policies wrt compound pages
>>>   *
>>> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
>>> index 613316724c6a..3b1277c52fed 100644
>>> --- a/lib/Kconfig.debug
>>> +++ b/lib/Kconfig.debug
>>> @@ -637,6 +637,20 @@ config DEBUG_VM_PGFLAGS
>>>
>>> If unsure, say N.
>>>
>>> +config DEBUG_VM_PAGE_INIT_POISON
>>> + bool "Enable early page metadata poisoning"
>>> + default y
>>> + depends on DEBUG_VM
>>> + help
>>> +   Seed the page metadata with a poison pattern to improve the
>>> +   likelihood of detecting attempts to access the page prior to
>>> +   initialization by the memory subsystem.
>>> +
>>> +   This initialization can result in a longer boot time for systems
>>> +   with a large amount of memory.
>>
>> What happens when DEBUG_VM_PGFLAGS = y and
>> DEBUG_VM_PAGE_INIT_POISON = n ?
>>
>> We are testing for pattern that was not set?
>>
>> I think DEBUG_VM_PAGE_INIT_POISON must depend on DEBUG_VM_PGFLAGS instead.
>>
>> Looks good otherwise.
>>
>> Thank you,
>> Pavel
> 
> The problem is that I then end up in the same situation I had in the
> last patch where you have to have DEBUG_VM_PGFLAGS on in order to do
> the seeding with poison.

OK

> 
> I can wrap the bit of code in PagePoisoned to just always return false
> if we didn't set the pattern. I figure there is value to be had for
> running DEBUG_VM_PGFLAGS regardless of the poison check, or
> DEBUG_VM_PAGE_INIT_POISON without the PGFLAGS check. That is why I
> wanted to leave them independent.

How about:

Remove "depends on DEBUG_VM", but make DEBUG_VM_PGFLAGS to depend on
both DEBUG_VM and DEBUG_VM_PAGE_INIT_POISON?

DEBUG_VM_PGFLAGS is already extremely slow, so having this extra
dependency is OK.

Thank you,
Pavel

> 
> - Alex
> 

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-05 Thread Pasha Tatashin


On 9/5/18 5:13 PM, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.
> 
> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> the boot time returned to something much more reasonable as the
> arch_add_memory call completed in milliseconds versus seconds. However in
> doing that I had to disable all of the other VM debugging on the system.
> 
> Instead of keeping the value in CONFIG_DEBUG_VM I am adding a new CONFIG
> value called CONFIG_DEBUG_VM_PAGE_INIT_POISON that will control the page
> poisoning independent of the CONFIG_DEBUG_VM option.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  include/linux/page-flags.h |8 
>  lib/Kconfig.debug  |   14 ++
>  mm/memblock.c  |5 ++---
>  mm/sparse.c|4 +---
>  4 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74bee8cecf4c..0e95ca63375a 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #endif /* !__GENERATING_BOUNDS_H */
> +#include 
>  
>  /*
>   * Various page->flags bits:
> @@ -162,6 +163,13 @@ static inline int PagePoisoned(const struct page *page)
>   return page->flags == PAGE_POISON_PATTERN;
>  }
>  
> +static inline void page_init_poison(struct page *page, size_t size)
> +{
> +#ifdef CONFIG_DEBUG_VM_PAGE_INIT_POISON
> + memset(page, PAGE_POISON_PATTERN, size);
> +#endif
> +}
> +
>  /*
>   * Page flags policies wrt compound pages
>   *
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 613316724c6a..3b1277c52fed 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -637,6 +637,20 @@ config DEBUG_VM_PGFLAGS
>  
> If unsure, say N.
>  
> +config DEBUG_VM_PAGE_INIT_POISON
> + bool "Enable early page metadata poisoning"
> + default y
> + depends on DEBUG_VM
> + help
> +   Seed the page metadata with a poison pattern to improve the
> +   likelihood of detecting attempts to access the page prior to
> +   initialization by the memory subsystem.
> +
> +   This initialization can result in a longer boot time for systems
> +   with a large amount of memory.

What happens when DEBUG_VM_PGFLAGS = y and
DEBUG_VM_PAGE_INIT_POISON = n ?

We are testing for pattern that was not set?

I think DEBUG_VM_PAGE_INIT_POISON must depend on DEBUG_VM_PGFLAGS instead.

Looks good otherwise.

Thank you,
Pavel

> +
> +   If unsure, say Y.
> +
>  config ARCH_HAS_DEBUG_VIRTUAL
>   bool
>  
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 237944479d25..a85315083b5a 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1444,10 +1444,9 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  
>   ptr = memblock_virt_alloc_internal(size, align,
>  min_addr, max_addr, nid);
> -#ifdef CONFIG_DEBUG_VM
>   if (ptr && size > 0)
> - memset(ptr, PAGE_POISON_PATTERN, size);
> -#endif
> + page_init_poison(ptr, size);
> +
>   return ptr;
>  }
>  
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..67ad061f7fb8 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -696,13 +696,11 @@ int __meminit sparse_add_one_section(struct pglist_data 
> *pgdat,
>   goto out;
>   }
>  
> -#ifdef CONFIG_DEBUG_VM
>   /*
>* Poison uninitialized struct pages in order to catch invalid flags
>* combinations.
>*/
> - memset(memmap, PAGE_POISON_PATTERN, sizeof(struct page) * 
> PAGES_PER_SECTION);
> -#endif
> + page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
>  
>   section_mark_present(ms);
>   sparse_init_one_section(ms, section_nr, memmap, usemap);
> 

Re: [PATCH 2/2] mm: Create non-atomic version of SetPageReserved for init use

2018-09-05 Thread Pasha Tatashin


On 9/5/18 4:18 PM, Alexander Duyck wrote:
> On Tue, Sep 4, 2018 at 11:24 PM Michal Hocko  wrote:
>>
>> On Tue 04-09-18 11:33:45, Alexander Duyck wrote:
>>> From: Alexander Duyck 
>>>
>>> It doesn't make much sense to use the atomic SetPageReserved at init time
>>> when we are using memset to clear the memory and manipulating the page
>>> flags via simple "&=" and "|=" operations in __init_single_page.
>>>
>>> This patch adds a non-atomic version __SetPageReserved that can be used
>>> during page init and shows about a 10% improvement in initialization times
>>> on the systems I have available for testing.
>>
>> I agree with Dave about a comment is due. I am also quite surprised that
>> this leads to such a large improvement. Could you be more specific about
>> your test and machines you were testing on?
> 
> So my test case has been just initializing 4 3TB blocks of persistent
> memory with a few trace_printk values added to track total time in
> move_pfn_range_to_zone.
> 
> What I have been seeing is that the time needed for the call drops on
> average from 35-36 seconds down to around 31-32.

Just curious why is there variance? During boot time is usually pretty
consistent, as there is only one thread and system is in pretty much the
same state.

A dmesg output in the commit log would be helpful.

Thank you,
Pavel

> 
>> Other than that the patch makes sense to me.
>>
>>> Signed-off-by: Alexander Duyck 
>>
>> With the above addressed, feel free to add
>> Acked-by: Michal Hocko 
>>
>> Thanks!
> 
> As far as adding a comment are we just talking about why it is
> reserved, or do we need a description of the __SetPageReserved versus
> SetPageReserved. For now I was looking at adding a comment like:
> @@ -5517,8 +5517,13 @@ void __meminit memmap_init_zone(unsigned long
> size, int nid, unsigned long zone,
>  not_early:
> page = pfn_to_page(pfn);
> __init_single_page(page, pfn, zone, nid);
> +
> +   /*
> +* Mark page reserved as it will need to wait for onlining
> +* phase for it to be fully associated with a zone.
> +*/
> if (context == MEMMAP_HOTPLUG)
> -   SetPageReserved(page);
> +   __SetPageReserved(page);
> 
> /*
>  * Mark the block movable so that blocks are reserved for
> 
> Any thoughts on this?
> 
> Thanks.
> 
> - Alex
> 

Re: Plumbers 2018 - Performance and Scalability Microconference

2018-09-05 Thread Pasha Tatashin


On 9/5/18 2:38 AM, Mike Rapoport wrote:
> On Tue, Sep 04, 2018 at 05:28:13PM -0400, Daniel Jordan wrote:
>> Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance 
>> and scalability microconference this year at Plumbers[*], which is happening 
>> in Vancouver this year.  The microconference is scheduled for the morning of 
>> the second day (Wed, Nov 14).
>>
>> We have a preliminary agenda and a list of confirmed and interested 
>> attendees (cc'ed), and are seeking more of both!
>>
>> Some of the items on the agenda as it stands now are:
>>
>>  - Promoting huge page usage:  With memory sizes becoming ever larger, huge 
>> pages are becoming more and more important to reduce TLB misses and the 
>> overhead of memory management itself--that is, to make the system scalable 
>> with the memory size.  But there are still some remaining gaps that prevent 
>> huge pages from being deployed in some situations, such as huge page 
>> allocation latency and memory fragmentation.
>>
>>  - Reducing the number of users of mmap_sem:  This semaphore is frequently 
>> used throughout the kernel.  In order to facilitate scaling this 
>> longstanding bottleneck, these uses should be documented and unnecessary 
>> users should be fixed.
>>
>>  - Parallelizing cpu-intensive kernel work:  Resolve problems of past 
>> approaches including extra threads interfering with other processes, playing 
>> well with power management, and proper cgroup accounting for the extra 
>> threads.  Bonus topic: proper accounting of workqueue threads running on 
>> behalf of cgroups.
>>
>>  - Preserving userland during kexec with a hibernation-like mechanism.
> 
> Just some crazy idea: have you considered using checkpoint-restore as a
> replacement or an addition to hibernation?

Hi Mike,

Yes, this is one way I was thinking about, and use kernel to pass the
application stored state to new kernel in pmem. The only problem is that
we waste memory: when there is not enough system memory to copy and pass
application state to new kernel this scheme won't work. Think about DB
that occupies 80% of system memory and we want to checkpoint/restore it.

So, we need to have another way, where the preserved memory is the
memory that is actually used by the applications, not copied. One easy
way is to give each application that has a large state that is expensive
to recreate a persistent memory device and let applications to keep its
state on that device (say /dev/pmemN). The only problem is that memory
on that device must be accessible just as fast as regular memory without
any file system overhead and hopefully without need for DAX.

I just want to get some ideas of what people are thinking about this,
and what would be the best way to achieve it.

Pavel


>  
>> These center around our interests, but having lots of topics to choose from 
>> ensures we cover what's most important to the community, so we would like to 
>> hear about additional topics and extensions to those listed here.  This 
>> includes, but is certainly not limited to, work in progress that would 
>> benefit from in-person discussion, real-world performance problems, and 
>> experimental and academic work.
>>
>> If you haven't already done so, please let us know if you are interested in 
>> attending, or have suggestions for other attendees.
>>
>> Thanks,
>> Daniel
>>
>> [*] https://blog.linuxplumbersconf.org/2018/performance-mc/
>>
> 

Re: [PATCH 1/2] mm: Move page struct poisoning from CONFIG_DEBUG_VM to CONFIG_DEBUG_VM_PGFLAGS

2018-09-04 Thread Pasha Tatashin


On 9/4/18 5:13 PM, Alexander Duyck wrote:
> On Tue, Sep 4, 2018 at 1:07 PM Pasha Tatashin
>  wrote:
>>
>> Hi Alexander,
>>
>> This is a wrong way to do it. memblock_virt_alloc_try_nid_raw() does not
>> initialize allocated memory, and by setting memory to all ones in debug
>> build we ensure that no callers rely on this function to return zeroed
>> memory just by accident.
> 
> I get that, but setting this to all 1's is still just debugging code
> and that is adding significant overhead.

That's correct debugging code on debugging kernel.

> 
>> And, the accidents are frequent because most of the BIOSes and
>> hypervisors zero memory for us. The exception is kexec reboot.
>>
>> So, the fact that page flags checks this pattern, does not mean that
>> this is the only user. Memory that is returned by
>> memblock_virt_alloc_try_nid_raw() is used for page table as well, and
>> can be used in other places as well that don't want memblock to zero the
>> memory for them for performance reasons.
> 
> The logic behind this statement is confusing. You are saying they
> don't want memblock to zero the memory for performance reasons, yet
> you are setting it to all 1's for debugging reasons? I get that it is
> wrapped, but in my mind just using CONFIG_DEBUG_VM is too broad of a
> brush. Especially with distros like Fedora enabling it by default.

The idea is not to zero memory on production kernel, and ensure that not
zeroing memory does not cause any accidental bugs by having debug code
on debug kernel.

> 
>> I am surprised that CONFIG_DEBUG_VM is used in production kernel, but if
>> so perhaps a new CONFIG should be added: CONFIG_DEBUG_MEMBLOCK
>>
>> Thank you,
>> Pavel
> 
> I don't know about production. I am running a Fedora kernel on my
> development system and it has it enabled. It looks like it has been
> that way for a while based on a FC20 Bugzilla
> (https://bugzilla.redhat.com/show_bug.cgi?id=1074710). A quick look at
> one of my CentOS systems shows that it doesn't have it set. I suspect
> it will vary from distro to distro. I just know it spooked me when I
> was stuck staring at a blank screen for three minutes when I was
> booting a system with 12TB of memory since this delay can hit you
> early in the boot.

I understand, this is the delay that I fixed when I removed memset(0)
from struct page initialization code. However, we still need to keep
this debug code memset(1) in order to catch some bugs. And we do from
time to time.

For far too long linux was expecting that the memory that is returned by
memblock and boot allocator is always zeroed.

> 
> I had considered adding a completely new CONFIG. The only thing is it
> doesn't make much sense to have the logic setting the value to all 1's
> without any logic to test for it. 

When memory is zeroed, page table works by accident as the entries are
empty. However, when entries are all ones, and we accidentally try to
use that memory as page table invalid VA in page table will crash debug
kernel (and it has in the past helped finding some bugs).

So, the testing is not only that uninitialized struct pages are not
accessed, but also that only explicitly initialized page tables are
accessed.


That is why I thought it made more
> sense to just fold it into CONFIG_DEBUG_VM_PGFLAGS. I suppose I could
> look at something like CONFIG_DEBUG_PAGE_INIT if we want to go that
> route. I figure using something like MEMBLOCK probably wouldn't make
> sense since this also impacts sparse section init.

If distros are using CONFIG_DEBUG_VM in production kernels (as you
pointed out above), it makes sense to add CONFIG_DEBUG_MEMBLOCK.

Thank you,
Pavel

Re: [PATCH 1/2] mm: Move page struct poisoning from CONFIG_DEBUG_VM to CONFIG_DEBUG_VM_PGFLAGS

2018-09-04 Thread Pasha Tatashin
Hi Alexander,

This is a wrong way to do it. memblock_virt_alloc_try_nid_raw() does not
initialize allocated memory, and by setting memory to all ones in debug
build we ensure that no callers rely on this function to return zeroed
memory just by accident.

And, the accidents are frequent because most of the BIOSes and
hypervisors zero memory for us. The exception is kexec reboot.

So, the fact that page flags checks this pattern, does not mean that
this is the only user. Memory that is returned by
memblock_virt_alloc_try_nid_raw() is used for page table as well, and
can be used in other places as well that don't want memblock to zero the
memory for them for performance reasons.

I am surprised that CONFIG_DEBUG_VM is used in production kernel, but if
so perhaps a new CONFIG should be added: CONFIG_DEBUG_MEMBLOCK

Thank you,
Pavel

On 9/4/18 2:33 PM, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.
> 
> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> the boot time returned to something much more reasonable as the
> arch_add_memory call completed in milliseconds versus seconds. However in
> doing that I had to disable all of the other VM debugging on the system.
> 
> I did a bit of research and it seems like the only function that checks
> for this poison value is the PagePoisoned function, and it is only called
> in two spots. One is the PF_POISONED_CHECK macro that is only in use when
> CONFIG_DEBUG_VM_PGFLAGS is defined, and the other is as a part of the
> __dump_page function which is using the check to prevent a recursive
> failure in the event of discovering a poisoned page.
> 
> With this being the case I am opting to move the poisoning of the page
> structs from CONFIG_DEBUG_VM to CONFIG_DEBUG_VM_PGFLAGS so that we are
> only performing the memset if it will be used to test for failures.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  mm/memblock.c |2 +-
>  mm/sparse.c   |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 237944479d25..51e8ae927257 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1444,7 +1444,7 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  
>   ptr = memblock_virt_alloc_internal(size, align,
>  min_addr, max_addr, nid);
> -#ifdef CONFIG_DEBUG_VM
> +#ifdef CONFIG_DEBUG_VM_PGFLAGS
>   if (ptr && size > 0)
>   memset(ptr, PAGE_POISON_PATTERN, size);
>  #endif
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..0fd9ad5021b0 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -696,7 +696,7 @@ int __meminit sparse_add_one_section(struct pglist_data 
> *pgdat,
>   goto out;
>   }
>  
> -#ifdef CONFIG_DEBUG_VM
> +#ifdef CONFIG_DEBUG_VM_PGFLAGS
>   /*
>* Poison uninitialized struct pages in order to catch invalid flags
>* combinations.
> 

[PATCH] mm: Disable deferred struct page for 32-bit arches

2018-08-31 Thread Pasha Tatashin
Deferred struct page init is needed only on systems with large amount of
physical memory to improve boot performance. 32-bit systems do not benefit
from this feature.

Jiri reported a problem where deferred struct pages do not work well with
x86-32:

[0.035162] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[0.035725] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.036269] Initializing CPU#0
[0.036513] Initializing HighMem for node 0 (00036ffe:0007ffe0)
[0.038459] page:f678 is uninitialized and poisoned
[0.038460] raw:       
 
[0.039509] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
[0.040038] [ cut here ]
[0.040399] kernel BUG at include/linux/page-flags.h:293!
[0.040823] invalid opcode:  [#1] SMP PTI
[0.041166] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc1_pt_jiri #9
[0.041694] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.0-20171110_100015-anatol 04/01/2014
[0.042496] EIP: free_highmem_page+0x64/0x80
[0.042839] Code: 13 46 d8 c1 e8 18 5d 83 e0 03 8d 04 c0 c1 e0 06 ff 80 ec 
5f 44 d8 c3 8d b4 26 00 00 00 00 ba 08 65 28 d8 89 d8 e8 fc 71 02 00 <0f> 0b 8d 
76 00 8d bc 27 00 00 00 00 ba d0 b1 26 d8 89 d8 e8 e4 71
[0.044338] EAX: 003c EBX: f678 ECX:  EDX: d856cbe8
[0.044868] ESI: 0007ffe0 EDI: d838df20 EBP: d838df00 ESP: d838defc
[0.045372] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210086
[0.045913] CR0: 80050033 CR2:  CR3: 18556000 CR4: 00040690
[0.046413] DR0:  DR1:  DR2:  DR3: 
[0.046913] DR6: fffe0ff0 DR7: 0400
[0.047220] Call Trace:
[0.047419]  add_highpages_with_active_regions+0xbd/0x10d
[0.047854]  set_highmem_pages_init+0x5b/0x71
[0.048202]  mem_init+0x2b/0x1e8
[0.048460]  start_kernel+0x1d2/0x425
[0.048757]  i386_start_kernel+0x93/0x97
[0.049073]  startup_32_smp+0x164/0x168
[0.049379] Modules linked in:
[0.049626] ---[ end trace 337949378db0abbb ]---

We free highmem pages before their struct pages are initialized:

mem_init()
 set_highmem_pages_init()
  add_highpages_with_active_regions()
   free_highmem_page()
.. Access uninitialized struct page here..

Because there is no reason to have this feature on 32-bit systems, just
disable it.

Fixes: 2e3ca40f03bb ("mm: relax deferred struct page requirements")
Cc: sta...@vger.kernel.org

Reported-by: Jiri Slaby 
Signed-off-by: Pavel Tatashin 
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index a550635ea5c3..de64ea658716 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -637,6 +637,7 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on NO_BOOTMEM
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
+   depends on 64BIT
help
  Ordinarily all struct pages are initialised during early boot in a
  single thread. On very large machines this can take a considerable
-- 
2.18.0


Re: [PATCH] mm/page_alloc: Clean up check_for_memory

2018-08-31 Thread Pasha Tatashin


On 8/31/18 8:24 AM, Oscar Salvador wrote:
> On Thu, Aug 30, 2018 at 01:55:29AM +0000, Pasha Tatashin wrote:
>> I would re-write the above function like this:
>> static void check_for_memory(pg_data_t *pgdat, int nid)
>> {
>> enum zone_type zone_type;
>>
>> for (zone_type = 0; zone_type < ZONE_MOVABLE; zone_type++) {
>> if (populated_zone(&pgdat->node_zones[zone_type])) { 
>> node_set_state(nid, zone_type <= ZONE_NORMAL ?
>>N_NORMAL_MEMORY: N_HIGH_MEMORY);
>> break;
>> }
>> }
>> }
> 
> Hi Pavel,
> 
> the above would not work fine.
> You set either N_NORMAL_MEMORY or N_HIGH_MEMORY, but a node can have both
> types of memory at the same time (on CONFIG_HIGHMEM systems).
> 
> N_HIGH_MEMORY stands for regular or high memory
> while N_NORMAL_MEMORY stands only for regular memory,
> that is why we set it only in case the zone is <= ZONE_NORMAL.

Hi Oscar,

Are you saying the code that is in mainline is broken? Because we set
node_set_state(nid, N_NORMAL_MEMORY); even on node with N_HIGH_MEMORY:

6826if (N_NORMAL_MEMORY != N_HIGH_MEMORY &&
6827zone_type <= ZONE_NORMAL)
6828node_set_state(nid, N_NORMAL_MEMORY);

Thank you,
Pavel

Re: [PATCH v1 5/5] mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages()

2018-08-30 Thread Pasha Tatashin


On 8/20/18 6:46 AM, David Hildenbrand wrote:
> On 16.08.2018 12:06, David Hildenbrand wrote:
>> Let's try to minimze the noise.
>>
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memory_hotplug.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index bbbd16f9d877..6fec2dc6a73d 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -966,9 +966,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
>> nr_pages, int online_typ
>>  return 0;
>>  
>>  failed_addition:
>> +#ifdef CONFIG_DEBUG_VM
>>  pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
>>   (unsigned long long) pfn << PAGE_SHIFT,
>>   (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
>> +#endif
>>  memory_notify(MEM_CANCEL_ONLINE, &arg);
>>  return ret;
>>  }
>> @@ -1660,7 +1662,9 @@ int offline_pages(unsigned long start_pfn, unsigned 
>> long nr_pages)
>>  offlined_pages = check_pages_isolated(start_pfn, end_pfn);
>>  if (offlined_pages < 0)
>>  goto repeat;
>> +#ifdef CONFIG_DEBUG_VM
>>  pr_info("Offlined Pages %ld\n", offlined_pages);
>> +#endif
>>  /* Ok, all of our target is isolated.
>> We cannot do rollback at this point. */
>>  offline_isolated_pages(start_pfn, end_pfn);
>> @@ -1695,9 +1699,11 @@ int offline_pages(unsigned long start_pfn, unsigned 
>> long nr_pages)
>>  return 0;
>>  
>>  failed_removal:
>> +#ifdef CONFIG_DEBUG_VM
>>  pr_debug("memory offlining [mem %#010llx-%#010llx] failed\n",
>>   (unsigned long long) start_pfn << PAGE_SHIFT,
>>   ((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
>> +#endif
>>  memory_notify(MEM_CANCEL_OFFLINE, &arg);
>>  /* pushback to free area */
>>  undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>>
> 
> I'll drop this patch for now, maybe the error messages are actually
> useful when debugging a crashdump of a system without CONFIG_DEBUG_VM.
> 

Yes, please drop it, no reason to reduce amount of these messages. They
are useful.

Thank you,
Pavel

Re: [PATCH v1 4/5] mm/memory_hotplug: onlining pages can only fail due to notifiers

2018-08-30 Thread Pasha Tatashin
LGTM

Reviewed-by: Pavel Tatashin 

On 8/16/18 6:06 AM, David Hildenbrand wrote:
> Onlining pages can only fail if a notifier reported a problem (e.g. -ENOMEM).
> online_pages_range() can never fail.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 3dc6d2a309c2..bbbd16f9d877 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -933,13 +933,8 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages, int online_typ
>   setup_zone_pageset(zone);
>   }
>  
> - ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
> - online_pages_range);
> - if (ret) {
> - if (need_zonelists_rebuild)
> - zone_pcp_reset(zone);
> - goto failed_addition;
> - }
> + walk_system_ram_range(pfn, nr_pages, &onlined_pages,
> +   online_pages_range);
>  
>   zone->present_pages += onlined_pages;
>  
> 

Re: [PATCH v1 3/5] mm/memory_hotplug: check if sections are already online/offline

2018-08-30 Thread Pasha Tatashin
On 8/16/18 7:00 AM, David Hildenbrand wrote:
> On 16.08.2018 12:47, Oscar Salvador wrote:
>> On Thu, Aug 16, 2018 at 12:06:26PM +0200, David Hildenbrand wrote:
>>
>>> +
>>> +/* check if all mem sections are offline */
>>> +bool mem_sections_offline(unsigned long pfn, unsigned long end_pfn)
>>> +{
>>> +   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>> +   unsigned long section_nr = pfn_to_section_nr(pfn);
>>> +
>>> +   if (WARN_ON(!valid_section_nr(section_nr)))
>>> +   continue;
>>> +   if (online_section_nr(section_nr))
>>> +   return false;
>>> +   }
>>> +   return true;
>>> +}
>>
>> AFAICS pages_correctly_probed will catch this first.
>> pages_correctly_probed checks for the section to be:
>>
>> - present
>> - valid
>> - !online
> 
> Right, I missed that function.
> 
>>
>> Maybe it makes sense to rename it, and write another pages_correctly_probed 
>> routine
>> for the offline case.
>>
>> So all checks would stay in memory_block_action level, and we would not need
>> the mem_sections_offline/online stuff.

I am OK with that, but will wait for a patch to review.

Pavel

Re: [PATCH v1 2/5] mm/memory_hotplug: enforce section alignment when onlining/offlining

2018-08-30 Thread Pasha Tatashin
Hi David,

I am not sure this is needed, because we already have a stricter checker:

check_hotplug_memory_range()

You could call it from online_pages(), if you think there is a reason to
do it, but other than that it is done from add_memory_resource() and
from remove_memory().

Thank you,
Pavel

On 8/16/18 6:06 AM, David Hildenbrand wrote:
> onlining/offlining code works on whole sections, so let's enforce that.
> Existing code only allows to add memory in memory block size. And only
> whole memory blocks can be onlined/offlined. Memory blocks are always
> aligned to sections, so this should not break anything.
> 
> online_pages/offline_pages will implicitly mark whole sections
> online/offline, so the code really can only handle such granularities.
> 
> (especially offlining code cannot deal with pageblock_nr_pages but
>  theoretically only MAX_ORDER-1)
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 090cf474de87..30d2fa42b0bb 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -897,6 +897,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages, int online_typ
>   struct memory_notify arg;
>   struct memory_block *mem;
>  
> + if (!IS_ALIGNED(pfn, PAGES_PER_SECTION))
> + return -EINVAL;
> + if (!IS_ALIGNED(nr_pages, PAGES_PER_SECTION))
> + return -EINVAL;
> +
>   /*
>* We can't use pfn_to_nid() because nid might be stored in struct page
>* which is not yet initialized. Instead, we find nid from memory block.
> @@ -1600,10 +1605,9 @@ int offline_pages(unsigned long start_pfn, unsigned 
> long nr_pages)
>   struct zone *zone;
>   struct memory_notify arg;
>  
> - /* at least, alignment against pageblock is necessary */
> - if (!IS_ALIGNED(start_pfn, pageblock_nr_pages))
> + if (!IS_ALIGNED(start_pfn, PAGES_PER_SECTION))
>   return -EINVAL;
> - if (!IS_ALIGNED(end_pfn, pageblock_nr_pages))
> + if (!IS_ALIGNED(nr_pages, PAGES_PER_SECTION))
>   return -EINVAL;
>   /* This makes hotplug much easier...and readable.
>  we assume this for now. .*/
> 

Re: [PATCH v1 1/5] mm/memory_hotplug: drop intermediate __offline_pages

2018-08-30 Thread Pasha Tatashin


On 8/30/18 4:17 PM, Pasha Tatashin wrote:
> I guess the wrap was done because of __ref, but no reason to have this
> wrap. So looks good to me.
> 
> Reviewed-by: Pavel Tatashin >
> On 8/16/18 6:06 AM, David Hildenbrand wrote:
>> Let's avoid this indirection and just call the function offline_pages().
>>
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memory_hotplug.c | 13 +++--
>>  1 file changed, 3 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 6a2726920ed2..090cf474de87 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1589,10 +1589,10 @@ static void node_states_clear_node(int node, struct 
>> memory_notify *arg)
>>  node_clear_state(node, N_MEMORY);
>>  }
>>  
>> -static int __ref __offline_pages(unsigned long start_pfn,
>> -  unsigned long end_pfn)
>> +/* Must be protected by mem_hotplug_begin() or a device_lock */
>> +int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
  ^^^
I meant to say keep the __ref, otherwise looks good.

Thank you,
Pavel

Re: [PATCH v1 1/5] mm/memory_hotplug: drop intermediate __offline_pages

2018-08-30 Thread Pasha Tatashin
I guess the wrap was done because of __ref, but no reason to have this
wrap. So looks good to me.

Reviewed-by: Pavel Tatashin 

On 8/16/18 6:06 AM, David Hildenbrand wrote:
> Let's avoid this indirection and just call the function offline_pages().
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 13 +++--
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6a2726920ed2..090cf474de87 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1589,10 +1589,10 @@ static void node_states_clear_node(int node, struct 
> memory_notify *arg)
>   node_clear_state(node, N_MEMORY);
>  }
>  
> -static int __ref __offline_pages(unsigned long start_pfn,
> -   unsigned long end_pfn)
> +/* Must be protected by mem_hotplug_begin() or a device_lock */
> +int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  {
> - unsigned long pfn, nr_pages;
> + unsigned long pfn, end_pfn = start_pfn + nr_pages;
>   long offlined_pages;
>   int ret, node;
>   unsigned long flags;
> @@ -1612,7 +1612,6 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
>  
>   zone = page_zone(pfn_to_page(valid_start));
>   node = zone_to_nid(zone);
> - nr_pages = end_pfn - start_pfn;
>  
>   /* set above range as isolated */
>   ret = start_isolate_page_range(start_pfn, end_pfn,
> @@ -1700,12 +1699,6 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
>   undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>   return ret;
>  }
> -
> -/* Must be protected by mem_hotplug_begin() or a device_lock */
> -int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> -{
> - return __offline_pages(start_pfn, start_pfn + nr_pages);
> -}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
>  /**
> 

Re: [PATCH] mm/page_alloc: Clean up check_for_memory

2018-08-29 Thread Pasha Tatashin


On 8/28/18 5:01 PM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> check_for_memory looks a bit confusing.
> First of all, we have this:
> 
> if (N_MEMORY == N_NORMAL_MEMORY)
>   return;
> 
> Checking the ENUM declaration, looks like N_MEMORY canot be equal to
> N_NORMAL_MEMORY.
> I could not find where N_MEMORY is set to N_NORMAL_MEMORY, or the other
> way around either, so unless I am missing something, this condition 
> will never evaluate to true.
> It makes sense to get rid of it.
> 
> Moving forward, the operations whithin the loop look a bit confusing
> as well.
> 
> We set N_HIGH_MEMORY unconditionally, and then we set N_NORMAL_MEMORY
> in case we have CONFIG_HIGHMEM (N_NORMAL_MEMORY != N_HIGH_MEMORY)
> and zone <= ZONE_NORMAL.
> (N_HIGH_MEMORY falls back to N_NORMAL_MEMORY on !CONFIG_HIGHMEM systems,
> and that is why we can just go ahead and set N_HIGH_MEMORY unconditionally)
> 
> Although this works, it is a bit subtle.
> 
> I think that this could be easier to follow:
> 
> First, we should only set N_HIGH_MEMORY in case we have
> CONFIG_HIGHMEM.
> And then we should set N_NORMAL_MEMORY in case zone <= ZONE_NORMAL,
> without further checking whether we have CONFIG_HIGHMEM or not.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/page_alloc.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 839e0cc17f2c..6aa947f9e614 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6819,15 +6819,12 @@ static void check_for_memory(pg_data_t *pgdat, int 
> nid)
>  {
>   enum zone_type zone_type;
>  
> - if (N_MEMORY == N_NORMAL_MEMORY)
> - return;
> -
>   for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
>   struct zone *zone = &pgdat->node_zones[zone_type];
>   if (populated_zone(zone)) {
> - node_set_state(nid, N_HIGH_MEMORY);
> - if (N_NORMAL_MEMORY != N_HIGH_MEMORY &&
> - zone_type <= ZONE_NORMAL)
> + if (IS_ENABLED(CONFIG_HIGHMEM))
> + node_set_state(nid, N_HIGH_MEMORY);
> + if (zone_type <= ZONE_NORMAL)
>   node_set_state(nid, N_NORMAL_MEMORY);
>   break;
>   }
> 

I would re-write the above function like this:
static void check_for_memory(pg_data_t *pgdat, int nid)
{
enum zone_type zone_type;

for (zone_type = 0; zone_type < ZONE_MOVABLE; zone_type++) {
if (populated_zone(&pgdat->node_zones[zone_type])) { 
node_set_state(nid, zone_type <= ZONE_NORMAL ?
   N_NORMAL_MEMORY: N_HIGH_MEMORY);
break;
}
}
}

zone_type <= ZONE_MOVABLE - 1
is the same as:
zone_type < ZONE_MOVABLE

If zone > ZONE_NORMAL, it means that CONFIG_HIGHMEM is enabled, no need to 
check for it.

Pavel

Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

2018-08-29 Thread Pasha Tatashin

On 8/17/18 11:41 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Currently, we decrement zone/node spanned_pages when we
> remove memory and not when we offline it.
> 
> This, besides of not being consistent with the current code,
> implies that we can access steal pages if we never get to online
> that memory.
> 
> In order to prevent that, we have to move all zone/pages stuff to
> the offlining memory stage.
> Removing memory path should only care about memory sections and memory
> blocks.
> 
> Another thing to notice here is that this is not so easy to be done
> as HMM/devm have a particular handling of memory-hotplug.
> They do not go through the common path, and so, they do not
> call either offline_pages() nor online_pages().
> 
> All they care about is to add the sections, move the pages to
> ZONE_DEVICE, and in some cases, to create the linear mapping.
> 
> In order to do this more smooth, two new functions are created
> to deal with these particular cases:
> 
> del_device_memory
> add_device_memory
> 
> add_device_memory is in charge of
> 
> a) calling either arch_add_memory() or add_pages(), depending on whether
>we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
>expand zone/pgdat spanned pages and initialize its pages
> 
> del_device_memory, on the other hand, is in charge of
> 
> a) offline the memory sections that correspond to the pfn range
> b) calling shrink_pages(), which shrinks node/zone spanned pages.
> c) calling either arch_remove_memory() or __remove_pages(), depending on
>whether we need to tear down the linear mapping or not
> 
> These two functions are called from:
> 
> add_device_memory:
>   - devm_memremap_pages()
>   - hmm_devmem_pages_create()
> 
> del_device_memory:
>   - devm_memremap_pages_release()
>   - hmm_devmem_release()
> 
> I think that this will get easier as soon as [1] gets merged.
> 
> Finally, shrink_pages() is moved to offline_pages(), so now,
> all pages/zone handling is being taken care in online/offline_pages stage.
> 
> [1] https://lkml.org/lkml/2018/6/19/110
> 
> Signed-off-by: Oscar Salvador 
> ---
>  arch/ia64/mm/init.c|   4 +-
>  arch/powerpc/mm/mem.c  |  10 +--
>  arch/sh/mm/init.c  |   4 +-
>  arch/x86/mm/init_32.c  |   4 +-
>  arch/x86/mm/init_64.c  |   8 +--
>  include/linux/memory_hotplug.h |   9 ++-
>  kernel/memremap.c  |  14 ++--
>  kernel/resource.c  |  16 +
>  mm/hmm.c   |  32 -
>  mm/memory_hotplug.c| 143 
> +++--
>  mm/sparse.c|   4 +-
>  11 files changed, 145 insertions(+), 103 deletions(-)

Hi Oscar,

I have been studying this patch, and do not see anything bad about it
except that it begs to be split into smaller patches. I think you can
send this work as a series without RFC if this patch is split into 3 or
so patches. I will review that series.

Thank you,
Pavel

Re: [PATCH] memory_hotplug: fix kernel_panic on offline page processing

2018-08-28 Thread Pasha Tatashin


On 8/28/18 7:25 AM, Michal Hocko wrote:
> On Tue 28-08-18 11:05:39, Mikhail Zaslonko wrote:
>> Within show_valid_zones() the function test_pages_in_a_zone() should be
>> called for online memory blocks only. Otherwise it might lead to the
>> VM_BUG_ON due to uninitialized struct pages (when CONFIG_DEBUG_VM_PGFLAGS
>> kernel option is set):
>>
>>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>  [ cut here ]
>>  Call Trace:
>>  ([<0038f91e>] test_pages_in_a_zone+0xe6/0x168)
>>   [<00923472>] show_valid_zones+0x5a/0x1a8
>>   [<00900284>] dev_attr_show+0x3c/0x78
>>   [<0046f6f0>] sysfs_kf_seq_show+0xd0/0x150
>>   [<003ef662>] seq_read+0x212/0x4b8
>>   [<003bf202>] __vfs_read+0x3a/0x178
>>   [<003bf3ca>] vfs_read+0x8a/0x148
>>   [<003bfa3a>] ksys_read+0x62/0xb8
>>   [<00bc2220>] system_call+0xdc/0x2d8
>>
>> That VM_BUG_ON was triggered by the page poisoning introduced in
>> mm/sparse.c with the git commit d0dc12e86b31 ("mm/memory_hotplug: optimize
>> memory hotplug")
>> With the same commit the new 'nid' field has been added to the struct
>> memory_block in order to store and later on derive the node id for offline
>> pages (instead of accessing struct page which might be uninitialized). But
>> one reference to nid in show_valid_zones() function has been overlooked.
>> Fixed with current commit.
>> Also, nr_pages will not be used any more after test_pages_in_a_zone() call,
>> do not update it.
>>
>> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
>> Cc:  # v4.17+
>> Cc: Pavel Tatashin 
>> Signed-off-by: Mikhail Zaslonko 
> 
> Btw. this land mines which are basically impossible to find during the
> review are the reason why I was not all that happy about d0dc12e86b31.
> It added a margninal improvement but opened a can of warms. On the other

Hi Michal,

I agree, the hotplug code is very fragile. But, it only means that it
requires improvements.

I specifically added PagePoisoned() check with hotplug optimizations
changes in order to catch these kind of bugs, and I am glad we are
catching them.

Reviewed-by: Pavel Tatashin 

Thank you,
Pavel

> hand maybe we just had to open that can one day...
> 
> Acked-by: Michal Hocko 
> 
> Thanks!
> 
>> ---
>>  drivers/base/memory.c | 20 +---
>>  1 file changed, 9 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index f5e560188a18..622ab8edc035 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -416,26 +416,24 @@ static ssize_t show_valid_zones(struct device *dev,
>>  struct zone *default_zone;
>>  int nid;
>>  
>> -/*
>> - * The block contains more than one zone can not be offlined.
>> - * This can happen e.g. for ZONE_DMA and ZONE_DMA32
>> - */
>> -if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages, 
>> &valid_start_pfn, &valid_end_pfn))
>> -return sprintf(buf, "none\n");
>> -
>> -start_pfn = valid_start_pfn;
>> -nr_pages = valid_end_pfn - start_pfn;
>> -
>>  /*
>>   * Check the existing zone. Make sure that we do that only on the
>>   * online nodes otherwise the page_zone is not reliable
>>   */
>>  if (mem->state == MEM_ONLINE) {
>> +/*
>> + * The block contains more than one zone can not be offlined.
>> + * This can happen e.g. for ZONE_DMA and ZONE_DMA32
>> + */
>> +if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages,
>> +  &valid_start_pfn, &valid_end_pfn))
>> +return sprintf(buf, "none\n");
>> +start_pfn = valid_start_pfn;
>>  strcat(buf, page_zone(pfn_to_page(start_pfn))->name);
>>  goto out;
>>  }
>>  
>> -nid = pfn_to_nid(start_pfn);
>> +nid = mem->nid;
>>  default_zone = zone_for_pfn_range(MMOP_ONLINE_KEEP, nid, start_pfn, 
>> nr_pages);
>>  strcat(buf, default_zone->name);
>>  
>> -- 
>> 2.16.4
> 

Re: [PATCH v4 3/4] mm/memory_hotplug: Define nodemask_t as a stack variable

2018-08-28 Thread Pasha Tatashin


On 8/17/18 5:00 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Currently, unregister_mem_sect_under_nodes() tries to allocate a nodemask_t
> in order to check whithin the loop which nodes have already been unlinked,
> so we do not repeat the operation on them.
> 
> NODEMASK_ALLOC calls kmalloc() if NODES_SHIFT > 8, otherwise
> it just declares a nodemask_t variable whithin the stack.
> 
> Since kmalloc() can fail, we actually check whether NODEMASK_ALLOC failed
> or not, and we return -ENOMEM accordingly.
> remove_memory_section() does not check for the return value though.
> It is pretty rare that such a tiny allocation can fail, but if it does,
> we will be left with dangled symlinks under /sys/devices/system/node/,
> since the mem_blk's directories will be removed no matter what
> unregister_mem_sect_under_nodes() returns.
> 
> One way to solve this is to check whether unlinked_nodes is NULL or not.
> In the case it is not, we can just use it as before, but if it is NULL,
> we can just skip the node_test_and_set check, and call sysfs_remove_link()
> unconditionally.
> This is harmless as sysfs_remove_link() backs off somewhere down the chain
> in case the link has already been removed.
> This method was presented in v3 of the path [1].
> 
> But since the maximum number of nodes we can have is 1024,
> when NODES_SHIFT = 10, that gives us a nodemask_t of 128 bytes.
> Although everything depends on how deep the stack is, I think we can afford
> to define the nodemask_t variable whithin the stack.
> 
> That simplifies the code, and we do not need to worry about untested error
> code paths.
> 
> If we see that this causes troubles with the stack, we can always return to 
> [1].
> 
> Signed-off-by: Oscar Salvador 

LGTM:

Reviewed-by: Pavel Tatashin 

Thank you,
Pavel

Re: [PATCH 2/2] mm: zero remaining unavailable struct pages

2018-08-27 Thread Pasha Tatashin
On 8/23/18 2:25 PM, Masayoshi Mizuma wrote:
> From: Naoya Horiguchi 
> 
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> 
>   BUG: unable to handle kernel paging request at fffe
>   PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
>   Oops:  [#1] SMP PTI
>   CPU: 2 PID: 1728 Comm: page-types Not tainted 
> 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 
> 04/01/2014
>   RIP: 0010:stable_page_flags+0x27/0x3c0
>   Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 
> 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 
> 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
>   RSP: 0018:bbd44111fde0 EFLAGS: 00010202
>   RAX: fffe RBX: 7fffeff9 RCX: 
>   RDX: 0001 RSI: 0202 RDI: ed1182fff5c0
>   RBP:  R08: 0001 R09: 0001
>   R10: bbd44111fed8 R11:  R12: ed1182fff5c0
>   R13: 000bffd7 R14: 02fff5c0 R15: bbd44111ff10
>   FS:  7efc4335a500() GS:93a5bfc0() knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: fffe CR3: b2a58000 CR4: 001406e0
>   Call Trace:
>kpageflags_read+0xc7/0x120
>proc_reg_read+0x3c/0x60
>__vfs_read+0x36/0x170
>vfs_read+0x89/0x130
>ksys_pread64+0x71/0x90
>do_syscall_64+0x5b/0x160
>entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   RIP: 0033:0x7efc42e75e23
>   Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 
> 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 
> 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> 
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
> 
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
> 
>   MEMBLOCK configuration:
>memory size = 0x0001fff75c00 reserved size = 0x0300c000
>memory.cnt  = 0x4
>memory[0x0] [0x1000-0x0009efff], 
> 0x0009e000 bytes on node 0 flags: 0x0
>memory[0x1] [0x0010-0xbffd6fff], 
> 0xbfed7000 bytes on node 0 flags: 0x0
>memory[0x2] [0x0001-0x00013fff], 
> 0x4000 bytes on node 0 flags: 0x0
>memory[0x3] [0x00014000-0x00023fff], 
> 0x0001 bytes on node 1 flags: 0x0
>...
> 
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x1-0x13fff] is gone:
> 
>   MEMBLOCK configuration:
>memory size = 0x0001bff75c00 reserved size = 0x0300c000
>memory.cnt  = 0x3
>memory[0x0] [0x1000-0x0009efff], 
> 0x0009e000 bytes on node 0 flags: 0x0
>memory[0x1] [0x0010-0xbffd6fff], 
> 0xbfed7000 bytes on node 0 flags: 0x0
>memory[0x2] [0x00014000-0x00023fff], 
> 0x0001 bytes on node 1 flags: 0x0
>...
> 
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
> 
> We have a function zero_resv_unavail() which does zeroing the struct
> pages outside memblock.memory, but currently it covers only the reserved
> unavailable range (i.e. memblock.memory && !memblock.reserved).
> This patch extends it to cover all unavailable range, which fixes
> the reported issue.
> 
> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> Signed-off-by: Naoya Horiguchi 
> Tested-by: Oscar Salvador 
> Tested-by: Masayoshi Mizuma 

Reviewed-by: Pavel Tatashin 

Also, please review and add the following patch to this series:

From 6d23e66e979244734a06c1b636742c2568121b39 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin 
Date: Mon, 27 Aug 2018 19:10:35 -0400
Subject: [PATCH] mm: return zero_resv_unavail optimization

When checking for valid pfns in zero_resv_unavail(), it is not necessary to
verify that pfns within pageblock_nr_pages ranges are valid, only the first
one needs to be checked. This is because memory for pages are allocated in
contiguous chunks that contain pageblock_nr_pages struct pages.

Signed-off-by: Pavel Tatashin 
---
 mm/page_alloc.c | 46 ++
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 650d8f16a67e..5dfc206db40e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6441,6 +6441,29 @@ void __init free_area_init_node(int 

Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"

2018-08-27 Thread Pasha Tatashin
On 8/23/18 2:25 PM, Masayoshi Mizuma wrote:
> From: Masayoshi Mizuma 
> 
> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> memblock.reserved") breaks movable_node kernel option because it
> changed the memory gap range to reserved memblock. So, the node
> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> 
> =
> kernel: BIOS-e820: [mem 0x1800-0x180f] usable
> kernel: BIOS-e820: [mem 0x1c00-0x1c0f] usable
> ...
> kernel: reserved[0x12]#011[0x1810-0x1bff], 
> 0x03f0 bytes flags: 0x0
> ...
> kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x1800-0x1bff] 
> hotplug
> kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c00-0x1fff] 
> hotplug
> ...
> kernel: Movable zone start for each node
> kernel:  Node 3: 0x1c00
> kernel: Early memory node ranges
> ...
> =
> 
> Naoya's v1 patch [*] fixes the original issue and this movable_node
> issue doesn't occur.
> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> regions into memblock.reserved") and apply the v1 patch.
> 
> [*] https://lkml.org/lkml/2018/6/13/27
> 
> Signed-off-by: Masayoshi Mizuma 

Reviewed-by: Pavel Tatashin 

Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"

2018-08-27 Thread Pasha Tatashin
On Mon, Aug 27, 2018 at 8:31 AM Masayoshi Mizuma  wrote:
>
> Hi Pavel,
>
> I would appreciate if you could send the feedback for the patch.

I will study it today.

Pavel

>
> Thanks!
> Masa
>
> On 08/24/2018 04:29 AM, Michal Hocko wrote:
> > On Fri 24-08-18 00:03:25, Naoya Horiguchi wrote:
> >> (CCed related people)
> >
> > Fixup Pavel email.
> >
> >>
> >> Hi Mizuma-san,
> >>
> >> Thank you for the report.
> >> The mentioned patch was created based on feedbacks from 
> >> reviewers/maintainers,
> >> so I'd like to hear from them about how we should handle the issue.
> >>
> >> And one note is that there is a follow-up patch for "x86/e820: put 
> >> !E820_TYPE_RAM
> >> regions into memblock.reserved" which might be affected by your changes.
> >>
> >>> commit e181ae0c5db9544de9c53239eb22bc012ce75033
> >>> Author: Pavel Tatashin 
> >>> Date:   Sat Jul 14 09:15:07 2018 -0400
> >>>
> >>> mm: zero unavailable pages before memmap init
> >>
> >> Thanks,
> >> Naoya Horiguchi
> >>
> >> On Thu, Aug 23, 2018 at 02:25:12PM -0400, Masayoshi Mizuma wrote:
> >>> From: Masayoshi Mizuma 
> >>>
> >>> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> >>> memblock.reserved") breaks movable_node kernel option because it
> >>> changed the memory gap range to reserved memblock. So, the node
> >>> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> >>>
> >>> =
> >>> kernel: BIOS-e820: [mem 0x1800-0x180f] usable
> >>> kernel: BIOS-e820: [mem 0x1c00-0x1c0f] usable
> >>> ...
> >>> kernel: reserved[0x12]#011[0x1810-0x1bff], 
> >>> 0x03f0 bytes flags: 0x0
> >>> ...
> >>> kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x1800-0x1bff] 
> >>> hotplug
> >>> kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c00-0x1fff] 
> >>> hotplug
> >>> ...
> >>> kernel: Movable zone start for each node
> >>> kernel:  Node 3: 0x1c00
> >>> kernel: Early memory node ranges
> >>> ...
> >>> =
> >>>
> >>> Naoya's v1 patch [*] fixes the original issue and this movable_node
> >>> issue doesn't occur.
> >>> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> >>> regions into memblock.reserved") and apply the v1 patch.
> >>>
> >>> [*] https://lkml.org/lkml/2018/6/13/27
> >>>
> >>> Signed-off-by: Masayoshi Mizuma 
> >>> ---
> >>>  arch/x86/kernel/e820.c | 15 +++
> >>>  1 file changed, 3 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> >>> index c88c23c658c1..d1f25c831447 100644
> >>> --- a/arch/x86/kernel/e820.c
> >>> +++ b/arch/x86/kernel/e820.c
> >>> @@ -1248,7 +1248,6 @@ void __init e820__memblock_setup(void)
> >>>  {
> >>> int i;
> >>> u64 end;
> >>> -   u64 addr = 0;
> >>>
> >>> /*
> >>>  * The bootstrap memblock region count maximum is 128 entries
> >>> @@ -1265,21 +1264,13 @@ void __init e820__memblock_setup(void)
> >>> struct e820_entry *entry = &e820_table->entries[i];
> >>>
> >>> end = entry->addr + entry->size;
> >>> -   if (addr < entry->addr)
> >>> -   memblock_reserve(addr, entry->addr - addr);
> >>> -   addr = end;
> >>> if (end != (resource_size_t)end)
> >>> continue;
> >>>
> >>> -   /*
> >>> -* all !E820_TYPE_RAM ranges (including gap ranges) are put
> >>> -* into memblock.reserved to make sure that struct pages in
> >>> -* such regions are not left uninitialized after bootup.
> >>> -*/
> >>> if (entry->type != E820_TYPE_RAM && entry->type != 
> >>> E820_TYPE_RESERVED_KERN)
> >>> -   memblock_reserve(entry->addr, entry->size);
> >>> -   else
> >>> -   memblock_add(entry->addr, entry->size);
> >>> +   continue;
> >>> +
> >>> +   memblock_add(entry->addr, entry->size);
> >>> }
> >>>
> >>> /* Throw away partial pages: */
> >>> --
> >>> 2.18.0
> >>>
> >>>
> >
>

Re: [RESEND PATCH v10 6/6] mm: page_alloc: reduce unnecessary binary search in early_pfn_valid()

2018-08-16 Thread Pasha Tatashin


On 7/6/18 5:01 AM, Jia He wrote:
> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But there is
> still some room for improvement. E.g. in early_pfn_valid(), if pfn and
> pfn+1 are in the same memblock region, we can record the last returned
> memblock region index and check whether pfn++ is still in the same
> region.
> 
> Currently it only improve the performance on arm/arm64 and will have no
> impact on other arches.
> 
> For the performance improvement, after this set, I can see the time
> overhead of memmap_init() is reduced from 27956us to 13537us in my
> armv8a server(QDF2400 with 96G memory, pagesize 64k).

This series would be a lot simpler if patches 4, 5, and 6 were dropped.
The extra complexity does not make sense to save 0.0001s/T during not.

Patches 1-3, look OK, but without patches 4-5 __init_memblock should be
made local static as I suggested earlier.

So, I think Jia should re-spin this series with only 3 patches. Or,
Andrew could remove the from linux-next before merge.

Thank you,
Pavel

Re: [RESEND PATCH v10 3/6] mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn()

2018-08-16 Thread Pasha Tatashin


On 8/16/18 9:08 PM, Pavel Tatashin wrote:
> 
>> Signed-off-by: Jia He 
>> ---
>>  mm/memblock.c | 37 +
>>  1 file changed, 29 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index ccad225..84f7fa7 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1140,31 +1140,52 @@ int __init_memblock memblock_set_node(phys_addr_t 
>> base, phys_addr_t size,
>>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>>  
>>  #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
>> +static int early_region_idx __init_memblock = -1;
> 
> One comment:
> 
> This should be __initdata, but even better bring it inside the function
> as local static variable.

Disregard this comment, this global is used in the next commits. So,
everything is OK. No need for __initdata either.

> 
>>  ulong __init_memblock memblock_next_valid_pfn(ulong pfn)
>>  {
> 
> Otherwise looks good:
> 
> Reviewed-by: Pavel Tatashin 
> 
> 

Re: [RESEND PATCH v10 3/6] mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn()

2018-08-16 Thread Pasha Tatashin

> Signed-off-by: Jia He 
> ---
>  mm/memblock.c | 37 +
>  1 file changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index ccad225..84f7fa7 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1140,31 +1140,52 @@ int __init_memblock memblock_set_node(phys_addr_t 
> base, phys_addr_t size,
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
> +static int early_region_idx __init_memblock = -1;

One comment:

This should be __initdata, but even better bring it inside the function
as local static variable.

>  ulong __init_memblock memblock_next_valid_pfn(ulong pfn)
>  {

Otherwise looks good:

Reviewed-by: Pavel Tatashin 



Re: [RESEND PATCH v10 2/6] mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64

2018-08-16 Thread Pasha Tatashin
On 18-07-06 17:01:11, Jia He wrote:
> From: Jia He 
> 
> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But it causes
> possible panic bug. So Daniel Vacek reverted it later.
> 
> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> Daniel said:
> "On arm and arm64, memblock is used by default. But generic version of
> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> not always return the next valid one but skips more resulting in some
> valid frames to be skipped (as if they were invalid). And that's why
> kernel was eventually crashing on some !arm machines."
> 
> About the performance consideration:
> As said by James in b92df1de5,
> "I have tested this patch on a virtual model of a Samurai CPU
> with a sparse memory map.  The kernel boot time drops from 109 to
> 62 seconds."
> 
> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> 
> Suggested-by: Daniel Vacek 
> Signed-off-by: Jia He 

The version of this patch in linux-next has few fixes, I reviewed that one
looks good to me.

Reviewed-by: Pavel Tatashin 

Re: [RESEND PATCH v10 0/6] optimize memblock_next_valid_pfn and early_pfn_valid on arm and arm64

2018-08-16 Thread Pasha Tatashin
On 18-08-15 15:34:56, Andrew Morton wrote:
> On Fri,  6 Jul 2018 17:01:09 +0800 Jia He  wrote:
> 
> > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> > where possible") optimized the loop in memmap_init_zone(). But it causes
> > possible panic bug. So Daniel Vacek reverted it later.
> > 
> > But as suggested by Daniel Vacek, it is fine to using memblock to skip
> > gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> > 
> > More from what Daniel said:
> > "On arm and arm64, memblock is used by default. But generic version of
> > pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> > not always return the next valid one but skips more resulting in some
> > valid frames to be skipped (as if they were invalid). And that's why
> > kernel was eventually crashing on some !arm machines."
> > 
> > About the performance consideration:
> > As said by James in b92df1de5,
> > "I have tested this patch on a virtual model of a Samurai CPU with a
> > sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> > Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> > 
> > Besides we can remain memblock_next_valid_pfn, there is still some room
> > for improvement. After this set, I can see the time overhead of memmap_init
> > is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> > memory, pagesize 64k). I believe arm server will benefit more if memory is
> > larger than TBs
> 
> This patchset is basically unreviewed at this stage.  Could people
> please find some time to check it carefully?

Working on it.

Pavel

Re: [PATCH v3 4/4] mm/memory_hotplug: Drop node_online check in unregister_mem_sect_under_nodes

2018-08-16 Thread Pasha Tatashin
On 18-08-15 16:42:19, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> We are getting the nid from the pages that are not yet removed,
> but a node can only be offline when its memory/cpu's have been removed.
> Therefore, we know that the node is still online.

Reviewed-by: Pavel Tatashin 

> 
> Signed-off-by: Oscar Salvador 
> ---
>  drivers/base/node.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 81b27b5b1f15..b23769e4fcbb 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -465,8 +465,6 @@ void unregister_mem_sect_under_nodes(struct memory_block 
> *mem_blk,
>  
>   if (nid < 0)
>   continue;
> - if (!node_online(nid))
> - continue;
>   /*
>* It is possible that NODEMASK_ALLOC fails due to memory
>* pressure.
> -- 
> 2.13.6
> 

Re: [PATCH v3 3/4] mm/memory_hotplug: Refactor unregister_mem_sect_under_nodes

2018-08-16 Thread Pasha Tatashin
> > d) What's the maximum number of nodes, ever?  Perhaps we can always
> >fit a nodemask_t onto the stack, dunno.
> 
> Right now, we define the maximum as NODES_SHIFT = 10, so:
> 
> 1 << 10 = 1024 Maximum nodes.
> 
> Since this makes only 128 bytes, I wonder if we can just go ahead and define 
> a nodemask_t
> whithin the stack.
> 128 bytes is not that much, is it?

Yeah, sue stack here, 128b is tiny. This also will solve Andrew's point of 
having an untested path when alloc fails, and simplify the patch overall.

Thank you,
Pavel

Re: [PATCH v3 2/4] mm/memory_hotplug: Drop mem_blk check from unregister_mem_sect_under_nodes

2018-08-16 Thread Pasha Tatashin
On 18-08-15 16:42:17, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Before calling to unregister_mem_sect_under_nodes(),
> remove_memory_section() already checks if we got a valid memory_block.
> 
> No need to check that again in unregister_mem_sect_under_nodes().
> 
> If more functions start using unregister_mem_sect_under_nodes() in the
> future, we can always place a WARN_ON to catch null mem_blk's so we can
> safely back off.
> 
> For now, let us keep the check in remove_memory_section() since it is the
> only function that uses it.
> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: Andrew Morton 

Reviewed-by: Pavel Tatashin 

> ---
>  drivers/base/node.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1ac4c36e13bb..dd3bdab230b2 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -455,10 +455,6 @@ int unregister_mem_sect_under_nodes(struct memory_block 
> *mem_blk,
>   NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
>   unsigned long pfn, sect_start_pfn, sect_end_pfn;
>  
> - if (!mem_blk) {
> - NODEMASK_FREE(unlinked_nodes);
> - return -EFAULT;
> - }
>   if (!unlinked_nodes)
>   return -ENOMEM;
>   nodes_clear(*unlinked_nodes);
> -- 
> 2.13.6
> 

Re: [PATCH v3 1/4] mm/memory-hotplug: Drop unused args from remove_memory_section

2018-08-16 Thread Pasha Tatashin
On 18-08-15 16:42:16, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> unregister_memory_section() calls remove_memory_section()
> with three arguments:
> 
> * node_id
> * section
> * phys_device
> 
> Neither node_id nor phys_device are used.
> Let us drop them from the function.

Looks good:
Reviewed-by: Pavel Tatashin 

> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 
> Reviewed-by: Andrew Morton 
> ---
>  drivers/base/memory.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index c8a1cb0b6136..2c622a9a7490 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -752,8 +752,7 @@ unregister_memory(struct memory_block *memory)
>   device_unregister(&memory->dev);
>  }
>  
> -static int remove_memory_section(unsigned long node_id,
> -struct mem_section *section, int phys_device)
> +static int remove_memory_section(struct mem_section *section)
>  {
>   struct memory_block *mem;
>  
> @@ -785,7 +784,7 @@ int unregister_memory_section(struct mem_section *section)
>   if (!present_section(section))
>   return -EINVAL;
>  
> - return remove_memory_section(0, section, 0);
> + return remove_memory_section(section);
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
> -- 
> 2.13.6
> 

Re: [PATCH v2 3/4] mm/memory_hotplug: Make register_mem_sect_under_node a cb of walk_memory_range

2018-08-16 Thread Pasha Tatashin
On 18-06-22 13:18:38, osalva...@techadventures.net wrote:
> From: Oscar Salvador 
> 
> link_mem_sections() and walk_memory_range() share most of the code,
> so we can use convert link_mem_sections() into a dummy function that calls
> walk_memory_range() with a callback to register_mem_sect_under_node().
> 
> This patch converts register_mem_sect_under_node() in order to
> match a walk_memory_range's callback, getting rid of the
> check_nid argument and checking instead if the system is still
> boothing, since we only have to check for the nid if the system
> is in such state.
> 
> Signed-off-by: Oscar Salvador 
> Suggested-by: Pavel Tatashin 

Reviewed-by: Pavel Tatashin 

Re: [PATCH v10 05/10] mm: zero reserved and unavailable struct pages

2017-10-06 Thread Pasha Tatashin

Hi Michal,



As I've said in other reply this should go in only if the scenario you
describe is real. I am somehow suspicious to be honest. I simply do not
see how those weird struct pages would be in a valid pfn range of any
zone.



There are examples of both when unavailable memory is not part of any 
zone, and where it is part of zones.


I run Linux in kvm with these arguments:

qemu-system-x86_64
-enable-kvm
-cpu kvm64
-kernel $kernel
-initrd $initrd
-m 512
-smp 2
-device e1000,netdev=net0
-netdev user,id=net0
-boot order=nc
-no-reboot
-watchdog i6300esb
-watchdog-action debug
-rtc base=localtime
-serial stdio
-display none
-monitor null

This patch reports that there are 98 unavailable pages.

They are: pfn 0 and pfns in range [159, 255].

Note, trim_low_memory_range() reserves only pfns in range [0, 15], it 
does not reserve [159, 255] ones.


e820__memblock_setup() reports linux that the following physical ranges 
are available:

[1 , 158]
[256, 130783]

Notice, that exactly unavailable pfns are missing!

Now, lets check what we have in zone 0: [1, 131039]

pfn 0, is not part of the zone, but pfns [1, 158], are.

However, the bigger problem we have if we do not initialize these struct 
pages is with memory hotplug. Because, that path operates at 2M 
boundaries (section_nr). And checks if 2M range of pages is hot 
removable. It starts with first pfn from zone, rounds it down to 2M 
boundary (sturct pages are allocated at 2M boundaries when vmemmap is 
created), and and checks if that section is hot removable. In this case 
start with pfn 1 and convert it down to pfn 0.


Later pfn is converted to struct page, and some fields are checked. Now, 
if we do not zero struct pages, we get unpredictable results. In fact 
when CONFIG_VM_DEBUG is enabled, and we explicitly set all vmemmap 
memory to ones, I am getting the following panic with kernel test 
without this patch applied:


[   23.277793] BUG: unable to handle kernel NULL pointer dereference at 
 (null)

[   23.278863] IP: is_pageblock_removable_nolock+0x35/0x90
[   23.279619] PGD 0 P4D 0
[   23.280031] Oops:  [#1] PREEMPT
[   23.280527] CPU: 0 PID: 249 Comm: udevd Not tainted 
4.14.0-rc3_pt_memset10-00335-g5e2c7478bed5-dirty #8

[   23.281735] task: 88001f4e2900 task.stack: c9314000
[   23.282532] RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
[   23.283275] RSP: 0018:c9317d60 EFLAGS: 00010202
[   23.283948] RAX:  RBX: 88001d92b000 RCX: 

[   23.284862] RDX:  RSI: 0020 RDI: 
88001d92b000
[   23.285771] RBP: c9317d80 R08: 10c8 R09: 

[   23.286542] R10:  R11:  R12: 
88001db2b000
[   23.287264] R13: 81af6d00 R14: 88001f7d5000 R15: 
82a1b6c0
[   23.287971] FS:  7f4eb857f7c0() GS:81c27000() 
knlGS:

[   23.288775] CS:  0010 DS:  ES:  CR0: 80050033
[   23.289355] CR2:  CR3: 1f4e6000 CR4: 
06b0

[   23.290066] Call Trace:
[   23.290323]  ? is_mem_section_removable+0x5a/0xd0
[   23.290798]  show_mem_removable+0x6b/0xa0
[   23.291204]  dev_attr_show+0x1b/0x50
[   23.291565]  sysfs_kf_seq_show+0xa1/0x100
[   23.291967]  kernfs_seq_show+0x22/0x30
[   23.292349]  seq_read+0x1ac/0x3a0
[   23.292687]  kernfs_fop_read+0x36/0x190
[   23.293074]  ? security_file_permission+0x90/0xb0
[   23.293547]  __vfs_read+0x16/0x30
[   23.293884]  vfs_read+0x81/0x130
[   23.294214]  SyS_read+0x44/0xa0
[   23.294537]  entry_SYSCALL_64_fastpath+0x1f/0xbd
[   23.295003] RIP: 0033:0x7f4eb7c660a0
[   23.295364] RSP: 002b:7ffda6cffe28 EFLAGS: 0246 ORIG_RAX: 

[   23.296152] RAX: ffda RBX: 03de RCX: 
7f4eb7c660a0
[   23.296934] RDX: 1000 RSI: 7ffda6cffec8 RDI: 
0005
[   23.297963] RBP: 7ffda6cffde8 R08: 7379732f73656369 R09: 
6f6d656d2f6d6574
[   23.299198] R10: 726f6d656d2f7972 R11: 0246 R12: 
0022
[   23.300400] R13: 561d68ea7710 R14:  R15: 
7ffda6d05c78
[   23.301591] Code: c1 ea 35 49 c1 e8 2b 48 8b 14 d5 c0 b6 a1 82 41 83 
e0 03 48 85 d2 74 0c 48 c1 e8 29 25 f0 0f 00 00 48 01 c2 4d 69 c0 98 
05 00 00 <48> 8b 02 48 89 fa 48 83 e0 f8 49 8b 88 28 b5 d3 81 48 29 c2 49
[   23.304739] RIP: is_pageblock_removable_nolock+0x35/0x90 RSP: 
c9317d60

[   23.305940] CR2: 


Re: [PATCH] mm: deferred_init_memmap improvements

2017-10-06 Thread Pasha Tatashin

Hi Anshuman,

Thank you very much for looking at this. My reply below::

On 10/06/2017 02:48 AM, Anshuman Khandual wrote:

On 10/04/2017 08:59 PM, Pavel Tatashin wrote:

This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.


page is initialized to NULL at the beginning of the function.


Yes, it is initialized to NULL but at the beginning of 
for_each_mem_pfn_range() loop



PFN advances but we dont proceed unless pfn_valid_within(pfn)
holds true which basically should have checked with arch call
back if the PFN is valid in presence of memory holes as well.
Is not this correct ?


Correct, if pfn_valid_within() is false we jump to the "goto 
free_range;", which is at the end of for (; pfn < end_pfn; pfn++) loop, 
so we are not jumping outside of this loop.





Thus means if deferred struct pages are enabled on systems with these kind
of holes, linux would get memory corruptions. I have fixed this issue by
defining a new macro that performs all the necessary operations when we
free the current set of pages.


If we bail out in case PFN is not valid, then how corruption
can happen ?



We are not bailing out. We continue next iteration with next pfn, but 
page is not incremented.


Please let me know if I am missing something.

Thank you,
Pasha


Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-04 Thread Pasha Tatashin



On 10/04/2017 10:04 AM, Michal Hocko wrote:

On Wed 04-10-17 09:28:55, Pasha Tatashin wrote:



I am not really familiar with the trim_low_memory_range code path. I am
not even sure we have to care about it because nobody should be walking
pfns outside of any zone.


According to commit comments first 4K belongs to BIOS, so I think the memory
exists but BIOS may or may not report it to Linux. So, reserve it to make
sure we never touch it.


Yes and that memory should be outside of any zones, no?


I am not totally sure, I think some x86 expert could help us here. But, 
in either case this issue can be fixed separately from the rest of the 
series.





I am worried that this patch adds a code which
is not really used and it will just stay that way for ever because
nobody will dare to change it as it is too obscure and not explained
very well.


I could explain mine code better. Perhaps add more comments, and explain
when it can be removed?


More explanation would be definitely helpful


trim_low_memory_range is a good example of this. Why do we
even reserve this range from the memory block allocator? The memory
shouldn't be backed by any real memory and thus not in the allocator in
the first place, no?



Since it is not enforced in memblock that everything in reserved list must
be part of memory list, we can have it, and we need to make sure kernel does
not panic. Otherwise, it is very hard to detect such bugs.


So, should we report such a memblock reservation API (ab)use to the log?
Are you actually sure that trim_low_memory_range is doing a sane and
really needed thing? In other words do we have a zone which contains
this no-memory backed pfns?



And, this patch reports it already:

+   pr_info("Reserved but unavailable: %lld pages", pgcnt);

I could add a comment above this print call, explain that such memory is 
probably bogus and must be studied/fixed. Also, add that this code can 
be removed once memblock is changed to allow reserve only memory that is 
backed by physical memory i.e. in "memory" list.


Pasha


Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-04 Thread Pasha Tatashin



I am not really familiar with the trim_low_memory_range code path. I am
not even sure we have to care about it because nobody should be walking
pfns outside of any zone.


According to commit comments first 4K belongs to BIOS, so I think the 
memory exists but BIOS may or may not report it to Linux. So, reserve it 
to make sure we never touch it.


 I am worried that this patch adds a code which

is not really used and it will just stay that way for ever because
nobody will dare to change it as it is too obscure and not explained
very well.


I could explain mine code better. Perhaps add more comments, and explain 
when it can be removed?


 trim_low_memory_range is a good example of this. Why do we

even reserve this range from the memory block allocator? The memory
shouldn't be backed by any real memory and thus not in the allocator in
the first place, no?



Since it is not enforced in memblock that everything in reserved list 
must be part of memory list, we can have it, and we need to make sure 
kernel does not panic. Otherwise, it is very hard to detect such bugs.


Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-04 Thread Pasha Tatashin

Could you be more specific where is such a memory reserved?



I know of one example: trim_low_memory_range() unconditionally reserves from
pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn
1 (i.e. KVM).


Then just initialize struct pages for that mapping rigth there where a
special API is used.


But, there could be more based on this comment from linux/page-flags.h:

  19  * PG_reserved is set for special pages, which can never be swapped out.
Some
  20  * of them might not even exist (eg empty_bad_page)...


I have no idea wht empty_bad_page is but a quick grep shows that this is
never used. I might be wrong here but if somebody is reserving a memory
in a special way then we should handle the initialization right there.
E.g. create an API for special memblock reservations.



Hi Michal,

The reservations happen before struct pages are allocated and mapped. 
So, it is not always possible to do it at call sites.


Previously, I have solved this problem like this:

https://patchwork.kernel.org/patch/9886163

But, I was not too happy with that approach, so I replaced it with the 
current approach as it is more generic, and solves similar issues if 
they happen in other places. Also, the comment in page-flags got me 
scared that there are probably other places perhaps on other 
architectures that can have the similar issue.


In addition, I did not like my solution, I was simply shrinking the low 
reservation from:
[0 - reserve_low) to [min_pfn - reserve_low), but if min_pfn > 
reserve_low can we skip low reservation entirely? I was not sure.


The current approach notifies us if there are such pages, and we can 
fix/remove them in the future without crashing kernel in the meantime.


Pasha


Re: [PATCH v9 06/12] mm: zero struct pages during initialization

2017-10-04 Thread Pasha Tatashin

On 10/04/2017 04:45 AM, Michal Hocko wrote:

On Tue 03-10-17 11:22:35, Pasha Tatashin wrote:



On 10/03/2017 09:08 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

  BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
--
Total   16.124447727s   8.362381922s


Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
anymore, right? So these number depend on the last patch in the series?


Correct, without the last patch sparse_init time won't change.


THen this is just misleading.



OK, I will re-arrange patches the way you suggested earlier.

Pasha


Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap

2017-10-03 Thread Pasha Tatashin

Hi Michal,

I decided not to merge these two patches, because in addition to sparc 
optimization move, we have this dependancies:


mm: zero reserved and unavailable struct pages

must be before

mm: stop zeroing memory during allocation in vmemmap.

Otherwise, we can end-up with struct pages that are not zeroed properly.

However, the first patch depends on
mm: zero struct pages during initialization

As it uses mm_zero_struct_page().

Pasha


On 10/03/2017 11:34 AM, Pasha Tatashin wrote:

On 10/03/2017 09:19 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:14, Pavel Tatashin wrote:

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page 
memory

is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in 
parallel

when struct pages are zeroed.


Is it possible to merge this patch with 
http://lkml.kernel.org/r/20170920201714.19817-7-pasha.tatas...@oracle.com


Yes, I will do that. It would also require re-arranging
[PATCH v9 07/12] sparc64: optimized struct page zeroing
optimization to come after this patch.

Pasha


Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements

2017-10-03 Thread Pasha Tatashin

Hi Michal,

Are you OK, if I replace DEFERRED_FREE() macro with a function like this:

/*
 * Helper for deferred_init_range, free the given range, and reset the
 * counters
 */
static inline unsigned long __def_free(unsigned long *nr_free,
   unsigned long *free_base_pfn,
   struct page **page)
{
unsigned long nr = *nr_free;

deferred_free_range(*free_base_pfn, nr);
*free_base_pfn = 0;
*nr_free = 0;
*page = NULL;

return nr;
}

Since it is inline, and we operate with non-volatile counters, compiler 
will be smart enough to remove all the unnecessary de-references. As a 
plus, we won't be adding any new branches, and the code is still going 
to stay compact.


Pasha

On 10/03/2017 11:15 AM, Pasha Tatashin wrote:

Hi Michal,



Please be explicit that this is possible only because we discard
memblock data later after 3010f876500f ("mm: discard memblock data
later"). Also be more explicit how the new code works.


OK



I like how the resulting code is more compact and smaller.


That was the goal :)


for_each_free_mem_range also looks more appropriate but I really detest
the DEFERRED_FREE thingy. Maybe we can handle all that in a single goto
section. I know this is not an art but manipulating variables from
macros is more error prone and much more ugly IMHO.


Sure, I can re-arrange to have a goto place. Function won't be as small, 
and if compiler is not smart enough we might end up with having more 
branches than what my current code has.




please do not use macros. Btw. this deserves its own fix. I suspect that
no CONFIG_HOLES_IN_ZONE arch enables DEFERRED_STRUCT_PAGE_INIT but
purely from the review point of view it should be its own patch.


Sure, I will submit this patch separately from the rest of the project. 
In my opinion DEFERRED_STRUCT_PAGE_INIT is the way of the future, so we 
should make sure it is working with as many configs as possible.


Thank you,
Pasha


Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap

2017-10-03 Thread Pasha Tatashin

On 10/03/2017 09:19 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:14, Pavel Tatashin wrote:

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.


Is it possible to merge this patch with 
http://lkml.kernel.org/r/20170920201714.19817-7-pasha.tatas...@oracle.com


Yes, I will do that. It would also require re-arranging
[PATCH v9 07/12] sparc64: optimized struct page zeroing
optimization to come after this patch.

Pasha


Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-03 Thread Pasha Tatashin

On 10/03/2017 09:18 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().


Could you be more specific where is such a memory reserved?



I know of one example: trim_low_memory_range() unconditionally reserves 
from pfn 0, but e820__memblock_setup() might provide the exiting memory 
from pfn 1 (i.e. KVM).


But, there could be more based on this comment from linux/page-flags.h:

 19  * PG_reserved is set for special pages, which can never be swapped 
out. Some

 20  * of them might not even exist (eg empty_bad_page)...

Pasha


Re: [PATCH v9 06/12] mm: zero struct pages during initialization

2017-10-03 Thread Pasha Tatashin



On 10/03/2017 09:08 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

 BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
   --
Total   16.124447727s   8.362381922s


Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
anymore, right? So these number depend on the last patch in the series?


Correct, without the last patch sparse_init time won't change.

Pasha


Re: [PATCH v9 04/12] sparc64: simplify vmemmap_populate

2017-10-03 Thread Pasha Tatashin




Acked-by: Michal Hocko 


Thank you,
Pasha


Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements

2017-10-03 Thread Pasha Tatashin

Hi Michal,



Please be explicit that this is possible only because we discard
memblock data later after 3010f876500f ("mm: discard memblock data
later"). Also be more explicit how the new code works.


OK



I like how the resulting code is more compact and smaller.


That was the goal :)


for_each_free_mem_range also looks more appropriate but I really detest
the DEFERRED_FREE thingy. Maybe we can handle all that in a single goto
section. I know this is not an art but manipulating variables from
macros is more error prone and much more ugly IMHO.


Sure, I can re-arrange to have a goto place. Function won't be as small, 
and if compiler is not smart enough we might end up with having more 
branches than what my current code has.




please do not use macros. Btw. this deserves its own fix. I suspect that
no CONFIG_HOLES_IN_ZONE arch enables DEFERRED_STRUCT_PAGE_INIT but
purely from the review point of view it should be its own patch.


Sure, I will submit this patch separately from the rest of the project. 
In my opinion DEFERRED_STRUCT_PAGE_INIT is the way of the future, so we 
should make sure it is working with as many configs as possible.


Thank you,
Pasha


Re: [PATCH v9 02/12] sparc64/mm: setting fields in deferred pages

2017-10-03 Thread Pasha Tatashin


As you separated x86 and sparc patches doing essentially the same I
assume David is going to take this patch?


Correct, I noticed that usually platform specific changes are done in 
separate patches even if they are small. Dave already Acked this patch. 
So, I do not think it should be separated from the rest of the patches 
when this projects goes into mm-tree.




Acked-by: Michal Hocko 


Thank you,
Pasha



Re: [PATCH v9 01/12] x86/mm: setting fields in deferred pages

2017-10-03 Thread Pasha Tatashin

Hi Michal,



I hope I haven't missed anything but it looks good to me.

Acked-by: Michal Hocko 


Thank you for your review.



one nit below

---
  arch/x86/mm/init_64.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..30fe22558720 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,17 @@ void __init mem_init(void)
  
  	/* clear_bss() already clear the empty_zero_page */
  
-	register_page_bootmem_info();

-
/* this will put all memory onto the freelists */
free_all_bootmem();
after_bootmem = 1;
  
+	/* Must be done after boot memory is put on freelist, because here we


standard code style is to do
/*
 * text starts here


OK, will change for both patch 1 and 2.

Pasha


Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function

2017-10-03 Thread Pasha Tatashin

Hi Mark,

I considered using a new  *populate() function for shadow without using 
vmemmap_populate(), but that makes things unnecessary complicated: 
vmemmap_populate() has builtin:


1. large page support
2. device memory support
3. node locality support
4. several config based variants on different platforms

All of that  will cause the code simply be duplicated on each platform 
if we want to support that in kasan.


We could limit ourselves to only supporting base pages in memory by 
using something like vmemmap_populate_basepages(), but that is a step 
backward.  Kasan benefits from using large pages now, why remove it?


So, the solution I provide is walking page table right after memory is 
mapped. Since, we are using the actual page table, it is guaranteed that 
we are not going to miss any mapped memory, and also it is in common 
code, which makes things smaller and nicer.


Thank you,
Pasha

On 10/03/2017 10:48 AM, Mark Rutland wrote:


I've given this a spin on arm64, and can confirm that it works.

Given that this involes redundant walking of page tables, I still think
it'd be preferable to have some common *_populate() helper that took a
gfp argument, but I guess it's not the end of the world.

I'll leave it to Will and Catalin to say whether they're happy with the
page table walking and the new p{u,m}d_large() helpers added to arm64.



Re: [PATCH v6 1/4] sched/clock: interface to allow timestamps early in boot

2017-09-28 Thread Pasha Tatashin

It will be best if we can support TSC sync capability in x86, but seems
is not easy.


Sure, your hardware achieving sync would be best, but even if it does
not, we can still use TSC. Using notsc simple because you fail to sync
TSCs is quite crazy.

The thing is, we need to support unsync'ed TSC in any case, because
older chips (pre Nehalem) didn't have synchronized TSC in any case, and
it still happens on recent chips if the BIOS mucks it up, which happens
surprisingly often :-(

I would suggest you try your reconfigurable setup with "tsc=unstable"
and see if that works for you. That marks the TSC unconditionally
unstable at boot and avoids any further wobbles once the TSC watchdog
notices (although that too _should_ more or less work).


That should do the trick nicely and we might just end up converting notsc
to tsc=unstable silently so we can avoid the bike shed discussions about
removing it.



Ok, I will start working on converting notsc to unstable, and modify my 
patches to do what Peter suggested earlier. In the mean time, I'd like 
to hear from Dou if this setup works with dynamic reconfig.


Thank you,
Pasha


Re: [PATCH v6 1/4] sched/clock: interface to allow timestamps early in boot

2017-09-27 Thread Pasha Tatashin

Hi Dou,

This makes sense. The current sched_clock_early() approach does not 
break it because with notsc TSC is used early in boot, and later 
stopped. But, notsc must stay.


Peter,

So, we could either expend sched_clock() with another static branch for 
early clock, or use what I proposed. IMO, the later is better, but 
either way works for me.


Thank you,
Pasha

On 09/27/2017 09:52 AM, Dou Liyang wrote:

Hi Pasha, Peter

At 09/27/2017 09:16 PM, Pasha Tatashin wrote:

Hi Peter,

I am totally happy with removing notsc. This certainly simplifies the
sched_clock code. Are there any issues with removing existing kernel
parameters that I should be aware of?



We do not want to do that. Because, we use "notsc" to support Dynamic
Reconfiguration[1].

AFAIK, this feature enables hot-add system board which contains CPUs
and memories. But the CPUs in different board may have different TSCs
which are not consistent with the TSC from the existing CPUs. If we 
hot-add a board directly, the machine may happen the inconsistency of

TSC.

We make our effort to specify the same TSC value as existing one through
hardware and firmware, but it is hard. So we recommend to specify
"notsc" option in command line for users who want to use Dynamic
Reconfiguration.

[1] 
http://www.fujitsu.com/global/products/computing/servers/mission-critical/primequest/technology/availability/dynamic-reconfiguration.html 



Re: [PATCH v6 1/4] sched/clock: interface to allow timestamps early in boot

2017-09-27 Thread Pasha Tatashin

Hi Russell,

This might be so for ARM, and in fact if you look at my SPARC 
implementation, I simply made source clock initialize early, so regular 
sched_clock() is used. As on SPARC, we use either %tick or %stick 
registers with frequency determined via OpenFrimware. But, on x86 there 
are dozen ways clock sources are setup, and some of them available quiet 
late in boot because of various dependencies. So, my early clock 
initialization for x86 (and expendable to other platforms with unstable 
clocks) is to make it available when TSC is available, which is 
determined by already existing kernel functionality in 
simple_udelay_calibration().


My goal was not to introduce any regressions to the already complex (in 
terms of number of branches and loads) sched_clock_cpu(), therefore I 
added a new function and avoided any extra branches through out the life 
of the system. I could mitigate some of that by using static branches, 
but imo the current approach is better.


Pasha


Re: [PATCH v6 1/4] sched/clock: interface to allow timestamps early in boot

2017-09-27 Thread Pasha Tatashin

Hi Peter,

I am totally happy with removing notsc. This certainly simplifies the 
sched_clock code. Are there any issues with removing existing kernel 
parameters that I should be aware of?


Thank you,
Pasha

On 09/27/2017 09:10 AM, Peter Zijlstra wrote:

On Wed, Sep 27, 2017 at 02:58:57PM +0200, Peter Zijlstra wrote:

(we're violating "notsc" in any case and really should kill that
option).


Something like so; in particular simple_udelay_calibrate() will issue
RDTSC _way_ early, so there is absolutely no point in then pretending we
can't use RDTSC for sched_clock.



Re: [PATCH v6 4/4] x86/tsc: use tsc early

2017-08-30 Thread Pasha Tatashin

Hi Fenghua,

Thank you for looking at this. Unfortunately I can't mark either of them 
__init because sched_clock_early() is called from

u64 sched_clock_cpu(int cpu)

Which is around for the live of the system.

Thank you,
Pasha

On 08/30/2017 05:21 PM, Fenghua Yu wrote:

On Wed, Aug 30, 2017 at 02:12:09PM -0700, Fenghua Yu wrote:

+static struct cyc2ns_data  cyc2ns_early;
+static bool sched_clock_early_enabled;


Should these two varaibles be "__initdata"?


+u64 sched_clock_early(void)

This function is only called during boot time. Should it
be a "__init" function?

Thanks.

-Fenghua



Re: [PATCH v7 07/11] sparc64: optimized struct page zeroing

2017-08-30 Thread Pasha Tatashin

Hi Dave,

Thank you for acking.

The reason I am not doing initializing stores is because they require a 
membar, even if only regular stores are following (I hoped to do a 
membar before first load). This is something I was thinking was not 
true, but after consulting with colleagues and checking processor 
manual, I verified that it is the case.


Pasha



You should probably use initializing stores when you are doing 8
stores and we thus know the page struct is cache line aligned.

But other than that:

Acked-by: David S. Miller 


Re: [PATCH v5 1/2] sched/clock: interface to allow timestamps early in boot

2017-08-28 Thread Pasha Tatashin

void __init timekeeping_init(void)
{
/*
 * We must determine boot timestamp before getting current  
 * persistent clock value, because implementation of
 * read_boot_clock64() might also call the persistent
 * clock, and a leap second may occur.
 */

read_boot_clock64(&boot);
...
read_persistent_clock64(&now);


No. That's the same crap just the other way round.

s390 can do that, because the boot timestamp is correlated with the
persistent clock. Your's not so much.



OK, how about reading the persistent clock only once, and send it's 
value to use for calculation of boot stamp to read_boot_clock64() via a 
new argument:


read_boot_clock64(&now, &boot);

Does this sound alright or is there a better way?

I would need to update read_boot_clock64() everywhere it is declared to 
add the __init macro, so this extra argument is not going to increase 
number of line changes.


Thank you,
Pasha


Re: [PATCH v5 1/2] sched/clock: interface to allow timestamps early in boot

2017-08-28 Thread Pasha Tatashin

And because its called only once, it does not need to be marked __init()
and must be kept around forever, right?


This is because every other architecture implements read_boot_clock64()
without __init: arm, s390. Beside, the original weak stub does not have __init
macro. So, I can certainly try to add it for x86, but I am not sure what is
the behavior once __init section is gone, but weak implementation stays.


And what about fixing that everywhere?



Sure, I will update it everywhere.


Re: [PATCH v5 1/2] sched/clock: interface to allow timestamps early in boot

2017-08-28 Thread Pasha Tatashin

Hi Thomas,

Thank you for your comments. My replies below.


+/*
+ * Called once during to boot to initialize boot time.
+ */
+void read_boot_clock64(struct timespec64 *ts)


And because its called only once, it does not need to be marked __init()
and must be kept around forever, right?


This is because every other architecture implements read_boot_clock64() 
without __init: arm, s390. Beside, the original weak stub does not have 
__init macro. So, I can certainly try to add it for x86, but I am not 
sure what is the behavior once __init section is gone, but weak 
implementation stays.





+{
+   u64 ns_boot = sched_clock_early(); /* nsec from boot */


Please do not use tail comments. They are a horrible habit.

Instead of adding this crap you'd have better spent time in adding proper
comments explaining the reasoning behind this function,


OK, I will add introduction comment, and remove the tail comment.


This is really broken. Look at the time keeping init code. It does:

  read_persistent_clock64(&now);
  ...
  read_boot_clock64(&boot);
  ...
  tk_set_xtime(tk, &now);
  ...
  set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec);
  tk_set_wall_to_mono(tk, tmp);

Lets assume that the initial read_persistent_clock64() happens right before
the second. For simplicity lets assume we get 1000 seconds since the epoch.

Now read_boot_clock() reads sched_clock_early() which returns 1 second.

The second read_persistent_clock64() returns 1001 seconds since the epoch
because the RTC advanced by now. So the resulting time stamp is going to be
1000s since the epoch.

In case the RTC still returns 100 since the epoch, the resulting time stamp
is 999s since the epoch.

A full second difference. That's time stamp lottery but nothing which we
want to base any boot time analysis on.

You have to come up with something more useful than that.



This makes sense. Changing order in timekeeping_init(void) should take 
care of this:


Change to:

void __init timekeeping_init(void)
{
/*
 * We must determine boot timestamp before getting current  
 * persistent clock value, because implementation of
 * read_boot_clock64() might also call the persistent
 * clock, and a leap second may occur.
 */

read_boot_clock64(&boot);
...
read_persistent_clock64(&now);
...
}


Re: [v6 01/15] x86/mm: reserve only exiting low pages

2017-08-17 Thread Pasha Tatashin

Hi Michal,

While working on a bug that was reported to me by "kernel test robot".

 unable to handle kernel NULL pointer dereference at   (null)

The issue was that page_to_pfn() on that configuration was looking for a 
section inside flags fields in "struct page". So, reserved but 
unavailable memory should have its "struct page" zeroed.


Therefore, I am going to remove this patch from my series, but instead 
have a new patch that iterates through:


reserved && !memory memblocks, and zeroes struct pages for them. Since 
for that memory struct pages will never go through __init_single_page(), 
yet some fields might still be accessed.


Pasha

On 08/14/2017 09:55 AM, Michal Hocko wrote:

Let's CC Hpa on this one. I am still not sure it is correct. The full
series is here
http://lkml.kernel.org/r/1502138329-123460-1-git-send-email-pasha.tatas...@oracle.com

On Mon 07-08-17 16:38:35, Pavel Tatashin wrote:

Struct pages are initialized by going through __init_single_page(). Since
the existing physical memory in memblock is represented in memblock.memory
list, struct page for every page from this list goes through
__init_single_page().

The second memblock list: memblock.reserved, manages the allocated memory.
The memory that won't be available to kernel allocator. So, every page from
this list goes through reserve_bootmem_region(), where certain struct page
fields are set, the assumption being that the struct pages have been
initialized beforehand.

In trim_low_memory_range() we unconditionally reserve memoryfrom PFN 0, but
memblock.memory might start at a later PFN. For example, in QEMU,
e820__memblock_setup() can use PFN 1 as the first PFN in memblock.memory,
so PFN 0 is not on memblock.memory (and hence isn't initialized via
__init_single_page) but is on memblock.reserved (and hence we set fields in
the uninitialized struct page).

Currently, the struct page memory is always zeroed during allocation,
which prevents this problem from being detected. But, if some asserts
provided by CONFIG_DEBUG_VM_PGFLAGS are tighten, this problem may become
visible in existing kernels.

In this patchset we will stop zeroing struct page memory during allocation.
Therefore, this bug must be fixed in order to avoid random assert failures
caused by CONFIG_DEBUG_VM_PGFLAGS triggers.

The fix is to reserve memory from the first existing PFN.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
---
  arch/x86/kernel/setup.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3486d0498800..489cdc141bcb 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -790,7 +790,10 @@ early_param("reservelow", parse_reservelow);
  
  static void __init trim_low_memory_range(void)

  {
-   memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
+   unsigned long min_pfn = find_min_pfn_with_active_regions();
+   phys_addr_t base = min_pfn << PAGE_SHIFT;
+
+   memblock_reserve(base, ALIGN(reserve_low, PAGE_SIZE));
  }

  /*
--
2.14.0




Re: [v6 05/15] mm: don't accessed uninitialized struct pages

2017-08-17 Thread Pasha Tatashin

Hi Michal,

I've been looking through this code again, and I think your suggestion 
will work. I did not realize this iterator already exist:


for_each_free_mem_range() basically iterates through (memory && !reserved)

This is exactly what we need here. So, I will update this patch to use 
this iterator, which will simplify it.


Pasha

On 08/14/2017 09:51 AM, Pasha Tatashin wrote:

mem_init()
  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- if this is deferred reserved page
   __init_single_pfn()
__init_single_page()

So, currently, we are using the value of page->flags to figure out if 
this
page has been initialized while being part of deferred page, but this 
is not
going to work for this project, as we do not zero the memory that is 
backing

the struct pages, and size the value of page->flags can be anything.


True, this is the initialization part I've missed in one of the previous
patches already. Would it be possible to only iterate over !reserved
memory blocks instead? Now that we discard all the metadata later it
should be quite easy to do for_each_memblock_type, no?


Hi Michal,

Clever suggestion to add a new iterator to go through unreserved 
existing memory, I do not think there is this iterator available, so it 
would need to be implemented, using similar approach to what I have done 
with a call back.


However, there is a different reason, why I took this current approach.

Daniel Jordan is working on a ktask support:
https://lkml.org/lkml/2017/7/14/666

He and I discussed on how to multi-thread struct pages initialization 
within memory nodes using ktasks. Having this callback interface makes 
that multi-threading quiet easy, improving the boot performance further, 
with his prototype we saw x4-6 improvements (using 4-8 threads per 
node). Reducing the total time it takes to initialize all struct pages 
on machines with terabytes of memory to less than one second.


Pasha


Re: [v3 1/2] sched/clock: interface to allow timestamps early in boot

2017-08-14 Thread Pasha Tatashin

Hi Dou,

Thank you for your comments:


 {
 x86_init.timers.timer_init();
 tsc_init();
+tsc_early_fini();


tsc_early_fini() is defined in patch 2, I guess you may miss it
when you split your patches.


Indeed, I will move it to patch 2.


+static DEFINE_STATIC_KEY_TRUE(__use_sched_clock_early);
+static bool __read_mostly sched_clock_early_running;
+


In my opinion, these two parameters are repetitive, I suggest remove
one.

eg. remove sched_clock_early_running like below
First, static DEFINE_STATIC_KEY_FALSE(__use_sched_clock_early);


We can't change the static branches before jump_label_init() is called, 
and we start early boot timestamps before that


This is why having two booleans is appropriate: one that can be changed 
early in boot, and another to patch the hotcode in order to keep good 
performance after boot.


I will update comment before __use_sched_clock_early explaining the 
reason why we need two of them.


Thank you,
Pasha


Re: [v6 15/15] mm: debug for raw alloctor

2017-08-14 Thread Pasha Tatashin

However, now thinking about it, I will change it to CONFIG_MEMBLOCK_DEBUG,
and let users decide what other debugging configs need to be enabled, as
this is also OK.


Actually the more I think about it the more I am convinced that a kernel
boot parameter would be better because it doesn't need the kernel to be
recompiled and it is a single branch in not so hot path.


The main reason I do not like kernel parameter is that automated test 
suits for every platform would need to be updated to include this new 
parameter in order to test it.


Yet, I think it is important at least initially to test it on every 
platform unconditionally when certain debug configs are enabled.


This patch series allows boot allocator to return uninitialized memory, 
this behavior Linux never had before, but way too often firmware 
explicitly zero all the memory before starting OS. Therefore, it would 
be hard to debug issues that might be only seen during kinit type of 
reboots.


In the future, when memory sizes will increase so that this memset will 
become unacceptable even on debug kernels, it can always be removed, but 
at least at that time we will know that the code has been tested for 
many years.


Re: [v6 05/15] mm: don't accessed uninitialized struct pages

2017-08-14 Thread Pasha Tatashin

mem_init()
  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- if this is deferred reserved page
   __init_single_pfn()
__init_single_page()

So, currently, we are using the value of page->flags to figure out if this
page has been initialized while being part of deferred page, but this is not
going to work for this project, as we do not zero the memory that is backing
the struct pages, and size the value of page->flags can be anything.


True, this is the initialization part I've missed in one of the previous
patches already. Would it be possible to only iterate over !reserved
memory blocks instead? Now that we discard all the metadata later it
should be quite easy to do for_each_memblock_type, no?


Hi Michal,

Clever suggestion to add a new iterator to go through unreserved 
existing memory, I do not think there is this iterator available, so it 
would need to be implemented, using similar approach to what I have done 
with a call back.


However, there is a different reason, why I took this current approach.

Daniel Jordan is working on a ktask support:
https://lkml.org/lkml/2017/7/14/666

He and I discussed on how to multi-thread struct pages initialization 
within memory nodes using ktasks. Having this callback interface makes 
that multi-threading quiet easy, improving the boot performance further, 
with his prototype we saw x4-6 improvements (using 4-8 threads per 
node). Reducing the total time it takes to initialize all struct pages 
on machines with terabytes of memory to less than one second.


Pasha


Re: [v6 04/15] mm: discard memblock data later

2017-08-14 Thread Pasha Tatashin

#ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs in
nobootmem headfile.


This is the standard way to do this. And it is usually preferred to
proliferate ifdefs in the code.


Hi Michal,

As you suggested, I sent-out this patch separately. If you feel 
strongly, that this should be updated to have stubs for platforms that 
do not implement memblock, please send a reply to that e-mail, so those 
who do not follow this tread will see it. Otherwise, I can leave it as 
is, page_alloc file already has a number memblock related ifdefs all of 
which can be cleaned out once every platform implements it (is it even 
achievable?)


Thank you,
Pasha


Re: [v6 04/15] mm: discard memblock data later

2017-08-14 Thread Pasha Tatashin

OK, I will post it separately. No it does not depend on the rest, but the
reset depends on this. So, I am not sure how to enforce that this comes
before the rest.


Andrew will take care of that. Just make it explicit that some of the
patch depends on an earlier work when reposting.


Ok.


Yes, they said that the problem was bisected down to this patch. Do you know
if there is a way to submit a patch to this test robot?


You can ask them for re testing with an updated patch by replying to
their report. ANyway I fail to see how the change could lead to this
patch.


I have already done that. Anyway, I think it is unrelated. I have used 
their scripts to test the patch alone, with number of elements in 
memblock array reduced down to 4. Verified that my freeing code is 
called, and never hit the problem that they reported.


Re: [v6 02/15] x86/mm: setting fields in deferred pages

2017-08-14 Thread Pasha Tatashin



On 08/14/2017 07:43 AM, Michal Hocko wrote:

register_page_bootmem_info
  register_page_bootmem_info_node
   get_page_bootmem
.. setting fields here ..
such as: page->freelist = (void *)type;

free_all_bootmem()
  free_low_memory_core_early()
   for_each_reserved_mem_region()
reserve_bootmem_region()
 init_reserved_page() <- Only if this is deferred reserved page
  __init_single_pfn()
   __init_single_page()
   memset(0) <-- Loose the set fields here!

OK, I have missed that part. Please make it explicit in the changelog.
It is quite easy to get lost in the deep call chains.


Ok, will update comment.


Re: [v6 01/15] x86/mm: reserve only exiting low pages

2017-08-14 Thread Pasha Tatashin

Correct, the pgflags asserts were triggered when we were setting reserved
flags to struct page for PFN 0 in which was never initialized through
__init_single_page(). The reason they were triggered is because we set all
uninitialized memory to ones in one of the debug patches.


And why don't we need the same treatment for other architectures?



I have not seen similar issues on other architectures. At least this low 
memory reserve is x86 specific for BIOS purposes:


Documentation/admin-guide/kernel-parameters.txt
3624reservelow= [X86]
3625Format: nn[K]
3626Set the amount of memory to reserve for BIOS at
3627the bottom of the address space.

If there are similar cases with other architectures, they will be caught 
by the last patch in this series, where all allocated memory is set to 
ones, and page flags asserts will be triggered. I have boot-tested on 
SPARC, ARM, and x86.


Pasha


Re: [v6 04/15] mm: discard memblock data later

2017-08-11 Thread Pasha Tatashin

Hi Michal,

This suggestion won't work, because there are arches without memblock 
support: tile, sh...


So, I would still need to have:

#ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs 
in nobootmem headfile. In either case it would become messier than what 
it is right now.


Pasha


I have just one nit below
Acked-by: Michal Hocko 

[...]

diff --git a/mm/memblock.c b/mm/memblock.c
index 2cb25fe4452c..bf14aea6ab70 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -285,31 +285,27 @@ static void __init_memblock memblock_remove_region(struct 
memblock_type *type, u
  }
  
  #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK


pull this ifdef inside memblock_discard and you do not have an another
one in page_alloc_init_late

[...]

+/**
+ * Discard memory and reserved arrays if they were allocated
+ */
+void __init memblock_discard(void)
  {


here


-   if (memblock.memory.regions == memblock_memory_init_regions)
-   return 0;
+   phys_addr_t addr, size;
  
-	*addr = __pa(memblock.memory.regions);

+   if (memblock.reserved.regions != memblock_reserved_init_regions) {
+   addr = __pa(memblock.reserved.regions);
+   size = PAGE_ALIGN(sizeof(struct memblock_region) *
+ memblock.reserved.max);
+   __memblock_free_late(addr, size);
+   }
  
-	return PAGE_ALIGN(sizeof(struct memblock_region) *

- memblock.memory.max);
+   if (memblock.memory.regions == memblock_memory_init_regions) {
+   addr = __pa(memblock.memory.regions);
+   size = PAGE_ALIGN(sizeof(struct memblock_region) *
+ memblock.memory.max);
+   __memblock_free_late(addr, size);
+   }
  }
-
  #endif


Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw

2017-08-11 Thread Pasha Tatashin

Sure, I could do this, but as I understood from earlier Dave Miller's
comments, we should do one logical change at a time. Hence, introduce API in
one patch use it in another. So, this is how I tried to organize this patch
set. Is this assumption incorrect?


Well, it really depends. If the patch is really small then adding a new
API along with users is easier to review and backport because you have a
clear view of the usage. I believe this is the case here. But if others
feel otherwise I will not object.


I will merge them.

Thank you,
Pasha


Re: [v6 04/15] mm: discard memblock data later

2017-08-11 Thread Pasha Tatashin

I will address your comment, and send out a new patch. Should I send it out
separately from the series or should I keep it inside?


I would post it separatelly. It doesn't depend on the rest.


OK, I will post it separately. No it does not depend on the rest, but 
the reset depends on this. So, I am not sure how to enforce that this 
comes before the rest.





Also, before I send out a new patch, I will need to root cause and resolve
problem found by kernel test robot , and bisected
down to this patch.

[  156.659400] BUG: Bad page state in process swapper  pfn:03147
[  156.660051] page:88001ed8a1c0 count:0 mapcount:-127 mapping:
(null) index:0x1
[  156.660917] flags: 0x0()
[  156.661198] raw:   0001
ff80
[  156.662006] raw: 88001f4a8120 88001ed85ce0 

[  156.662811] page dumped because: nonzero mapcount
[  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted
4.13.0-rc3-00220-g1aad694 #1
[  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.9.3-20161025_171302-gandalf 04/01/2014
[  156.665129] Call Trace:
[  156.665422]  dump_stack+0x1e/0x20
[  156.665802]  bad_page+0x122/0x148


Was the report related with this patch?


Yes, they said that the problem was bisected down to this patch. Do you 
know if there is a way to submit a patch to this test robot?


Thank you,
Pasha


Re: [v6 15/15] mm: debug for raw alloctor

2017-08-11 Thread Pasha Tatashin

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.


Please fold this into the patch which introduces
memblock_virt_alloc_try_nid_raw.


OK

 I am not sure CONFIG_DEBUG_VM is the

best config because that tends to be enabled quite often. Maybe
CONFIG_MEMBLOCK_DEBUG? Or even make it kernel command line parameter?



Initially, I did not want to make it CONFIG_MEMBLOCK_DEBUG because we 
really benefit from this debugging code when VM debug is enabled, and 
especially struct page debugging asserts which also depend on 
CONFIG_DEBUG_VM.


However, now thinking about it, I will change it to 
CONFIG_MEMBLOCK_DEBUG, and let users decide what other debugging configs 
need to be enabled, as this is also OK.


Thank you,
Pasha


Re: [v6 14/15] mm: optimize early system hash allocations

2017-08-11 Thread Pasha Tatashin

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 


OK, but as mentioned in the previous patch add memblock_virt_alloc_raw
in this patch.

Acked-by: Michal Hocko 


Ok I will merge them.

Thank you,
Pasha


Re: [v6 13/15] mm: stop zeroing memory during allocation in vmemmap

2017-08-11 Thread Pasha Tatashin

On 08/11/2017 09:04 AM, Michal Hocko wrote:

On Mon 07-08-17 16:38:47, Pavel Tatashin wrote:

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.


First of all this should be probably merged with the previous patch. The
I think vmemmap_alloc_block would be better to split up into
__vmemmap_alloc_block which doesn't zero and vmemmap_alloc_block which
does zero which would reduce the memset callsites and it would make it
slightly more robust interface.


Ok, I will add: vmemmap_alloc_block_zero() call, and merge this and the 
previous patches together.


Re: [v6 09/15] sparc64: optimized struct page zeroing

2017-08-11 Thread Pasha Tatashin

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to tent regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.


Again, this doesn't explain why we need this. You have mentioned those
reasons in some previous emails but be explicit here please.



I will add performance data to this patch as well.

Thank you,
Pasha


Re: [v6 08/15] mm: zero struct pages during initialization

2017-08-11 Thread Pasha Tatashin

I believe this deserves much more detailed explanation why this is safe.
What actually prevents any pfn walker from seeing an uninitialized
struct page? Please make your assumptions explicit in the commit log so
that we can check them independently.


There is nothing prevents pfn walkers from walk over any struct pages 
deferred and non-deferred. However, during boot before deferred pages 
are initialized we have just a few places that do that, and all of those 
cases are fixed in this patchset.



Also this is done with some purpose which is the perfmance, right? You
have mentioned that in the cover letter but if somebody is going to read
through git logs this wouldn't be obvious from the specific commit.
So add that information here as well. Especially numbers will be
interesting.


I will add more performance data to this patch comment.


Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw

2017-08-11 Thread Pasha Tatashin

On 08/11/2017 08:39 AM, Michal Hocko wrote:

On Mon 07-08-17 16:38:41, Pavel Tatashin wrote:

A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
 - Does not zero the allocated memory
 - Does not panic if request cannot be satisfied


OK, this looks good but I would not introduce memblock_virt_alloc_raw
here because we do not have any users. Please move that to "mm: optimize
early system hash allocations" which actually uses the API. It would be
easier to review it that way.


Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 


other than that
Acked-by: Michal Hocko 


Sure, I could do this, but as I understood from earlier Dave Miller's 
comments, we should do one logical change at a time. Hence, introduce 
API in one patch use it in another. So, this is how I tried to organize 
this patch set. Is this assumption incorrect?


Re: [v6 05/15] mm: don't accessed uninitialized struct pages

2017-08-11 Thread Pasha Tatashin

On 08/11/2017 05:37 AM, Michal Hocko wrote:

On Mon 07-08-17 16:38:39, Pavel Tatashin wrote:

In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

 if (page->flags) {
 VM_BUG_ON(page_zone(page) != zone);
 goto free_range;
 }

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

This patch defines a new accessor memblock_get_reserved_pfn_range()
which returns successive ranges of reserved PFNs.  deferred_init_memmap()
calls it to determine if a PFN and its struct page has already been
initialized.


Why don't we simply check the pfn against pgdat->first_deferred_pfn?


Because we are initializing deferred pages, and all of them have pfn 
greater than pgdat->first_deferred_pfn. However, some of deferred pages 
were already initialized if they were reserved, in this path:


mem_init()
 free_all_bootmem()
  free_low_memory_core_early()
   for_each_reserved_mem_region()
reserve_bootmem_region()
 init_reserved_page() <- if this is deferred reserved page
  __init_single_pfn()
   __init_single_page()

So, currently, we are using the value of page->flags to figure out if 
this page has been initialized while being part of deferred page, but 
this is not going to work for this project, as we do not zero the memory 
that is backing the struct pages, and size the value of page->flags can 
be anything.


Re: [v6 04/15] mm: discard memblock data later

2017-08-11 Thread Pasha Tatashin

I guess this goes all the way down to
Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel 
with kswapd")


I will add this to the patch.


Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 


Considering that some HW might behave strangely and this would be rather
hard to debug I would be tempted to mark this for stable. It should also
be merged separately from the rest of the series.

I have just one nit below
Acked-by: Michal Hocko 


I will address your comment, and send out a new patch. Should I send it 
out separately from the series or should I keep it inside?


Also, before I send out a new patch, I will need to root cause and 
resolve problem found by kernel test robot , and 
bisected down to this patch.


[  156.659400] BUG: Bad page state in process swapper  pfn:03147
[  156.660051] page:88001ed8a1c0 count:0 mapcount:-127 mapping: 
(null) index:0x1

[  156.660917] flags: 0x0()
[  156.661198] raw:   0001 
ff80
[  156.662006] raw: 88001f4a8120 88001ed85ce0  


[  156.662811] page dumped because: nonzero mapcount
[  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.13.0-rc3-00220-g1aad694 #1
[  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.9.3-20161025_171302-gandalf 04/01/2014

[  156.665129] Call Trace:
[  156.665422]  dump_stack+0x1e/0x20
[  156.665802]  bad_page+0x122/0x148

I was not able to reproduce this problem, even-though I used their qemu 
script and config. But I am getting the following panic both base and fix:


[  115.763259] VFS: Cannot open root device "ram0" or 
unknown-block(0,0): error -6
[  115.764511] Please append a correct "root=" boot option; here are the 
available partitions:
[  115.765816] Kernel panic - not syncing: VFS: Unable to mount root fs 
on unknown-block(0,0)
[  115.767124] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.13.0-rc4_pt_memset6-00033-g7e65200b1473 #7
[  115.768506] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014

[  115.770368] Call Trace:
[  115.770771]  dump_stack+0x1e/0x20
[  115.771310]  panic+0xf8/0x2bc
[  115.771792]  mount_block_root+0x3bb/0x441
[  115.772437]  ? do_early_param+0xc5/0xc5
[  115.773051]  ? do_early_param+0xc5/0xc5
[  115.773683]  mount_root+0x7c/0x7f
[  115.774243]  prepare_namespace+0x194/0x1d1
[  115.774898]  kernel_init_freeable+0x1c8/0x1df
[  115.775575]  ? rest_init+0x13f/0x13f
[  115.776153]  kernel_init+0x14/0x142
[  115.776711]  ? rest_init+0x13f/0x13f
[  115.777285]  ret_from_fork+0x2a/0x40
[  115.777864] Kernel Offset: disabled

Their config has CONFIG_BLK_DEV_RAM disabled, but qemu script has:
root=/dev/ram0, so I enabled dev_ram, but still getting a panic when 
root is mounted both in base and fix.


Pasha


Re: [v6 02/15] x86/mm: setting fields in deferred pages

2017-08-11 Thread Pasha Tatashin

AFAIU register_page_bootmem_info_node is only about struct pages backing
pgdat, usemap and memmap. Those should be in reserved memblocks and we
do not initialize those at later times, they are not relevant to the
deferred initialization as your changelog suggests so the ordering with
get_page_bootmem shouldn't matter. Or am I missing something here?


The pages for pgdata, usemap, and memmap are part of reserved, and thus 
getting initialized when free_all_bootmem() is called.


So, we have something like this in mem_init()

register_page_bootmem_info
 register_page_bootmem_info_node
  get_page_bootmem
   .. setting fields here ..
   such as: page->freelist = (void *)type;

free_all_bootmem()
 free_low_memory_core_early()
  for_each_reserved_mem_region()
   reserve_bootmem_region()
init_reserved_page() <- Only if this is deferred reserved page
 __init_single_pfn()
  __init_single_page()
  memset(0) <-- Loose the set fields here!

memblock does not know about deferred pages, and can be requested to 
allocate physical pages anywhere. So, the reserved pages in memblock can 
be both in non-deferred and deferred part of the memory.


Without deferred pages enabled, by the time register_page_bootmem_info() 
is called every page went through __init_single_page(), but with 
deferred pages enabled, there is scenario where fields can be set before 
pages go through __init_single_page(). This patch fixes it.


Re: [v6 01/15] x86/mm: reserve only exiting low pages

2017-08-11 Thread Pasha Tatashin

Struct pages are initialized by going through __init_single_page(). Since
the existing physical memory in memblock is represented in memblock.memory
list, struct page for every page from this list goes through
__init_single_page().


By a page _from_ this list you mean struct pages backing the physical
memory of the memblock lists?


Correct: "for every page from this list...", for every page represented 
by this list the struct page is initialized through __init_single_page()



In this patchset we will stop zeroing struct page memory during allocation.
Therefore, this bug must be fixed in order to avoid random assert failures
caused by CONFIG_DEBUG_VM_PGFLAGS triggers.

The fix is to reserve memory from the first existing PFN.


Hmm, I assume this is a result of some assert triggering, right? Which
one? Why don't we need the same treatment for other than x86 arch?


Correct, the pgflags asserts were triggered when we were setting 
reserved flags to struct page for PFN 0 in which was never initialized 
through __init_single_page(). The reason they were triggered is because 
we set all uninitialized memory to ones in one of the debug patches.



Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 


I guess that the review happened inhouse. I do not want to question its
value but it is rather strange to not hear the specific review comments
which might be useful in general and moreover even not include those
people on the CC list so they are aware of the follow up discussion.


I will bring this up with my colleagues to how to handle this better in 
the future. I will also CC the reviewers when I sent out the updated 
patch series.


Re: [v6 00/15] complete deferred page initialization

2017-08-11 Thread Pasha Tatashin

On 08/11/2017 03:58 AM, Michal Hocko wrote:

[I am sorry I didn't get to your previous versions]


Thank you for reviewing this work. I will address your comments, and 
send-out a new patches.




In this work we do the following:
- Never read access struct page until it was initialized


How is this enforced? What about pfn walkers? E.g. page_ext
initialization code (page owner in particular)


This is hard to enforce 100%. But, because we have a patch in this 
series that sets all memory that was allocated by 
memblock_virt_alloc_try_nid_raw() to ones with debug options enabled, 
and because Linux has a good set of asserts in place that check struct 
pages to be sane, especially the ones that are enabled with this config: 
CONFIG_DEBUG_VM_PGFLAGS. I was able to find many places in linux which 
accessed struct pages before __init_single_page() is performed, and fix 
them. Most of these places happen only when deferred struct page 
initialization code is enabled.





- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


Please give us a more highlevel description of how your reimplementation
works and how is the patchset organized. I will go through those patches
but it is always good to give an overview in the cover letter to make
the review easier.


Ok, will add more explanation to the cover letter.


Single threaded struct page init: 7.6s/T improvement
Deferred struct page init: 10.2s/T improvement


What are before and after numbers and how have you measured them.


When I send out this series the next time I will include before vs. 
after on the machine I tested, including links to dmesg output.


I used my early boot timestamps for x86 and sparc to measure the data. 
Early boot timestamps for sparc is already part of mainline, the x86 
patches are out for review: https://lkml.org/lkml/2017/8/10/946 (should 
have changed subject line there :) ).


Re: [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory

2017-08-08 Thread Pasha Tatashin

On 2017-08-08 09:15, David Laight wrote:

From: Pasha Tatashin

Sent: 08 August 2017 12:49
Thank you for looking at this change. What you described was in my
previous iterations of this project.

See for example here: https://lkml.org/lkml/2017/5/5/369

I was asked to remove that flag, and only zero memory in place when
needed. Overall the current approach is better everywhere else in the
kernel, but it adds a little extra code to kasan initialization.


Perhaps you could #define the function prototype(s?) so that the flags
are not passed unless it is a kasan build?



Hi David,

Thank you for suggestion. I think a kasan specific vmemmap (what I 
described in the previous e-mail) would be a better solution over having 
different prototypes with different builds.  It would be cleaner to have 
all kasan specific code in one place.


Pasha


  1   2   >