from:"Pavel Tatashin"

Re: [PATCH v4 5/6] pstore/ram: Introduce max_reason and convert dump_oops

2020-05-15 Thread Pavel Tatashin

 pdata.dump_oops = dump_oops;
> +   /* If "max_reason" is set, its value has priority over "dump_oops". */
> +   if (ramoops_max_reason != -1)
> +   pdata.max_reason = ramoops_max_reason;

 (ramoops_max_reason >= 0) might make more sense here, we do not want
negative max_reason even if it was provided by the user.

Otherwise the series looks good.

Thank you,
Pasha

Re: [PATCH v4 5/6] pstore/ram: Introduce max_reason and convert dump_oops

2020-05-15 Thread Pavel Tatashin

>  #define parse_u32(name, field, default_value) {  
>   \
> ret = ramoops_parse_dt_u32(pdev, name, default_value,   \

The series seems to be missing the patch where ramoops_parse_dt_size
-> ramoops_parse_dt_u32 get renamed, and updated to handle default
value.

Re: [PATCH v4 3/6] printk: Introduce kmsg_dump_reason_str()

2020-05-15 Thread Pavel Tatashin

On Fri, May 15, 2020 at 2:44 PM Kees Cook  wrote:
>
> The pstore subsystem already had a private version of this function.
> With the coming addition of the pstore/zone driver, this needs to be
> shared. As it really should live with printk, move it there instead.
>
> Link: 
> https://lore.kernel.org/lkml/20200510202436.63222-8-keesc...@chromium.org/
> Acked-by: Petr Mladek 
> Acked-by: Sergey Senozhatsky 
> Signed-off-by: Kees Cook 

Reviewed-by: Pavel Tatashin

Re: [PATCH v4 1/6] printk: Collapse shutdown types into a single dump reason

2020-05-15 Thread Pavel Tatashin

On Fri, May 15, 2020 at 2:44 PM Kees Cook  wrote:
>
> To turn the KMSG_DUMP_* reasons into a more ordered list, collapse
> the redundant KMSG_DUMP_(RESTART|HALT|POWEROFF) reasons into
> KMSG_DUMP_SHUTDOWN. The current users already don't meaningfully
> distinguish between them, so there's no need to, as discussed here:
> https://lore.kernel.org/lkml/ca+ck2bapv5u1ih5y9t5funtyximtfctdyxjcpuyjoyhnokr...@mail.gmail.com/
>
> Signed-off-by: Kees Cook 

Maybe it makes sense to mention in the commit log that for all three
merged cases there is a pr_emerg() message logged right before the
kmsg_dump(), thus the reason is distinguishable from the dmesg log
itself.

Reviewed-by: Pavel Tatashin

Re: [PATCH v4 0/6] allow ramoops to collect all kmesg_dump events

2020-05-15 Thread Pavel Tatashin

On Fri, May 15, 2020 at 2:44 PM Kees Cook  wrote:
>
> Hello!
>
> I wanted to get the pstore tree nailed down, so here's the v4 of
> Pavel's series, tweaked for the feedback during v3 review.

Hi Kees,

Thank you, I was planning to send a new version of this series later
today. Let me quickly review it.

Pasha

>
> -Kees
>
> v4:
> - rebase on pstore tree
> - collapse shutdown types into a single dump reason
>   
> https://lore.kernel.org/lkml/ca+ck2bapv5u1ih5y9t5funtyximtfctdyxjcpuyjoyhnokr...@mail.gmail.com/
> - fix dump_oops vs max_reason module params
>   https://lore.kernel.org/lkml/20200512233504.GA118720@sequoia/
> - typos
>   
> https://lore.kernel.org/lkml/4cdeaa2af2fe0d6cc2ca8ce3a37608340799df8a.ca...@perches.com/
> - rename DT parsing routines ..._size -> ..._u32
>   
> https://lore.kernel.org/lkml/ca+ck2bcu8efomiu+nebjvn-o2dbuecxwrfssnjb3ys3cacb...@mail.gmail.com/
> v3: https://lore.kernel.org/lkml/20200506211523.15077-1-keesc...@chromium.org/
> v2: 
> https://lore.kernel.org/lkml/20200505154510.93506-1-pasha.tatas...@soleen.com
> v1: 
> https://lore.kernel.org/lkml/20200502143555.543636-1-pasha.tatas...@soleen.com
>
> Kees Cook (3):
>   printk: Collapse shutdown types into a single dump reason
>   printk: Introduce kmsg_dump_reason_str()
>   pstore/ram: Introduce max_reason and convert dump_oops
>
> Pavel Tatashin (3):
>   printk: honor the max_reason field in kmsg_dumper
>   pstore/platform: Pass max_reason to kmesg dump
>   ramoops: Add max_reason optional field to ramoops DT node
>
>  Documentation/admin-guide/ramoops.rst | 14 +++--
>  .../bindings/reserved-memory/ramoops.txt  | 13 -
>  arch/powerpc/kernel/nvram_64.c|  4 +-
>  drivers/platform/chrome/chromeos_pstore.c |  2 +-
>  fs/pstore/platform.c  | 26 ++---
>  fs/pstore/ram.c   | 58 +--
>  include/linux/kmsg_dump.h | 12 +++-
>  include/linux/pstore.h|  7 +++
>  include/linux/pstore_ram.h|  2 +-
>  kernel/printk/printk.c| 32 --
>  kernel/reboot.c   |  6 +-
>  11 files changed, 114 insertions(+), 62 deletions(-)
>
> --
> 2.20.1
>

Re: [PATCH v3 07/11] mm/memory_hotplug: Create memory block devices after arch_add_memory()

2019-05-30 Thread Pavel Tatashin

On Mon, May 27, 2019 at 7:12 AM David Hildenbrand  wrote:
>
> Only memory to be added to the buddy and to be onlined/offlined by
> user space using /sys/devices/system/memory/... needs (and should have!)
> memory block devices.
>
> Factor out creation of memory block devices. Create all devices after
> arch_add_memory() succeeded. We can later drop the want_memblock parameter,
> because it is now effectively stale.
>
> Only after memory block devices have been added, memory can be onlined
> by user space. This implies, that memory is not visible to user space at
> all before arch_add_memory() succeeded.
>
> While at it
> - use WARN_ON_ONCE instead of BUG_ON in moved unregister_memory()
> - introduce find_memory_block_by_id() to search via block id
> - Use find_memory_block_by_id() in init_memory_block() to catch
>   duplicates
>
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Cc: David Hildenbrand 
> Cc: "mike.tra...@hpe.com" 
> Cc: Andrew Morton 
> Cc: Ingo Molnar 
> Cc: Andrew Banman 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Qian Cai 
> Cc: Wei Yang 
> Cc: Arun KS 
> Cc: Mathieu Malaterre 
> Signed-off-by: David Hildenbrand 

LGTM
Reviewed-by: Pavel Tatashin

Re: [PATCH v3 06/11] mm/memory_hotplug: Allow arch_remove_pages() without CONFIG_MEMORY_HOTREMOVE

2019-05-30 Thread Pavel Tatashin

On Mon, May 27, 2019 at 7:12 AM David Hildenbrand  wrote:
>
> We want to improve error handling while adding memory by allowing
> to use arch_remove_memory() and __remove_pages() even if
> CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:
>
> arch_add_memory()
> rc = do_something();
> if (rc) {
> arch_remove_memory();
> }
>
> We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
> quite some dependencies for memory offlining.

I like this simplification, we should really get rid of CONFIG_MEMORY_HOTREMOVE.
Reviewed-by: Pavel Tatashin

Re: [PATCH v3 01/11] mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()

2019-05-30 Thread Pavel Tatashin

On Mon, May 27, 2019 at 7:12 AM David Hildenbrand  wrote:
>
> By converting start and size to page granularity, we actually ignore
> unaligned parts within a page instead of properly bailing out with an
> error.
>
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: David Hildenbrand 
> Cc: Pavel Tatashin 
> Cc: Qian Cai 
> Cc: Wei Yang 
> Cc: Arun KS 
> Cc: Mathieu Malaterre 
> Reviewed-by: Dan Williams 
> Reviewed-by: Wei Yang 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Pavel Tatashin

Re: [next-20180711][Oops] linux-next kernel boot is broken on powerpc

2018-07-17 Thread Pavel Tatashin

On Tue, Jul 17, 2018 at 6:49 AM Abdul Haleem
 wrote:
>
> On Sat, 2018-07-14 at 10:55 +1000, Stephen Rothwell wrote:
> > Hi Abdul,
> >
> > On Fri, 13 Jul 2018 14:43:11 +0530 Abdul Haleem 
> >  wrote:
> > >
> > > On Thu, 2018-07-12 at 13:44 -0400, Pavel Tatashin wrote:
> > > > > Related commit could be one of below ? I see lots of patches related 
> > > > > to mm and could not bisect
> > > > >
> > > > > 5479976fda7d3ab23ba0a4eb4d60b296eb88b866 mm: page_alloc: restore 
> > > > > memblock_next_valid_pfn() on arm/arm64
> > > > > 41619b27b5696e7e5ef76d9c692dd7342c1ad7eb 
> > > > > mm-drop-vm_bug_on-from-__get_free_pages-fix
> > > > > 531bbe6bd2721f4b66cdb0f5cf5ac14612fa1419 mm: drop VM_BUG_ON from 
> > > > > __get_free_pages
> > > > > 479350dd1a35f8bfb2534697e5ca68ee8a6e8dea mm, page_alloc: actually 
> > > > > ignore mempolicies for high priority allocations
> > > > > 088018f6fe571444caaeb16e84c9f24f22dfc8b0 mm: skip invalid pages block 
> > > > > at a time in zero_resv_unresv()
> > > >
> > > > Looks like:
> > > > 0ba29a108979 mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
> > > >
> > > > This patch is going to be reverted from linux-next. Abdul, please
> > > > verify that issue is gone once  you revert this patch.
> > >
> > > kernel booted fine when the above patch is reverted.
> >
> > And it has been removed from linux-next as of next-20180713.  (Friday
> > the 13th is not all bad :-))
>
> Hi Stephen,
>
> After reverting 0ba29a108979, our bare-metal machines boot fails with
> kernel panic, is this related ?
>
> I have attached the boot logs.

The panic happens much later in boot and looks unrelated to the
sparse_init changes.

Thank you,
Pavel

Re: [next-20180711][Oops] linux-next kernel boot is broken on powerpc

2018-07-12 Thread Pavel Tatashin

> Related commit could be one of below ? I see lots of patches related to mm 
> and could not bisect
>
> 5479976fda7d3ab23ba0a4eb4d60b296eb88b866 mm: page_alloc: restore 
> memblock_next_valid_pfn() on arm/arm64
> 41619b27b5696e7e5ef76d9c692dd7342c1ad7eb 
> mm-drop-vm_bug_on-from-__get_free_pages-fix
> 531bbe6bd2721f4b66cdb0f5cf5ac14612fa1419 mm: drop VM_BUG_ON from 
> __get_free_pages
> 479350dd1a35f8bfb2534697e5ca68ee8a6e8dea mm, page_alloc: actually ignore 
> mempolicies for high priority allocations
> 088018f6fe571444caaeb16e84c9f24f22dfc8b0 mm: skip invalid pages block at a 
> time in zero_resv_unresv()

Looks like:
0ba29a108979 mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER

This patch is going to be reverted from linux-next. Abdul, please
verify that issue is gone once  you revert this patch.

Thank you,
Pavel

Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)

2018-07-12 Thread Pavel Tatashin

On Thu, Jul 12, 2018 at 5:50 AM Oscar Salvador
 wrote:
>
> > > I just roughly check, but if I checked the right place,
> > > vmemmap_populated() checks for the section to contain the flags we are
> > > setting in sparse_init_one_section().
> >
> > Yes.
> >
> > > But with this patch, we populate first everything, and then we call
> > > sparse_init_one_section() in sparse_init().
> > > As I said I could be mistaken because I just checked the surface.

Yes, this is right, sparse_init_one_section() is needed after every
populate call on ppc64. I am adding this to my sparse_init re-write,
and it actually simplifies code, as it avoids one extra loop, and
makes ppc64 to work.

Pavel

Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)

2018-07-11 Thread Pavel Tatashin

I am OK, if this patch is removed from Baoquan's series. But, I would
still like to get rid of CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER, I
can work on this in my sparse_init re-write series. ppc64 should
really fallback safely to small chunks allocs, and if it does not
there is some existing bug. Michael please send the config that you
used.

Thank you,
Pavel
On Wed, Jul 11, 2018 at 9:37 AM Oscar Salvador
 wrote:
>
> On Wed, Jul 11, 2018 at 10:49:58PM +1000, Michael Ellerman wrote:
> > a...@linux-foundation.org writes:
> > > The mm-of-the-moment snapshot 2018-07-10-16-50 has been uploaded to
> > >
> > >http://www.ozlabs.org/~akpm/mmotm/
> > ...
> >
> > > * mm-sparse-add-a-static-variable-nr_present_sections.patch
> > > * mm-sparsemem-defer-the-ms-section_mem_map-clearing.patch
> > > * mm-sparsemem-defer-the-ms-section_mem_map-clearing-fix.patch
> > > * 
> > > mm-sparse-add-a-new-parameter-data_unit_size-for-alloc_usemap_and_memmap.patch
> > > * mm-sparse-optimize-memmap-allocation-during-sparse_init.patch
> > > * 
> > > mm-sparse-optimize-memmap-allocation-during-sparse_init-checkpatch-fixes.patch
> >
> > > * mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch
> >
> > This seems to be breaking my powerpc pseries qemu boots.
> >
> > The boot log with some extra debug shows eg:
> >
> >   $ make pseries_le_defconfig
>
> Could you please share the config?
> I was not able to find such config in the kernel tree.
> --
> Oscar Salvador
> SUSE L3
>

Re: [PATCHv3 2/4] drivers/base: utilize device tree info to shutdown devices

2018-07-03 Thread Pavel Tatashin

Thank you Andy for the heads up. I might need to rebase my work
(http://lkml.kernel.org/r/20180629182541.6735-1-pasha.tatas...@oracle.com)
based on this change. But, it is possible it is going to be harder to
parallelize based on device tree. I will need to think about it.

Pavel

On Tue, Jul 3, 2018 at 6:59 AM Andy Shevchenko
 wrote:
>
> I think Pavel would be interested to see this as well (he is doing
> some parallel device shutdown stuff)
>
> On Tue, Jul 3, 2018 at 9:50 AM, Pingfan Liu  wrote:
> > commit 52cdbdd49853 ("driver core: correct device's shutdown order")
> > places an assumption of supplier<-consumer order on the process of probe.
> > But it turns out to break down the parent <- child order in some scene.
> > E.g in pci, a bridge is enabled by pci core, and behind it, the devices
> > have been probed. Then comes the bridge's module, which enables extra
> > feature(such as hotplug) on this bridge. This will break the
> > parent<-children order and cause failure when "kexec -e" in some scenario.
> >
> > The detailed description of the scenario:
> > An IBM Power9 machine on which, two drivers portdrv_pci and shpchp(a mod)
> > match the PCI_CLASS_BRIDGE_PCI, but neither of them success to probe due
> > to some issue. For this case, the bridge is moved after its children in
> > devices_kset. Then, when "kexec -e", a ata-disk behind the bridge can not
> > write back buffer in flight due to the former shutdown of the bridge which
> > clears the BusMaster bit.
> >
> > It is a little hard to impose both "parent<-child" and "supplier<-consumer"
> > order on devices_kset. Take the following scene:
> > step0: before a consumer's probing, (note child_a is supplier of consumer_a)
> >   [ consumer-X, child_a, , child_z] [... consumer_a, ..., consumer_z, 
> > ...] supplier-X
> >  ^^ affected range 
> > ^^
> > step1: when probing, moving consumer-X after supplier-X
> >   [ child_a, , child_z] [ consumer_a, ..., consumer_z, ...] 
> > supplier-X, consumer-X
> > step2: the children of consumer-X should be re-ordered to maintain the seq
> >   [... consumer_a, ..., consumer_z, ] supplier-X  [consumer-X, child_a, 
> > , child_z]
> > step3: the consumer_a should be re-ordered to maintain the seq
> >   [... consumer_z, ...] supplier-X [ consumer-X, child_a, consumer_a ..., 
> > child_z]
> >
> > It requires two nested recursion to drain out all out-of-order item in
> > "affected range". To avoid such complicated code, this patch suggests
> > to utilize the info in device tree, instead of using the order of
> > devices_kset during shutdown. It iterates the device tree, and firstly
> > shutdown a device's children and consumers. After this patch, the buggy
> > commit is hollow and left to clean.
> >
> > Cc: Greg Kroah-Hartman 
> > Cc: Rafael J. Wysocki 
> > Cc: Grygorii Strashko 
> > Cc: Christoph Hellwig 
> > Cc: Bjorn Helgaas 
> > Cc: Dave Young 
> > Cc: linux-...@vger.kernel.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Pingfan Liu 
> > ---
> >  drivers/base/core.c| 48 
> > +++-
> >  include/linux/device.h |  1 +
> >  2 files changed, 44 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index a48868f..684b994 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -1446,6 +1446,7 @@ void device_initialize(struct device *dev)
> > INIT_LIST_HEAD(>links.consumers);
> > INIT_LIST_HEAD(>links.suppliers);
> > dev->links.status = DL_DEV_NO_DRIVER;
> > +   dev->shutdown = false;
> >  }
> >  EXPORT_SYMBOL_GPL(device_initialize);
> >
> > @@ -2811,7 +2812,6 @@ static void __device_shutdown(struct device *dev)
> >  * lock is to be held
> >  */
> > parent = get_device(dev->parent);
> > -   get_device(dev);
> > /*
> >  * Make sure the device is off the kset list, in the
> >  * event that dev->*->shutdown() doesn't remove it.
> > @@ -2842,23 +2842,60 @@ static void __device_shutdown(struct device *dev)
> > dev_info(dev, "shutdown\n");
> > dev->driver->shutdown(dev);
> > }
> > -
> > +   dev->shutdown = true;
> > device_unlock(dev);
> > if (parent)
> > device_unlock(parent);
> >
> > -   put_device(dev);
> > put_device(parent);
> > spin_lock(_kset->list_lock);
> >  }
> >
> > +/* shutdown dev's children and consumer firstly, then itself */
> > +static int device_for_each_child_shutdown(struct device *dev)
> > +{
> > +   struct klist_iter i;
> > +   struct device *child;
> > +   struct device_link *link;
> > +
> > +   /* already shutdown, then skip this sub tree */
> > +   if (dev->shutdown)
> > +   return 0;
> > +
> > +   if (!dev->p)
> > +   goto check_consumers;
> > +
> > +   /* there is breakage of

Re: [PATCH v1] mm: relax deferred struct page requirements

2018-06-19 Thread Pavel Tatashin

On Tue, Jun 19, 2018 at 9:50 AM Pavel Tatashin
 wrote:
>
> On Sat, Jun 16, 2018 at 4:04 AM Jiri Slaby  wrote:
> >
> > On 11/21/2017, 08:24 AM, Michal Hocko wrote:
> > > On Thu 16-11-17 20:46:01, Pavel Tatashin wrote:
> > >> There is no need to have ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT,
> > >> as all the page initialization code is in common code.
> > >>
> > >> Also, there is no need to depend on MEMORY_HOTPLUG, as initialization 
> > >> code
> > >> does not really use hotplug memory functionality. So, we can remove this
> > >> requirement as well.
> > >>
> > >> This patch allows to use deferred struct page initialization on all
> > >> platforms with memblock allocator.
> > >>
> > >> Tested on x86, arm64, and sparc. Also, verified that code compiles on
> > >> PPC with CONFIG_MEMORY_HOTPLUG disabled.
> > >
> > > There is slight risk that we will encounter corner cases on some
> > > architectures with weird memory layout/topology
> >
> > Which x86_32-pae seems to be. Many bad page state errors are emitted
> > during boot when this patch is applied:
>
> Hi Jiri,
>
> Thank you for reporting this bug.
>
> Because 32-bit systems are limited in the maximum amount of physical
> memory, they don't need deferred struct pages. So, we can add depends
> on 64BIT to DEFERRED_STRUCT_PAGE_INIT in mm/Kconfig.
>
> However, before we do this, I want to try reproducing this problem and
> root cause it, as it might expose a general problem that is not 32-bit
> specific.

Hi Jiri,

Could you please attach your config and full qemu arguments that you
used to reproduce this bug.

Thank you,
Pavel


>
> Thank you,
> Pavel

Re: [PATCH v1] mm: relax deferred struct page requirements

2018-06-19 Thread Pavel Tatashin

On Sat, Jun 16, 2018 at 4:04 AM Jiri Slaby  wrote:
>
> On 11/21/2017, 08:24 AM, Michal Hocko wrote:
> > On Thu 16-11-17 20:46:01, Pavel Tatashin wrote:
> >> There is no need to have ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT,
> >> as all the page initialization code is in common code.
> >>
> >> Also, there is no need to depend on MEMORY_HOTPLUG, as initialization code
> >> does not really use hotplug memory functionality. So, we can remove this
> >> requirement as well.
> >>
> >> This patch allows to use deferred struct page initialization on all
> >> platforms with memblock allocator.
> >>
> >> Tested on x86, arm64, and sparc. Also, verified that code compiles on
> >> PPC with CONFIG_MEMORY_HOTPLUG disabled.
> >
> > There is slight risk that we will encounter corner cases on some
> > architectures with weird memory layout/topology
>
> Which x86_32-pae seems to be. Many bad page state errors are emitted
> during boot when this patch is applied:

Hi Jiri,

Thank you for reporting this bug.

Because 32-bit systems are limited in the maximum amount of physical
memory, they don't need deferred struct pages. So, we can add depends
on 64BIT to DEFERRED_STRUCT_PAGE_INIT in mm/Kconfig.

However, before we do this, I want to try reproducing this problem and
root cause it, as it might expose a general problem that is not 32-bit
specific.

Thank you,
Pavel

[PATCH v1] mm: relax deferred struct page requirements

2017-11-16 Thread Pavel Tatashin

There is no need to have ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT,
as all the page initialization code is in common code.

Also, there is no need to depend on MEMORY_HOTPLUG, as initialization code
does not really use hotplug memory functionality. So, we can remove this
requirement as well.

This patch allows to use deferred struct page initialization on all
platforms with memblock allocator.

Tested on x86, arm64, and sparc. Also, verified that code compiles on
PPC with CONFIG_MEMORY_HOTPLUG disabled.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/powerpc/Kconfig | 1 -
 arch/s390/Kconfig| 1 -
 arch/x86/Kconfig | 1 -
 mm/Kconfig   | 7 +--
 4 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index cb782ac1c35d..1540348691c9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -148,7 +148,6 @@ config PPC
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_SUPPORTS_ATOMIC_RMW
-   select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_WANT_IPC_PARSE_VERSION
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 863a62a6de3c..525c2e3df6f5 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -108,7 +108,6 @@ config S390
select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
select ARCH_SAVE_PAGE_KEYS if HIBERNATION
select ARCH_SUPPORTS_ATOMIC_RMW
-   select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index df3276d6bfe3..00a5446de394 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,7 +69,6 @@ config X86
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_SUPPORTS_ATOMIC_RMW
-   select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_QUEUED_RWLOCKS
diff --git a/mm/Kconfig b/mm/Kconfig
index 9c4b80c2..c6bd0309ce7a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -639,15 +639,10 @@ config MAX_STACK_SIZE_MB
 
  A sane initial value is 80 MB.
 
-# For architectures that support deferred memory initialisation
-config ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
-   bool
-
 config DEFERRED_STRUCT_PAGE_INIT
bool "Defer initialisation of struct pages to kthreads"
default n
-   depends on ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
-   depends on NO_BOOTMEM && MEMORY_HOTPLUG
+   depends on NO_BOOTMEM
depends on !FLATMEM
help
  Ordinarily all struct pages are initialised during early boot in a
-- 
2.15.0

Re: [PATCH v12 08/11] arm64/kasan: add and use kasan_map_populate()

2017-11-03 Thread Pavel Tatashin


1. Replace these two patches:

arm64/kasan: add and use kasan_map_populate()
x86/kasan: add and use kasan_map_populate()

With:

x86/mm/kasan: don't use vmemmap_populate() to initialize
  shadow
arm64/mm/kasan: don't use vmemmap_populate() to initialize
  shadow



Pavel, could you please send the patches? These patches doesn't interfere with 
rest of the series,
so I think it should be enough to send just two patches to replace the old ones.



Hi Andrey,

I asked Michal and Andrew how to proceed but never received a reply from 
them. The patches independent from the deferred page init series as long 
as they come before the series.


Anyway, I will post these two patches to the mailing list soon. But, not 
really sure if they will be taken into mm-tree.


Pavel

Re: [PATCH v12 09/11] mm: stop zeroing memory during allocation in vmemmap

2017-10-19 Thread Pavel Tatashin

This looks good to me, thank you Andrew.

Pavel

Re: [PATCH v12 08/11] arm64/kasan: add and use kasan_map_populate()

2017-10-18 Thread Pavel Tatashin


Hi Andrew and Michal,

There are a few changes I need to do to my series:

1. Replace these two patches:

arm64/kasan: add and use kasan_map_populate()
x86/kasan: add and use kasan_map_populate()

With:

x86/mm/kasan: don't use vmemmap_populate() to initialize
 shadow
arm64/mm/kasan: don't use vmemmap_populate() to initialize
 shadow

2. Fix a kbuild warning about section mismatch in
mm: deferred_init_memmap improvements

How should I proceed to get these replaced in mm-tree? Send three new 
patches, or send a new series?


Thank you,
Pavel

On 10/18/2017 01:18 PM, Andrey Ryabinin wrote:

On 10/18/2017 08:08 PM, Pavel Tatashin wrote:


As I said, I'm fine either way, I just didn't want to cause extra work
or rebasing:

http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/535703.html


Makes sense. I am also fine either way, I can submit a new patch merging 
together the two if needed.



Please, do this. Single patch makes more sense



Pavel

Re: [PATCH v12 07/11] x86/kasan: add and use kasan_map_populate()

2017-10-18 Thread Pavel Tatashin

Thank you Andrey, I will test this patch. Should it go on top or replace 
the existing patch in mm-tree? ARM and x86 should be done the same 
either both as follow-ups or both replace.


Pavel

Re: [PATCH v12 08/11] arm64/kasan: add and use kasan_map_populate()

2017-10-18 Thread Pavel Tatashin



As I said, I'm fine either way, I just didn't want to cause extra work
or rebasing:

http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/535703.html


Makes sense. I am also fine either way, I can submit a new patch merging 
together the two if needed.


Pavel

Re: [PATCH v12 08/11] arm64/kasan: add and use kasan_map_populate()

2017-10-18 Thread Pavel Tatashin


Hi Andrey,

I asked Will, about it, and he preferred to have this patched added to 
the end of my series instead of replacing "arm64/kasan: add and use 
kasan_map_populate()".


In addition, Will's patch stops using large pages for kasan memory, and 
thus might add some regression in which case it is easier to revert just 
that patch instead of the whole series. It is unlikely that regression 
is going to be detectable, because kasan by itself makes system quiet 
slow already.


Pasha

Re: [PATCH v12 01/11] mm: deferred_init_memmap improvements

2017-10-17 Thread Pavel Tatashin


This really begs to have two patches... I will not insist though. I also
suspect the code can be further simplified but again this is nothing to
block this to go.


Perhaps "page" can be avoided in deferred_init_range(), as pfn is 
converted to page in deferred_free_range, but I have not studied it.


  

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>


I do not see any obvious issues in the patch

Acked-by: Michal Hocko <mho...@suse.com>


Thank you very much!

Pavel




---
  mm/page_alloc.c | 168 
  1 file changed, 85 insertions(+), 83 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c5c57b..cdbd14829fd3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
  }
  
  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT

-static void __init deferred_free_range(struct page *page,
-   unsigned long pfn, int nr_pages)
+static void __init deferred_free_range(unsigned long pfn,
+  unsigned long nr_pages)
  {
-   int i;
+   struct page *page;
+   unsigned long i;
  
-	if (!page)

+   if (!nr_pages)
return;
  
+	page = pfn_to_page(pfn);

+
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == pageblock_nr_pages &&
(pfn & (pageblock_nr_pages - 1)) == 0) {
@@ -1443,19 +1446,89 @@ static inline void __init 
pgdat_init_report_one_done(void)
complete(_init_all_done_comp);
  }
  
+/*

+ * Helper for deferred_init_range, free the given range, reset the counters, 
and
+ * return number of pages freed.
+ */
+static inline unsigned long __def_free(unsigned long *nr_free,
+  unsigned long *free_base_pfn,
+  struct page **page)
+{
+   unsigned long nr = *nr_free;
+
+   deferred_free_range(*free_base_pfn, nr);
+   *free_base_pfn = 0;
+   *nr_free = 0;
+   *page = NULL;
+
+   return nr;
+}
+
+static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
+unsigned long end_pfn)
+{
+   struct mminit_pfnnid_cache nid_init_state = { };
+   unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   unsigned long free_base_pfn = 0;
+   unsigned long nr_pages = 0;
+   unsigned long nr_free = 0;
+   struct page *page = NULL;
+
+   for (; pfn < end_pfn; pfn++) {
+   /*
+* First we check if pfn is valid on architectures where it is
+* possible to have holes within pageblock_nr_pages. On systems
+* where it is not possible, this function is optimized out.
+*
+* Then, we check if a current large page is valid by only
+* checking the validity of the head pfn.
+*
+* meminit_pfn_in_nid is checked on systems where pfns can
+* interleave within a node: a pfn is between start and end
+* of a node, but does not belong to this memory node.
+*
+* Finally, we minimize pfn page lookups and scheduler checks by
+* performing it only once every pageblock_nr_pages.
+*/
+   if (!pfn_valid_within(pfn)) {
+   nr_pages += __def_free(_free, _base_pfn, );
+   } else if (!(pfn & nr_pgmask) && !pfn_valid(pfn)) {
+   nr_pages += __def_free(_free, _base_pfn, );
+   } else if (!meminit_pfn_in_nid(pfn, nid, _init_state)) {
+   nr_pages += __def_free(_free, _base_pfn, );
+   } else if (page && (pfn & nr_pgmask)) {
+   page++;
+   __init_single_page(page, pfn, zid, nid);
+   nr_free++;
+   } else {
+   nr_pages += __def_free(_free, _base_pfn, );
+   page = pfn_to_page(pfn);
+   __init_single_page(page, pfn, zid, nid);
+   free_base_pfn = pfn;
+   nr_free = 1;
+   cond_resched();
+   }
+   }
+   /* Free the last block of pages to allocator */
+   nr_pages += __def_free(_free, _base_pfn, );
+
+   return nr_pages;
+}
+
  /* Initialise remaining memory on a node */
  static int __init deferred_init_memmap(void *data)
  {
pg_data_t *pgdat = data;
int nid = pgdat->node_id;
-   struct mminit_pfnnid_cache nid_init_state = { };
unsigned long start = jiffies;
unsigned

[PATCH v12 10/11] sparc64: optimized struct page zeroing

2017-10-13 Thread Pavel Tatashin

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

   BASEFIX  OPTIMIZED_FIX
bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
  
Total 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)   (mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#definemm_zero_struct_page(pp) do {
\
+   unsigned long *_pp = (void *)(pp);  \
+   \
+/* Check that struct page is either 64, 72, or 80 bytes */ \
+   BUILD_BUG_ON(sizeof(struct page) & 7);  \
+   BUILD_BUG_ON(sizeof(struct page) < 64); \
+   BUILD_BUG_ON(sizeof(struct page) > 80); \
+   \
+   switch (sizeof(struct page)) {  \
+   case 80:\
+   _pp[9] = 0; /* fallthrough */   \
+   case 72:\
+   _pp[8] = 0; /* fallthrough */   \
+   default:\
+   _pp[7] = 0; \
+   _pp[6] = 0; \
+   _pp[5] = 0; \
+   _pp[4] = 0; \
+   _pp[3] = 0; \
+   _pp[2] = 0; \
+   _pp[1] = 0; \
+   _pp[0] = 0; \
+   }   \
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.2

[PATCH v12 05/11] mm: defining memblock_virt_alloc_try_nid_raw

2017-10-13 Thread Pavel Tatashin

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/bootmem.h | 27 ++
 mm/memblock.c   | 60 +++--
 mm/page_alloc.c | 15 ++---
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr,
+ phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+   BOOTMEM_ALLOC_ACCESSIBLE,
+   NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   if (!align)
+   align = SMP_CACHE_BYTES;
+   return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init 
memblock_virt_alloc_try_nid(phys_addr_t size,
  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+   phys_addr_t size, phys_addr_t align,
+   phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+   return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+   min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
return NULL;
 done:
ptr = phys_to_virt(alloc);
-   memset(ptr, 0, size);
 
/*
 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *   is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *   is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *   allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to fi

[PATCH v12 00/11] complete deferred page initialization

2017-10-13 Thread Pavel Tatashin

Changelog:
v12 - v11
- Improved comments for mm: zero reserved and unavailable struct pages
- Added back patch: mm: deferred_init_memmap improvements
- Added patch from Will Deacon: arm64: kasan: Avoid using
  vmemmap_populate to initialise shadow

v11 - v10
- Moved kasan_map_populate() implementation from common code into arch
  specific as discussed with Will Deacon. We do not need
  "mm/kasan: kasan specific map populate function" anymore, so only
  9 patches left.

v10 - v9
- Addressed new comments from Michal Hocko.
- Sent "mm: deferred_init_memmap improvements" as a separate patch as
  it is also fixing existing problem.
- Merged "mm: stop zeroing memory during allocation in vmemmap" with
  "mm: zero struct pages during initialization".
- Added more comments "mm: zero reserved and unavailable struct pages"

v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix 
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations
v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
* Splited changes to platforms into 4 patches
* Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME  SPEED UP
base no deferred:   95.796233s
fix no deferred:79.978956s19.77%

base deferred:  77.254713s
fix deferred:   55.050509s40.34%
==
SPARC M6 3600 MHz with 15T of memory
TIME  SPEED UP
base no deferred:   358.335727s
fix no deferred:302.320936s   18.52%

base deferred:  237.534603s
fix deferred:   182.103003s   30.44%
==
Raw dmesg output with timestamps:
x86 base no deferred:https://hastebin.com/ofunepurit.scala
x86 base deferred:   https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred:https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:  https://hastebin.com/xadinobutu.go

Pavel Tatashin (10):
  mm: deferred_init_memmap improvements
  x86/mm: setting fields in deferred pa

[PATCH v12 09/11] mm: stop zeroing memory during allocation in vmemmap

2017-10-13 Thread Pavel Tatashin

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

 BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
  --
Total   16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/mm.h  | 11 +++
 mm/page_alloc.c |  1 +
 mm/sparse-vmemmap.c | 15 +++
 mm/sparse.c |  6 +++---
 4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04c8b2e5aff4..fd045a3b243a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned 
long size, int node)
return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+   void *p = vmemmap_alloc_block(size, node);
+
+   if (!p)
+   return NULL;
+   memset(p, 0, size);
+
+   return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
   int node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 54e0fa12e7ff..eb2ac79926e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
 {
+   mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
unsigned long align,
unsigned long goal)
 {
-   return memblock_virt_alloc_try_nid(size, align, goal,
+   return memblock_virt_alloc_try_nid_raw(size, align, goal,
BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int 
node)
if (slab_is_available()) {
struct page *page;
 
-   page = alloc_pages_node(node,
-   GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-   get_order(size));
+   page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+   get_order(size));
if (page)
return page_address(page);
return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned 
long addr, int node)
 {
pmd_t *pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pmd_populate_kernel(_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned 
long addr, int node)
 {
pud_t *pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pud_populate(_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned 
long addr, int node)
 {
p4d_t *p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;

[PATCH v12 02/11] x86/mm: setting fields in deferred pages

2017-10-13 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

mem_init() {
register_page_bootmem_info();
free_all_bootmem();
...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- Only if this is deferred reserved page
   __init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 arch/x86/mm/init_64.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..8822523fdcd7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,18 @@ void __init mem_init(void)
 
/* clear_bss() already clear the empty_zero_page */
 
-   register_page_bootmem_info();
-
/* this will put all memory onto the freelists */
free_all_bootmem();
after_bootmem = 1;
 
+   /*
+* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/* Register memory areas for /proc/kcore */
kclist_add(_vsyscall, (void *)VSYSCALL_ADDR,
 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.2

[PATCH v12 08/11] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 72 ++
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..cb4af2951c90 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -28,6 +28,66 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+   int node)
+{
+   unsigned long addr, pfn, next;
+   unsigned long long size;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   int ret;
+
+   ret = vmemmap_populate(start, end, node);
+   /*
+* We might have partially populated memory, so check for no entries,
+* and zero only those that actually exist.
+*/
+   for (addr = start; addr < end; addr = next) {
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd)) {
+   next = pgd_addr_end(addr, end);
+   continue;
+   }
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud)) {
+   next = pud_addr_end(addr, end);
+   continue;
+   }
+   if (pud_sect(*pud)) {
+   /* This is PUD size page */
+   next = pud_addr_end(addr, end);
+   size = PUD_SIZE;
+   pfn = pud_pfn(*pud);
+   } else {
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   next = pmd_addr_end(addr, end);
+   continue;
+   }
+   if (pmd_sect(*pmd)) {
+   /* This is PMD size page */
+   next = pmd_addr_end(addr, end);
+   size = PMD_SIZE;
+   pfn = pmd_pfn(*pmd);
+   } else {
+   pte = pte_offset_kernel(pmd, addr);
+   next = addr + PAGE_SIZE;
+   if (pte_none(*pte))
+   continue;
+   /* This is base size page */
+   size = PAGE_SIZE;
+   pfn = pte_pfn(*pte);
+   }
+   }
+   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+   }
+   return ret;
+}
+
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be 
used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -161,11 +221,11 @@ void __init kasan_init(void)
 
clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-   vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-pfn_to_nid(virt_to_pfn(lm_alias(_text;
+   kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+  pfn_to_nid(virt_to_pfn(lm_alias(_text;
 
/*
-* vmemmap_populate() has populated the shadow region that covers the
+* kasan_map_populate() has populated the shadow region that covers the
 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +251,9 @@ void __init kasan_init(void)
if (start >= end)
break;
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-   (unsigned long)kasan_mem_to_shadow(end),
-   pfn_to_nid(virt_to_pfn(start)));
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+  (unsigned long)kasan_mem_to_shadow(end),
+  pfn_to_nid(virt_to_pfn(start)));
}
 
/*
-- 
2.14.2

[PATCH v12 11/11] arm64: kasan: Avoid using vmemmap_populate to initialise shadow

2017-10-13 Thread Pavel Tatashin

From: Will Deacon <will.dea...@arm.com>

The kasan shadow is currently mapped using vmemmap_populate since that
provides a semi-convenient way to map pages into swapper. However, since
that no longer zeroes the mapped pages, it is not suitable for kasan,
which requires that the shadow is zeroed in order to avoid false
positives.

This patch removes our reliance on vmemmap_populate and reuses the
existing kasan page table code, which is already required for creating
the early shadow.

Signed-off-by: Will Deacon <will.dea...@arm.com>
Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/arm64/Kconfig |   2 +-
 arch/arm64/mm/kasan_init.c | 180 +++--
 2 files changed, 76 insertions(+), 106 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..888580b9036e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -68,7 +68,7 @@ config ARM64
select HAVE_ARCH_BITREVERSE
select HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
-   select HAVE_ARCH_KASAN if SPARSEMEM_VMEMMAP && !(ARM64_16K_PAGES && 
ARM64_VA_BITS_48)
+   select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index cb4af2951c90..acba49fb5aac 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -11,6 +11,7 @@
  */
 
 #define pr_fmt(fmt) "kasan: " fmt
+#include 
 #include 
 #include 
 #include 
@@ -28,66 +29,6 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
-/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
-static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
-   int node)
-{
-   unsigned long addr, pfn, next;
-   unsigned long long size;
-   pgd_t *pgd;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
-   int ret;
-
-   ret = vmemmap_populate(start, end, node);
-   /*
-* We might have partially populated memory, so check for no entries,
-* and zero only those that actually exist.
-*/
-   for (addr = start; addr < end; addr = next) {
-   pgd = pgd_offset_k(addr);
-   if (pgd_none(*pgd)) {
-   next = pgd_addr_end(addr, end);
-   continue;
-   }
-
-   pud = pud_offset(pgd, addr);
-   if (pud_none(*pud)) {
-   next = pud_addr_end(addr, end);
-   continue;
-   }
-   if (pud_sect(*pud)) {
-   /* This is PUD size page */
-   next = pud_addr_end(addr, end);
-   size = PUD_SIZE;
-   pfn = pud_pfn(*pud);
-   } else {
-   pmd = pmd_offset(pud, addr);
-   if (pmd_none(*pmd)) {
-   next = pmd_addr_end(addr, end);
-   continue;
-   }
-   if (pmd_sect(*pmd)) {
-   /* This is PMD size page */
-   next = pmd_addr_end(addr, end);
-   size = PMD_SIZE;
-   pfn = pmd_pfn(*pmd);
-   } else {
-   pte = pte_offset_kernel(pmd, addr);
-   next = addr + PAGE_SIZE;
-   if (pte_none(*pte))
-   continue;
-   /* This is base size page */
-   size = PAGE_SIZE;
-   pfn = pte_pfn(*pte);
-   }
-   }
-   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
-   }
-   return ret;
-}
-
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be 
used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -95,77 +36,117 @@ static int __meminit kasan_map_populate(unsigned long 
start, unsigned long end,
  * with the physical address from __pa_symbol.
  */
 
-static void __init kasan_early_pte_populate(pmd_t *pmd, unsigned long addr,
-   unsigned long end)
+static phys_addr_t __init kasan_alloc_zeroed_page(int node)
 {
-   pte_t *pte;
-   unsigned long next;
+   void *p = memblock_virt_alloc_try_nid(PAGE_SIZE, PAGE_SIZE,
+ __pa(MAX_DMA_ADDRESS),
+ MEMBLOCK_ALLOC_ACCESSIBLE, node);
+   return __pa(p);
+}
 
-   if (pmd_none(*pmd))
-   __pmd_pop

[PATCH v12 04/11] sparc64: simplify vmemmap_populate

2017-10-13 Thread Pavel Tatashin

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
Acked-by: Michal Hocko <mho...@suse.com>
---
 arch/sparc/mm/init_64.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index caed495544e9..6839db3ffe1d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2652,30 +2652,19 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
vstart = vstart & PMD_MASK;
vend = ALIGN(vend, PMD_SIZE);
for (; vstart < vend; vstart += PMD_SIZE) {
-   pgd_t *pgd = pgd_offset_k(vstart);
+   pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
unsigned long pte;
pud_t *pud;
pmd_t *pmd;
 
-   if (pgd_none(*pgd)) {
-   pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+   if (!pgd)
+   return -ENOMEM;
 
-   if (!new)
-   return -ENOMEM;
-   pgd_populate(_mm, pgd, new);
-   }
-
-   pud = pud_offset(pgd, vstart);
-   if (pud_none(*pud)) {
-   pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-   if (!new)
-   return -ENOMEM;
-   pud_populate(_mm, pud, new);
-   }
+   pud = vmemmap_pud_populate(pgd, vstart, node);
+   if (!pud)
+   return -ENOMEM;
 
pmd = pmd_offset(pud, vstart);
-
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.2

[PATCH v12 06/11] mm: zero reserved and unavailable struct pages

2017-10-13 Thread Pavel Tatashin

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the exiting
memory from pfn 1 (i.e. KVM).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

===

Here is more detailed example of problem that this patch is addressing:

Run tested on qemu with the following arguments:

-enable-kvm -cpu kvm64 -m 512 -smp 2

This patch reports that there are 98 unavailable pages.

They are: pfn 0 and pfns in range [159, 255].

Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.

e820__memblock_setup() reports linux that the following physical ranges are
available:
[1 , 158]
[256, 130783]

Notice, that exactly unavailable pfns are missing!

Now, lets check what we have in zone 0: [1, 131039]

pfn 0, is not part of the zone, but pfns [1, 158], are.

However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug. Because, that path operates at 2M boundaries
(section_nr). And checks if 2M range of pages is hot removable. It starts
with first pfn from zone, rounds it down to 2M boundary (sturct pages are
allocated at 2M boundaries when vmemmap is created), and checks if that
section is hot removable. In this case start with pfn 1 and convert it down
to pfn 0. Later pfn is converted to struct page, and some fields are
checked. Now, if we do not zero struct pages, we get unpredictable results.

In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all vmemmap
memory to ones, the following panic is observed with kernel test without
this patch applied:

BUG: unable to handle kernel NULL pointer dereference at  (null)
IP: is_pageblock_removable_nolock+0x35/0x90
PGD 0 P4D 0
Oops:  [#1] PREEMPT
...
task: 88001f4e2900 task.stack: c9314000
RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
RSP: 0018:c9317d60 EFLAGS: 00010202
RAX:  RBX: 88001d92b000 RCX: 
RDX:  RSI: 0020 RDI: 88001d92b000
RBP: c9317d80 R08: 10c8 R09: 
R10:  R11:  R12: 88001db2b000
R13: 81af6d00 R14: 88001f7d5000 R15: 82a1b6c0
FS:  7f4eb857f7c0() GS:81c27000() knlGS:0
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 1f4e6000 CR4: 06b0
Call Trace:
 ? is_mem_section_removable+0x5a/0xd0
 show_mem_removable+0x6b/0xa0
 dev_attr_show+0x1b/0x50
 sysfs_kf_seq_show+0xa1/0x100
 kernfs_seq_show+0x22/0x30
 seq_read+0x1ac/0x3a0
 kernfs_fop_read+0x36/0x190
 ? security_file_permission+0x90/0xb0
 __vfs_read+0x16/0x30
 vfs_read+0x81/0x130
 SyS_read+0x44/0xa0
 entry_SYSCALL_64_fastpath+0x1f/0xbd

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/memblock.h | 16 
 include/linux/mm.h   | 15 +++
 mm/page_alloc.c  | 40 
 3 files changed, 71 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..ce8bfa5f3e9b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, 
unsigned long max_pfn);
for_each_mem_range_rev(i, , , \
   nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable 
memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does n

[PATCH v12 03/11] sparc64/mm: setting fields in deferred pages

2017-10-13 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
 register_page_bootmem_info();
 free_all_bootmem();
 ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
 __init_single_page()
  memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
Acked-by: Michal Hocko <mho...@suse.com>
---
 arch/sparc/mm/init_64.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..caed495544e9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,16 @@ void __init mem_init(void)
 {
high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-   register_page_bootmem_info();
free_all_bootmem();
 
+   /*
+* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/*
 * Set up the zero page, mark it reserved, so that page count
 * is not manipulated when freeing the page from user ptes.
-- 
2.14.2

[PATCH v12 01/11] mm: deferred_init_memmap improvements

2017-10-13 Thread Pavel Tatashin

deferred_init_memmap() is called when struct pages are initialized later
in boot by slave CPUs. This patch simplifies and optimizes this function,
and also fixes a couple issues (described below).

The main change is that now we are iterating through free memblock areas
instead of all configured memory. Thus, we do not have to check if the
struct page has already been initialized.

=
In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

The deferred_init_memmap() is re-written to loop through only free memory
ranges provided by memblock.

Note, this first issue is relevant only when the following change is
merged:

=
This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
Thus means if deferred struct pages are enabled on systems with these kind
of holes, linux would get memory corruptions. I have fixed this issue by
defining a new macro that performs all the necessary operations when we
free the current set of pages.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 mm/page_alloc.c | 168 
 1 file changed, 85 insertions(+), 83 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c5c57b..cdbd14829fd3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(struct page *page,
-   unsigned long pfn, int nr_pages)
+static void __init deferred_free_range(unsigned long pfn,
+  unsigned long nr_pages)
 {
-   int i;
+   struct page *page;
+   unsigned long i;
 
-   if (!page)
+   if (!nr_pages)
return;
 
+   page = pfn_to_page(pfn);
+
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == pageblock_nr_pages &&
(pfn & (pageblock_nr_pages - 1)) == 0) {
@@ -1443,19 +1446,89 @@ static inline void __init 
pgdat_init_report_one_done(void)
complete(_init_all_done_comp);
 }
 
+/*
+ * Helper for deferred_init_range, free the given range, reset the counters, 
and
+ * return number of pages freed.
+ */
+static inline unsigned long __def_free(unsigned long *nr_free,
+  unsigned long *free_base_pfn,
+  struct page **page)
+{
+   unsigned long nr = *nr_free;
+
+   deferred_free_range(*free_base_pfn, nr);
+   *free_base_pfn = 0;
+   *nr_free = 0;
+   *page = NULL;
+
+   return nr;
+}
+
+static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
+unsigned long end_pfn)
+{
+   struct mminit_pfnnid_cache nid_init_state = { };
+   unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   unsigned long free_base_pfn = 0;
+   unsigned long nr_pages = 0;
+   unsigned long nr_free = 0;
+   struct page *page = NULL;
+
+   for (; pfn < end_pfn; pfn++) {
+   /*
+* First we check if pfn is valid on architectures where it is
+* possible to have holes within pageblock_nr_pages. On systems
+* where it is not possible, this function is optimized out.
+*
+* Then, we check if a current large page is valid by only
+* checking the validity of the head pfn.
+*
+* meminit_pfn_in_nid is checked on systems where pfns can
+* interleave within a node: a pfn is between start and end
+* of a node, but does not belong to this memory node.
+*
+* Finally, we minimize pfn page lookups and scheduler checks by
+* performing it only once every pageblock_nr_pages.
+*/
+   if (!pfn_valid_within(pfn)) {
+   nr_pages += __

[PATCH v12 07/11] x86/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 75 ++---
 1 file changed, 71 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..9778fec8a5dc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,73 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+   int node)
+{
+   unsigned long addr, pfn, next;
+   unsigned long long size;
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   int ret;
+
+   ret = vmemmap_populate(start, end, node);
+   /*
+* We might have partially populated memory, so check for no entries,
+* and zero only those that actually exist.
+*/
+   for (addr = start; addr < end; addr = next) {
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd)) {
+   next = pgd_addr_end(addr, end);
+   continue;
+   }
+
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d)) {
+   next = p4d_addr_end(addr, end);
+   continue;
+   }
+
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud)) {
+   next = pud_addr_end(addr, end);
+   continue;
+   }
+   if (pud_large(*pud)) {
+   /* This is PUD size page */
+   next = pud_addr_end(addr, end);
+   size = PUD_SIZE;
+   pfn = pud_pfn(*pud);
+   } else {
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   next = pmd_addr_end(addr, end);
+   continue;
+   }
+   if (pmd_large(*pmd)) {
+   /* This is PMD size page */
+   next = pmd_addr_end(addr, end);
+   size = PMD_SIZE;
+   pfn = pmd_pfn(*pmd);
+   } else {
+   pte = pte_offset_kernel(pmd, addr);
+   next = addr + PAGE_SIZE;
+   if (pte_none(*pte))
+   continue;
+   /* This is base size page */
+   size = PAGE_SIZE;
+   pfn = pte_pfn(*pte);
+   }
+   }
+   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+   }
+   return ret;
+}
+
 static int __init map_range(struct range *range)
 {
unsigned long start;
@@ -23,7 +90,7 @@ static int __init map_range(struct range *range)
start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-   return vmemmap_populate(start, end, NUMA_NO_NODE);
+   return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +203,9 @@ void __init kasan_init(void)
kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-   (unsigned long)kasan_mem_to_shadow(_end),
-   NUMA_NO_NODE);
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+  (unsigned long)kasan_mem_to_shadow(_end),
+  NUMA_NO_NODE);
 
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);
-- 
2.14.2

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

BTW, don't we need the same aligments inside for_each_memblock() loop?

How about change kasan_map_populate() to accept regular VA start, end
address, and convert them internally after aligning to PAGE_SIZE?

Thank you,
Pavel


On Fri, Oct 13, 2017 at 11:54 AM, Pavel Tatashin
<pasha.tatas...@oracle.com> wrote:
>> Thanks for sharing the .config and tree. It looks like the problem is that
>> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
>> them up in kasan_map_populate, they remain unaligned when passed to
>> kasan_populate_zero_shadow, which confuses the loop termination conditions
>> in e.g. zero_pte_populate and the shadow isn't configured properly.
>
> This makes sense. Thank you. I will insert these changes into your
> patch, and send out a new series soon after sanity checking it.
>
> Pavel

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

> Thanks for sharing the .config and tree. It looks like the problem is that
> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
> them up in kasan_map_populate, they remain unaligned when passed to
> kasan_populate_zero_shadow, which confuses the loop termination conditions
> in e.g. zero_pte_populate and the shadow isn't configured properly.

This makes sense. Thank you. I will insert these changes into your
patch, and send out a new series soon after sanity checking it.

Pavel

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

Here is simplified qemu command:

qemu-system-aarch64 \
  -display none \
  -kernel ./arch/arm64/boot/Image  \
  -M virt -cpu cortex-a57 -s -S

In a separate terminal start arm64 cross debugger:

$ aarch64-unknown-linux-gnu-gdb ./vmlinux
...
Reading symbols from ./vmlinux...done.
(gdb) target remote :1234
Remote debugging using :1234
0x4000 in ?? ()
(gdb) c
Continuing.
^C
(gdb) lx-dmesg
[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 4.14.0-rc4_pt_study-00136-gbed2c89768ba
(soleen@xakep) (gcc version 7.1.0 (crosstool-NG
crosstool-ng-1.23.0-90-g81327dd9)) #1 SMP PREEMPT Fri Oct 13 11:24:46
EDT 2017
... until the panic message is printed ...

Thank you,
Pavel

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

> It shouldn't be difficult to use section mappings with my patch, I just
> don't really see the need to try to optimise TLB pressure when you're
> running with KASAN enabled which already has something like a 3x slowdown
> afaik. If it ends up being a big deal, we can always do that later, but
> my main aim here is to divorce kasan from vmemmap because they should be
> completely unrelated.

Yes, I understand that kasan makes system slow, but my point is why
make it even slower? However, I am OK adding your patch to the series,
BTW, symmetric changes will be needed for x86 as well sometime later.

>
> This certainly doesn't sound right; mapping the shadow with pages shouldn't
> lead to problems. I also can't seem to reproduce this myself -- could you
> share your full .config and a pointer to the git tree that you're using,
> please?

Config is attached. I am using my patch series + your patch + today's
clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Also, in a separate e-mail i sent out the qemu arguments.

>
>> I feel, this patch requires more work, and I am troubled with using
>> base pages instead of large pages.
>
> I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> is the right thing to do here.

Thank you very much.

Pavel


config.gz
Description: GNU Zip compressed data

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

> Do you know what your physical memory layout looks like?

[0.00] Memory: 34960K/131072K available (16316K kernel code,
6716K rwdata, 7996K rodata, 1472K init, 8837K bss, 79728K reserved,
16384K cma-reserved)
[0.00] Virtual kernel memory layout:
[0.00] kasan   : 0x - 0x2000
( 32768 GB)
[0.00] modules : 0x2000 - 0x2800
(   128 MB)
[0.00] vmalloc : 0x2800 - 0x7dffbfff
( 96254 GB)
[0.00]   .text : 0x2808 - 0x2907
( 16320 KB)
[0.00] .rodata : 0x2907 - 0x2985
(  8064 KB)
[0.00]   .init : 0x2985 - 0x299c
(  1472 KB)
[0.00]   .data : 0x299c - 0x2a04f200
(  6717 KB)
[0.00].bss : 0x2a04f200 - 0x2a8f09e0
(  8838 KB)
[0.00] fixed   : 0x7dfffe7fd000 - 0x7dfffec0
(  4108 KB)
[0.00] PCI I/O : 0x7dfffee0 - 0x7de0
(16 MB)
[0.00] vmemmap : 0x7e00 - 0x8000
(  2048 GB maximum)
[0.00]   0x7e00 - 0x7e20
( 2 MB actual)
[0.00] memory  : 0x8000 - 0x8800
(   128 MB)

>
> Knowing that would tell us where shadow memory *should* be.
>
> Can you share the command line you're using the launch the VM?
>

virtme-run --kdir . --arch aarch64 --qemu-opts -s -S

and get messages from connected gdb session via lx-dmesg command.

The actual qemu arguments are these:

qemu-system-aarch64 -fsdev
local,id=virtfs1,path=/,security_model=none,readonly -device
virtio-9p-device,fsdev=virtfs1,mount_tag=/dev/root -fsdev
local,id=virtfs5,path=/usr/share/virtme-guest-0,security_model=none,readonly
-device virtio-9p-device,fsdev=virtfs5,mount_tag=virtme.guesttools -M
virt -cpu cortex-a57 -parallel none -net none -echr 1 -serial none
-chardev stdio,id=console,signal=off,mux=on -serial chardev:console
-mon chardev=console -vga none -display none -kernel
./arch/arm64/boot/Image -append 'earlyprintk=serial,ttyAMA0,115200
console=ttyAMA0 psmouse.proto=exps "virtme_stty_con=rows 57 cols 105
iutf8" TERM=screen-256color-bce rootfstype=9p
rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect
ro init=/bin/sh -- -c "mount -t tmpfs run /run;mkdir -p
/run/virtme/guesttools;/bin/mount -n -t 9p -o
ro,version=9p2000.L,trans=virtio,access=any virtme.guesttools
/run/virtme/guesttools;exec /run/virtme/guesttools/virtme-init"' -s -S

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

] [] kernel_thread+0x30/0x38
[0.062156] [] rest_init+0x34/0x108
[0.062260] [] start_kernel+0x45c/0x48c
[0.062458] Code: 540001e1 d343fc00 d2c40002 f2fbffe2 (38e26800)
[0.063559] ---[ end trace 390c5d4fc6641888 ]---
[0.064164] Kernel panic - not syncing: Attempted to kill the idle task!
[0.064438] ---[ end Kernel panic - not syncing: Attempted to kill
the idle task!

So, I've been trying to root cause it, and here is what I've got:

First, I went back to my version of kasan_map_populate() and replaced
vmemmap_populate() with vmemmap_populate_basepages(), which
behavior-vise made it very similar to your patch. After doing this I
got the same panic. So, I figured there must be something to do with
the differences that regular vmemmap allocated with granularity of
SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.

So, I made the following modification to your patch:

static void __init kasan_map_populate(unsigned long start, unsigned long end,
  int node)
{
+start = round_down(start, SWAPPER_BLOCK_SIZE);
+   end = round_up(end, SWAPPER_BLOCK_SIZE);
kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
}

This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
aligned. After, this modification everything is working.  However, I
am not sure if this is a proper fix.

I feel, this patch requires more work, and I am troubled with using
base pages instead of large pages.

Thank you,
Pavel

On Tue, Oct 10, 2017 at 1:41 PM, Pavel Tatashin
<pasha.tatas...@oracle.com> wrote:
> Hi Will,
>
> Ok, I will add your patch at the end of my series.
>
> Thank you,
> Pavel
>
>>
>> I was thinking that you could just add my patch to the end of your series
>> and have the whole lot go up like that. If you want to merge it with your
>> patch, I'm fine with that too.
>>
>> Will
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-10 Thread Pavel Tatashin

Hi Will,

Ok, I will add your patch at the end of my series.

Thank you,
Pavel

>
> I was thinking that you could just add my patch to the end of your series
> and have the whole lot go up like that. If you want to merge it with your
> patch, I'm fine with that too.
>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org

Re: [PATCH v11 0/9] complete deferred page initialization

2017-10-10 Thread Pavel Tatashin

I wanted to thank you Michal for spending time and doing the in-depth 
reviews of every incremental change. Overall the series is in much 
better shape now because of your feedback.


Pavel

On 10/10/2017 10:15 AM, Michal Hocko wrote:

Btw. thanks for your persistance and willingness to go over all the
suggestions which might not have been consistent btween different
versions. I believe this is a general improvement in the early
initialization code. We do not rely on an implicit zeroing which just
happens to work by a chance. The perfomance improvements are a bonus on
top.

Thanks, good work!

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-10 Thread Pavel Tatashin

Hi Will,

Thank you for doing this work. How would you like to proceed?

- If you OK for my series to be accepted as-is, so your patch can be
added later on top, I think, I need an ack from you for kasan changes.
- Otherwise, I can replace: 4267aaf1d279 arm64/kasan: add and use
kasan_map_populate() in my series with code from your patch.

Thank you,
Pavel

Re: [PATCH v11 5/9] mm: zero reserved and unavailable struct pages

2017-10-10 Thread Pavel Tatashin

> Btw. I would add your example from 
> http://lkml.kernel.org/r/bcf24369-ac37-cedd-a264-3396fb5cf...@oracle.com
> to do changelog
>

Will add, thank you for your review.

Pavel

[PATCH v11 5/9] mm: zero reserved and unavailable struct pages

2017-10-09 Thread Pavel Tatashin

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the exiting
memory from pfn 1 (i.e. KVM).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 include/linux/memblock.h | 16 
 include/linux/mm.h   | 15 +++
 mm/page_alloc.c  | 38 ++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..ce8bfa5f3e9b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, 
unsigned long max_pfn);
for_each_mem_range_rev(i, , , \
   nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable 
memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end) \
+   for_each_mem_range(i, , , \
+  NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..04c8b2e5aff4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X) (0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in .
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
@@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long 
pfn,
struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20b0bace2235..5f0013bbbe9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+   phys_addr_t start, end;
+   unsigned long pfn;
+   u64 i, pgcnt;
+
+   /* Loop through ranges that are reserved, but do not have reported
+* physical memory backing.
+*/
+

[PATCH v11 2/9] sparc64/mm: setting fields in deferred pages

2017-10-09 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
 register_page_bootmem_info();
 free_all_bootmem();
 ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
 __init_single_page()
  memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
Acked-by: Michal Hocko <mho...@suse.com>
---
 arch/sparc/mm/init_64.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..caed495544e9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,16 @@ void __init mem_init(void)
 {
high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-   register_page_bootmem_info();
free_all_bootmem();
 
+   /*
+* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/*
 * Set up the zero page, mark it reserved, so that page count
 * is not manipulated when freeing the page from user ptes.
-- 
2.14.2

[PATCH v11 6/9] x86/kasan: add and use kasan_map_populate()

2017-10-09 Thread Pavel Tatashin

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 75 ++---
 1 file changed, 71 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..9778fec8a5dc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,73 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+   int node)
+{
+   unsigned long addr, pfn, next;
+   unsigned long long size;
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   int ret;
+
+   ret = vmemmap_populate(start, end, node);
+   /*
+* We might have partially populated memory, so check for no entries,
+* and zero only those that actually exist.
+*/
+   for (addr = start; addr < end; addr = next) {
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd)) {
+   next = pgd_addr_end(addr, end);
+   continue;
+   }
+
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d)) {
+   next = p4d_addr_end(addr, end);
+   continue;
+   }
+
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud)) {
+   next = pud_addr_end(addr, end);
+   continue;
+   }
+   if (pud_large(*pud)) {
+   /* This is PUD size page */
+   next = pud_addr_end(addr, end);
+   size = PUD_SIZE;
+   pfn = pud_pfn(*pud);
+   } else {
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   next = pmd_addr_end(addr, end);
+   continue;
+   }
+   if (pmd_large(*pmd)) {
+   /* This is PMD size page */
+   next = pmd_addr_end(addr, end);
+   size = PMD_SIZE;
+   pfn = pmd_pfn(*pmd);
+   } else {
+   pte = pte_offset_kernel(pmd, addr);
+   next = addr + PAGE_SIZE;
+   if (pte_none(*pte))
+   continue;
+   /* This is base size page */
+   size = PAGE_SIZE;
+   pfn = pte_pfn(*pte);
+   }
+   }
+   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+   }
+   return ret;
+}
+
 static int __init map_range(struct range *range)
 {
unsigned long start;
@@ -23,7 +90,7 @@ static int __init map_range(struct range *range)
start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-   return vmemmap_populate(start, end, NUMA_NO_NODE);
+   return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +203,9 @@ void __init kasan_init(void)
kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-   (unsigned long)kasan_mem_to_shadow(_end),
-   NUMA_NO_NODE);
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+  (unsigned long)kasan_mem_to_shadow(_end),
+  NUMA_NO_NODE);
 
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);
-- 
2.14.2

[PATCH v11 0/9] complete deferred page initialization

2017-10-09 Thread Pavel Tatashin

Changelog:
v11 - v10
- Moved kasan_map_populate() implementation from common code into arch
  specific as discussed with Will Deacon. We do not need
  "mm/kasan: kasan specific map populate function" anymore, so only
  9 patches left.

v10 - v9
- Addressed new comments from Michal Hocko.
- Sent "mm: deferred_init_memmap improvements" as a separate patch as
  it is also fixing existing problem.
- Merged "mm: stop zeroing memory during allocation in vmemmap" with
  "mm: zero struct pages during initialization".
- Added more comments "mm: zero reserved and unavailable struct pages"

v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix 
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations
v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
* Splited changes to platforms into 4 patches
* Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME  SPEED UP
base no deferred:   95.796233s
fix no deferred:79.978956s19.77%

base deferred:  77.254713s
fix deferred:   55.050509s40.34%
==
SPARC M6 3600 MHz with 15T of memory
TIME  SPEED UP
base no deferred:   358.335727s
fix no deferred:302.320936s   18.52%

base deferred:  237.534603s
fix deferred:   182.103003s   30.44%
==
Raw dmesg output with timestamps:
x86 base no deferred:https://hastebin.com/ofunepurit.scala
x86 base deferred:   https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred:https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:  https://hastebin.com/xadinobutu.go

Pavel Tatashin (9):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero reserved and unavailable struct pages
  x86/kasan: add and use kasan_map_populate()
  arm64/kasan: add and use kasan_map_populate()
  mm:

[PATCH v11 8/9] mm: stop zeroing memory during allocation in vmemmap

2017-10-09 Thread Pavel Tatashin

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

 BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
  --
Total   16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/mm.h  | 11 +++
 mm/page_alloc.c |  1 +
 mm/sparse-vmemmap.c | 15 +++
 mm/sparse.c |  6 +++---
 4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04c8b2e5aff4..fd045a3b243a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned 
long size, int node)
return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+   void *p = vmemmap_alloc_block(size, node);
+
+   if (!p)
+   return NULL;
+   memset(p, 0, size);
+
+   return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
   int node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f0013bbbe9d..85e038e1e941 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
 {
+   mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
unsigned long align,
unsigned long goal)
 {
-   return memblock_virt_alloc_try_nid(size, align, goal,
+   return memblock_virt_alloc_try_nid_raw(size, align, goal,
BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int 
node)
if (slab_is_available()) {
struct page *page;
 
-   page = alloc_pages_node(node,
-   GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-   get_order(size));
+   page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+   get_order(size));
if (page)
return page_address(page);
return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned 
long addr, int node)
 {
pmd_t *pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pmd_populate_kernel(_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned 
long addr, int node)
 {
pud_t *pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pud_populate(_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned 
long addr, int node)
 {
p4d_t *p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;

[PATCH v11 4/9] mm: defining memblock_virt_alloc_try_nid_raw

2017-10-09 Thread Pavel Tatashin

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/bootmem.h | 27 ++
 mm/memblock.c   | 60 +++--
 mm/page_alloc.c | 15 ++---
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr,
+ phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+   BOOTMEM_ALLOC_ACCESSIBLE,
+   NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   if (!align)
+   align = SMP_CACHE_BYTES;
+   return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init 
memblock_virt_alloc_try_nid(phys_addr_t size,
  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+   phys_addr_t size, phys_addr_t align,
+   phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+   return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+   min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
return NULL;
 done:
ptr = phys_to_virt(alloc);
-   memset(ptr, 0, size);
 
/*
 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *   is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *   is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *   allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to fi

[PATCH v11 9/9] sparc64: optimized struct page zeroing

2017-10-09 Thread Pavel Tatashin

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

   BASEFIX  OPTIMIZED_FIX
bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
  
Total 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)   (mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#definemm_zero_struct_page(pp) do {
\
+   unsigned long *_pp = (void *)(pp);  \
+   \
+/* Check that struct page is either 64, 72, or 80 bytes */ \
+   BUILD_BUG_ON(sizeof(struct page) & 7);  \
+   BUILD_BUG_ON(sizeof(struct page) < 64); \
+   BUILD_BUG_ON(sizeof(struct page) > 80); \
+   \
+   switch (sizeof(struct page)) {  \
+   case 80:\
+   _pp[9] = 0; /* fallthrough */   \
+   case 72:\
+   _pp[8] = 0; /* fallthrough */   \
+   default:\
+   _pp[7] = 0; \
+   _pp[6] = 0; \
+   _pp[5] = 0; \
+   _pp[4] = 0; \
+   _pp[3] = 0; \
+   _pp[2] = 0; \
+   _pp[1] = 0; \
+   _pp[0] = 0; \
+   }   \
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.2

[PATCH v11 3/9] sparc64: simplify vmemmap_populate

2017-10-09 Thread Pavel Tatashin

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
Acked-by: Michal Hocko <mho...@suse.com>
---
 arch/sparc/mm/init_64.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index caed495544e9..6839db3ffe1d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2652,30 +2652,19 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
vstart = vstart & PMD_MASK;
vend = ALIGN(vend, PMD_SIZE);
for (; vstart < vend; vstart += PMD_SIZE) {
-   pgd_t *pgd = pgd_offset_k(vstart);
+   pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
unsigned long pte;
pud_t *pud;
pmd_t *pmd;
 
-   if (pgd_none(*pgd)) {
-   pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+   if (!pgd)
+   return -ENOMEM;
 
-   if (!new)
-   return -ENOMEM;
-   pgd_populate(_mm, pgd, new);
-   }
-
-   pud = pud_offset(pgd, vstart);
-   if (pud_none(*pud)) {
-   pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-   if (!new)
-   return -ENOMEM;
-   pud_populate(_mm, pud, new);
-   }
+   pud = vmemmap_pud_populate(pgd, vstart, node);
+   if (!pud)
+   return -ENOMEM;
 
pmd = pmd_offset(pud, vstart);
-
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.2

[PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-09 Thread Pavel Tatashin

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 72 ++
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..cb4af2951c90 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -28,6 +28,66 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+   int node)
+{
+   unsigned long addr, pfn, next;
+   unsigned long long size;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   int ret;
+
+   ret = vmemmap_populate(start, end, node);
+   /*
+* We might have partially populated memory, so check for no entries,
+* and zero only those that actually exist.
+*/
+   for (addr = start; addr < end; addr = next) {
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd)) {
+   next = pgd_addr_end(addr, end);
+   continue;
+   }
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud)) {
+   next = pud_addr_end(addr, end);
+   continue;
+   }
+   if (pud_sect(*pud)) {
+   /* This is PUD size page */
+   next = pud_addr_end(addr, end);
+   size = PUD_SIZE;
+   pfn = pud_pfn(*pud);
+   } else {
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   next = pmd_addr_end(addr, end);
+   continue;
+   }
+   if (pmd_sect(*pmd)) {
+   /* This is PMD size page */
+   next = pmd_addr_end(addr, end);
+   size = PMD_SIZE;
+   pfn = pmd_pfn(*pmd);
+   } else {
+   pte = pte_offset_kernel(pmd, addr);
+   next = addr + PAGE_SIZE;
+   if (pte_none(*pte))
+   continue;
+   /* This is base size page */
+   size = PAGE_SIZE;
+   pfn = pte_pfn(*pte);
+   }
+   }
+   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+   }
+   return ret;
+}
+
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be 
used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -161,11 +221,11 @@ void __init kasan_init(void)
 
clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-   vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-pfn_to_nid(virt_to_pfn(lm_alias(_text;
+   kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+  pfn_to_nid(virt_to_pfn(lm_alias(_text;
 
/*
-* vmemmap_populate() has populated the shadow region that covers the
+* kasan_map_populate() has populated the shadow region that covers the
 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +251,9 @@ void __init kasan_init(void)
if (start >= end)
break;
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-   (unsigned long)kasan_mem_to_shadow(end),
-   pfn_to_nid(virt_to_pfn(start)));
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+  (unsigned long)kasan_mem_to_shadow(end),
+  pfn_to_nid(virt_to_pfn(start)));
}
 
/*
-- 
2.14.2

[PATCH v11 1/9] x86/mm: setting fields in deferred pages

2017-10-09 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

mem_init() {
register_page_bootmem_info();
free_all_bootmem();
...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- Only if this is deferred reserved page
   __init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 arch/x86/mm/init_64.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..8822523fdcd7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,18 @@ void __init mem_init(void)
 
/* clear_bss() already clear the empty_zero_page */
 
-   register_page_bootmem_info();
-
/* this will put all memory onto the freelists */
free_all_bootmem();
after_bootmem = 1;
 
+   /*
+* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/* Register memory areas for /proc/kcore */
kclist_add(_vsyscall, (void *)VSYSCALL_ADDR,
 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.2

Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function

2017-10-09 Thread Pavel Tatashin

>> I guess we could implement that on arm64 using our current vmemmap_populate
>> logic and an explicit memset.

Hi Will,

I will send out a new patch series with x86/arm64  versions of
kasan_map_populate(), so you could take a look if this is something
that is acceptable.

Thank you,
Pavel

Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function

2017-10-09 Thread Pavel Tatashin

>
> Ok, but I'm still missing why you think that is needed. What would be the
> second page table walker that needs implementing?
>
> I guess we could implement that on arm64 using our current vmemmap_populate
> logic and an explicit memset.
>

Hi Will,

What do you mean by explicit memset()? We can't simply memset() from
start to end without doing the page table walk, because at the time
kasan is calling vmemmap_populate() we have a tmp_pg_dir instead of
swapper_pg_dir.

We could do the explicit memset() after
cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); but again, this was in
one of my previous implementations, and I was asked to replace that.

Pavel

Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function

2017-10-09 Thread Pavel Tatashin

Hi Will,

> We have two table walks even with your patch series applied afaict: one in
> our definition of vmemmap_populate (arch/arm64/mm/mmu.c) and this one
> in the core code.

I meant to say implementing two new page table walkers, not at runtime.

> My worry is that these are actually highly arch-specific, but will likely
> grow more users in mm/ that assume things for all architectures that aren't
> necessarily valid.

I see, how about moving new kasan_map_populate() implementation into
arch dependent code:

arch/x86/mm/kasan_init_64.c
arch/arm64/mm/kasan_init.c

This way we won't need to add pmd_large()/pud_large() macros for arm64?

Pavel

Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function

2017-10-09 Thread Pavel Tatashin

Hi Will,

In addition to what Michal wrote:

> As an interim step, why not introduce something like
> vmemmap_alloc_block_flags and make the page-table walking opt-out for
> architectures that don't want it? Then we can just pass __GFP_ZERO from
> our vmemmap_populate where necessary and other architectures can do the
> page-table walking dance if they prefer.

I do not see the benefit, implementing this approach means that we
would need to implement two table walks instead of one: one for x86,
another for ARM, as these two architectures support kasan. Also, this
would become a requirement for any future architecture that want to
add kasan support to add this page table walk implementation.

>> IMO, while I understand that it looks strange that we must walk page
>> table after creating it, it is a better approach: more enclosed as it
>> effects kasan only, and more universal as it is in common code.
>
> I don't buy the more universal aspect, but I appreciate it's subjective.
> Frankly, I'd just sooner not have core code walking early page tables if
> it can be avoided, and it doesn't look hard to avoid it in this case.
> The fact that you're having to add pmd_large and pud_large, which are
> otherwise unused in mm/, is an indication that this isn't quite right imo.

 28 +#define pmd_large(pmd) pmd_sect(pmd)
 29 +#define pud_large(pud) pud_sect(pud)

it is just naming difference, ARM64 calls them pmd_sect, common mm and
other arches call them
pmd_large/pud_large. Even the ARM has these defines in

arm/include/asm/pgtable-3level.h
arm/include/asm/pgtable-2level.h

Pavel

Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function

2017-10-09 Thread Pavel Tatashin

Hi Will,

I can go back to that approach, if Michal OK with it. But, that would
mean that I would need to touch every single architecture that
implements vmemmap_populate(), and also pass flags at least through
these functions on every architectures (some have more than one
decided by configs).:

vmemmap_populate()
vmemmap_populate_basepages()
vmemmap_populate_hugepages()
vmemmap_pte_populate()
__vmemmap_alloc_block_buf()
alloc_block_buf()
vmemmap_alloc_block()

IMO, while I understand that it looks strange that we must walk page
table after creating it, it is a better approach: more enclosed as it
effects kasan only, and more universal as it is in common code. We are
also somewhat late in the review process, means we will need again to
get ACKs from the maintainers of other arches.

Pavel

On Mon, Oct 9, 2017 at 1:13 PM, Will Deacon <will.dea...@arm.com> wrote:
> On Tue, Oct 03, 2017 at 03:48:46PM +0100, Mark Rutland wrote:
>> On Wed, Sep 20, 2017 at 04:17:11PM -0400, Pavel Tatashin wrote:
>> > During early boot, kasan uses vmemmap_populate() to establish its shadow
>> > memory. But, that interface is intended for struct pages use.
>> >
>> > Because of the current project, vmemmap won't be zeroed during allocation,
>> > but kasan expects that memory to be zeroed. We are adding a new
>> > kasan_map_populate() function to resolve this difference.
>>
>> Thanks for putting this together.
>>
>> I've given this a spin on arm64, and can confirm that it works.
>>
>> Given that this involes redundant walking of page tables, I still think
>> it'd be preferable to have some common *_populate() helper that took a
>> gfp argument, but I guess it's not the end of the world.
>>
>> I'll leave it to Will and Catalin to say whether they're happy with the
>> page table walking and the new p{u,m}d_large() helpers added to arm64.
>
> To be honest, it just looks completely backwards to me; we're walking the
> page tables we created earlier on so that we can figure out what needs to
> be zeroed for KASAN. We already had that information before, hence my
> preference to allow propagation of GFP_FLAGs to vmemmap_alloc_block when
> it's needed. I know that's not popular for some reason, but is walking the
> page tables really better?
>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org

[PATCH v10 10/10] sparc64: optimized struct page zeroing

2017-10-05 Thread Pavel Tatashin

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

   BASEFIX  OPTIMIZED_FIX
bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
  
Total 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)   (mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#definemm_zero_struct_page(pp) do {
\
+   unsigned long *_pp = (void *)(pp);  \
+   \
+/* Check that struct page is either 64, 72, or 80 bytes */ \
+   BUILD_BUG_ON(sizeof(struct page) & 7);  \
+   BUILD_BUG_ON(sizeof(struct page) < 64); \
+   BUILD_BUG_ON(sizeof(struct page) > 80); \
+   \
+   switch (sizeof(struct page)) {  \
+   case 80:\
+   _pp[9] = 0; /* fallthrough */   \
+   case 72:\
+   _pp[8] = 0; /* fallthrough */   \
+   default:\
+   _pp[7] = 0; \
+   _pp[6] = 0; \
+   _pp[5] = 0; \
+   _pp[4] = 0; \
+   _pp[3] = 0; \
+   _pp[2] = 0; \
+   _pp[1] = 0; \
+   _pp[0] = 0; \
+   }   \
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.2

[PATCH v10 08/10] arm64/kasan: use kasan_map_populate()

2017-10-05 Thread Pavel Tatashin

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..b6e92cfa3ea3 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -161,11 +161,11 @@ void __init kasan_init(void)
 
clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-   vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-pfn_to_nid(virt_to_pfn(lm_alias(_text;
+   kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+  pfn_to_nid(virt_to_pfn(lm_alias(_text;
 
/*
-* vmemmap_populate() has populated the shadow region that covers the
+* kasan_map_populate() has populated the shadow region that covers the
 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +191,9 @@ void __init kasan_init(void)
if (start >= end)
break;
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-   (unsigned long)kasan_mem_to_shadow(end),
-   pfn_to_nid(virt_to_pfn(start)));
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+  (unsigned long)kasan_mem_to_shadow(end),
+  pfn_to_nid(virt_to_pfn(start)));
}
 
/*
-- 
2.14.2

[PATCH v10 06/10] mm/kasan: kasan specific map populate function

2017-10-05 Thread Pavel Tatashin

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/arm64/include/asm/pgtable.h |  3 ++
 include/linux/kasan.h|  2 ++
 mm/kasan/kasan_init.c| 67 
 3 files changed, 72 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b46e54c2399b..11ff58901519 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -381,6 +381,9 @@ extern pgprot_t phys_mem_access_prot(struct file *file, 
unsigned long pfn,
 PUD_TYPE_TABLE)
 #endif
 
+#define pmd_large(pmd) pmd_sect(pmd)
+#define pud_large(pud) pud_sect(pud)
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
*pmdp = pmd;
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index a5c7046f26b4..7e13df1722c2 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -78,6 +78,8 @@ size_t kasan_metadata_size(struct kmem_cache *cache);
 
 bool kasan_save_enable_multi_shot(void);
 void kasan_restore_multi_shot(bool enabled);
+int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+int node);
 
 #else /* CONFIG_KASAN */
 
diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
index 554e4c0f23a2..57a973f05f63 100644
--- a/mm/kasan/kasan_init.c
+++ b/mm/kasan/kasan_init.c
@@ -197,3 +197,70 @@ void __init kasan_populate_zero_shadow(const void 
*shadow_start,
zero_p4d_populate(pgd, addr, next);
} while (pgd++, addr = next, addr != end);
 }
+
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+int node)
+{
+   unsigned long addr, pfn, next;
+   unsigned long long size;
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   int ret;
+
+   ret = vmemmap_populate(start, end, node);
+   /*
+* We might have partially populated memory, so check for no entries,
+* and zero only those that actually exist.
+*/
+   for (addr = start; addr < end; addr = next) {
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd)) {
+   next = pgd_addr_end(addr, end);
+   continue;
+   }
+
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d)) {
+   next = p4d_addr_end(addr, end);
+   continue;
+   }
+
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud)) {
+   next = pud_addr_end(addr, end);
+   continue;
+   }
+   if (pud_large(*pud)) {
+   /* This is PUD size page */
+   next = pud_addr_end(addr, end);
+   size = PUD_SIZE;
+   pfn = pud_pfn(*pud);
+   } else {
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   next = pmd_addr_end(addr, end);
+   continue;
+   }
+   if (pmd_large(*pmd)) {
+   /* This is PMD size page */
+   next = pmd_addr_end(addr, end);
+   size = PMD_SIZE;
+   pfn = pmd_pfn(*pmd);
+   } else {
+   pte = pte_offset_kernel(pmd, addr);
+   next = addr + PAGE_SIZE;
+   if (pte_none(*pte))
+   continue;
+   /* This is base size page */
+   size = PAGE_SIZE;
+   pfn = pte_pfn(*pte);
+   }
+   }
+   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+   }
+   return ret;
+}
-- 
2.14.2

[PATCH v10 04/10] mm: defining memblock_virt_alloc_try_nid_raw

2017-10-05 Thread Pavel Tatashin

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/bootmem.h | 27 ++
 mm/memblock.c   | 60 +++--
 mm/page_alloc.c | 15 ++---
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr,
+ phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+   BOOTMEM_ALLOC_ACCESSIBLE,
+   NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   if (!align)
+   align = SMP_CACHE_BYTES;
+   return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init 
memblock_virt_alloc_try_nid(phys_addr_t size,
  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+   phys_addr_t size, phys_addr_t align,
+   phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+   return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+   min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
return NULL;
 done:
ptr = phys_to_virt(alloc);
-   memset(ptr, 0, size);
 
/*
 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *   is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *   is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *   allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to fi

[PATCH v10 09/10] mm: stop zeroing memory during allocation in vmemmap

2017-10-05 Thread Pavel Tatashin

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

 BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
  --
Total   16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/mm.h  | 11 +++
 mm/page_alloc.c |  1 +
 mm/sparse-vmemmap.c | 15 +++
 mm/sparse.c |  6 +++---
 4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04c8b2e5aff4..fd045a3b243a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned 
long size, int node)
return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+   void *p = vmemmap_alloc_block(size, node);
+
+   if (!p)
+   return NULL;
+   memset(p, 0, size);
+
+   return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
   int node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f0013bbbe9d..85e038e1e941 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
 {
+   mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
unsigned long align,
unsigned long goal)
 {
-   return memblock_virt_alloc_try_nid(size, align, goal,
+   return memblock_virt_alloc_try_nid_raw(size, align, goal,
BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int 
node)
if (slab_is_available()) {
struct page *page;
 
-   page = alloc_pages_node(node,
-   GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-   get_order(size));
+   page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+   get_order(size));
if (page)
return page_address(page);
return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned 
long addr, int node)
 {
pmd_t *pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pmd_populate_kernel(_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned 
long addr, int node)
 {
pud_t *pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pud_populate(_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned 
long addr, int node)
 {
p4d_t *p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;

[PATCH v10 02/10] sparc64/mm: setting fields in deferred pages

2017-10-05 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
 register_page_bootmem_info();
 free_all_bootmem();
 ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
 __init_single_page()
  memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
Acked-by: Michal Hocko <mho...@suse.com>
---
 arch/sparc/mm/init_64.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..caed495544e9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,16 @@ void __init mem_init(void)
 {
high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-   register_page_bootmem_info();
free_all_bootmem();
 
+   /*
+* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/*
 * Set up the zero page, mark it reserved, so that page count
 * is not manipulated when freeing the page from user ptes.
-- 
2.14.2

[PATCH v10 07/10] x86/kasan: use kasan_map_populate()

2017-10-05 Thread Pavel Tatashin

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..2db95efd208e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -23,7 +23,7 @@ static int __init map_range(struct range *range)
start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-   return vmemmap_populate(start, end, NUMA_NO_NODE);
+   return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +136,9 @@ void __init kasan_init(void)
kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-   (unsigned long)kasan_mem_to_shadow(_end),
-   NUMA_NO_NODE);
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+  (unsigned long)kasan_mem_to_shadow(_end),
+  NUMA_NO_NODE);
 
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);
-- 
2.14.2

[PATCH v10 05/10] mm: zero reserved and unavailable struct pages

2017-10-05 Thread Pavel Tatashin

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the exiting
memory from pfn 1 (i.e. KVM).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 include/linux/memblock.h | 16 
 include/linux/mm.h   | 15 +++
 mm/page_alloc.c  | 38 ++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..ce8bfa5f3e9b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, 
unsigned long max_pfn);
for_each_mem_range_rev(i, , , \
   nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable 
memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end) \
+   for_each_mem_range(i, , , \
+  NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..04c8b2e5aff4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X) (0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in .
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
@@ -2001,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long 
pfn,
struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20b0bace2235..5f0013bbbe9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6209,6 +6209,42 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+   phys_addr_t start, end;
+   unsigned long pfn;
+   u64 i, pgcnt;
+
+   /* Loop through ranges that are reserved, but do not have reported
+* physical memory backing.
+*/
+

[PATCH v10 03/10] sparc64: simplify vmemmap_populate

2017-10-05 Thread Pavel Tatashin

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
Acked-by: Michal Hocko <mho...@suse.com>
---
 arch/sparc/mm/init_64.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index caed495544e9..6839db3ffe1d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2652,30 +2652,19 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
vstart = vstart & PMD_MASK;
vend = ALIGN(vend, PMD_SIZE);
for (; vstart < vend; vstart += PMD_SIZE) {
-   pgd_t *pgd = pgd_offset_k(vstart);
+   pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
unsigned long pte;
pud_t *pud;
pmd_t *pmd;
 
-   if (pgd_none(*pgd)) {
-   pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+   if (!pgd)
+   return -ENOMEM;
 
-   if (!new)
-   return -ENOMEM;
-   pgd_populate(_mm, pgd, new);
-   }
-
-   pud = pud_offset(pgd, vstart);
-   if (pud_none(*pud)) {
-   pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-   if (!new)
-   return -ENOMEM;
-   pud_populate(_mm, pud, new);
-   }
+   pud = vmemmap_pud_populate(pgd, vstart, node);
+   if (!pud)
+   return -ENOMEM;
 
pmd = pmd_offset(pud, vstart);
-
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.2

[PATCH v10 00/10] complete deferred page initialization

2017-10-05 Thread Pavel Tatashin

Changelog:
v10 - v9
- Addressed new comments from Michal Hocko.
- Sent "mm: deferred_init_memmap improvements" as a separate patch as
  it is also fixing existing problem.
- Merged "mm: stop zeroing memory during allocation in vmemmap" with
  "mm: zero struct pages during initialization".
- Added more comments "mm: zero reserved and unavailable struct pages"

v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix 
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations
v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
* Splited changes to platforms into 4 patches
* Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization

==
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME  SPEED UP
base no deferred:   95.796233s
fix no deferred:79.978956s19.77%

base deferred:  77.254713s
fix deferred:   55.050509s40.34%
==
SPARC M6 3600 MHz with 15T of memory
TIME  SPEED UP
base no deferred:   358.335727s
fix no deferred:302.320936s   18.52%

base deferred:  237.534603s
fix deferred:   182.103003s   30.44%
==
Raw dmesg output with timestamps:
x86 base no deferred:https://hastebin.com/ofunepurit.scala
x86 base deferred:   https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred:https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:  https://hastebin.com/xadinobutu.go

Pavel Tatashin (10):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero reserved and unavailable struct pages
  mm/kasan: kasan specific map populate function
  x86/kasan: use kasan_map_populate()
  arm64/kasan: use kasan_map_populate()
  mm: stop zeroing memory during allocation in vmemmap
  sparc64: optimized struct page zeroing

 arch/arm64/include/asm/pgtable.h|  3 ++
 arch/arm64/mm/kasan_init.c  | 12 +++
 arch/sparc/in

[PATCH v10 01/10] x86/mm: setting fields in deferred pages

2017-10-05 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

mem_init() {
register_page_bootmem_info();
free_all_bootmem();
...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- Only if this is deferred reserved page
   __init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 arch/x86/mm/init_64.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..8822523fdcd7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,18 @@ void __init mem_init(void)
 
/* clear_bss() already clear the empty_zero_page */
 
-   register_page_bootmem_info();
-
/* this will put all memory onto the freelists */
free_all_bootmem();
after_bootmem = 1;
 
+   /*
+* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/* Register memory areas for /proc/kcore */
kclist_add(_vsyscall, (void *)VSYSCALL_ADDR,
 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.2

[PATCH] mm: deferred_init_memmap improvements

2017-10-04 Thread Pavel Tatashin

deferred_init_memmap() is called when struct pages are initialized later
in boot by slave CPUs. This patch simplifies and optimizes this function,
and also fixes a couple issues (described below).

The main change is that now we are iterating through free memblock areas
instead of all configured memory. Thus, we do not have to check if the
struct page has already been initialized.

=
In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

The deferred_init_memmap() is re-written to loop through only free memory
ranges provided by memblock.

Note, this first issue is relevant only when the following change is merged:

mm: stop zeroing memory during allocation in vmemmap

=

This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
Thus means if deferred struct pages are enabled on systems with these kind
of holes, linux would get memory corruptions. I have fixed this issue by
defining a new macro that performs all the necessary operations when we
free the current set of pages.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 mm/page_alloc.c | 168 
 1 file changed, 85 insertions(+), 83 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af88836a..dcfd657cfd4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(struct page *page,
-   unsigned long pfn, int nr_pages)
+static void __init deferred_free_range(unsigned long pfn,
+  unsigned long nr_pages)
 {
-   int i;
+   struct page *page;
+   unsigned long i;
 
-   if (!page)
+   if (!nr_pages)
return;
 
+   page = pfn_to_page(pfn);
+
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == pageblock_nr_pages &&
(pfn & (pageblock_nr_pages - 1)) == 0) {
@@ -1443,19 +1446,89 @@ static inline void __init 
pgdat_init_report_one_done(void)
complete(_init_all_done_comp);
 }
 
+/*
+ * Helper for deferred_init_range, free the given range, reset the counters, 
and
+ * return number of pages freed.
+ */
+static inline unsigned long __def_free(unsigned long *nr_free,
+  unsigned long *free_base_pfn,
+  struct page **page)
+{
+   unsigned long nr = *nr_free;
+
+   deferred_free_range(*free_base_pfn, nr);
+   *free_base_pfn = 0;
+   *nr_free = 0;
+   *page = NULL;
+
+   return nr;
+}
+
+static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
+unsigned long end_pfn)
+{
+   struct mminit_pfnnid_cache nid_init_state = { };
+   unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   unsigned long free_base_pfn = 0;
+   unsigned long nr_pages = 0;
+   unsigned long nr_free = 0;
+   struct page *page = NULL;
+
+   for (; pfn < end_pfn; pfn++) {
+   /*
+* First we check if pfn is valid on architectures where it is
+* possible to have holes within pageblock_nr_pages. On systems
+* where it is not possible, this function is optimized out.
+*
+* Then, we check if a current large page is valid by only
+* checking the validity of the head pfn.
+*
+* meminit_pfn_in_nid is checked on systems where pfns can
+* interleave within a node: a pfn is between start and end
+* of a node, but does not belong to this memory node.
+*
+* Finally, we minimize pfn page lookups and scheduler checks by
+* performing it only once every pageblock_nr_pages.
+*/
+   if (!pfn_v

[PATCH v9 06/12] mm: zero struct pages during initialization

2017-09-20 Thread Pavel Tatashin

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
  --
Total   16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/mm.h | 9 +
 mm/page_alloc.c| 1 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8c10d336e42..50b74d628243 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X) (0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in .
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8dbd405ed94..4b630ee91430 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
 {
+   mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
-- 
2.14.1

[PATCH v9 11/12] arm64/kasan: use kasan_map_populate()

2017-09-20 Thread Pavel Tatashin

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..b6e92cfa3ea3 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -161,11 +161,11 @@ void __init kasan_init(void)
 
clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-   vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-pfn_to_nid(virt_to_pfn(lm_alias(_text;
+   kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+  pfn_to_nid(virt_to_pfn(lm_alias(_text;
 
/*
-* vmemmap_populate() has populated the shadow region that covers the
+* kasan_map_populate() has populated the shadow region that covers the
 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +191,9 @@ void __init kasan_init(void)
if (start >= end)
break;
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-   (unsigned long)kasan_mem_to_shadow(end),
-   pfn_to_nid(virt_to_pfn(start)));
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+  (unsigned long)kasan_mem_to_shadow(end),
+  pfn_to_nid(virt_to_pfn(start)));
}
 
/*
-- 
2.14.1

[PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap

2017-09-20 Thread Pavel Tatashin

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 include/linux/mm.h  | 11 +++
 mm/sparse-vmemmap.c | 15 +++
 mm/sparse.c |  6 +++---
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7bba4ce79ba..25848764570f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned 
long size, int node)
return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+   void *p = vmemmap_alloc_block(size, node);
+
+   if (!p)
+   return NULL;
+   memset(p, 0, size);
+
+   return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
   int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
unsigned long align,
unsigned long goal)
 {
-   return memblock_virt_alloc_try_nid(size, align, goal,
+   return memblock_virt_alloc_try_nid_raw(size, align, goal,
BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int 
node)
if (slab_is_available()) {
struct page *page;
 
-   page = alloc_pages_node(node,
-   GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-   get_order(size));
+   page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+   get_order(size));
if (page)
return page_address(page);
return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned 
long addr, int node)
 {
pmd_t *pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pmd_populate_kernel(_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned 
long addr, int node)
 {
pud_t *pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pud_populate(_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned 
long addr, int node)
 {
p4d_t *p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
p4d_populate(_mm, p4d, p);
@@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, 
int node)
 {
pgd_t *pgd = pgd_offset_k(addr);
if (pgd_none(*pgd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pgd_populate(_mm, pgd, p);
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..d22f51bb7c79 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page 
**map_map,
}
 
size = PAGE_ALIGN(size);
-   map = memblock_virt_alloc_try_nid(size * map_count,
- PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
- BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
+   map = memblock_virt_alloc_try_nid_raw(size * map_count,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+ BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
if (map) {
for (pnum = pnum_begi

[PATCH v9 05/12] mm: defining memblock_virt_alloc_try_nid_raw

2017-09-20 Thread Pavel Tatashin

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/bootmem.h | 27 ++
 mm/memblock.c   | 60 +++--
 mm/page_alloc.c | 15 ++---
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr,
+ phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+   BOOTMEM_ALLOC_ACCESSIBLE,
+   NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   if (!align)
+   align = SMP_CACHE_BYTES;
+   return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init 
memblock_virt_alloc_try_nid(phys_addr_t size,
  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+   phys_addr_t size, phys_addr_t align,
+   phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+   return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+   min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
return NULL;
 done:
ptr = phys_to_virt(alloc);
-   memset(ptr, 0, size);
 
/*
 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *   is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *   is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *   allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to fi

[PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-09-20 Thread Pavel Tatashin

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 include/linux/memblock.h | 16 
 include/linux/mm.h   |  6 ++
 mm/page_alloc.c  | 30 ++
 3 files changed, 52 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..bdd4268f9323 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, 
unsigned long max_pfn);
for_each_mem_range_rev(i, , , \
   nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable 
memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailabled but reserved (reserved && !memory) areas of 
memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end) \
+   for_each_mem_range(i, , , \
+  NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50b74d628243..a7bba4ce79ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2010,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long 
pfn,
struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b630ee91430..1d38d391dffd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6202,6 +6202,34 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+   phys_addr_t start, end;
+   unsigned long pfn;
+   u64 i, pgcnt;
+
+   /* Loop through ranges that are reserved, but do not have reported
+* physical memory backing.
+*/
+   pgcnt = 0;
+   for_each_resv_unavail_range(i, , ) {
+   for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
+   mm_zero_struct_page(pfn_to_page(pfn));
+   pgcnt++;
+   }
+   }
+   pr_info("Reserved but unavailable: %lld pages", pgcnt);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
@@ -6625,6 +6653,7 @@ void __init free_area_init_nodes(unsigned long 
*max_zone_pfn)
node_set_state(nid, N_MEMORY);
check_for_memory(pgdat, nid);
}
+   zero_resv_unavail();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6788,6 +6817,7 @@ void __init free_area_init(unsigned lon

[PATCH v9 03/12] mm: deferred_init_memmap improvements

2017-09-20 Thread Pavel Tatashin

This patch fixes two issues in deferred_init_memmap

=
In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

The deferred_init_memmap() is re-written to loop through only free memory
ranges provided by memblock.

=
This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
Thus means if deferred struct pages are enabled on systems with these kind
of holes, linux would get memory corruptions. I have fixed this issue by
defining a new macro that performs all the necessary operations when we
free the current set of pages.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 mm/page_alloc.c | 161 +++-
 1 file changed, 78 insertions(+), 83 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af88836a..d132c801d2c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(struct page *page,
-   unsigned long pfn, int nr_pages)
+static void __init deferred_free_range(unsigned long pfn,
+  unsigned long nr_pages)
 {
-   int i;
+   struct page *page;
+   unsigned long i;
 
-   if (!page)
+   if (!nr_pages)
return;
 
+   page = pfn_to_page(pfn);
+
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == pageblock_nr_pages &&
(pfn & (pageblock_nr_pages - 1)) == 0) {
@@ -1443,19 +1446,82 @@ static inline void __init 
pgdat_init_report_one_done(void)
complete(_init_all_done_comp);
 }
 
+#define DEFERRED_FREE(nr_free, free_base_pfn, page)\
+({ \
+   unsigned long nr = (nr_free);   \
+   \
+   deferred_free_range((free_base_pfn), (nr)); \
+   (free_base_pfn) = 0;\
+   (nr_free) = 0;  \
+   page = NULL;\
+   nr; \
+})
+
+static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
+unsigned long end_pfn)
+{
+   struct mminit_pfnnid_cache nid_init_state = { };
+   unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   unsigned long free_base_pfn = 0;
+   unsigned long nr_pages = 0;
+   unsigned long nr_free = 0;
+   struct page *page = NULL;
+
+   for (; pfn < end_pfn; pfn++) {
+   /*
+* First we check if pfn is valid on architectures where it is
+* possible to have holes within pageblock_nr_pages. On systems
+* where it is not possible, this function is optimized out.
+*
+* Then, we check if a current large page is valid by only
+* checking the validity of the head pfn.
+*
+* meminit_pfn_in_nid is checked on systems where pfns can
+* interleave within a node: a pfn is between start and end
+* of a node, but does not belong to this memory node.
+*
+* Finally, we minimize pfn page lookups and scheduler checks by
+* performing it only once every pageblock_nr_pages.
+*/
+   if (!pfn_valid_within(pfn)) {
+   nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+   } else if (!(pfn & nr_pgmask) && !pfn_valid(pfn)) {
+   nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+   } else if (!meminit_pfn_i

[PATCH v9 10/12] x86/kasan: use kasan_map_populate()

2017-09-20 Thread Pavel Tatashin

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/x86/mm/kasan_init_64.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..2db95efd208e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -23,7 +23,7 @@ static int __init map_range(struct range *range)
start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-   return vmemmap_populate(start, end, NUMA_NO_NODE);
+   return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +136,9 @@ void __init kasan_init(void)
kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-   (unsigned long)kasan_mem_to_shadow(_end),
-   NUMA_NO_NODE);
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+  (unsigned long)kasan_mem_to_shadow(_end),
+  NUMA_NO_NODE);
 
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);
-- 
2.14.1

[PATCH v9 09/12] mm/kasan: kasan specific map populate function

2017-09-20 Thread Pavel Tatashin

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
---
 arch/arm64/include/asm/pgtable.h |  3 ++
 include/linux/kasan.h|  2 ++
 mm/kasan/kasan_init.c| 67 
 3 files changed, 72 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bc4e92337d16..d89713f04354 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -381,6 +381,9 @@ extern pgprot_t phys_mem_access_prot(struct file *file, 
unsigned long pfn,
 PUD_TYPE_TABLE)
 #endif
 
+#define pmd_large(pmd) pmd_sect(pmd)
+#define pud_large(pud) pud_sect(pud)
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
*pmdp = pmd;
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index a5c7046f26b4..7e13df1722c2 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -78,6 +78,8 @@ size_t kasan_metadata_size(struct kmem_cache *cache);
 
 bool kasan_save_enable_multi_shot(void);
 void kasan_restore_multi_shot(bool enabled);
+int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+int node);
 
 #else /* CONFIG_KASAN */
 
diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
index 554e4c0f23a2..57a973f05f63 100644
--- a/mm/kasan/kasan_init.c
+++ b/mm/kasan/kasan_init.c
@@ -197,3 +197,70 @@ void __init kasan_populate_zero_shadow(const void 
*shadow_start,
zero_p4d_populate(pgd, addr, next);
} while (pgd++, addr = next, addr != end);
 }
+
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+int node)
+{
+   unsigned long addr, pfn, next;
+   unsigned long long size;
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   int ret;
+
+   ret = vmemmap_populate(start, end, node);
+   /*
+* We might have partially populated memory, so check for no entries,
+* and zero only those that actually exist.
+*/
+   for (addr = start; addr < end; addr = next) {
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd)) {
+   next = pgd_addr_end(addr, end);
+   continue;
+   }
+
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d)) {
+   next = p4d_addr_end(addr, end);
+   continue;
+   }
+
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud)) {
+   next = pud_addr_end(addr, end);
+   continue;
+   }
+   if (pud_large(*pud)) {
+   /* This is PUD size page */
+   next = pud_addr_end(addr, end);
+   size = PUD_SIZE;
+   pfn = pud_pfn(*pud);
+   } else {
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   next = pmd_addr_end(addr, end);
+   continue;
+   }
+   if (pmd_large(*pmd)) {
+   /* This is PMD size page */
+   next = pmd_addr_end(addr, end);
+   size = PMD_SIZE;
+   pfn = pmd_pfn(*pmd);
+   } else {
+   pte = pte_offset_kernel(pmd, addr);
+   next = addr + PAGE_SIZE;
+   if (pte_none(*pte))
+   continue;
+   /* This is base size page */
+   size = PAGE_SIZE;
+   pfn = pte_pfn(*pte);
+   }
+   }
+   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+   }
+   return ret;
+}
-- 
2.14.1

[PATCH v9 07/12] sparc64: optimized struct page zeroing

2017-09-20 Thread Pavel Tatashin

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

   BASEFIX  OPTIMIZED_FIX
bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
  
Total 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)   (mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#definemm_zero_struct_page(pp) do {
\
+   unsigned long *_pp = (void *)(pp);  \
+   \
+/* Check that struct page is either 64, 72, or 80 bytes */ \
+   BUILD_BUG_ON(sizeof(struct page) & 7);  \
+   BUILD_BUG_ON(sizeof(struct page) < 64); \
+   BUILD_BUG_ON(sizeof(struct page) > 80); \
+   \
+   switch (sizeof(struct page)) {  \
+   case 80:\
+   _pp[9] = 0; /* fallthrough */   \
+   case 72:\
+   _pp[8] = 0; /* fallthrough */   \
+   default:\
+   _pp[7] = 0; \
+   _pp[6] = 0; \
+   _pp[5] = 0; \
+   _pp[4] = 0; \
+   _pp[3] = 0; \
+   _pp[2] = 0; \
+   _pp[1] = 0; \
+   _pp[0] = 0; \
+   }   \
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.1

[PATCH v9 02/12] sparc64/mm: setting fields in deferred pages

2017-09-20 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
 register_page_bootmem_info();
 free_all_bootmem();
 ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
 __init_single_page()
  memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
---
 arch/sparc/mm/init_64.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..310c6754bcaa 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,15 @@ void __init mem_init(void)
 {
high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-   register_page_bootmem_info();
free_all_bootmem();
 
+   /* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/*
 * Set up the zero page, mark it reserved, so that page count
 * is not manipulated when freeing the page from user ptes.
-- 
2.14.1

[PATCH v9 01/12] x86/mm: setting fields in deferred pages

2017-09-20 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

mem_init() {
register_page_bootmem_info();
free_all_bootmem();
...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- Only if this is deferred reserved page
   __init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 arch/x86/mm/init_64.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..30fe22558720 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,17 @@ void __init mem_init(void)
 
/* clear_bss() already clear the empty_zero_page */
 
-   register_page_bootmem_info();
-
/* this will put all memory onto the freelists */
free_all_bootmem();
after_bootmem = 1;
 
+   /* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/* Register memory areas for /proc/kcore */
kclist_add(_vsyscall, (void *)VSYSCALL_ADDR,
 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.1

[PATCH v9 00/12] complete deferred page initialization

2017-09-20 Thread Pavel Tatashin

Changelog:
v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations

v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
* Splited changes to platforms into 4 patches
* Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME  SPEED UP
base no deferred:   95.796233s
fix no deferred:79.978956s19.77%

base deferred:  77.254713s
fix deferred:   55.050509s40.34%
==
SPARC M6 3600 MHz with 15T of memory
TIME  SPEED UP
base no deferred:   358.335727s
fix no deferred:302.320936s   18.52%

base deferred:  237.534603s
fix deferred:   182.103003s   30.44%
==
Raw dmesg output with timestamps:
x86 base no deferred:https://hastebin.com/ofunepurit.scala
x86 base deferred:   https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred:https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:  https://hastebin.com/xadinobutu.go

Pavel Tatashin (12):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  mm: deferred_init_memmap improvements
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero struct pages during initialization
  sparc64: optimized struct page zeroing
  mm: zero reserved and unavailable struct pages
  mm/kasan: kasan specific map populate function
  x86/kasan: use kasan_map_populate()
  arm64/kasan: use kasan_map_populate()
  mm: stop zeroing memory during allocation in vmemmap

 arch/arm64/include/asm/pgtable.h|   3 +
 arch/arm64/mm/kasan_init.c  |  12 +--
 arch/sparc/include/asm/pgtable_64.h |  30 ++
 arch/sparc/mm/init_64.c |  31 +++---
 arch/x86/mm/init_64.c   |   9 +-
 arch/x86/mm/kasan_init_64.c |   8 +-
 include/linux/bootmem.h |  27 +
 include/linux/kasan.h   |   2 +
 include/linux/memblock.h

[PATCH v9 04/12] sparc64: simplify vmemmap_populate

2017-09-20 Thread Pavel Tatashin

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
---
 arch/sparc/mm/init_64.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 310c6754bcaa..99aea4d15a5f 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2651,30 +2651,19 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
vstart = vstart & PMD_MASK;
vend = ALIGN(vend, PMD_SIZE);
for (; vstart < vend; vstart += PMD_SIZE) {
-   pgd_t *pgd = pgd_offset_k(vstart);
+   pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
unsigned long pte;
pud_t *pud;
pmd_t *pmd;
 
-   if (pgd_none(*pgd)) {
-   pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+   if (!pgd)
+   return -ENOMEM;
 
-   if (!new)
-   return -ENOMEM;
-   pgd_populate(_mm, pgd, new);
-   }
-
-   pud = pud_offset(pgd, vstart);
-   if (pud_none(*pud)) {
-   pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-   if (!new)
-   return -ENOMEM;
-   pud_populate(_mm, pud, new);
-   }
+   pud = vmemmap_pud_populate(pgd, vstart, node);
+   if (!pud)
+   return -ENOMEM;
 
pmd = pmd_offset(pud, vstart);
-
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.1

Re: [PATCH v8 10/11] arm64/kasan: explicitly zero kasan shadow memory

2017-09-15 Thread Pavel Tatashin


Hi Mark,

I had this option  back upto version 3, where zero flag was passed into 
vmemmap_alloc_block(), but I was asked to remove it, because it required 
too many changes in other places. So, the current approach is cleaner, 
but the idea is that kasan should use its own version of 
vmemmap_populate() for both x86 and ARM, but I think it is outside of 
the scope of this work.


See this comment from Ard Biesheuvel:
https://lkml.org/lkml/2017/8/3/948

"
KASAN uses vmemmap_populate as a convenience: kasan has nothing to do
with vmemmap, but the function already existed and happened to do what
KASAN requires.

Given that that will no longer be the case, it would be far better to
stop using vmemmap_populate altogether, and clone it into a KASAN
specific version (with an appropriate name) with the zeroing folded
into it.
"

If you think I should add these function in this project, than sure I 
can send a new version with kasanmap_populate() functions.


Thank you,
Pasha

On 09/15/2017 04:38 PM, Mark Rutland wrote:

On Thu, Sep 14, 2017 at 09:30:28PM -0400, Pavel Tatashin wrote:
Hi Mark, Thank you for looking at this. We can't do this because page 
table is not set until cpu_replace_ttbr1() is called. So, we can't do 
memset() on this memory until then. 
I see. Sorry, I had missed that we were on the temporary tables at 
this point in time. I'm still not keen on duplicating the iteration. 
Can we split the vmemmap code so that we have a variant that takes a 
GFP? That way we could explicitly pass __GFP_ZERO for those cases 
where we want a zeroed page, and are happy to pay the cost of 
initialization. Thanks Mark.

Re: [PATCH v8 10/11] arm64/kasan: explicitly zero kasan shadow memory

2017-09-14 Thread Pavel Tatashin


Hi Mark,

Thank you for looking at this. We can't do this because page table is 
not set until cpu_replace_ttbr1() is called. So, we can't do memset() on 
this memory until then.


Pasha

[PATCH v8 00/11] complete deferred page initialization

2017-09-14 Thread Pavel Tatashin


Copy paste error, changing the subject for the header to v8 from v7.

On 09/14/2017 06:35 PM, Pavel Tatashin wrote:

Changelog:
v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
   separately
- Fixed bug reported by kbuild test robot new patch:
   mm: zero reserved and unavailable struct pages
- Removed patch
   x86/mm: reserve only exiting low pages
   As, it is not needed anymore, because of the previous fix
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
   page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
   iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
   whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations

v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
   suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
   memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
 * Splited changes to platforms into 4 patches
 * Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
   keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
 TIME  SPEED UP
base no deferred:   95.796233s
fix no deferred:79.978956s19.77%

base deferred:  77.254713s
fix deferred:   55.050509s40.34%
==
SPARC M6 3600 MHz with 15T of memory
 TIME  SPEED UP
base no deferred:   358.335727s
fix no deferred:302.320936s   18.52%

base deferred:  237.534603s
fix deferred:   182.103003s   30.44%
==
Raw dmesg output with timestamps:
x86 base no deferred:https://hastebin.com/ofunepurit.scala
x86 base deferred:   https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred:https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:  https://hastebin.com/xadinobutu.go

Pavel Tatashin (11):
   x86/mm: setting fields in deferred pages
   sparc64/mm: setting fields in deferred pages
   mm: deferred_init_memmap improvements
   sparc64: simplify vmemmap_populate
   mm: defining memblock_virt_alloc_try_nid_raw
   mm: zero struct pages during initialization
   sparc64: optimized struct page zeroing
   mm: zero reserved and unavailable struct pages
   x86/kasan: explicitly zero kasan shadow memory
   arm64/kasan: explicitly zero kasan shadow memory
   mm: stop zeroing memory during allocation in vmemmap

  arch/arm64/mm/kasan_init.c  |  42 
  arch/sparc/include/asm/pgtable_64.h |  30 ++
  arch/sparc/mm/init_64.c |  31 +++---
  arch/x86/mm/init_64.c   |   9 +-
  arch/x86/mm/kasan_init_64.c |  66 
  include/linux/bootmem.h |  27 +
  include/linux/memblock.h|  16 +++
  include/linux/mm.h  |  26 +
  mm/memblock.c   |  60 +--
  mm/page_alloc.c

[PATCH v7 00/11] complete deferred page initialization

2017-09-14 Thread Pavel Tatashin

Changelog:
v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations

v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
* Splited changes to platforms into 4 patches
* Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME  SPEED UP
base no deferred:   95.796233s
fix no deferred:79.978956s19.77%

base deferred:  77.254713s
fix deferred:   55.050509s40.34%
==
SPARC M6 3600 MHz with 15T of memory
TIME  SPEED UP
base no deferred:   358.335727s
fix no deferred:302.320936s   18.52%

base deferred:  237.534603s
fix deferred:   182.103003s   30.44%
==
Raw dmesg output with timestamps:
x86 base no deferred:https://hastebin.com/ofunepurit.scala
x86 base deferred:   https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred:https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:  https://hastebin.com/xadinobutu.go

Pavel Tatashin (11):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  mm: deferred_init_memmap improvements
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero struct pages during initialization
  sparc64: optimized struct page zeroing
  mm: zero reserved and unavailable struct pages
  x86/kasan: explicitly zero kasan shadow memory
  arm64/kasan: explicitly zero kasan shadow memory
  mm: stop zeroing memory during allocation in vmemmap

 arch/arm64/mm/kasan_init.c  |  42 
 arch/sparc/include/asm/pgtable_64.h |  30 ++
 arch/sparc/mm/init_64.c |  31 +++---
 arch/x86/mm/init_64.c   |   9 +-
 arch/x86/mm/kasan_init_64.c |  66 
 include/linux/bootmem.h |  27 +
 include/linux/memblock.h|  16 +++
 include/linux/mm.h  |  26 +
 mm/memblock.c   |  60 +--
 mm/page_alloc.c | 207 
 mm/sparse-vmemmap.c |  15 ++-
 mm/sparse.c |   6 +-
 1

[PATCH v8 02/11] sparc64/mm: setting fields in deferred pages

2017-09-14 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
 register_page_bootmem_info();
 free_all_bootmem();
 ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
 __init_single_page()
  memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
---
 arch/sparc/mm/init_64.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index b2ba410b26f4..078f1352736e 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2539,9 +2539,15 @@ void __init mem_init(void)
 {
high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-   register_page_bootmem_info();
free_all_bootmem();
 
+   /* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/*
 * Set up the zero page, mark it reserved, so that page count
 * is not manipulated when freeing the page from user ptes.
-- 
2.14.1

[PATCH v8 07/11] sparc64: optimized struct page zeroing

2017-09-14 Thread Pavel Tatashin

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

   BASEFIX  OPTIMIZED_FIX
bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
  
Total 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)   (mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#definemm_zero_struct_page(pp) do {
\
+   unsigned long *_pp = (void *)(pp);  \
+   \
+/* Check that struct page is either 64, 72, or 80 bytes */ \
+   BUILD_BUG_ON(sizeof(struct page) & 7);  \
+   BUILD_BUG_ON(sizeof(struct page) < 64); \
+   BUILD_BUG_ON(sizeof(struct page) > 80); \
+   \
+   switch (sizeof(struct page)) {  \
+   case 80:\
+   _pp[9] = 0; /* fallthrough */   \
+   case 72:\
+   _pp[8] = 0; /* fallthrough */   \
+   default:\
+   _pp[7] = 0; \
+   _pp[6] = 0; \
+   _pp[5] = 0; \
+   _pp[4] = 0; \
+   _pp[3] = 0; \
+   _pp[2] = 0; \
+   _pp[1] = 0; \
+   _pp[0] = 0; \
+   }   \
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.1

[PATCH v8 05/11] mm: defining memblock_virt_alloc_try_nid_raw

2017-09-14 Thread Pavel Tatashin

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/bootmem.h | 27 ++
 mm/memblock.c   | 60 +++--
 mm/page_alloc.c | 15 ++---
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr,
+ phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+   BOOTMEM_ALLOC_ACCESSIBLE,
+   NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   if (!align)
+   align = SMP_CACHE_BYTES;
+   return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init 
memblock_virt_alloc_try_nid(phys_addr_t size,
  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+   phys_addr_t size, phys_addr_t align,
+   phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+   return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+   min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
return NULL;
 done:
ptr = phys_to_virt(alloc);
-   memset(ptr, 0, size);
 
/*
 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *   is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *   is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *   allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to fi

[PATCH v8 11/11] mm: stop zeroing memory during allocation in vmemmap

2017-09-14 Thread Pavel Tatashin

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 include/linux/mm.h  | 11 +++
 mm/sparse-vmemmap.c | 15 +++
 mm/sparse.c |  6 +++---
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7bba4ce79ba..25848764570f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned 
long size, int node)
return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+   void *p = vmemmap_alloc_block(size, node);
+
+   if (!p)
+   return NULL;
+   memset(p, 0, size);
+
+   return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
   int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
unsigned long align,
unsigned long goal)
 {
-   return memblock_virt_alloc_try_nid(size, align, goal,
+   return memblock_virt_alloc_try_nid_raw(size, align, goal,
BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int 
node)
if (slab_is_available()) {
struct page *page;
 
-   page = alloc_pages_node(node,
-   GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-   get_order(size));
+   page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+   get_order(size));
if (page)
return page_address(page);
return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned 
long addr, int node)
 {
pmd_t *pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pmd_populate_kernel(_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned 
long addr, int node)
 {
pud_t *pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pud_populate(_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned 
long addr, int node)
 {
p4d_t *p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
p4d_populate(_mm, p4d, p);
@@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, 
int node)
 {
pgd_t *pgd = pgd_offset_k(addr);
if (pgd_none(*pgd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pgd_populate(_mm, pgd, p);
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..d22f51bb7c79 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page 
**map_map,
}
 
size = PAGE_ALIGN(size);
-   map = memblock_virt_alloc_try_nid(size * map_count,
- PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
- BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
+   map = memblock_virt_alloc_try_nid_raw(size * map_count,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+ BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
if (map) {
for (pnum = pnum_begi

[PATCH v8 03/11] mm: deferred_init_memmap improvements

2017-09-14 Thread Pavel Tatashin

This patch fixes two issues in deferred_init_memmap

=
In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

The deferred_init_memmap() is re-written to loop through only free memory
ranges provided by memblock.

=
This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
Thus means if deferred struct pages are enabled on systems with these kind
of holes, linux would get memory corruptions. I have fixed this issue by
defining a new macro that performs all the necessary operations when we
free the current set of pages.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 mm/page_alloc.c | 161 +++-
 1 file changed, 78 insertions(+), 83 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af88836a..d132c801d2c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(struct page *page,
-   unsigned long pfn, int nr_pages)
+static void __init deferred_free_range(unsigned long pfn,
+  unsigned long nr_pages)
 {
-   int i;
+   struct page *page;
+   unsigned long i;
 
-   if (!page)
+   if (!nr_pages)
return;
 
+   page = pfn_to_page(pfn);
+
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == pageblock_nr_pages &&
(pfn & (pageblock_nr_pages - 1)) == 0) {
@@ -1443,19 +1446,82 @@ static inline void __init 
pgdat_init_report_one_done(void)
complete(_init_all_done_comp);
 }
 
+#define DEFERRED_FREE(nr_free, free_base_pfn, page)\
+({ \
+   unsigned long nr = (nr_free);   \
+   \
+   deferred_free_range((free_base_pfn), (nr)); \
+   (free_base_pfn) = 0;\
+   (nr_free) = 0;  \
+   page = NULL;\
+   nr; \
+})
+
+static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
+unsigned long end_pfn)
+{
+   struct mminit_pfnnid_cache nid_init_state = { };
+   unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   unsigned long free_base_pfn = 0;
+   unsigned long nr_pages = 0;
+   unsigned long nr_free = 0;
+   struct page *page = NULL;
+
+   for (; pfn < end_pfn; pfn++) {
+   /*
+* First we check if pfn is valid on architectures where it is
+* possible to have holes within pageblock_nr_pages. On systems
+* where it is not possible, this function is optimized out.
+*
+* Then, we check if a current large page is valid by only
+* checking the validity of the head pfn.
+*
+* meminit_pfn_in_nid is checked on systems where pfns can
+* interleave within a node: a pfn is between start and end
+* of a node, but does not belong to this memory node.
+*
+* Finally, we minimize pfn page lookups and scheduler checks by
+* performing it only once every pageblock_nr_pages.
+*/
+   if (!pfn_valid_within(pfn)) {
+   nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+   } else if (!(pfn & nr_pgmask) && !pfn_valid(pfn)) {
+   nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+   } else if (!meminit_pfn_i

[PATCH v8 10/11] arm64/kasan: explicitly zero kasan shadow memory

2017-09-14 Thread Pavel Tatashin

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

We must explicitly zero the memory that is allocated by vmemmap_populate()
for kasan, as this memory does not go through struct page initialization
path.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 arch/arm64/mm/kasan_init.c | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..e78a9ecbb687 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -135,6 +135,41 @@ static void __init clear_pgds(unsigned long start,
set_pgd(pgd_offset_k(start), __pgd(0));
 }
 
+/*
+ * Memory that was allocated by vmemmap_populate is not zeroed, so we must
+ * zero it here explicitly.
+ */
+static void
+zero_vmemmap_populated_memory(void)
+{
+   struct memblock_region *reg;
+   u64 start, end;
+
+   for_each_memblock(memory, reg) {
+   start = __phys_to_virt(reg->base);
+   end = __phys_to_virt(reg->base + reg->size);
+
+   if (start >= end)
+   break;
+
+   start = (u64)kasan_mem_to_shadow((void *)start);
+   end = (u64)kasan_mem_to_shadow((void *)end);
+
+   /* Round to the start end of the mapped pages */
+   start = round_down(start, SWAPPER_BLOCK_SIZE);
+   end = round_up(end, SWAPPER_BLOCK_SIZE);
+   memset((void *)start, 0, end - start);
+   }
+
+   start = (u64)kasan_mem_to_shadow(_text);
+   end = (u64)kasan_mem_to_shadow(_end);
+
+   /* Round to the start end of the mapped pages */
+   start = round_down(start, SWAPPER_BLOCK_SIZE);
+   end = round_up(end, SWAPPER_BLOCK_SIZE);
+   memset((void *)start, 0, end - start);
+}
+
 void __init kasan_init(void)
 {
u64 kimg_shadow_start, kimg_shadow_end;
@@ -205,8 +240,15 @@ void __init kasan_init(void)
pfn_pte(sym_to_pfn(kasan_zero_page), PAGE_KERNEL_RO));
 
memset(kasan_zero_page, 0, PAGE_SIZE);
+
cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
 
+   /*
+* vmemmap_populate does not zero the memory, so we need to zero it
+* explicitly
+*/
+   zero_vmemmap_populated_memory();
+
/* At this point kasan is fully initialized. Enable error messages */
init_task.kasan_depth = 0;
pr_info("KernelAddressSanitizer initialized\n");
-- 
2.14.1

[PATCH v8 04/11] sparc64: simplify vmemmap_populate

2017-09-14 Thread Pavel Tatashin

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: David S. Miller <da...@davemloft.net>
---
 arch/sparc/mm/init_64.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 078f1352736e..fc47afa518f5 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2642,30 +2642,19 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
vstart = vstart & PMD_MASK;
vend = ALIGN(vend, PMD_SIZE);
for (; vstart < vend; vstart += PMD_SIZE) {
-   pgd_t *pgd = pgd_offset_k(vstart);
+   pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
unsigned long pte;
pud_t *pud;
pmd_t *pmd;
 
-   if (pgd_none(*pgd)) {
-   pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+   if (!pgd)
+   return -ENOMEM;
 
-   if (!new)
-   return -ENOMEM;
-   pgd_populate(_mm, pgd, new);
-   }
-
-   pud = pud_offset(pgd, vstart);
-   if (pud_none(*pud)) {
-   pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-   if (!new)
-   return -ENOMEM;
-   pud_populate(_mm, pud, new);
-   }
+   pud = vmemmap_pud_populate(pgd, vstart, node);
+   if (!pud)
+   return -ENOMEM;
 
pmd = pmd_offset(pud, vstart);
-
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.1

[PATCH v8 08/11] mm: zero reserved and unavailable struct pages

2017-09-14 Thread Pavel Tatashin

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 include/linux/memblock.h | 16 
 include/linux/mm.h   |  6 ++
 mm/page_alloc.c  | 30 ++
 3 files changed, 52 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..bdd4268f9323 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, 
unsigned long max_pfn);
for_each_mem_range_rev(i, , , \
   nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable 
memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailabled but reserved (reserved && !memory) areas of 
memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end) \
+   for_each_mem_range(i, , , \
+  NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50b74d628243..a7bba4ce79ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2010,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long 
pfn,
struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b630ee91430..1d38d391dffd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6202,6 +6202,34 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+   phys_addr_t start, end;
+   unsigned long pfn;
+   u64 i, pgcnt;
+
+   /* Loop through ranges that are reserved, but do not have reported
+* physical memory backing.
+*/
+   pgcnt = 0;
+   for_each_resv_unavail_range(i, , ) {
+   for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
+   mm_zero_struct_page(pfn_to_page(pfn));
+   pgcnt++;
+   }
+   }
+   pr_info("Reserved but unavailable: %lld pages", pgcnt);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
@@ -6625,6 +6653,7 @@ void __init free_area_init_nodes(unsigned long 
*max_zone_pfn)
node_set_state(nid, N_MEMORY);
check_for_memory(pgdat, nid);
}
+   zero_resv_unavail();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6788,6 +6817,7 @@ void __init free_area_init(unsigned lon

[PATCH v8 06/11] mm: zero struct pages during initialization

2017-09-14 Thread Pavel Tatashin

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
  --
Total   16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/mm.h | 9 +
 mm/page_alloc.c| 1 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8c10d336e42..50b74d628243 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X) (0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in .
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8dbd405ed94..4b630ee91430 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
 {
+   mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
-- 
2.14.1

[PATCH v8 01/11] x86/mm: setting fields in deferred pages

2017-09-14 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

mem_init() {
register_page_bootmem_info();
free_all_bootmem();
...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- Only if this is deferred reserved page
   __init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
---
 arch/x86/mm/init_64.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 048fbe8fc274..42b4b7a585c2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1173,12 +1173,17 @@ void __init mem_init(void)
 
/* clear_bss() already clear the empty_zero_page */
 
-   register_page_bootmem_info();
-
/* this will put all memory onto the freelists */
free_all_bootmem();
after_bootmem = 1;
 
+   /* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/* Register memory areas for /proc/kcore */
kclist_add(_vsyscall, (void *)VSYSCALL_ADDR,
 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.1

[PATCH v7 05/11] mm: defining memblock_virt_alloc_try_nid_raw

2017-08-28 Thread Pavel Tatashin

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com>
Reviewed-by: Steven Sistare <steven.sist...@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jor...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Acked-by: Michal Hocko <mho...@suse.com>
---
 include/linux/bootmem.h | 27 ++
 mm/memblock.c   | 60 +++--
 mm/page_alloc.c | 15 ++---
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr,
+ phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+   BOOTMEM_ALLOC_ACCESSIBLE,
+   NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   if (!align)
+   align = SMP_CACHE_BYTES;
+   return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init 
memblock_virt_alloc_try_nid(phys_addr_t size,
  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+   phys_addr_t size, phys_addr_t align,
+   phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+   return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+   min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
return NULL;
 done:
ptr = phys_to_virt(alloc);
-   memset(ptr, 0, size);
 
/*
 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *   is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *   is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *   allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to fi

1 2 >

1 - 100 of 181 matches

Mail list logo