Re: [PATCH v4 3/5] RISC-V: Initial DTS for Microchip ICICLE board

2021-04-18 Thread Vitaly Wool
Hi Atish,

On Sun, Apr 18, 2021 at 5:37 AM Atish Patra  wrote:
>
> On Mon, Mar 29, 2021 at 10:04 AM Vitaly Wool  wrote:
> >
> > On Sat, Mar 27, 2021 at 6:24 PM Alex Ghiti  wrote:
> > >
> > > Hi Atish,
> > >
> > > Le 3/3/21 à 3:02 PM, Atish Patra a écrit :
> > > > Add initial DTS for Microchip ICICLE board having only
> > > > essential devices (clocks, sdhci, ethernet, serial, etc).
> > > > The device tree is based on the U-Boot patch.
> > > >
> > > > https://patchwork.ozlabs.org/project/uboot/patch/20201110103414.10142-6-padmarao.beg...@microchip.com/
> > > >
> > > > Signed-off-by: Atish Patra 
> > > > ---
> > > >   arch/riscv/boot/dts/Makefile  |   1 +
> > > >   arch/riscv/boot/dts/microchip/Makefile|   2 +
> > > >   .../microchip/microchip-mpfs-icicle-kit.dts   |  72 
> > > >   .../boot/dts/microchip/microchip-mpfs.dtsi| 329 ++
> > > >   4 files changed, 404 insertions(+)
> > > >   create mode 100644 arch/riscv/boot/dts/microchip/Makefile
> > > >   create mode 100644 
> > > > arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
> > > >   create mode 100644 arch/riscv/boot/dts/microchip/microchip-mpfs.dtsi
> > > >
> > > > diff --git a/arch/riscv/boot/dts/Makefile b/arch/riscv/boot/dts/Makefile
> > > > index 7ffd502e3e7b..fe996b88319e 100644
> > > > --- a/arch/riscv/boot/dts/Makefile
> > > > +++ b/arch/riscv/boot/dts/Makefile
> > > > @@ -1,5 +1,6 @@
> > > >   # SPDX-License-Identifier: GPL-2.0
> > > >   subdir-y += sifive
> > > >   subdir-$(CONFIG_SOC_CANAAN_K210_DTB_BUILTIN) += canaan
> > > > +subdir-y += microchip
> > > >
> > > >   obj-$(CONFIG_BUILTIN_DTB) := $(addsuffix /, $(subdir-y))
> > > > diff --git a/arch/riscv/boot/dts/microchip/Makefile 
> > > > b/arch/riscv/boot/dts/microchip/Makefile
> > > > new file mode 100644
> > > > index ..622b12771fd3
> > > > --- /dev/null
> > > > +++ b/arch/riscv/boot/dts/microchip/Makefile
> > > > @@ -0,0 +1,2 @@
> > > > +# SPDX-License-Identifier: GPL-2.0
> > > > +dtb-$(CONFIG_SOC_MICROCHIP_POLARFIRE) += microchip-mpfs-icicle-kit.dtb
> > >
> > > I'm playing (or trying to...) with XIP_KERNEL and I had to add the
> > > following to have the device tree actually builtin the kernel:
> > >
> > > diff --git a/arch/riscv/boot/dts/microchip/Makefile
> > > b/arch/riscv/boot/dts/microchip/Makefile
> > > index 622b12771fd3..855c1502d912 100644
> > > --- a/arch/riscv/boot/dts/microchip/Makefile
> > > +++ b/arch/riscv/boot/dts/microchip/Makefile
> > > @@ -1,2 +1,3 @@
> > >   # SPDX-License-Identifier: GPL-2.0
> > >   dtb-$(CONFIG_SOC_MICROCHIP_POLARFIRE) += microchip-mpfs-icicle-kit.dtb
> > > +obj-$(CONFIG_BUILTIN_DTB) += $(addsuffix .o, $(dtb-y))
> > >
> > > Alex
> >
> > Yes, I believe this is necessary for BUILTIN_DTB to work on Polarfire,
> > regardless of whether the kernel is XIP or not.
> >
>
> But there is no usecase for BUILTIN_DTB for polarfire except XIP kernel.
> The bootloaders for polarfire is capable of providing a DTB to kernel.

I have hard time seeing an industrial application with a bootloader
mounting a vfat partition to load a device tree file. So there has to
be a less obscure and less time consuming alternative. And if the
mainline kernel doesn't provide it (e. g. in the form of support for
BUILTIN_DTB) it opens up for error prone custom solutions from various
vendors. Is that really what we want?

Best regards,
   Vitaly

> If XIP kernel is enabled, the following line in
> arch/riscv/boot/dts/Makefile should take care of things
>
>
> > Best regards,
> >Vitaly
> >
> > > > diff --git 
> > > > a/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts 
> > > > b/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
> > > > new file mode 100644
> > > > index ..ec79944065c9
> > > > --- /dev/null
> > > > +++ b/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
> > > > @@ -0,0 +1,72 @@
> > > > +// SPDX-License-Identifier: (GPL-2.0 OR MIT)
> > > > +/* Copyright (c) 2020 Microchip Technology Inc */
> > > > +
> > > > +/dts-v1/;
> > > > +
> > > > +#include "microchip-mpfs.

Re: [PATCH v7] RISC-V: enable XIP

2021-04-12 Thread Vitaly Wool
On Mon, Apr 12, 2021 at 7:12 AM Alex Ghiti  wrote:
>
> Le 4/9/21 à 10:42 AM, Vitaly Wool a écrit :
> > On Fri, Apr 9, 2021 at 3:59 PM Mike Rapoport  wrote:
> >>
> >> On Fri, Apr 09, 2021 at 02:46:17PM +0200, David Hildenbrand wrote:
> >>>>>> Also, will that memory properly be exposed in the resource tree as
> >>>>>> System RAM (e.g., /proc/iomem) ? Otherwise some things (/proc/kcore)
> >>>>>> won't work as expected - the kernel won't be included in a dump.
> >>>> Do we really need a XIP kernel to included in kdump?
> >>>> And does not it sound weird to expose flash as System RAM in 
> >>>> /proc/iomem? ;-)
> >>>
> >>> See my other mail, maybe we actually want something different.
> >>>
> >>>>
> >>>>> I have just checked and it does not appear in /proc/iomem.
> >>>>>
> >>>>> Ok your conclusion would be to have struct page, I'm going to implement 
> >>>>> this
> >>>>> version then using memblock as you described.
> >>>>
> >>>> I'm not sure this is required. With XIP kernel text never gets into RAM, 
> >>>> so
> >>>> it does not seem to require struct page.
> >>>>
> >>>> XIP by definition has some limitations relatively to "normal" operation,
> >>>> so lack of kdump could be one of them.
> >>>
> >>> I agree.
> >>>
> >>>>
> >>>> I might be wrong, but IMHO, artificially creating a memory map for part 
> >>>> of
> >>>> flash would cause more problems in the long run.
> >>>
> >>> Can you elaborate?
> >>
> >> Nothing particular, just a gut feeling. Usually, when you force something
> >> it comes out the wrong way later.
> >
> > It's possible still that MTD_XIP is implemented allowing to write to
> > the flash used for XIP. While flash is being written, memory map
> > doesn't make sense at all. I can't come up with a real life example
> > when it can actually lead to problems but it is indeed weird when
> > System RAM suddenly becomes unreadable. I really don't think exposing
> > it in /proc/iomem is a good idea.
> >
> >>>> BTW, how does XIP account the kernel text on other architectures that
> >>>> implement it?
> >>>
> >>> Interesting point, I thought XIP would be something new on RISC-V (well, 
> >>> at
> >>> least to me :) ). If that concept exists already, we better mimic what
> >>> existing implementations do.
> >>
> >> I had quick glance at ARM, it seems that kernel text does not have memory
> >> map and does not show up in System RAM.
> >
> > Exactly, and I believe ARM64 won't do that too when it gets its own
> > XIP support (which is underway).
> >
>
>
> memmap does not seem necessary and ARM/ARM64 do not use it.
>
> But if someone tries to get a struct page from a physical address that
> lies in flash, as mentioned by David, that could lead to silent
> corruptions if something exists at the address where the struct page
> should be. And it is hard to know which features in the kernel depends
> on that.
>
> Regarding SPARSEMEM, the vmemmap lies in its own region so that's
> unlikely to happen, so we will catch those invalid accesses (and that's
> what I observed on riscv).
>
> But for FLATMEM, memmap is in the linear mapping, then that could very
> likely happen silently.
>
> Could a simple solution be to force SPARSEMEM for those XIP kernels ?
> Then wrong things could happen, but we would see those and avoid
> spending hours to debug :)
>
> I will at least send a v8 to remove the pfn_valid modifications for
> FLATMEM that now returns true to pfn in flash.

That sounds good to me. I am not very keen on spending 200K on struct
pages for flash (we can think of this as of an option but I would
definitely like to have the option to compile it out in the end), so
let's remove pfn_valid and fix things that will eventually break, if
some.

Best regards,
   Vitaly


Re: [PATCH v7] RISC-V: enable XIP

2021-04-09 Thread Vitaly Wool
On Fri, Apr 9, 2021 at 3:59 PM Mike Rapoport  wrote:
>
> On Fri, Apr 09, 2021 at 02:46:17PM +0200, David Hildenbrand wrote:
> > > > > Also, will that memory properly be exposed in the resource tree as
> > > > > System RAM (e.g., /proc/iomem) ? Otherwise some things (/proc/kcore)
> > > > > won't work as expected - the kernel won't be included in a dump.
> > > Do we really need a XIP kernel to included in kdump?
> > > And does not it sound weird to expose flash as System RAM in /proc/iomem? 
> > > ;-)
> >
> > See my other mail, maybe we actually want something different.
> >
> > >
> > > > I have just checked and it does not appear in /proc/iomem.
> > > >
> > > > Ok your conclusion would be to have struct page, I'm going to implement 
> > > > this
> > > > version then using memblock as you described.
> > >
> > > I'm not sure this is required. With XIP kernel text never gets into RAM, 
> > > so
> > > it does not seem to require struct page.
> > >
> > > XIP by definition has some limitations relatively to "normal" operation,
> > > so lack of kdump could be one of them.
> >
> > I agree.
> >
> > >
> > > I might be wrong, but IMHO, artificially creating a memory map for part of
> > > flash would cause more problems in the long run.
> >
> > Can you elaborate?
>
> Nothing particular, just a gut feeling. Usually, when you force something
> it comes out the wrong way later.

It's possible still that MTD_XIP is implemented allowing to write to
the flash used for XIP. While flash is being written, memory map
doesn't make sense at all. I can't come up with a real life example
when it can actually lead to problems but it is indeed weird when
System RAM suddenly becomes unreadable. I really don't think exposing
it in /proc/iomem is a good idea.

> > > BTW, how does XIP account the kernel text on other architectures that
> > > implement it?
> >
> > Interesting point, I thought XIP would be something new on RISC-V (well, at
> > least to me :) ). If that concept exists already, we better mimic what
> > existing implementations do.
>
> I had quick glance at ARM, it seems that kernel text does not have memory
> map and does not show up in System RAM.

Exactly, and I believe ARM64 won't do that too when it gets its own
XIP support (which is underway).

Best regards,
   Vitaly


Re: [PATCH v6] RISC-V: enable XIP

2021-04-07 Thread Vitaly Wool
Hi Alex,


> > All in all, I am quite sure now that your take on XIP is working fine.
> > The issue with single-core boot under QEmu seems to be  less
> > reproducible on slower machines running QEmu and more reproducible on
> > higher performance ones. It's not clear to me if that is a QEmu
> > problem or an in-kernel race, but it's hardly a XIP problem: I was
> > able to reproduce it once on a non-XIP kernel too, by copying it to
> > RAM in u-boot and giving it a 'go'.
>
> Ok then I'll post a v7 of your patch soon hoping it will go to for-next.
> I'll add my SoB to yours as I modified quite a few things and I thin
> people need to know who to yell at, if you don't mind of course.

No, absolutely not. :) Thanks for digging into this!

Best regards,
   Vitaly


Re: [PATCH v6] RISC-V: enable XIP

2021-04-06 Thread Vitaly Wool
On Tue, Apr 6, 2021 at 8:47 AM Alex Ghiti  wrote:
>
> Hi Vitaly,
>
> Le 4/5/21 à 4:34 AM, Vitaly Wool a écrit :
> > On Sun, Apr 4, 2021 at 10:39 AM Vitaly Wool  
> > wrote:
> >>
> >> On Sat, Apr 3, 2021 at 12:00 PM Alex Ghiti  wrote:
> >>>
> >>> Hi Vitaly,
> >>>
> >>> Le 4/1/21 à 7:10 AM, Alex Ghiti a écrit :
> >>>> Le 4/1/21 à 4:52 AM, Vitaly Wool a écrit :
> >>>>> Hi Alex,
> >>>>>
> >>>>> On Thu, Apr 1, 2021 at 10:11 AM Alex Ghiti  wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Le 3/30/21 à 4:04 PM, Alex Ghiti a écrit :
> >>>>>>> Le 3/30/21 à 3:33 PM, Palmer Dabbelt a écrit :
> >>>>>>>> On Tue, 30 Mar 2021 11:39:10 PDT (-0700), a...@ghiti.fr wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Le 3/30/21 à 2:26 AM, Vitaly Wool a écrit :
> >>>>>>>>>> On Tue, Mar 30, 2021 at 8:23 AM Palmer Dabbelt
> >>>>>>>>>>  wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Sun, 21 Mar 2021 17:12:15 PDT (-0700), vitaly.w...@konsulko.com
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>> Introduce XIP (eXecute In Place) support for RISC-V platforms.
> >>>>>>>>>>>> It allows code to be executed directly from non-volatile storage
> >>>>>>>>>>>> directly addressable by the CPU, such as QSPI NOR flash which can
> >>>>>>>>>>>> be found on many RISC-V platforms. This makes way for significant
> >>>>>>>>>>>> optimization of RAM footprint. The XIP kernel is not compressed
> >>>>>>>>>>>> since it has to run directly from flash, so it will occupy more
> >>>>>>>>>>>> space on the non-volatile storage. The physical flash address 
> >>>>>>>>>>>> used
> >>>>>>>>>>>> to link the kernel object files and for storing it has to be 
> >>>>>>>>>>>> known
> >>>>>>>>>>>> at compile time and is represented by a Kconfig option.
> >>>>>>>>>>>>
> >>>>>>>>>>>> XIP on RISC-V will for the time being only work on MMU-enabled
> >>>>>>>>>>>> kernels.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Vitaly Wool 
> >>>>>>>>>>>>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>
> >>>>>>>>>>>> Changes in v2:
> >>>>>>>>>>>> - dedicated macro for XIP address fixup when MMU is not enabled
> >>>>>>>>>>>> yet
> >>>>>>>>>>>>  o both for 32-bit and 64-bit RISC-V
> >>>>>>>>>>>> - SP is explicitly set to a safe place in RAM before
> >>>>>>>>>>>> __copy_data call
> >>>>>>>>>>>> - removed redundant alignment requirements in vmlinux-xip.lds.S
> >>>>>>>>>>>> - changed long -> uintptr_t typecast in __XIP_FIXUP macro.
> >>>>>>>>>>>> Changes in v3:
> >>>>>>>>>>>> - rebased against latest for-next
> >>>>>>>>>>>> - XIP address fixup macro now takes an argument
> >>>>>>>>>>>> - SMP related fixes
> >>>>>>>>>>>> Changes in v4:
> >>>>>>>>>>>> - rebased against the current for-next
> >>>>>>>>>>>> - less #ifdef's in C/ASM code
> >>>>>>>>>>>> - dedicated XIP_FIXUP_OFFSET assembler macro in head.S
> >>>>>>>>>>>> - C-specific definitions moved into #ifndef __ASSEMBLY__
> >>>>>>>>>>>> - Fixed multi-core boot
> >>>>>>>>>>>> Changes in v5:
> >>>>>>>>>>>> - fixed build error for non-XI

Re: [PATCH v6] RISC-V: enable XIP

2021-04-05 Thread Vitaly Wool
On Sun, Apr 4, 2021 at 10:39 AM Vitaly Wool  wrote:
>
> On Sat, Apr 3, 2021 at 12:00 PM Alex Ghiti  wrote:
> >
> > Hi Vitaly,
> >
> > Le 4/1/21 à 7:10 AM, Alex Ghiti a écrit :
> > > Le 4/1/21 à 4:52 AM, Vitaly Wool a écrit :
> > >> Hi Alex,
> > >>
> > >> On Thu, Apr 1, 2021 at 10:11 AM Alex Ghiti  wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Le 3/30/21 à 4:04 PM, Alex Ghiti a écrit :
> > >>>> Le 3/30/21 à 3:33 PM, Palmer Dabbelt a écrit :
> > >>>>> On Tue, 30 Mar 2021 11:39:10 PDT (-0700), a...@ghiti.fr wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>> Le 3/30/21 à 2:26 AM, Vitaly Wool a écrit :
> > >>>>>>> On Tue, Mar 30, 2021 at 8:23 AM Palmer Dabbelt
> > >>>>>>>  wrote:
> > >>>>>>>>
> > >>>>>>>> On Sun, 21 Mar 2021 17:12:15 PDT (-0700), vitaly.w...@konsulko.com
> > >>>>>>>> wrote:
> > >>>>>>>>> Introduce XIP (eXecute In Place) support for RISC-V platforms.
> > >>>>>>>>> It allows code to be executed directly from non-volatile storage
> > >>>>>>>>> directly addressable by the CPU, such as QSPI NOR flash which can
> > >>>>>>>>> be found on many RISC-V platforms. This makes way for significant
> > >>>>>>>>> optimization of RAM footprint. The XIP kernel is not compressed
> > >>>>>>>>> since it has to run directly from flash, so it will occupy more
> > >>>>>>>>> space on the non-volatile storage. The physical flash address used
> > >>>>>>>>> to link the kernel object files and for storing it has to be known
> > >>>>>>>>> at compile time and is represented by a Kconfig option.
> > >>>>>>>>>
> > >>>>>>>>> XIP on RISC-V will for the time being only work on MMU-enabled
> > >>>>>>>>> kernels.
> > >>>>>>>>>
> > >>>>>>>>> Signed-off-by: Vitaly Wool 
> > >>>>>>>>>
> > >>>>>>>>> ---
> > >>>>>>>>>
> > >>>>>>>>> Changes in v2:
> > >>>>>>>>> - dedicated macro for XIP address fixup when MMU is not enabled
> > >>>>>>>>> yet
> > >>>>>>>>> o both for 32-bit and 64-bit RISC-V
> > >>>>>>>>> - SP is explicitly set to a safe place in RAM before
> > >>>>>>>>> __copy_data call
> > >>>>>>>>> - removed redundant alignment requirements in vmlinux-xip.lds.S
> > >>>>>>>>> - changed long -> uintptr_t typecast in __XIP_FIXUP macro.
> > >>>>>>>>> Changes in v3:
> > >>>>>>>>> - rebased against latest for-next
> > >>>>>>>>> - XIP address fixup macro now takes an argument
> > >>>>>>>>> - SMP related fixes
> > >>>>>>>>> Changes in v4:
> > >>>>>>>>> - rebased against the current for-next
> > >>>>>>>>> - less #ifdef's in C/ASM code
> > >>>>>>>>> - dedicated XIP_FIXUP_OFFSET assembler macro in head.S
> > >>>>>>>>> - C-specific definitions moved into #ifndef __ASSEMBLY__
> > >>>>>>>>> - Fixed multi-core boot
> > >>>>>>>>> Changes in v5:
> > >>>>>>>>> - fixed build error for non-XIP kernels
> > >>>>>>>>> Changes in v6:
> > >>>>>>>>> - XIP_PHYS_RAM_BASE config option renamed to PHYS_RAM_BASE
> > >>>>>>>>> - added PHYS_RAM_BASE_FIXED config flag to allow usage of
> > >>>>>>>>> PHYS_RAM_BASE in non-XIP configurations if needed
> > >>>>>>>>> - XIP_FIXUP macro rewritten with a tempoarary variable to avoid
> > >>>>>>>>> side
> > >>>>>>>>> effects
> > >>>>>>>>> - fixed crash for non-XIP kernels that don't use b

Re: [PATCH v6] RISC-V: enable XIP

2021-04-04 Thread Vitaly Wool
On Sat, Apr 3, 2021 at 12:00 PM Alex Ghiti  wrote:
>
> Hi Vitaly,
>
> Le 4/1/21 à 7:10 AM, Alex Ghiti a écrit :
> > Le 4/1/21 à 4:52 AM, Vitaly Wool a écrit :
> >> Hi Alex,
> >>
> >> On Thu, Apr 1, 2021 at 10:11 AM Alex Ghiti  wrote:
> >>>
> >>> Hi,
> >>>
> >>> Le 3/30/21 à 4:04 PM, Alex Ghiti a écrit :
> >>>> Le 3/30/21 à 3:33 PM, Palmer Dabbelt a écrit :
> >>>>> On Tue, 30 Mar 2021 11:39:10 PDT (-0700), a...@ghiti.fr wrote:
> >>>>>>
> >>>>>>
> >>>>>> Le 3/30/21 à 2:26 AM, Vitaly Wool a écrit :
> >>>>>>> On Tue, Mar 30, 2021 at 8:23 AM Palmer Dabbelt
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> On Sun, 21 Mar 2021 17:12:15 PDT (-0700), vitaly.w...@konsulko.com
> >>>>>>>> wrote:
> >>>>>>>>> Introduce XIP (eXecute In Place) support for RISC-V platforms.
> >>>>>>>>> It allows code to be executed directly from non-volatile storage
> >>>>>>>>> directly addressable by the CPU, such as QSPI NOR flash which can
> >>>>>>>>> be found on many RISC-V platforms. This makes way for significant
> >>>>>>>>> optimization of RAM footprint. The XIP kernel is not compressed
> >>>>>>>>> since it has to run directly from flash, so it will occupy more
> >>>>>>>>> space on the non-volatile storage. The physical flash address used
> >>>>>>>>> to link the kernel object files and for storing it has to be known
> >>>>>>>>> at compile time and is represented by a Kconfig option.
> >>>>>>>>>
> >>>>>>>>> XIP on RISC-V will for the time being only work on MMU-enabled
> >>>>>>>>> kernels.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Vitaly Wool 
> >>>>>>>>>
> >>>>>>>>> ---
> >>>>>>>>>
> >>>>>>>>> Changes in v2:
> >>>>>>>>> - dedicated macro for XIP address fixup when MMU is not enabled
> >>>>>>>>> yet
> >>>>>>>>> o both for 32-bit and 64-bit RISC-V
> >>>>>>>>> - SP is explicitly set to a safe place in RAM before
> >>>>>>>>> __copy_data call
> >>>>>>>>> - removed redundant alignment requirements in vmlinux-xip.lds.S
> >>>>>>>>> - changed long -> uintptr_t typecast in __XIP_FIXUP macro.
> >>>>>>>>> Changes in v3:
> >>>>>>>>> - rebased against latest for-next
> >>>>>>>>> - XIP address fixup macro now takes an argument
> >>>>>>>>> - SMP related fixes
> >>>>>>>>> Changes in v4:
> >>>>>>>>> - rebased against the current for-next
> >>>>>>>>> - less #ifdef's in C/ASM code
> >>>>>>>>> - dedicated XIP_FIXUP_OFFSET assembler macro in head.S
> >>>>>>>>> - C-specific definitions moved into #ifndef __ASSEMBLY__
> >>>>>>>>> - Fixed multi-core boot
> >>>>>>>>> Changes in v5:
> >>>>>>>>> - fixed build error for non-XIP kernels
> >>>>>>>>> Changes in v6:
> >>>>>>>>> - XIP_PHYS_RAM_BASE config option renamed to PHYS_RAM_BASE
> >>>>>>>>> - added PHYS_RAM_BASE_FIXED config flag to allow usage of
> >>>>>>>>> PHYS_RAM_BASE in non-XIP configurations if needed
> >>>>>>>>> - XIP_FIXUP macro rewritten with a tempoarary variable to avoid
> >>>>>>>>> side
> >>>>>>>>> effects
> >>>>>>>>> - fixed crash for non-XIP kernels that don't use built-in DTB
> >>>>>>>>
> >>>>>>>> So v5 landed on for-next, which generally means it's best to avoid
> >>>>>>>> re-spinning the patch and instead send along fixups.  That said,
> >>>>>>>> the v5
> >>>>>>>> is causing some testing failures fo

Re: [PATCH v6] RISC-V: enable XIP

2021-04-01 Thread Vitaly Wool
Hi Alex,

On Thu, Apr 1, 2021 at 10:11 AM Alex Ghiti  wrote:
>
> Hi,
>
> Le 3/30/21 à 4:04 PM, Alex Ghiti a écrit :
> > Le 3/30/21 à 3:33 PM, Palmer Dabbelt a écrit :
> >> On Tue, 30 Mar 2021 11:39:10 PDT (-0700), a...@ghiti.fr wrote:
> >>>
> >>>
> >>> Le 3/30/21 à 2:26 AM, Vitaly Wool a écrit :
> >>>> On Tue, Mar 30, 2021 at 8:23 AM Palmer Dabbelt
> >>>>  wrote:
> >>>>>
> >>>>> On Sun, 21 Mar 2021 17:12:15 PDT (-0700), vitaly.w...@konsulko.com
> >>>>> wrote:
> >>>>>> Introduce XIP (eXecute In Place) support for RISC-V platforms.
> >>>>>> It allows code to be executed directly from non-volatile storage
> >>>>>> directly addressable by the CPU, such as QSPI NOR flash which can
> >>>>>> be found on many RISC-V platforms. This makes way for significant
> >>>>>> optimization of RAM footprint. The XIP kernel is not compressed
> >>>>>> since it has to run directly from flash, so it will occupy more
> >>>>>> space on the non-volatile storage. The physical flash address used
> >>>>>> to link the kernel object files and for storing it has to be known
> >>>>>> at compile time and is represented by a Kconfig option.
> >>>>>>
> >>>>>> XIP on RISC-V will for the time being only work on MMU-enabled
> >>>>>> kernels.
> >>>>>>
> >>>>>> Signed-off-by: Vitaly Wool 
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> Changes in v2:
> >>>>>> - dedicated macro for XIP address fixup when MMU is not enabled yet
> >>>>>>o both for 32-bit and 64-bit RISC-V
> >>>>>> - SP is explicitly set to a safe place in RAM before __copy_data call
> >>>>>> - removed redundant alignment requirements in vmlinux-xip.lds.S
> >>>>>> - changed long -> uintptr_t typecast in __XIP_FIXUP macro.
> >>>>>> Changes in v3:
> >>>>>> - rebased against latest for-next
> >>>>>> - XIP address fixup macro now takes an argument
> >>>>>> - SMP related fixes
> >>>>>> Changes in v4:
> >>>>>> - rebased against the current for-next
> >>>>>> - less #ifdef's in C/ASM code
> >>>>>> - dedicated XIP_FIXUP_OFFSET assembler macro in head.S
> >>>>>> - C-specific definitions moved into #ifndef __ASSEMBLY__
> >>>>>> - Fixed multi-core boot
> >>>>>> Changes in v5:
> >>>>>> - fixed build error for non-XIP kernels
> >>>>>> Changes in v6:
> >>>>>> - XIP_PHYS_RAM_BASE config option renamed to PHYS_RAM_BASE
> >>>>>> - added PHYS_RAM_BASE_FIXED config flag to allow usage of
> >>>>>>PHYS_RAM_BASE in non-XIP configurations if needed
> >>>>>> - XIP_FIXUP macro rewritten with a tempoarary variable to avoid side
> >>>>>>effects
> >>>>>> - fixed crash for non-XIP kernels that don't use built-in DTB
> >>>>>
> >>>>> So v5 landed on for-next, which generally means it's best to avoid
> >>>>> re-spinning the patch and instead send along fixups.  That said,
> >>>>> the v5
> >>>>> is causing some testing failures for me.
> >>>>>
> >>>>> I'm going to drop the v5 for now as I don't have time to test this
> >>>>> tonight.  I'll try and take a look soon, as it will conflict with
> >>>>> Alex's
> >>>>> patches.
> >>>>
> >>>> I can come up with the incremental patch instead pretty much straight
> >>>> away if that works better.
> >>>>
> >>>> ~Vitaly
> >>>>
> >>>>>>   arch/riscv/Kconfig  |  49 ++-
> >>>>>>   arch/riscv/Makefile |   8 +-
> >>>>>>   arch/riscv/boot/Makefile|  13 +++
> >>>>>>   arch/riscv/include/asm/pgtable.h|  65 --
> >>>>>>   arch/riscv/kernel/cpu_ops_sbi.c |  11 ++-
> >>>>>>   arch/riscv/kernel/head.S|  49 ++-
> >>>>>>   arch/riscv/kernel/head.h|   3 +
> 

Re: [PATCH v6] RISC-V: enable XIP

2021-03-31 Thread Vitaly Wool
Hi Kefeng,

On Wed, Mar 31, 2021 at 10:37 AM Kefeng Wang  wrote:
>
> Hi,some error when enable XIP_KERNEL config,ARCH_HAS_STRICT_KERNEL_RWX
> should disable when enable XIP_KERNEL,
>
> but there
>
> riscv64-linux-ld: section .data LMA [0080,008cd37f]
> overlaps section .rodata LMA [00706bc0,0085dd67]
> riscv64-linux-ld: section .pci_fixup LMA
> [0085dd68,00861397] overlaps section .data LMA
> [0080,008cd37f]
> riscv64-linux-ld: arch/riscv/mm/init.o: in function `.L138':
> init.c:(.text+0x232): undefined reference to `__init_text_begin'
> riscv64-linux-ld: arch/riscv/mm/init.o: in function
> `protect_kernel_text_data':
> init.c:(.text+0x23a): undefined reference to `__init_data_begin'
> riscv64-linux-ld: init.c:(.text+0x28c): undefined reference to
> `__init_text_begin'
> riscv64-linux-ld: init.c:(.text+0x2a0): undefined reference to
> `__init_data_begin'

all the RO sections should fit in 8 MB for xipImage. Could you please
remove the unnecessary parts from your kernel and retry?

Thanks,
   Vitaly

> On 2021/3/22 8:12, Vitaly Wool wrote:
> > Introduce XIP (eXecute In Place) support for RISC-V platforms.
> > It allows code to be executed directly from non-volatile storage
> > directly addressable by the CPU, such as QSPI NOR flash which can
> > be found on many RISC-V platforms. This makes way for significant
> > optimization of RAM footprint. The XIP kernel is not compressed
> > since it has to run directly from flash, so it will occupy more
> > space on the non-volatile storage. The physical flash address used
> > to link the kernel object files and for storing it has to be known
> > at compile time and is represented by a Kconfig option.
> >
> > XIP on RISC-V will for the time being only work on MMU-enabled
> > kernels.
> >
> > Signed-off-by: Vitaly Wool 
> >
> > ---
> >
> > Changes in v2:
> > - dedicated macro for XIP address fixup when MMU is not enabled yet
> >o both for 32-bit and 64-bit RISC-V
> > - SP is explicitly set to a safe place in RAM before __copy_data call
> > - removed redundant alignment requirements in vmlinux-xip.lds.S
> > - changed long -> uintptr_t typecast in __XIP_FIXUP macro.
> > Changes in v3:
> > - rebased against latest for-next
> > - XIP address fixup macro now takes an argument
> > - SMP related fixes
> > Changes in v4:
> > - rebased against the current for-next
> > - less #ifdef's in C/ASM code
> > - dedicated XIP_FIXUP_OFFSET assembler macro in head.S
> > - C-specific definitions moved into #ifndef __ASSEMBLY__
> > - Fixed multi-core boot
> > Changes in v5:
> > - fixed build error for non-XIP kernels
> > Changes in v6:
> > - XIP_PHYS_RAM_BASE config option renamed to PHYS_RAM_BASE
> > - added PHYS_RAM_BASE_FIXED config flag to allow usage of
> >PHYS_RAM_BASE in non-XIP configurations if needed
> > - XIP_FIXUP macro rewritten with a tempoarary variable to avoid side
> >effects
> > - fixed crash for non-XIP kernels that don't use built-in DTB
> >
> >   arch/riscv/Kconfig  |  49 ++-
> >   arch/riscv/Makefile |   8 +-
> >   arch/riscv/boot/Makefile|  13 +++
> >   arch/riscv/include/asm/pgtable.h|  65 --
> >   arch/riscv/kernel/cpu_ops_sbi.c |  11 ++-
> >   arch/riscv/kernel/head.S|  49 ++-
> >   arch/riscv/kernel/head.h|   3 +
> >   arch/riscv/kernel/setup.c   |   8 +-
> >   arch/riscv/kernel/vmlinux-xip.lds.S | 132 
> >   arch/riscv/kernel/vmlinux.lds.S |   6 ++
> >   arch/riscv/mm/init.c| 100 +++--
> >   11 files changed, 426 insertions(+), 18 deletions(-)
> >   create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index 8ea60a0a19ae..bd6f82240c34 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -441,7 +441,7 @@ config EFI_STUB
> >
> >   config EFI
> >   bool "UEFI runtime support"
> > - depends on OF
> > + depends on OF && !XIP_KERNEL
> >   select LIBFDT
> >   select UCS2_STRING
> >   select EFI_PARAMS_FROM_FDT
> > @@ -465,11 +465,56 @@ config STACKPROTECTOR_PER_TASK
> >   def_bool y
> >   depends on STACKPROTECTOR && CC_HAVE_STACKPROTECTOR_TLS
> >
> > +config PHYS_RAM_BASE_FIXED
> > + bool "Explicitly specified physical RAM address"
> > +  

Re: [PATCH v6] RISC-V: enable XIP

2021-03-30 Thread Vitaly Wool
On Tue, Mar 30, 2021 at 8:23 AM Palmer Dabbelt  wrote:
>
> On Sun, 21 Mar 2021 17:12:15 PDT (-0700), vitaly.w...@konsulko.com wrote:
> > Introduce XIP (eXecute In Place) support for RISC-V platforms.
> > It allows code to be executed directly from non-volatile storage
> > directly addressable by the CPU, such as QSPI NOR flash which can
> > be found on many RISC-V platforms. This makes way for significant
> > optimization of RAM footprint. The XIP kernel is not compressed
> > since it has to run directly from flash, so it will occupy more
> > space on the non-volatile storage. The physical flash address used
> > to link the kernel object files and for storing it has to be known
> > at compile time and is represented by a Kconfig option.
> >
> > XIP on RISC-V will for the time being only work on MMU-enabled
> > kernels.
> >
> > Signed-off-by: Vitaly Wool 
> >
> > ---
> >
> > Changes in v2:
> > - dedicated macro for XIP address fixup when MMU is not enabled yet
> >   o both for 32-bit and 64-bit RISC-V
> > - SP is explicitly set to a safe place in RAM before __copy_data call
> > - removed redundant alignment requirements in vmlinux-xip.lds.S
> > - changed long -> uintptr_t typecast in __XIP_FIXUP macro.
> > Changes in v3:
> > - rebased against latest for-next
> > - XIP address fixup macro now takes an argument
> > - SMP related fixes
> > Changes in v4:
> > - rebased against the current for-next
> > - less #ifdef's in C/ASM code
> > - dedicated XIP_FIXUP_OFFSET assembler macro in head.S
> > - C-specific definitions moved into #ifndef __ASSEMBLY__
> > - Fixed multi-core boot
> > Changes in v5:
> > - fixed build error for non-XIP kernels
> > Changes in v6:
> > - XIP_PHYS_RAM_BASE config option renamed to PHYS_RAM_BASE
> > - added PHYS_RAM_BASE_FIXED config flag to allow usage of
> >   PHYS_RAM_BASE in non-XIP configurations if needed
> > - XIP_FIXUP macro rewritten with a tempoarary variable to avoid side
> >   effects
> > - fixed crash for non-XIP kernels that don't use built-in DTB
>
> So v5 landed on for-next, which generally means it's best to avoid
> re-spinning the patch and instead send along fixups.  That said, the v5
> is causing some testing failures for me.
>
> I'm going to drop the v5 for now as I don't have time to test this
> tonight.  I'll try and take a look soon, as it will conflict with Alex's
> patches.

I can come up with the incremental patch instead pretty much straight
away if that works better.

~Vitaly

> >  arch/riscv/Kconfig  |  49 ++-
> >  arch/riscv/Makefile |   8 +-
> >  arch/riscv/boot/Makefile|  13 +++
> >  arch/riscv/include/asm/pgtable.h|  65 --
> >  arch/riscv/kernel/cpu_ops_sbi.c |  11 ++-
> >  arch/riscv/kernel/head.S|  49 ++-
> >  arch/riscv/kernel/head.h|   3 +
> >  arch/riscv/kernel/setup.c   |   8 +-
> >  arch/riscv/kernel/vmlinux-xip.lds.S | 132 
> >  arch/riscv/kernel/vmlinux.lds.S |   6 ++
> >  arch/riscv/mm/init.c| 100 +++--
> >  11 files changed, 426 insertions(+), 18 deletions(-)
> >  create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index 8ea60a0a19ae..bd6f82240c34 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -441,7 +441,7 @@ config EFI_STUB
> >
> >  config EFI
> >   bool "UEFI runtime support"
> > - depends on OF
> > + depends on OF && !XIP_KERNEL
> >   select LIBFDT
> >   select UCS2_STRING
> >   select EFI_PARAMS_FROM_FDT
> > @@ -465,11 +465,56 @@ config STACKPROTECTOR_PER_TASK
> >   def_bool y
> >   depends on STACKPROTECTOR && CC_HAVE_STACKPROTECTOR_TLS
> >
> > +config PHYS_RAM_BASE_FIXED
> > + bool "Explicitly specified physical RAM address"
> > + default n
> > +
> > +config PHYS_RAM_BASE
> > + hex "Platform Physical RAM address"
> > + depends on PHYS_RAM_BASE_FIXED
> > + default "0x8000"
> > + help
> > +   This is the physical address of RAM in the system. It has to be
> > +   explicitly specified to run early relocations of read-write data
> > +   from flash to RAM.
> > +
> > +config XIP_KERNEL
> > + bool "Kernel Execute-In-Place from ROM"
> > + depends on MMU
> > + select PHYS

Re: [PATCH v4 3/5] RISC-V: Initial DTS for Microchip ICICLE board

2021-03-28 Thread Vitaly Wool
On Sat, Mar 27, 2021 at 6:24 PM Alex Ghiti  wrote:
>
> Hi Atish,
>
> Le 3/3/21 à 3:02 PM, Atish Patra a écrit :
> > Add initial DTS for Microchip ICICLE board having only
> > essential devices (clocks, sdhci, ethernet, serial, etc).
> > The device tree is based on the U-Boot patch.
> >
> > https://patchwork.ozlabs.org/project/uboot/patch/20201110103414.10142-6-padmarao.beg...@microchip.com/
> >
> > Signed-off-by: Atish Patra 
> > ---
> >   arch/riscv/boot/dts/Makefile  |   1 +
> >   arch/riscv/boot/dts/microchip/Makefile|   2 +
> >   .../microchip/microchip-mpfs-icicle-kit.dts   |  72 
> >   .../boot/dts/microchip/microchip-mpfs.dtsi| 329 ++
> >   4 files changed, 404 insertions(+)
> >   create mode 100644 arch/riscv/boot/dts/microchip/Makefile
> >   create mode 100644 
> > arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
> >   create mode 100644 arch/riscv/boot/dts/microchip/microchip-mpfs.dtsi
> >
> > diff --git a/arch/riscv/boot/dts/Makefile b/arch/riscv/boot/dts/Makefile
> > index 7ffd502e3e7b..fe996b88319e 100644
> > --- a/arch/riscv/boot/dts/Makefile
> > +++ b/arch/riscv/boot/dts/Makefile
> > @@ -1,5 +1,6 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   subdir-y += sifive
> >   subdir-$(CONFIG_SOC_CANAAN_K210_DTB_BUILTIN) += canaan
> > +subdir-y += microchip
> >
> >   obj-$(CONFIG_BUILTIN_DTB) := $(addsuffix /, $(subdir-y))
> > diff --git a/arch/riscv/boot/dts/microchip/Makefile 
> > b/arch/riscv/boot/dts/microchip/Makefile
> > new file mode 100644
> > index ..622b12771fd3
> > --- /dev/null
> > +++ b/arch/riscv/boot/dts/microchip/Makefile
> > @@ -0,0 +1,2 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +dtb-$(CONFIG_SOC_MICROCHIP_POLARFIRE) += microchip-mpfs-icicle-kit.dtb
>
> I'm playing (or trying to...) with XIP_KERNEL and I had to add the
> following to have the device tree actually builtin the kernel:
>
> diff --git a/arch/riscv/boot/dts/microchip/Makefile
> b/arch/riscv/boot/dts/microchip/Makefile
> index 622b12771fd3..855c1502d912 100644
> --- a/arch/riscv/boot/dts/microchip/Makefile
> +++ b/arch/riscv/boot/dts/microchip/Makefile
> @@ -1,2 +1,3 @@
>   # SPDX-License-Identifier: GPL-2.0
>   dtb-$(CONFIG_SOC_MICROCHIP_POLARFIRE) += microchip-mpfs-icicle-kit.dtb
> +obj-$(CONFIG_BUILTIN_DTB) += $(addsuffix .o, $(dtb-y))
>
> Alex

Yes, I believe this is necessary for BUILTIN_DTB to work on Polarfire,
regardless of whether the kernel is XIP or not.

Best regards,
   Vitaly

> > diff --git a/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts 
> > b/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
> > new file mode 100644
> > index ..ec79944065c9
> > --- /dev/null
> > +++ b/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
> > @@ -0,0 +1,72 @@
> > +// SPDX-License-Identifier: (GPL-2.0 OR MIT)
> > +/* Copyright (c) 2020 Microchip Technology Inc */
> > +
> > +/dts-v1/;
> > +
> > +#include "microchip-mpfs.dtsi"
> > +
> > +/* Clock frequency (in Hz) of the rtcclk */
> > +#define RTCCLK_FREQ  100
> > +
> > +/ {
> > + #address-cells = <2>;
> > + #size-cells = <2>;
> > + model = "Microchip PolarFire-SoC Icicle Kit";
> > + compatible = "microchip,mpfs-icicle-kit";
> > +
> > + chosen {
> > + stdout-path = 
> > + };
> > +
> > + cpus {
> > + timebase-frequency = ;
> > + };
> > +
> > + memory@8000 {
> > + device_type = "memory";
> > + reg = <0x0 0x8000 0x0 0x4000>;
> > + clocks = < 26>;
> > + };
> > +
> > + soc {
> > + };
> > +};
> > +
> > + {
> > + status = "okay";
> > +};
> > +
> > + {
> > + status = "okay";
> > +};
> > +
> > + {
> > + status = "okay";
> > +};
> > +
> > + {
> > + status = "okay";
> > +};
> > +
> > + {
> > + status = "okay";
> > +};
> > +
> > + {
> > + phy-mode = "sgmii";
> > + phy-handle = <>;
> > + phy0: ethernet-phy@8 {
> > + reg = <8>;
> > + ti,fifo-depth = <0x01>;
> > + };
> > +};
> > +
> > + {
> > + status = "okay";
> > + phy-mode = "sgmii";
> > + phy-handle = <>;
> > + phy1: ethernet-phy@9 {
> > + reg = <9>;
> > + ti,fifo-depth = <0x01>;
> > + };
> > +};
> > diff --git a/arch/riscv/boot/dts/microchip/microchip-mpfs.dtsi 
> > b/arch/riscv/boot/dts/microchip/microchip-mpfs.dtsi
> > new file mode 100644
> > index ..b9819570a7d1
> > --- /dev/null
> > +++ b/arch/riscv/boot/dts/microchip/microchip-mpfs.dtsi
> > @@ -0,0 +1,329 @@
> > +// SPDX-License-Identifier: (GPL-2.0 OR MIT)
> > +/* Copyright (c) 2020 Microchip Technology Inc */
> > +
> > +/dts-v1/;
> > +
> > +/ {
> > + #address-cells = <2>;
> > + #size-cells = <2>;
> > + model = "Microchip MPFS Icicle Kit";
> > + compatible = "microchip,mpfs-icicle-kit";
> > +
> > + chosen {
> > + };
> > +
> > + cpus {
> > + #address-cells = <1>;
> > +

[PATCH] riscv: XIP: fix build for rv32

2021-03-27 Thread Vitaly Wool
32-bit RISC-V uses folded page tables by default, so we should
follow that in the XIP-specific part of init too.

Signed-off-by: Vitaly Wool 
Reported-by: kernel test robot 
---
 arch/riscv/mm/init.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index fe583c2aa5a2..8c0eeaae67a3 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -481,6 +481,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 #endif
 
 #ifdef CONFIG_XIP_KERNEL
+#ifndef __PAGETABLE_PMD_FOLDED
create_pgd_mapping(trampoline_pg_dir, XIP_VIRT_ADDR_START,
   (uintptr_t)xip_pmd, PGDIR_SIZE, PAGE_TABLE);
for (va = XIP_VIRT_ADDR_START;
@@ -493,7 +494,16 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 
create_pgd_mapping(early_pg_dir, XIP_VIRT_ADDR_START,
   (uintptr_t)xip_pmd, PGDIR_SIZE, PAGE_TABLE);
-#endif
+#else
+   for (va = XIP_VIRT_ADDR_START;
+va < XIP_VIRT_ADDR_START + xiprom_sz;
+va += map_size) {
+   create_pgd_mapping(early_pg_dir, va,
+  xiprom + (va - XIP_VIRT_ADDR_START),
+  map_size, PAGE_KERNEL_EXEC);
+   }
+#endif /* __PAGETABLE_PMD_FOLDED */
+#endif /* CONFIG_XIP_KERNEL */
 
/*
 * Setup early PGD covering entire kernel which will allows
-- 
2.29.2



Re: [PATCH v5] RISC-V: enable XIP

2021-03-22 Thread Vitaly Wool
On Mon, Mar 22, 2021 at 7:55 AM Alex Ghiti  wrote:
>
> Le 3/21/21 à 2:06 PM, Vitaly Wool a écrit :
> > Hey Alex,
> >
> > On Sun, Mar 21, 2021 at 4:11 PM Alex Ghiti  wrote:
> >>
> >> Hi Vitaly,
> >>
> >> Le 3/10/21 à 4:22 AM, Vitaly Wool a écrit :
> >>> Introduce XIP (eXecute In Place) support for RISC-V platforms.
> >>> It allows code to be executed directly from non-volatile storage
> >>> directly addressable by the CPU, such as QSPI NOR flash which can
> >>> be found on many RISC-V platforms. This makes way for significant
> >>> optimization of RAM footprint. The XIP kernel is not compressed
> >>> since it has to run directly from flash, so it will occupy more
> >>> space on the non-volatile storage to The physical flash address
> >>
> >> There seems to be missing something here.
> >
> >
> > Hmmm... strange indeed. I'll come up with a respin shortly and will
> > double check.
> >>
> >>
> >>> used to link the kernel object files and for storing it has to
> >>> be known at compile time and is represented by a Kconfig option.
> >>>
> >>> XIP on RISC-V will currently only work on MMU-enabled kernels.
> >>>
> >>> Signed-off-by: Vitaly Wool 
> >>
> >>
> >> This fails to boot on current for-next with the following panic.
> >> This is because dtb_early_va points to an address that is not mapped in
> >> swapper_pg_dir: using __va(dtb_early_pa) instead works fine.
> >
> > Is it with CONFIG_BUILTIN_DTB or without?
>
> It is without CONFIG_BUILTIN_DTB enabled.
>
> And I noticed I can't link a XIP_KERNEL either:
>
> /home/alex/wip/lpc/buildroot/build_rv64/host/bin/riscv64-buildroot-linux-gnu-ld:
> section .data LMA [0080,008cd77f] overlaps section
> .rodata LMA [006f61c0,008499a7]
> /home/alex/wip/lpc/buildroot/build_rv64/host/bin/riscv64-buildroot-linux-gnu-ld:
> section .pci_fixup LMA [008499a8,0084cfd7] overlaps
> section .data LMA [0080,008cd77f]
> /home/alex/wip/lpc/buildroot/build_rv64/host/bin/riscv64-buildroot-linux-gnu-ld:
> arch/riscv/mm/init.o: in function `.L138':
> init.c:(.text+0x232): undefined reference to `__init_text_begin'
> /home/alex/wip/lpc/buildroot/build_rv64/host/bin/riscv64-buildroot-linux-gnu-ld:
> arch/riscv/mm/init.o: in function `protect_kernel_text_data':
> init.c:(.text+0x23a): undefined reference to `__init_data_begin'
> /home/alex/wip/lpc/buildroot/build_rv64/host/bin/riscv64-buildroot-linux-gnu-ld:
> init.c:(.text+0x28c): undefined reference to `__init_text_begin'
> /home/alex/wip/lpc/buildroot/build_rv64/host/bin/riscv64-buildroot-linux-gnu-ld:
> init.c:(.text+0x2a0): undefined reference to `__init_data_begin'
> make[2]: *** [Makefile:1197: vmlinux] Error 1
> make[1]: *** [package/pkg-generic.mk:250:
> /home/alex/wip/lpc/buildroot/build_rv64/build/linux-custom/.stamp_built]
> Error 2
> make: *** [Makefile:23: _all] Error 2
>
> The 2 missing symbols are not defined in vmlinux-xip.lds.S and are
> required for CONFIG_STRICT_KERNEL_RWX, I don't think both configs are
> exclusive ? I added them in linker script and that works.

All the executable and read-only parts of an XIP kernel reside in NOR
flash which is read-only when it is in random access mode, so
CONFIG_STRICT_KERNEL_RWX basically has zero value for XIP. And since
NOR flash space is usually quite limited, any option you can avoid
enabling for an XIP kernel should not be enabled.

> But then I'm still blocked with the overlaps, any idea ?

As I've said, flash space is very limited. In a standard RISC-V case
it is 16 MB, and the linker basically assumes we split it 50/50
between RO and RW sections.
Then the first 8 MB of the NOR flash are mapped right before the
standard RAM mapping to create virtually contiguous space of:
* RO/X sections still residing in flash, and
* RW sections that have been copied to the beginning of RAM for them
to actually be RW.

So the idea is simple: you have to slim down your kernel so that RO/X
sections occupy approximately 300K less than they do now :)

Thanks again for the review and for the interest taken,

Best regards,
   Vitaly

> >>
> >> And as this likely needs another version, I'm going to add my comments
> >> below.
> >>
> >> [0.00] OF: fdt: Ignoring memory range 0x8000 - 0x8020
> >> [0.00] Machine model: riscv-virtio,qemu
> >> [0.00] earlycon: sbi0 at I/O port 0x0 (options '')
> >> [0.00] printk: bootconsole [sbi0] enabled
> >> [0.00] efi: UEFI not found.
> >> [ 

[PATCH v6] RISC-V: enable XIP

2021-03-21 Thread Vitaly Wool
Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage. The physical flash address used
to link the kernel object files and for storing it has to be known
at compile time and is represented by a Kconfig option.

XIP on RISC-V will for the time being only work on MMU-enabled
kernels.

Signed-off-by: Vitaly Wool 

---

Changes in v2:
- dedicated macro for XIP address fixup when MMU is not enabled yet
  o both for 32-bit and 64-bit RISC-V
- SP is explicitly set to a safe place in RAM before __copy_data call
- removed redundant alignment requirements in vmlinux-xip.lds.S
- changed long -> uintptr_t typecast in __XIP_FIXUP macro.
Changes in v3:
- rebased against latest for-next
- XIP address fixup macro now takes an argument
- SMP related fixes
Changes in v4:
- rebased against the current for-next
- less #ifdef's in C/ASM code
- dedicated XIP_FIXUP_OFFSET assembler macro in head.S
- C-specific definitions moved into #ifndef __ASSEMBLY__
- Fixed multi-core boot
Changes in v5:
- fixed build error for non-XIP kernels
Changes in v6:
- XIP_PHYS_RAM_BASE config option renamed to PHYS_RAM_BASE
- added PHYS_RAM_BASE_FIXED config flag to allow usage of
  PHYS_RAM_BASE in non-XIP configurations if needed
- XIP_FIXUP macro rewritten with a tempoarary variable to avoid side
  effects
- fixed crash for non-XIP kernels that don't use built-in DTB

 arch/riscv/Kconfig  |  49 ++-
 arch/riscv/Makefile |   8 +-
 arch/riscv/boot/Makefile|  13 +++
 arch/riscv/include/asm/pgtable.h|  65 --
 arch/riscv/kernel/cpu_ops_sbi.c |  11 ++-
 arch/riscv/kernel/head.S|  49 ++-
 arch/riscv/kernel/head.h|   3 +
 arch/riscv/kernel/setup.c   |   8 +-
 arch/riscv/kernel/vmlinux-xip.lds.S | 132 
 arch/riscv/kernel/vmlinux.lds.S |   6 ++
 arch/riscv/mm/init.c| 100 +++--
 11 files changed, 426 insertions(+), 18 deletions(-)
 create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 8ea60a0a19ae..bd6f82240c34 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -441,7 +441,7 @@ config EFI_STUB
 
 config EFI
bool "UEFI runtime support"
-   depends on OF
+   depends on OF && !XIP_KERNEL
select LIBFDT
select UCS2_STRING
select EFI_PARAMS_FROM_FDT
@@ -465,11 +465,56 @@ config STACKPROTECTOR_PER_TASK
def_bool y
depends on STACKPROTECTOR && CC_HAVE_STACKPROTECTOR_TLS
 
+config PHYS_RAM_BASE_FIXED
+   bool "Explicitly specified physical RAM address"
+   default n
+
+config PHYS_RAM_BASE
+   hex "Platform Physical RAM address"
+   depends on PHYS_RAM_BASE_FIXED
+   default "0x8000"
+   help
+ This is the physical address of RAM in the system. It has to be
+ explicitly specified to run early relocations of read-write data
+ from flash to RAM.
+
+config XIP_KERNEL
+   bool "Kernel Execute-In-Place from ROM"
+   depends on MMU
+   select PHYS_RAM_BASE_FIXED
+   help
+ Execute-In-Place allows the kernel to run from non-volatile storage
+ directly addressable by the CPU, such as NOR flash. This saves RAM
+ space since the text section of the kernel is not loaded from flash
+ to RAM.  Read-write sections, such as the data section and stack,
+ are still copied to RAM.  The XIP kernel is not compressed since
+ it has to run directly from flash, so it will take more space to
+ store it.  The flash address used to link the kernel object files,
+ and for storing it, is configuration dependent. Therefore, if you
+ say Y here, you must know the proper physical address where to
+ store the kernel image depending on your own flash memory usage.
+
+ Also note that the make target becomes "make xipImage" rather than
+ "make zImage" or "make Image".  The final kernel binary to put in
+ ROM memory will be arch/riscv/boot/xipImage.
+
+ If unsure, say N.
+
+config XIP_PHYS_ADDR
+   hex "XIP Kernel Physical Location"
+   depends on XIP_KERNEL
+   default "0x2100"
+   help
+ This is the physical address in your flash memory the kernel will
+ be linked for and stored to.  This address is dependent on your
+ own flash usage.
+
 endmenu
 
 config BUILTIN_DTB
-   def_bool n
+   bool
  

Re: [PATCH v5] RISC-V: enable XIP

2021-03-21 Thread Vitaly Wool
Hey Alex,

On Sun, Mar 21, 2021 at 4:11 PM Alex Ghiti  wrote:
>
> Hi Vitaly,
>
> Le 3/10/21 à 4:22 AM, Vitaly Wool a écrit :
> > Introduce XIP (eXecute In Place) support for RISC-V platforms.
> > It allows code to be executed directly from non-volatile storage
> > directly addressable by the CPU, such as QSPI NOR flash which can
> > be found on many RISC-V platforms. This makes way for significant
> > optimization of RAM footprint. The XIP kernel is not compressed
> > since it has to run directly from flash, so it will occupy more
> > space on the non-volatile storage to The physical flash address
>
> There seems to be missing something here.


Hmmm... strange indeed. I'll come up with a respin shortly and will
double check.
>
>
> > used to link the kernel object files and for storing it has to
> > be known at compile time and is represented by a Kconfig option.
> >
> > XIP on RISC-V will currently only work on MMU-enabled kernels.
> >
> > Signed-off-by: Vitaly Wool 
>
>
> This fails to boot on current for-next with the following panic.
> This is because dtb_early_va points to an address that is not mapped in
> swapper_pg_dir: using __va(dtb_early_pa) instead works fine.

Is it with CONFIG_BUILTIN_DTB or without?

>
> And as this likely needs another version, I'm going to add my comments
> below.
>
> [0.00] OF: fdt: Ignoring memory range 0x8000 - 0x8020
> [0.00] Machine model: riscv-virtio,qemu
> [0.00] earlycon: sbi0 at I/O port 0x0 (options '')
> [0.00] printk: bootconsole [sbi0] enabled
> [0.00] efi: UEFI not found.
> [0.00] Unable to handle kernel paging request at virtual address
> 4001
> [0.00] Oops [#1]
> [0.00] Modules linked in:
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 5.12.0-rc2 #155
> [0.00] Hardware name: riscv-virtio,qemu (DT)
> [0.00] epc : fdt_check_header+0x0/0x1fc
> [0.00]  ra : early_init_dt_verify+0x16/0x6e
> [0.00] epc : ffe0002b955e ra : ffe00082100c sp :
> ffe001203f10
> [0.00]  gp : ffe0012e40b8 tp : ffe00120bd80 t0 :
> ffe23fdf7000
> [0.00]  t1 :  t2 :  s0 :
> ffe001203f30
> [0.00]  s1 : 4000 a0 : 4000 a1 :
> 0002bfff
> [0.00]  a2 : ffe23fdf6f00 a3 : 0001 a4 :
> 0018
> [0.00]  a5 : ffe000a0b5e8 a6 : ffe23fdf6ef0 a7 :
> 0018
> [0.00]  s2 : 8200 s3 : 0fff s4 :
> ffe000a0a958
> [0.00]  s5 : 0005 s6 : 0140 s7 :
> ffe23fdf6ec0
> [0.00]  s8 : 81000200 s9 : 8200 s10:
> ffe000a01000
> [0.00]  s11: 0fff t3 : bfff7000 t4 :
> 
> [0.00]  t5 : 80e0 t6 : 80202000
> [0.00] status: 0100 badaddr: 4001 cause:
> 000d
> [0.00] Call Trace:
> [0.00] [] fdt_check_header+0x0/0x1fc
> [0.00] [] setup_arch+0x3a6/0x412
> [0.00] [] start_kernel+0x7e/0x580
> [0.00] random: get_random_bytes called from
> print_oops_end_marker+0x22/0x44 with crng_init=0
> [0.00] ---[ end trace  ]---
> [0.00] Kernel panic - not syncing: Fatal exception
> [0.00] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> >
> > ---
> > Changes in v2:
> > - dedicated macro for XIP address fixup when MMU is not enabled yet
> >o both for 32-bit and 64-bit RISC-V
> > - SP is explicitly set to a safe place in RAM before __copy_data call
> > - removed redundant alignment requirements in vmlinux-xip.lds.S
> > - changed long -> uintptr_t typecast in __XIP_FIXUP macro.
> > Changes in v3:
> > - rebased against latest for-next
> > - XIP address fixup macro now takes an argument
> > - SMP related fixes
> > Changes in v4:
> > - rebased against the current for-next
> > - less #ifdef's in C/ASM code
> > - dedicated XIP_FIXUP_OFFSET assembler macro in head.S
> > - C-specific definitions moved into #ifndef __ASSEMBLY__
> > - Fixed multi-core boot
> > Changes in v5:
> > - fixed build error for non-XIP kernels
> >
> >   arch/riscv/Kconfig  |  44 +-
> >   arch/riscv/Makefile |   8 +-
> >   arch/riscv/boot/Makefile|  13 +++
> >   arch/riscv/include/asm/pgtable.h|  65 --
> >   arch/riscv/kernel/cpu_ops_sbi.c |  12 ++-
> >   arch/riscv/kernel/head.S

Re: [PATCH] z3fold: prevent reclaim/free race for headless pages

2021-03-12 Thread Vitaly Wool
On Thu, Mar 11, 2021 at 9:40 AM Thomas Hebb  wrote:
>
> commit ca0246bb97c2 ("z3fold: fix possible reclaim races") introduced
> the PAGE_CLAIMED flag "to avoid racing on a z3fold 'headless' page
> release." By atomically testing and setting the bit in each of
> z3fold_free() and z3fold_reclaim_page(), a double-free was avoided.
>
> However, commit dcf5aedb24f8 ("z3fold: stricter locking and more careful
> reclaim") appears to have unintentionally broken this behavior by moving
> the PAGE_CLAIMED check in z3fold_reclaim_page() to after the page lock
> gets taken, which only happens for non-headless pages. For headless
> pages, the check is now skipped entirely and races can occur again.
>
> I have observed such a race on my system:
>
> page:ffbd76b7 refcount:0 mapcount:0 mapping: 
> index:0x0 pfn:0x165316
> flags: 0x200()
> raw: 0200 ea0004535f48 8881d553a170 
> raw:  0011  
> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> [ cut here ]
> kernel BUG at include/linux/mm.h:707!
> invalid opcode:  [#1] PREEMPT SMP KASAN PTI
> CPU: 2 PID: 291928 Comm: kworker/2:0 Tainted: GB 
> 5.10.7-arch1-1-kasan #1
> Hardware name: Gigabyte Technology Co., Ltd. H97N-WIFI/H97N-WIFI, BIOS 
> F9b 03/03/2016
> Workqueue: zswap-shrink shrink_worker
> RIP: 0010:__free_pages+0x10a/0x130
> Code: c1 e7 06 48 01 ef 45 85 e4 74 d1 44 89 e6 31 d2 41 83 ec 01 e8 e7 
> b0 ff ff eb da 48 c7 c6 e0 32 91 88 48 89 ef e8 a6 89 f8 ff <0f> 0b 4c 89 e7 
> e8 fc 79 07 00 e9 33 ff ff ff 48 89 ef e8 ff 79 07
> RSP: :88819a2ffb98 EFLAGS: 00010296
> RAX:  RBX: ea000594c5a8 RCX: 
> RDX: 1d4000b298b7 RSI:  RDI: ea000594c5b8
> RBP: ea000594c580 R08: 003e R09: 8881d5520bbb
> R10: ed103aaa4177 R11: 0001 R12: ea000594c5b4
> R13:  R14: 888165316000 R15: ea000594c588
> FS:  () GS:8881d550() 
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7f7c8c3654d8 CR3: 000103f42004 CR4: 001706e0
> Call Trace:
>  z3fold_zpool_shrink+0x9b6/0x1240
>  ? sugov_update_single+0x357/0x990
>  ? sched_clock+0x5/0x10
>  ? sched_clock_cpu+0x18/0x180
>  ? z3fold_zpool_map+0x490/0x490
>  ? _raw_spin_lock_irq+0x88/0xe0
>  shrink_worker+0x35/0x90
>  process_one_work+0x70c/0x1210
>  ? pwq_dec_nr_in_flight+0x15b/0x2a0
>  worker_thread+0x539/0x1200
>  ? __kthread_parkme+0x73/0x120
>  ? rescuer_thread+0x1000/0x1000
>  kthread+0x330/0x400
>  ? __kthread_bind_mask+0x90/0x90
>  ret_from_fork+0x22/0x30
> Modules linked in: rfcomm ebtable_filter ebtables ip6table_filter 
> ip6_tables iptable_filter ccm algif_aead des_generic libdes ecb 
> algif_skcipher cmac bnep md4 algif_hash af_alg vfat fat intel_rapl_msr 
> intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel 
> iwlmvm hid_logitech_hidpp kvm at24 mac80211 snd_hda_codec_realtek iTCO_wdt 
> snd_hda_codec_generic intel_pmc_bxt snd_hda_codec_hdmi ledtrig_audio 
> iTCO_vendor_support mei_wdt mei_hdcp snd_hda_intel snd_intel_dspcfg libarc4 
> soundwire_intel irqbypass iwlwifi soundwire_generic_allocation rapl 
> soundwire_cadence intel_cstate snd_hda_codec intel_uncore btusb joydev 
> mousedev snd_usb_audio pcspkr btrtl uvcvideo nouveau btbcm i2c_i801 btintel 
> snd_hda_core videobuf2_vmalloc i2c_smbus snd_usbmidi_lib videobuf2_memops 
> bluetooth snd_hwdep soundwire_bus snd_soc_rt5640 videobuf2_v4l2 cfg80211 
> snd_soc_rl6231 videobuf2_common snd_rawmidi lpc_ich alx videodev mdio 
> snd_seq_device snd_soc_core mc ecdh_generic mxm_wmi mei_me
>  hid_logitech_dj wmi snd_compress e1000e ac97_bus mei ttm rfkill 
> snd_pcm_dmaengine ecc snd_pcm snd_timer snd soundcore mac_hid acpi_pad 
> pkcs8_key_parser it87 hwmon_vid crypto_user fuse ip_tables x_tables ext4 
> crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted tpm 
> rng_core usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel 
> ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci 
> xhci_pci_renesas i915 video intel_gtt i2c_algo_bit drm_kms_helper syscopyarea 
> sysfillrect sysimgblt fb_sys_fops cec drm agpgart
> ---[ end trace 126d646fc3dc0ad8 ]---
>
> To fix the issue, re-add the earlier test and set in the case where we
> have a headless page.
>
> Fixes: dcf5aedb24f8 (&qu

[PATCH v5] RISC-V: enable XIP

2021-03-10 Thread Vitaly Wool
Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage to The physical flash address
used to link the kernel object files and for storing it has to
be known at compile time and is represented by a Kconfig option.

XIP on RISC-V will currently only work on MMU-enabled kernels.

Signed-off-by: Vitaly Wool 

---
Changes in v2:
- dedicated macro for XIP address fixup when MMU is not enabled yet
  o both for 32-bit and 64-bit RISC-V
- SP is explicitly set to a safe place in RAM before __copy_data call
- removed redundant alignment requirements in vmlinux-xip.lds.S
- changed long -> uintptr_t typecast in __XIP_FIXUP macro.
Changes in v3:
- rebased against latest for-next
- XIP address fixup macro now takes an argument
- SMP related fixes
Changes in v4:
- rebased against the current for-next
- less #ifdef's in C/ASM code
- dedicated XIP_FIXUP_OFFSET assembler macro in head.S
- C-specific definitions moved into #ifndef __ASSEMBLY__
- Fixed multi-core boot
Changes in v5:
- fixed build error for non-XIP kernels

 arch/riscv/Kconfig  |  44 +-
 arch/riscv/Makefile |   8 +-
 arch/riscv/boot/Makefile|  13 +++
 arch/riscv/include/asm/pgtable.h|  65 --
 arch/riscv/kernel/cpu_ops_sbi.c |  12 ++-
 arch/riscv/kernel/head.S|  59 -
 arch/riscv/kernel/head.h|   3 +
 arch/riscv/kernel/setup.c   |   8 +-
 arch/riscv/kernel/vmlinux-xip.lds.S | 132 
 arch/riscv/kernel/vmlinux.lds.S |   6 ++
 arch/riscv/mm/init.c| 100 +++--
 11 files changed, 432 insertions(+), 18 deletions(-)
 create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 85d626b8ce5e..59fb945a900e 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -438,7 +438,7 @@ config EFI_STUB
 
 config EFI
bool "UEFI runtime support"
-   depends on OF
+   depends on OF && !XIP_KERNEL
select LIBFDT
select UCS2_STRING
select EFI_PARAMS_FROM_FDT
@@ -462,11 +462,51 @@ config STACKPROTECTOR_PER_TASK
def_bool y
depends on STACKPROTECTOR && CC_HAVE_STACKPROTECTOR_TLS
 
+config XIP_KERNEL
+   bool "Kernel Execute-In-Place from ROM"
+   depends on MMU
+   help
+ Execute-In-Place allows the kernel to run from non-volatile storage
+ directly addressable by the CPU, such as NOR flash. This saves RAM
+ space since the text section of the kernel is not loaded from flash
+ to RAM.  Read-write sections, such as the data section and stack,
+ are still copied to RAM.  The XIP kernel is not compressed since
+ it has to run directly from flash, so it will take more space to
+ store it.  The flash address used to link the kernel object files,
+ and for storing it, is configuration dependent. Therefore, if you
+ say Y here, you must know the proper physical address where to
+ store the kernel image depending on your own flash memory usage.
+
+ Also note that the make target becomes "make xipImage" rather than
+ "make zImage" or "make Image".  The final kernel binary to put in
+ ROM memory will be arch/riscv/boot/xipImage.
+
+ If unsure, say N.
+
+config XIP_PHYS_ADDR
+   hex "XIP Kernel Physical Location"
+   depends on XIP_KERNEL
+   default "0x2100"
+   help
+ This is the physical address in your flash memory the kernel will
+ be linked for and stored to.  This address is dependent on your
+ own flash usage.
+
+config XIP_PHYS_RAM_BASE
+   hex "Platform Physical RAM address"
+   depends on XIP_KERNEL
+   default "0x8000"
+   help
+ This is the physical address of RAM in the system. It has to be
+ explicitly specified to run early relocations of read-write data
+ from flash to RAM.
+
 endmenu
 
 config BUILTIN_DTB
-   def_bool n
+   bool
depends on OF
+   default y if XIP_KERNEL
 
 menu "Power management options"
 
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 1368d943f1f3..8fcbec03974d 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -82,7 +82,11 @@ CHECKFLAGS += -D__riscv -D__riscv_xlen=$(BITS)
 
 # Default target when executing plain make
 boot   := arch/riscv/boot
+ifeq ($(CONFIG_XIP_KERNEL),y)
+KBUILD_IMAGE := $(boot)/xipImage
+else
 KBUILD_IMAGE   := 

[PATCH v4] RISC-V: enable XIP

2021-03-06 Thread Vitaly Wool
Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage to The physical flash address
used to link the kernel object files and for storing it has to
be known at compile time and is represented by a Kconfig option.

XIP on RISC-V will currently only work on MMU-enabled kernels.

Signed-off-by: Vitaly Wool 

---

Changed in v2:
- dedicated macro for XIP address fixup when MMU is not enabled yet
  = both for 32-bit and 64-bit RISC-V
- SP is explicitly set to a safe place in RAM before __copy_data call
- removed redundant alignment requirements in vmlinux-xip.lds.S
- changed long -> uintptr_t typecast in __XIP_FIXUP macro.

Changed in v3:
- rebased against latest for-next
- XIP address fixup macro now takes an argument
- SMP related fixes

Changes in v4:
- rebased against the current for-next
- less #ifdef's in C/ASM code
- dedicated XIP_FIXUP_OFFSET assembler macro in head.S
- C-specific definitions moved into #ifndef __ASSEMBLY__
- Fixed multi-core boot

 arch/riscv/Kconfig  |  44 +-
 arch/riscv/Makefile |   8 +-
 arch/riscv/boot/Makefile|  13 +++
 arch/riscv/include/asm/pgtable.h|  65 --
 arch/riscv/kernel/cpu_ops_sbi.c |  10 ++-
 arch/riscv/kernel/head.S|  59 -
 arch/riscv/kernel/head.h|   3 +
 arch/riscv/kernel/setup.c   |   8 +-
 arch/riscv/kernel/vmlinux-xip.lds.S | 132 
 arch/riscv/kernel/vmlinux.lds.S |   6 ++
 arch/riscv/mm/init.c| 100 +++--
 11 files changed, 430 insertions(+), 18 deletions(-)
 create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 85d626b8ce5e..59fb945a900e 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -438,7 +438,7 @@ config EFI_STUB
 
 config EFI
bool "UEFI runtime support"
-   depends on OF
+   depends on OF && !XIP_KERNEL
select LIBFDT
select UCS2_STRING
select EFI_PARAMS_FROM_FDT
@@ -462,11 +462,51 @@ config STACKPROTECTOR_PER_TASK
def_bool y
depends on STACKPROTECTOR && CC_HAVE_STACKPROTECTOR_TLS
 
+config XIP_KERNEL
+   bool "Kernel Execute-In-Place from ROM"
+   depends on MMU
+   help
+ Execute-In-Place allows the kernel to run from non-volatile storage
+ directly addressable by the CPU, such as NOR flash. This saves RAM
+ space since the text section of the kernel is not loaded from flash
+ to RAM.  Read-write sections, such as the data section and stack,
+ are still copied to RAM.  The XIP kernel is not compressed since
+ it has to run directly from flash, so it will take more space to
+ store it.  The flash address used to link the kernel object files,
+ and for storing it, is configuration dependent. Therefore, if you
+ say Y here, you must know the proper physical address where to
+ store the kernel image depending on your own flash memory usage.
+
+ Also note that the make target becomes "make xipImage" rather than
+ "make zImage" or "make Image".  The final kernel binary to put in
+ ROM memory will be arch/riscv/boot/xipImage.
+
+ If unsure, say N.
+
+config XIP_PHYS_ADDR
+   hex "XIP Kernel Physical Location"
+   depends on XIP_KERNEL
+   default "0x2100"
+   help
+ This is the physical address in your flash memory the kernel will
+ be linked for and stored to.  This address is dependent on your
+ own flash usage.
+
+config XIP_PHYS_RAM_BASE
+   hex "Platform Physical RAM address"
+   depends on XIP_KERNEL
+   default "0x8000"
+   help
+ This is the physical address of RAM in the system. It has to be
+ explicitly specified to run early relocations of read-write data
+ from flash to RAM.
+
 endmenu
 
 config BUILTIN_DTB
-   def_bool n
+   bool
depends on OF
+   default y if XIP_KERNEL
 
 menu "Power management options"
 
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 1368d943f1f3..8fcbec03974d 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -82,7 +82,11 @@ CHECKFLAGS += -D__riscv -D__riscv_xlen=$(BITS)
 
 # Default target when executing plain make
 boot   := arch/riscv/boot
+ifeq ($(CONFIG_XIP_KERNEL),y)
+KBUILD_IMAGE := $(boot)/xipImage
+else
 KBUILD_IMAGE   := $(boot)/Image.gz
+endif
 
 head-y := arch/riscv/kerne

Re: [RFC PATCH] z3fold: prevent reclaim/free race for headless pages

2021-02-16 Thread Vitaly Wool
Hi Thomas,

On Tue, Feb 16, 2021 at 9:44 AM Thomas Hebb  wrote:
>
> commit ca0246bb97c2 ("z3fold: fix possible reclaim races") introduced
> the PAGE_CLAIMED flag "to avoid racing on a z3fold 'headless' page
> release." By atomically testing and setting the bit in each of
> z3fold_free() and z3fold_reclaim_page(), a double-free was avoided.
>
> However, commit 746d179b0e66 ("z3fold: stricter locking and more careful
> reclaim") appears to have unintentionally broken this behavior by moving
> the PAGE_CLAIMED check in z3fold_reclaim_page() to after the page lock
> gets taken, which only happens for non-headless pages. For headless
> pages, the check is now skipped entirely and races can occur again.
>
> I have observed such a race on my system:
>
> page:ffbd76b7 refcount:0 mapcount:0 mapping: 
> index:0x0 pfn:0x165316
> flags: 0x200()
> raw: 0200 ea0004535f48 8881d553a170 
> raw:  0011  
> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> [ cut here ]
> kernel BUG at include/linux/mm.h:707!
> invalid opcode:  [#1] PREEMPT SMP KASAN PTI
> CPU: 2 PID: 291928 Comm: kworker/2:0 Tainted: GB 
> 5.10.7-arch1-1-kasan #1
> Hardware name: Gigabyte Technology Co., Ltd. H97N-WIFI/H97N-WIFI, BIOS 
> F9b 03/03/2016
> Workqueue: zswap-shrink shrink_worker
> RIP: 0010:__free_pages+0x10a/0x130
> Code: c1 e7 06 48 01 ef 45 85 e4 74 d1 44 89 e6 31 d2 41 83 ec 01 e8 e7 
> b0 ff ff eb da 48 c7 c6 e0 32 91 88 48 89 ef e8 a6 89 f8 ff <0f> 0b 4c 89 e7 
> e8 fc 79 07 00 e9 33 ff ff ff 48 89 ef e8 ff 79 07
> RSP: :88819a2ffb98 EFLAGS: 00010296
> RAX:  RBX: ea000594c5a8 RCX: 
> RDX: 1d4000b298b7 RSI:  RDI: ea000594c5b8
> RBP: ea000594c580 R08: 003e R09: 8881d5520bbb
> R10: ed103aaa4177 R11: 0001 R12: ea000594c5b4
> R13:  R14: 888165316000 R15: ea000594c588
> FS:  () GS:8881d550() 
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7f7c8c3654d8 CR3: 000103f42004 CR4: 001706e0
> Call Trace:
>  z3fold_zpool_shrink+0x9b6/0x1240
>  ? sugov_update_single+0x357/0x990
>  ? sched_clock+0x5/0x10
>  ? sched_clock_cpu+0x18/0x180
>  ? z3fold_zpool_map+0x490/0x490
>  ? _raw_spin_lock_irq+0x88/0xe0
>  shrink_worker+0x35/0x90
>  process_one_work+0x70c/0x1210
>  ? pwq_dec_nr_in_flight+0x15b/0x2a0
>  worker_thread+0x539/0x1200
>  ? __kthread_parkme+0x73/0x120
>  ? rescuer_thread+0x1000/0x1000
>  kthread+0x330/0x400
>  ? __kthread_bind_mask+0x90/0x90
>  ret_from_fork+0x22/0x30
> Modules linked in: rfcomm ebtable_filter ebtables ip6table_filter 
> ip6_tables iptable_filter ccm algif_aead des_generic libdes ecb 
> algif_skcipher cmac bnep md4 algif_hash af_alg vfat fat intel_rapl_msr 
> intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel 
> iwlmvm hid_logitech_hidpp kvm at24 mac80211 snd_hda_codec_realtek iTCO_wdt 
> snd_hda_codec_generic intel_pmc_bxt snd_hda_codec_hdmi ledtrig_audio 
> iTCO_vendor_support mei_wdt mei_hdcp snd_hda_intel snd_intel_dspcfg libarc4 
> soundwire_intel irqbypass iwlwifi soundwire_generic_allocation rapl 
> soundwire_cadence intel_cstate snd_hda_codec intel_uncore btusb joydev 
> mousedev snd_usb_audio pcspkr btrtl uvcvideo nouveau btbcm i2c_i801 btintel 
> snd_hda_core videobuf2_vmalloc i2c_smbus snd_usbmidi_lib videobuf2_memops 
> bluetooth snd_hwdep soundwire_bus snd_soc_rt5640 videobuf2_v4l2 cfg80211 
> snd_soc_rl6231 videobuf2_common snd_rawmidi lpc_ich alx videodev mdio 
> snd_seq_device snd_soc_core mc ecdh_generic mxm_wmi mei_me
>  hid_logitech_dj wmi snd_compress e1000e ac97_bus mei ttm rfkill 
> snd_pcm_dmaengine ecc snd_pcm snd_timer snd soundcore mac_hid acpi_pad 
> pkcs8_key_parser it87 hwmon_vid crypto_user fuse ip_tables x_tables ext4 
> crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted tpm 
> rng_core usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel 
> ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci 
> xhci_pci_renesas i915 video intel_gtt i2c_algo_bit drm_kms_helper syscopyarea 
> sysfillrect sysimgblt fb_sys_fops cec drm agpgart
> ---[ end trace 126d646fc3dc0ad8 ]---
>
> To fix the issue, re-add the earlier test and set in the case where we
> have a headless page.
>
> Fixes: 746d179b0e66 ("z3fold: stricter locking and more careful reclaim")
> Signed-off-by: Thomas Hebb 
> ---
> I have NOT tested this patch yet beyond compiling it. If the approach
> seems good, I'll test it on my system for a period of several days and
> see if I can reproduce the crash before sending a v1.


Re: [PATCH v3] riscv: add BUILTIN_DTB support for MMU-enabled targets

2021-01-21 Thread Vitaly Wool
On Sat, Jan 16, 2021 at 12:57 AM Vitaly Wool  wrote:
>
> Sometimes, especially in a production system we may not want to
> use a "smart bootloader" like u-boot to load kernel, ramdisk and
> device tree from a filesystem on eMMC, but rather load the kernel
> from a NAND partition and just run it as soon as we can, and in
> this case it is convenient to have device tree compiled into the
> kernel binary. Since this case is not limited to MMU-less systems,
> let's support it for these which have MMU enabled too.
>
> While at it, provide __dtb_start as a parameter to setup_vm() in
> BUILTIN_DTB case, so we don't have to duplicate BUILTIN_DTB specific
> processing in MMU-enabled and MMU-disabled versions of setup_vm().

@Palmer: ping :)

> Signed-off-by: Vitaly Wool 

While at it, since this is just a respin/concatenation:
@Damien: are you okay with re-adding 'Tested-By:' ?
@Anup: are you okay with adding 'Reviewed-by:' since you have reviewed
both v1 patches that were concatenated?

Best regards,
   Vitaly

> ---
> Changes from v2:
> * folded "RISC-V: simplify BUILTIN_DTB processing" patch
> [http://lists.infradead.org/pipermail/linux-riscv/2021-January/004153.html]
> Changes from v1:
> * no direct initial_boot_params assignment
> * skips the temporary mapping for DT if BUILTIN_DTB=y
>
>  arch/riscv/Kconfig   |  1 -
>  arch/riscv/kernel/head.S |  4 
>  arch/riscv/mm/init.c | 19 +--
>  3 files changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 2ef05ef921b5..444a1ed1e847 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -445,7 +445,6 @@ endmenu
>
>  config BUILTIN_DTB
> def_bool n
> -   depends on RISCV_M_MODE
> depends on OF
>
>  menu "Power management options"
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 16e9941900c4..f5a9bad86e58 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -260,7 +260,11 @@ clear_bss_done:
>
> /* Initialize page tables and relocate to virtual addresses */
> la sp, init_thread_union + THREAD_SIZE
> +#ifdef CONFIG_BUILTIN_DTB
> +   la a0, __dtb_start
> +#else
> mv a0, s1
> +#endif /* CONFIG_BUILTIN_DTB */
> call setup_vm
>  #ifdef CONFIG_MMU
> la a0, early_pg_dir
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 30b61f2c6b87..45faad7c4291 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -192,10 +192,13 @@ void __init setup_bootmem(void)
>  #endif /* CONFIG_BLK_DEV_INITRD */
>
> /*
> -* Avoid using early_init_fdt_reserve_self() since __pa() does
> +* If DTB is built in, no need to reserve its memblock.
> +* Otherwise, do reserve it but avoid using
> +* early_init_fdt_reserve_self() since __pa() does
>  * not work for DTB pointers that are fixmap addresses
>  */
> -   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
> +   if (!IS_ENABLED(CONFIG_BUILTIN_DTB))
> +   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
>
> early_init_fdt_scan_reserved_mem();
> dma_contiguous_reserve(dma32_phys_limit);
> @@ -499,6 +502,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> /* Setup early PMD for DTB */
> create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
>(uintptr_t)early_dtb_pmd, PGDIR_SIZE, PAGE_TABLE);
> +#ifndef CONFIG_BUILTIN_DTB
> /* Create two consecutive PMD mappings for FDT early scan */
> pa = dtb_pa & ~(PMD_SIZE - 1);
> create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
> @@ -506,7 +510,11 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA + PMD_SIZE,
>pa + PMD_SIZE, PMD_SIZE, PAGE_KERNEL);
> dtb_early_va = (void *)DTB_EARLY_BASE_VA + (dtb_pa & (PMD_SIZE - 1));
> +#else /* CONFIG_BUILTIN_DTB */
> +   dtb_early_va = __va(dtb_pa);
> +#endif /* CONFIG_BUILTIN_DTB */
>  #else
> +#ifndef CONFIG_BUILTIN_DTB
> /* Create two consecutive PGD mappings for FDT early scan */
> pa = dtb_pa & ~(PGDIR_SIZE - 1);
> create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
> @@ -514,6 +522,9 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA + PGDIR_SIZE,
>pa + PGDIR_SIZE, PGDIR_SIZE, PAGE_KERNEL);
> dtb_early_va = (void *)DTB_EARLY_BASE_VA + (dtb_pa & (PGDIR_SIZE - 
> 1));
> +#else /* CONFIG_BU

[PATCH v3] riscv: add BUILTIN_DTB support for MMU-enabled targets

2021-01-15 Thread Vitaly Wool
Sometimes, especially in a production system we may not want to
use a "smart bootloader" like u-boot to load kernel, ramdisk and
device tree from a filesystem on eMMC, but rather load the kernel
from a NAND partition and just run it as soon as we can, and in
this case it is convenient to have device tree compiled into the
kernel binary. Since this case is not limited to MMU-less systems,
let's support it for these which have MMU enabled too.

While at it, provide __dtb_start as a parameter to setup_vm() in
BUILTIN_DTB case, so we don't have to duplicate BUILTIN_DTB specific
processing in MMU-enabled and MMU-disabled versions of setup_vm().

Signed-off-by: Vitaly Wool 
---
Changes from v2:
* folded "RISC-V: simplify BUILTIN_DTB processing" patch
[http://lists.infradead.org/pipermail/linux-riscv/2021-January/004153.html]
Changes from v1:
* no direct initial_boot_params assignment
* skips the temporary mapping for DT if BUILTIN_DTB=y

 arch/riscv/Kconfig   |  1 -
 arch/riscv/kernel/head.S |  4 
 arch/riscv/mm/init.c | 19 +--
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 2ef05ef921b5..444a1ed1e847 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -445,7 +445,6 @@ endmenu
 
 config BUILTIN_DTB
def_bool n
-   depends on RISCV_M_MODE
depends on OF
 
 menu "Power management options"
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 16e9941900c4..f5a9bad86e58 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -260,7 +260,11 @@ clear_bss_done:
 
/* Initialize page tables and relocate to virtual addresses */
la sp, init_thread_union + THREAD_SIZE
+#ifdef CONFIG_BUILTIN_DTB
+   la a0, __dtb_start
+#else
mv a0, s1
+#endif /* CONFIG_BUILTIN_DTB */
call setup_vm
 #ifdef CONFIG_MMU
la a0, early_pg_dir
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 30b61f2c6b87..45faad7c4291 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -192,10 +192,13 @@ void __init setup_bootmem(void)
 #endif /* CONFIG_BLK_DEV_INITRD */
 
/*
-* Avoid using early_init_fdt_reserve_self() since __pa() does
+* If DTB is built in, no need to reserve its memblock.
+* Otherwise, do reserve it but avoid using
+* early_init_fdt_reserve_self() since __pa() does
 * not work for DTB pointers that are fixmap addresses
 */
-   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
+   if (!IS_ENABLED(CONFIG_BUILTIN_DTB))
+   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
 
early_init_fdt_scan_reserved_mem();
dma_contiguous_reserve(dma32_phys_limit);
@@ -499,6 +502,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
/* Setup early PMD for DTB */
create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
   (uintptr_t)early_dtb_pmd, PGDIR_SIZE, PAGE_TABLE);
+#ifndef CONFIG_BUILTIN_DTB
/* Create two consecutive PMD mappings for FDT early scan */
pa = dtb_pa & ~(PMD_SIZE - 1);
create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
@@ -506,7 +510,11 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA + PMD_SIZE,
   pa + PMD_SIZE, PMD_SIZE, PAGE_KERNEL);
dtb_early_va = (void *)DTB_EARLY_BASE_VA + (dtb_pa & (PMD_SIZE - 1));
+#else /* CONFIG_BUILTIN_DTB */
+   dtb_early_va = __va(dtb_pa);
+#endif /* CONFIG_BUILTIN_DTB */
 #else
+#ifndef CONFIG_BUILTIN_DTB
/* Create two consecutive PGD mappings for FDT early scan */
pa = dtb_pa & ~(PGDIR_SIZE - 1);
create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
@@ -514,6 +522,9 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA + PGDIR_SIZE,
   pa + PGDIR_SIZE, PGDIR_SIZE, PAGE_KERNEL);
dtb_early_va = (void *)DTB_EARLY_BASE_VA + (dtb_pa & (PGDIR_SIZE - 1));
+#else /* CONFIG_BUILTIN_DTB */
+   dtb_early_va = __va(dtb_pa);
+#endif /* CONFIG_BUILTIN_DTB */
 #endif
dtb_early_pa = dtb_pa;
 
@@ -604,11 +615,7 @@ static void __init setup_vm_final(void)
 #else
 asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 {
-#ifdef CONFIG_BUILTIN_DTB
-   dtb_early_va = (void *) __dtb_start;
-#else
dtb_early_va = (void *)dtb_pa;
-#endif
dtb_early_pa = dtb_pa;
 }
 
-- 
2.20.1



Re: [PATCH] RISC-V: simplify BUILTIN_DTB processing

2021-01-15 Thread Vitaly Wool
On Fri, Jan 15, 2021 at 11:43 AM Anup Patel  wrote:
>
> On Fri, Jan 15, 2021 at 3:18 PM Vitaly Wool  wrote:
> >
> >
> >
> > On Fri, 15 Jan 2021, 10:39 Anup Patel,  wrote:
> >>
> >> On Tue, Jan 12, 2021 at 2:51 AM Vitaly Wool  
> >> wrote:
> >> >
> >> > Provide __dtb_start as a parameter to setup_vm() in case
> >> > CONFIG_BUILTIN_DTB is true, so we don't have to duplicate
> >> > BUILTIN_DTB specific processing in MMU-enabled and MMU-disabled
> >> > versions of setup_vm().
> >> >
> >> > Signed-off-by: Vitaly Wool 
> >> > ---
> >> >  arch/riscv/kernel/head.S | 4 
> >> >  arch/riscv/mm/init.c | 4 
> >> >  2 files changed, 4 insertions(+), 4 deletions(-)
> >> >
> >> > diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >> > index 16e9941900c4..f5a9bad86e58 100644
> >> > --- a/arch/riscv/kernel/head.S
> >> > +++ b/arch/riscv/kernel/head.S
> >> > @@ -260,7 +260,11 @@ clear_bss_done:
> >> >
> >> > /* Initialize page tables and relocate to virtual addresses */
> >> > la sp, init_thread_union + THREAD_SIZE
> >> > +#ifdef CONFIG_BUILTIN_DTB
> >> > +   la a0, __dtb_start
> >> > +#else
> >> > mv a0, s1
> >> > +#endif /* CONFIG_BUILTIN_DTB */
> >> > call setup_vm
> >> >  #ifdef CONFIG_MMU
> >> > la a0, early_pg_dir
> >> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >> > index 5b17f8d22f91..45faad7c4291 100644
> >> > --- a/arch/riscv/mm/init.c
> >> > +++ b/arch/riscv/mm/init.c
> >> > @@ -615,11 +615,7 @@ static void __init setup_vm_final(void)
> >> >  #else
> >> >  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >> >  {
> >> > -#ifdef CONFIG_BUILTIN_DTB
> >> > -   dtb_early_va = (void *) __dtb_start;
> >> > -#else
> >> > dtb_early_va = (void *)dtb_pa;
> >> > -#endif
> >> > dtb_early_pa = dtb_pa;
> >> >  }
> >> >
> >> > --
> >> > 2.20.1
> >> >
> >>
> >> We can avoid the early DTB mapping for MMU-enabled case when
> >> BUILTIN_DTB is enabled (same as previous discussion). Otherwise
> >> looks good to me.
> >
> >
> > Right, but I had already submitted the patch which takes care of that, and 
> > you have reviewed it too IIRC :)
>
> Ahh, I assumed this patch is based on latest stable Linux-5.11-rcX.
>
> Either you can create a series with two patches OR you can squash this patch
> into your previous patch.

Fair enough, I'll come up with a new (aggregate) one.

Best regards,
   Vitaly


Re: [PATCH] zsmalloc: do not use bit_spin_lock

2021-01-14 Thread Vitaly Wool
On Thu, Jan 14, 2021 at 5:56 PM Sebastian Andrzej Siewior
 wrote:
>
> On 2021-01-14 17:29:37 [+0100], Vitaly Wool wrote:
> > On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior,
> >  wrote:
> > >
> > > On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > > > > write the following patch according to your idea, what do you think ?
> > > >
> > > > Yep, that is basically what I was thinking of. Some nitpicks below:
> > >
> > > Did this go somewhere? The thread just ends here on my end.
> > > Mike, is this patch fixing / helping your case in anyway?
> >
> > Please see
> > * https://marc.info/?l=linux-mm=160889419514019=2
> > * https://marc.info/?l=linux-mm=160889418114011=2
> > * https://marc.info/?l=linux-mm=160889448814057=2
>
> Thank you, that would be
>1608894171-54174-1-git-send-email-tiant...@hisilicon.com
>
> for b4 compatibility :)
>
> > Haven't had time to test these yet but seem to be alright.
>
> So zs_map_object() still disables preemption but the mutex part is
> avoided by the patch?

Basically, yes. Minchan was very clear that he didn't want to remove
that inter-function locking, so be it.
I wouldn't really advise to use zsmalloc with zswap because zsmalloc
has no support for reclaim, nevertheless I wouldn't like this
configuration to stop working for those who are already using it.

Would you or Mike be up for testing Tian Taos's patchset?

Best regards,
   Vitaly


Re: [PATCH] zsmalloc: do not use bit_spin_lock

2021-01-14 Thread Vitaly Wool
On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior,
 wrote:
>
> On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > > write the following patch according to your idea, what do you think ?
> >
> > Yep, that is basically what I was thinking of. Some nitpicks below:
>
> Did this go somewhere? The thread just ends here on my end.
> Mike, is this patch fixing / helping your case in anyway?

Please see
* https://marc.info/?l=linux-mm=160889419514019=2
* https://marc.info/?l=linux-mm=160889418114011=2
* https://marc.info/?l=linux-mm=160889448814057=2

Haven't had time to test these yet but seem to be alright.

Best regards,
   Vitaly


[PATCH] RISC-V: simplify BUILTIN_DTB processing

2021-01-11 Thread Vitaly Wool
Provide __dtb_start as a parameter to setup_vm() in case
CONFIG_BUILTIN_DTB is true, so we don't have to duplicate
BUILTIN_DTB specific processing in MMU-enabled and MMU-disabled
versions of setup_vm().

Signed-off-by: Vitaly Wool 
---
 arch/riscv/kernel/head.S | 4 
 arch/riscv/mm/init.c | 4 
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 16e9941900c4..f5a9bad86e58 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -260,7 +260,11 @@ clear_bss_done:
 
/* Initialize page tables and relocate to virtual addresses */
la sp, init_thread_union + THREAD_SIZE
+#ifdef CONFIG_BUILTIN_DTB
+   la a0, __dtb_start
+#else
mv a0, s1
+#endif /* CONFIG_BUILTIN_DTB */
call setup_vm
 #ifdef CONFIG_MMU
la a0, early_pg_dir
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 5b17f8d22f91..45faad7c4291 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -615,11 +615,7 @@ static void __init setup_vm_final(void)
 #else
 asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 {
-#ifdef CONFIG_BUILTIN_DTB
-   dtb_early_va = (void *) __dtb_start;
-#else
dtb_early_va = (void *)dtb_pa;
-#endif
dtb_early_pa = dtb_pa;
 }
 
-- 
2.20.1



[PATCH v2] riscv: add BUILTIN_DTB support for MMU-enabled targets

2021-01-04 Thread Vitaly Wool
Sometimes, especially in a production system we may not want to
use a "smart bootloader" like u-boot to load kernel, ramdisk and
device tree from a filesystem on eMMC, but rather load the kernel
from a NAND partition and just run it as soon as we can, and in
this case it is convenient to have device tree compiled into the
kernel binary. Since this case is not limited to MMU-less systems,
let's support it for these which have MMU enabled too.

Signed-off-by: Vitaly Wool 
---
Changelog from v1:
* no direct initial_boot_params assignment
* skips the temporary mapping for DT if BUILTIN_DTB=y 

 arch/riscv/Kconfig   |  1 -
 arch/riscv/mm/init.c | 15 +--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 81b76d44725d..07a8bdcc423f 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -416,7 +416,6 @@ endmenu
 
 config BUILTIN_DTB
def_bool n
-   depends on RISCV_M_MODE
depends on OF
 
 menu "Power management options"
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 13ba533f462b..04aeee276817 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -191,10 +191,13 @@ void __init setup_bootmem(void)
 #endif /* CONFIG_BLK_DEV_INITRD */
 
/*
-* Avoid using early_init_fdt_reserve_self() since __pa() does
+* If DTB is built in, no need to reserve its memblock.
+* Otherwise, do reserve it but avoid using
+* early_init_fdt_reserve_self() since __pa() does
 * not work for DTB pointers that are fixmap addresses
 */
-   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
+   if (!IS_ENABLED(CONFIG_BUILTIN_DTB))
+   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
 
early_init_fdt_scan_reserved_mem();
dma_contiguous_reserve(dma32_phys_limit);
@@ -499,6 +502,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
/* Setup early PMD for DTB */
create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
   (uintptr_t)early_dtb_pmd, PGDIR_SIZE, PAGE_TABLE);
+#ifndef CONFIG_BUILTIN_DTB
/* Create two consecutive PMD mappings for FDT early scan */
pa = dtb_pa & ~(PMD_SIZE - 1);
create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
@@ -506,7 +510,11 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA + PMD_SIZE,
   pa + PMD_SIZE, PMD_SIZE, PAGE_KERNEL);
dtb_early_va = (void *)DTB_EARLY_BASE_VA + (dtb_pa & (PMD_SIZE - 1));
+#else /* CONFIG_BUILTIN_DTB */
+   dtb_early_va = __va(dtb_pa);
+#endif /* CONFIG_BUILTIN_DTB */
 #else
+#ifndef CONFIG_BUILTIN_DTB
/* Create two consecutive PGD mappings for FDT early scan */
pa = dtb_pa & ~(PGDIR_SIZE - 1);
create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
@@ -514,6 +522,9 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA + PGDIR_SIZE,
   pa + PGDIR_SIZE, PGDIR_SIZE, PAGE_KERNEL);
dtb_early_va = (void *)DTB_EARLY_BASE_VA + (dtb_pa & (PGDIR_SIZE - 1));
+#else /* CONFIG_BUILTIN_DTB */
+   dtb_early_va = __va(dtb_pa);
+#endif /* CONFIG_BUILTIN_DTB */
 #endif
dtb_early_pa = dtb_pa;
 
-- 
2.20.1



Re: [PATCH] riscv: add BUILTIN_DTB support for MMU-enabled targets

2020-12-31 Thread Vitaly Wool
On Tue, Dec 29, 2020 at 6:05 AM Anup Patel  wrote:
>
> On Mon, Dec 28, 2020 at 10:08 PM Vitaly Wool  wrote:
> >
> > On Mon, Dec 28, 2020 at 3:10 PM Anup Patel  wrote:
> > >
> > > On Mon, Dec 28, 2020 at 7:05 PM Vitaly Wool  
> > > wrote:
> > > >
> > > > On Mon, Dec 28, 2020 at 12:59 PM Anup Patel  wrote:
> > > > >
> > > > > On Sat, Dec 26, 2020 at 10:03 PM Vitaly Wool 
> > > > >  wrote:
> > > > > >
> > > > > > Sometimes, especially in a production system we may not want to
> > > > > > use a "smart bootloader" like u-boot to load kernel, ramdisk and
> > > > > > device tree from a filesystem on eMMC, but rather load the kernel
> > > > > > from a NAND partition and just run it as soon as we can, and in
> > > > > > this case it is convenient to have device tree compiled into the
> > > > > > kernel binary. Since this case is not limited to MMU-less systems,
> > > > > > let's support it for these which have MMU enabled too.
> > > > > >
> > > > > > Signed-off-by: Vitaly Wool 
> > > > > > ---
> > > > > >  arch/riscv/Kconfig   |  1 -
> > > > > >  arch/riscv/mm/init.c | 12 ++--
> > > > > >  2 files changed, 10 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > > index 2b41f6d8e458..9464b4e3a71a 100644
> > > > > > --- a/arch/riscv/Kconfig
> > > > > > +++ b/arch/riscv/Kconfig
> > > > > > @@ -419,7 +419,6 @@ endmenu
> > > > > >
> > > > > >  config BUILTIN_DTB
> > > > > > def_bool n
> > > > > > -   depends on RISCV_M_MODE
> > > > > > depends on OF
> > > > > >
> > > > > >  menu "Power management options"
> > > > > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > > > > index 87c305c566ac..5d1c7a3ec01c 100644
> > > > > > --- a/arch/riscv/mm/init.c
> > > > > > +++ b/arch/riscv/mm/init.c
> > > > > > @@ -194,12 +194,20 @@ void __init setup_bootmem(void)
> > > > > > setup_initrd();
> > > > > >  #endif /* CONFIG_BLK_DEV_INITRD */
> > > > > >
> > > > > > +   /*
> > > > > > +* If DTB is built in, no need to reserve its memblock.
> > > > > > +* OTOH, initial_boot_params has to be set to properly copy 
> > > > > > DTB
> > > > > > +* before unflattening later on.
> > > > > > +*/
> > > > > > +   if (IS_ENABLED(CONFIG_BUILTIN_DTB))
> > > > > > +   initial_boot_params = __va(dtb_early_pa);
> > > > >
> > > > > Don't assign initial_boot_params directly here because the
> > > > > early_init_dt_scan() will do it.
> > > >
> > > > early_init_dt_scan will set initial_boot_params to dtb_early_va from
> > > > the early mapping which will be gone by the time
> > > > unflatten_and_copy_device_tree() is called.
> > >
> > > That's why we are doing early_init_dt_verify() again for the MMU-enabled
> > > case which already takes care of your concern.
> >
> > I might be out in the woods here but... Do you mean the call to
> > early_init_dt_verify() in setup_arch() which is compiled out
> > completely in the CONFIG_BUILTIN_DTB case?
> > Or is there any other call that I'm overlooking?
>
> Sorry for the confusion, what I meant was that we are calling
> early_init_dt_verify() from setup_arch() for the MMU-enabled
> with built-in DTB disabled case to update "initial_boot_params"
> after the boot CPU has switched from early_pg_dir to swapper_pg_dir.
>
> For MMU-enabled with built-in DTB case, if setup_vm() sets the
> dtb_early_va and dtb_early_pa correctly then early_init_dt_scan()
> called from setup_arch() will automatically set correct value for
> "initial_boot_params".

Oh I think I get it now. You are suggesting to skip the temporary
mapping for DT altogether since it is anyway in the kernel mapping
range, aren't you?
That does make sense indeed, thanks :)

> It is strange that early_init_dt_verify() is being compiled-out for you
> because the early_init_dt_scan() called from se

Re: [PATCH] riscv: add BUILTIN_DTB support for MMU-enabled targets

2020-12-28 Thread Vitaly Wool
On Mon, Dec 28, 2020 at 3:10 PM Anup Patel  wrote:
>
> On Mon, Dec 28, 2020 at 7:05 PM Vitaly Wool  wrote:
> >
> > On Mon, Dec 28, 2020 at 12:59 PM Anup Patel  wrote:
> > >
> > > On Sat, Dec 26, 2020 at 10:03 PM Vitaly Wool  
> > > wrote:
> > > >
> > > > Sometimes, especially in a production system we may not want to
> > > > use a "smart bootloader" like u-boot to load kernel, ramdisk and
> > > > device tree from a filesystem on eMMC, but rather load the kernel
> > > > from a NAND partition and just run it as soon as we can, and in
> > > > this case it is convenient to have device tree compiled into the
> > > > kernel binary. Since this case is not limited to MMU-less systems,
> > > > let's support it for these which have MMU enabled too.
> > > >
> > > > Signed-off-by: Vitaly Wool 
> > > > ---
> > > >  arch/riscv/Kconfig   |  1 -
> > > >  arch/riscv/mm/init.c | 12 ++--
> > > >  2 files changed, 10 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > index 2b41f6d8e458..9464b4e3a71a 100644
> > > > --- a/arch/riscv/Kconfig
> > > > +++ b/arch/riscv/Kconfig
> > > > @@ -419,7 +419,6 @@ endmenu
> > > >
> > > >  config BUILTIN_DTB
> > > > def_bool n
> > > > -   depends on RISCV_M_MODE
> > > > depends on OF
> > > >
> > > >  menu "Power management options"
> > > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > > index 87c305c566ac..5d1c7a3ec01c 100644
> > > > --- a/arch/riscv/mm/init.c
> > > > +++ b/arch/riscv/mm/init.c
> > > > @@ -194,12 +194,20 @@ void __init setup_bootmem(void)
> > > > setup_initrd();
> > > >  #endif /* CONFIG_BLK_DEV_INITRD */
> > > >
> > > > +   /*
> > > > +* If DTB is built in, no need to reserve its memblock.
> > > > +* OTOH, initial_boot_params has to be set to properly copy DTB
> > > > +* before unflattening later on.
> > > > +*/
> > > > +   if (IS_ENABLED(CONFIG_BUILTIN_DTB))
> > > > +   initial_boot_params = __va(dtb_early_pa);
> > >
> > > Don't assign initial_boot_params directly here because the
> > > early_init_dt_scan() will do it.
> >
> > early_init_dt_scan will set initial_boot_params to dtb_early_va from
> > the early mapping which will be gone by the time
> > unflatten_and_copy_device_tree() is called.
>
> That's why we are doing early_init_dt_verify() again for the MMU-enabled
> case which already takes care of your concern.

I might be out in the woods here but... Do you mean the call to
early_init_dt_verify() in setup_arch() which is compiled out
completely in the CONFIG_BUILTIN_DTB case?
Or is there any other call that I'm overlooking?

Best regards,
   Vitaly

> We use early_init_dt_verify() like most architectures to set the initial DTB.
>
> >
> > > The setup_vm() is supposed to setup dtb_early_va and dtb_early_pa
> > > for MMU-enabled case so please add a "#ifdef" over there for the
> > > built-in DTB case.
> > >
> > > > +   else
> > > > +   memblock_reserve(dtb_early_pa, 
> > > > fdt_totalsize(dtb_early_va));
> > > > +
> > > > /*
> > > >  * Avoid using early_init_fdt_reserve_self() since __pa() does
> > > >  * not work for DTB pointers that are fixmap addresses
> > > >  */
> > >
> > > This comment needs to be updated and moved along the memblock_reserve()
> > > statement.
> > >
> > > > -   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
> > > > -
> > > > early_init_fdt_scan_reserved_mem();
> > > > dma_contiguous_reserve(dma32_phys_limit);
> > > > memblock_allow_resize();
> > > > --
> > > > 2.29.2
> > > >
> > >
> > > This patch should be based upon Damiens builtin DTB patch.
> > > Refer, https://www.spinics.net/lists/linux-gpio/msg56616.html
> >
> > Thanks for the pointer, however I don't think our patches have
> > intersections. Besides, Damien is dealing with the MMU-less case
> > there.
>
> Damien's patch is also trying to move to use generic BUILTIN_DTB
> support for the MMU-less case so it is similar work hence the chance
> of patch conflict.
>
> Regards,
> Anup


Re: [PATCH] riscv: add BUILTIN_DTB support for MMU-enabled targets

2020-12-28 Thread Vitaly Wool
On Mon, Dec 28, 2020 at 12:59 PM Anup Patel  wrote:
>
> On Sat, Dec 26, 2020 at 10:03 PM Vitaly Wool  wrote:
> >
> > Sometimes, especially in a production system we may not want to
> > use a "smart bootloader" like u-boot to load kernel, ramdisk and
> > device tree from a filesystem on eMMC, but rather load the kernel
> > from a NAND partition and just run it as soon as we can, and in
> > this case it is convenient to have device tree compiled into the
> > kernel binary. Since this case is not limited to MMU-less systems,
> > let's support it for these which have MMU enabled too.
> >
> > Signed-off-by: Vitaly Wool 
> > ---
> >  arch/riscv/Kconfig   |  1 -
> >  arch/riscv/mm/init.c | 12 ++--
> >  2 files changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index 2b41f6d8e458..9464b4e3a71a 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -419,7 +419,6 @@ endmenu
> >
> >  config BUILTIN_DTB
> > def_bool n
> > -   depends on RISCV_M_MODE
> > depends on OF
> >
> >  menu "Power management options"
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index 87c305c566ac..5d1c7a3ec01c 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -194,12 +194,20 @@ void __init setup_bootmem(void)
> > setup_initrd();
> >  #endif /* CONFIG_BLK_DEV_INITRD */
> >
> > +   /*
> > +* If DTB is built in, no need to reserve its memblock.
> > +* OTOH, initial_boot_params has to be set to properly copy DTB
> > +* before unflattening later on.
> > +*/
> > +   if (IS_ENABLED(CONFIG_BUILTIN_DTB))
> > +   initial_boot_params = __va(dtb_early_pa);
>
> Don't assign initial_boot_params directly here because the
> early_init_dt_scan() will do it.

early_init_dt_scan will set initial_boot_params to dtb_early_va from
the early mapping which will be gone by the time
unflatten_and_copy_device_tree() is called.

> The setup_vm() is supposed to setup dtb_early_va and dtb_early_pa
> for MMU-enabled case so please add a "#ifdef" over there for the
> built-in DTB case.
>
> > +   else
> > +   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
> > +
> > /*
> >  * Avoid using early_init_fdt_reserve_self() since __pa() does
> >  * not work for DTB pointers that are fixmap addresses
> >  */
>
> This comment needs to be updated and moved along the memblock_reserve()
> statement.
>
> > -   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
> > -
> > early_init_fdt_scan_reserved_mem();
> > dma_contiguous_reserve(dma32_phys_limit);
> > memblock_allow_resize();
> > --
> > 2.29.2
> >
>
> This patch should be based upon Damiens builtin DTB patch.
> Refer, https://www.spinics.net/lists/linux-gpio/msg56616.html

Thanks for the pointer, however I don't think our patches have
intersections. Besides, Damien is dealing with the MMU-less case
there.

Best regards,
   Vitaly


[PATCH] riscv: add BUILTIN_DTB support for MMU-enabled targets

2020-12-26 Thread Vitaly Wool
Sometimes, especially in a production system we may not want to
use a "smart bootloader" like u-boot to load kernel, ramdisk and
device tree from a filesystem on eMMC, but rather load the kernel
from a NAND partition and just run it as soon as we can, and in
this case it is convenient to have device tree compiled into the
kernel binary. Since this case is not limited to MMU-less systems,
let's support it for these which have MMU enabled too.

Signed-off-by: Vitaly Wool 
---
 arch/riscv/Kconfig   |  1 -
 arch/riscv/mm/init.c | 12 ++--
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 2b41f6d8e458..9464b4e3a71a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -419,7 +419,6 @@ endmenu
 
 config BUILTIN_DTB
def_bool n
-   depends on RISCV_M_MODE
depends on OF
 
 menu "Power management options"
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 87c305c566ac..5d1c7a3ec01c 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -194,12 +194,20 @@ void __init setup_bootmem(void)
setup_initrd();
 #endif /* CONFIG_BLK_DEV_INITRD */
 
+   /*
+* If DTB is built in, no need to reserve its memblock.
+* OTOH, initial_boot_params has to be set to properly copy DTB
+* before unflattening later on.
+*/
+   if (IS_ENABLED(CONFIG_BUILTIN_DTB))
+   initial_boot_params = __va(dtb_early_pa);
+   else
+   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
+
/*
 * Avoid using early_init_fdt_reserve_self() since __pa() does
 * not work for DTB pointers that are fixmap addresses
 */
-   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
-
early_init_fdt_scan_reserved_mem();
dma_contiguous_reserve(dma32_phys_limit);
memblock_allow_resize();
-- 
2.29.2



Re: [PATCH] zsmalloc: do not use bit_spin_lock

2020-12-23 Thread Vitaly Wool
On Wed, Dec 23, 2020 at 1:44 PM tiantao (H)  wrote:
>
>
> 在 2020/12/23 8:11, Vitaly Wool 写道:
> > On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song),
> >  wrote:
> >>
> >>
> >>> -Original Message-
> >>> From: Vitaly Wool [mailto:vitaly.w...@konsulko.com]
> >>> Sent: Tuesday, December 22, 2020 10:44 PM
> >>> To: Song Bao Hua (Barry Song) 
> >>> Cc: Shakeel Butt ; Minchan Kim ; 
> >>> Mike
> >>> Galbraith ; LKML ; linux-mm
> >>> ; Sebastian Andrzej Siewior ;
> >>> NitinGupta ; Sergey Senozhatsky
> >>> ; Andrew Morton
> >>> ; tiantao (H) 
> >>> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >>>
> >>> On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
> >>>  wrote:
> >>>>
> >>>>
> >>>>> -Original Message-
> >>>>> From: Song Bao Hua (Barry Song)
> >>>>> Sent: Tuesday, December 22, 2020 3:03 PM
> >>>>> To: 'Vitaly Wool' 
> >>>>> Cc: Shakeel Butt ; Minchan Kim 
> >>>>> ;
> >>> Mike
> >>>>> Galbraith ; LKML ; linux-mm
> >>>>> ; Sebastian Andrzej Siewior ;
> >>>>> NitinGupta ; Sergey Senozhatsky
> >>>>> ; Andrew Morton
> >>>>> ; tiantao (H) 
> >>>>> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >>>>>
> >>>>>
> >>>>>> I'm still not convinced. Will kmap what, src? At this point src might
> >>> become
> >>>>> just a bogus pointer.
> >>>>>
> >>>>> As long as the memory is still there, we can kmap it by its page struct.
> >>> But
> >>>>> if
> >>>>> it is not there anymore, we have no way.
> >>>>>
> >>>>>> Why couldn't the object have been moved somewhere else (due to the 
> >>>>>> compaction
> >>>>> mechanism for instance)
> >>>>>> at the time DMA kicks in?
> >>>>> So zs_map_object() will guarantee the src won't be moved by holding 
> >>>>> those
> >>>>> preemption-disabled lock?
> >>>>> If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc 
> >>>>> case?
> >>>>>
> >>>> Or we can do get_page() to avoid the movement of the page.
> >>>
> >>> I would like to discuss this more in zswap context than zsmalloc's.
> >>> Since zsmalloc does not implement reclaim callback, using it in zswap
> >>> is a corner case anyway.
> >> I see. But it seems we still need a solution for the compatibility
> >> of zsmalloc and zswap? this will require change in either zsmalloc
> >> or zswap.
> >> or do you want to make zswap depend on !ZSMALLOC?
> > No, I really don't think we should go that far. What if we add a flag
> > to zpool, named like "can_sleep_mapped", and have it set for
> > zbud/z3fold?
> > Then zswap could go the current path if the flag is set; and if it's
> > not set, and mutex_trylock fails, copy data from src to a temporary
> > buffer, then unmap the handle, take the mutex, process the buffer
> > instead of src. Not the nicest thing to do but at least it won't break
> > anything.
>
> write the following patch according to your idea, what do you think ?

Yep, that is basically what I was thinking of. Some nitpicks below:

> --- a/mm/zswap.c
>
> +++ b/mm/zswap.c
> @@ -1235,7 +1235,7 @@ static int zswap_frontswap_load(unsigned type,
> pgoff_t offset,
>  struct zswap_entry *entry;
>  struct scatterlist input, output;
>  struct crypto_acomp_ctx *acomp_ctx;
> -   u8 *src, *dst;
> +   u8 *src, *dst, *tmp;
>  unsigned int dlen;
>  int ret;
>
> @@ -1262,16 +1262,26 @@ static int zswap_frontswap_load(unsigned type,
> pgoff_t offset,
>  if (zpool_evictable(entry->pool->zpool))
>  src += sizeof(struct zswap_header);
>
> +   if (!zpool_can_sleep_mapped(entry->pool->zpool) &&
> !mutex_trylock(acomp_ctx->mutex)) {
> +   tmp = kmemdup(src, entry->length, GFP_ATOMIC);

kmemdump? just use memcpy :)

> +   zpool_unmap_handle(entry->pool->zpool, entry->handle); ???
> +   if (!tmp)
> +   goto freeentry;

Jumping to freentry results in returning su

Re: [PATCH] zsmalloc: do not use bit_spin_lock

2020-12-22 Thread Vitaly Wool
On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song),
 wrote:
>
>
>
> > -Original Message-
> > From: Vitaly Wool [mailto:vitaly.w...@konsulko.com]
> > Sent: Tuesday, December 22, 2020 10:44 PM
> > To: Song Bao Hua (Barry Song) 
> > Cc: Shakeel Butt ; Minchan Kim ; 
> > Mike
> > Galbraith ; LKML ; linux-mm
> > ; Sebastian Andrzej Siewior ;
> > NitinGupta ; Sergey Senozhatsky
> > ; Andrew Morton
> > ; tiantao (H) 
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
> >  wrote:
> > >
> > >
> > >
> > > > -----Original Message-
> > > > From: Song Bao Hua (Barry Song)
> > > > Sent: Tuesday, December 22, 2020 3:03 PM
> > > > To: 'Vitaly Wool' 
> > > > Cc: Shakeel Butt ; Minchan Kim 
> > > > ;
> > Mike
> > > > Galbraith ; LKML ; linux-mm
> > > > ; Sebastian Andrzej Siewior ;
> > > > NitinGupta ; Sergey Senozhatsky
> > > > ; Andrew Morton
> > > > ; tiantao (H) 
> > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > >
> > > > > I'm still not convinced. Will kmap what, src? At this point src might
> > become
> > > > just a bogus pointer.
> > > >
> > > > As long as the memory is still there, we can kmap it by its page struct.
> > But
> > > > if
> > > > it is not there anymore, we have no way.
> > > >
> > > > > Why couldn't the object have been moved somewhere else (due to the 
> > > > > compaction
> > > > mechanism for instance)
> > > > > at the time DMA kicks in?
> > > >
> > > > So zs_map_object() will guarantee the src won't be moved by holding 
> > > > those
> > > > preemption-disabled lock?
> > > > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc 
> > > > case?
> > > >
> > >
> > > Or we can do get_page() to avoid the movement of the page.
> >
> >
> > I would like to discuss this more in zswap context than zsmalloc's.
> > Since zsmalloc does not implement reclaim callback, using it in zswap
> > is a corner case anyway.
>
> I see. But it seems we still need a solution for the compatibility
> of zsmalloc and zswap? this will require change in either zsmalloc
> or zswap.
> or do you want to make zswap depend on !ZSMALLOC?

No, I really don't think we should go that far. What if we add a flag
to zpool, named like "can_sleep_mapped", and have it set for
zbud/z3fold?
Then zswap could go the current path if the flag is set; and if it's
not set, and mutex_trylock fails, copy data from src to a temporary
buffer, then unmap the handle, take the mutex, process the buffer
instead of src. Not the nicest thing to do but at least it won't break
anything.

~Vitaly

> > zswap, on the other hand, may be dealing with some new backends in
> > future which have more chances to become mainstream. Imagine typical
> > NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
> > unused video memory. In such a case if you try to use a pointer to an
> > invalidated zpool mapping, you are on the way to thrash the system.
> > So: no assumptions that the zswap pool is in regular linear RAM should
> > be made.
> >
> > ~Vitaly
>
> Thanks
> Barry


Re: [PATCH v3] RISC-V: enable XIP

2020-12-22 Thread Vitaly Wool
Hi Anup,

On Tue, Dec 22, 2020 at 6:16 AM Anup Patel  wrote:
>
> On Tue, Dec 22, 2020 at 2:08 AM Vitaly Wool  wrote:
> >
> > Introduce XIP (eXecute In Place) support for RISC-V platforms.
> > It allows code to be executed directly from non-volatile storage
> > directly addressable by the CPU, such as QSPI NOR flash which can
> > be found on many RISC-V platforms. This makes way for significant
> > optimization of RAM footprint. The XIP kernel is not compressed
> > since it has to run directly from flash, so it will occupy more
> > space on the non-volatile storage to The physical flash address
> > used to link the kernel object files and for storing it has to
> > be known at compile time and is represented by a Kconfig option.
> >
> > XIP on RISC-V will currently only work on MMU-enabled kernels.
> >
> > Changed in v2:
> > - dedicated macro for XIP address fixup when MMU is not enabled yet
> >   o both for 32-bit and 64-bit RISC-V
> > - SP is explicitly set to a safe place in RAM before __copy_data call
> > - removed redundant alignment requirements in vmlinux-xip.lds.S
> > - changed long -> uintptr_t typecast in __XIP_FIXUP macro.
> >
> > Changed in v3:
> > - rebased against latest for-next
> > - XIP address fixup macro now takes an argument
> > - SMP related fixes
> > Signed-off-by: Vitaly Wool 
> > ---
> >  arch/riscv/Kconfig  |  46 -
> >  arch/riscv/Makefile |   8 +-
> >  arch/riscv/boot/Makefile|  13 +++
> >  arch/riscv/include/asm/pgtable.h|  56 +--
> >  arch/riscv/kernel/cpu_ops_sbi.c |   3 +
> >  arch/riscv/kernel/head.S|  69 +-
> >  arch/riscv/kernel/head.h|   3 +
> >  arch/riscv/kernel/setup.c   |   8 +-
> >  arch/riscv/kernel/vmlinux-xip.lds.S | 132 ++
> >  arch/riscv/kernel/vmlinux.lds.S |   6 ++
> >  arch/riscv/mm/init.c| 142 +---
> >  11 files changed, 460 insertions(+), 26 deletions(-)
> >  create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S
>
> If possible please break down this patch into smaller patches
> which are bisect-able and easy to reivew.

I was thinking about that but any such split would look artificial to
me. It would obscure the idea in my opinion rather than help
understand it.

> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index 2b41f6d8e458..fabafdf763da 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -398,7 +398,7 @@ config EFI_STUB
> >
> >  config EFI
> > bool "UEFI runtime support"
> > -   depends on OF
> > +   depends on OF && !XIP_KERNEL
> > select LIBFDT
> > select UCS2_STRING
> > select EFI_PARAMS_FROM_FDT
> > @@ -415,12 +415,52 @@ config EFI
> >   allow the kernel to be booted as an EFI application. This
> >   is only useful on systems that have UEFI firmware.
> >
> > +config XIP_KERNEL
> > +   bool "Kernel Execute-In-Place from ROM"
> > +   depends on MMU
> > +   help
> > + Execute-In-Place allows the kernel to run from non-volatile 
> > storage
> > + directly addressable by the CPU, such as NOR flash. This saves RAM
> > + space since the text section of the kernel is not loaded from 
> > flash
> > + to RAM.  Read-write sections, such as the data section and stack,
> > + are still copied to RAM.  The XIP kernel is not compressed since
> > + it has to run directly from flash, so it will take more space to
> > + store it.  The flash address used to link the kernel object files,
> > + and for storing it, is configuration dependent. Therefore, if you
> > + say Y here, you must know the proper physical address where to
> > + store the kernel image depending on your own flash memory usage.
> > +
> > + Also note that the make target becomes "make xipImage" rather than
> > + "make zImage" or "make Image".  The final kernel binary to put in
> > + ROM memory will be arch/riscv/boot/xipImage.
> > +
> > + If unsure, say N.
> > +
> > +config XIP_PHYS_ADDR
> > +   hex "XIP Kernel Physical Location"
> > +   depends on XIP_KERNEL
> > +   default "0x2100"
> > +   help
> > + This is the physical address in your flash memory the kernel will
> >

Re: [PATCH] zsmalloc: do not use bit_spin_lock

2020-12-22 Thread Vitaly Wool
On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
 wrote:
>
>
>
> > -Original Message-
> > From: Song Bao Hua (Barry Song)
> > Sent: Tuesday, December 22, 2020 3:03 PM
> > To: 'Vitaly Wool' 
> > Cc: Shakeel Butt ; Minchan Kim ; 
> > Mike
> > Galbraith ; LKML ; linux-mm
> > ; Sebastian Andrzej Siewior ;
> > NitinGupta ; Sergey Senozhatsky
> > ; Andrew Morton
> > ; tiantao (H) 
> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> >
> > > I'm still not convinced. Will kmap what, src? At this point src might 
> > > become
> > just a bogus pointer.
> >
> > As long as the memory is still there, we can kmap it by its page struct. But
> > if
> > it is not there anymore, we have no way.
> >
> > > Why couldn't the object have been moved somewhere else (due to the 
> > > compaction
> > mechanism for instance)
> > > at the time DMA kicks in?
> >
> > So zs_map_object() will guarantee the src won't be moved by holding those
> > preemption-disabled lock?
> > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> >
>
> Or we can do get_page() to avoid the movement of the page.


I would like to discuss this more in zswap context than zsmalloc's.
Since zsmalloc does not implement reclaim callback, using it in zswap
is a corner case anyway.

zswap, on the other hand, may be dealing with some new backends in
future which have more chances to become mainstream. Imagine typical
NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
unused video memory. In such a case if you try to use a pointer to an
invalidated zpool mapping, you are on the way to thrash the system.
So: no assumptions that the zswap pool is in regular linear RAM should
be made.

~Vitaly
>
>
> > >
> > > >
> > > > ~Vitaly
> > >
> >
> > Thanks
> > Barry
>
>


Re: [PATCH v3] RISC-V: enable XIP

2020-12-21 Thread Vitaly Wool
On Tue, Dec 22, 2020 at 2:44 AM Bin Meng  wrote:
>
> Hi Vitaly,
>
> On Tue, Dec 22, 2020 at 4:39 AM Vitaly Wool  wrote:
> >
> > Introduce XIP (eXecute In Place) support for RISC-V platforms.
> > It allows code to be executed directly from non-volatile storage
> > directly addressable by the CPU, such as QSPI NOR flash which can
> > be found on many RISC-V platforms. This makes way for significant
> > optimization of RAM footprint. The XIP kernel is not compressed
> > since it has to run directly from flash, so it will occupy more
> > space on the non-volatile storage to The physical flash address
> > used to link the kernel object files and for storing it has to
> > be known at compile time and is represented by a Kconfig option.
> >
> > XIP on RISC-V will currently only work on MMU-enabled kernels.
> >
> > Changed in v2:
> > - dedicated macro for XIP address fixup when MMU is not enabled yet
> >   o both for 32-bit and 64-bit RISC-V
> > - SP is explicitly set to a safe place in RAM before __copy_data call
> > - removed redundant alignment requirements in vmlinux-xip.lds.S
> > - changed long -> uintptr_t typecast in __XIP_FIXUP macro.
> >
> > Changed in v3:
> > - rebased against latest for-next
> > - XIP address fixup macro now takes an argument
> > - SMP related fixes
>
> The above changelogs should go below ---

That is very fair, thanks. Will do that for v4.

~Vitaly

> > Signed-off-by: Vitaly Wool 
> > ---
> >  arch/riscv/Kconfig  |  46 -
> >  arch/riscv/Makefile |   8 +-
> >  arch/riscv/boot/Makefile|  13 +++
> >  arch/riscv/include/asm/pgtable.h|  56 +--
> >  arch/riscv/kernel/cpu_ops_sbi.c |   3 +
> >  arch/riscv/kernel/head.S|  69 +-
> >  arch/riscv/kernel/head.h|   3 +
> >  arch/riscv/kernel/setup.c   |   8 +-
> >  arch/riscv/kernel/vmlinux-xip.lds.S | 132 ++
> >  arch/riscv/kernel/vmlinux.lds.S |   6 ++
> >  arch/riscv/mm/init.c| 142 +---
> >  11 files changed, 460 insertions(+), 26 deletions(-)
> >  create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S
> >
>
> Regards,
> Bin


Re: [PATCH] zsmalloc: do not use bit_spin_lock

2020-12-21 Thread Vitaly Wool
On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song)
 wrote:
>
>
>
> > -Original Message-
> > From: Song Bao Hua (Barry Song)
> > Sent: Tuesday, December 22, 2020 11:38 AM
> > To: 'Vitaly Wool' 
> > Cc: Shakeel Butt ; Minchan Kim ; 
> > Mike
> > Galbraith ; LKML ; linux-mm
> > ; Sebastian Andrzej Siewior ;
> > NitinGupta ; Sergey Senozhatsky
> > ; Andrew Morton
> > 
> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> >
> >
> > > -Original Message-
> > > From: Vitaly Wool [mailto:vitaly.w...@konsulko.com]
> > > Sent: Tuesday, December 22, 2020 11:12 AM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: Shakeel Butt ; Minchan Kim ;
> > Mike
> > > Galbraith ; LKML ; linux-mm
> > > ; Sebastian Andrzej Siewior ;
> > > NitinGupta ; Sergey Senozhatsky
> > > ; Andrew Morton
> > > 
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > >  wrote:
> > > >
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Shakeel Butt [mailto:shake...@google.com]
> > > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > > To: Song Bao Hua (Barry Song) 
> > > > > Cc: Vitaly Wool ; Minchan Kim
> > > ;
> > > > > Mike Galbraith ; LKML ;
> > > linux-mm
> > > > > ; Sebastian Andrzej Siewior 
> > > > > ;
> > > > > NitinGupta ; Sergey Senozhatsky
> > > > > ; Andrew Morton
> > > > > 
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Shakeel Butt [mailto:shake...@google.com]
> > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > > To: Vitaly Wool 
> > > > > > > Cc: Minchan Kim ; Mike Galbraith 
> > > > > > > ;
> > > LKML
> > > > > > > ; linux-mm ; 
> > > > > > > Song
> > > Bao
> > > > > Hua
> > > > > > > (Barry Song) ; Sebastian Andrzej 
> > > > > > > Siewior
> > > > > > > ; NitinGupta ; Sergey
> > > > > Senozhatsky
> > > > > > > ; Andrew Morton
> > > > > > > 
> > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > >
> > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool 
> > > > > > > 
> > > > > wrote:
> > > > > > > >
> > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim 
> > wrote:
> > > > > > > > >
> > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and 
> > > > > > > > > > releases
> > > it
> > > > > > > > > > only in unmap() which is unsafe and leads to zswap 
> > > > > > > > > > complaining
> > > > > > > > > > about scheduling in atomic context.
> > > > > > > > > >
> > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove
> > that
> > > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > > >
> > > > > > > > > I don't want to use such open code for the lock.
> > > > > > > > >
> > > > > > > > > I see from Mike's patch, recent zswap change introduced the 
> > > > > > > > > lockdep
> > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap 
> > > > > > > > > bug
> > > and
> > > > > > > > > introduce this patch with allowing preemption enabling.
> > > > > > > >
> > > > > > > > This understanding is upsi

Re: [PATCH] zsmalloc: do not use bit_spin_lock

2020-12-21 Thread Vitaly Wool
On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
 wrote:
>
>
>
> > -Original Message-
> > From: Shakeel Butt [mailto:shake...@google.com]
> > Sent: Tuesday, December 22, 2020 10:03 AM
> > To: Song Bao Hua (Barry Song) 
> > Cc: Vitaly Wool ; Minchan Kim 
> > ;
> > Mike Galbraith ; LKML ; 
> > linux-mm
> > ; Sebastian Andrzej Siewior ;
> > NitinGupta ; Sergey Senozhatsky
> > ; Andrew Morton
> > 
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> >  wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Shakeel Butt [mailto:shake...@google.com]
> > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > To: Vitaly Wool 
> > > > Cc: Minchan Kim ; Mike Galbraith ; 
> > > > LKML
> > > > ; linux-mm ; Song Bao
> > Hua
> > > > (Barry Song) ; Sebastian Andrzej Siewior
> > > > ; NitinGupta ; Sergey
> > Senozhatsky
> > > > ; Andrew Morton
> > > > 
> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool 
> > wrote:
> > > > >
> > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim  
> > > > > wrote:
> > > > > >
> > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > about scheduling in atomic context.
> > > > > > >
> > > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > >
> > > > > > I don't want to use such open code for the lock.
> > > > > >
> > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > > > > introduce this patch with allowing preemption enabling.
> > > > >
> > > > > This understanding is upside down. The code in zswap you are referring
> > > > > to is not buggy.  You may claim that it is suboptimal but there is
> > > > > nothing wrong in taking a mutex.
> > > > >
> > > >
> > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I
> > > > am not very familiar with the crypto API. If I select lzo or lz4 as a
> > > > zswap compressor will the [de]compression be async or sync?
> > >
> > > Right now, in crypto subsystem, new drivers are required to write based on
> > > async APIs. The old sync API can't work in new accelerator drivers as they
> > > are not supported at all.
> > >
> > > Old drivers are used to sync, but they've got async wrappers to support 
> > > async
> > > APIs. Eg.
> > > crypto: acomp - add support for lz4 via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > >
> > > crypto: acomp - add support for lzo via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > >
> > > so they are supporting async APIs but they are still working in sync mode
> > as
> > > those old drivers don't sleep.
> > >
> >
> > Good to know that those are sync because I want them to be sync.
> > Please note that zswap is a cache in front of a real swap and the load
> > operation is latency sensitive as it comes in the page fault path and
> > directly impacts the applications. I doubt decompressing synchronously
> > a 4k page on a cpu will be costlier than asynchronously decompressing
> > the same page from hardware accelerators.
>
> If you read the old paper:
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality
> Because the hardware accelerator speeds up compression, looking at the zswap
> metrics we observed that there were more store and load requests in a given
> amount of time, which filled up the zswap pool faste

[PATCH] arch/Kconfig: JUMP_LABEL should depend on !XIP

2020-12-21 Thread Vitaly Wool
There's no point in trying to optimize the code when the code is
executed from a read-only medium (e. g. NOR flash) as in the XIP
case. Moreover, trying to do so in a XIP kernel may result in
faults and kernel crashes so let's explicitly disallow JUMP_LABEL
if XIP_KERNEL configuration option is set.

Signed-off-by: Vitaly Wool 
---
 arch/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..88632c9588ae 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -81,6 +81,7 @@ config JUMP_LABEL
bool "Optimize very unlikely/likely branches"
depends on HAVE_ARCH_JUMP_LABEL
depends on CC_HAS_ASM_GOTO
+   depends on !XIP_KERNEL
help
 This option enables a transparent branch optimization that
 makes certain almost-always-true or almost-always-false branch
-- 
2.29.2



[PATCH v3] RISC-V: enable XIP

2020-12-21 Thread Vitaly Wool
Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage to The physical flash address
used to link the kernel object files and for storing it has to
be known at compile time and is represented by a Kconfig option.

XIP on RISC-V will currently only work on MMU-enabled kernels.

Changed in v2:
- dedicated macro for XIP address fixup when MMU is not enabled yet
  o both for 32-bit and 64-bit RISC-V
- SP is explicitly set to a safe place in RAM before __copy_data call
- removed redundant alignment requirements in vmlinux-xip.lds.S
- changed long -> uintptr_t typecast in __XIP_FIXUP macro.

Changed in v3:
- rebased against latest for-next
- XIP address fixup macro now takes an argument
- SMP related fixes
Signed-off-by: Vitaly Wool 
---
 arch/riscv/Kconfig  |  46 -
 arch/riscv/Makefile |   8 +-
 arch/riscv/boot/Makefile|  13 +++
 arch/riscv/include/asm/pgtable.h|  56 +--
 arch/riscv/kernel/cpu_ops_sbi.c |   3 +
 arch/riscv/kernel/head.S|  69 +-
 arch/riscv/kernel/head.h|   3 +
 arch/riscv/kernel/setup.c   |   8 +-
 arch/riscv/kernel/vmlinux-xip.lds.S | 132 ++
 arch/riscv/kernel/vmlinux.lds.S |   6 ++
 arch/riscv/mm/init.c| 142 +---
 11 files changed, 460 insertions(+), 26 deletions(-)
 create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 2b41f6d8e458..fabafdf763da 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -398,7 +398,7 @@ config EFI_STUB
 
 config EFI
bool "UEFI runtime support"
-   depends on OF
+   depends on OF && !XIP_KERNEL
select LIBFDT
select UCS2_STRING
select EFI_PARAMS_FROM_FDT
@@ -415,12 +415,52 @@ config EFI
  allow the kernel to be booted as an EFI application. This
  is only useful on systems that have UEFI firmware.
 
+config XIP_KERNEL
+   bool "Kernel Execute-In-Place from ROM"
+   depends on MMU
+   help
+ Execute-In-Place allows the kernel to run from non-volatile storage
+ directly addressable by the CPU, such as NOR flash. This saves RAM
+ space since the text section of the kernel is not loaded from flash
+ to RAM.  Read-write sections, such as the data section and stack,
+ are still copied to RAM.  The XIP kernel is not compressed since
+ it has to run directly from flash, so it will take more space to
+ store it.  The flash address used to link the kernel object files,
+ and for storing it, is configuration dependent. Therefore, if you
+ say Y here, you must know the proper physical address where to
+ store the kernel image depending on your own flash memory usage.
+
+ Also note that the make target becomes "make xipImage" rather than
+ "make zImage" or "make Image".  The final kernel binary to put in
+ ROM memory will be arch/riscv/boot/xipImage.
+
+ If unsure, say N.
+
+config XIP_PHYS_ADDR
+   hex "XIP Kernel Physical Location"
+   depends on XIP_KERNEL
+   default "0x2100"
+   help
+ This is the physical address in your flash memory the kernel will
+ be linked for and stored to.  This address is dependent on your
+ own flash usage.
+
+config XIP_PHYS_RAM_BASE
+   hex "Platform Physical RAM address"
+   depends on XIP_KERNEL
+   default "0x8000"
+   help
+ This is the physical address of RAM in the system. It has to be
+ explicitly specified to run early relocations of read-write data
+ from flash to RAM.
 endmenu
 
 config BUILTIN_DTB
-   def_bool n
-   depends on RISCV_M_MODE
+   bool
+   depends on (RISCV_M_MODE || XIP_KERNEL)
depends on OF
+   default y if XIP_KERNEL
+   default n if !XIP_KERNEL
 
 menu "Power management options"
 
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 8c29e553ef7f..b89b6897b0be 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -70,7 +70,11 @@ CHECKFLAGS += -D__riscv -D__riscv_xlen=$(BITS)
 
 # Default target when executing plain make
 boot   := arch/riscv/boot
+ifeq ($(CONFIG_XIP_KERNEL),y)
+KBUILD_IMAGE := $(boot)/xipImage
+else
 KBUILD_IMAGE   := $(boot)/Image.gz
+endif
 
 head-y := arch/riscv/kernel/head.o
 
@@ -83,12 +87,14 @@ PHONY += vdso_install
 vdso_install:
$(Q)$(MAKE) $(build)=arc

Re: [PATCH] zsmalloc: do not use bit_spin_lock

2020-12-21 Thread Vitaly Wool
On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim  wrote:
>
> On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> I don't want to use such open code for the lock.
>
> I see from Mike's patch, recent zswap change introduced the lockdep
> splat bug and you want to improve zsmalloc to fix the zswap bug and
> introduce this patch with allowing preemption enabling.

This understanding is upside down. The code in zswap you are referring
to is not buggy.  You may claim that it is suboptimal but there is
nothing wrong in taking a mutex.

> https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.ca...@gmx.de/
>
> zs_[un/map]_object is designed to be used in fast path(i.e.,
> zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
> perfectly fine for API point of view. However, zswap introduced
> using the API with mutex_lock/crypto_wait_req where allowing
> preemption, which was wrong.

Taking a spinlock in one callback and releasing it in another is
unsafe and error prone. What if unmap was called on completion of a
DMA-like transfer from another context, like a threaded IRQ handler?
In that case this spinlock might never be released.

Anyway I can come up with a zswap patch explicitly stating that
zsmalloc is not fully compliant with zswap / zpool API to avoid
confusion for the time being. Would that be ok with you?

Best regards,
   Vitaly

> Furthermore, the zs_map_object already has a few more places where
> disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic).
>
> Without making those locks preemptible all at once, zswap will still
> see the lockdep warning.
>
> >
> > Signed-off-by: Vitaly Wool 
> > ---
> >  mm/zsmalloc.c | 13 -
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 7289f502ffac..ff26546a7fed 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, 
> > void *obj)
> >
> >  static inline int testpin_tag(unsigned long handle)
> >  {
> > - return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> > + return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> >  }
> >
> >  static inline int trypin_tag(unsigned long handle)
> >  {
> > - return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > + return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> >  }
> >
> > -static void pin_tag(unsigned long handle) __acquires(bitlock)
> > +static void pin_tag(unsigned long handle)
> >  {
> > - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > + preempt_disable();
> > + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
> > + cpu_relax();
> > + preempt_enable();
> >  }
> >
> >  static void unpin_tag(unsigned long handle) __releases(bitlock)
> >  {
> > - bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > + clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> >  }
> >
> >  static void reset_page(struct page *page)
> > --
> > 2.20.1
> >


Re: [PATCH] zsmalloc: do not use bit_spin_lock

2020-12-19 Thread Vitaly Wool
On Sun, Dec 20, 2020 at 2:18 AM Matthew Wilcox  wrote:
>
> On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> Isn't this just "I open coded bit spinlock to make the lockdep
> warnings go away"?

Not really because bit spinlock leaves preemption disabled.


[PATCH] zsmalloc: do not use bit_spin_lock

2020-12-19 Thread Vitaly Wool
zsmalloc takes bit spinlock in its _map() callback and releases it
only in unmap() which is unsafe and leads to zswap complaining
about scheduling in atomic context.

To fix that and to improve RT properties of zsmalloc, remove that
bit spinlock completely and use a bit flag instead.

Signed-off-by: Vitaly Wool 
---
 mm/zsmalloc.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 7289f502ffac..ff26546a7fed 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void 
*obj)
 
 static inline int testpin_tag(unsigned long handle)
 {
-   return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
+   return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static inline int trypin_tag(unsigned long handle)
 {
-   return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
+   return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
-static void pin_tag(unsigned long handle) __acquires(bitlock)
+static void pin_tag(unsigned long handle)
 {
-   bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
+   preempt_disable();
+   while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
+   cpu_relax();
+   preempt_enable();
 }
 
 static void unpin_tag(unsigned long handle) __releases(bitlock)
 {
-   bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
+   clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void reset_page(struct page *page)
-- 
2.20.1



Re: [patch] zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

2020-12-19 Thread Vitaly Wool
On Sat, 19 Dec 2020, 11:27 Mike Galbraith,  wrote:
>
> On Sat, 2020-12-19 at 11:20 +0100, Vitaly Wool wrote:
> > Hi Mike,
> >
> > On Sat, Dec 19, 2020 at 11:12 AM Mike Galbraith  wrote:
> > >
> > > (mailer partially munged formatting? resend)
> > >
> > > mm/zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() 
> > > might_sleep() splat
> > >
> > > zsmalloc map/unmap methods use preemption disabling bit spinlocks.  Take 
> > > the
> > > mutex outside of pool map/unmap methods in zswap_frontswap_load() as is 
> > > done
> > > in zswap_frontswap_store().
> >
> > oh wait... So is zsmalloc taking a spin lock in its map callback and
> > releasing it only in unmap? In this case, I would rather keep zswap as
> > is, mark zsmalloc as RT unsafe and have zsmalloc maintainer fix it.
>
> The kernel that generated that splat was NOT an RT kernel, it was plain
> master.today with a PREEMPT config.


I see, thanks. I don't think it makes things better for zsmalloc
though. From what I can see, the offending code is this:

>/* From now on, migration cannot move the object */
>pin_tag(handle);

Bit spinlock is taken in pin_tag(). I find the comment above somewhat
misleading, why is it necessary to take a spinlock to prevent
migration? I would guess an atomic flag should normally be enough.

zswap is not broken here, it is zsmalloc that needs to be fixed.

Best regards,
   Vitaly


Re: [patch] zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

2020-12-19 Thread Vitaly Wool
Hi Mike,

On Sat, Dec 19, 2020 at 11:12 AM Mike Galbraith  wrote:
>
> (mailer partially munged formatting? resend)
>
> mm/zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() 
> splat
>
> zsmalloc map/unmap methods use preemption disabling bit spinlocks.  Take the
> mutex outside of pool map/unmap methods in zswap_frontswap_load() as is done
> in zswap_frontswap_store().

oh wait... So is zsmalloc taking a spin lock in its map callback and
releasing it only in unmap? In this case, I would rather keep zswap as
is, mark zsmalloc as RT unsafe and have zsmalloc maintainer fix it.

Best regards,
   Vitaly

> Signed-off-by: Mike Galbraith 
> Fixes: 1ec3b5fe6eec "mm/zswap: move to use crypto_acomp API for hardware 
> acceleration"
> ---
>  mm/zswap.c |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned
>
> /* decompress */
> dlen = PAGE_SIZE;
> +   acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> +   mutex_lock(acomp_ctx->mutex);
> src = zpool_map_handle(entry->pool->zpool, entry->handle, 
> ZPOOL_MM_RO);
> if (zpool_evictable(entry->pool->zpool))
> src += sizeof(struct zswap_header);
>
> -   acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> -   mutex_lock(acomp_ctx->mutex);
> sg_init_one(, src, entry->length);
> sg_init_table(, 1);
> sg_set_page(, page, PAGE_SIZE, 0);
> acomp_request_set_params(acomp_ctx->req, , , 
> entry->length, dlen);
> ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), 
> _ctx->wait);
> -   mutex_unlock(acomp_ctx->mutex);
>
> zpool_unmap_handle(entry->pool->zpool, entry->handle);
> +   mutex_unlock(acomp_ctx->mutex);
> BUG_ON(ret);
>
>  freeentry:
>


Re: scheduling while atomic in z3fold

2020-12-08 Thread Vitaly Wool

Hi Mike,

On 2020-12-07 16:41, Mike Galbraith wrote:

On Mon, 2020-12-07 at 16:21 +0100, Vitaly Wool wrote:

On Mon, Dec 7, 2020 at 1:34 PM Mike Galbraith  wrote:





Unfortunately, that made zero difference.


Okay, I suggest that you submit the patch that changes read_lock() to
write_lock() in __release_z3fold_page() and I'll ack it then.
I would like to rewrite the code so that write_lock is not necessary
there but I don't want to hold you back and it isn't likely that I'll
complete this today.


Nah, I'm in no rush... especially not to sign off on "Because the
little voices in my head said this bit should look like that bit over
yonder, and testing _seems_ to indicate they're right about that" :)

-Mike



okay, thanks. Would this make things better:

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 18feaa0bc537..340c38a5ffac 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -303,10 +303,9 @@ static inline void put_z3fold_header(struct 
z3fold_header *zhdr)

z3fold_page_unlock(zhdr);
 }

-static inline void free_handle(unsigned long handle)
+static inline void free_handle(unsigned long handle, struct 
z3fold_header *zhdr)

 {
struct z3fold_buddy_slots *slots;
-   struct z3fold_header *zhdr;
int i;
bool is_free;

@@ -316,22 +315,13 @@ static inline void free_handle(unsigned long handle)
if (WARN_ON(*(unsigned long *)handle == 0))
return;

-   zhdr = handle_to_z3fold_header(handle);
slots = handle_to_slots(handle);
write_lock(>lock);
*(unsigned long *)handle = 0;
-   if (zhdr->slots == slots) {
-   write_unlock(>lock);
-   return; /* simple case, nothing else to do */
-   }
+   if (zhdr->slots != slots)
+   zhdr->foreign_handles--;

-   /* we are freeing a foreign handle if we are here */
-   zhdr->foreign_handles--;
is_free = true;
-   if (!test_bit(HANDLES_ORPHANED, >pool)) {
-   write_unlock(>lock);
-   return;
-   }
for (i = 0; i <= BUDDY_MASK; i++) {
if (slots->slot[i]) {
is_free = false;
@@ -343,6 +333,8 @@ static inline void free_handle(unsigned long handle)
if (is_free) {
struct z3fold_pool *pool = slots_to_pool(slots);

+   if (zhdr->slots == slots)
+   zhdr->slots = NULL;
kmem_cache_free(pool->c_handle, slots);
}
 }
@@ -525,8 +517,6 @@ static void __release_z3fold_page(struct 
z3fold_header *zhdr, bool locked)

 {
struct page *page = virt_to_page(zhdr);
struct z3fold_pool *pool = zhdr_to_pool(zhdr);
-   bool is_free = true;
-   int i;

WARN_ON(!list_empty(>buddy));
set_bit(PAGE_STALE, >private);
@@ -536,21 +526,6 @@ static void __release_z3fold_page(struct 
z3fold_header *zhdr, bool locked)

list_del_init(>lru);
spin_unlock(>lock);

-   /* If there are no foreign handles, free the handles array */
-   read_lock(>slots->lock);
-   for (i = 0; i <= BUDDY_MASK; i++) {
-   if (zhdr->slots->slot[i]) {
-   is_free = false;
-   break;
-   }
-   }
-   if (!is_free)
-   set_bit(HANDLES_ORPHANED, >slots->pool);
-   read_unlock(>slots->lock);
-
-   if (is_free)
-   kmem_cache_free(pool->c_handle, zhdr->slots);
-
if (locked)
z3fold_page_unlock(zhdr);

@@ -973,6 +948,9 @@ static inline struct z3fold_header 
*__z3fold_alloc(struct z3fold_pool *pool,

}
}

+   if (zhdr && !zhdr->slots)
+   zhdr->slots = alloc_slots(pool,
+   can_sleep ? GFP_NOIO : GFP_ATOMIC);
return zhdr;
 }

@@ -1270,7 +1248,7 @@ static void z3fold_free(struct z3fold_pool *pool, 
unsigned long handle)

}

if (!page_claimed)
-   free_handle(handle);
+   free_handle(handle, zhdr);
if (kref_put(>refcount, release_z3fold_page_locked_list)) {
atomic64_dec(>pages_nr);
return;
@@ -1429,19 +1407,19 @@ static int z3fold_reclaim_page(struct 
z3fold_pool *pool, unsigned int retries)

ret = pool->ops->evict(pool, middle_handle);
if (ret)
goto next;
-   free_handle(middle_handle);
+   free_handle(middle_handle, zhdr);
}
if (first_handle) {
ret = pool->ops->evict(pool, first_handle);
if (ret)
goto next;
-   free_handle(first_handle);
+   free_handle(first_handle

Re: scheduling while atomic in z3fold

2020-12-07 Thread Vitaly Wool
On Mon, Dec 7, 2020 at 1:34 PM Mike Galbraith  wrote:
>
> On Mon, 2020-12-07 at 12:52 +0100, Vitaly Wool wrote:
> >
> > Thanks. This trace beats me because I don't quite get how this could
> > have happened.
>
> I swear there's a mythical creature loose in there somewhere ;-)
> Everything looks just peachy up to the instant it goes boom, then you
> find in the wreckage that which was most definitely NOT there just a
> few ns ago.
>
> > Hitting write_unlock at line 341 would mean that HANDLES_ORPHANED bit
> > is set but obviously it isn't.
> > Could you please comment out the ".shrink = z3fold_zpool_shrink" line
> > and retry?
>
> Unfortunately, that made zero difference.

Okay, I suggest that you submit the patch that changes read_lock() to
write_lock() in __release_z3fold_page() and I'll ack it then.
I would like to rewrite the code so that write_lock is not necessary
there but I don't want to hold you back and it isn't likely that I'll
complete this today.

Best regards,
   Vitaly


Re: scheduling while atomic in z3fold

2020-12-07 Thread Vitaly Wool
On Mon, Dec 7, 2020 at 3:18 AM Mike Galbraith  wrote:
>
> On Mon, 2020-12-07 at 02:05 +0100, Vitaly Wool wrote:
> >
> > Could you please try the following patch in your setup:
>
> crash> gdb list *z3fold_zpool_free+0x527
> 0xc0e14487 is in z3fold_zpool_free (mm/z3fold.c:341).
> 336 if (slots->slot[i]) {
> 337 is_free = false;
> 338 break;
> 339 }
> 340 }
> 341 write_unlock(>lock);  <== boom
> 342
> 343 if (is_free) {
> 344 struct z3fold_pool *pool = slots_to_pool(slots);
> 345
> crash> z3fold_buddy_slots -x 99a3287b8780
> struct z3fold_buddy_slots {
>   slot = {0xdeadbeef, 0xdeadbeef, 0xdeadbeef, 0xdeadbeef},
>   pool = 0x99a3146b8400,
>   lock = {
> rtmutex = {
>   wait_lock = {
> raw_lock = {
>   {
> val = {
>   counter = 0x1
> },
> {
>   locked = 0x1,
>   pending = 0x0
> },
> {
>   locked_pending = 0x1,
>   tail = 0x0
> }
>   }
> }
>   },
>   waiters = {
> rb_root = {
>   rb_node = 0x99a3287b8e00
> },
> rb_leftmost = 0x0
>   },
>   owner = 0x99a355c24500,
>   save_state = 0x1
> },
> readers = {
>   counter = 0x8000
> }
>   }
> }

Thanks. This trace beats me because I don't quite get how this could
have happened.
Hitting write_unlock at line 341 would mean that HANDLES_ORPHANED bit
is set but obviously it isn't.
Could you please comment out the ".shrink = z3fold_zpool_shrink" line
and retry? Reclaim is the trickiest thing over there since I have to
drop page lock while reclaiming.

Thanks,
   Vitaly

> > diff --git a/mm/z3fold.c b/mm/z3fold.c
> > index 18feaa0bc537..efe9a012643d 100644
> > --- a/mm/z3fold.c
> > +++ b/mm/z3fold.c
> > @@ -544,12 +544,17 @@ static void __release_z3fold_page(struct 
> > z3fold_header *zhdr, bool locked)
> >   break;
> >   }
> >   }
> > - if (!is_free)
> > + if (!is_free) {
> >   set_bit(HANDLES_ORPHANED, >slots->pool);
> > - read_unlock(>slots->lock);
> > -
> > - if (is_free)
> > + read_unlock(>slots->lock);
> > + } else {
> > + zhdr->slots->slot[0] =
> > + zhdr->slots->slot[1] =
> > + zhdr->slots->slot[2] =
> > + zhdr->slots->slot[3] = 0xdeadbeef;
> > + read_unlock(>slots->lock);
> >   kmem_cache_free(pool->c_handle, zhdr->slots);
> > + }
> >
> >   if (locked)
> >   z3fold_page_unlock(zhdr);
>


Re: scheduling while atomic in z3fold

2020-12-03 Thread Vitaly Wool
On Thu, Dec 3, 2020 at 2:39 PM Sebastian Andrzej Siewior
 wrote:
>
> On 2020-12-03 09:18:21 [+0100], Mike Galbraith wrote:
> > On Thu, 2020-12-03 at 03:16 +0100, Mike Galbraith wrote:
> > > On Wed, 2020-12-02 at 23:08 +0100, Sebastian Andrzej Siewior wrote:
> > > Looks like...
> > >
> > > d8f117abb380 z3fold: fix use-after-free when freeing handles
> > >
> > > ...wasn't completely effective...
> >
> > The top two hunks seem to have rendered the thing RT tolerant.

Thanks for all your efforts, I promise to take a closer look at this
today, has had my hands full with RISC-V up until now.

Best regards,
   Vitaly

> Yes, it appears to. I have no idea if this is a proper fix or not.
> Without your write lock, after a few attempts, KASAN says:
>
> | BUG: KASAN: use-after-free in __pv_queued_spin_lock_slowpath+0x293/0x770
> | Write of size 2 at addr 88800e0e10aa by task kworker/u16:3/237
> |
> | CPU: 5 PID: 237 Comm: kworker/u16:3 Tainted: GW 
> 5.10.0-rc6-rt13-rt+
> | Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1 04/01/2014
> | Workqueue: zswap1 compact_page_work
> | Call Trace:
> |  dump_stack+0x7d/0xa3
> |  print_address_description.constprop.0+0x19/0x120
> |  __kasan_report.cold+0x1d/0x35
> |  kasan_report+0x3a/0x50
> |  check_memory_region+0x145/0x1a0
> |  __pv_queued_spin_lock_slowpath+0x293/0x770
> |  _raw_spin_lock_irq+0xca/0xe0
> |  rt_read_unlock+0x2c/0x70
> |  __release_z3fold_page.constprop.0+0x45e/0x620
> |  do_compact_page+0x674/0xa50
> |  process_one_work+0x63a/0x1130
> |  worker_thread+0xd3/0xc80
> |  kthread+0x401/0x4e0
> |  ret_from_fork+0x22/0x30
> |
> | Allocated by task 225 (systemd-journal):
> |  kasan_save_stack+0x1b/0x40
> |  __kasan_kmalloc.constprop.0+0xc2/0xd0
> |  kmem_cache_alloc+0x103/0x2b0
> |  z3fold_alloc+0x597/0x1970
> |  zswap_frontswap_store+0x928/0x1bf0
> |  __frontswap_store+0x117/0x320
> |  swap_writepage+0x34/0x70
> |  pageout+0x268/0x7c0
> |  shrink_page_list+0x13e1/0x1e80
> |  shrink_inactive_list+0x303/0xde0
> |  shrink_lruvec+0x3dd/0x660
> |  shrink_node_memcgs+0x3a1/0x600
> |  shrink_node+0x3a7/0x1350
> |  shrink_zones+0x1f1/0x7f0
> |  do_try_to_free_pages+0x219/0xcc0
> |  try_to_free_pages+0x1c5/0x4b0
> |  __perform_reclaim+0x18f/0x2c0
> |  __alloc_pages_slowpath.constprop.0+0x7ea/0x1790
> |  __alloc_pages_nodemask+0x5f5/0x700
> |  page_cache_ra_unbounded+0x30f/0x690
> |  do_sync_mmap_readahead+0x3e3/0x640
> |  filemap_fault+0x981/0x1110
> |  __xfs_filemap_fault+0x12d/0x840
> |  __do_fault+0xf3/0x4b0
> |  do_fault+0x202/0x8c0
> |  __handle_mm_fault+0x338/0x500
> |  handle_mm_fault+0x1a8/0x670
> |  do_user_addr_fault+0x409/0x8b0
> |  exc_page_fault+0x60/0xc0
> |  asm_exc_page_fault+0x1e/0x30
> |
> | Freed by task 71 (oom_reaper):
> |  kasan_save_stack+0x1b/0x40
> |  kasan_set_track+0x1c/0x30
> |  kasan_set_free_info+0x1b/0x30
> |  __kasan_slab_free+0x110/0x150
> |  kmem_cache_free+0x7f/0x450
> |  z3fold_free+0x1f8/0xc90
> |  zswap_free_entry+0x168/0x230
> |  zswap_frontswap_invalidate_page+0x145/0x190
> |  __frontswap_invalidate_page+0xe8/0x1a0
> |  swap_range_free.constprop.0+0x266/0x300
> |  swapcache_free_entries+0x1dc/0x970
> |  free_swap_slot+0x19c/0x290
> |  __swap_entry_free+0x139/0x160
> |  free_swap_and_cache+0xda/0x230
> |  zap_pte_range+0x275/0x1590
> |  unmap_page_range+0x320/0x690
> |  __oom_reap_task_mm+0x207/0x330
> |  oom_reap_task_mm+0x78/0x7e0
> |  oom_reap_task+0x6d/0x1a0
> |  oom_reaper+0x103/0x290
> |  kthread+0x401/0x4e0
> |  ret_from_fork+0x22/0x30
> |
> | The buggy address belongs to the object at 88800e0e1080
> |  which belongs to the cache z3fold_handle of size 88
> | The buggy address is located 42 bytes inside of
> |  88-byte region [88800e0e1080, 88800e0e10d8)
> | The buggy address belongs to the page:
> | page:2ba661bc refcount:1 mapcount:0 mapping: 
> index:0x0 pfn:0xe0e1
> | flags: 0x80200(slab)
> | raw: 00080200 dead0100 dead0122 88800aa4eb40
> | raw:  00200020 0001 
> | page dumped because: kasan: bad access detected
> |
> | Memory state around the buggy address:
> |  88800e0e0f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> |  88800e0e1000: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
> | >88800e0e1080: fa fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
> |   ^
> |  88800e0e1100: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
> |  88800e0e1180: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
> | ==
> | Disabling lock debugging due to kernel taint
>
> with the lock I haven't seen anything.
>
> Sebastian


[PATCH v2] RISC-V: enable XIP

2020-12-03 Thread Vitaly Wool
Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage to The physical flash address
used to link the kernel object files and for storing it has to
be known at compile time and is represented by a Kconfig option.

XIP on RISC-V will currently only work on MMU-enabled kernels.

Changed in v2:
- dedicated macro for XIP address fixup when MMU is not enabled yet
  o both for 32-bit and 64-bit RISC-V
- SP is explicitly set to a safe place in RAM before __copy_data call
- removed redundant alignment requirements in vmlinux-xip.lds.S
- changed long -> uintptr_t typecast in __XIP_FIXUP macro.

Signed-off-by: Vitaly Wool 
---
 arch/riscv/Kconfig  |  40 -
 arch/riscv/Makefile |   8 +-
 arch/riscv/boot/Makefile|  14 ++-
 arch/riscv/include/asm/pgtable.h|  54 ++--
 arch/riscv/kernel/head.S|  54 +++-
 arch/riscv/kernel/head.h|   3 +
 arch/riscv/kernel/setup.c   |   2 +-
 arch/riscv/kernel/vmlinux-xip.lds.S | 132 
 arch/riscv/kernel/vmlinux.lds.S |   6 ++
 arch/riscv/mm/init.c| 132 ++--
 10 files changed, 423 insertions(+), 22 deletions(-)
 create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 44377fd7860e..c9bef841c884 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -395,7 +395,7 @@ config EFI_STUB
 
 config EFI
bool "UEFI runtime support"
-   depends on OF
+   depends on OF && !XIP_KERNEL
select LIBFDT
select UCS2_STRING
select EFI_PARAMS_FROM_FDT
@@ -412,6 +412,44 @@ config EFI
  allow the kernel to be booted as an EFI application. This
  is only useful on systems that have UEFI firmware.
 
+config XIP_KERNEL
+   bool "Kernel Execute-In-Place from ROM"
+   depends on MMU
+   help
+ Execute-In-Place allows the kernel to run from non-volatile storage
+ directly addressable by the CPU, such as NOR flash. This saves RAM
+ space since the text section of the kernel is not loaded from flash
+ to RAM.  Read-write sections, such as the data section and stack,
+ are still copied to RAM.  The XIP kernel is not compressed since
+ it has to run directly from flash, so it will take more space to
+ store it.  The flash address used to link the kernel object files,
+ and for storing it, is configuration dependent. Therefore, if you
+ say Y here, you must know the proper physical address where to
+ store the kernel image depending on your own flash memory usage.
+
+ Also note that the make target becomes "make xipImage" rather than
+ "make zImage" or "make Image".  The final kernel binary to put in
+ ROM memory will be arch/riscv/boot/xipImage.
+
+ If unsure, say N.
+
+config XIP_PHYS_ADDR
+   hex "XIP Kernel Physical Location"
+   depends on XIP_KERNEL
+   default "0x2100"
+   help
+ This is the physical address in your flash memory the kernel will
+ be linked for and stored to.  This address is dependent on your
+ own flash usage.
+
+config XIP_PHYS_RAM_BASE
+   hex "Platform Physical RAM address"
+   depends on XIP_KERNEL
+   default "0x8000"
+   help
+ This is the physical address of RAM in the system. It has to be
+ explicitly specified to run early relocations of read-write data
+ from flash to RAM.
 endmenu
 
 config BUILTIN_DTB
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 0289a97325d1..387afe973530 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -70,7 +70,11 @@ CHECKFLAGS += -D__riscv -D__riscv_xlen=$(BITS)
 
 # Default target when executing plain make
 boot   := arch/riscv/boot
+ifeq ($(CONFIG_XIP_KERNEL),y)
+KBUILD_IMAGE := $(boot)/xipImage
+else
 KBUILD_IMAGE   := $(boot)/Image.gz
+endif
 
 head-y := arch/riscv/kernel/head.o
 
@@ -83,12 +87,14 @@ PHONY += vdso_install
 vdso_install:
$(Q)$(MAKE) $(build)=arch/riscv/kernel/vdso $@
 
+ifneq ($(CONFIG_XIP_KERNEL),y)
 ifeq ($(CONFIG_RISCV_M_MODE)$(CONFIG_SOC_KENDRYTE),yy)
 KBUILD_IMAGE := $(boot)/loader.bin
 else
 KBUILD_IMAGE := $(boot)/Image.gz
 endif
-BOOT_TARGETS := Image Image.gz loader loader.bin
+endif
+BOOT_TARGETS := Image Image.gz loader loader.bin xipImage
 
 all:   $(notdir $(KBUILD_IMAGE))
 
diff --git a/arch/riscv/boot/Makefile b/arch/riscv/boot/

Re: [PATCH] arch/riscv: enable XIP

2020-12-02 Thread Vitaly Wool
On Wed, Dec 2, 2020 at 7:06 PM Nicolas Pitre  wrote:
>
> On Wed, 2 Dec 2020, Vitaly Wool wrote:
>
> > Introduce XIP (eXecute In Place) support for RISC-V platforms.
> > It allows code to be executed directly from non-volatile storage
> > directly addressable by the CPU, such as QSPI NOR flash which can
> > be found on many RISC-V platforms. This makes way for significant
> > optimization of RAM footprint. The XIP kernel is not compressed
> > since it has to run directly from flash, so it will occupy more
> > space on the non-volatile storage to The physical flash address
> > used to link the kernel object files and for storing it has to
> > be known at compile time and is represented by a Kconfig option.
> >
> > XIP on RISC-V will currently only work on MMU-enabled kernels.
> >
> > Signed-off-by: Vitaly Wool 
>
> That's nice!
>
> Suggestion for a future enhancement:
> To save on ROM storage, and given that the .data segment has to be
> copied to RAM anyway, you could store .data compressed and decompress it
> to RAM instead. See commit ca8b5d97d6bf for inspiration. In fact, many
> parts there could be shared.

Thanks! That's in my TODO list.

> More comments below.
>
> > +#define __XIP_FIXUP(addr) \
> > + (((long)(addr) >= CONFIG_XIP_PHYS_ADDR && \
> > +   (long)(addr) <= CONFIG_XIP_PHYS_ADDR + SZ_16M) ? \
> > + (long)(addr) - CONFIG_XIP_PHYS_ADDR + CONFIG_XIP_PHYS_RAM_BASE - 
> > XIP_OFFSET : \
> > + (long)(addr))
>
> Here you should cast to unsigned long instead.

Right, or (just thought of it) uintptr_t for that matter. Does that sound right?

> > +#ifdef CONFIG_XIP_KERNEL
> > + la a0, _trampoline_pg_dir
> > + lw t0, _xip_fixup
> > + add a0, a0, t0
> [...]
> > +_xip_fixup:
> > + .dword CONFIG_XIP_PHYS_RAM_BASE - CONFIG_XIP_PHYS_ADDR - XIP_OFFSET
> > +#endif
>
> Here _xip_fixup is a dword but you're loading it as a word.
> This won't work for both rv32 and rv64.

Well, at this point I believe it does, as long as we use little
endian. 64-bit version has been verified.
I do not argue though that it isn't nice and should be fixed.

> > +SECTIONS
> > +{
> > + /* Beginning of code and text segment */
> > + . = XIP_VIRT_ADDR(CONFIG_XIP_PHYS_ADDR);
> > + _xiprom = .;
> > + _start = .;
> > + HEAD_TEXT_SECTION
> > + INIT_TEXT_SECTION(PAGE_SIZE)
> > + /* we have to discard exit text and such at runtime, not link time */
> > + .exit.text :
> > + {
> > + EXIT_TEXT
> > + }
> > +
> > + .text : {
> > + _text = .;
> > + _stext = .;
> > + TEXT_TEXT
> > + SCHED_TEXT
> > + CPUIDLE_TEXT
> > + LOCK_TEXT
> > + KPROBES_TEXT
> > + ENTRY_TEXT
> > + IRQENTRY_TEXT
> > + SOFTIRQENTRY_TEXT
> > + *(.fixup)
> > + _etext = .;
> > + }
> > + RO_DATA(L1_CACHE_BYTES)
> > + .srodata : {
> > + *(.srodata*)
> > + }
> > + .init.rodata : {
> > + INIT_SETUP(16)
> > + INIT_CALLS
> > + CON_INITCALL
> > + INIT_RAM_FS
> > + }
> > + _exiprom = ALIGN(PAGE_SIZE);/* End of XIP ROM area */
>
> Why do you align this to a page size?

TBH I just cut the corners here and below, did not have to worry about
partial pages and such.


> > +
> > +
> > +/*
> > + * From this point, stuff is considered writable and will be copied to RAM
> > + */
> > + __data_loc = ALIGN(PAGE_SIZE);  /* location in file */
>
> Same question here?
>
> > + . = PAGE_OFFSET;/* location in memory */
> > +
> > + _sdata = .; /* Start of data section */
> > + _data = .;
> > + RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
> > + _edata = .;
> > + __start_ro_after_init = .;
> > + .data.ro_after_init : AT(ADDR(.data.ro_after_init) - LOAD_OFFSET) {
> > + *(.data..ro_after_init)
> > + }
> > + __end_ro_after_init = .;
> > +
> > + . = ALIGN(PAGE_SIZE);
>
> And again here?
>
> > +#ifdef CONFIG_XIP_KERNEL
> > +/* called from head.S with MMU off */
> > +asmlinkage void __init __copy_data(void)
> > +{
> > + void *from = (void *)(&_sdata);
> > + void *end = (void *)(&_end);
> > + void *to = (void *)CONFIG_XIP_PHYS_RAM_BASE;
> > + size_t sz = (size_t)(end - from);
> > +
> > + memcpy(to, from, sz);
> > +}
> > +#endif
>
> Where is the stack located when this executes? The stack for the init
> task is typically found within the .data area. At least on ARM it is.
> You don't want to overwrite your stack here.

sp is set to within the .data area later, we rely on the u-boot sp
setting which is outside of the destination area for the image parts
that are copied. I agree that this makes the implementation fragile
and will explicitly set sp in the next path version.

Best regards,
   Vitaly


[PATCH] arch/riscv: enable XIP

2020-12-02 Thread Vitaly Wool
Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage to The physical flash address
used to link the kernel object files and for storing it has to
be known at compile time and is represented by a Kconfig option.

XIP on RISC-V will currently only work on MMU-enabled kernels.

Signed-off-by: Vitaly Wool 
---
 arch/riscv/Kconfig  |  40 +++-
 arch/riscv/Makefile |   8 +-
 arch/riscv/boot/Makefile|  14 ++-
 arch/riscv/include/asm/pgtable.h|  53 +--
 arch/riscv/kernel/head.S|  35 ++-
 arch/riscv/kernel/head.h|   3 +
 arch/riscv/kernel/setup.c   |   2 +-
 arch/riscv/kernel/vmlinux-xip.lds.S | 132 +++
 arch/riscv/kernel/vmlinux.lds.S |   6 ++
 arch/riscv/mm/Makefile  |   1 +
 arch/riscv/mm/init.c| 136 +---
 11 files changed, 408 insertions(+), 22 deletions(-)
 create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 44377fd7860e..c9bef841c884 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -395,7 +395,7 @@ config EFI_STUB
 
 config EFI
bool "UEFI runtime support"
-   depends on OF
+   depends on OF && !XIP_KERNEL
select LIBFDT
select UCS2_STRING
select EFI_PARAMS_FROM_FDT
@@ -412,6 +412,44 @@ config EFI
  allow the kernel to be booted as an EFI application. This
  is only useful on systems that have UEFI firmware.
 
+config XIP_KERNEL
+   bool "Kernel Execute-In-Place from ROM"
+   depends on MMU
+   help
+ Execute-In-Place allows the kernel to run from non-volatile storage
+ directly addressable by the CPU, such as NOR flash. This saves RAM
+ space since the text section of the kernel is not loaded from flash
+ to RAM.  Read-write sections, such as the data section and stack,
+ are still copied to RAM.  The XIP kernel is not compressed since
+ it has to run directly from flash, so it will take more space to
+ store it.  The flash address used to link the kernel object files,
+ and for storing it, is configuration dependent. Therefore, if you
+ say Y here, you must know the proper physical address where to
+ store the kernel image depending on your own flash memory usage.
+
+ Also note that the make target becomes "make xipImage" rather than
+ "make zImage" or "make Image".  The final kernel binary to put in
+ ROM memory will be arch/riscv/boot/xipImage.
+
+ If unsure, say N.
+
+config XIP_PHYS_ADDR
+   hex "XIP Kernel Physical Location"
+   depends on XIP_KERNEL
+   default "0x2100"
+   help
+ This is the physical address in your flash memory the kernel will
+ be linked for and stored to.  This address is dependent on your
+ own flash usage.
+
+config XIP_PHYS_RAM_BASE
+   hex "Platform Physical RAM address"
+   depends on XIP_KERNEL
+   default "0x8000"
+   help
+ This is the physical address of RAM in the system. It has to be
+ explicitly specified to run early relocations of read-write data
+ from flash to RAM.
 endmenu
 
 config BUILTIN_DTB
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 0289a97325d1..387afe973530 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -70,7 +70,11 @@ CHECKFLAGS += -D__riscv -D__riscv_xlen=$(BITS)
 
 # Default target when executing plain make
 boot   := arch/riscv/boot
+ifeq ($(CONFIG_XIP_KERNEL),y)
+KBUILD_IMAGE := $(boot)/xipImage
+else
 KBUILD_IMAGE   := $(boot)/Image.gz
+endif
 
 head-y := arch/riscv/kernel/head.o
 
@@ -83,12 +87,14 @@ PHONY += vdso_install
 vdso_install:
$(Q)$(MAKE) $(build)=arch/riscv/kernel/vdso $@
 
+ifneq ($(CONFIG_XIP_KERNEL),y)
 ifeq ($(CONFIG_RISCV_M_MODE)$(CONFIG_SOC_KENDRYTE),yy)
 KBUILD_IMAGE := $(boot)/loader.bin
 else
 KBUILD_IMAGE := $(boot)/Image.gz
 endif
-BOOT_TARGETS := Image Image.gz loader loader.bin
+endif
+BOOT_TARGETS := Image Image.gz loader loader.bin xipImage
 
 all:   $(notdir $(KBUILD_IMAGE))
 
diff --git a/arch/riscv/boot/Makefile b/arch/riscv/boot/Makefile
index c59fca695f9d..bda88bec0ad7 100644
--- a/arch/riscv/boot/Makefile
+++ b/arch/riscv/boot/Makefile
@@ -17,8 +17,20 @@
 KCOV_INSTRUMENT := n
 
 OBJCOPYFLAGS_Image :=-O binary -R .note -R .note.gnu.build-id -R .comment -S
+OBJCOPYFLAGS_xipImage :=-O binary -R .note -R .not

Re: [PATCH] riscv: toggle mmu_enabled flag in a precise manner

2020-12-01 Thread Vitaly Wool
On Tue, Dec 1, 2020 at 6:40 PM Atish Patra  wrote:
>
> On Tue, Dec 1, 2020 at 1:01 AM  wrote:
> >
> > From: Vitaly Wool 
> >
> > Cuurently mmu_enabled flag is set to true way later than the actual
> > MMU enablement takes place. This leads to hard-to-track races in
> > e. g. SBI earlycon initialization taking wrong path configuring
> > fixmap.
> >
>
> This code path is significantly changed in 5.10-rcX with UEFI series.
> https://patchwork.kernel.org/project/linux-riscv/patch/20200917223716.2300238-4-atish.pa...@wdc.com/
>
> Can you check if you can still reproduce the issue you were seeing
> with the latest upstream kernel ?
> If yes, please share the steps to reproduce the issue.

No, I don't think I can reproduce it now, thanks!

~Vitaly

> > To fix that, move mmu_enabled toggling to head.S and rename it to
> > _mmu_enabled to avoid possible name clashes since it's not a static
> > variable any more.
> >
> > Signed-off-by: Vitaly Wool 
> > ---
> >  arch/riscv/kernel/head.S |  9 +
> >  arch/riscv/mm/init.c | 13 +
> >  2 files changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> > index 0a4e81b8dc79..33cd57285be3 100644
> > --- a/arch/riscv/kernel/head.S
> > +++ b/arch/riscv/kernel/head.S
> > @@ -248,6 +248,10 @@ clear_bss_done:
> > call relocate
> >  #endif /* CONFIG_MMU */
> >
> > +   la a0, _mmu_enabled
> > +   li a1, 1
> > +   sw a1, (a0)
> > +
> > call setup_trap_vector
> > /* Restore C environment */
> > la tp, init_task
> > @@ -370,6 +374,11 @@ ENTRY(reset_regs)
> >  END(reset_regs)
> >  #endif /* CONFIG_RISCV_M_MODE */
> >
> > +.section ".data"
> > +   .global _mmu_enabled
> > +_mmu_enabled:
> > +   .word 0
> > +
> >  __PAGE_ALIGNED_BSS
> > /* Empty zero page */
> > .balign PAGE_SIZE
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index 787c75f751a5..4038be635e25 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -211,7 +211,7 @@ EXPORT_SYMBOL(pfn_base);
> >  pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> >  pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> >  pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
> > -static bool mmu_enabled;
> > +extern bool _mmu_enabled;
> >
> >  #define MAX_EARLY_MAPPING_SIZE SZ_128M
> >
> > @@ -236,7 +236,7 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t 
> > phys, pgprot_t prot)
> >
> >  static pte_t *__init get_pte_virt(phys_addr_t pa)
> >  {
> > -   if (mmu_enabled) {
> > +   if (_mmu_enabled) {
> > clear_fixmap(FIX_PTE);
> > return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
> > } else {
> > @@ -250,7 +250,7 @@ static phys_addr_t __init alloc_pte(uintptr_t va)
> >  * We only create PMD or PGD early mappings so we
> >  * should never reach here with MMU disabled.
> >  */
> > -   BUG_ON(!mmu_enabled);
> > +   BUG_ON(!_mmu_enabled);
> >
> > return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >  }
> > @@ -281,7 +281,7 @@ pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] 
> > __initdata __aligned(PAGE_SIZE);
> >
> >  static pmd_t *__init get_pmd_virt(phys_addr_t pa)
> >  {
> > -   if (mmu_enabled) {
> > +   if (_mmu_enabled) {
> > clear_fixmap(FIX_PMD);
> > return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
> > } else {
> > @@ -293,7 +293,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
> >  {
> > uintptr_t pmd_num;
> >
> > -   if (mmu_enabled)
> > +   if (_mmu_enabled)
> > return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >
> > pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> > @@ -467,9 +467,6 @@ static void __init setup_vm_final(void)
> > phys_addr_t pa, start, end;
> > struct memblock_region *reg;
> >
> > -   /* Set mmu_enabled flag */
> > -   mmu_enabled = true;
> > -
> > /* Setup swapper PGD for fixmap */
> > create_pgd_mapping(swapper_pg_dir, FIXADDR_START,
> >__pa_symbol(fixmap_pgd_next),
> > --
> > 2.20.1
> >
> >
> > ___
> > linux-riscv mailing list
> > linux-ri...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
>
>
>
> --
> Regards,
> Atish


[PATCH] riscv: toggle mmu_enabled flag in a precise manner

2020-12-01 Thread vitaly . wool
From: Vitaly Wool 

Cuurently mmu_enabled flag is set to true way later than the actual
MMU enablement takes place. This leads to hard-to-track races in
e. g. SBI earlycon initialization taking wrong path configuring
fixmap.

To fix that, move mmu_enabled toggling to head.S and rename it to
_mmu_enabled to avoid possible name clashes since it's not a static
variable any more.

Signed-off-by: Vitaly Wool 
---
 arch/riscv/kernel/head.S |  9 +
 arch/riscv/mm/init.c | 13 +
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 0a4e81b8dc79..33cd57285be3 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -248,6 +248,10 @@ clear_bss_done:
call relocate
 #endif /* CONFIG_MMU */
 
+   la a0, _mmu_enabled
+   li a1, 1
+   sw a1, (a0)
+
call setup_trap_vector
/* Restore C environment */
la tp, init_task
@@ -370,6 +374,11 @@ ENTRY(reset_regs)
 END(reset_regs)
 #endif /* CONFIG_RISCV_M_MODE */
 
+.section ".data"
+   .global _mmu_enabled
+_mmu_enabled:
+   .word 0
+
 __PAGE_ALIGNED_BSS
/* Empty zero page */
.balign PAGE_SIZE
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 787c75f751a5..4038be635e25 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -211,7 +211,7 @@ EXPORT_SYMBOL(pfn_base);
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
 pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
 pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
-static bool mmu_enabled;
+extern bool _mmu_enabled;
 
 #define MAX_EARLY_MAPPING_SIZE SZ_128M
 
@@ -236,7 +236,7 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t 
phys, pgprot_t prot)
 
 static pte_t *__init get_pte_virt(phys_addr_t pa)
 {
-   if (mmu_enabled) {
+   if (_mmu_enabled) {
clear_fixmap(FIX_PTE);
return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
} else {
@@ -250,7 +250,7 @@ static phys_addr_t __init alloc_pte(uintptr_t va)
 * We only create PMD or PGD early mappings so we
 * should never reach here with MMU disabled.
 */
-   BUG_ON(!mmu_enabled);
+   BUG_ON(!_mmu_enabled);
 
return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
@@ -281,7 +281,7 @@ pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata 
__aligned(PAGE_SIZE);
 
 static pmd_t *__init get_pmd_virt(phys_addr_t pa)
 {
-   if (mmu_enabled) {
+   if (_mmu_enabled) {
clear_fixmap(FIX_PMD);
return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
} else {
@@ -293,7 +293,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
 {
uintptr_t pmd_num;
 
-   if (mmu_enabled)
+   if (_mmu_enabled)
return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 
pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
@@ -467,9 +467,6 @@ static void __init setup_vm_final(void)
phys_addr_t pa, start, end;
struct memblock_region *reg;
 
-   /* Set mmu_enabled flag */
-   mmu_enabled = true;
-
/* Setup swapper PGD for fixmap */
create_pgd_mapping(swapper_pg_dir, FIXADDR_START,
   __pa_symbol(fixmap_pgd_next),
-- 
2.20.1



Re: [PATCH v6] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-09-28 Thread Vitaly Wool
On Tue, Aug 18, 2020 at 2:34 PM Barry Song  wrote:
>
> Right now, all new ZIP drivers are adapted to crypto_acomp APIs rather
> than legacy crypto_comp APIs. Tradiontal ZIP drivers like lz4,lzo etc
> have been also wrapped into acomp via scomp backend. But zswap.c is still
> using the old APIs. That means zswap won't be able to work on any new
> ZIP drivers in kernel.
>
> This patch moves to use cryto_acomp APIs to fix the disconnected bridge
> between new ZIP drivers and zswap. It is probably the first real user
> to use acomp but perhaps not a good example to demonstrate how multiple
> acomp requests can be executed in parallel in one acomp instance.
> frontswap is doing page load and store page by page synchronously.
> swap_writepage() depends on the completion of frontswap_store() to
> decide if it should call __swap_writepage() to swap to disk.
>
> However this patch creates multiple acomp instances, so multiple threads
> running on multiple different cpus can actually do (de)compression
> parallelly, leveraging the power of multiple ZIP hardware queues. This
> is also consistent with frontswap's page management model.
>
> The old zswap code uses atomic context and avoids the race conditions
> while shared resources like zswap_dstmem are accessed. Here since acomp
> can sleep, per-cpu mutex is used to replace preemption-disable.
>
> While it is possible to make mm/page_io.c and mm/frontswap.c support
> async (de)compression in some way, the entire design requires careful
> thinking and performance evaluation. For the first step, the base with
> fixed connection between ZIP drivers and zswap should be built.
>
> Cc: Luis Claudio R. Goncalves 
> Cc: Sebastian Andrzej Siewior 
> Cc: Andrew Morton 
> Cc: Herbert Xu 
> Cc: David S. Miller 
> Cc: Mahipal Challa 
> Cc: Seth Jennings 
> Cc: Dan Streetman 
> Cc: Vitaly Wool 
> Cc: Zhou Wang 
> Cc: Hao Fang 
> Cc: Colin Ian King 
> Signed-off-by: Barry Song 

Acked-by: Vitaly Wool 

> ---
>  -v6:
>  * rebase on top of 5.9-rc1;
>  * move to crypto_alloc_acomp_node() API to use local ZIP hardware
>
>  mm/zswap.c | 183 -
>  1 file changed, 138 insertions(+), 45 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index fbb782924ccc..00b5f14a7332 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -24,8 +24,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -127,9 +129,17 @@ module_param_named(same_filled_pages_enabled, 
> zswap_same_filled_pages_enabled,
>  * data structures
>  **/
>
> +struct crypto_acomp_ctx {
> +   struct crypto_acomp *acomp;
> +   struct acomp_req *req;
> +   struct crypto_wait wait;
> +   u8 *dstmem;
> +   struct mutex *mutex;
> +};
> +
>  struct zswap_pool {
> struct zpool *zpool;
> -   struct crypto_comp * __percpu *tfm;
> +   struct crypto_acomp_ctx __percpu *acomp_ctx;
> struct kref kref;
> struct list_head list;
> struct work_struct release_work;
> @@ -388,23 +398,43 @@ static struct zswap_entry *zswap_entry_find_get(struct 
> rb_root *root,
>  * per-cpu code
>  **/
>  static DEFINE_PER_CPU(u8 *, zswap_dstmem);
> +/*
> + * If users dynamically change the zpool type and compressor at runtime, i.e.
> + * zswap is running, zswap can have more than one zpool on one cpu, but they
> + * are sharing dtsmem. So we need this mutex to be per-cpu.
> + */
> +static DEFINE_PER_CPU(struct mutex *, zswap_mutex);
>
>  static int zswap_dstmem_prepare(unsigned int cpu)
>  {
> +   struct mutex *mutex;
> u8 *dst;
>
> dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> if (!dst)
> return -ENOMEM;
>
> +   mutex = kmalloc_node(sizeof(*mutex), GFP_KERNEL, cpu_to_node(cpu));
> +   if (!mutex) {
> +   kfree(dst);
> +   return -ENOMEM;
> +   }
> +
> +   mutex_init(mutex);
> per_cpu(zswap_dstmem, cpu) = dst;
> +   per_cpu(zswap_mutex, cpu) = mutex;
> return 0;
>  }
>
>  static int zswap_dstmem_dead(unsigned int cpu)
>  {
> +   struct mutex *mutex;
> u8 *dst;
>
> +   mutex = per_cpu(zswap_mutex, cpu);
> +   kfree(mutex);
> +   per_cpu(zswap_mutex, cpu) = NULL;
> +
> dst = per_cpu(zswap_dstmem, cpu);
> kfree(dst);
> per_cpu(zswap_dstmem, cpu) = NULL;
> @@ -415,30 +445,54 @@ static int zswap_dstmem_dead(unsigned int cpu)
>  static int zswap_cpu_comp_prepare(unsign

Re: [PATCH][next] mm/zswap: fix a couple of memory leaks and rework kzalloc failure check

2020-06-23 Thread Vitaly Wool
On Tue, Jun 23, 2020, 1:12 PM Colin Ian King  wrote:
>
> On 22/06/2020 20:55, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> >> Sent: Tuesday, June 23, 2020 6:28 AM
> >> To: Colin King 
> >> Cc: Seth Jennings ; Dan Streetman
> >> ; Vitaly Wool ; Andrew
> >> Morton ; Song Bao Hua (Barry Song)
> >> ; Stephen Rothwell ;
> >> linux...@kvack.org; kernel-janit...@vger.kernel.org;
> >> linux-kernel@vger.kernel.org
> >> Subject: Re: [PATCH][next] mm/zswap: fix a couple of memory leaks and
> >> rework kzalloc failure check
> >>
> >> On Mon, Jun 22, 2020 at 04:35:46PM +0100, Colin King wrote:
> >>> From: Colin Ian King 
> >>>
> >>> kzalloc failures return NULL on out of memory errors, so replace the
> >>> IS_ERR_OR_NULL check with the usual null pointer check.  Fix two memory
> >>> leaks with on acomp and acomp_ctx by ensuring these objects are free'd
> >>> on the error return path.
> >>>
> >>> Addresses-Coverity: ("Resource leak")
> >>> Fixes: d4f86abd6e35 ("mm/zswap: move to use crypto_acomp API for
> >> hardware acceleration")
> >>> Signed-off-by: Colin Ian King 
> >
> >
> > Colin, thanks for your patch. I am sorry I did the same thing with you here:
> > https://lkml.org/lkml/2020/6/22/347
>
> Thanks for fixing this correctly, I ran out of time yesterday to re-do
> the fix.
>
> Colin

I think this has gotten out of hand. Barry, could you please come up
with a replacement for the initial patch rather than doing it
incrementally?

Thanks,
   Vitaly

>
> >
> >
> >>> ---
> >>>  mm/zswap.c | 16 +++-
> >>>  1 file changed, 11 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>> index 0d914ba6b4a0..14839cbac7ff 100644
> >>> --- a/mm/zswap.c
> >>> +++ b/mm/zswap.c
> >>> @@ -433,23 +433,23 @@ static int zswap_cpu_comp_prepare(unsigned int
> >> cpu, struct hlist_node *node)
> >>> return 0;
> >>>
> >>> acomp_ctx = kzalloc(sizeof(*acomp_ctx), GFP_KERNEL);
> >>> -   if (IS_ERR_OR_NULL(acomp_ctx)) {
> >>> +   if (!acomp_ctx) {
> >>> pr_err("Could not initialize acomp_ctx\n");
> >>> return -ENOMEM;
> >>> }
> >>> acomp = crypto_alloc_acomp(pool->tfm_name, 0, 0);
> >>> -   if (IS_ERR_OR_NULL(acomp)) {
> >>> +   if (!acomp) {
> >>
> >> This should be IS_ERR(acomp).  Please preserve the error code.
> >>
> >>> pr_err("could not alloc crypto acomp %s : %ld\n",
> >>> pool->tfm_name, PTR_ERR(acomp));
> >>> -   return -ENOMEM;
> >>> +   goto free_acomp_ctx;
> >>> }
> >>> acomp_ctx->acomp = acomp;
> >>>
> >>> req = acomp_request_alloc(acomp_ctx->acomp);
> >>> -   if (IS_ERR_OR_NULL(req)) {
> >>> +   if (!req) {
> >>> pr_err("could not alloc crypto acomp %s : %ld\n",
> >>>pool->tfm_name, PTR_ERR(acomp));
> >>> -   return -ENOMEM;
> >>> +   goto free_acomp;
> >>> }
> >>> acomp_ctx->req = req;
> >>>
> >>> @@ -462,6 +462,12 @@ static int zswap_cpu_comp_prepare(unsigned int
> >> cpu, struct hlist_node *node)
> >>> *per_cpu_ptr(pool->acomp_ctx, cpu) = acomp_ctx;
> >>>
> >>> return 0;
> >>> +
> >>> +free_acomp:
> >>> +   kfree(acomp);
> >>
> >> The kfree() isn't correct.  It needs to be:
> >>
> >>  crypto_free_acomp(acomp);
> >>
> >>> +free_acomp_ctx:
> >>> +   kfree(acomp_ctx);
> >>> +   return -ENOMEM;
> >>
> >> regards,
> >> dan carpenter
> >
>


Re: [PATCH v2] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-06-21 Thread Vitaly Wool
On Sun, Jun 21, 2020 at 1:52 AM Barry Song  wrote:
>
> right now, all new ZIP drivers are using crypto_acomp APIs rather than
> legacy crypto_comp APIs. But zswap.c is still using the old APIs. That
> means zswap won't be able to use any new zip drivers in kernel.
>
> This patch moves to use cryto_acomp APIs to fix the problem. On the
> other hand, tradiontal compressors like lz4,lzo etc have been wrapped
> into acomp via scomp backend. So platforms without async compressors
> can fallback to use acomp via scomp backend.
>
> Cc: Luis Claudio R. Goncalves 
> Cc: Sebastian Andrzej Siewior 
> Cc: Andrew Morton 
> Cc: Herbert Xu 
> Cc: David S. Miller 
> Cc: Mahipal Challa 
> Cc: Seth Jennings 
> Cc: Dan Streetman 
> Cc: Vitaly Wool 
> Cc: Zhou Wang 
> Signed-off-by: Barry Song 
> ---
>  -v2:
>  rebase to 5.8-rc1;
>  cleanup commit log;
>  cleanup to improve the readability according to Sebastian's comment
>
>  mm/zswap.c | 153 ++---
>  1 file changed, 110 insertions(+), 43 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index fbb782924ccc..0d914ba6b4a0 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -24,8 +24,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -127,9 +129,17 @@ module_param_named(same_filled_pages_enabled, 
> zswap_same_filled_pages_enabled,
>  * data structures
>  **/
>
> +struct crypto_acomp_ctx {
> +   struct crypto_acomp *acomp;
> +   struct acomp_req *req;
> +   struct crypto_wait wait;
> +   u8 *dstmem;
> +   struct mutex mutex;
> +};
> +
>  struct zswap_pool {
> struct zpool *zpool;
> -   struct crypto_comp * __percpu *tfm;
> +   struct crypto_acomp_ctx * __percpu *acomp_ctx;
> struct kref kref;
> struct list_head list;
> struct work_struct release_work;
> @@ -415,30 +425,60 @@ static int zswap_dstmem_dead(unsigned int cpu)
>  static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  {
> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> -   struct crypto_comp *tfm;
> +   struct crypto_acomp *acomp;
> +   struct acomp_req *req;
> +   struct crypto_acomp_ctx *acomp_ctx;
>
> -   if (WARN_ON(*per_cpu_ptr(pool->tfm, cpu)))
> +   if (WARN_ON(*per_cpu_ptr(pool->acomp_ctx, cpu)))
> return 0;
>
> -   tfm = crypto_alloc_comp(pool->tfm_name, 0, 0);
> -   if (IS_ERR_OR_NULL(tfm)) {
> -   pr_err("could not alloc crypto comp %s : %ld\n",
> -  pool->tfm_name, PTR_ERR(tfm));
> +   acomp_ctx = kzalloc(sizeof(*acomp_ctx), GFP_KERNEL);
> +   if (IS_ERR_OR_NULL(acomp_ctx)) {
> +   pr_err("Could not initialize acomp_ctx\n");
> +   return -ENOMEM;
> +   }
> +   acomp = crypto_alloc_acomp(pool->tfm_name, 0, 0);
> +   if (IS_ERR_OR_NULL(acomp)) {
> +   pr_err("could not alloc crypto acomp %s : %ld\n",
> +   pool->tfm_name, PTR_ERR(acomp));
> return -ENOMEM;
> }

I bet you actually want to free acomp_ctx here. Overall, could you
please provide more careful error path implementation or explain why
it isn't necessary?

Best regards,
Vitaly

> -   *per_cpu_ptr(pool->tfm, cpu) = tfm;
> +   acomp_ctx->acomp = acomp;
> +
> +   req = acomp_request_alloc(acomp_ctx->acomp);
> +   if (IS_ERR_OR_NULL(req)) {
> +   pr_err("could not alloc crypto acomp %s : %ld\n",
> +  pool->tfm_name, PTR_ERR(acomp));
> +   return -ENOMEM;
> +   }
> +   acomp_ctx->req = req;
> +
> +   mutex_init(_ctx->mutex);
> +   crypto_init_wait(_ctx->wait);
> +   acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> +  crypto_req_done, _ctx->wait);
> +
> +   acomp_ctx->dstmem = per_cpu(zswap_dstmem, cpu);
> +   *per_cpu_ptr(pool->acomp_ctx, cpu) = acomp_ctx;
> +
> return 0;
>  }
>
>  static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>  {
> struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> -   struct crypto_comp *tfm;
> +   struct crypto_acomp_ctx *acomp_ctx;
> +
> +   acomp_ctx = *per_cpu_ptr(pool->acomp_ctx, cpu);
> +   if (!IS_ERR_OR_NULL(acomp_ctx)) {
> +   if (!IS_ERR_OR_NULL(acomp_ctx->req))
> + 

Re: zswap z3fold + memory offline = infinite loop

2020-05-20 Thread Vitaly Wool
On Tue, May 19, 2020 at 5:50 AM Qian Cai  wrote:

>
> >
> > Removing that check in ->isolate() is not a big deal, but ->migratepage() 
> > shall not allow actual migration anyway if there are mapped objects.
>
> Is that worse than an endless loop here?

Well, let's figure if there really has to be an endless loop. Would
you mind retesting with
https://marc.info/?l=linux-mm=158996286816704=2?

Best regards,
   Vitaly


Re: [PATCH 0/3] Allow ZRAM to use any zpool-compatible backend

2019-10-21 Thread Vitaly Wool
On Tue, Oct 15, 2019 at 10:00 PM Minchan Kim  wrote:
>
> On Tue, Oct 15, 2019 at 09:39:35AM +0200, Vitaly Wool wrote:
> > Hi Minchan,
> >
> > On Mon, Oct 14, 2019 at 6:41 PM Minchan Kim  wrote:
> > >
> > > On Thu, Oct 10, 2019 at 11:04:14PM +0300, Vitaly Wool wrote:
> > > > The coming patchset is a new take on the old issue: ZRAM can currently 
> > > > be used only with zsmalloc even though this may not be the optimal 
> > > > combination for some configurations. The previous (unsuccessful) 
> > > > attempt dates back to 2015 [1] and is notable for the heated 
> > > > discussions it has caused.
> > > >
> > > > The patchset in [1] had basically the only goal of enabling ZRAM/zbud 
> > > > combo which had a very narrow use case. Things have changed 
> > > > substantially since then, and now, with z3fold used widely as a zswap 
> > > > backend, I, as the z3fold maintainer, am getting requests to 
> > > > re-interate on making it possible to use ZRAM with any zpool-compatible 
> > > > backend, first of all z3fold.
> > > >
> > > > The preliminary results for this work have been delivered at Linux 
> > > > Plumbers this year [2]. The talk at LPC, though having attracted 
> > > > limited interest, ended in a consensus to continue the work and pursue 
> > > > the goal of decoupling ZRAM from zsmalloc.
> > > >
> > > > The current patchset has been stress tested on arm64 and x86_64 
> > > > devices, including the Dell laptop I'm writing this message on now, not 
> > > > to mention several QEmu confugirations.
> > > >
> > > > [1] https://lkml.org/lkml/2015/9/14/356
> > > > [2] https://linuxplumbersconf.org/event/4/contributions/551/
> > >
> > > Please describe what's the usecase in real world, what's the benefit 
> > > zsmalloc
> > > cannot fulfill by desgin and how it's significant.
> >
> > I'm not entirely sure how to interpret the phrase "the benefit
> > zsmalloc cannot fulfill by design" but let me explain.
> > First, there are multi multi core systems where z3fold can provide
> > better throughput.
>
> Please include number in the description with workload.

Sure. So on an HMP 8-core ARM64 system with ZRAM, we run the following command:
fio --bs=4k --randrepeat=1 --randseed=100 --refill_buffers \
--buffer_compress_percentage=50 --scramble_buffers=1 \
--direct=1 --loops=15 --numjobs=4 --filename=/dev/block/zram0 \
 --name=seq-write --rw=write --stonewall --name=seq-read \
 --rw=read --stonewall --name=seq-readwrite --rw=rw --stonewall \
 --name=rand-readwrite --rw=randrw --stonewall

The results are the following:

zsmalloc:
Run status group 0 (all jobs):
  WRITE: io=61440MB, aggrb=1680.4MB/s, minb=430167KB/s,
maxb=440590KB/s, mint=35699msec, maxt=36564msec

Run status group 1 (all jobs):
   READ: io=61440MB, aggrb=1620.4MB/s, minb=414817KB/s,
maxb=414850KB/s, mint=37914msec, maxt=37917msec

Run status group 2 (all jobs):
  READ: io=30615MB, aggrb=897979KB/s, minb=224494KB/s,
maxb=228161KB/s, mint=34351msec, maxt=34912msec
  WRITE: io=30825MB, aggrb=904110KB/s, minb=226027KB/s,
maxb=229718KB/s, mint=34351msec, maxt=34912msec

Run status group 3 (all jobs):
   READ: io=30615MB, aggrb=772002KB/s, minb=193000KB/s,
maxb=193010KB/s, mint=40607msec, maxt=40609msec
  WRITE: io=30825MB, aggrb=777273KB/s, minb=194318KB/s,
maxb=194327KB/s, mint=40607msec, maxt=40609msec

z3fold:
Run status group 0 (all jobs):
  WRITE: io=61440MB, aggrb=1224.8MB/s, minb=313525KB/s,
maxb=329941KB/s, mint=47671msec, maxt=50167msec

Run status group 1 (all jobs):
   READ: io=61440MB, aggrb=3119.3MB/s, minb=798529KB/s,
maxb=862883KB/s, mint=18228msec, maxt=19697msec

Run status group 2 (all jobs):
   READ: io=30615MB, aggrb=937283KB/s, minb=234320KB/s,
maxb=234334KB/s, mint=33446msec, maxt=33448msec
  WRITE: io=30825MB, aggrb=943682KB/s, minb=235920KB/s,
maxb=235934KB/s, mint=33446msec, maxt=33448msec

Run status group 3 (all jobs):
   READ: io=30615MB, aggrb=829591KB/s, minb=207397KB/s,
maxb=210285KB/s, mint=37271msec, maxt=37790msec
  WRITE: io=30825MB, aggrb=835255KB/s, minb=208813KB/s,
maxb=211721KB/s, mint=37271msec, maxt=37790msec

So, z3fold is faster everywhere (including being *two* times faster on
read) except for sequential write which is the least important use
case in real world.

> > Then, there are low end systems with hardware
> > compression/decompression support which don't need zsmalloc
> > sophistication and would rather use zbud with ZRAM because the
> > compression ratio is relatively low.
>
> I couldn't imagine how it's bad with zsmalloc. Could you be more
>

Re: [PATCH 0/3] Allow ZRAM to use any zpool-compatible backend

2019-10-15 Thread Vitaly Wool
Hi Minchan,

On Mon, Oct 14, 2019 at 6:41 PM Minchan Kim  wrote:
>
> On Thu, Oct 10, 2019 at 11:04:14PM +0300, Vitaly Wool wrote:
> > The coming patchset is a new take on the old issue: ZRAM can currently be 
> > used only with zsmalloc even though this may not be the optimal combination 
> > for some configurations. The previous (unsuccessful) attempt dates back to 
> > 2015 [1] and is notable for the heated discussions it has caused.
> >
> > The patchset in [1] had basically the only goal of enabling ZRAM/zbud combo 
> > which had a very narrow use case. Things have changed substantially since 
> > then, and now, with z3fold used widely as a zswap backend, I, as the z3fold 
> > maintainer, am getting requests to re-interate on making it possible to use 
> > ZRAM with any zpool-compatible backend, first of all z3fold.
> >
> > The preliminary results for this work have been delivered at Linux Plumbers 
> > this year [2]. The talk at LPC, though having attracted limited interest, 
> > ended in a consensus to continue the work and pursue the goal of decoupling 
> > ZRAM from zsmalloc.
> >
> > The current patchset has been stress tested on arm64 and x86_64 devices, 
> > including the Dell laptop I'm writing this message on now, not to mention 
> > several QEmu confugirations.
> >
> > [1] https://lkml.org/lkml/2015/9/14/356
> > [2] https://linuxplumbersconf.org/event/4/contributions/551/
>
> Please describe what's the usecase in real world, what's the benefit zsmalloc
> cannot fulfill by desgin and how it's significant.

I'm not entirely sure how to interpret the phrase "the benefit
zsmalloc cannot fulfill by design" but let me explain.
First, there are multi multi core systems where z3fold can provide
better throughput.
Then, there are low end systems with hardware
compression/decompression support which don't need zsmalloc
sophistication and would rather use zbud with ZRAM because the
compression ratio is relatively low.
Finally, there are MMU-less systems targeting IOT and still running
Linux and having a compressed RAM disk is something that would help
these systems operate in a better way (for the benefit of the overall
Linux ecosystem, if you care about that, of course; well, some people
do).

> I really don't want to make fragmentaion of allocator so we should really see
> how zsmalloc cannot achieve things if you are claiming.

I have to say that this point is completely bogus. We do not create
fragmentation by using a better defined and standardized API. In fact,
we aim to increase the number of use cases and test coverage for ZRAM.
With that said, I have hard time seeing how zsmalloc can operate on a
MMU-less system.

> Please tell us how to test it so that we could investigate what's the root
> cause.

I gather you haven't read neither the LPC documents nor my
conversation with Sergey re: these changes, because if you did you
wouldn't have had the type of questions you're asking. Please also see
above.

I feel a bit awkward explaining basic things to you but there may not
be other "root cause" than applicability issue. zsmalloc is a great
allocator but it's not universal and has its limitations. The
(potential) scope for ZRAM is wider than zsmalloc can provide. We are
*helping* _you_ to extend this scope "in real world" (c) and you come
up with bogus objections. Why?

Best regards,
   Vitaly


Re: [PATCH 3/3] zram: use common zpool interface

2019-10-14 Thread Vitaly Wool
On Mon, Oct 14, 2019 at 12:49 PM Sergey Senozhatsky
 wrote:
>
> On (10/10/19 23:20), Vitaly Wool wrote:
> [..]
> >  static const char *default_compressor = "lzo-rle";
> >
> > +#define BACKEND_PAR_BUF_SIZE 32
> > +static char backend_par_buf[BACKEND_PAR_BUF_SIZE];
>
> We can have multiple zram devices (zram0 .. zramN), I guess it
> would make sense not to force all devices to use one particular
> allocator (e.g. see comp_algorithm_store()).
>
> If the motivation for the patch set is that zsmalloc does not
> perform equally well for various data access patterns, then the
> same is true for any other allocator. Thus, I think, we need to
> have a per-device 'allocator' knob.

We were thinking here in per-SoC terms basically, but this is a valid
point. Since zram has a well-established sysfs per-device
configuration interface, backend choice better be moved there. Agree?

~Vitaly


Re: [PATCH 0/3] Allow ZRAM to use any zpool-compatible backend

2019-10-14 Thread Vitaly Wool
Hi Sergey,

On Mon, Oct 14, 2019 at 12:35 PM Sergey Senozhatsky
 wrote:
>
> Hi,
>
> On (10/10/19 23:04), Vitaly Wool wrote:
> [..]
> > The coming patchset is a new take on the old issue: ZRAM can
> > currently be used only with zsmalloc even though this may not
> > be the optimal combination for some configurations. The previous
> > (unsuccessful) attempt dates back to 2015 [1] and is notable for
> > the heated discussions it has caused.
>
> Oh, right, I do recall it.
>
> > The patchset in [1] had basically the only goal of enabling
> > ZRAM/zbud combo which had a very narrow use case. Things have
> > changed substantially since then, and now, with z3fold used
> > widely as a zswap backend, I, as the z3fold maintainer, am
> > getting requests to re-interate on making it possible to use
> > ZRAM with any zpool-compatible backend, first of all z3fold.
>
> A quick question, what are the technical reasons to prefer
> allocator X over zsmalloc? Some data would help, I guess.

For z3fold, the data can be found here:
https://elinux.org/images/d/d3/Z3fold.pdf.

For zbud (which is also of interest), imagine a low-end platform with
a simplistic HW compressor that doesn't give really high ratio. We
still want to be able to use ZRAM (not necessarily as a swap
partition, but rather for /home and /var) but we absolutely don't need
zsmalloc's complexity. zbud is a perfect match here (provided that it
can cope with PAGE_SIZE pages, yes, but it's a small patch to make
that work) since it's unlikely that we squeeze more than 2 compressed
pages per page with that HW compressor anyway.

> > The preliminary results for this work have been delivered at
> > Linux Plumbers this year [2]. The talk at LPC, though having
> > attracted limited interest, ended in a consensus to continue
> > the work and pursue the goal of decoupling ZRAM from zsmalloc.
>
> [..]
>
> > [1] https://lkml.org/lkml/2015/9/14/356
>
> I need to re-read it, thanks for the link. IIRC, but maybe
> I'm wrong, one of the things Minchan was not happy with was
> increased maintenance cost. So, perhaps, this also should
> be discuss/addressed (and maybe even in the first place).

I have hard time seeing how maintenance cost is increased here :)

~Vitaly


[PATCH 3/3] zram: use common zpool interface

2019-10-10 Thread Vitaly Wool
Change ZRAM into using zpool API. This patch allows to use any
zpool compatible allocation backend with ZRAM. It is meant to make
no functional changes to ZRAM.

zpool-registered backend can be selected via the module parameter
or kernel boot string. 'zsmalloc' is taken by default.

Signed-off-by: Vitaly Wool 
---
 drivers/block/zram/Kconfig|  3 ++-
 drivers/block/zram/zram_drv.c | 64 +++
 drivers/block/zram/zram_drv.h |  4 +--
 3 files changed, 39 insertions(+), 32 deletions(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fe7a4b7d30cf..7248d5aa3468 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -1,8 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 config ZRAM
tristate "Compressed RAM block device support"
-   depends on BLOCK && SYSFS && ZSMALLOC && CRYPTO
+   depends on BLOCK && SYSFS && CRYPTO
select CRYPTO_LZO
+   select ZPOOL
help
  Creates virtual block devices called /dev/zramX (X = 0, 1, ...).
  Pages written to these disks are compressed and stored in memory
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d58a359a6622..881f10f99a5d 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -43,6 +43,9 @@ static DEFINE_MUTEX(zram_index_mutex);
 static int zram_major;
 static const char *default_compressor = "lzo-rle";
 
+#define BACKEND_PAR_BUF_SIZE   32
+static char backend_par_buf[BACKEND_PAR_BUF_SIZE];
+
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 /*
@@ -277,7 +280,7 @@ static ssize_t mem_used_max_store(struct device *dev,
down_read(>init_lock);
if (init_done(zram)) {
atomic_long_set(>stats.max_used_pages,
-   zs_get_total_pages(zram->mem_pool));
+   zpool_get_total_size(zram->mem_pool) >> PAGE_SHIFT);
}
up_read(>init_lock);
 
@@ -1020,7 +1023,7 @@ static ssize_t compact_store(struct device *dev,
return -EINVAL;
}
 
-   zs_compact(zram->mem_pool);
+   zpool_compact(zram->mem_pool);
up_read(>init_lock);
 
return len;
@@ -1048,17 +1051,14 @@ static ssize_t mm_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
struct zram *zram = dev_to_zram(dev);
-   struct zs_pool_stats pool_stats;
u64 orig_size, mem_used = 0;
-   long max_used;
+   long max_used, num_compacted = 0;
ssize_t ret;
 
-   memset(_stats, 0x00, sizeof(struct zs_pool_stats));
-
down_read(>init_lock);
if (init_done(zram)) {
-   mem_used = zs_get_total_pages(zram->mem_pool);
-   zs_pool_stats(zram->mem_pool, _stats);
+   mem_used = zpool_get_total_size(zram->mem_pool);
+   num_compacted = zpool_get_num_compacted(zram->mem_pool);
}
 
orig_size = atomic64_read(>stats.pages_stored);
@@ -1068,11 +1068,11 @@ static ssize_t mm_stat_show(struct device *dev,
"%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu\n",
orig_size << PAGE_SHIFT,
(u64)atomic64_read(>stats.compr_data_size),
-   mem_used << PAGE_SHIFT,
+   mem_used,
zram->limit_pages << PAGE_SHIFT,
max_used << PAGE_SHIFT,
(u64)atomic64_read(>stats.same_pages),
-   pool_stats.pages_compacted,
+   num_compacted,
(u64)atomic64_read(>stats.huge_pages));
up_read(>init_lock);
 
@@ -1133,27 +1133,30 @@ static void zram_meta_free(struct zram *zram, u64 
disksize)
for (index = 0; index < num_pages; index++)
zram_free_page(zram, index);
 
-   zs_destroy_pool(zram->mem_pool);
+   zpool_destroy_pool(zram->mem_pool);
vfree(zram->table);
 }
 
 static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 {
size_t num_pages;
+   char *backend;
 
num_pages = disksize >> PAGE_SHIFT;
zram->table = vzalloc(array_size(num_pages, sizeof(*zram->table)));
if (!zram->table)
return false;
 
-   zram->mem_pool = zs_create_pool(zram->disk->disk_name);
+   backend = strlen(backend_par_buf) ? backend_par_buf : "zsmalloc";
+   zram->mem_pool = zpool_create_pool(backend, zram->disk->disk_name,
+   GFP_NOIO, NULL);
if (!zram->mem_pool) {
vfree(zram->table);
return false;
}
 
if (!huge_class_size)
-   huge_class_size 

[PATCH 2/3] zsmalloc: add compaction and huge class callbacks

2019-10-10 Thread Vitaly Wool
Add compaction callbacks for zpool compaction API extension.
Add huge_class_size callback too to be fully aligned.

With these in place, we can proceed with ZRAM modification
to use the universal (zpool) API. 

Signed-off-by: Vitaly Wool 
---
 mm/zsmalloc.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 2b2b9aae8a3c..43f43272b998 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -437,11 +437,29 @@ static void zs_zpool_unmap(void *pool, unsigned long 
handle)
zs_unmap_object(pool, handle);
 }
 
+static unsigned long zs_zpool_compact(void *pool)
+{
+   return zs_compact(pool);
+}
+
+static unsigned long zs_zpool_get_compacted(void *pool)
+{
+   struct zs_pool_stats stats;
+
+   zs_pool_stats(pool, );
+   return stats.pages_compacted;
+}
+
 static u64 zs_zpool_total_size(void *pool)
 {
return zs_get_total_pages(pool) << PAGE_SHIFT;
 }
 
+static size_t zs_zpool_huge_class_size(void *pool)
+{
+   return zs_huge_class_size(pool);
+}
+
 static struct zpool_driver zs_zpool_driver = {
.type =   "zsmalloc",
.owner =  THIS_MODULE,
@@ -453,6 +471,9 @@ static struct zpool_driver zs_zpool_driver = {
.map =zs_zpool_map,
.unmap =  zs_zpool_unmap,
.total_size = zs_zpool_total_size,
+   .compact =zs_zpool_compact,
+   .get_num_compacted =  zs_zpool_get_compacted,
+   .huge_class_size =zs_zpool_huge_class_size,
 };
 
 MODULE_ALIAS("zpool-zsmalloc");
-- 
2.20.1


[PATCH 1/3] zpool: extend API to match zsmalloc

2019-10-10 Thread Vitaly Wool
This patch adds the following functions to the zpool API:
- zpool_compact()
- zpool_get_num_compacted()
- zpool_huge_class_size()

The first one triggers compaction for the underlying allocator, the
second retrieves the number of pages migrated due to compaction for
the whole time of this pool's existence and the third one returns
the huge class size.

This API extension is done to align zpool API with zsmalloc API.

Signed-off-by: Vitaly Wool 
---
 include/linux/zpool.h | 14 +-
 mm/zpool.c| 36 
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/include/linux/zpool.h b/include/linux/zpool.h
index 51bf43076165..31f0c1360569 100644
--- a/include/linux/zpool.h
+++ b/include/linux/zpool.h
@@ -61,8 +61,13 @@ void *zpool_map_handle(struct zpool *pool, unsigned long 
handle,
 
 void zpool_unmap_handle(struct zpool *pool, unsigned long handle);
 
+unsigned long zpool_compact(struct zpool *pool);
+
+unsigned long zpool_get_num_compacted(struct zpool *pool);
+
 u64 zpool_get_total_size(struct zpool *pool);
 
+size_t zpool_huge_class_size(struct zpool *zpool);
 
 /**
  * struct zpool_driver - driver implementation for zpool
@@ -75,7 +80,10 @@ u64 zpool_get_total_size(struct zpool *pool);
  * @shrink:shrink the pool.
  * @map:   map a handle.
  * @unmap: unmap a handle.
- * @total_size:get total size of a pool.
+ * @compact:   try to run compaction over a pool
+ * @get_num_compacted: get amount of compacted pages for a pool
+ * @total_size:get total size of a pool
+ * @huge_class_size: huge class threshold for pool pages.
  *
  * This is created by a zpool implementation and registered
  * with zpool.
@@ -104,7 +112,11 @@ struct zpool_driver {
enum zpool_mapmode mm);
void (*unmap)(void *pool, unsigned long handle);
 
+   unsigned long (*compact)(void *pool);
+   unsigned long (*get_num_compacted)(void *pool);
+
u64 (*total_size)(void *pool);
+   size_t (*huge_class_size)(void *pool);
 };
 
 void zpool_register_driver(struct zpool_driver *driver);
diff --git a/mm/zpool.c b/mm/zpool.c
index 863669212070..55e69213c2eb 100644
--- a/mm/zpool.c
+++ b/mm/zpool.c
@@ -362,6 +362,30 @@ void zpool_unmap_handle(struct zpool *zpool, unsigned long 
handle)
zpool->driver->unmap(zpool->pool, handle);
 }
 
+ /**
+ * zpool_compact() - try to run compaction over zpool
+ * @pool   The zpool to compact
+ *
+ * Returns: the number of migrated pages
+ */
+unsigned long zpool_compact(struct zpool *zpool)
+{
+   return zpool->driver->compact ? zpool->driver->compact(zpool->pool) : 0;
+}
+
+
+/**
+ * zpool_get_num_compacted() - get the number of migrated/compacted pages
+ * @pool   The zpool to get compaction statistic for
+ *
+ * Returns: the total number of migrated pages for the pool
+ */
+unsigned long zpool_get_num_compacted(struct zpool *zpool)
+{
+   return zpool->driver->get_num_compacted ?
+   zpool->driver->get_num_compacted(zpool->pool) : 0;
+}
+
 /**
  * zpool_get_total_size() - The total size of the pool
  * @zpool: The zpool to check
@@ -375,6 +399,18 @@ u64 zpool_get_total_size(struct zpool *zpool)
return zpool->driver->total_size(zpool->pool);
 }
 
+/**
+ * zpool_huge_class_size() - get size for the "huge" class
+ * @pool   The zpool to check
+ *
+ * Returns: size of the huge class
+ */
+size_t zpool_huge_class_size(struct zpool *zpool)
+{
+   return zpool->driver->huge_class_size ?
+   zpool->driver->huge_class_size(zpool->pool) : 0;
+}
+
 /**
  * zpool_evictable() - Test if zpool is potentially evictable
  * @zpool: The zpool to test
-- 
2.20.1


[PATCH 0/3] Allow ZRAM to use any zpool-compatible backend

2019-10-10 Thread Vitaly Wool
The coming patchset is a new take on the old issue: ZRAM can currently be used 
only with zsmalloc even though this may not be the optimal combination for some 
configurations. The previous (unsuccessful) attempt dates back to 2015 [1] and 
is notable for the heated discussions it has caused.

The patchset in [1] had basically the only goal of enabling ZRAM/zbud combo 
which had a very narrow use case. Things have changed substantially since then, 
and now, with z3fold used widely as a zswap backend, I, as the z3fold 
maintainer, am getting requests to re-interate on making it possible to use 
ZRAM with any zpool-compatible backend, first of all z3fold.

The preliminary results for this work have been delivered at Linux Plumbers 
this year [2]. The talk at LPC, though having attracted limited interest, ended 
in a consensus to continue the work and pursue the goal of decoupling ZRAM from 
zsmalloc.

The current patchset has been stress tested on arm64 and x86_64 devices, 
including the Dell laptop I'm writing this message on now, not to mention 
several QEmu confugirations.

[1] https://lkml.org/lkml/2015/9/14/356
[2] https://linuxplumbersconf.org/event/4/contributions/551/


[PATCH] z3fold: add inter-page compaction

2019-10-05 Thread Vitaly Wool
From: Vitaly Wool 

For each page scheduled for compaction (e. g. by z3fold_free()),
try to apply inter-page compaction before running the traditional/
existing intra-page compaction. That means, if the page has only one
buddy, we treat that buddy as a new object that we aim to place into
an existing z3fold page. If such a page is found, that object is
transferred and the old page is freed completely. The transferred
object is named "foreign" and treated slightly differently thereafter.

Namely, we increase "foreign handle" counter for the new page. Pages
with non-zero "foreign handle" count become unmovable. This patch
implements "foreign handle" detection when a handle is freed to
decrement the foreign handle counter accordingly, so a page may as
well become movable again as the time goes by.

As a result, we almost always have exactly 3 objects per page and
significantly better average compression ratio.

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 363 +---
 1 file changed, 291 insertions(+), 72 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 6d3d3f698ebb..25713a4a7186 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -90,6 +91,7 @@ struct z3fold_buddy_slots {
 */
unsigned long slot[BUDDY_MASK + 1];
unsigned long pool; /* back link + flags */
+   rwlock_t lock;
 };
 #define HANDLE_FLAG_MASK   (0x03)
 
@@ -124,6 +126,7 @@ struct z3fold_header {
unsigned short start_middle;
unsigned short first_num:2;
unsigned short mapped_count:2;
+   unsigned short foreign_handles:2;
 };
 
 /**
@@ -178,6 +181,19 @@ enum z3fold_page_flags {
PAGE_CLAIMED, /* by either reclaim or free */
 };
 
+/*
+ * handle flags, go under HANDLE_FLAG_MASK
+ */
+enum z3fold_handle_flags {
+   HANDLES_ORPHANED = 0,
+};
+
+/*
+ * Forward declarations
+ */
+static struct z3fold_header *__z3fold_alloc(struct z3fold_pool *, size_t, 
bool);
+static void compact_page_work(struct work_struct *w);
+
 /*
  * Helpers
 */
@@ -191,8 +207,6 @@ static int size_to_chunks(size_t size)
 #define for_each_unbuddied_list(_iter, _begin) \
for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++)
 
-static void compact_page_work(struct work_struct *w);
-
 static inline struct z3fold_buddy_slots *alloc_slots(struct z3fold_pool *pool,
gfp_t gfp)
 {
@@ -204,6 +218,7 @@ static inline struct z3fold_buddy_slots *alloc_slots(struct 
z3fold_pool *pool,
if (slots) {
memset(slots->slot, 0, sizeof(slots->slot));
slots->pool = (unsigned long)pool;
+   rwlock_init(>lock);
}
 
return slots;
@@ -219,27 +234,108 @@ static inline struct z3fold_buddy_slots 
*handle_to_slots(unsigned long handle)
return (struct z3fold_buddy_slots *)(handle & ~(SLOTS_ALIGN - 1));
 }
 
+/* Lock a z3fold page */
+static inline void z3fold_page_lock(struct z3fold_header *zhdr)
+{
+   spin_lock(>page_lock);
+}
+
+/* Try to lock a z3fold page */
+static inline int z3fold_page_trylock(struct z3fold_header *zhdr)
+{
+   return spin_trylock(>page_lock);
+}
+
+/* Unlock a z3fold page */
+static inline void z3fold_page_unlock(struct z3fold_header *zhdr)
+{
+   spin_unlock(>page_lock);
+}
+
+
+static inline struct z3fold_header *__get_z3fold_header(unsigned long handle,
+   bool lock)
+{
+   struct z3fold_buddy_slots *slots;
+   struct z3fold_header *zhdr;
+   int locked = 0;
+
+   if (!(handle & (1 << PAGE_HEADLESS))) {
+   slots = handle_to_slots(handle);
+   do {
+   unsigned long addr;
+
+   read_lock(>lock);
+   addr = *(unsigned long *)handle;
+   zhdr = (struct z3fold_header *)(addr & PAGE_MASK);
+   if (lock)
+   locked = z3fold_page_trylock(zhdr);
+   read_unlock(>lock);
+   if (locked)
+   break;
+   cpu_relax();
+   } while (lock);
+   } else {
+   zhdr = (struct z3fold_header *)(handle & PAGE_MASK);
+   }
+
+   return zhdr;
+}
+
+/* Returns the z3fold page where a given handle is stored */
+static inline struct z3fold_header *handle_to_z3fold_header(unsigned long h)
+{
+   return __get_z3fold_header(h, false);
+}
+
+/* return locked z3fold page if it's not headless */
+static inline struct z3fold_header *get_z3fold_header(unsigned long h)
+{
+   return __get_z3fold_header(h, true);
+}
+
+static inline void put_z3fold_header(struct z3fold_header *zhdr)
+{
+  

[PATCH v2] z3fold: claim page in the beginning of free

2019-09-28 Thread Vitaly Wool
There's a really hard to reproduce race in z3fold between
z3fold_free() and z3fold_reclaim_page(). z3fold_reclaim_page()
can claim the page after z3fold_free() has checked if the page
was claimed and z3fold_free() will then schedule this page for
compaction which may in turn lead to random page faults (since
that page would have been reclaimed by then). Fix that by
claiming page in the beginning of z3fold_free() and not
forgetting to clear the claim in the end.

Reported-by: Markus Linnala 
Signed-off-by: Vitaly Wool 
Cc: 
---
 mm/z3fold.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 05bdf90646e7..6d3d3f698ebb 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -998,9 +998,11 @@ static void z3fold_free(struct z3fold_pool *pool,
unsigned long handle) struct z3fold_header *zhdr;
struct page *page;
enum buddy bud;
+   bool page_claimed;
 
zhdr = handle_to_z3fold_header(handle);
page = virt_to_page(zhdr);
+   page_claimed = test_and_set_bit(PAGE_CLAIMED, >private);
 
if (test_bit(PAGE_HEADLESS, >private)) {
/* if a headless page is under reclaim, just leave.
@@ -1008,7 +1010,7 @@ static void z3fold_free(struct z3fold_pool *pool,
unsigned long handle)
 * has not been set before, we release this page
 * immediately so we don't care about its value any
more. */
-   if (!test_and_set_bit(PAGE_CLAIMED, >private)) {
+   if (!page_claimed) {
spin_lock(>lock);
list_del(>lru);
spin_unlock(>lock);
@@ -1044,13 +1046,15 @@ static void z3fold_free(struct z3fold_pool
*pool, unsigned long handle) atomic64_dec(>pages_nr);
return;
}
-   if (test_bit(PAGE_CLAIMED, >private)) {
+   if (page_claimed) {
+   /* the page has not been claimed by us */
z3fold_page_unlock(zhdr);
return;
}
if (unlikely(PageIsolated(page)) ||
test_and_set_bit(NEEDS_COMPACTING, >private)) {
z3fold_page_unlock(zhdr);
+   clear_bit(PAGE_CLAIMED, >private);
return;
}
if (zhdr->cpu < 0 || !cpu_online(zhdr->cpu)) {
@@ -1060,10 +1064,12 @@ static void z3fold_free(struct z3fold_pool
*pool, unsigned long handle) zhdr->cpu = -1;
kref_get(>refcount);
do_compact_page(zhdr, true);
+   clear_bit(PAGE_CLAIMED, >private);
return;
}
kref_get(>refcount);
queue_work_on(zhdr->cpu, pool->compact_wq, >work);
+   clear_bit(PAGE_CLAIMED, >private);
z3fold_page_unlock(zhdr);
 }
 
-- 
2.17.1


[PATCH] z3fold: claim page in the beginning of free

2019-09-26 Thread Vitaly Wool
There's a really hard to reproduce race in z3fold between
z3fold_free() and z3fold_reclaim_page(). z3fold_reclaim_page()
can claim the page after z3fold_free() has checked if the page
was claimed and z3fold_free() will then schedule this page for
compaction which may in turn lead to random page faults (since
that page would have been reclaimed by then). Fix that by
claiming page in the beginning of z3fold_free().

Reported-by: Markus Linnala 
Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 05bdf90646e7..01b87c78b984 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -998,9 +998,11 @@ static void z3fold_free(struct z3fold_pool *pool, unsigned 
long handle)
struct z3fold_header *zhdr;
struct page *page;
enum buddy bud;
+   bool page_claimed;
 
zhdr = handle_to_z3fold_header(handle);
page = virt_to_page(zhdr);
+   page_claimed = test_and_set_bit(PAGE_CLAIMED, >private);
 
if (test_bit(PAGE_HEADLESS, >private)) {
/* if a headless page is under reclaim, just leave.
@@ -1008,7 +1010,7 @@ static void z3fold_free(struct z3fold_pool *pool, 
unsigned long handle)
 * has not been set before, we release this page
 * immediately so we don't care about its value any more.
 */
-   if (!test_and_set_bit(PAGE_CLAIMED, >private)) {
+   if (!page_claimed) {
spin_lock(>lock);
list_del(>lru);
spin_unlock(>lock);
@@ -1044,7 +1046,7 @@ static void z3fold_free(struct z3fold_pool *pool, 
unsigned long handle)
atomic64_dec(>pages_nr);
return;
}
-   if (test_bit(PAGE_CLAIMED, >private)) {
+   if (page_claimed) {
z3fold_page_unlock(zhdr);
return;
}
-- 
2.17.1


Re: [PATCH] z3fold: fix memory leak in kmem cache

2019-09-19 Thread Vitaly Wool
On Wed, Sep 18, 2019 at 9:35 AM Vlastimil Babka  wrote:
>
> On 9/17/19 5:53 PM, Vitaly Wool wrote:
> > Currently there is a leak in init_z3fold_page() -- it allocates
> > handles from kmem cache even for headless pages, but then they are
> > never used and never freed, so eventually kmem cache may get
> > exhausted. This patch provides a fix for that.
> >
> > Reported-by: Markus Linnala 
> > Signed-off-by: Vitaly Wool 
>
> Can a Fixes: commit be pinpointed, and CC stable added?

Fixes: 7c2b8baa61fe578 "mm/z3fold.c: add structure for buddy handles"

Best regards,
   Vitaly

> > ---
> >  mm/z3fold.c | 15 +--
> >  1 file changed, 9 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/z3fold.c b/mm/z3fold.c
> > index 6397725b5ec6..7dffef2599c3 100644
> > --- a/mm/z3fold.c
> > +++ b/mm/z3fold.c
> > @@ -301,14 +301,11 @@ static void z3fold_unregister_migration(struct 
> > z3fold_pool *pool)
> >   }
> >
> >  /* Initializes the z3fold header of a newly allocated z3fold page */
> > -static struct z3fold_header *init_z3fold_page(struct page *page,
> > +static struct z3fold_header *init_z3fold_page(struct page *page, bool 
> > headless,
> >   struct z3fold_pool *pool, gfp_t gfp)
> >  {
> >   struct z3fold_header *zhdr = page_address(page);
> > - struct z3fold_buddy_slots *slots = alloc_slots(pool, gfp);
> > -
> > - if (!slots)
> > - return NULL;
> > + struct z3fold_buddy_slots *slots;
> >
> >   INIT_LIST_HEAD(>lru);
> >   clear_bit(PAGE_HEADLESS, >private);
> > @@ -316,6 +313,12 @@ static struct z3fold_header *init_z3fold_page(struct 
> > page *page,
> >   clear_bit(NEEDS_COMPACTING, >private);
> >   clear_bit(PAGE_STALE, >private);
> >   clear_bit(PAGE_CLAIMED, >private);
> > + if (headless)
> > + return zhdr;
> > +
> > + slots = alloc_slots(pool, gfp);
> > + if (!slots)
> > + return NULL;
> >
> >   spin_lock_init(>page_lock);
> >   kref_init(>refcount);
> > @@ -962,7 +965,7 @@ static int z3fold_alloc(struct z3fold_pool *pool, 
> > size_t size, gfp_t gfp,
> >   if (!page)
> >   return -ENOMEM;
> >
> > - zhdr = init_z3fold_page(page, pool, gfp);
> > + zhdr = init_z3fold_page(page, bud == HEADLESS, pool, gfp);
> >   if (!zhdr) {
> >   __free_page(page);
> >   return -ENOMEM;
> >
>


[PATCH] z3fold: fix memory leak in kmem cache

2019-09-17 Thread Vitaly Wool
Currently there is a leak in init_z3fold_page() -- it allocates
handles from kmem cache even for headless pages, but then they are
never used and never freed, so eventually kmem cache may get
exhausted. This patch provides a fix for that.

Reported-by: Markus Linnala 
Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 6397725b5ec6..7dffef2599c3 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -301,14 +301,11 @@ static void z3fold_unregister_migration(struct 
z3fold_pool *pool)
  }
 
 /* Initializes the z3fold header of a newly allocated z3fold page */
-static struct z3fold_header *init_z3fold_page(struct page *page,
+static struct z3fold_header *init_z3fold_page(struct page *page, bool headless,
struct z3fold_pool *pool, gfp_t gfp)
 {
struct z3fold_header *zhdr = page_address(page);
-   struct z3fold_buddy_slots *slots = alloc_slots(pool, gfp);
-
-   if (!slots)
-   return NULL;
+   struct z3fold_buddy_slots *slots;
 
INIT_LIST_HEAD(>lru);
clear_bit(PAGE_HEADLESS, >private);
@@ -316,6 +313,12 @@ static struct z3fold_header *init_z3fold_page(struct page 
*page,
clear_bit(NEEDS_COMPACTING, >private);
clear_bit(PAGE_STALE, >private);
clear_bit(PAGE_CLAIMED, >private);
+   if (headless)
+   return zhdr;
+
+   slots = alloc_slots(pool, gfp);
+   if (!slots)
+   return NULL;
 
spin_lock_init(>page_lock);
kref_init(>refcount);
@@ -962,7 +965,7 @@ static int z3fold_alloc(struct z3fold_pool *pool, size_t 
size, gfp_t gfp,
if (!page)
return -ENOMEM;
 
-   zhdr = init_z3fold_page(page, pool, gfp);
+   zhdr = init_z3fold_page(page, bud == HEADLESS, pool, gfp);
if (!zhdr) {
__free_page(page);
return -ENOMEM;
-- 
2.17.1


[PATCH/RFC] zswap: do not map same object twice

2019-09-15 Thread Vitaly Wool
zswap_writeback_entry() maps a handle to read swpentry first, and
then in the most common case it would map the same handle again.
This is ok when zbud is the backend since its mapping callback is
plain and simple, but it slows things down for z3fold.

Since there's hardly a point in unmapping a handle _that_ fast as
zswap_writeback_entry() does when it reads swpentry, the
suggestion is to keep the handle mapped till the end.

Signed-off-by: Vitaly Wool 
---
 mm/zswap.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 0e22744a76cb..b35464bc7315 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -856,7 +856,6 @@ static int zswap_writeback_entry(struct zpool *pool, 
unsigned long handle)
/* extract swpentry from data */
zhdr = zpool_map_handle(pool, handle, ZPOOL_MM_RO);
swpentry = zhdr->swpentry; /* here */
-   zpool_unmap_handle(pool, handle);
tree = zswap_trees[swp_type(swpentry)];
offset = swp_offset(swpentry);
 
@@ -866,6 +865,7 @@ static int zswap_writeback_entry(struct zpool *pool, 
unsigned long handle)
if (!entry) {
/* entry was invalidated */
spin_unlock(>lock);
+   zpool_unmap_handle(pool, handle);
return 0;
}
spin_unlock(>lock);
@@ -886,15 +886,13 @@ static int zswap_writeback_entry(struct zpool *pool, 
unsigned long handle)
case ZSWAP_SWAPCACHE_NEW: /* page is locked */
/* decompress */
dlen = PAGE_SIZE;
-   src = (u8 *)zpool_map_handle(entry->pool->zpool, entry->handle,
-   ZPOOL_MM_RO) + sizeof(struct zswap_header);
+   src = (u8 *)zhdr + sizeof(struct zswap_header);
dst = kmap_atomic(page);
tfm = *get_cpu_ptr(entry->pool->tfm);
ret = crypto_comp_decompress(tfm, src, entry->length,
 dst, );
put_cpu_ptr(entry->pool->tfm);
kunmap_atomic(dst);
-   zpool_unmap_handle(entry->pool->zpool, entry->handle);
BUG_ON(ret);
BUG_ON(dlen != PAGE_SIZE);
 
@@ -940,6 +938,7 @@ static int zswap_writeback_entry(struct zpool *pool, 
unsigned long handle)
spin_unlock(>lock);
 
 end:
+   zpool_unmap_handle(pool, handle);
return ret;
 }
 
-- 
2.17.1


[PATCH] Revert "mm/z3fold.c: fix race between migration and destruction"

2019-09-10 Thread Vitaly Wool
With the original commit applied, z3fold_zpool_destroy() may
get blocked on wait_event() for indefinite time. Revert this
commit for the time being to get rid of this problem since the
issue the original commit addresses is less severe.

This reverts commit d776aaa9895eb6eb770908e899cb7f5bd5025b3c.

Reported-by: Agustín Dall'Alba 
Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 90 -
 1 file changed, 90 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 75b7962439ff..ed19d98c9dcd 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -41,7 +41,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -146,8 +145,6 @@ struct z3fold_header {
  * @release_wq:workqueue for safe page release
  * @work:  work_struct for safe page release
  * @inode: inode for z3fold pseudo filesystem
- * @destroying: bool to stop migration once we start destruction
- * @isolated: int to count the number of pages currently in isolation
  *
  * This structure is allocated at pool creation time and maintains metadata
  * pertaining to a particular z3fold pool.
@@ -166,11 +163,8 @@ struct z3fold_pool {
const struct zpool_ops *zpool_ops;
struct workqueue_struct *compact_wq;
struct workqueue_struct *release_wq;
-   struct wait_queue_head isolate_wait;
struct work_struct work;
struct inode *inode;
-   bool destroying;
-   int isolated;
 };
 
 /*
@@ -775,7 +769,6 @@ static struct z3fold_pool *z3fold_create_pool(const char 
*name, gfp_t gfp,
goto out_c;
spin_lock_init(>lock);
spin_lock_init(>stale_lock);
-   init_waitqueue_head(>isolate_wait);
pool->unbuddied = __alloc_percpu(sizeof(struct list_head)*NCHUNKS, 2);
if (!pool->unbuddied)
goto out_pool;
@@ -815,15 +808,6 @@ static struct z3fold_pool *z3fold_create_pool(const char 
*name, gfp_t gfp,
return NULL;
 }
 
-static bool pool_isolated_are_drained(struct z3fold_pool *pool)
-{
-   bool ret;
-
-   spin_lock(>lock);
-   ret = pool->isolated == 0;
-   spin_unlock(>lock);
-   return ret;
-}
 /**
  * z3fold_destroy_pool() - destroys an existing z3fold pool
  * @pool:  the z3fold pool to be destroyed
@@ -833,22 +817,6 @@ static bool pool_isolated_are_drained(struct z3fold_pool 
*pool)
 static void z3fold_destroy_pool(struct z3fold_pool *pool)
 {
kmem_cache_destroy(pool->c_handle);
-   /*
-* We set pool-> destroying under lock to ensure that
-* z3fold_page_isolate() sees any changes to destroying. This way we
-* avoid the need for any memory barriers.
-*/
-
-   spin_lock(>lock);
-   pool->destroying = true;
-   spin_unlock(>lock);
-
-   /*
-* We need to ensure that no pages are being migrated while we destroy
-* these workqueues, as migration can queue work on either of the
-* workqueues.
-*/
-   wait_event(pool->isolate_wait, !pool_isolated_are_drained(pool));
 
/*
 * We need to destroy pool->compact_wq before pool->release_wq,
@@ -1339,28 +1307,6 @@ static u64 z3fold_get_pool_size(struct z3fold_pool *pool)
return atomic64_read(>pages_nr);
 }
 
-/*
- * z3fold_dec_isolated() expects to be called while pool->lock is held.
- */
-static void z3fold_dec_isolated(struct z3fold_pool *pool)
-{
-   assert_spin_locked(>lock);
-   VM_BUG_ON(pool->isolated <= 0);
-   pool->isolated--;
-
-   /*
-* If we have no more isolated pages, we have to see if
-* z3fold_destroy_pool() is waiting for a signal.
-*/
-   if (pool->isolated == 0 && waitqueue_active(>isolate_wait))
-   wake_up_all(>isolate_wait);
-}
-
-static void z3fold_inc_isolated(struct z3fold_pool *pool)
-{
-   pool->isolated++;
-}
-
 static bool z3fold_page_isolate(struct page *page, isolate_mode_t mode)
 {
struct z3fold_header *zhdr;
@@ -1387,34 +1333,6 @@ static bool z3fold_page_isolate(struct page *page, 
isolate_mode_t mode)
spin_lock(>lock);
if (!list_empty(>lru))
list_del(>lru);
-   /*
-* We need to check for destruction while holding pool->lock, as
-* otherwise destruction could see 0 isolated pages, and
-* proceed.
-*/
-   if (unlikely(pool->destroying)) {
-   spin_unlock(>lock);
-   /*
-* If this page isn't stale, somebody else holds a
-* reference to it. Let't drop our refcount so that they
-* can call the release logic.
-*/
-   if (unlikely(kref_put(>refcount,
- release_z3fold_pa

Re: [PATCH] z3fold: fix retry mechanism in page reclaim

2019-09-08 Thread Vitaly Wool
On Sun, Sep 8, 2019 at 4:56 PM Maciej S. Szmigiero
 wrote:
>
> On 08.09.2019 15:29, Vitaly Wool wrote:
> > z3fold_page_reclaim()'s retry mechanism is broken: on a second
> > iteration it will have zhdr from the first one so that zhdr
> > is no longer in line with struct page. That leads to crashes when
> > the system is stressed.
> >
> > Fix that by moving zhdr assignment up.
> >
> > While at it, protect against using already freed handles by using
> > own local slots structure in z3fold_page_reclaim().
> >
> > Reported-by: Markus Linnala 
> > Reported-by: Chris Murphy 
> > Reported-by: Agustin Dall'Alba 
> > Signed-off-by: Vitaly Wool 
> > ---
>
> Shouldn't this be CC'ed to stable@ ?

I guess :)

Thanks,
   Vitaly


[PATCH] z3fold: fix retry mechanism in page reclaim

2019-09-08 Thread Vitaly Wool
z3fold_page_reclaim()'s retry mechanism is broken: on a second
iteration it will have zhdr from the first one so that zhdr
is no longer in line with struct page. That leads to crashes when
the system is stressed.

Fix that by moving zhdr assignment up.

While at it, protect against using already freed handles by using
own local slots structure in z3fold_page_reclaim().

Reported-by: Markus Linnala 
Reported-by: Chris Murphy 
Reported-by: Agustin Dall'Alba 
Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 49 ++---
 1 file changed, 34 insertions(+), 15 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 75b7962439ff..6397725b5ec6 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -372,9 +372,10 @@ static inline int __idx(struct z3fold_header *zhdr, enum 
buddy bud)
  * Encodes the handle of a particular buddy within a z3fold page
  * Pool lock should be held as this function accesses first_num
  */
-static unsigned long encode_handle(struct z3fold_header *zhdr, enum buddy bud)
+static unsigned long __encode_handle(struct z3fold_header *zhdr,
+   struct z3fold_buddy_slots *slots,
+   enum buddy bud)
 {
-   struct z3fold_buddy_slots *slots;
unsigned long h = (unsigned long)zhdr;
int idx = 0;
 
@@ -391,11 +392,15 @@ static unsigned long encode_handle(struct z3fold_header 
*zhdr, enum buddy bud)
if (bud == LAST)
h |= (zhdr->last_chunks << BUDDY_SHIFT);
 
-   slots = zhdr->slots;
slots->slot[idx] = h;
return (unsigned long)>slot[idx];
 }
 
+static unsigned long encode_handle(struct z3fold_header *zhdr, enum buddy bud)
+{
+   return __encode_handle(zhdr, zhdr->slots, bud);
+}
+
 /* Returns the z3fold page where a given handle is stored */
 static inline struct z3fold_header *handle_to_z3fold_header(unsigned long h)
 {
@@ -630,6 +635,7 @@ static void do_compact_page(struct z3fold_header *zhdr, 
bool locked)
}
 
if (unlikely(PageIsolated(page) ||
+test_bit(PAGE_CLAIMED, >private) ||
 test_bit(PAGE_STALE, >private))) {
z3fold_page_unlock(zhdr);
return;
@@ -1132,6 +1138,7 @@ static int z3fold_reclaim_page(struct z3fold_pool *pool, 
unsigned int retries)
struct z3fold_header *zhdr = NULL;
struct page *page = NULL;
struct list_head *pos;
+   struct z3fold_buddy_slots slots;
unsigned long first_handle = 0, middle_handle = 0, last_handle = 0;
 
spin_lock(>lock);
@@ -1150,16 +1157,22 @@ static int z3fold_reclaim_page(struct z3fold_pool 
*pool, unsigned int retries)
/* this bit could have been set by free, in which case
 * we pass over to the next page in the pool.
 */
-   if (test_and_set_bit(PAGE_CLAIMED, >private))
+   if (test_and_set_bit(PAGE_CLAIMED, >private)) {
+   page = NULL;
continue;
+   }
 
-   if (unlikely(PageIsolated(page)))
+   if (unlikely(PageIsolated(page))) {
+   clear_bit(PAGE_CLAIMED, >private);
+   page = NULL;
continue;
+   }
+   zhdr = page_address(page);
if (test_bit(PAGE_HEADLESS, >private))
break;
 
-   zhdr = page_address(page);
if (!z3fold_page_trylock(zhdr)) {
+   clear_bit(PAGE_CLAIMED, >private);
zhdr = NULL;
continue; /* can't evict at this point */
}
@@ -1177,26 +1190,30 @@ static int z3fold_reclaim_page(struct z3fold_pool 
*pool, unsigned int retries)
 
if (!test_bit(PAGE_HEADLESS, >private)) {
/*
-* We need encode the handles before unlocking, since
-* we can race with free that will set
-* (first|last)_chunks to 0
+* We need encode the handles before unlocking, and
+* use our local slots structure because z3fold_free
+* can zero out zhdr->slots and we can't do much
+* about that
 */
first_handle = 0;
last_handle = 0;
middle_handle = 0;
if (zhdr->first_chunks)
-   first_handle = encode_handle(zhdr, FIRST);
+   first_handle = __encode_handle(zhdr, ,
+  

Re: [PATCH] mm/z3fold.c: Fix race between migration and destruction

2019-08-10 Thread Vitaly Wool
Hi Henry,

Den fre 9 aug. 2019 6:46 emHenry Burns  skrev:
>
> In z3fold_destroy_pool() we call destroy_workqueue(>compact_wq).
> However, we have no guarantee that migration isn't happening in the
> background at that time.
>
> Migration directly calls queue_work_on(pool->compact_wq), if destruction
> wins that race we are using a destroyed workqueue.


Thanks for the fix. Would you please comment why adding
flush_workqueue() isn't enough?

~Vitaly
>
>
> Signed-off-by: Henry Burns 
> ---
>  mm/z3fold.c | 51 +++
>  1 file changed, 51 insertions(+)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index 78447cecfffa..e136d97ce56e 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -40,6 +40,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  /*
> @@ -161,8 +162,10 @@ struct z3fold_pool {
> const struct zpool_ops *zpool_ops;
> struct workqueue_struct *compact_wq;
> struct workqueue_struct *release_wq;
> +   struct wait_queue_head isolate_wait;
> struct work_struct work;
> struct inode *inode;
> +   int isolated_pages;
>  };
>
>  /*
> @@ -772,6 +775,7 @@ static struct z3fold_pool *z3fold_create_pool(const char 
> *name, gfp_t gfp,
> goto out_c;
> spin_lock_init(>lock);
> spin_lock_init(>stale_lock);
> +   init_waitqueue_head(>isolate_wait);
> pool->unbuddied = __alloc_percpu(sizeof(struct list_head)*NCHUNKS, 2);
> if (!pool->unbuddied)
> goto out_pool;
> @@ -811,6 +815,15 @@ static struct z3fold_pool *z3fold_create_pool(const char 
> *name, gfp_t gfp,
> return NULL;
>  }
>
> +static bool pool_isolated_are_drained(struct z3fold_pool *pool)
> +{
> +   bool ret;
> +
> +   spin_lock(>lock);
> +   ret = pool->isolated_pages == 0;
> +   spin_unlock(>lock);
> +   return ret;
> +}
>  /**
>   * z3fold_destroy_pool() - destroys an existing z3fold pool
>   * @pool:  the z3fold pool to be destroyed
> @@ -821,6 +834,13 @@ static void z3fold_destroy_pool(struct z3fold_pool *pool)
>  {
> kmem_cache_destroy(pool->c_handle);
>
> +   /*
> +* We need to ensure that no pages are being migrated while we destroy
> +* these workqueues, as migration can queue work on either of the
> +* workqueues.
> +*/
> +   wait_event(pool->isolate_wait, !pool_isolated_are_drained(pool));
> +
> /*
>  * We need to destroy pool->compact_wq before pool->release_wq,
>  * as any pending work on pool->compact_wq will call
> @@ -1317,6 +1337,28 @@ static u64 z3fold_get_pool_size(struct z3fold_pool 
> *pool)
> return atomic64_read(>pages_nr);
>  }
>
> +/*
> + * z3fold_dec_isolated() expects to be called while pool->lock is held.
> + */
> +static void z3fold_dec_isolated(struct z3fold_pool *pool)
> +{
> +   assert_spin_locked(>lock);
> +   VM_BUG_ON(pool->isolated_pages <= 0);
> +   pool->isolated_pages--;
> +
> +   /*
> +* If we have no more isolated pages, we have to see if
> +* z3fold_destroy_pool() is waiting for a signal.
> +*/
> +   if (pool->isolated_pages == 0 && 
> waitqueue_active(>isolate_wait))
> +   wake_up_all(>isolate_wait);
> +}
> +
> +static void z3fold_inc_isolated(struct z3fold_pool *pool)
> +{
> +   pool->isolated_pages++;
> +}
> +
>  static bool z3fold_page_isolate(struct page *page, isolate_mode_t mode)
>  {
> struct z3fold_header *zhdr;
> @@ -1343,6 +1385,7 @@ static bool z3fold_page_isolate(struct page *page, 
> isolate_mode_t mode)
> spin_lock(>lock);
> if (!list_empty(>lru))
> list_del(>lru);
> +   z3fold_inc_isolated(pool);
> spin_unlock(>lock);
> z3fold_page_unlock(zhdr);
> return true;
> @@ -1417,6 +1460,10 @@ static int z3fold_page_migrate(struct address_space 
> *mapping, struct page *newpa
>
> queue_work_on(new_zhdr->cpu, pool->compact_wq, _zhdr->work);
>
> +   spin_lock(>lock);
> +   z3fold_dec_isolated(pool);
> +   spin_unlock(>lock);
> +
> page_mapcount_reset(page);
> put_page(page);
> return 0;
> @@ -1436,10 +1483,14 @@ static void z3fold_page_putback(struct page *page)
> INIT_LIST_HEAD(>lru);
> if (kref_put(>refcount, release_z3fold_page_locked)) {
> atomic64_dec(>pages_nr);
> +   spin_lock(>lock);
> +   z3fold_dec_isolated(pool);
> +   spin_unlock(>lock);
> return;
> }
> spin_lock(>lock);
> list_add(>lru, >lru);
> +   z3fold_dec_isolated(pool);
> spin_unlock(>lock);
> z3fold_page_unlock(zhdr);
>  }
> --
> 2.22.0.770.g0f2c4a37fd-goog
>


Re: [PATCH] mm/z3fold.c: Allow __GFP_HIGHMEM in z3fold_alloc

2019-07-13 Thread Vitaly Wool
On Sat, Jul 13, 2019 at 12:22 AM Henry Burns  wrote:
>
> One of the gfp flags used to show that a page is movable is
> __GFP_HIGHMEM.  Currently z3fold_alloc() fails when __GFP_HIGHMEM is
> passed.  Now that z3fold pages are movable, we allow __GFP_HIGHMEM. We
> strip the movability related flags from the call to kmem_cache_alloc()
> for our slots since it is a kernel allocation.
>
> Signed-off-by: Henry Burns 

Acked-by: Vitaly Wool 

> ---
>  mm/z3fold.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index e78f95284d7c..cb567ddf051c 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -193,7 +193,8 @@ static inline struct z3fold_buddy_slots 
> *alloc_slots(struct z3fold_pool *pool,
> gfp_t gfp)
>  {
> struct z3fold_buddy_slots *slots = kmem_cache_alloc(pool->c_handle,
> -   gfp);
> +   (gfp & 
> ~(__GFP_HIGHMEM
> +  | 
> __GFP_MOVABLE)));
>
> if (slots) {
> memset(slots->slot, 0, sizeof(slots->slot));
> @@ -844,7 +845,7 @@ static int z3fold_alloc(struct z3fold_pool *pool, size_t 
> size, gfp_t gfp,
> enum buddy bud;
> bool can_sleep = gfpflags_allow_blocking(gfp);
>
> -   if (!size || (gfp & __GFP_HIGHMEM))
> +   if (!size)
> return -EINVAL;
>
> if (size > PAGE_SIZE)
> --
> 2.22.0.510.g264f2c817a-goog
>


[PATCH] mm/z3fold.c: don't try to use buddy slots after free

2019-07-08 Thread Vitaly Wool
>From fd87fdc38ea195e5a694102a57bd4d59fc177433 Mon Sep 17 00:00:00 2001
From: Vitaly Wool 
Date: Mon, 8 Jul 2019 13:41:02 +0200
[PATCH] mm/z3fold: don't try to use buddy slots after free

As reported by Henry Burns:

Running z3fold stress testing with address sanitization
showed zhdr->slots was being used after it was freed.

z3fold_free(z3fold_pool, handle)
  free_handle(handle)
kmem_cache_free(pool->c_handle, zhdr->slots)
  release_z3fold_page_locked_list(kref)
__release_z3fold_page(zhdr, true)
  zhdr_to_pool(zhdr)
slots_to_pool(zhdr->slots)  *BOOM*

To fix this, add pointer to the pool back to z3fold_header and modify
zhdr_to_pool to return zhdr->pool.

Fixes: 7c2b8baa61fe  ("mm/z3fold.c: add structure for buddy handles")

Reported-by: Henry Burns 
Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 985732c8b025..e1686bf6d689 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -101,6 +101,7 @@ struct z3fold_buddy_slots {
  * @refcount:  reference count for the z3fold page
  * @work:  work_struct for page layout optimization
  * @slots: pointer to the structure holding buddy slots
+ * @pool:  pointer to the containing pool
  * @cpu:   CPU which this page "belongs" to
  * @first_chunks:  the size of the first buddy in chunks, 0 if free
  * @middle_chunks: the size of the middle buddy in chunks, 0 if free
@@ -114,6 +115,7 @@ struct z3fold_header {
struct kref refcount;
struct work_struct work;
struct z3fold_buddy_slots *slots;
+   struct z3fold_pool *pool;
short cpu;
unsigned short first_chunks;
unsigned short middle_chunks;
@@ -320,6 +322,7 @@ static struct z3fold_header *init_z3fold_page(struct page 
*page,
zhdr->start_middle = 0;
zhdr->cpu = -1;
zhdr->slots = slots;
+   zhdr->pool = pool;
INIT_LIST_HEAD(>buddy);
INIT_WORK(>work, compact_page_work);
return zhdr;
@@ -426,7 +429,7 @@ static enum buddy handle_to_buddy(unsigned long handle)
 
 static inline struct z3fold_pool *zhdr_to_pool(struct z3fold_header *zhdr)
 {
-   return slots_to_pool(zhdr->slots);
+   return zhdr->pool;
 }
 
 static void __release_z3fold_page(struct z3fold_header *zhdr, bool locked)
-- 
2.17.1


Re: [PATCH] mm/z3fold: Fix z3fold_buddy_slots use after free

2019-07-03 Thread Vitaly Wool
On Tue, Jul 2, 2019 at 6:57 PM Henry Burns  wrote:
>
> On Tue, Jul 2, 2019 at 12:45 AM Vitaly Wool  wrote:
> >
> > Hi Henry,
> >
> > On Mon, Jul 1, 2019 at 8:31 PM Henry Burns  wrote:
> > >
> > > Running z3fold stress testing with address sanitization
> > > showed zhdr->slots was being used after it was freed.
> > >
> > > z3fold_free(z3fold_pool, handle)
> > >   free_handle(handle)
> > > kmem_cache_free(pool->c_handle, zhdr->slots)
> > >   release_z3fold_page_locked_list(kref)
> > > __release_z3fold_page(zhdr, true)
> > >   zhdr_to_pool(zhdr)
> > > slots_to_pool(zhdr->slots)  *BOOM*
> >
> > Thanks for looking into this. I'm not entirely sure I'm all for
> > splitting free_handle() but let me think about it.
> >
> > > Instead we split free_handle into two functions, release_handle()
> > > and free_slots(). We use release_handle() in place of free_handle(),
> > > and use free_slots() to call kmem_cache_free() after
> > > __release_z3fold_page() is done.
> >
> > A little less intrusive solution would be to move backlink to pool
> > from slots back to z3fold_header. Looks like it was a bad idea from
> > the start.
> >
> > Best regards,
> >Vitaly
>
> We still want z3fold pages to be movable though. Wouldn't moving
> the backink to the pool from slots to z3fold_header prevent us from
> enabling migration?

That is a valid point but we can just add back pool pointer to
z3fold_header. The thing here is, there's another patch in the
pipeline that allows for a better (inter-page) compaction and it will
somewhat complicate things, because sometimes slots will have to be
released after z3fold page is released (because they will hold a
handle to another z3fold page). I would prefer that we just added back
pool to z3fold_header and changed zhdr_to_pool to just return
zhdr->pool, then had the compaction patch valid again, and then we
could come back to size optimization.

Best regards,
   Vitaly


Re: [PATCH v2] mm/z3fold.c: Lock z3fold page before __SetPageMovable()

2019-07-02 Thread Vitaly Wool
On Wed, Jul 3, 2019 at 12:18 AM Henry Burns  wrote:
>
> On Tue, Jul 2, 2019 at 2:19 PM Andrew Morton  
> wrote:
> >
> > On Mon, 1 Jul 2019 18:16:30 -0700 Henry Burns  wrote:
> >
> > > Cc: Vitaly Wool , Vitaly Vul 
> >
> > Are these the same person?
> I Think it's the same person, but i wasn't sure which email to include
> because one was
> in the list of maintainers and I had contacted the other earlier.

This is the same person, it's the transliteration done differently
that caused this :)

~Vitaly


Re: [PATCH v2] mm/z3fold.c: Lock z3fold page before __SetPageMovable()

2019-07-02 Thread Vitaly Wool
On Wed, Jul 3, 2019 at 12:24 AM Andrew Morton  wrote:
>
> On Tue, 2 Jul 2019 15:17:47 -0700 Henry Burns  wrote:
>
> > > > > > +   if (can_sleep) {
> > > > > > +   lock_page(page);
> > > > > > +   __SetPageMovable(page, pool->inode->i_mapping);
> > > > > > +   unlock_page(page);
> > > > > > +   } else {
> > > > > > +   if (!WARN_ON(!trylock_page(page))) {
> > > > > > +   __SetPageMovable(page, 
> > > > > > pool->inode->i_mapping);
> > > > > > +   unlock_page(page);
> > > > > > +   } else {
> > > > > > +   pr_err("Newly allocated z3fold page is 
> > > > > > locked\n");
> > > > > > +   WARN_ON(1);
>
> The WARN_ON will have already warned in this case.
>
> But the whole idea of warning in this case may be undesirable.  We KNOW
> that the warning will sometimes trigger (yes?).  So what's the point in
> scaring users?

Well, normally a newly allocated page that we own should not be locked
by someone else so this is worth a warning IMO. With that said, the
else branch here appears to be redundant.

~Vitaly


Re: [PATCH v2] mm/z3fold.c: Lock z3fold page before __SetPageMovable()

2019-07-02 Thread Vitaly Wool
On Tue, Jul 2, 2019 at 3:51 AM Henry Burns  wrote:
>
> __SetPageMovable() expects it's page to be locked, but z3fold.c doesn't
> lock the page. Following zsmalloc.c's example we call trylock_page() and
> unlock_page(). Also makes z3fold_page_migrate() assert that newpage is
> passed in locked, as documentation.
>
> Signed-off-by: Henry Burns 
> Suggested-by: Vitaly Wool 

Acked-by: Vitaly Wool 

Thanks!

> ---
>  Changelog since v1:
>  - Added an if statement around WARN_ON(trylock_page(page)) to avoid
>unlocking a page locked by a someone else.
>
>  mm/z3fold.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index e174d1549734..6341435b9610 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -918,7 +918,10 @@ static int z3fold_alloc(struct z3fold_pool *pool, size_t 
> size, gfp_t gfp,
> set_bit(PAGE_HEADLESS, >private);
> goto headless;
> }
> -   __SetPageMovable(page, pool->inode->i_mapping);
> +   if (!WARN_ON(!trylock_page(page))) {
> +   __SetPageMovable(page, pool->inode->i_mapping);
> +   unlock_page(page);
> +   }
> z3fold_page_lock(zhdr);
>
>  found:
> @@ -1325,6 +1328,7 @@ static int z3fold_page_migrate(struct address_space 
> *mapping, struct page *newpa
>
> VM_BUG_ON_PAGE(!PageMovable(page), page);
> VM_BUG_ON_PAGE(!PageIsolated(page), page);
> +   VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
>
> zhdr = page_address(page);
> pool = zhdr_to_pool(zhdr);
> --
> 2.22.0.410.gd8fdbe21b5-goog
>


Re: [PATCH] mm/z3fold: Fix z3fold_buddy_slots use after free

2019-07-02 Thread Vitaly Wool
Hi Henry,

On Mon, Jul 1, 2019 at 8:31 PM Henry Burns  wrote:
>
> Running z3fold stress testing with address sanitization
> showed zhdr->slots was being used after it was freed.
>
> z3fold_free(z3fold_pool, handle)
>   free_handle(handle)
> kmem_cache_free(pool->c_handle, zhdr->slots)
>   release_z3fold_page_locked_list(kref)
> __release_z3fold_page(zhdr, true)
>   zhdr_to_pool(zhdr)
> slots_to_pool(zhdr->slots)  *BOOM*

Thanks for looking into this. I'm not entirely sure I'm all for
splitting free_handle() but let me think about it.

> Instead we split free_handle into two functions, release_handle()
> and free_slots(). We use release_handle() in place of free_handle(),
> and use free_slots() to call kmem_cache_free() after
> __release_z3fold_page() is done.

A little less intrusive solution would be to move backlink to pool
from slots back to z3fold_header. Looks like it was a bad idea from
the start.

Best regards,
   Vitaly


Re: [PATCH V3 1/2] zpool: Add malloc_support_movable to zpool_driver

2019-06-05 Thread Vitaly Wool
Hi Shakeel,

On Wed, Jun 5, 2019 at 6:31 PM Shakeel Butt  wrote:
>
> On Wed, Jun 5, 2019 at 3:06 AM Hui Zhu  wrote:
> >
> > As a zpool_driver, zsmalloc can allocate movable memory because it
> > support migate pages.
> > But zbud and z3fold cannot allocate movable memory.
> >
>
> Cc: Vitaly

thanks for looping me in :)

> It seems like z3fold does support page migration but z3fold's malloc
> is rejecting __GFP_HIGHMEM. Vitaly, is there a reason to keep
> rejecting __GFP_HIGHMEM after 1f862989b04a ("mm/z3fold.c: support page
> migration").

No; I don't think I see a reason to keep that part. You are very
welcome to submit a patch, or otherwise I can do it when I'm done with
the patches that are already in the pipeline.

Thanks,
   Vitaly


[PATCH v2] z3fold: add inter-page compaction

2019-05-27 Thread Vitaly Wool
For each page scheduled for compaction (e. g. by z3fold_free()),
try to apply inter-page compaction before running the traditional/
existing intra-page compaction. That means, if the page has only one
buddy, we treat that buddy as a new object that we aim to place into
an existing z3fold page. If such a page is found, that object is
transferred and the old page is freed completely. The transferred
object is named "foreign" and treated slightly differently thereafter.

Namely, we increase "foreign handle" counter for the new page. Pages
with non-zero "foreign handle" count become unmovable. This patch
implements "foreign handle" detection when a handle is freed to
decrement the foreign handle counter accordingly, so a page may as
well become movable again as the time goes by.

As a result, we almost always have exactly 3 objects per page and
significantly better average compression ratio.

Changes from v1:
* balanced use of inlining
* more comments in the key parts of code
* code rearranged to avoid forward declarations
* rwlock instead of seqlock

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 538 
 1 file changed, 373 insertions(+), 165 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 985732c8b025..2bc3dbde6255 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -89,6 +89,7 @@ struct z3fold_buddy_slots {
 */
unsigned long slot[BUDDY_MASK + 1];
unsigned long pool; /* back link + flags */
+   rwlock_t lock;
 };
 #define HANDLE_FLAG_MASK   (0x03)
 
@@ -121,6 +122,7 @@ struct z3fold_header {
unsigned short start_middle;
unsigned short first_num:2;
unsigned short mapped_count:2;
+   unsigned short foreign_handles:2;
 };
 
 /**
@@ -175,6 +177,14 @@ enum z3fold_page_flags {
PAGE_CLAIMED, /* by either reclaim or free */
 };
 
+/*
+ * handle flags, go under HANDLE_FLAG_MASK
+ */
+enum z3fold_handle_flags {
+   HANDLES_ORPHANED = 0,
+};
+
+
 /*
  * Helpers
 */
@@ -199,6 +209,7 @@ static inline struct z3fold_buddy_slots *alloc_slots(struct 
z3fold_pool *pool,
if (slots) {
memset(slots->slot, 0, sizeof(slots->slot));
slots->pool = (unsigned long)pool;
+   rwlock_init(>lock);
}
 
return slots;
@@ -214,33 +225,6 @@ static inline struct z3fold_buddy_slots 
*handle_to_slots(unsigned long handle)
return (struct z3fold_buddy_slots *)(handle & ~(SLOTS_ALIGN - 1));
 }
 
-static inline void free_handle(unsigned long handle)
-{
-   struct z3fold_buddy_slots *slots;
-   int i;
-   bool is_free;
-
-   if (handle & (1 << PAGE_HEADLESS))
-   return;
-
-   WARN_ON(*(unsigned long *)handle == 0);
-   *(unsigned long *)handle = 0;
-   slots = handle_to_slots(handle);
-   is_free = true;
-   for (i = 0; i <= BUDDY_MASK; i++) {
-   if (slots->slot[i]) {
-   is_free = false;
-   break;
-   }
-   }
-
-   if (is_free) {
-   struct z3fold_pool *pool = slots_to_pool(slots);
-
-   kmem_cache_free(pool->c_handle, slots);
-   }
-}
-
 static struct dentry *z3fold_do_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
 {
@@ -320,6 +304,7 @@ static struct z3fold_header *init_z3fold_page(struct page 
*page,
zhdr->start_middle = 0;
zhdr->cpu = -1;
zhdr->slots = slots;
+   zhdr->foreign_handles = 0;
INIT_LIST_HEAD(>buddy);
INIT_WORK(>work, compact_page_work);
return zhdr;
@@ -361,6 +346,55 @@ static inline int __idx(struct z3fold_header *zhdr, enum 
buddy bud)
return (bud + zhdr->first_num) & BUDDY_MASK;
 }
 
+static inline struct z3fold_header *__get_z3fold_header(unsigned long handle,
+   bool lock)
+{
+   struct z3fold_buddy_slots *slots;
+   struct z3fold_header *zhdr;
+
+   if (!(handle & (1 << PAGE_HEADLESS))) {
+   slots = handle_to_slots(handle);
+   do {
+   unsigned long addr;
+
+   read_lock(>lock);
+   addr = *(unsigned long *)handle;
+   zhdr = (struct z3fold_header *)(addr & PAGE_MASK);
+   if (lock && z3fold_page_trylock(zhdr)) {
+   read_unlock(>lock);
+   break;
+   }
+   read_unlock(>lock);
+   cpu_relax();
+   } while (lock);
+   } else {
+   zhdr = (struct z3fold_header *)(handle & PAGE_MASK);
+   }
+
+   return zhdr;
+}
+
+
+/* Returns the z3fold page where a g

Re: [PATCH] z3fold: add inter-page compaction

2019-05-27 Thread Vitaly Wool
On Sun, May 26, 2019 at 12:09 AM Andrew Morton
 wrote:


> Forward-declaring inline functions is peculiar, but it does appear to work.
>
> z3fold is quite inline-happy.  Fortunately the compiler will ignore the
> inline hint if it seems a bad idea.  Even then, the below shrinks
> z3fold.o text from 30k to 27k.  Which might even make it faster

It is faster with inlines, I'll try to find a better balance between
size and performance in the next version of the patch though.


> >
> > ...
> >
> > +static inline struct z3fold_header *__get_z3fold_header(unsigned long 
> > handle,
> > + bool lock)
> > +{
> > + struct z3fold_buddy_slots *slots;
> > + struct z3fold_header *zhdr;
> > + unsigned int seq;
> > + bool is_valid;
> > +
> > + if (!(handle & (1 << PAGE_HEADLESS))) {
> > + slots = handle_to_slots(handle);
> > + do {
> > + unsigned long addr;
> > +
> > + seq = read_seqbegin(>seqlock);
> > + addr = *(unsigned long *)handle;
> > + zhdr = (struct z3fold_header *)(addr & PAGE_MASK);
> > + preempt_disable();
>
> Why is this done?
>
> > + is_valid = !read_seqretry(>seqlock, seq);
> > + if (!is_valid) {
> > + preempt_enable();
> > + continue;
> > + }
> > + /*
> > +  * if we are here, zhdr is a pointer to a valid z3fold
> > +  * header. Lock it! And then re-check if someone has
> > +  * changed which z3fold page this handle points to
> > +  */
> > + if (lock)
> > + z3fold_page_lock(zhdr);
> > + preempt_enable();
> > + /*
> > +  * we use is_valid as a "cached" value: if it's false,
> > +  * no other checks needed, have to go one more round
> > +  */
> > + } while (!is_valid || (read_seqretry(>seqlock, seq) &&
> > + (lock ? ({ z3fold_page_unlock(zhdr); 1; }) : 1)));
> > + } else {
> > + zhdr = (struct z3fold_header *)(handle & PAGE_MASK);
> > + }
> > +
> > + return zhdr;
> > +}
> >
> > ...
> >
> >  static unsigned short handle_to_chunks(unsigned long handle)
> >  {
> > - unsigned long addr = *(unsigned long *)handle;
> > + unsigned long addr;
> > + struct z3fold_buddy_slots *slots = handle_to_slots(handle);
> > + unsigned int seq;
> > +
> > + do {
> > + seq = read_seqbegin(>seqlock);
> > + addr = *(unsigned long *)handle;
> > + } while (read_seqretry(>seqlock, seq));
>
> It isn't done here (I think).

handle_to_chunks() is always called with z3fold header locked which
makes it a lot easier in this case. I'll add some comments in V2.

Thanks,
   Vitaly


[PATCH] z3fold: add inter-page compaction

2019-05-24 Thread Vitaly Wool
For each page scheduled for compaction (e. g. by z3fold_free()),
try to apply inter-page compaction before running the traditional/
existing intra-page compaction. That means, if the page has only one
buddy, we treat that buddy as a new object that we aim to place into
an existing z3fold page. If such a page is found, that object is
transferred and the old page is freed completely. The transferred
object is named "foreign" and treated slightly differently thereafter.

Namely, we increase "foreign handle" counter for the new page. Pages
with non-zero "foreign handle" count become unmovable. This patch
implements "foreign handle" detection when a handle is freed to
decrement the foreign handle counter accordingly, so a page may as
well become movable again as the time goes by.

As a result, we almost always have exactly 3 objects per page and
significantly better average compression ratio.

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 328 +---
 1 file changed, 285 insertions(+), 43 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 985732c8b025..d82bccc8bc90 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -89,6 +90,7 @@ struct z3fold_buddy_slots {
 */
unsigned long slot[BUDDY_MASK + 1];
unsigned long pool; /* back link + flags */
+   seqlock_t seqlock;
 };
 #define HANDLE_FLAG_MASK   (0x03)
 
@@ -121,6 +123,7 @@ struct z3fold_header {
unsigned short start_middle;
unsigned short first_num:2;
unsigned short mapped_count:2;
+   unsigned short foreign_handles:2;
 };
 
 /**
@@ -175,6 +178,18 @@ enum z3fold_page_flags {
PAGE_CLAIMED, /* by either reclaim or free */
 };
 
+/*
+ * handle flags, go under HANDLE_FLAG_MASK
+ */
+enum z3fold_handle_flags {
+   HANDLES_ORPHANED = 0,
+};
+
+static inline struct z3fold_header *handle_to_z3fold_header(unsigned long);
+static inline struct z3fold_pool *zhdr_to_pool(struct z3fold_header *);
+static struct z3fold_header *__z3fold_alloc(struct z3fold_pool *, size_t, 
bool);
+static void add_to_unbuddied(struct z3fold_pool *, struct z3fold_header *);
+
 /*
  * Helpers
 */
@@ -199,6 +214,7 @@ static inline struct z3fold_buddy_slots *alloc_slots(struct 
z3fold_pool *pool,
if (slots) {
memset(slots->slot, 0, sizeof(slots->slot));
slots->pool = (unsigned long)pool;
+   seqlock_init(>seqlock);
}
 
return slots;
@@ -217,24 +233,39 @@ static inline struct z3fold_buddy_slots 
*handle_to_slots(unsigned long handle)
 static inline void free_handle(unsigned long handle)
 {
struct z3fold_buddy_slots *slots;
+   struct z3fold_header *zhdr;
int i;
bool is_free;
+   unsigned int seq;
 
if (handle & (1 << PAGE_HEADLESS))
return;
 
-   WARN_ON(*(unsigned long *)handle == 0);
-   *(unsigned long *)handle = 0;
+   if (WARN_ON(*(unsigned long *)handle == 0))
+   return;
+
+   zhdr = handle_to_z3fold_header(handle);
slots = handle_to_slots(handle);
-   is_free = true;
-   for (i = 0; i <= BUDDY_MASK; i++) {
-   if (slots->slot[i]) {
-   is_free = false;
-   break;
+   write_seqlock(>seqlock);
+   *(unsigned long *)handle = 0;
+   write_sequnlock(>seqlock);
+   if (zhdr->slots == slots)
+   return; /* simple case, nothing else to do */
+
+   /* we are freeing a foreign handle if we are here */
+   zhdr->foreign_handles--;
+   do {
+   is_free = true;
+   seq = read_seqbegin(>seqlock);
+   for (i = 0; i <= BUDDY_MASK; i++) {
+   if (slots->slot[i]) {
+   is_free = false;
+   break;
+   }
}
-   }
+   } while (read_seqretry(>seqlock, seq));
 
-   if (is_free) {
+   if (is_free && test_and_clear_bit(HANDLES_ORPHANED, >pool)) {
struct z3fold_pool *pool = slots_to_pool(slots);
 
kmem_cache_free(pool->c_handle, slots);
@@ -320,6 +351,7 @@ static struct z3fold_header *init_z3fold_page(struct page 
*page,
zhdr->start_middle = 0;
zhdr->cpu = -1;
zhdr->slots = slots;
+   zhdr->foreign_handles = 0;
INIT_LIST_HEAD(>buddy);
INIT_WORK(>work, compact_page_work);
return zhdr;
@@ -385,25 +417,87 @@ static unsigned long encode_handle(struct z3fold_header 
*zhdr, enum buddy bud)
h |= (zhdr->last_chunks << BUDDY_SHIFT);
 
slots = zhdr->slots;
+   write_seqlock(>seqlock);
slots->slot[idx] = h;
+   write_seq

[PATCH] z3fold: fix sheduling while atomic

2019-05-23 Thread Vitaly Wool
kmem_cache_alloc() may be called from z3fold_alloc() in atomic
context, so we need to pass correct gfp flags to avoid "scheduling
while atomic" bug.

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 99be52c5ca45..985732c8b025 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -190,10 +190,11 @@ static int size_to_chunks(size_t size)
 
 static void compact_page_work(struct work_struct *w);
 
-static inline struct z3fold_buddy_slots *alloc_slots(struct z3fold_pool *pool)
+static inline struct z3fold_buddy_slots *alloc_slots(struct z3fold_pool *pool,
+   gfp_t gfp)
 {
struct z3fold_buddy_slots *slots = kmem_cache_alloc(pool->c_handle,
-   GFP_KERNEL);
+   gfp);
 
if (slots) {
memset(slots->slot, 0, sizeof(slots->slot));
@@ -295,10 +296,10 @@ static void z3fold_unregister_migration(struct 
z3fold_pool *pool)
 
 /* Initializes the z3fold header of a newly allocated z3fold page */
 static struct z3fold_header *init_z3fold_page(struct page *page,
-   struct z3fold_pool *pool)
+   struct z3fold_pool *pool, gfp_t gfp)
 {
struct z3fold_header *zhdr = page_address(page);
-   struct z3fold_buddy_slots *slots = alloc_slots(pool);
+   struct z3fold_buddy_slots *slots = alloc_slots(pool, gfp);
 
if (!slots)
return NULL;
@@ -912,7 +913,7 @@ static int z3fold_alloc(struct z3fold_pool *pool, size_t 
size, gfp_t gfp,
if (!page)
return -ENOMEM;
 
-   zhdr = init_z3fold_page(page, pool);
+   zhdr = init_z3fold_page(page, pool, gfp);
if (!zhdr) {
__free_page(page);
return -ENOMEM;
-- 
2.17.1


[PATCHv2 4/4] z3fold: support page migration

2019-04-17 Thread Vitaly Wool
Now that we are not using page address in handles directly, we
can make z3fold pages movable to decrease the memory fragmentation
z3fold may create over time.

This patch starts advertising non-headless z3fold pages as movable
and uses the existing kernel infrastructure to implement moving of
such pages per memory management subsystem's request. It thus
implements 3 required callbacks for page migration:

* isolation callback: z3fold_page_isolate(): try to isolate the
page by removing it from all lists. Pages scheduled for some activity
and mapped pages will not be isolated. Return true if isolation was
successful or false otherwise
* migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
memory subsystem. Returns 0 on success or negative error code
otherwise
* putback callback: z3fold_page_putback(): put back the page if
z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 241 +---
 1 file changed, 231 insertions(+), 10 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index bebc10083f1c..d9eabfdad0fe 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -24,10 +24,18 @@
 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -97,6 +105,7 @@ struct z3fold_buddy_slots {
  * @middle_chunks: the size of the middle buddy in chunks, 0 if free
  * @last_chunks:   the size of the last buddy in chunks, 0 if free
  * @first_num: the starting number (for the first handle)
+ * @mapped_count:  the number of objects currently mapped
  */
 struct z3fold_header {
struct list_head buddy;
@@ -110,6 +119,7 @@ struct z3fold_header {
unsigned short last_chunks;
unsigned short start_middle;
unsigned short first_num:2;
+   unsigned short mapped_count:2;
 };
 
 /**
@@ -130,6 +140,7 @@ struct z3fold_header {
  * @compact_wq:workqueue for page layout background optimization
  * @release_wq:workqueue for safe page release
  * @work:  work_struct for safe page release
+ * @inode: inode for z3fold pseudo filesystem
  *
  * This structure is allocated at pool creation time and maintains metadata
  * pertaining to a particular z3fold pool.
@@ -149,6 +160,7 @@ struct z3fold_pool {
struct workqueue_struct *compact_wq;
struct workqueue_struct *release_wq;
struct work_struct work;
+   struct inode *inode;
 };
 
 /*
@@ -227,6 +239,59 @@ static inline void free_handle(unsigned long handle)
}
 }
 
+static struct dentry *z3fold_do_mount(struct file_system_type *fs_type,
+   int flags, const char *dev_name, void *data)
+{
+   static const struct dentry_operations ops = {
+   .d_dname = simple_dname,
+   };
+
+   return mount_pseudo(fs_type, "z3fold:", NULL, , 0x33);
+}
+
+static struct file_system_type z3fold_fs = {
+   .name   = "z3fold",
+   .mount  = z3fold_do_mount,
+   .kill_sb= kill_anon_super,
+};
+
+static struct vfsmount *z3fold_mnt;
+static int z3fold_mount(void)
+{
+   int ret = 0;
+
+   z3fold_mnt = kern_mount(_fs);
+   if (IS_ERR(z3fold_mnt))
+   ret = PTR_ERR(z3fold_mnt);
+
+   return ret;
+}
+
+static void z3fold_unmount(void)
+{
+   kern_unmount(z3fold_mnt);
+}
+
+static const struct address_space_operations z3fold_aops;
+static int z3fold_register_migration(struct z3fold_pool *pool)
+{
+   pool->inode = alloc_anon_inode(z3fold_mnt->mnt_sb);
+   if (IS_ERR(pool->inode)) {
+   pool->inode = NULL;
+   return 1;
+   }
+
+   pool->inode->i_mapping->private_data = pool;
+   pool->inode->i_mapping->a_ops = _aops;
+   return 0;
+}
+
+static void z3fold_unregister_migration(struct z3fold_pool *pool)
+{
+   if (pool->inode)
+   iput(pool->inode);
+ }
+
 /* Initializes the z3fold header of a newly allocated z3fold page */
 static struct z3fold_header *init_z3fold_page(struct page *page,
struct z3fold_pool *pool)
@@ -259,8 +324,14 @@ static struct z3fold_header *init_z3fold_page(struct page 
*page,
 }
 
 /* Resets the struct page fields and frees the page */
-static void free_z3fold_page(struct page *page)
+static void free_z3fold_page(struct page *page, bool headless)
 {
+   if (!headless) {
+   lock_page(page);
+   __ClearPageMovable(page);
+   unlock_page(page);
+   }
+   ClearPagePrivate(page);
__free_page(page);
 }
 
@@ -317,12 +388,12 @@ static unsigned long encode_handle(struct z3fold_header 
*zhdr, enum buddy bud)
 }
 
 /* Returns the z3fold page where a given

[PATCHv2 2/4] z3fold: improve compression by extending search

2019-04-17 Thread Vitaly Wool
The current z3fold implementation only searches this CPU's page
lists for a fitting page to put a new object into. This patch adds
quick search for very well fitting pages (i. e. those having
exactly the required number of free space) on other CPUs too,
before allocating a new page for that object.

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 7a59875d880c..29a4f1249bef 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -522,6 +522,42 @@ static inline struct z3fold_header *__z3fold_alloc(struct 
z3fold_pool *pool,
}
put_cpu_ptr(pool->unbuddied);
 
+   if (!zhdr) {
+   int cpu;
+
+   /* look for _exact_ match on other cpus' lists */
+   for_each_online_cpu(cpu) {
+   struct list_head *l;
+
+   unbuddied = per_cpu_ptr(pool->unbuddied, cpu);
+   spin_lock(>lock);
+   l = [chunks];
+
+   zhdr = list_first_entry_or_null(READ_ONCE(l),
+   struct z3fold_header, buddy);
+
+   if (!zhdr || !z3fold_page_trylock(zhdr)) {
+   spin_unlock(>lock);
+   zhdr = NULL;
+   continue;
+   }
+   list_del_init(>buddy);
+   zhdr->cpu = -1;
+   spin_unlock(>lock);
+
+   page = virt_to_page(zhdr);
+   if (test_bit(NEEDS_COMPACTING, >private)) {
+   z3fold_page_unlock(zhdr);
+   zhdr = NULL;
+   if (can_sleep)
+   cond_resched();
+   continue;
+   }
+   kref_get(>refcount);
+   break;
+   }
+   }
+
return zhdr;
 }
 
-- 
2.17.1


[PATCHv2 3/4] z3fold: add structure for buddy handles

2019-04-17 Thread Vitaly Wool
For z3fold to be able to move its pages per request of the memory
subsystem, it should not use direct object addresses in handles.
Instead, it will create abstract handles (3 per page) which will
contain pointers to z3fold objects. Thus, it will be possible to
change these pointers when z3fold page is moved.

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 185 
 1 file changed, 145 insertions(+), 40 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 29a4f1249bef..bebc10083f1c 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -34,6 +34,29 @@
 #include 
 #include 
 
+/*
+ * NCHUNKS_ORDER determines the internal allocation granularity, effectively
+ * adjusting internal fragmentation.  It also determines the number of
+ * freelists maintained in each pool. NCHUNKS_ORDER of 6 means that the
+ * allocation granularity will be in chunks of size PAGE_SIZE/64. Some chunks
+ * in the beginning of an allocated page are occupied by z3fold header, so
+ * NCHUNKS will be calculated to 63 (or 62 in case CONFIG_DEBUG_SPINLOCK=y),
+ * which shows the max number of free chunks in z3fold page, also there will
+ * be 63, or 62, respectively, freelists per pool.
+ */
+#define NCHUNKS_ORDER  6
+
+#define CHUNK_SHIFT(PAGE_SHIFT - NCHUNKS_ORDER)
+#define CHUNK_SIZE (1 << CHUNK_SHIFT)
+#define ZHDR_SIZE_ALIGNED round_up(sizeof(struct z3fold_header), CHUNK_SIZE)
+#define ZHDR_CHUNKS(ZHDR_SIZE_ALIGNED >> CHUNK_SHIFT)
+#define TOTAL_CHUNKS   (PAGE_SIZE >> CHUNK_SHIFT)
+#define NCHUNKS((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT)
+
+#define BUDDY_MASK (0x3)
+#define BUDDY_SHIFT2
+#define SLOTS_ALIGN(0x40)
+
 /*
  * Structures
 */
@@ -47,9 +70,19 @@ enum buddy {
FIRST,
MIDDLE,
LAST,
-   BUDDIES_MAX
+   BUDDIES_MAX = LAST
 };
 
+struct z3fold_buddy_slots {
+   /*
+* we are using BUDDY_MASK in handle_to_buddy etc. so there should
+* be enough slots to hold all possible variants
+*/
+   unsigned long slot[BUDDY_MASK + 1];
+   unsigned long pool; /* back link + flags */
+};
+#define HANDLE_FLAG_MASK   (0x03)
+
 /*
  * struct z3fold_header - z3fold page metadata occupying first chunks of each
  * z3fold page, except for HEADLESS pages
@@ -58,7 +91,7 @@ enum buddy {
  * @page_lock: per-page lock
  * @refcount:  reference count for the z3fold page
  * @work:  work_struct for page layout optimization
- * @pool:  pointer to the pool which this page belongs to
+ * @slots: pointer to the structure holding buddy slots
  * @cpu:   CPU which this page "belongs" to
  * @first_chunks:  the size of the first buddy in chunks, 0 if free
  * @middle_chunks: the size of the middle buddy in chunks, 0 if free
@@ -70,7 +103,7 @@ struct z3fold_header {
spinlock_t page_lock;
struct kref refcount;
struct work_struct work;
-   struct z3fold_pool *pool;
+   struct z3fold_buddy_slots *slots;
short cpu;
unsigned short first_chunks;
unsigned short middle_chunks;
@@ -79,28 +112,6 @@ struct z3fold_header {
unsigned short first_num:2;
 };
 
-/*
- * NCHUNKS_ORDER determines the internal allocation granularity, effectively
- * adjusting internal fragmentation.  It also determines the number of
- * freelists maintained in each pool. NCHUNKS_ORDER of 6 means that the
- * allocation granularity will be in chunks of size PAGE_SIZE/64. Some chunks
- * in the beginning of an allocated page are occupied by z3fold header, so
- * NCHUNKS will be calculated to 63 (or 62 in case CONFIG_DEBUG_SPINLOCK=y),
- * which shows the max number of free chunks in z3fold page, also there will
- * be 63, or 62, respectively, freelists per pool.
- */
-#define NCHUNKS_ORDER  6
-
-#define CHUNK_SHIFT(PAGE_SHIFT - NCHUNKS_ORDER)
-#define CHUNK_SIZE (1 << CHUNK_SHIFT)
-#define ZHDR_SIZE_ALIGNED round_up(sizeof(struct z3fold_header), CHUNK_SIZE)
-#define ZHDR_CHUNKS(ZHDR_SIZE_ALIGNED >> CHUNK_SHIFT)
-#define TOTAL_CHUNKS   (PAGE_SIZE >> CHUNK_SHIFT)
-#define NCHUNKS((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT)
-
-#define BUDDY_MASK (0x3)
-#define BUDDY_SHIFT2
-
 /**
  * struct z3fold_pool - stores metadata for each z3fold pool
  * @name:  pool name
@@ -113,6 +124,7 @@ struct z3fold_header {
  * added buddy.
  * @stale: list of pages marked for freeing
  * @pages_nr:  number of z3fold pages in the pool.
+ * @c_handle:  cache for z3fold_buddy_slots allocation
  * @ops:   pointer to a structure of user defined operations specified at
  * pool creation time.
  * @compact_wq:workqueue for page layout background optimization
@@ -130,6 +142,7 @@ struct z3fold_pool {
struct list_head lru;
struct li

[PATCHV2 1/4] z3fold: introduce helper functions

2019-04-17 Thread Vitaly Wool
This patch introduces a separate helper function for object
allocation, as well as 2 smaller helpers to add a buddy to the list
and to get a pointer to the pool from the z3fold header. No
functional changes here.

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 184 
 1 file changed, 100 insertions(+), 84 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index aee9b0b8d907..7a59875d880c 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -255,10 +255,15 @@ static enum buddy handle_to_buddy(unsigned long handle)
return (handle - zhdr->first_num) & BUDDY_MASK;
 }
 
+static inline struct z3fold_pool *zhdr_to_pool(struct z3fold_header *zhdr)
+{
+   return zhdr->pool;
+}
+
 static void __release_z3fold_page(struct z3fold_header *zhdr, bool locked)
 {
struct page *page = virt_to_page(zhdr);
-   struct z3fold_pool *pool = zhdr->pool;
+   struct z3fold_pool *pool = zhdr_to_pool(zhdr);
 
WARN_ON(!list_empty(>buddy));
set_bit(PAGE_STALE, >private);
@@ -295,9 +300,10 @@ static void release_z3fold_page_locked_list(struct kref 
*ref)
 {
struct z3fold_header *zhdr = container_of(ref, struct z3fold_header,
   refcount);
-   spin_lock(>pool->lock);
+   struct z3fold_pool *pool = zhdr_to_pool(zhdr);
+   spin_lock(>lock);
list_del_init(>buddy);
-   spin_unlock(>pool->lock);
+   spin_unlock(>lock);
 
WARN_ON(z3fold_page_trylock(zhdr));
__release_z3fold_page(zhdr, true);
@@ -349,6 +355,23 @@ static int num_free_chunks(struct z3fold_header *zhdr)
return nfree;
 }
 
+/* Add to the appropriate unbuddied list */
+static inline void add_to_unbuddied(struct z3fold_pool *pool,
+   struct z3fold_header *zhdr)
+{
+   if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 ||
+   zhdr->middle_chunks == 0) {
+   struct list_head *unbuddied = get_cpu_ptr(pool->unbuddied);
+
+   int freechunks = num_free_chunks(zhdr);
+   spin_lock(>lock);
+   list_add(>buddy, [freechunks]);
+   spin_unlock(>lock);
+   zhdr->cpu = smp_processor_id();
+   put_cpu_ptr(pool->unbuddied);
+   }
+}
+
 static inline void *mchunk_memmove(struct z3fold_header *zhdr,
unsigned short dst_chunk)
 {
@@ -406,10 +429,8 @@ static int z3fold_compact_page(struct z3fold_header *zhdr)
 
 static void do_compact_page(struct z3fold_header *zhdr, bool locked)
 {
-   struct z3fold_pool *pool = zhdr->pool;
+   struct z3fold_pool *pool = zhdr_to_pool(zhdr);
struct page *page;
-   struct list_head *unbuddied;
-   int fchunks;
 
page = virt_to_page(zhdr);
if (locked)
@@ -430,18 +451,7 @@ static void do_compact_page(struct z3fold_header *zhdr, 
bool locked)
}
 
z3fold_compact_page(zhdr);
-   unbuddied = get_cpu_ptr(pool->unbuddied);
-   fchunks = num_free_chunks(zhdr);
-   if (fchunks < NCHUNKS &&
-   (!zhdr->first_chunks || !zhdr->middle_chunks ||
-   !zhdr->last_chunks)) {
-   /* the page's not completely free and it's unbuddied */
-   spin_lock(>lock);
-   list_add(>buddy, [fchunks]);
-   spin_unlock(>lock);
-   zhdr->cpu = smp_processor_id();
-   }
-   put_cpu_ptr(pool->unbuddied);
+   add_to_unbuddied(pool, zhdr);
z3fold_page_unlock(zhdr);
 }
 
@@ -453,6 +463,67 @@ static void compact_page_work(struct work_struct *w)
do_compact_page(zhdr, false);
 }
 
+/* returns _locked_ z3fold page header or NULL */
+static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
+   size_t size, bool can_sleep)
+{
+   struct z3fold_header *zhdr = NULL;
+   struct page *page;
+   struct list_head *unbuddied;
+   int chunks = size_to_chunks(size), i;
+
+lookup:
+   /* First, try to find an unbuddied z3fold page. */
+   unbuddied = get_cpu_ptr(pool->unbuddied);
+   for_each_unbuddied_list(i, chunks) {
+   struct list_head *l = [i];
+
+   zhdr = list_first_entry_or_null(READ_ONCE(l),
+   struct z3fold_header, buddy);
+
+   if (!zhdr)
+   continue;
+
+   /* Re-check under lock. */
+   spin_lock(>lock);
+   l = [i];
+   if (unlikely(zhdr != list_first_entry(READ_ONCE(l),
+   struct z3fold_header, buddy)) ||
+   !z3fold_page_trylock(zhdr)) {
+   spin_unlock(>lock);
+   zhdr = NULL;
+   

[PATCHv2 0/4] z3fold: support page migration

2019-04-17 Thread Vitaly Wool
This patchset implements page migration support and slightly better
buddy search. To implement page migration support, z3fold has to move
away from the current scheme of handle encoding. i. e. stop encoding
page address in handles. Instead, a small per-page structure is created
which will contain actual addresses for z3fold objects, while pointers
to fields of that structure will be used as handles.

Thus, it will be possible to change the underlying addresses to reflect
page migration.

To support migration itself, 3 callbacks will be implemented:
1: isolation callback: z3fold_page_isolate(): try to isolate
the page by removing it from all lists. Pages scheduled for some
activity and mapped pages will not be isolated. Return true if
isolation was successful or false otherwise
2: migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
system. Returns 0 on success or negative error code otherwise
3: putback callback: z3fold_page_putback(): put back the page
if z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).

To make sure an isolated page doesn't get freed, its kref is incremented
in z3fold_page_isolate() and decremented during post-migration
compaction, if migration was successful, or by z3fold_page_putback() in
the other case.

Since the new handle encoding scheme implies slight memory consumption
increase, better buddy search (which decreases memory consumption) is
included in this patchset.

Vitaly Wool (4):
  z3fold: introduce helper functions
  z3fold: improve compression by extending search
  z3fold: add structure for buddy handles
  z3fold: support page migration

 mm/z3fold.c |  638 
++-
 1 file changed, 508 insertions(+), 130 deletions(-)


Re: [PATCH 0/4] z3fold: support page migration

2019-04-17 Thread Vitaly Wool
Den ons 17 apr. 2019 kl 01:18 skrev Andrew Morton :
>
> On Thu, 11 Apr 2019 17:32:12 +0200 Vitaly Wool  wrote:
>
> > This patchset implements page migration support and slightly better
> > buddy search. To implement page migration support, z3fold has to move
> > away from the current scheme of handle encoding. i. e. stop encoding
> > page address in handles. Instead, a small per-page structure is created
> > which will contain actual addresses for z3fold objects, while pointers
> > to fields of that structure will be used as handles.
>
> Can you please help find a reviewer for this work?
>
> For some reason I'm seeing a massive number of rejects when trying to
> apply these.  It looks like your mail client performed some sort of
> selective space-stuffing.  I suggest you email a patch to yourself,
> check that the result applies properly.


Sorry about that. You can never be sure when you work with
Thunderbird. I checked the tabs were not converted to spaces but
Thunderbird managed to add extra space in the beginning of each
unchanged line of the patch.

I'll just to a v2 patchset today.


[PATCH 4/4] z3fold: support page migration

2019-04-11 Thread Vitaly Wool

Now that we are not using page address in handles directly, we
can make z3fold pages movable to decrease the memory fragmentation
z3fold may create over time.

This patch starts advertising non-headless z3fold pages as movable
and uses the existing kernel infrastructure to implement moving of
such pages per memory management subsystem's request. It thus
implements 3 required callbacks for page migration:

* isolation callback: z3fold_page_isolate(): try to isolate the
page by removing it from all lists. Pages scheduled for some activity
and mapped pages will not be isolated. Return true if isolation was
successful or false otherwise
* migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
memory subsystem. Returns 0 on success or negative error code
otherwise
* putback callback: z3fold_page_putback(): put back the page if
z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 241 +---
 1 file changed, 231 insertions(+), 10 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index bebc10083f1c..d9eabfdad0fe 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -24,10 +24,18 @@
 
 #include 

 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -97,6 +105,7 @@ struct z3fold_buddy_slots {
  * @middle_chunks: the size of the middle buddy in chunks, 0 if free
  * @last_chunks:   the size of the last buddy in chunks, 0 if free
  * @first_num: the starting number (for the first handle)
+ * @mapped_count:  the number of objects currently mapped
  */
 struct z3fold_header {
struct list_head buddy;
@@ -110,6 +119,7 @@ struct z3fold_header {
unsigned short last_chunks;
unsigned short start_middle;
unsigned short first_num:2;
+   unsigned short mapped_count:2;
 };
 
 /**

@@ -130,6 +140,7 @@ struct z3fold_header {
  * @compact_wq:workqueue for page layout background optimization
  * @release_wq:workqueue for safe page release
  * @work:  work_struct for safe page release
+ * @inode: inode for z3fold pseudo filesystem
  *
  * This structure is allocated at pool creation time and maintains metadata
  * pertaining to a particular z3fold pool.
@@ -149,6 +160,7 @@ struct z3fold_pool {
struct workqueue_struct *compact_wq;
struct workqueue_struct *release_wq;
struct work_struct work;
+   struct inode *inode;
 };
 
 /*

@@ -227,6 +239,59 @@ static inline void free_handle(unsigned long handle)
}
 }
 
+static struct dentry *z3fold_do_mount(struct file_system_type *fs_type,

+   int flags, const char *dev_name, void *data)
+{
+   static const struct dentry_operations ops = {
+   .d_dname = simple_dname,
+   };
+
+   return mount_pseudo(fs_type, "z3fold:", NULL, , 0x33);
+}
+
+static struct file_system_type z3fold_fs = {
+   .name   = "z3fold",
+   .mount  = z3fold_do_mount,
+   .kill_sb= kill_anon_super,
+};
+
+static struct vfsmount *z3fold_mnt;
+static int z3fold_mount(void)
+{
+   int ret = 0;
+
+   z3fold_mnt = kern_mount(_fs);
+   if (IS_ERR(z3fold_mnt))
+   ret = PTR_ERR(z3fold_mnt);
+
+   return ret;
+}
+
+static void z3fold_unmount(void)
+{
+   kern_unmount(z3fold_mnt);
+}
+
+static const struct address_space_operations z3fold_aops;
+static int z3fold_register_migration(struct z3fold_pool *pool)
+{
+   pool->inode = alloc_anon_inode(z3fold_mnt->mnt_sb);
+   if (IS_ERR(pool->inode)) {
+   pool->inode = NULL;
+   return 1;
+   }
+
+   pool->inode->i_mapping->private_data = pool;
+   pool->inode->i_mapping->a_ops = _aops;
+   return 0;
+}
+
+static void z3fold_unregister_migration(struct z3fold_pool *pool)
+{
+   if (pool->inode)
+   iput(pool->inode);
+ }
+
 /* Initializes the z3fold header of a newly allocated z3fold page */
 static struct z3fold_header *init_z3fold_page(struct page *page,
struct z3fold_pool *pool)
@@ -259,8 +324,14 @@ static struct z3fold_header *init_z3fold_page(struct page 
*page,
 }
 
 /* Resets the struct page fields and frees the page */

-static void free_z3fold_page(struct page *page)
+static void free_z3fold_page(struct page *page, bool headless)
 {
+   if (!headless) {
+   lock_page(page);
+   __ClearPageMovable(page);
+   unlock_page(page);
+   }
+   ClearPagePrivate(page);
__free_page(page);
 }
 
@@ -317,12 +388,12 @@ static unsigned long encode_handle(struct z3fold_header *zhdr, enum buddy bud)

 }
 
 /* Returns the z3fold page 

[PATCH 3/4] z3fold: add structure for buddy handles

2019-04-11 Thread Vitaly Wool

For z3fold to be able to move its pages per request of the memory
subsystem, it should not use direct object addresses in handles.
Instead, it will create abstract handles (3 per page) which will
contain pointers to z3fold objects. Thus, it will be possible to
change these pointers when z3fold page is moved.

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 185 
 1 file changed, 145 insertions(+), 40 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 29a4f1249bef..bebc10083f1c 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -34,6 +34,29 @@
 #include 
 #include 
 
+/*

+ * NCHUNKS_ORDER determines the internal allocation granularity, effectively
+ * adjusting internal fragmentation.  It also determines the number of
+ * freelists maintained in each pool. NCHUNKS_ORDER of 6 means that the
+ * allocation granularity will be in chunks of size PAGE_SIZE/64. Some chunks
+ * in the beginning of an allocated page are occupied by z3fold header, so
+ * NCHUNKS will be calculated to 63 (or 62 in case CONFIG_DEBUG_SPINLOCK=y),
+ * which shows the max number of free chunks in z3fold page, also there will
+ * be 63, or 62, respectively, freelists per pool.
+ */
+#define NCHUNKS_ORDER  6
+
+#define CHUNK_SHIFT(PAGE_SHIFT - NCHUNKS_ORDER)
+#define CHUNK_SIZE (1 << CHUNK_SHIFT)
+#define ZHDR_SIZE_ALIGNED round_up(sizeof(struct z3fold_header), CHUNK_SIZE)
+#define ZHDR_CHUNKS(ZHDR_SIZE_ALIGNED >> CHUNK_SHIFT)
+#define TOTAL_CHUNKS   (PAGE_SIZE >> CHUNK_SHIFT)
+#define NCHUNKS((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT)
+
+#define BUDDY_MASK (0x3)
+#define BUDDY_SHIFT2
+#define SLOTS_ALIGN(0x40)
+
 /*
  * Structures
 */
@@ -47,9 +70,19 @@ enum buddy {
FIRST,
MIDDLE,
LAST,
-   BUDDIES_MAX
+   BUDDIES_MAX = LAST
 };
 
+struct z3fold_buddy_slots {

+   /*
+* we are using BUDDY_MASK in handle_to_buddy etc. so there should
+* be enough slots to hold all possible variants
+*/
+   unsigned long slot[BUDDY_MASK + 1];
+   unsigned long pool; /* back link + flags */
+};
+#define HANDLE_FLAG_MASK   (0x03)
+
 /*
  * struct z3fold_header - z3fold page metadata occupying first chunks of each
  * z3fold page, except for HEADLESS pages
@@ -58,7 +91,7 @@ enum buddy {
  * @page_lock: per-page lock
  * @refcount:  reference count for the z3fold page
  * @work:  work_struct for page layout optimization
- * @pool:  pointer to the pool which this page belongs to
+ * @slots: pointer to the structure holding buddy slots
  * @cpu:   CPU which this page "belongs" to
  * @first_chunks:  the size of the first buddy in chunks, 0 if free
  * @middle_chunks: the size of the middle buddy in chunks, 0 if free
@@ -70,7 +103,7 @@ struct z3fold_header {
spinlock_t page_lock;
struct kref refcount;
struct work_struct work;
-   struct z3fold_pool *pool;
+   struct z3fold_buddy_slots *slots;
short cpu;
unsigned short first_chunks;
unsigned short middle_chunks;
@@ -79,28 +112,6 @@ struct z3fold_header {
unsigned short first_num:2;
 };
 
-/*

- * NCHUNKS_ORDER determines the internal allocation granularity, effectively
- * adjusting internal fragmentation.  It also determines the number of
- * freelists maintained in each pool. NCHUNKS_ORDER of 6 means that the
- * allocation granularity will be in chunks of size PAGE_SIZE/64. Some chunks
- * in the beginning of an allocated page are occupied by z3fold header, so
- * NCHUNKS will be calculated to 63 (or 62 in case CONFIG_DEBUG_SPINLOCK=y),
- * which shows the max number of free chunks in z3fold page, also there will
- * be 63, or 62, respectively, freelists per pool.
- */
-#define NCHUNKS_ORDER  6
-
-#define CHUNK_SHIFT(PAGE_SHIFT - NCHUNKS_ORDER)
-#define CHUNK_SIZE (1 << CHUNK_SHIFT)
-#define ZHDR_SIZE_ALIGNED round_up(sizeof(struct z3fold_header), CHUNK_SIZE)
-#define ZHDR_CHUNKS(ZHDR_SIZE_ALIGNED >> CHUNK_SHIFT)
-#define TOTAL_CHUNKS   (PAGE_SIZE >> CHUNK_SHIFT)
-#define NCHUNKS((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT)
-
-#define BUDDY_MASK (0x3)
-#define BUDDY_SHIFT2
-
 /**
  * struct z3fold_pool - stores metadata for each z3fold pool
  * @name:  pool name
@@ -113,6 +124,7 @@ struct z3fold_header {
  * added buddy.
  * @stale: list of pages marked for freeing
  * @pages_nr:  number of z3fold pages in the pool.
+ * @c_handle:  cache for z3fold_buddy_slots allocation
  * @ops:   pointer to a structure of user defined operations specified at
  * pool creation time.
  * @compact_wq:workqueue for page layout background optimization
@@ -130,6 +142,7 @@ struct z3fold_pool {
struct list_head lru;
struct li

[PATCH 2/4] z3fold: improve compression by extending search

2019-04-11 Thread Vitaly Wool

The current z3fold implementation only searches this CPU's page
lists for a fitting page to put a new object into. This patch adds
quick search for very well fitting pages (i. e. those having
exactly the required number of free space) on other CPUs too,
before allocating a new page for that object.

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index 7a59875d880c..29a4f1249bef 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -522,6 +522,42 @@ static inline struct z3fold_header *__z3fold_alloc(struct 
z3fold_pool *pool,
}
put_cpu_ptr(pool->unbuddied);
 
+	if (!zhdr) {

+   int cpu;
+
+   /* look for _exact_ match on other cpus' lists */
+   for_each_online_cpu(cpu) {
+   struct list_head *l;
+
+   unbuddied = per_cpu_ptr(pool->unbuddied, cpu);
+   spin_lock(>lock);
+   l = [chunks];
+
+   zhdr = list_first_entry_or_null(READ_ONCE(l),
+   struct z3fold_header, buddy);
+
+   if (!zhdr || !z3fold_page_trylock(zhdr)) {
+   spin_unlock(>lock);
+   zhdr = NULL;
+   continue;
+   }
+   list_del_init(>buddy);
+   zhdr->cpu = -1;
+   spin_unlock(>lock);
+
+   page = virt_to_page(zhdr);
+   if (test_bit(NEEDS_COMPACTING, >private)) {
+   z3fold_page_unlock(zhdr);
+   zhdr = NULL;
+   if (can_sleep)
+   cond_resched();
+   continue;
+   }
+   kref_get(>refcount);
+   break;
+   }
+   }
+
return zhdr;
 }
 
--

2.17.1



[PATCH 1/4] z3fold: introduce helper functions

2019-04-11 Thread Vitaly Wool

This patch introduces a separate helper function for object
allocation, as well as 2 smaller helpers to add a buddy to the list
and to get a pointer to the pool from the z3fold header. No
functional changes here.

Signed-off-by: Vitaly Wool 
---
 mm/z3fold.c | 184 
 1 file changed, 100 insertions(+), 84 deletions(-)

diff --git a/mm/z3fold.c b/mm/z3fold.c
index aee9b0b8d907..7a59875d880c 100644
--- a/mm/z3fold.c
+++ b/mm/z3fold.c
@@ -255,10 +255,15 @@ static enum buddy handle_to_buddy(unsigned long handle)
return (handle - zhdr->first_num) & BUDDY_MASK;
 }
 
+static inline struct z3fold_pool *zhdr_to_pool(struct z3fold_header *zhdr)

+{
+   return zhdr->pool;
+}
+
 static void __release_z3fold_page(struct z3fold_header *zhdr, bool locked)
 {
struct page *page = virt_to_page(zhdr);
-   struct z3fold_pool *pool = zhdr->pool;
+   struct z3fold_pool *pool = zhdr_to_pool(zhdr);
 
 	WARN_ON(!list_empty(>buddy));

set_bit(PAGE_STALE, >private);
@@ -295,9 +300,10 @@ static void release_z3fold_page_locked_list(struct kref 
*ref)
 {
struct z3fold_header *zhdr = container_of(ref, struct z3fold_header,
   refcount);
-   spin_lock(>pool->lock);
+   struct z3fold_pool *pool = zhdr_to_pool(zhdr);
+   spin_lock(>lock);
list_del_init(>buddy);
-   spin_unlock(>pool->lock);
+   spin_unlock(>lock);
 
 	WARN_ON(z3fold_page_trylock(zhdr));

__release_z3fold_page(zhdr, true);
@@ -349,6 +355,23 @@ static int num_free_chunks(struct z3fold_header *zhdr)
return nfree;
 }
 
+/* Add to the appropriate unbuddied list */

+static inline void add_to_unbuddied(struct z3fold_pool *pool,
+   struct z3fold_header *zhdr)
+{
+   if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 ||
+   zhdr->middle_chunks == 0) {
+   struct list_head *unbuddied = get_cpu_ptr(pool->unbuddied);
+
+   int freechunks = num_free_chunks(zhdr);
+   spin_lock(>lock);
+   list_add(>buddy, [freechunks]);
+   spin_unlock(>lock);
+   zhdr->cpu = smp_processor_id();
+   put_cpu_ptr(pool->unbuddied);
+   }
+}
+
 static inline void *mchunk_memmove(struct z3fold_header *zhdr,
unsigned short dst_chunk)
 {
@@ -406,10 +429,8 @@ static int z3fold_compact_page(struct z3fold_header *zhdr)
 
 static void do_compact_page(struct z3fold_header *zhdr, bool locked)

 {
-   struct z3fold_pool *pool = zhdr->pool;
+   struct z3fold_pool *pool = zhdr_to_pool(zhdr);
struct page *page;
-   struct list_head *unbuddied;
-   int fchunks;
 
 	page = virt_to_page(zhdr);

if (locked)
@@ -430,18 +451,7 @@ static void do_compact_page(struct z3fold_header *zhdr, 
bool locked)
}
 
 	z3fold_compact_page(zhdr);

-   unbuddied = get_cpu_ptr(pool->unbuddied);
-   fchunks = num_free_chunks(zhdr);
-   if (fchunks < NCHUNKS &&
-   (!zhdr->first_chunks || !zhdr->middle_chunks ||
-   !zhdr->last_chunks)) {
-   /* the page's not completely free and it's unbuddied */
-   spin_lock(>lock);
-   list_add(>buddy, [fchunks]);
-   spin_unlock(>lock);
-   zhdr->cpu = smp_processor_id();
-   }
-   put_cpu_ptr(pool->unbuddied);
+   add_to_unbuddied(pool, zhdr);
z3fold_page_unlock(zhdr);
 }
 
@@ -453,6 +463,67 @@ static void compact_page_work(struct work_struct *w)

do_compact_page(zhdr, false);
 }
 
+/* returns _locked_ z3fold page header or NULL */

+static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool,
+   size_t size, bool can_sleep)
+{
+   struct z3fold_header *zhdr = NULL;
+   struct page *page;
+   struct list_head *unbuddied;
+   int chunks = size_to_chunks(size), i;
+
+lookup:
+   /* First, try to find an unbuddied z3fold page. */
+   unbuddied = get_cpu_ptr(pool->unbuddied);
+   for_each_unbuddied_list(i, chunks) {
+   struct list_head *l = [i];
+
+   zhdr = list_first_entry_or_null(READ_ONCE(l),
+   struct z3fold_header, buddy);
+
+   if (!zhdr)
+   continue;
+
+   /* Re-check under lock. */
+   spin_lock(>lock);
+   l = [i];
+   if (unlikely(zhdr != list_first_entry(READ_ONCE(l),
+   struct z3fold_header, buddy)) ||
+   !z3fold_page_trylock(zhdr)) {
+   spin_unlock(>lock);
+   zhdr = NULL;
+   put_cpu_ptr(pool->unbuddied);
+ 

[PATCH 0/4] z3fold: support page migration

2019-04-11 Thread Vitaly Wool

This patchset implements page migration support and slightly better
buddy search. To implement page migration support, z3fold has to move
away from the current scheme of handle encoding. i. e. stop encoding
page address in handles. Instead, a small per-page structure is created
which will contain actual addresses for z3fold objects, while pointers
to fields of that structure will be used as handles.

Thus, it will be possible to change the underlying addresses to reflect
page migration.

To support migration itself, 3 callbacks will be implemented:
1: isolation callback: z3fold_page_isolate(): try to isolate
the page by removing it from all lists. Pages scheduled for some
activity and mapped pages will not be isolated. Return true if
isolation was successful or false otherwise
2: migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
system. Returns 0 on success or negative error code otherwise
3: putback callback: z3fold_page_isolate(): put back the page
if z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).

To make sure an isolated page doesn't get freed, its kref is incremented
in z3fold_page_isolate() and decremented during post-migration
compaction, if migration was successful, or by z3fold_page_isolate() in
the other case.

Since the new handle encoding scheme implies slight memory consumption
increase, better buddy search (which decreases memory consumption) is
included in this patchset.

Vitaly Wool (4):
  z3fold: introduce helper functions
  z3fold: improve compression by extending search
  z3fold: add structure for buddy handles
  z3fold: support page migration

 mm/z3fold.c |  638 
++-
 1 file changed, 508 insertions(+), 130 deletions(-)

  





Re: [PATCH] z3fold: fix wrong handling of headless pages

2018-11-08 Thread Vitaly Wool
Den tors 8 nov. 2018 kl 13:34 skrev 김종석 :
>
> Hi Vitaly,
> thank you for the reply.
>
> I agree your a new solution is more comprehensive and drop my patch is simple 
> way.
> But, I think it's not fair.
> If my previous patch was not wrong, is (my patch -> your patch) the right way?

I could apply the new patch on top of yours but that would effectively
revert most of your changes.
Would it be ok for you if I add you to Signed-off-by for the new patch instead?

~Vitaly

> I'm sorry I sent reply twice.
>
> Best regards,
> Jongseok
>
>
> > On 6/11/2018 4:48 PM, Vitaly Wool wrote:
> > Hi Jongseok,
>
> > thank you for your work, we've now got a more comprehensive solution:
> > https://lkml.org/lkml/2018/11/5/726
>
> > Would you please confirm that it works for you? Also, would you be
> >okay with dropping your patch in favor of the new one?
>
> > ~Vitaly


Re: [PATCH] z3fold: fix wrong handling of headless pages

2018-11-08 Thread Vitaly Wool
Den tors 8 nov. 2018 kl 13:34 skrev 김종석 :
>
> Hi Vitaly,
> thank you for the reply.
>
> I agree your a new solution is more comprehensive and drop my patch is simple 
> way.
> But, I think it's not fair.
> If my previous patch was not wrong, is (my patch -> your patch) the right way?

I could apply the new patch on top of yours but that would effectively
revert most of your changes.
Would it be ok for you if I add you to Signed-off-by for the new patch instead?

~Vitaly

> I'm sorry I sent reply twice.
>
> Best regards,
> Jongseok
>
>
> > On 6/11/2018 4:48 PM, Vitaly Wool wrote:
> > Hi Jongseok,
>
> > thank you for your work, we've now got a more comprehensive solution:
> > https://lkml.org/lkml/2018/11/5/726
>
> > Would you please confirm that it works for you? Also, would you be
> >okay with dropping your patch in favor of the new one?
>
> > ~Vitaly


  1   2   3   4   5   >