from:"Andrea Righi"

Re: [GIT PULL] Modules changes for v6.7-rc1

2023-11-02 Thread Andrea Righi

On Thu, Nov 02, 2023 at 08:29:17AM +0100, Andrea Righi wrote:
> On Wed, Nov 01, 2023 at 09:21:09PM -1000, Linus Torvalds wrote:
> > On Wed, 1 Nov 2023 at 21:02, Linus Torvalds
> >  wrote:
> > >
> > > kmalloc() isn't just about "use physically contiguous allocations".
> > > It's also more memory-efficient, and a *lot* faster than vmalloc(),
> > > which has to play VM tricks.
> > 
> > I've pulled this, but I think you should do something like the
> > attached (UNTESTED!) patch.
> > 
> > Linus
> 
> Looks good to me, I'll give it a try ASAP.
> 
> -Andrea

Just tested this both with zstd and gzip module compression, all good.

You can add my:

Tested-by: Andrea Righi 

Or if you need a proper paperwork:

--

From: Andrea Righi 
Subject: [PATCH] module/decompress: use kvmalloc() consistently

We consistently switched from kmalloc() to vmalloc() in module
decompression to prevent potential memory allocation failures with large
modules, however vmalloc() is not as memory-efficient and fast as
kmalloc().

Since we don't know in general the size of the workspace required by the
decompression algorithm, it is more reasonable to use kvmalloc()
consistently, also considering that we don't have special memory
requirements here.

Signed-off-by: Linus Torvalds 
Tested-by: Andrea Righi 
Signed-off-by: Andrea Righi 
---
 kernel/module/decompress.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/module/decompress.c b/kernel/module/decompress.c
index 4156d59be440..474e68f0f063 100644
--- a/kernel/module/decompress.c
+++ b/kernel/module/decompress.c
@@ -100,7 +100,7 @@ static ssize_t module_gzip_decompress(struct load_info 
*info,
s.next_in = buf + gzip_hdr_len;
s.avail_in = size - gzip_hdr_len;
 
-   s.workspace = vmalloc(zlib_inflate_workspacesize());
+   s.workspace = kvmalloc(zlib_inflate_workspacesize(), GFP_KERNEL);
if (!s.workspace)
return -ENOMEM;
 
@@ -138,7 +138,7 @@ static ssize_t module_gzip_decompress(struct load_info 
*info,
 out_inflate_end:
zlib_inflateEnd();
 out:
-   vfree(s.workspace);
+   kvfree(s.workspace);
return retval;
 }
 #elif defined(CONFIG_MODULE_COMPRESS_XZ)
@@ -241,7 +241,7 @@ static ssize_t module_zstd_decompress(struct load_info 
*info,
}
 
wksp_size = zstd_dstream_workspace_bound(header.windowSize);
-   wksp = vmalloc(wksp_size);
+   wksp = kvmalloc(wksp_size, GFP_KERNEL);
if (!wksp) {
retval = -ENOMEM;
goto out;
@@ -284,7 +284,7 @@ static ssize_t module_zstd_decompress(struct load_info 
*info,
retval = new_size;
 
  out:
-   vfree(wksp);
+   kvfree(wksp);
return retval;
 }
 #else
-- 
2.40.1

Re: [GIT PULL] Modules changes for v6.7-rc1

2023-11-02 Thread Andrea Righi

On Wed, Nov 01, 2023 at 09:21:09PM -1000, Linus Torvalds wrote:
> On Wed, 1 Nov 2023 at 21:02, Linus Torvalds
>  wrote:
> >
> > kmalloc() isn't just about "use physically contiguous allocations".
> > It's also more memory-efficient, and a *lot* faster than vmalloc(),
> > which has to play VM tricks.
> 
> I've pulled this, but I think you should do something like the
> attached (UNTESTED!) patch.
> 
> Linus

Looks good to me, I'll give it a try ASAP.

-Andrea


>  kernel/module/decompress.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/module/decompress.c b/kernel/module/decompress.c
> index 4156d59be440..474e68f0f063 100644
> --- a/kernel/module/decompress.c
> +++ b/kernel/module/decompress.c
> @@ -100,7 +100,7 @@ static ssize_t module_gzip_decompress(struct load_info 
> *info,
>   s.next_in = buf + gzip_hdr_len;
>   s.avail_in = size - gzip_hdr_len;
>  
> - s.workspace = vmalloc(zlib_inflate_workspacesize());
> + s.workspace = kvmalloc(zlib_inflate_workspacesize(), GFP_KERNEL);
>   if (!s.workspace)
>   return -ENOMEM;
>  
> @@ -138,7 +138,7 @@ static ssize_t module_gzip_decompress(struct load_info 
> *info,
>  out_inflate_end:
>   zlib_inflateEnd();
>  out:
> - vfree(s.workspace);
> + kvfree(s.workspace);
>   return retval;
>  }
>  #elif defined(CONFIG_MODULE_COMPRESS_XZ)
> @@ -241,7 +241,7 @@ static ssize_t module_zstd_decompress(struct load_info 
> *info,
>   }
>  
>   wksp_size = zstd_dstream_workspace_bound(header.windowSize);
> - wksp = vmalloc(wksp_size);
> + wksp = kvmalloc(wksp_size, GFP_KERNEL);
>   if (!wksp) {
>   retval = -ENOMEM;
>   goto out;
> @@ -284,7 +284,7 @@ static ssize_t module_zstd_decompress(struct load_info 
> *info,
>   retval = new_size;
>  
>   out:
> - vfree(wksp);
> + kvfree(wksp);
>   return retval;
>  }
>  #else

Re: [GIT PULL] Modules changes for v6.7-rc1

2023-11-02 Thread Andrea Righi

On Wed, Nov 01, 2023 at 09:02:51PM -1000, Linus Torvalds wrote:
> On Wed, 1 Nov 2023 at 10:13, Luis Chamberlain  wrote:
> >
> > The only thing worth highligthing is that gzip moves to use vmalloc() 
> > instead of
> > kmalloc just as we had a fix for this for zstd on v6.6-rc1.
> 
> Actually, that's almost certainly entirely the wrong thing to do.
> 
> Unless you *know* that the allocation is large, you shouldn't use
> vmalloc(). And since kmalloc() has worked fine, you most definitely
> don't know that.
> 
> So we have 'kvmalloc()' *exactly* for this reason, which is a "use
> kmalloc, unless that is too small, then use vmalloc".
> 
> kmalloc() isn't just about "use physically contiguous allocations".
> It's also more memory-efficient, and a *lot* faster than vmalloc(),
> which has to play VM tricks.
> 
> So this "just switch to vmalloc()" is entirely wrong.
> 
>   Linus

I proposed that change mostlfy for consistency with the zstd case, but I
haven't experience any issue with gzip compressed modules (that seem to
require less memory, even with larger modules).

So, yes, it probably makes sense to drop this change for now and I can
send another patch to switch to kvmalloc() for all the decompress cases.

Thanks,
-Andrea

Re: [PATCH] leds: trigger: fix potential deadlock with libata

2021-03-06 Thread Andrea Righi

On Sun, Mar 07, 2021 at 10:02:32AM +0800, Boqun Feng wrote:
> On Sat, Mar 06, 2021 at 09:39:54PM +0100, Marc Kleine-Budde wrote:
> > Hello *,
> > 
> > On 02.11.2020 11:41:52, Andrea Righi wrote:
> > > We have the following potential deadlock condition:
> > > 
> > >  
> > >  WARNING: possible irq lock inversion dependency detected
> > >  5.10.0-rc2+ #25 Not tainted
> > >  
> > >  swapper/3/0 just changed the state of lock:
> > >  8880063bd618 (>lock){-...}-{2:2}, at: 
> > > ata_bmdma_interrupt+0x27/0x200
> > >  but this lock took another, HARDIRQ-READ-unsafe lock in the past:
> > >   (>leddev_list_lock){.+.?}-{2:2}
> > > 
> > >  and interrupts could create inverse lock ordering between them.
> > 
> > [...]
> > 
> > > ---
> > >  drivers/leds/led-triggers.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
> > > index 91da90cfb11d..16d1a93a10a8 100644
> > > --- a/drivers/leds/led-triggers.c
> > > +++ b/drivers/leds/led-triggers.c
> > > @@ -378,14 +378,15 @@ void led_trigger_event(struct led_trigger *trig,
> > >   enum led_brightness brightness)
> > >  {
> > >   struct led_classdev *led_cdev;
> > > + unsigned long flags;
> > >  
> > >   if (!trig)
> > >   return;
> > >  
> > > - read_lock(>leddev_list_lock);
> > > + read_lock_irqsave(>leddev_list_lock, flags);
> > >   list_for_each_entry(led_cdev, >led_cdevs, trig_list)
> > >   led_set_brightness(led_cdev, brightness);
> > > - read_unlock(>leddev_list_lock);
> > > + read_unlock_irqrestore(>leddev_list_lock, flags);
> > >  }
> > >  EXPORT_SYMBOL_GPL(led_trigger_event);
> > 
> > meanwhile this patch hit v5.10.x stable and caused a performance
> > degradation on our use case:
> > 
> > It's an embedded ARM system, 4x Cortex A53, with an SPI attached CAN
> > controller. CAN stands for Controller Area Network and here used to
> > connect to some automotive equipment. Over CAN an ISOTP (a CAN-specific
> > Transport Protocol) transfer is running. With this patch, we see CAN
> > frames delayed for ~6ms, the usual gap between CAN frames is 240µs.
> > 
> > Reverting this patch, restores the old performance.
> > 
> > What is the best way to solve this dilemma? Identify the critical path
> > in our use case? Is there a way we can get around the irqsave in
> > led_trigger_event()?
> > 
> 
> Probably, we can change from rwlock to rcu here, POC code as follow,
> only compile tested. Marc, could you see whether this help the
> performance on your platform? Please note that I haven't test it in a
> running kernel and I'm not that familir with led subsystem, so use it
> with caution ;-)

If we don't want to touch the led subsystem at all maybe we could try to
fix the problem in libata, we just need to prevent calling
led_trigger_blink_oneshot() with >lock held from
ata_qc_complete(), maybe doing the led blinking from another context (a
workqueue for example)?

-Andrea

[PATCH v2] x86/entry: build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-23 Thread Andrea Righi

With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
empty object file.

With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
assembler doesn't generate a symbol table for empty object files and
objtool fails with the following error when a valid symbol table cannot
be found:

  arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table

To prevent this from happening, build thunk_$(BITS).o only if
CONFIG_PREEMPTION is enabled.

BugLink: https://bugs.launchpad.net/bugs/1911359
Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
Signed-off-by: Andrea Righi 
---
 arch/x86/entry/Makefile   | 3 ++-
 arch/x86/entry/thunk_32.S | 2 --
 arch/x86/entry/thunk_64.S | 4 
 arch/x86/um/Makefile  | 3 ++-
 4 files changed, 4 insertions(+), 8 deletions(-)

ChangeLog (v1 -> v2):
 - do not break UML build

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 08bf95dbc911..83c98dae74a6 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -21,12 +21,13 @@ CFLAGS_syscall_64.o += $(call 
cc-option,-Wno-override-init,)
 CFLAGS_syscall_32.o+= $(call cc-option,-Wno-override-init,)
 CFLAGS_syscall_x32.o   += $(call cc-option,-Wno-override-init,)
 
-obj-y  := entry_$(BITS).o thunk_$(BITS).o 
syscall_$(BITS).o
+obj-y  := entry_$(BITS).o syscall_$(BITS).o
 obj-y  += common.o
 
 obj-y  += vdso/
 obj-y  += vsyscall/
 
+obj-$(CONFIG_PREEMPTION)   += thunk_$(BITS).o
 obj-$(CONFIG_IA32_EMULATION)   += entry_64_compat.o syscall_32.o
 obj-$(CONFIG_X86_X32_ABI)  += syscall_x32.o
 
diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S
index f1f96d4d8cd6..5997ec0b4b17 100644
--- a/arch/x86/entry/thunk_32.S
+++ b/arch/x86/entry/thunk_32.S
@@ -29,10 +29,8 @@ SYM_CODE_START_NOALIGN(\name)
 SYM_CODE_END(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index ccd32877a3c4..c7cf79be7231 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -36,14 +36,11 @@ SYM_FUNC_END(\name)
_ASM_NOKPROBE(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
-#ifdef CONFIG_PREEMPTION
 SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
popq %r11
popq %r10
@@ -58,4 +55,3 @@ SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
ret
_ASM_NOKPROBE(.L_restore)
 SYM_CODE_END(.L_restore)
-#endif
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index 77f70b969d14..3113800da63a 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -27,7 +27,8 @@ else
 
 obj-y += syscalls_64.o vdso/
 
-subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o ../entry/thunk_64.o
+subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o
+subarch-$(CONFIG_PREEMPTION) += ../entry/thunk_64.o
 
 endif
 
-- 
2.29.2

Re: [tip: x86/entry] x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-21 Thread Andrea Righi

On Thu, Jan 21, 2021 at 09:52:01AM +0100, Andrea Righi wrote:
> On Thu, Jan 21, 2021 at 08:49:28AM +0100, Ingo Molnar wrote:
> > 
> > * tip-bot2 for Andrea Righi  wrote:
> > 
> > > The following commit has been merged into the x86/entry branch of tip:
> > > 
> > > Commit-ID: e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
> > > Gitweb:
> > > https://git.kernel.org/tip/e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
> > > Author:Andrea Righi 
> > > AuthorDate:Thu, 14 Jan 2021 12:48:35 +01:00
> > > Committer: Ingo Molnar 
> > > CommitterDate: Thu, 21 Jan 2021 08:11:52 +01:00
> > > 
> > > x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
> > > 
> > > With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
> > > empty object file.
> > > 
> > > With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
> > > assembler doesn't generate a symbol table for empty object files and
> > > objtool fails with the following error when a valid symbol table cannot
> > > be found:
> > > 
> > >   arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table
> > > 
> > > To prevent this from happening, build thunk_$(BITS).o only if
> > > CONFIG_PREEMPTION is enabled.
> > > 
> > >   BugLink: https://bugs.launchpad.net/bugs/1911359
> > > 
> > > Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
> > > Signed-off-by: Andrea Righi 
> > > Signed-off-by: Ingo Molnar 
> > > Cc: Borislav Petkov 
> > > Link: https://lore.kernel.org/r/YAAvk0UQelq0Ae7+@xps-13-7390
> > 
> > Hm, this fails to build on UML defconfig:
> > 
> >  
> > /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld:
> >  arch/x86/um/../entry/thunk_64.o: in function `preempt_schedule_thunk':
> >  /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:34: undefined 
> > reference to `preempt_schedule'
> >  
> > /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld:
> >  arch/x86/um/../entry/thunk_64.o: in function 
> > `preempt_schedule_notrace_thunk':
> >  /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:35: undefined 
> > reference to `preempt_schedule_notrace'
> > 
> > Thanks,
> > 
> > Ingo
> 
> I've been able to reproduce it, I'm looking at this right now. Thanks!

I see, basically UML selects ARCH_NO_PREEMPT, but in
arch/x86/um/Makefile it explicitly includes thunk_$(BITS).o regardless.

Considering that thunk_$(BITS) only contains preemption code now, we can
probably drop it from the Makefile, or, to be more consistent with the
x86 change, we could include it only if CONFIG_PREEMPTION is enabled
(even if it would never be, because UML has ARCH_NO_PREEMPT).

If it's unlikely that preemption will be enabled in UML one day I'd
probably go with the former, otherwise I'd go with the latter, because
it looks more consistent.

Opinions?

Thanks,
-Andrea

Re: [tip: x86/entry] x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-21 Thread Andrea Righi

On Thu, Jan 21, 2021 at 08:49:28AM +0100, Ingo Molnar wrote:
> 
> * tip-bot2 for Andrea Righi  wrote:
> 
> > The following commit has been merged into the x86/entry branch of tip:
> > 
> > Commit-ID: e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
> > Gitweb:
> > https://git.kernel.org/tip/e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
> > Author:Andrea Righi 
> > AuthorDate:Thu, 14 Jan 2021 12:48:35 +01:00
> > Committer: Ingo Molnar 
> > CommitterDate: Thu, 21 Jan 2021 08:11:52 +01:00
> > 
> > x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
> > 
> > With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
> > empty object file.
> > 
> > With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
> > assembler doesn't generate a symbol table for empty object files and
> > objtool fails with the following error when a valid symbol table cannot
> > be found:
> > 
> >   arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table
> > 
> > To prevent this from happening, build thunk_$(BITS).o only if
> > CONFIG_PREEMPTION is enabled.
> > 
> >   BugLink: https://bugs.launchpad.net/bugs/1911359
> > 
> > Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
> > Signed-off-by: Andrea Righi 
> > Signed-off-by: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Link: https://lore.kernel.org/r/YAAvk0UQelq0Ae7+@xps-13-7390
> 
> Hm, this fails to build on UML defconfig:
> 
>  
> /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld:
>  arch/x86/um/../entry/thunk_64.o: in function `preempt_schedule_thunk':
>  /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:34: undefined 
> reference to `preempt_schedule'
>  
> /home/mingo/gcc/cross/lib/gcc/x86_64-linux/9.3.1/../../../../x86_64-linux/bin/ld:
>  arch/x86/um/../entry/thunk_64.o: in function 
> `preempt_schedule_notrace_thunk':
>  /home/mingo/tip.cross/arch/x86/um/../entry/thunk_64.S:35: undefined 
> reference to `preempt_schedule_notrace'
> 
> Thanks,
> 
>   Ingo

I've been able to reproduce it, I'm looking at this right now. Thanks!

-Andrea

[tip: x86/entry] x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-20 Thread tip-bot2 for Andrea Righi

The following commit has been merged into the x86/entry branch of tip:

Commit-ID: e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
Gitweb:
https://git.kernel.org/tip/e6d92b6680371ae1aeeb6c5eb2387fdc5d9a2c89
Author:Andrea Righi 
AuthorDate:Thu, 14 Jan 2021 12:48:35 +01:00
Committer: Ingo Molnar 
CommitterDate: Thu, 21 Jan 2021 08:11:52 +01:00

x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y

With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
empty object file.

With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
assembler doesn't generate a symbol table for empty object files and
objtool fails with the following error when a valid symbol table cannot
be found:

  arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table

To prevent this from happening, build thunk_$(BITS).o only if
CONFIG_PREEMPTION is enabled.

  BugLink: https://bugs.launchpad.net/bugs/1911359

Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
Signed-off-by: Andrea Righi 
Signed-off-by: Ingo Molnar 
Cc: Borislav Petkov 
Link: https://lore.kernel.org/r/YAAvk0UQelq0Ae7+@xps-13-7390
---
 arch/x86/entry/Makefile   | 3 ++-
 arch/x86/entry/thunk_32.S | 2 --
 arch/x86/entry/thunk_64.S | 4 
 3 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 08bf95d..83c98da 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -21,12 +21,13 @@ CFLAGS_syscall_64.o += $(call 
cc-option,-Wno-override-init,)
 CFLAGS_syscall_32.o+= $(call cc-option,-Wno-override-init,)
 CFLAGS_syscall_x32.o   += $(call cc-option,-Wno-override-init,)
 
-obj-y  := entry_$(BITS).o thunk_$(BITS).o 
syscall_$(BITS).o
+obj-y  := entry_$(BITS).o syscall_$(BITS).o
 obj-y  += common.o
 
 obj-y  += vdso/
 obj-y  += vsyscall/
 
+obj-$(CONFIG_PREEMPTION)   += thunk_$(BITS).o
 obj-$(CONFIG_IA32_EMULATION)   += entry_64_compat.o syscall_32.o
 obj-$(CONFIG_X86_X32_ABI)  += syscall_x32.o
 
diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S
index f1f96d4..5997ec0 100644
--- a/arch/x86/entry/thunk_32.S
+++ b/arch/x86/entry/thunk_32.S
@@ -29,10 +29,8 @@ SYM_CODE_START_NOALIGN(\name)
 SYM_CODE_END(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index 496b11e..9d543c4 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -31,14 +31,11 @@ SYM_FUNC_END(\name)
_ASM_NOKPROBE(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
-#ifdef CONFIG_PREEMPTION
 SYM_CODE_START_LOCAL_NOALIGN(__thunk_restore)
popq %r11
popq %r10
@@ -53,4 +50,3 @@ SYM_CODE_START_LOCAL_NOALIGN(__thunk_restore)
ret
_ASM_NOKPROBE(__thunk_restore)
 SYM_CODE_END(__thunk_restore)
-#endif

[PATCH] x86/entry: build thunk_$(BITS) only if CONFIG_PREEMPTION=y

2021-01-14 Thread Andrea Righi

With CONFIG_PREEMPTION disabled, arch/x86/entry/thunk_64.o is just an
empty object file.

With the newer binutils (tested with 2.35.90.20210113-1ubuntu1) the GNU
assembler doesn't generate a symbol table for empty object files and
objtool fails with the following error when a valid symbol table cannot
be found:

  arch/x86/entry/thunk_64.o: warning: objtool: missing symbol table

To prevent this from happening, build thunk_$(BITS).o only if
CONFIG_PREEMPTION is enabled.

BugLink: https://bugs.launchpad.net/bugs/1911359
Fixes: 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft")
Signed-off-by: Andrea Righi 
---
 arch/x86/entry/Makefile   | 3 ++-
 arch/x86/entry/thunk_32.S | 2 --
 arch/x86/entry/thunk_64.S | 4 
 3 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 08bf95dbc911..83c98dae74a6 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -21,12 +21,13 @@ CFLAGS_syscall_64.o += $(call 
cc-option,-Wno-override-init,)
 CFLAGS_syscall_32.o+= $(call cc-option,-Wno-override-init,)
 CFLAGS_syscall_x32.o   += $(call cc-option,-Wno-override-init,)
 
-obj-y  := entry_$(BITS).o thunk_$(BITS).o 
syscall_$(BITS).o
+obj-y  := entry_$(BITS).o syscall_$(BITS).o
 obj-y  += common.o
 
 obj-y  += vdso/
 obj-y  += vsyscall/
 
+obj-$(CONFIG_PREEMPTION)   += thunk_$(BITS).o
 obj-$(CONFIG_IA32_EMULATION)   += entry_64_compat.o syscall_32.o
 obj-$(CONFIG_X86_X32_ABI)  += syscall_x32.o
 
diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S
index f1f96d4d8cd6..5997ec0b4b17 100644
--- a/arch/x86/entry/thunk_32.S
+++ b/arch/x86/entry/thunk_32.S
@@ -29,10 +29,8 @@ SYM_CODE_START_NOALIGN(\name)
 SYM_CODE_END(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index ccd32877a3c4..c7cf79be7231 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -36,14 +36,11 @@ SYM_FUNC_END(\name)
_ASM_NOKPROBE(\name)
.endm
 
-#ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
EXPORT_SYMBOL(preempt_schedule_thunk)
EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
-#endif
 
-#ifdef CONFIG_PREEMPTION
 SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
popq %r11
popq %r10
@@ -58,4 +55,3 @@ SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
ret
_ASM_NOKPROBE(.L_restore)
 SYM_CODE_END(.L_restore)
-#endif
-- 
2.29.2

[PATCH] ring-buffer: set the right timestamp in the slow path of __rb_reserve_next()

2020-11-28 Thread Andrea Righi

In the slow path of __rb_reserve_next() a nested event(s) can happen
between evaluating the timestamp delta of the current event and updating
write_stamp via local_cmpxchg(); in this case the delta is not valid
anymore and it should be set to 0 (same timestamp as the interrupting
event), since the event that we are currently processing is not the last
event in the buffer.

Link: https://lwn.net/Articles/831207
Fixes: a389d86f7fd0 ("ring-buffer: Have nested events still record running time 
stamp")
Signed-off-by: Andrea Righi 
---
 kernel/trace/ring_buffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index dc83b3fa9fe7..5e30e0cdb6ce 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3287,11 +3287,11 @@ __rb_reserve_next(struct ring_buffer_per_cpu 
*cpu_buffer,
ts = rb_time_stamp(cpu_buffer->buffer);
barrier();
  /*E*/ if (write == (local_read(_page->write) & RB_WRITE_MASK) &&
-   info->after < ts) {
+   info->after < ts &&
+   rb_time_cmpxchg(_buffer->write_stamp,
+   info->after, info->ts)) {
/* Nothing came after this event between C and E */
info->delta = ts - info->after;
-   (void)rb_time_cmpxchg(_buffer->write_stamp,
- info->after, info->ts);
info->ts = ts;
} else {
/*
-- 
2.29.2

Re: [PATCH] leds: trigger: fix potential deadlock with libata

2020-11-25 Thread Andrea Righi

On Wed, Nov 25, 2020 at 03:15:18PM +0100, Andrea Righi wrote:
...
> > I'd hate to see this in stable 3 days after Linus merges it...
> > 
> > Do these need _irqsave, too?
> > 
> > drivers/leds/led-triggers.c:   read_lock(>leddev_list_lock);
> > drivers/leds/led-triggers.c:   read_unlock(>leddev_list_lock);
> > drivers/leds/led-triggers.c:   read_lock(>leddev_list_lock);
> > drivers/leds/led-triggers.c:   read_unlock(>leddev_list_lock);
> > 
> > Best regards,
> 
> I think also led_trigger_blink_setup() needs to use irqsave/irqrestore,
> in fact:
> 
> $ git grep "led_trigger_blink("
> drivers/leds/led-triggers.c:void led_trigger_blink(struct led_trigger *trig,
> drivers/power/supply/power_supply_leds.c:   
> led_trigger_blink(psy->charging_blink_full_solid_trig,
> include/linux/leds.h:void led_trigger_blink(struct led_trigger *trigger, 
> unsigned long *delay_on,
> include/linux/leds.h:static inline void led_trigger_blink(struct led_trigger 
> *trigger,
> 
> power_supply_leds.c is using led_trigger_blink() from a workqueue
> context, so potentially the same deadlock condition can also happen.
> 
> Let me know if you want me to send a new patch to include also this
> case.

Just sent (and tested) a v2 of this patch that changes also
led_trigger_blink_setup().

-Andrea

[PATCH v2] leds: trigger: fix potential deadlock with libata

2020-11-25 Thread Andrea Righi

driver_probe_device+0xe9/0x160
device_driver_attach+0xb2/0xc0
__driver_attach+0x91/0x150
bus_for_each_dev+0x81/0xc0
driver_attach+0x1e/0x20
bus_add_driver+0x138/0x1f0
driver_register+0x91/0xf0
__pci_register_driver+0x73/0x80
piix_init+0x1e/0x2e
do_one_initcall+0x5f/0x2d0
kernel_init_freeable+0x26f/0x2cf
kernel_init+0xe/0x113
ret_from_fork+0x1f/0x30
  }
  ... key  at: [] __key.6+0x0/0x10
  ... acquired at:
__lock_acquire+0x9da/0x2370
lock_acquire+0x15f/0x420
_raw_spin_lock_irqsave+0x52/0xa0
ata_bmdma_interrupt+0x27/0x200
__handle_irq_event_percpu+0xd5/0x2b0
handle_irq_event+0x57/0xb0
handle_edge_irq+0x8c/0x230
asm_call_irq_on_stack+0xf/0x20
common_interrupt+0x100/0x1c0
asm_common_interrupt+0x1e/0x40
native_safe_halt+0xe/0x10
arch_cpu_idle+0x15/0x20
default_idle_call+0x59/0x1c0
do_idle+0x22c/0x2c0
cpu_startup_entry+0x20/0x30
start_secondary+0x11d/0x150
secondary_startup_64_no_verify+0xa6/0xab

This lockdep splat is reported after:
commit e918188611f0 ("locking: More accurate annotations for read_lock()")

To clarify:
 - read-locks are recursive only in interrupt context (when
   in_interrupt() returns true)
 - after acquiring host->lock in CPU1, another cpu (i.e. CPU2) may call
   write_lock(>leddev_list_lock) that would be blocked by CPU0
   that holds trig->leddev_list_lock in read-mode
 - when CPU1 (ata_ac_complete()) tries to read-lock
   trig->leddev_list_lock, it would be blocked by the write-lock waiter
   on CPU2 (because we are not in interrupt context, so the read-lock is
   not recursive)
 - at this point if an interrupt happens on CPU0 and
   ata_bmdma_interrupt() is executed it will try to acquire host->lock,
   that is held by CPU1, that is currently blocked by CPU2, so:

   * CPU0 blocked by CPU1
   * CPU1 blocked by CPU2
   * CPU2 blocked by CPU0

 *** DEADLOCK ***

The deadlock scenario is better represented by the following schema
(thanks to Boqun Feng  for the schema and the
detailed explanation of the deadlock condition):

 CPU 0:  CPU 1:CPU 2:
 -   - -
 led_trigger_event():
   read_lock(>leddev_list_lock);

ata_hsm_qc_complete():
  spin_lock_irqsave(>lock);

write_lock(>leddev_list_lock);
  ata_port_freeze():
ata_do_link_abort():
  ata_qc_complete():
ledtrig_disk_activity():
  led_trigger_blink_oneshot():
read_lock(>leddev_list_lock);
// ^ not in in_interrupt() context, 
so could get blocked by CPU 2
 
   ata_bmdma_interrupt():
 spin_lock_irqsave(>lock);

Fix by using read_lock_irqsave/irqrestore() in led_trigger_event(), so
that no interrupt can happen in between, preventing the deadlock
condition.

Apply the same change to led_trigger_blink_setup() as well, since the
same deadlock scenario can also happen in power_supply_update_bat_leds()
-> led_trigger_blink() -> led_trigger_blink_setup() (workqueue context),
and potentially prevent other similar usages.

Link: https://lore.kernel.org/lkml/20201101092614.GB3989@xps-13-7390/
Fixes: eb25cb9956cc ("leds: convert IDE trigger to common disk trigger")
Signed-off-by: Andrea Righi 
---
 drivers/leds/led-triggers.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

Changelog (v1 -> v2):
 - use _irqsave/irqsrestore also in led_trigger_blink_setup()

diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 91da90cfb11d..4e7b78a84149 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -378,14 +378,15 @@ void led_trigger_event(struct led_trigger *trig,
enum led_brightness brightness)
 {
struct led_classdev *led_cdev;
+   unsigned long flags;
 
if (!trig)
return;
 
-   read_lock(>leddev_list_lock);
+   read_lock_irqsave(>leddev_list_lock, flags);
list_for_each_entry(led_cdev, >led_cdevs, trig_list)
led_set_brightness(led_cdev, brightness);
-   read_unlock(>leddev_list_lock);
+   read_unlock_irqrestore(>leddev_list_lock, flags);
 }
 EXPORT_SYMBOL_GPL(led_trigger_event);
 
@@ -396,11 +397,12 @@ static void led_trigger_blink_setup(struct led_trigger 
*trig,

Re: [PATCH] leds: trigger: fix potential deadlock with libata

2020-11-25 Thread Andrea Righi

nterrupt+0x100/0x1c0
> >  asm_common_interrupt+0x1e/0x40
> >  native_safe_halt+0xe/0x10
> >  arch_cpu_idle+0x15/0x20
> >  default_idle_call+0x59/0x1c0
> >  do_idle+0x22c/0x2c0
> >  cpu_startup_entry+0x20/0x30
> >  start_secondary+0x11d/0x150
> >  secondary_startup_64_no_verify+0xa6/0xab
> > INITIAL USE at:
> > lock_acquire+0x15f/0x420
> > _raw_spin_lock_irqsave+0x52/0xa0
> > ata_dev_init+0x54/0xe0
> > ata_link_init+0x8b/0xd0
> > ata_port_alloc+0x1f1/0x210
> > ata_host_alloc+0xf1/0x130
> > ata_host_alloc_pinfo+0x14/0xb0
> > ata_pci_sff_prepare_host+0x41/0xa0
> > ata_pci_bmdma_prepare_host+0x14/0x30
> > piix_init_one+0x21f/0x600
> > local_pci_probe+0x48/0x80
> > pci_device_probe+0x105/0x1c0
> > really_probe+0x221/0x490
> > driver_probe_device+0xe9/0x160
> > device_driver_attach+0xb2/0xc0
> > __driver_attach+0x91/0x150
> > bus_for_each_dev+0x81/0xc0
> > driver_attach+0x1e/0x20
> > bus_add_driver+0x138/0x1f0
> > driver_register+0x91/0xf0
> > __pci_register_driver+0x73/0x80
> > piix_init+0x1e/0x2e
> > do_one_initcall+0x5f/0x2d0
> > kernel_init_freeable+0x26f/0x2cf
> > kernel_init+0xe/0x113
> > ret_from_fork+0x1f/0x30
> >   }
> >   ... key  at: [] __key.6+0x0/0x10
> >   ... acquired at:
> > __lock_acquire+0x9da/0x2370
> > lock_acquire+0x15f/0x420
> > _raw_spin_lock_irqsave+0x52/0xa0
> > ata_bmdma_interrupt+0x27/0x200
> > __handle_irq_event_percpu+0xd5/0x2b0
> > handle_irq_event+0x57/0xb0
> > handle_edge_irq+0x8c/0x230
> > asm_call_irq_on_stack+0xf/0x20
> > common_interrupt+0x100/0x1c0
> > asm_common_interrupt+0x1e/0x40
> > native_safe_halt+0xe/0x10
> > arch_cpu_idle+0x15/0x20
> > default_idle_call+0x59/0x1c0
> > do_idle+0x22c/0x2c0
> > cpu_startup_entry+0x20/0x30
> > start_secondary+0x11d/0x150
> > secondary_startup_64_no_verify+0xa6/0xab
> > 
> > This lockdep splat is reported after:
> > commit e918188611f0 ("locking: More accurate annotations for read_lock()")
> > 
> > To clarify:
> >  - read-locks are recursive only in interrupt context (when
> >in_interrupt() returns true)
> >  - after acquiring host->lock in CPU1, another cpu (i.e. CPU2) may call
> >write_lock(>leddev_list_lock) that would be blocked by CPU0
> >that holds trig->leddev_list_lock in read-mode
> >  - when CPU1 (ata_ac_complete()) tries to read-lock
> >trig->leddev_list_lock, it would be blocked by the write-lock waiter
> >on CPU2 (because we are not in interrupt context, so the read-lock is
> >not recursive)
> >  - at this point if an interrupt happens on CPU0 and
> >ata_bmdma_interrupt() is executed it will try to acquire host->lock,
> >that is held by CPU1, that is currently blocked by CPU2, so:
> > 
> >* CPU0 blocked by CPU1
> >* CPU1 blocked by CPU2
> >* CPU2 blocked by CPU0
> > 
> >  *** DEADLOCK ***
> > 
> > The deadlock scenario is better represented by the following schema
> > (thanks to Boqun Feng  for the schema and the
> > detailed explanation of the deadlock condition):
> > 
> >  CPU 0:  CPU 1:CPU 2:
> >  -   - -
> >  led_trigger_event():
> >read_lock(>leddev_list_lock);
> > 
> > ata_hsm_qc_complete():
> >   spin_lock_irqsave(>lock);
> > 
> > write_lock(>leddev_list_lock);
> >   ata_port_freeze():
> > ata_do_link_abort():
> >   ata_qc_complete():
> > ledtrig_disk_activity():
> >

Re: lockdep: possible irq lock inversion dependency detected (trig->leddev_list_lock)

2020-11-05 Thread Andrea Righi

On Mon, Nov 02, 2020 at 10:09:28AM +0100, Andrea Righi wrote:
> On Mon, Nov 02, 2020 at 09:56:58AM +0100, Pavel Machek wrote:
> > Hi!
> > 
> > > > > I'm getting the following lockdep splat (see below).
> > > > > 
> > > > > Apparently this warning starts to be reported after applying:
> > > > > 
> > > > >  e918188611f0 ("locking: More accurate annotations for read_lock()")
> > > > > 
> > > > > It looks like a false positive to me, but it made me think a bit and
> > > > > IIUC there can be still a potential deadlock, even if the deadlock
> > > > > scenario is a bit different than what lockdep is showing.
> > > > > 
> > > > > In the assumption that read-locks are recursive only in_interrupt()
> > > > > context (as stated in e918188611f0), the following scenario can still
> > > > > happen:
> > > > > 
> > > > >  CPU0 CPU1
> > > > >   
> > > > >  read_lock(>leddev_list_lock);
> > > > >   
> > > > > write_lock(>leddev_list_lock);
> > > > >  
> > > > >  kbd_bh()
> > > > >-> read_lock(>leddev_list_lock);
> > > > > 
> > > > >  *** DEADLOCK ***
> > > > > 
> > > > > The write-lock is waiting on CPU1 and the second read_lock() on CPU0
> > > > > would be blocked by the write-lock *waiter* on CPU1 => deadlock.
> > > > > 
> > > > 
> > > > No, this is not a deadlock, as a write-lock waiter only blocks
> > > > *non-recursive* readers, so since the read_lock() in kbd_bh() is called
> > > > in soft-irq (which in_interrupt() returns true), so it's a recursive
> > > > reader and won't get blocked by the write-lock waiter.
> > > 
> > > That's right, I was missing that in_interrupt() returns true also from
> > > soft-irq context.
> > > 
> > > > > In that case we could prevent this deadlock condition using a 
> > > > > workqueue
> > > > > to call kbd_propagate_led_state() instead of calling it directly from
> > > > > kbd_bh() (even if lockdep would still report the false positive).
> > > > > 
> > > > 
> > > > The deadlock senario reported by the following splat is:
> > > > 
> > > > 
> > > > CPU 0:  CPU 1:  
> > > > CPU 2:
> > > > -   -   
> > > > -
> > > > led_trigger_event():
> > > >   read_lock(>leddev_list_lock);
> > > > 
> > > > ata_hsm_qs_complete():
> > > >   
> > > > spin_lock_irqsave(>lock);
> > > > 
> > > > write_lock(>leddev_list_lock);
> > > >   ata_port_freeze():
> > > > ata_do_link_abort():
> > > >   ata_qc_complete():
> > > > ledtrig_disk_activity():
> > > >   
> > > > led_trigger_blink_oneshot():
> > > > 
> > > > read_lock(>leddev_list_lock);
> > > > // ^ not in 
> > > > in_interrupt() context, so could get blocked by CPU 2
> > > > 
> > > >   ata_bmdma_interrupt():
> > > > spin_lock_irqsave(>lock);
> > > >   
> > > > , where CPU 0 is blocked by CPU 1 because of the spin_lock_irqsave() in
> > > > ata_bmdma_interrupt() and CPU 1 is blocked by CPU 2 because of the
> > > > read_lock() in led_trigger_blink_oneshot() and CPU 2 is blocked by CPU 0
> > > > because of an arbitrary writer on >leddev_list_lock.
> > > > 
> > > > So I don't think it's false positive, but I might miss something
> > > > obvious, because I don't know what the code here actually does ;-)
> > > 
> > > With the CPU2 part it all makes sense now and lockdep was right. :)
> > > 
> > > At this point I think we could just schedule a separate work to do the
> > > led trigger and avoid calling it with host->lock held and that should
> > > prevent the deadlock. I'll send a patch to do that.
> > 
> > Let's... not do that, unless we have no choice.
> > 
> > Would it help if leddev_list_lock used _irqsave() locking?
> 
> Using read_lock_irqsave/irqrestore() in led_trigger_event() would be
> enough to prevent the deadlock. If it's an acceptable solution I can
> send a patch (already tested it and lockdep doesn't complain :)).

Any comment on
https://lore.kernel.org/lkml/20201102104152.GG9930@xps-13-7390/?

Thanks,
-Andrea

[PATCH] leds: trigger: fix potential deadlock with libata

2020-11-02 Thread Andrea Righi

driver_probe_device+0xe9/0x160
device_driver_attach+0xb2/0xc0
__driver_attach+0x91/0x150
bus_for_each_dev+0x81/0xc0
driver_attach+0x1e/0x20
bus_add_driver+0x138/0x1f0
driver_register+0x91/0xf0
__pci_register_driver+0x73/0x80
piix_init+0x1e/0x2e
do_one_initcall+0x5f/0x2d0
kernel_init_freeable+0x26f/0x2cf
kernel_init+0xe/0x113
ret_from_fork+0x1f/0x30
  }
  ... key  at: [] __key.6+0x0/0x10
  ... acquired at:
__lock_acquire+0x9da/0x2370
lock_acquire+0x15f/0x420
_raw_spin_lock_irqsave+0x52/0xa0
ata_bmdma_interrupt+0x27/0x200
__handle_irq_event_percpu+0xd5/0x2b0
handle_irq_event+0x57/0xb0
handle_edge_irq+0x8c/0x230
asm_call_irq_on_stack+0xf/0x20
common_interrupt+0x100/0x1c0
asm_common_interrupt+0x1e/0x40
native_safe_halt+0xe/0x10
arch_cpu_idle+0x15/0x20
default_idle_call+0x59/0x1c0
do_idle+0x22c/0x2c0
cpu_startup_entry+0x20/0x30
start_secondary+0x11d/0x150
secondary_startup_64_no_verify+0xa6/0xab

This lockdep splat is reported after:
commit e918188611f0 ("locking: More accurate annotations for read_lock()")

To clarify:
 - read-locks are recursive only in interrupt context (when
   in_interrupt() returns true)
 - after acquiring host->lock in CPU1, another cpu (i.e. CPU2) may call
   write_lock(>leddev_list_lock) that would be blocked by CPU0
   that holds trig->leddev_list_lock in read-mode
 - when CPU1 (ata_ac_complete()) tries to read-lock
   trig->leddev_list_lock, it would be blocked by the write-lock waiter
   on CPU2 (because we are not in interrupt context, so the read-lock is
   not recursive)
 - at this point if an interrupt happens on CPU0 and
   ata_bmdma_interrupt() is executed it will try to acquire host->lock,
   that is held by CPU1, that is currently blocked by CPU2, so:

   * CPU0 blocked by CPU1
   * CPU1 blocked by CPU2
   * CPU2 blocked by CPU0

 *** DEADLOCK ***

The deadlock scenario is better represented by the following schema
(thanks to Boqun Feng  for the schema and the
detailed explanation of the deadlock condition):

 CPU 0:  CPU 1:CPU 2:
 -   - -
 led_trigger_event():
   read_lock(>leddev_list_lock);

ata_hsm_qc_complete():
  spin_lock_irqsave(>lock);

write_lock(>leddev_list_lock);
  ata_port_freeze():
ata_do_link_abort():
  ata_qc_complete():
ledtrig_disk_activity():
  led_trigger_blink_oneshot():
read_lock(>leddev_list_lock);
// ^ not in in_interrupt() context, 
so could get blocked by CPU 2
 
   ata_bmdma_interrupt():
 spin_lock_irqsave(>lock);

Fix by using read_lock_irqsave/irqrestore() in led_trigger_event(), so
that no interrupt can happen in between, preventing the deadlock
condition.

Link: https://lore.kernel.org/lkml/20201101092614.GB3989@xps-13-7390/
Fixes: eb25cb9956cc ("leds: convert IDE trigger to common disk trigger")
Signed-off-by: Andrea Righi 
---
 drivers/leds/led-triggers.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 91da90cfb11d..16d1a93a10a8 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -378,14 +378,15 @@ void led_trigger_event(struct led_trigger *trig,
enum led_brightness brightness)
 {
struct led_classdev *led_cdev;
+   unsigned long flags;
 
if (!trig)
return;
 
-   read_lock(>leddev_list_lock);
+   read_lock_irqsave(>leddev_list_lock, flags);
list_for_each_entry(led_cdev, >led_cdevs, trig_list)
led_set_brightness(led_cdev, brightness);
-   read_unlock(>leddev_list_lock);
+   read_unlock_irqrestore(>leddev_list_lock, flags);
 }
 EXPORT_SYMBOL_GPL(led_trigger_event);
 
-- 
2.27.0

Re: lockdep: possible irq lock inversion dependency detected (trig->leddev_list_lock)

2020-11-02 Thread Andrea Righi

On Mon, Nov 02, 2020 at 09:56:58AM +0100, Pavel Machek wrote:
> Hi!
> 
> > > > I'm getting the following lockdep splat (see below).
> > > > 
> > > > Apparently this warning starts to be reported after applying:
> > > > 
> > > >  e918188611f0 ("locking: More accurate annotations for read_lock()")
> > > > 
> > > > It looks like a false positive to me, but it made me think a bit and
> > > > IIUC there can be still a potential deadlock, even if the deadlock
> > > > scenario is a bit different than what lockdep is showing.
> > > > 
> > > > In the assumption that read-locks are recursive only in_interrupt()
> > > > context (as stated in e918188611f0), the following scenario can still
> > > > happen:
> > > > 
> > > >  CPU0 CPU1
> > > >   
> > > >  read_lock(>leddev_list_lock);
> > > >   
> > > > write_lock(>leddev_list_lock);
> > > >  
> > > >  kbd_bh()
> > > >-> read_lock(>leddev_list_lock);
> > > > 
> > > >  *** DEADLOCK ***
> > > > 
> > > > The write-lock is waiting on CPU1 and the second read_lock() on CPU0
> > > > would be blocked by the write-lock *waiter* on CPU1 => deadlock.
> > > > 
> > > 
> > > No, this is not a deadlock, as a write-lock waiter only blocks
> > > *non-recursive* readers, so since the read_lock() in kbd_bh() is called
> > > in soft-irq (which in_interrupt() returns true), so it's a recursive
> > > reader and won't get blocked by the write-lock waiter.
> > 
> > That's right, I was missing that in_interrupt() returns true also from
> > soft-irq context.
> > 
> > > > In that case we could prevent this deadlock condition using a workqueue
> > > > to call kbd_propagate_led_state() instead of calling it directly from
> > > > kbd_bh() (even if lockdep would still report the false positive).
> > > > 
> > > 
> > > The deadlock senario reported by the following splat is:
> > > 
> > >   
> > >   CPU 0:  CPU 1:  
> > > CPU 2:
> > >   -   -   
> > > -
> > >   led_trigger_event():
> > > read_lock(>leddev_list_lock);
> > >   
> > >   ata_hsm_qs_complete():
> > > spin_lock_irqsave(>lock);
> > >   
> > > write_lock(>leddev_list_lock);
> > > ata_port_freeze():
> > >   ata_do_link_abort():
> > > ata_qc_complete():
> > >   ledtrig_disk_activity():
> > > led_trigger_blink_oneshot():
> > >   
> > > read_lock(>leddev_list_lock);
> > >   // ^ not in in_interrupt() 
> > > context, so could get blocked by CPU 2
> > >   
> > > ata_bmdma_interrupt():
> > >   spin_lock_irqsave(>lock);
> > > 
> > > , where CPU 0 is blocked by CPU 1 because of the spin_lock_irqsave() in
> > > ata_bmdma_interrupt() and CPU 1 is blocked by CPU 2 because of the
> > > read_lock() in led_trigger_blink_oneshot() and CPU 2 is blocked by CPU 0
> > > because of an arbitrary writer on >leddev_list_lock.
> > > 
> > > So I don't think it's false positive, but I might miss something
> > > obvious, because I don't know what the code here actually does ;-)
> > 
> > With the CPU2 part it all makes sense now and lockdep was right. :)
> > 
> > At this point I think we could just schedule a separate work to do the
> > led trigger and avoid calling it with host->lock held and that should
> > prevent the deadlock. I'll send a patch to do that.
> 
> Let's... not do that, unless we have no choice.
> 
> Would it help if leddev_list_lock used _irqsave() locking?

Using read_lock_irqsave/irqrestore() in led_trigger_event() would be
enough to prevent the deadlock. If it's an acceptable solution I can
send a patch (already tested it and lockdep doesn't complain :)).

Thanks,
-Andrea

Re: lockdep: possible irq lock inversion dependency detected (trig->leddev_list_lock)

2020-11-01 Thread Andrea Righi

On Sun, Nov 01, 2020 at 05:28:38PM +0100, Pavel Machek wrote:
> Hi!
> 
> > I'm getting the following lockdep splat (see below).
> > 
> > Apparently this warning starts to be reported after applying:
> > 
> >  e918188611f0 ("locking: More accurate annotations for read_lock()")
> > 
> > It looks like a false positive to me, but it made me think a bit and
> > IIUC there can be still a potential deadlock, even if the deadlock
> > scenario is a bit different than what lockdep is showing.
> > 
> > In the assumption that read-locks are recursive only in_interrupt()
> > context (as stated in e918188611f0), the following scenario can still
> > happen:
> > 
> >  CPU0 CPU1
> >   
> >  read_lock(>leddev_list_lock);
> >   
> > write_lock(>leddev_list_lock);
> >  
> >  kbd_bh()
> >-> read_lock(>leddev_list_lock);
> > 
> >  *** DEADLOCK ***
> > 
> > The write-lock is waiting on CPU1 and the second read_lock() on CPU0
> > would be blocked by the write-lock *waiter* on CPU1 => deadlock.
> > 
> > In that case we could prevent this deadlock condition using a workqueue
> > to call kbd_propagate_led_state() instead of calling it directly from
> > kbd_bh() (even if lockdep would still report the false positive).
> 
> console.c is already using bh to delay work from
> interrupt. But... that should not be neccessary. led_trigger_event
> should already be callable from interrupt context, AFAICT.
> 
> Could this be resolved by doing the operations directly from keyboard
> interrupt?

As pointed out by Boqun this is not a deadlock condition, because the
read_lock() called from soft-irq context is recursive (I was missing
that in_interrupt() returns true also from soft-irq context).

But the initial lockdep warning was correct, so there is still a
potential deadlock condition between trig->leddev_list_lock and
host->lock. And I think this can be prevented simply by scheduling the
led triggering part in a separate work from ata_hsm_qs_complete(), so
that led_trigger_event() won't be called with host->lock held. I'll send
a patch soon to do that.

-Andrea

Re: lockdep: possible irq lock inversion dependency detected (trig->leddev_list_lock)

2020-11-01 Thread Andrea Righi

On Sat, Oct 31, 2020 at 06:17:40PM +0800, Boqun Feng wrote:
> Hi Andrea,
> 
> On Sun, Nov 01, 2020 at 10:26:14AM +0100, Andrea Righi wrote:
> > I'm getting the following lockdep splat (see below).
> > 
> > Apparently this warning starts to be reported after applying:
> > 
> >  e918188611f0 ("locking: More accurate annotations for read_lock()")
> > 
> > It looks like a false positive to me, but it made me think a bit and
> > IIUC there can be still a potential deadlock, even if the deadlock
> > scenario is a bit different than what lockdep is showing.
> > 
> > In the assumption that read-locks are recursive only in_interrupt()
> > context (as stated in e918188611f0), the following scenario can still
> > happen:
> > 
> >  CPU0 CPU1
> >   
> >  read_lock(>leddev_list_lock);
> >   
> > write_lock(>leddev_list_lock);
> >  
> >  kbd_bh()
> >-> read_lock(>leddev_list_lock);
> > 
> >  *** DEADLOCK ***
> > 
> > The write-lock is waiting on CPU1 and the second read_lock() on CPU0
> > would be blocked by the write-lock *waiter* on CPU1 => deadlock.
> > 
> 
> No, this is not a deadlock, as a write-lock waiter only blocks
> *non-recursive* readers, so since the read_lock() in kbd_bh() is called
> in soft-irq (which in_interrupt() returns true), so it's a recursive
> reader and won't get blocked by the write-lock waiter.

That's right, I was missing that in_interrupt() returns true also from
soft-irq context.

> 
> > In that case we could prevent this deadlock condition using a workqueue
> > to call kbd_propagate_led_state() instead of calling it directly from
> > kbd_bh() (even if lockdep would still report the false positive).
> > 
> 
> The deadlock senario reported by the following splat is:
> 
>   
>   CPU 0:  CPU 1:  
> CPU 2:
>   -   -   
> -
>   led_trigger_event():
> read_lock(>leddev_list_lock);
>   
>   ata_hsm_qs_complete():
> spin_lock_irqsave(>lock);
>   
> write_lock(>leddev_list_lock);
> ata_port_freeze():
>   ata_do_link_abort():
> ata_qc_complete():
>   ledtrig_disk_activity():
> led_trigger_blink_oneshot():
>   
> read_lock(>leddev_list_lock);
>   // ^ not in in_interrupt() 
> context, so could get blocked by CPU 2
>   
> ata_bmdma_interrupt():
>   spin_lock_irqsave(>lock);
> 
> , where CPU 0 is blocked by CPU 1 because of the spin_lock_irqsave() in
> ata_bmdma_interrupt() and CPU 1 is blocked by CPU 2 because of the
> read_lock() in led_trigger_blink_oneshot() and CPU 2 is blocked by CPU 0
> because of an arbitrary writer on >leddev_list_lock.
> 
> So I don't think it's false positive, but I might miss something
> obvious, because I don't know what the code here actually does ;-)

With the CPU2 part it all makes sense now and lockdep was right. :)

At this point I think we could just schedule a separate work to do the
led trigger and avoid calling it with host->lock held and that should
prevent the deadlock. I'll send a patch to do that.

Thanks tons for you detailed explanation!

-Andrea

lockdep: possible irq lock inversion dependency detected (trig->leddev_list_lock)

2020-11-01 Thread Andrea Righi

I'm getting the following lockdep splat (see below).

Apparently this warning starts to be reported after applying:

 e918188611f0 ("locking: More accurate annotations for read_lock()")

It looks like a false positive to me, but it made me think a bit and
IIUC there can be still a potential deadlock, even if the deadlock
scenario is a bit different than what lockdep is showing.

In the assumption that read-locks are recursive only in_interrupt()
context (as stated in e918188611f0), the following scenario can still
happen:

 CPU0 CPU1
  
 read_lock(>leddev_list_lock);
  write_lock(>leddev_list_lock);
 
 kbd_bh()
   -> read_lock(>leddev_list_lock);

 *** DEADLOCK ***

The write-lock is waiting on CPU1 and the second read_lock() on CPU0
would be blocked by the write-lock *waiter* on CPU1 => deadlock.

In that case we could prevent this deadlock condition using a workqueue
to call kbd_propagate_led_state() instead of calling it directly from
kbd_bh() (even if lockdep would still report the false positive).

Can you help me to understand if this assumption is correct or if I'm
missing something?

Thanks,
-Andrea

Lockdep trace:

[1.087260] WARNING: possible irq lock inversion dependency detected
[1.087267] 5.10.0-rc1+ #18 Not tainted
[1.088829] softirqs last  enabled at (0): [] 
copy_process+0x6c7/0x1c70
[1.089662] 
[1.090284] softirqs last disabled at (0): [<>] 0x0
[1.092766] swapper/3/0 just changed the state of lock:
[1.093325] 888006394c18 (>lock){-...}-{2:2}, at: 
ata_bmdma_interrupt+0x27/0x200
[1.094190] but this lock took another, HARDIRQ-READ-unsafe lock in the past:
[1.094944]  (>leddev_list_lock){.+.?}-{2:2}
[1.094946] 
[1.094946] 
[1.094946] and interrupts could create inverse lock ordering between them.
[1.094946] 
[1.096600] 
[1.096600] other info that might help us debug this:
[1.097250]  Possible interrupt unsafe locking scenario:
[1.097250] 
[1.097940]CPU0CPU1
[1.098401]
[1.098873]   lock(>leddev_list_lock);
[1.099315]local_irq_disable();
[1.099932]lock(>lock);
[1.100527]lock(>leddev_list_lock);
[1.101219]   
[1.101490] lock(>lock);
[1.101844] 
[1.101844]  *** DEADLOCK ***
[1.101844] 
[1.102447] no locks held by swapper/3/0.
[1.102858] 
[1.102858] the shortest dependencies between 2nd lock and 1st lock:
[1.103646]  -> (>leddev_list_lock){.+.?}-{2:2} ops: 46 {
[1.104248] HARDIRQ-ON-R at:
[1.104600]   lock_acquire+0xec/0x430
[1.105120]   _raw_read_lock+0x42/0x90
[1.105839]   led_trigger_event+0x2b/0x70
[1.106348]   rfkill_global_led_trigger_worker+0x94/0xb0
[1.106970]   process_one_work+0x240/0x560
[1.107498]   worker_thread+0x58/0x3d0
[1.107984]   kthread+0x151/0x170
[1.108447]   ret_from_fork+0x1f/0x30
[1.108924] IN-SOFTIRQ-R at:
[1.109227]   lock_acquire+0xec/0x430
[1.109820]   _raw_read_lock+0x42/0x90
[1.110404]   led_trigger_event+0x2b/0x70
[1.111051]   kbd_bh+0x9e/0xc0
[1.111558]   
tasklet_action_common.constprop.0+0xe9/0x100
[1.112265]   tasklet_action+0x22/0x30
[1.112917]   __do_softirq+0xcc/0x46d
[1.113474]   run_ksoftirqd+0x3f/0x70
[1.114033]   smpboot_thread_fn+0x116/0x1f0
[1.114597]   kthread+0x151/0x170
[1.115118]   ret_from_fork+0x1f/0x30
[1.115674] SOFTIRQ-ON-R at:
[1.115987]   lock_acquire+0xec/0x430
[1.116468]   _raw_read_lock+0x42/0x90
[1.116949]   led_trigger_event+0x2b/0x70
[1.117454]   rfkill_global_led_trigger_worker+0x94/0xb0
[1.118070]   process_one_work+0x240/0x560
[1.118659]   worker_thread+0x58/0x3d0
[1.119225]   kthread+0x151/0x170
[1.119740]   ret_from_fork+0x1f/0x30
[1.120294] INITIAL READ USE at:
[1.120639]   lock_acquire+0xec/0x430
[1.121141]   _raw_read_lock+0x42/0x90
[1.121649]   led_trigger_event+0x2b/0x70
[1.122177]   
rfkill_global_led_trigger_worker+0x94/0xb0
[1.122841]

[PATCH] ext4: properly check for dirty state in ext4_inode_datasync_dirty()

2020-10-24 Thread Andrea Righi

ext4_inode_datasync_dirty() needs to return 'true' if the inode is
dirty, 'false' otherwise, but the logic seems to be incorrectly changed
by commit aa75f4d3daae ("ext4: main fast-commit commit path").

This introduces a problem with swap files that are always failing to be
activated, showing this error in dmesg:

 [   34.406479] swapon: file is not committed

Simple test case to reproduce the problem:

  # fallocate -l 8G swapfile
  # chmod 0600 swapfile
  # mkswap swapfile
  # swapon swapfile

Fix the logic to return the proper state of the inode.

Link: https://lore.kernel.org/lkml/20201024131333.GA32124@xps-13-7390
Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Signed-off-by: Andrea Righi 
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 03c2253005f0..a890a17ab7e1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3308,8 +3308,8 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
if (journal) {
if (jbd2_transaction_committed(journal,
EXT4_I(inode)->i_datasync_tid))
-   return true;
-   return atomic_read(_SB(inode->i_sb)->s_fc_subtid) >=
+   return false;
+   return atomic_read(_SB(inode->i_sb)->s_fc_subtid) <
EXT4_I(inode)->i_fc_committed_subtid;
}
 
-- 
2.27.0

Re: swap file broken with ext4 fast-commit

2020-10-24 Thread Andrea Righi

On Sat, Oct 24, 2020 at 03:13:37PM +0200, Andrea Righi wrote:
> I'm getting the following error if I try to create and activate a swap
> file defined on an ext4 filesystem:
> 
>  [   34.406479] swapon: file is not committed
> 
> The swap file is created in the root filesystem (ext4 mounted with the
> following options):
> 
> $ grep " / " /proc/mounts
> /dev/vda1 / ext4 rw,relatime 0 0
> 
> No matter how long I wait or how many times I run sync, I'm still
> getting the same error and the swap file is never activated.
> 
> A git bisect shows that this issue has been introduced by the following
> commit:
> 
>  aa75f4d3daae ("ext4: main fast-commit commit path")
> 
> Simple test case to reproduce the problem:
> 
>  # fallocate -l 8G /swapfile
>  # chmod 0600 /swapfile
>  # mkswap /swapfile
>  # swapon /swapfile
> 
> Maybe we're missing to mark the inode as clean somewhere, even if the
> transation is committed to the journal?

I think I see the problem. There's something wrong in
ext4_inode_datasync_dirty(), it looks like the logic to check if the
inode is dirty is quite the opposite.

I'll test and send a patch soon.

-Andrea

swap file broken with ext4 fast-commit

2020-10-24 Thread Andrea Righi

I'm getting the following error if I try to create and activate a swap
file defined on an ext4 filesystem:

 [   34.406479] swapon: file is not committed

The swap file is created in the root filesystem (ext4 mounted with the
following options):

$ grep " / " /proc/mounts
/dev/vda1 / ext4 rw,relatime 0 0

No matter how long I wait or how many times I run sync, I'm still
getting the same error and the swap file is never activated.

A git bisect shows that this issue has been introduced by the following
commit:

 aa75f4d3daae ("ext4: main fast-commit commit path")

Simple test case to reproduce the problem:

 # fallocate -l 8G /swapfile
 # chmod 0600 /swapfile
 # mkswap /swapfile
 # swapon /swapfile

Maybe we're missing to mark the inode as clean somewhere, even if the
transation is committed to the journal?

Thanks,
-Andrea

Re: [PATCH RFC v2] Opportunistic memory reclaim

2020-10-05 Thread Andrea Righi

On Mon, Oct 05, 2020 at 03:46:12PM +0100, Chris Down wrote:
> Andrea Righi writes:
> > senpai is focused at estimating the ideal memory requirements without
> > affecting performance. And this covers the use case about reducing
> > memory footprint.
> > 
> > In my specific use-case (hibernation) I would let the system use as much
> > memory as possible if it's doing any activity (reclaiming memory only
> > when the kernel decides that it needs to reclaim memory) and apply a
> > more aggressive memory reclaiming policy when the system is mostly idle.
> 
> From this description, I don't see any reason why it needs to be implemented
> in kernel space. All of that information is available to userspace, and all
> of the knobs are there.
> 
> As it is I'm afraid of the "only when the system is mostly idle" comment,
> because it's usually after such periods that applications need to do large
> retrievals, and now they're going to be in slowpath (eg. periodic jobs).

True, but with memory.high there's the risk to trash some applications
badly if I'm not reacting fast at increasing memory.high.

However, something that I could definitely want to try is to move all
the memory hogs to a cgroup, set memory.high to a very small value and
then immediately set it back to 'max'. The effect should be pretty much
the same as calling shrink_all_memory(), that is what I'm doing with my
memory.swap.reclaim.

> 
> Such tradeoffs for a specific situation might be fine in userspace as a
> distribution maintainer, but codifying them in the kernel seems premature to
> me, especially for a knob which we will have to maintain forever onwards.
> 
> > I could probably implement this behavior adjusting memory.high
> > dynamically, like senpai, but I'm worried about potential sudden large
> > allocations that may require to respond faster at increasing
> > memory.high. I think the user-space triggered memory reclaim approach is
> > a safer solution from this perspective.
> 
> Have you seen Shakeel's recent "per-memcg reclaim interface" patches? I
> suspect they may help you there.

Yes, Michal pointed out to me his work, it's basically the same approach
that I'm using.

I started this work with a patch that was hibernation specific
(https://lore.kernel.org/lkml/20200601160636.148346-1-andrea.ri...@canonical.com/);
this v2 was the natural evolution of my previous work and I didn't
notice that something similar has been posted in the meantime.

Anyway, I already contacted Shakeel, so we won't duplicate the efforts
in the future. :)

Thanks for your feedback!
-Andrea

Re: [PATCH RFC v2] Opportunistic memory reclaim

2020-10-05 Thread Andrea Righi

On Mon, Oct 05, 2020 at 12:25:55PM +0100, Chris Down wrote:
> Andrea Righi writes:
> > This feature has been successfully used to improve hibernation time of
> > cloud computing instances.
> > 
> > Certain cloud providers allow to run "spot instances": low-priority
> > instances that run when there are spare resources available and can be
> > stopped at any time to prioritize other more privileged instances [2].
> > 
> > Hibernation can be used to stop these low-priority instances nicely,
> > rather than losing state when the instance is shut down. Being able to
> > quickly stop low-priority instances can be critical to provide a better
> > quality of service in the overall cloud infrastructure [1].
> > 
> > The main bottleneck of hibernation is represented by the I/O generated
> > to write all the main memory (hibernation image) to a persistent
> > storage.
> > 
> > Opportunistic memory reclaimed can be used to reduce the size of the
> > hibernation image in advance, for example if the system is idle for a
> > certain amount of time, so if an hibernation request happens, the kernel
> > has already saved most of the memory to the swap device (caches have
> > been dropped, etc.) and hibernation can complete quickly.
> 
> Hmm, why does this need to be implemented in kernelspace? We already have
> userspace shrinkers using memory pressure information as part of PID control
> already (eg. senpai). Using memory.high and pressure information looks a lot
> easier to reason about than having to choose an absolute number ahead of
> time and hoping it works.

senpai is focused at estimating the ideal memory requirements without
affecting performance. And this covers the use case about reducing
memory footprint.

In my specific use-case (hibernation) I would let the system use as much
memory as possible if it's doing any activity (reclaiming memory only
when the kernel decides that it needs to reclaim memory) and apply a
more aggressive memory reclaiming policy when the system is mostly idle.

I could probably implement this behavior adjusting memory.high
dynamically, like senpai, but I'm worried about potential sudden large
allocations that may require to respond faster at increasing
memory.high. I think the user-space triggered memory reclaim approach is
a safer solution from this perspective.

-Andrea

Re: [PATCH RFC v2] Opportunistic memory reclaim

2020-10-05 Thread Andrea Righi

On Mon, Oct 05, 2020 at 10:35:16AM +0200, Michal Hocko wrote:
> A similar thing has been proposed recently by Shakeel
> http://lkml.kernel.org/r/20200909215752.1725525-1-shake...@google.com
> Please have a look at the follow up discussion.

Thanks for pointing this out, I wasn't aware of that patch and yes, it's
definitely similar. I'll follow up on that.

-Andrea

[PATCH RFC v2 2/2] mm: memcontrol: introduce opportunistic memory reclaim

2020-10-05 Thread Andrea Righi

Opportunistic memory reclaim allows user-space to trigger an artificial
memory pressure condition and force the system to reclaim memory (drop
caches, swap out anonymous memory, etc.).

This feature is provided by adding a new file to each memcg:
memory.swap.reclaim.

Writing a number to this file forces a memcg to reclaim memory up to
that number of bytes ("max" means as much memory as possible). Reading
from the this file returns the amount of bytes reclaimed in the last
opportunistic memory reclaim attempt.

Memory reclaim can be interrupted sending a signal to the process that
is writing to memory.swap.reclaim (i.e., to set a timeout for the whole
memory reclaim run).

Signed-off-by: Andrea Righi 
---
 Documentation/admin-guide/cgroup-v2.rst | 18 
 include/linux/memcontrol.h  |  4 ++
 mm/memcontrol.c | 59 +
 3 files changed, 81 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index baa07b30845e..2850a5cb4b1e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1409,6 +1409,24 @@ PAGE_SIZE multiple when read back.
Swap usage hard limit.  If a cgroup's swap usage reaches this
limit, anonymous memory of the cgroup will not be swapped out.
 
+  memory.swap.reclaim
+A read-write single value file that can be used to trigger
+opportunistic memory reclaim.
+
+The string written to this file represents the amount of memory to be
+reclaimed (special value "max" means "as much memory as possible").
+
+When opportunistic memory reclaim is started the system will be put
+into an artificial memory pressure condition and memory will be
+reclaimed by dropping clean page cache pages, swapping out anonymous
+pages, etc.
+
+NOTE: it is possible to interrupt the memory reclaim sending a signal
+to the writer of this file.
+
+Reading from memory.swap.reclaim returns the amount of bytes reclaimed
+in the last attempt.
+
   memory.swap.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d0b036123c6a..0c90d989bdc1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -306,6 +306,10 @@ struct mem_cgroup {
booltcpmem_active;
int tcpmem_pressure;
 
+#ifdef CONFIG_MEMCG_SWAP
+   unsigned long   nr_swap_reclaimed;
+#endif
+
 #ifdef CONFIG_MEMCG_KMEM
 /* Index in the kmem_cache->memcg_params.memcg_caches array */
int kmemcg_id;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6877c765b8d0..b98e9bbd61b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7346,6 +7346,60 @@ static int swap_events_show(struct seq_file *m, void *v)
return 0;
 }
 
+/*
+ * Try to reclaim some memory in the system, stop when one of the following
+ * conditions occurs:
+ *  - at least "nr_pages" have been reclaimed
+ *  - no more pages can be reclaimed
+ *  - current task explicitly interrupted by a signal (e.g., user space
+ *timeout)
+ *
+ *  @nr_pages - amount of pages to be reclaimed (0 means "as many pages as
+ *  possible").
+ */
+static unsigned long
+do_mm_reclaim(struct mem_cgroup *memcg, unsigned long nr_pages)
+{
+   unsigned long nr_reclaimed = 0;
+
+   while (nr_pages > 0) {
+   unsigned long reclaimed;
+
+   if (signal_pending(current))
+   break;
+   reclaimed = __shrink_all_memory(nr_pages, memcg);
+   if (!reclaimed)
+   break;
+   nr_reclaimed += reclaimed;
+   nr_pages -= min_t(unsigned long, reclaimed, nr_pages);
+   }
+   return nr_reclaimed;
+}
+
+static ssize_t swap_reclaim_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned long nr_to_reclaim;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", _to_reclaim);
+   if (err)
+   return err;
+   memcg->nr_swap_reclaimed = do_mm_reclaim(memcg, nr_to_reclaim);
+
+   return nbytes;
+}
+
+static u64 swap_reclaim_read(struct cgroup_subsys_state *css,
+struct cftype *cft)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+   return memcg->nr_swap_reclaimed << PAGE_SHIFT;
+}
+
 static struct cftype swap_files[] = {
{
.name = "swap.current",
@@ -7370,6 +7424,11 @@ static struct cftype swap_files[] = {
.file_offset = off

[PATCH RFC v2] Opportunistic memory reclaim

2020-10-05 Thread Andrea Righi

t attempt (per-memcg)

--------
Andrea Righi (2):
  mm: memcontrol: make shrink_all_memory() memcg aware
  mm: memcontrol: introduce opportunistic memory reclaim

 Documentation/admin-guide/cgroup-v2.rst | 18 ++
 include/linux/memcontrol.h  |  4 +++
 include/linux/swap.h|  9 -
 mm/memcontrol.c | 59 +
 mm/vmscan.c |  6 ++--
 5 files changed, 92 insertions(+), 4 deletions(-)

[PATCH RFC v2 1/2] mm: memcontrol: make shrink_all_memory() memcg aware

2020-10-05 Thread Andrea Righi

Allow to specify a memcg when calling shrink_all_memory() to reclaim
some memory from a specific cgroup.

Moreover, make shrink_all_memory() always available and do not depend on
having CONFIG_HIBERNATION enabled.

This is required by the opportunistic memory reclaim feature.

Signed-off-by: Andrea Righi 
---
 include/linux/swap.h | 9 -
 mm/vmscan.c  | 6 +++---
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 661046994db4..1490b09a6e6c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -368,7 +368,14 @@ extern unsigned long mem_cgroup_shrink_node(struct 
mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
pg_data_t *pgdat,
unsigned long *nr_scanned);
-extern unsigned long shrink_all_memory(unsigned long nr_pages);
+extern unsigned long
+__shrink_all_memory(unsigned long nr_pages, struct mem_cgroup *memcg);
+
+static inline unsigned long shrink_all_memory(unsigned long nr_pages)
+{
+   return __shrink_all_memory(nr_pages, NULL);
+}
+
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 466fc3144fff..ac04d5e16c42 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3986,7 +3986,6 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, 
int order,
wake_up_interruptible(>kswapd_wait);
 }
 
-#ifdef CONFIG_HIBERNATION
 /*
  * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
  * freed pages.
@@ -3995,7 +3994,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, 
int order,
  * LRU order by reclaiming preferentially
  * inactive > active > active referenced > active mapped
  */
-unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
+unsigned long
+__shrink_all_memory(unsigned long nr_to_reclaim, struct mem_cgroup *memcg)
 {
struct scan_control sc = {
.nr_to_reclaim = nr_to_reclaim,
@@ -4006,6 +4006,7 @@ unsigned long shrink_all_memory(unsigned long 
nr_to_reclaim)
.may_unmap = 1,
.may_swap = 1,
.hibernation_mode = 1,
+   .target_mem_cgroup = memcg,
};
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
unsigned long nr_reclaimed;
@@ -4023,7 +4024,6 @@ unsigned long shrink_all_memory(unsigned long 
nr_to_reclaim)
 
return nr_reclaimed;
 }
-#endif /* CONFIG_HIBERNATION */
 
 /*
  * This kswapd start function will be called by init and node-hot-add.
-- 
2.27.0

Re: [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim

2020-09-21 Thread Andrea Righi

On Mon, Sep 21, 2020 at 05:36:30PM +0200, Rafael J. Wysocki wrote:
...
> > > 3. It is not clear how much mm_reclaim/release is going to help.  If
> > > the preloading of the swapped-out pages uses some kind of LIFO order,
> > > and can batch multiple pages, then it might help.  Otherwise demand
> > > paging is likely to be more effective.  If the preloading does indeed
> > > help, it may be useful to explain why in the commit message.
> >
> > Swap readahead helps a lot in terms of performance if we preload all at
> > once. But I agree that for the majority of cases on-demand paging just
> > works fine.
> >
> > My specific use-case for mm_reclaim/release is to make sure a VM
> > that is just resumed is immediately "fast" by preloading the swapped-out
> > pages back to memory all at once.
> >
> > Without mm_reclaim/release I've been using the trick of running swapoff
> > followed by a swapon to force all the pages back to memory, but it's
> > kinda ugly and I was looking for a better way to do this. I've been
> > trying also the ptrace() + reading all the VMAs via /proc/pid/mem, it
> > works, but it's not as fast as swapoff+swapon or mm_reclaim/release.
> >
> > I'll report performance numbers of mm_reclaim/release vs ptrace() +
> > /proc/pid/mem in the next version of this patch.
> 
> Sorry for the huge delay.
> 
> I'm wondering what your vision regarding the use of this mechanism in
> practice is?
> 
> In the "Testing" part of the changelog you say that "in the
> 5.7-mm_reclaim case a user-space daemon detects when the system is
> idle and triggers the opportunistic memory reclaim via
> /sys/power/mm_reclaim/run", but this may not be entirely practical,
> because hibernation is not triggered every time the system is idle.
> 
> In particular, how much time is required for the opportunistic reclaim
> to run before hibernation so as to make a significant difference?
> 
> Thanks!

Hi Raphael,

the typical use-case for this feature is to hibernate "spot" cloud
instances (low-priority instances that can be stopped at any time to
prioritize more privileged instances, see for example [1]). In this
scenario hibernation can be used as a "nicer" way to stop low priority
instances, instead of shutting them down.

Opportunistic memory reclaim doesn't really reduce the time to hibernate
overall: performance wise regular hibernation and hibernation w/
opportunistic reclaim require pretty much the same time.

But the advantage of opportunistic reclaim is that we can "prepare" a
system for hibernation using some idle time, so when we really need to
hibernate a low priority instance, because a high priority instance
requires to run, hibernation can be significantly faster.

What do you think about it? Do you see a better way to achieve this
goal?

Thanks,
-Andrea

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html

Re: crypto: aegis128: error: incompatible types when initializing type 'unsigned char' using type 'uint8x16_t'

2020-07-30 Thread Andrea Righi

On Thu, Jul 30, 2020 at 10:11:52AM -0500, Justin Forbes wrote:
> On Mon, Jul 27, 2020 at 8:05 AM Andrea Righi  
> wrote:
> >
> > I'm experiencing this build error on arm64 after updating to gcc 10:
> >
> > crypto/aegis128-neon-inner.c: In function 'crypto_aegis128_init_neon':
> > crypto/aegis128-neon-inner.c:151:3: error: incompatible types when 
> > initializing type 'unsigned char' using type 'uint8x16_t'
> >   151 |   k ^ vld1q_u8(const0),
> >   |   ^
> > crypto/aegis128-neon-inner.c:152:3: error: incompatible types when 
> > initializing type 'unsigned char' using type 'uint8x16_t'
> >   152 |   k ^ vld1q_u8(const1),
> >   |   ^
> >
> > Anybody knows if there's a fix for this already? Otherwise I'll take a look 
> > at it.
> 
> 
> I hit it and have been working with Jakub on the issue.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96377
> 
> Justin

Thanks, Justin! I'll keep an eye at this bug report.

-Andrea

crypto: aegis128: error: incompatible types when initializing type 'unsigned char' using type 'uint8x16_t'

2020-07-27 Thread Andrea Righi

I'm experiencing this build error on arm64 after updating to gcc 10:

crypto/aegis128-neon-inner.c: In function 'crypto_aegis128_init_neon':
crypto/aegis128-neon-inner.c:151:3: error: incompatible types when initializing 
type 'unsigned char' using type 'uint8x16_t'
  151 |   k ^ vld1q_u8(const0),
  |   ^
crypto/aegis128-neon-inner.c:152:3: error: incompatible types when initializing 
type 'unsigned char' using type 'uint8x16_t'
  152 |   k ^ vld1q_u8(const1),
  |   ^

Anybody knows if there's a fix for this already? Otherwise I'll take a look at 
it.

Thanks,
-Andrea

[PATCH v2] xen-netfront: fix potential deadlock in xennet_remove()

2020-07-24 Thread Andrea Righi

There's a potential race in xennet_remove(); this is what the driver is
doing upon unregistering a network device:

  1. state = read bus state
  2. if state is not "Closed":
  3.request to set state to "Closing"
  4.wait for state to be set to "Closing"
  5.request to set state to "Closed"
  6.wait for state to be set to "Closed"

If the state changes to "Closed" immediately after step 1 we are stuck
forever in step 4, because the state will never go back from "Closed" to
"Closing".

Make sure to check also for state == "Closed" in step 4 to prevent the
deadlock.

Also add a 5 sec timeout any time we wait for the bus state to change,
to avoid getting stuck forever in wait_event().

Signed-off-by: Andrea Righi 
---
Changes in v2:
 - remove all dev_dbg() calls (as suggested by David Miller)

 drivers/net/xen-netfront.c | 64 +-
 1 file changed, 42 insertions(+), 22 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 482c6c8b0fb7..88280057e032 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -63,6 +63,8 @@ module_param_named(max_queues, xennet_max_queues, uint, 0644);
 MODULE_PARM_DESC(max_queues,
 "Maximum number of queues per virtual interface");
 
+#define XENNET_TIMEOUT  (5 * HZ)
+
 static const struct ethtool_ops xennet_ethtool_ops;
 
 struct netfront_cb {
@@ -1334,12 +1336,15 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
 
netif_carrier_off(netdev);
 
-   xenbus_switch_state(dev, XenbusStateInitialising);
-   wait_event(module_wq,
-  xenbus_read_driver_state(dev->otherend) !=
-  XenbusStateClosed &&
-  xenbus_read_driver_state(dev->otherend) !=
-  XenbusStateUnknown);
+   do {
+   xenbus_switch_state(dev, XenbusStateInitialising);
+   err = wait_event_timeout(module_wq,
+xenbus_read_driver_state(dev->otherend) !=
+XenbusStateClosed &&
+xenbus_read_driver_state(dev->otherend) !=
+XenbusStateUnknown, XENNET_TIMEOUT);
+   } while (!err);
+
return netdev;
 
  exit:
@@ -2139,28 +2144,43 @@ static const struct attribute_group xennet_dev_group = {
 };
 #endif /* CONFIG_SYSFS */
 
-static int xennet_remove(struct xenbus_device *dev)
+static void xennet_bus_close(struct xenbus_device *dev)
 {
-   struct netfront_info *info = dev_get_drvdata(>dev);
-
-   dev_dbg(>dev, "%s\n", dev->nodename);
+   int ret;
 
-   if (xenbus_read_driver_state(dev->otherend) != XenbusStateClosed) {
+   if (xenbus_read_driver_state(dev->otherend) == XenbusStateClosed)
+   return;
+   do {
xenbus_switch_state(dev, XenbusStateClosing);
-   wait_event(module_wq,
-  xenbus_read_driver_state(dev->otherend) ==
-  XenbusStateClosing ||
-  xenbus_read_driver_state(dev->otherend) ==
-  XenbusStateUnknown);
+   ret = wait_event_timeout(module_wq,
+  xenbus_read_driver_state(dev->otherend) ==
+  XenbusStateClosing ||
+  xenbus_read_driver_state(dev->otherend) ==
+  XenbusStateClosed ||
+  xenbus_read_driver_state(dev->otherend) ==
+  XenbusStateUnknown,
+  XENNET_TIMEOUT);
+   } while (!ret);
+
+   if (xenbus_read_driver_state(dev->otherend) == XenbusStateClosed)
+   return;
 
+   do {
xenbus_switch_state(dev, XenbusStateClosed);
-   wait_event(module_wq,
-  xenbus_read_driver_state(dev->otherend) ==
-  XenbusStateClosed ||
-  xenbus_read_driver_state(dev->otherend) ==
-  XenbusStateUnknown);
-   }
+   ret = wait_event_timeout(module_wq,
+  xenbus_read_driver_state(dev->otherend) ==
+  XenbusStateClosed ||
+  xenbus_read_driver_state(dev->otherend) ==
+  XenbusStateUnknown,
+  XENNET_TIMEOUT);
+   } while (!ret);
+}
+
+static int xennet_remove(struct xenbus_device *dev)
+{
+   struct netfront_info *info = dev_get_drvdata(>dev);
 
+   xennet_bus_close(dev);
xennet_disconnect_backend(info);
 
if (info->netdev->reg_state == NETREG_REGISTERED)
-- 
2.25.1

Re: [PATCH] xen-netfront: fix potential deadlock in xennet_remove()

2020-07-24 Thread Andrea Righi

On Thu, Jul 23, 2020 at 02:57:22PM -0700, David Miller wrote:
> From: Andrea Righi 
> Date: Wed, 22 Jul 2020 08:52:11 +0200
> 
> > +static int xennet_remove(struct xenbus_device *dev)
> > +{
> > +   struct netfront_info *info = dev_get_drvdata(>dev);
> > +
> > +   dev_dbg(>dev, "%s\n", dev->nodename);
> 
> These kinds of debugging messages provide zero context and are so much
> less useful than simply using tracepoints which are more universally
> available than printk debugging facilities.
> 
> Please remove all of the dev_dbg() calls from this patch.

I didn't add that dev_dbg() call, it's just the old code moved around,
but I agree, I'll remove that call and send a new version of this patch.

Thanks for looking at it!
-Andrea

Re: [PATCH] mm: swap: do not wait for lock_page() in unuse_pte_range()

2020-07-22 Thread Andrea Righi

On Wed, Jul 22, 2020 at 07:04:25PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 22, 2020 at 07:44:36PM +0200, Andrea Righi wrote:
> > Waiting for lock_page() with mm->mmap_sem held in unuse_pte_range() can
> > lead to stalls while running swapoff (i.e., not being able to ssh into
> > the system, inability to execute simple commands like 'ps', etc.).
> > 
> > Replace lock_page() with trylock_page() and release mm->mmap_sem if we
> > fail to lock it, giving other tasks a chance to continue and prevent
> > the stall.
> 
> I think you've removed the warning at the expense of turning a stall
> into a potential livelock.
> 
> > @@ -1977,7 +1977,11 @@ static int unuse_pte_range(struct vm_area_struct 
> > *vma, pmd_t *pmd,
> > return -ENOMEM;
> > }
> >  
> > -   lock_page(page);
> > +   if (!trylock_page(page)) {
> > +   ret = -EAGAIN;
> > +   put_page(page);
> > +   goto out;
> > +   }
> 
> If you look at the patterns we have elsewhere in the MM for doing
> this kind of thing (eg truncate_inode_pages_range()), we iterate over the
> entire range, take care of the easy cases, then go back and deal with the
> hard cases later.
> 
> So that would argue for skipping any page that we can't trylock, but
> continue over at least the VMA, and quite possibly the entire MM until
> we're convinced that we have unused all of the required pages.
> 
> Another thing we could do is drop the MM semaphore _here_, sleep on this
> page until it's unlocked, then go around again.
> 
>   if (!trylock_page(page)) {
>   mmap_read_unlock(mm);
>   lock_page(page);
>   unlock_page(page);
>   put_page(page);
>   ret = -EAGAIN;
>   goto out;
>   }
> 
> (I haven't checked the call paths; maybe you can't do this because
> sometimes it's called with the mmap sem held for write)
> 
> Also, if we're trying to scale this better, there are some fun
> workloads where readers block writers who block subsequent readers
> and we shouldn't wait for I/O in swapin_readahead().  See patches like
> 6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11 for more on this kind of thing.

Thanks for the review, Matthew. I'll see if I can find a better solution
following your useful hints!

-Andrea

[PATCH] mm: swap: do not wait for lock_page() in unuse_pte_range()

2020-07-22 Thread Andrea Righi

Waiting for lock_page() with mm->mmap_sem held in unuse_pte_range() can
lead to stalls while running swapoff (i.e., not being able to ssh into
the system, inability to execute simple commands like 'ps', etc.).

Replace lock_page() with trylock_page() and release mm->mmap_sem if we
fail to lock it, giving other tasks a chance to continue and prevent
the stall.

This issue can be easily reproduced running swapoff in systems with a
large amount of RAM (>=100GB) and a lot of pages swapped out to disk. A
specific use case is to run swapoff immediately after resuming from
hibernation.

Under these conditions and without this patch applied the system can be
stalled even for 15min, with this patch applied the system is always
responsive.

Signed-off-by: Andrea Righi 
---
 mm/swapfile.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 987276c557d1..794935ecf82a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1977,7 +1977,11 @@ static int unuse_pte_range(struct vm_area_struct *vma, 
pmd_t *pmd,
return -ENOMEM;
}
 
-   lock_page(page);
+   if (!trylock_page(page)) {
+   ret = -EAGAIN;
+   put_page(page);
+   goto out;
+   }
wait_on_page_writeback(page);
ret = unuse_pte(vma, pmd, addr, entry, page);
if (ret < 0) {
@@ -2100,11 +2104,17 @@ static int unuse_mm(struct mm_struct *mm, unsigned int 
type,
struct vm_area_struct *vma;
int ret = 0;
 
+retry:
mmap_read_lock(mm);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (vma->anon_vma) {
ret = unuse_vma(vma, type, frontswap,
fs_pages_to_unuse);
+   if (ret == -EAGAIN) {
+   mmap_read_unlock(mm);
+   cond_resched();
+   goto retry;
+   }
if (ret)
break;
}
-- 
2.25.1

[PATCH] xen-netfront: fix potential deadlock in xennet_remove()

2020-07-22 Thread Andrea Righi

There's a potential race in xennet_remove(); this is what the driver is
doing upon unregistering a network device:

  1. state = read bus state
  2. if state is not "Closed":
  3.request to set state to "Closing"
  4.wait for state to be set to "Closing"
  5.request to set state to "Closed"
  6.wait for state to be set to "Closed"

If the state changes to "Closed" immediately after step 1 we are stuck
forever in step 4, because the state will never go back from "Closed" to
"Closing".

Make sure to check also for state == "Closed" in step 4 to prevent the
deadlock.

Also add a 5 sec timeout any time we wait for the bus state to change,
to avoid getting stuck forever in wait_event() and add a debug message
to help tracking down potential similar issues.

Signed-off-by: Andrea Righi 
---
 drivers/net/xen-netfront.c | 79 +++---
 1 file changed, 57 insertions(+), 22 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 482c6c8b0fb7..e09caba93dd9 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -63,6 +63,8 @@ module_param_named(max_queues, xennet_max_queues, uint, 0644);
 MODULE_PARM_DESC(max_queues,
 "Maximum number of queues per virtual interface");
 
+#define XENNET_TIMEOUT  (5 * HZ)
+
 static const struct ethtool_ops xennet_ethtool_ops;
 
 struct netfront_cb {
@@ -1334,12 +1336,20 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
 
netif_carrier_off(netdev);
 
-   xenbus_switch_state(dev, XenbusStateInitialising);
-   wait_event(module_wq,
-  xenbus_read_driver_state(dev->otherend) !=
-  XenbusStateClosed &&
-  xenbus_read_driver_state(dev->otherend) !=
-  XenbusStateUnknown);
+   do {
+   dev_dbg(>dev,
+   "%s: switching to XenbusStateInitialising\n",
+   dev->nodename);
+   xenbus_switch_state(dev, XenbusStateInitialising);
+   err = wait_event_timeout(module_wq,
+xenbus_read_driver_state(dev->otherend) !=
+XenbusStateClosed &&
+xenbus_read_driver_state(dev->otherend) !=
+XenbusStateUnknown, XENNET_TIMEOUT);
+   dev_dbg(>dev, "%s: state = %d\n", dev->nodename,
+   xenbus_read_driver_state(dev->otherend));
+   } while (!err);
+
return netdev;
 
  exit:
@@ -2139,28 +2149,53 @@ static const struct attribute_group xennet_dev_group = {
 };
 #endif /* CONFIG_SYSFS */
 
-static int xennet_remove(struct xenbus_device *dev)
+static void xennet_bus_close(struct xenbus_device *dev)
 {
-   struct netfront_info *info = dev_get_drvdata(>dev);
-
-   dev_dbg(>dev, "%s\n", dev->nodename);
+   int ret;
 
-   if (xenbus_read_driver_state(dev->otherend) != XenbusStateClosed) {
+   if (xenbus_read_driver_state(dev->otherend) == XenbusStateClosed)
+   return;
+   do {
+   dev_dbg(>dev, "%s: switching to XenbusStateClosing\n",
+   dev->nodename);
xenbus_switch_state(dev, XenbusStateClosing);
-   wait_event(module_wq,
-  xenbus_read_driver_state(dev->otherend) ==
-  XenbusStateClosing ||
-  xenbus_read_driver_state(dev->otherend) ==
-  XenbusStateUnknown);
+   ret = wait_event_timeout(module_wq,
+  xenbus_read_driver_state(dev->otherend) ==
+  XenbusStateClosing ||
+  xenbus_read_driver_state(dev->otherend) ==
+  XenbusStateClosed ||
+  xenbus_read_driver_state(dev->otherend) ==
+  XenbusStateUnknown,
+  XENNET_TIMEOUT);
+   dev_dbg(>dev, "%s: state = %d\n", dev->nodename,
+   xenbus_read_driver_state(dev->otherend));
+   } while (!ret);
+
+   if (xenbus_read_driver_state(dev->otherend) == XenbusStateClosed)
+   return;
 
+   do {
+   dev_dbg(>dev, "%s: switching to XenbusStateClosed\n",
+   dev->nodename);
xenbus_switch_state(dev, XenbusStateClosed);
-   wait_event(module_wq,
-  xenbus_read_driver_state(dev->otherend) ==
-  XenbusStateClosed ||
-  xenbus_read_driver_stat

Re: [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim

2020-06-09 Thread Andrea Righi

On Mon, Jun 08, 2020 at 03:23:22PM -0700, Luigi Semenzato wrote:
> Hi Andrea,
> 
> 1. This mechanism is quite general.  It is possible that, although
> hibernation may be an important use, there will be other uses for it.
> I suggest leaving the hibernation example and performance analysis,
> but not mentioning PM or hibernation in the patch subject.

I was actually thinking to make this feature even more generic, since
there might be other potential users of this forced "memory reclaim"
feature outside hibernation. So, instead of adding the new sysfs files
under /sys/power/mm_reclaim/, maybe move them to /sys/kernel/mm/ (since
it's more like a mm feature, rather than a PM/hibernation feature).

> 
> 2. It may be useful to have run_show() return the number of pages
> reclaimed in the last attempt.  (I had suggested something similar in
> https://lore.kernel.org/linux-mm/caa25o9sxajraa+zyhvtydakdxokcrnyxgeuimax4sujgcmr...@mail.gmail.com/).

I like this idea, I'll add that in the next version.

> 
> 3. It is not clear how much mm_reclaim/release is going to help.  If
> the preloading of the swapped-out pages uses some kind of LIFO order,
> and can batch multiple pages, then it might help.  Otherwise demand
> paging is likely to be more effective.  If the preloading does indeed
> help, it may be useful to explain why in the commit message.

Swap readahead helps a lot in terms of performance if we preload all at
once. But I agree that for the majority of cases on-demand paging just
works fine.

My specific use-case for mm_reclaim/release is to make sure a VM
that is just resumed is immediately "fast" by preloading the swapped-out
pages back to memory all at once.

Without mm_reclaim/release I've been using the trick of running swapoff
followed by a swapon to force all the pages back to memory, but it's
kinda ugly and I was looking for a better way to do this. I've been
trying also the ptrace() + reading all the VMAs via /proc/pid/mem, it
works, but it's not as fast as swapoff+swapon or mm_reclaim/release.

I'll report performance numbers of mm_reclaim/release vs ptrace() +
/proc/pid/mem in the next version of this patch.

> 
> Thanks!

Thanks for your review!

-Andrea

[RFC PATCH 1/2] mm: swap: allow partial swapoff with try_to_unuse()

2020-06-01 Thread Andrea Righi

Allow to run try_to_unuse() passing an arbitrary amount of pages also
when frontswap is not used.

To preserve the default behavior introduce a new function called
try_to_unuse_wait() and add a new 'wait' parameter: if 'wait' is false
return as soon as "pages_to_unuse" pages are unused, if it is true
simply ignore "pages_to_unuse" and wait until all the pages are unused.

In any case the value of 0 in "pages_to_unuse" means "all pages".

This is required by the PM / hibernation opportunistic memory reclaim
feature.

Signed-off-by: Andrea Righi 
---
 include/linux/swapfile.h |  7 +++
 mm/swapfile.c| 15 +++
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index e06febf62978..ac4d0ccd1f7b 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -9,6 +9,13 @@
 extern spinlock_t swap_lock;
 extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
+extern int try_to_unuse_wait(unsigned int type, bool frontswap, bool wait,
+unsigned long pages_to_unuse);
+static inline int
+try_to_unuse(unsigned int type, bool frontswap, unsigned long pages_to_unuse)
+{
+   return try_to_unuse_wait(type, frontswap, true, pages_to_unuse);
+}
 extern int try_to_unuse(unsigned int, bool, unsigned long);
 extern unsigned long generic_max_swapfile_size(void);
 extern unsigned long max_swapfile_size(void);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f8bf926c9c8f..651471ccf133 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2121,10 +2121,13 @@ static unsigned int find_next_to_unuse(struct 
swap_info_struct *si,
 }
 
 /*
- * If the boolean frontswap is true, only unuse pages_to_unuse pages;
- * pages_to_unuse==0 means all pages; ignored if frontswap is false
+ * Unuse pages_to_unuse pages; pages_to_unuse==0 means all pages.
+ *
+ * If "wait" is false stop as soon as "pages_to_unuse" pages are unused, if
+ * wait is true "pages_to_unuse" will be ignored and wait until all the pages
+ * are unused.
  */
-int try_to_unuse(unsigned int type, bool frontswap,
+int try_to_unuse_wait(unsigned int type, bool frontswap, bool wait,
 unsigned long pages_to_unuse)
 {
struct mm_struct *prev_mm;
@@ -2138,10 +2141,6 @@ int try_to_unuse(unsigned int type, bool frontswap,
 
if (!READ_ONCE(si->inuse_pages))
return 0;
-
-   if (!frontswap)
-   pages_to_unuse = 0;
-
 retry:
retval = shmem_unuse(type, frontswap, _to_unuse);
if (retval)
@@ -2223,7 +,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
 * been preempted after get_swap_page(), temporarily hiding that swap.
 * It's easy and robust (though cpu-intensive) just to keep retrying.
 */
-   if (READ_ONCE(si->inuse_pages)) {
+   if (wait && READ_ONCE(si->inuse_pages)) {
if (!signal_pending(current))
goto retry;
retval = -EINTR;
-- 
2.25.1

[RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim

2020-06-01 Thread Andrea Righi

== Overview ==

When a system is going to be hibernated, the kernel needs to allocate
and dump the content of the entire memory to the resume device (swap) by
creating a "hibernation image".

To make sure this image fits in the available free memory, the kernel
can induce an artificial memory pressure condition that allows to free
up some pages (i.e., drop clean page cache pages, writeback dirty page
cache pages, swap out anonymous memory, etc.).

How much the kernel is pushing to free up memory is determined by
/sys/power/image_size: a smaller size will cause more memory to be
dropped, cutting down the amount of I/O required to write the
hibernation image; a larger image size, instead, is going to generate
more I/O, but the system will likely be less sluggish at resume, because
more caches will be restored, reducing the paging time.

The I/O generated to free up memory, write the hibernation image to disk
and load it back to memory is the main bottleneck of hibernation [1].

== Proposed solution ==

The "opportunistic memory reclaim" aims to provide an interface to the
user-space to control the artificial memory pressure. With this feature
user-space can trigger the memory reclaim before the actual hibernation
is started (e.g., if the system is idle for a certain amount of time).

This allows to consistently speed up hibernation performance when needed
(in terms of time to hibernate) by reducing the size of the hibernation
image in advance.

== Interface ==

The accomplish this goal the following new files are provided in sysfs:

 - /sys/power/mm_reclaim/run
 - /sys/power/mm_reclaim/release

The former can be used to start the memory reclaim by writing a number
representing the desired amount of pages to be reclaimed (with "0" the
kernel will try to reclaim as many pages as possible).

The latter can be used in the same way to force the kernel to pull a
certain amount of swapped out pages back to memory (by writing the
number of pages or "0" to load back to memory as many pages as
possible); this can be useful immediately after resume to speed up the
paging time and get the system back to full speed faster.

Memory reclaim and release can be interrupted sending a signal to the
process that is writing to /sys/power/mm_reclaim/{run,release} (i.e.,
to set a timeout for the particular operation).

== Testing ==

Environment:
   - VM (kvm):
 8GB of RAM
 disk speed: 100 MB/s
 8GB swap file on ext4 (/swapfile)

Use case:
  - allocate 85% of memory, wait for 60s almost in idle, then hibernate
and resume (measuring the time)

Result (average of 10 runs):
 5.7-vanilla   5.7-mm_reclaim
 ---   --
  [hibernate] image_size=default  51.56s4.19s
 [resume] image_size=default  26.34s5.01s
  [hibernate] image_size=073.22s5.36s
 [resume] image_size=0 5.32s5.26s

NOTE #1: in the 5.7-mm_reclaim case a user-space daemon detects when the
system is idle and triggers the opportunistic memory reclaim via
/sys/power/mm_reclaim/run.

NOTE #2: in the 5.7-mm_reclaim case, after the system is resumed, a
user-space process can (optionally) use /sys/power/mm_reclaim/release to
pre-load back to memory all (or some) of the swapped out pages in order
to have a more responsive system.

== Conclusion ==

Opportunistic memory reclaim can provide a significant benefit to those
systems where being able to hibernate quickly is important.

The typical use case is with "spot" cloud instances: low-priority
instances that can be stopped at any time (prior advice) to prioritize
other more privileged instances [2].

Being able to quickly stop low-priority instances that are mostly idle
for the majority of time can be critical to provide a better quality of
service in the overall cloud infrastructure.

== See also ==

[1] https://lwn.net/Articles/821158/
[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html

Signed-off-by: Andrea Righi 
---
 Documentation/ABI/testing/sysfs-power | 38 +++
 include/linux/swapfile.h  |  1 +
 kernel/power/hibernate.c  | 94 ++-
 mm/swapfile.c | 30 +
 4 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-power 
b/Documentation/ABI/testing/sysfs-power
index 5e6ead29124c..b33db9816a8c 100644
--- a/Documentation/ABI/testing/sysfs-power
+++ b/Documentation/ABI/testing/sysfs-power
@@ -192,6 +192,44 @@ Description:
Reading from this file will display the current value, which is
set to 1 MB by default.
 
+What:  /sys/power/mm_reclaim/
+Date:  May 2020
+Contact:   Andrea Righi 
+Description:
+   The /sys/power/mm_reclaim directory contains all the
+   opportunistic

[RFC PATCH 0/2] PM: hibernate: opportunistic memory reclaim

2020-06-01 Thread Andrea Righi

Here is the first attempt to provide an interface that allows user-space
tasks to trigger an opportunistic memory reclaim before hibernating a
system.

Reclaiming memory in advance (e.g., when the system is idle) allows to
reduce the size of the hibernation image and significantly speed up the
time to hibernate and resume.

The typical use case of this feature is to allow high-priority cloud
instances to preempt low-priority instances (e.g., "spot" instances [1])
by hibernating them.

Opportunistic memory reclaim is very effective to quickly hibernate
instances that allocate a large chunk of memory and remain mostly idle
for the majority of the time, only using a minimum working set.

This topic has been mentioned during the OSPM 2020 conference [2].

See [RFC PATCH 2/2] for details about the proposed solution.

Feedbacks are welcome.

Thanks,
-Andrea

[1] 
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html
[2] https://lwn.net/Articles/821158/

--------
Andrea Righi (2):
  mm: swap: allow partial swapoff with try_to_unuse()
  PM: hibernate: introduce opportunistic memory reclaim

 Documentation/ABI/testing/sysfs-power | 38 ++
 include/linux/swapfile.h  |  8 +++
 kernel/power/hibernate.c  | 94 ++-
 mm/swapfile.c | 45 ++---
 4 files changed, 176 insertions(+), 9 deletions(-)

Re: [PATCH v3] bcache: fix deadlock in bcache_allocator

2019-10-10 Thread Andrea Righi

On Wed, Aug 07, 2019 at 09:53:46PM +0800, Coly Li wrote:
> On 2019/8/7 6:38 下午, Andrea Righi wrote:
> > bcache_allocator can call the following:
> > 
> >  bch_allocator_thread()
> >   -> bch_prio_write()
> >  -> bch_bucket_alloc()
> > -> wait on >set->bucket_wait
> > 
> > But the wake up event on bucket_wait is supposed to come from
> > bch_allocator_thread() itself => deadlock:
> > 
> > [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 
> > seconds.
> > [ 1158.495929]   Not tainted 5.3.0-050300rc3-generic #201908042232
> > [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [ 1158.504413] bcache_allocato D0 15861  2 0x80004000
> > [ 1158.504419] Call Trace:
> > [ 1158.504429]  __schedule+0x2a8/0x670
> > [ 1158.504432]  schedule+0x2d/0x90
> > [ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
> > [ 1158.504453]  ? wait_woken+0x80/0x80
> > [ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
> > [ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
> > [ 1158.504491]  kthread+0x121/0x140
> > [ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
> > [ 1158.504506]  ? kthread_park+0xb0/0xb0
> > [ 1158.504510]  ret_from_fork+0x35/0x40
> > 
> > Fix by making the call to bch_prio_write() non-blocking, so that
> > bch_allocator_thread() never waits on itself.
> > 
> > Moreover, make sure to wake up the garbage collector thread when
> > bch_prio_write() is failing to allocate buckets.
> > 
> > BugLink: https://bugs.launchpad.net/bugs/1784665
> > BugLink: https://bugs.launchpad.net/bugs/1796292
> > Signed-off-by: Andrea Righi 
> 
> OK, I add this version into my for-test directory. Once you have a new
> version, I will update it. Thanks.
> 
> Coly Li

Hi Coly,

any news about this patch? We're still using it in Ubuntu and no errors
have been reported so far. Do you think we can add this to linux-next?

Thanks,
-Andrea

[tip:perf/urgent] kprobes: Fix potential deadlock in kprobe_optimizer()

2019-08-19 Thread tip-bot for Andrea Righi

Commit-ID:  f1c6ece23729257fb46562ff9224cf5f61b818da
Gitweb: https://git.kernel.org/tip/f1c6ece23729257fb46562ff9224cf5f61b818da
Author: Andrea Righi 
AuthorDate: Mon, 12 Aug 2019 20:43:02 +0200
Committer:  Ingo Molnar 
CommitDate: Mon, 19 Aug 2019 12:22:19 +0200

kprobes: Fix potential deadlock in kprobe_optimizer()

lockdep reports the following deadlock scenario:

 WARNING: possible circular locking dependency detected

 kworker/1:1/48 is trying to acquire lock:
 8d7a62b2 (text_mutex){+.+.}, at: kprobe_optimizer+0x163/0x290

 but task is already holding lock:
 850b5e2d (module_mutex){+.+.}, at: kprobe_optimizer+0x31/0x290

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (module_mutex){+.+.}:
__mutex_lock+0xac/0x9f0
mutex_lock_nested+0x1b/0x20
set_all_modules_text_rw+0x22/0x90
ftrace_arch_code_modify_prepare+0x1c/0x20
ftrace_run_update_code+0xe/0x30
ftrace_startup_enable+0x2e/0x50
ftrace_startup+0xa7/0x100
register_ftrace_function+0x27/0x70
arm_kprobe+0xb3/0x130
enable_kprobe+0x83/0xa0
enable_trace_kprobe.part.0+0x2e/0x80
kprobe_register+0x6f/0xc0
perf_trace_event_init+0x16b/0x270
perf_kprobe_init+0xa7/0xe0
perf_kprobe_event_init+0x3e/0x70
perf_try_init_event+0x4a/0x140
perf_event_alloc+0x93a/0xde0
__do_sys_perf_event_open+0x19f/0xf30
__x64_sys_perf_event_open+0x20/0x30
do_syscall_64+0x65/0x1d0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

 -> #0 (text_mutex){+.+.}:
__lock_acquire+0xfcb/0x1b60
lock_acquire+0xca/0x1d0
__mutex_lock+0xac/0x9f0
mutex_lock_nested+0x1b/0x20
kprobe_optimizer+0x163/0x290
process_one_work+0x22b/0x560
worker_thread+0x50/0x3c0
kthread+0x112/0x150
ret_from_fork+0x3a/0x50

 other info that might help us debug this:

  Possible unsafe locking scenario:

CPU0CPU1

   lock(module_mutex);
lock(text_mutex);
lock(module_mutex);
   lock(text_mutex);

  *** DEADLOCK ***

As a reproducer I've been using bcc's funccount.py
(https://github.com/iovisor/bcc/blob/master/tools/funccount.py),
for example:

 # ./funccount.py '*interrupt*'

That immediately triggers the lockdep splat.

Fix by acquiring text_mutex before module_mutex in kprobe_optimizer().

Signed-off-by: Andrea Righi 
Acked-by: Masami Hiramatsu 
Cc: Anil S Keshavamurthy 
Cc: David S. Miller 
Cc: Linus Torvalds 
Cc: Naveen N. Rao 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: d5b844a2cf50 ("ftrace/x86: Remove possible deadlock between 
register_kprobe() and ftrace_run_update_code()")
Link: http://lkml.kernel.org/r/20190812184302.GA7010@xps-13
Signed-off-by: Ingo Molnar 
---
 kernel/kprobes.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9873fc627d61..d9770a5393c8 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -470,6 +470,7 @@ static DECLARE_DELAYED_WORK(optimizing_work, 
kprobe_optimizer);
  */
 static void do_optimize_kprobes(void)
 {
+   lockdep_assert_held(_mutex);
/*
 * The optimization/unoptimization refers online_cpus via
 * stop_machine() and cpu-hotplug modifies online_cpus.
@@ -487,9 +488,7 @@ static void do_optimize_kprobes(void)
list_empty(_list))
return;
 
-   mutex_lock(_mutex);
arch_optimize_kprobes(_list);
-   mutex_unlock(_mutex);
 }
 
 /*
@@ -500,6 +499,7 @@ static void do_unoptimize_kprobes(void)
 {
struct optimized_kprobe *op, *tmp;
 
+   lockdep_assert_held(_mutex);
/* See comment in do_optimize_kprobes() */
lockdep_assert_cpus_held();
 
@@ -507,7 +507,6 @@ static void do_unoptimize_kprobes(void)
if (list_empty(_list))
return;
 
-   mutex_lock(_mutex);
arch_unoptimize_kprobes(_list, _list);
/* Loop free_list for disarming */
list_for_each_entry_safe(op, tmp, _list, list) {
@@ -524,7 +523,6 @@ static void do_unoptimize_kprobes(void)
} else
list_del_init(>list);
}
-   mutex_unlock(_mutex);
 }
 
 /* Reclaim all kprobes on the free_list */
@@ -556,6 +554,7 @@ static void kprobe_optimizer(struct work_struct *work)
 {
mutex_lock(_mutex);
cpus_read_lock();
+   mutex_lock(_mutex);
/* Lock modules while optimizing kprobes */
mutex_lock(_mutex);
 
@@ -583,6 +582,7 @@ static void kprobe_optimizer(struct work_struct *work)
do_free_cleaned_kprobes();
 
mutex_unlock(_mutex);
+   mutex_unlock(_mutex);
cpus_read_unlock();
mutex_unlock(_mutex);

[PATCH] kprobes: fix potential deadlock in kprobe_optimizer()

2019-08-12 Thread Andrea Righi

lockdep reports the following:

 WARNING: possible circular locking dependency detected

 kworker/1:1/48 is trying to acquire lock:
 8d7a62b2 (text_mutex){+.+.}, at: kprobe_optimizer+0x163/0x290

 but task is already holding lock:
 850b5e2d (module_mutex){+.+.}, at: kprobe_optimizer+0x31/0x290

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (module_mutex){+.+.}:
__mutex_lock+0xac/0x9f0
mutex_lock_nested+0x1b/0x20
set_all_modules_text_rw+0x22/0x90
ftrace_arch_code_modify_prepare+0x1c/0x20
ftrace_run_update_code+0xe/0x30
ftrace_startup_enable+0x2e/0x50
ftrace_startup+0xa7/0x100
register_ftrace_function+0x27/0x70
arm_kprobe+0xb3/0x130
enable_kprobe+0x83/0xa0
enable_trace_kprobe.part.0+0x2e/0x80
kprobe_register+0x6f/0xc0
perf_trace_event_init+0x16b/0x270
perf_kprobe_init+0xa7/0xe0
perf_kprobe_event_init+0x3e/0x70
perf_try_init_event+0x4a/0x140
perf_event_alloc+0x93a/0xde0
__do_sys_perf_event_open+0x19f/0xf30
__x64_sys_perf_event_open+0x20/0x30
do_syscall_64+0x65/0x1d0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

 -> #0 (text_mutex){+.+.}:
__lock_acquire+0xfcb/0x1b60
lock_acquire+0xca/0x1d0
__mutex_lock+0xac/0x9f0
mutex_lock_nested+0x1b/0x20
kprobe_optimizer+0x163/0x290
process_one_work+0x22b/0x560
worker_thread+0x50/0x3c0
kthread+0x112/0x150
ret_from_fork+0x3a/0x50

 other info that might help us debug this:

  Possible unsafe locking scenario:

CPU0CPU1

   lock(module_mutex);
lock(text_mutex);
lock(module_mutex);
   lock(text_mutex);

  *** DEADLOCK ***

As a reproducer I've been using bcc's funccount.py
(https://github.com/iovisor/bcc/blob/master/tools/funccount.py),
for example:

 # ./funccount.py '*interrupt*'

That immediately triggers the lockdep splat.

Fix by acquiring text_mutex before module_mutex in kprobe_optimizer().

Fixes: d5b844a2cf50 ("ftrace/x86: Remove possible deadlock between 
register_kprobe() and ftrace_run_update_code()")
Signed-off-by: Andrea Righi 
---
 kernel/kprobes.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9873fc627d61..d9770a5393c8 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -470,6 +470,7 @@ static DECLARE_DELAYED_WORK(optimizing_work, 
kprobe_optimizer);
  */
 static void do_optimize_kprobes(void)
 {
+   lockdep_assert_held(_mutex);
/*
 * The optimization/unoptimization refers online_cpus via
 * stop_machine() and cpu-hotplug modifies online_cpus.
@@ -487,9 +488,7 @@ static void do_optimize_kprobes(void)
list_empty(_list))
return;
 
-   mutex_lock(_mutex);
arch_optimize_kprobes(_list);
-   mutex_unlock(_mutex);
 }
 
 /*
@@ -500,6 +499,7 @@ static void do_unoptimize_kprobes(void)
 {
struct optimized_kprobe *op, *tmp;
 
+   lockdep_assert_held(_mutex);
/* See comment in do_optimize_kprobes() */
lockdep_assert_cpus_held();
 
@@ -507,7 +507,6 @@ static void do_unoptimize_kprobes(void)
if (list_empty(_list))
return;
 
-   mutex_lock(_mutex);
arch_unoptimize_kprobes(_list, _list);
/* Loop free_list for disarming */
list_for_each_entry_safe(op, tmp, _list, list) {
@@ -524,7 +523,6 @@ static void do_unoptimize_kprobes(void)
} else
list_del_init(>list);
}
-   mutex_unlock(_mutex);
 }
 
 /* Reclaim all kprobes on the free_list */
@@ -556,6 +554,7 @@ static void kprobe_optimizer(struct work_struct *work)
 {
mutex_lock(_mutex);
cpus_read_lock();
+   mutex_lock(_mutex);
/* Lock modules while optimizing kprobes */
mutex_lock(_mutex);
 
@@ -583,6 +582,7 @@ static void kprobe_optimizer(struct work_struct *work)
do_free_cleaned_kprobes();
 
mutex_unlock(_mutex);
+   mutex_unlock(_mutex);
cpus_read_unlock();
mutex_unlock(_mutex);
 
-- 
2.20.1

[PATCH v3] bcache: fix deadlock in bcache_allocator

2019-08-07 Thread Andrea Righi

bcache_allocator can call the following:

 bch_allocator_thread()
  -> bch_prio_write()
 -> bch_bucket_alloc()
-> wait on >set->bucket_wait

But the wake up event on bucket_wait is supposed to come from
bch_allocator_thread() itself => deadlock:

[ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 
seconds.
[ 1158.495929]   Not tainted 5.3.0-050300rc3-generic #201908042232
[ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1158.504413] bcache_allocato D0 15861  2 0x80004000
[ 1158.504419] Call Trace:
[ 1158.504429]  __schedule+0x2a8/0x670
[ 1158.504432]  schedule+0x2d/0x90
[ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
[ 1158.504453]  ? wait_woken+0x80/0x80
[ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
[ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
[ 1158.504491]  kthread+0x121/0x140
[ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
[ 1158.504506]  ? kthread_park+0xb0/0xb0
[ 1158.504510]  ret_from_fork+0x35/0x40

Fix by making the call to bch_prio_write() non-blocking, so that
bch_allocator_thread() never waits on itself.

Moreover, make sure to wake up the garbage collector thread when
bch_prio_write() is failing to allocate buckets.

BugLink: https://bugs.launchpad.net/bugs/1784665
BugLink: https://bugs.launchpad.net/bugs/1796292
Signed-off-by: Andrea Righi 
---
Changes in v3:
 - prevent buckets leak in bch_prio_write()

 drivers/md/bcache/alloc.c  |  5 -
 drivers/md/bcache/bcache.h |  2 +-
 drivers/md/bcache/super.c  | 27 +--
 3 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 6f776823b9ba..a1df0d95151c 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -377,7 +377,10 @@ static int bch_allocator_thread(void *arg)
if (!fifo_full(>free_inc))
goto retry_invalidate;
 
-   bch_prio_write(ca);
+   if (bch_prio_write(ca, false) < 0) {
+   ca->invalidate_needs_gc = 1;
+   wake_up_gc(ca->set);
+   }
}
}
 out:
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 013e35a9e317..deb924e1d790 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -977,7 +977,7 @@ bool bch_cached_dev_error(struct cached_dev *dc);
 __printf(2, 3)
 bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...);
 
-void bch_prio_write(struct cache *ca);
+int bch_prio_write(struct cache *ca, bool wait);
 void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent);
 
 extern struct workqueue_struct *bcache_wq;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 20ed838e9413..bd153234290d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -529,12 +529,29 @@ static void prio_io(struct cache *ca, uint64_t bucket, 
int op,
closure_sync(cl);
 }
 
-void bch_prio_write(struct cache *ca)
+int bch_prio_write(struct cache *ca, bool wait)
 {
int i;
struct bucket *b;
struct closure cl;
 
+   pr_debug("free_prio=%zu, free_none=%zu, free_inc=%zu",
+fifo_used(>free[RESERVE_PRIO]),
+fifo_used(>free[RESERVE_NONE]),
+fifo_used(>free_inc));
+
+   /*
+* Pre-check if there are enough free buckets. In the non-blocking
+* scenario it's better to fail early rather than starting to allocate
+* buckets and do a cleanup later in case of failure.
+*/
+   if (!wait) {
+   size_t avail = fifo_used(>free[RESERVE_PRIO]) +
+  fifo_used(>free[RESERVE_NONE]);
+   if (prio_buckets(ca) > avail)
+   return -ENOMEM;
+   }
+
closure_init_stack();
 
lockdep_assert_held(>set->bucket_lock);
@@ -544,9 +561,6 @@ void bch_prio_write(struct cache *ca)
atomic_long_add(ca->sb.bucket_size * prio_buckets(ca),
>meta_sectors_written);
 
-   //pr_debug("free %zu, free_inc %zu, unused %zu", fifo_used(>free),
-   //   fifo_used(>free_inc), fifo_used(>unused));
-
for (i = prio_buckets(ca) - 1; i >= 0; --i) {
long bucket;
struct prio_set *p = ca->disk_buckets;
@@ -564,7 +578,7 @@ void bch_prio_write(struct cache *ca)
p->magic= pset_magic(>sb);
p->csum = bch_crc64(>magic, bucket_bytes(ca) - 8);
 
-   bucket = bch_bucket_alloc(ca, RESERVE_PRIO, true);
+   bucket = bch_bucket_alloc(ca, RESERVE_PRIO, wait);
BUG_ON(bucket == -1);
 
mutex_unlock(>s

Re: [PATCH v2] bcache: fix deadlock in bcache_allocator

2019-08-07 Thread Andrea Righi

On Tue, Aug 06, 2019 at 07:36:48PM +0200, Andrea Righi wrote:
> On Tue, Aug 06, 2019 at 11:18:01AM +0200, Andrea Righi wrote:
> > bcache_allocator() can call the following:
> > 
> >  bch_allocator_thread()
> >   -> bch_prio_write()
> >  -> bch_bucket_alloc()
> > -> wait on >set->bucket_wait
> > 
> > But the wake up event on bucket_wait is supposed to come from
> > bch_allocator_thread() itself => deadlock:
> > 
> > [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 
> > seconds.
> > [ 1158.495929]   Not tainted 5.3.0-050300rc3-generic #201908042232
> > [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [ 1158.504413] bcache_allocato D0 15861  2 0x80004000
> > [ 1158.504419] Call Trace:
> > [ 1158.504429]  __schedule+0x2a8/0x670
> > [ 1158.504432]  schedule+0x2d/0x90
> > [ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
> > [ 1158.504453]  ? wait_woken+0x80/0x80
> > [ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
> > [ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
> > [ 1158.504491]  kthread+0x121/0x140
> > [ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
> > [ 1158.504506]  ? kthread_park+0xb0/0xb0
> > [ 1158.504510]  ret_from_fork+0x35/0x40
> > 
> > Fix by making the call to bch_prio_write() non-blocking, so that
> > bch_allocator_thread() never waits on itself.
> > 
> > Moreover, make sure to wake up the garbage collector thread when
> > bch_prio_write() is failing to allocate buckets.
> > 
> > BugLink: https://bugs.launchpad.net/bugs/1784665
> > BugLink: https://bugs.launchpad.net/bugs/1796292
> > Signed-off-by: Andrea Righi 
> > ---
> > Changes in v2:
> >  - prevent retry_invalidate busy loop in bch_allocator_thread()
> > 
> >  drivers/md/bcache/alloc.c  |  5 -
> >  drivers/md/bcache/bcache.h |  2 +-
> >  drivers/md/bcache/super.c  | 13 +
> >  3 files changed, 14 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> > index 6f776823b9ba..a1df0d95151c 100644
> > --- a/drivers/md/bcache/alloc.c
> > +++ b/drivers/md/bcache/alloc.c
> > @@ -377,7 +377,10 @@ static int bch_allocator_thread(void *arg)
> > if (!fifo_full(>free_inc))
> > goto retry_invalidate;
> >  
> > -   bch_prio_write(ca);
> > +   if (bch_prio_write(ca, false) < 0) {
> > +   ca->invalidate_needs_gc = 1;
> > +   wake_up_gc(ca->set);
> > +   }
> > }
> > }
> >  out:
> > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> > index 013e35a9e317..deb924e1d790 100644
> > --- a/drivers/md/bcache/bcache.h
> > +++ b/drivers/md/bcache/bcache.h
> > @@ -977,7 +977,7 @@ bool bch_cached_dev_error(struct cached_dev *dc);
> >  __printf(2, 3)
> >  bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...);
> >  
> > -void bch_prio_write(struct cache *ca);
> > +int bch_prio_write(struct cache *ca, bool wait);
> >  void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent);
> >  
> >  extern struct workqueue_struct *bcache_wq;
> > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > index 20ed838e9413..716ea272fb55 100644
> > --- a/drivers/md/bcache/super.c
> > +++ b/drivers/md/bcache/super.c
> > @@ -529,7 +529,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, 
> > int op,
> > closure_sync(cl);
> >  }
> >  
> > -void bch_prio_write(struct cache *ca)
> > +int bch_prio_write(struct cache *ca, bool wait)
> >  {
> > int i;
> > struct bucket *b;
> > @@ -564,8 +564,12 @@ void bch_prio_write(struct cache *ca)
> > p->magic= pset_magic(>sb);
> > p->csum = bch_crc64(>magic, bucket_bytes(ca) - 8);
> >  
> > -   bucket = bch_bucket_alloc(ca, RESERVE_PRIO, true);
> > -   BUG_ON(bucket == -1);
> > +   bucket = bch_bucket_alloc(ca, RESERVE_PRIO, wait);
> > +   if (bucket == -1) {
> > +   if (!wait)
> > +   return -ENOMEM;
> > +       BUG_ON(1);
> > +   }
> 
> Coly,
> 
> looking more at this change, I think we should handle the failure path
> properly or we may leak buckets, am I right? (sorry for not realizing
> this before). Maybe we need something like the following on top of my
> previous patch.
> 
> I'm going to run more stress tests with this patch applied and will try
> to figure out if we're actually leaking buckets without it.
> 
> ---
> Subject: bcache: prevent leaking buckets in bch_prio_write()
> 
> Handle the allocation failure path properly in bch_prio_write() to avoid
> leaking buckets from the previous successful iterations.
> 
> Signed-off-by: Andrea Righi 

Coly, ignore this one please. A v3 of the previous patch with a better
fix for this potential buckets leak is on the way.

Thanks,
-Andrea

Re: [PATCH v2] bcache: fix deadlock in bcache_allocator

2019-08-06 Thread Andrea Righi

On Tue, Aug 06, 2019 at 11:18:01AM +0200, Andrea Righi wrote:
> bcache_allocator() can call the following:
> 
>  bch_allocator_thread()
>   -> bch_prio_write()
>  -> bch_bucket_alloc()
> -> wait on >set->bucket_wait
> 
> But the wake up event on bucket_wait is supposed to come from
> bch_allocator_thread() itself => deadlock:
> 
> [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 
> seconds.
> [ 1158.495929]   Not tainted 5.3.0-050300rc3-generic #201908042232
> [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 1158.504413] bcache_allocato D0 15861  2 0x80004000
> [ 1158.504419] Call Trace:
> [ 1158.504429]  __schedule+0x2a8/0x670
> [ 1158.504432]  schedule+0x2d/0x90
> [ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
> [ 1158.504453]  ? wait_woken+0x80/0x80
> [ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
> [ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
> [ 1158.504491]  kthread+0x121/0x140
> [ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
> [ 1158.504506]  ? kthread_park+0xb0/0xb0
> [ 1158.504510]  ret_from_fork+0x35/0x40
> 
> Fix by making the call to bch_prio_write() non-blocking, so that
> bch_allocator_thread() never waits on itself.
> 
> Moreover, make sure to wake up the garbage collector thread when
> bch_prio_write() is failing to allocate buckets.
> 
> BugLink: https://bugs.launchpad.net/bugs/1784665
> BugLink: https://bugs.launchpad.net/bugs/1796292
> Signed-off-by: Andrea Righi 
> ---
> Changes in v2:
>  - prevent retry_invalidate busy loop in bch_allocator_thread()
> 
>  drivers/md/bcache/alloc.c  |  5 -
>  drivers/md/bcache/bcache.h |  2 +-
>  drivers/md/bcache/super.c  | 13 +
>  3 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index 6f776823b9ba..a1df0d95151c 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -377,7 +377,10 @@ static int bch_allocator_thread(void *arg)
>   if (!fifo_full(>free_inc))
>   goto retry_invalidate;
>  
> - bch_prio_write(ca);
> + if (bch_prio_write(ca, false) < 0) {
> + ca->invalidate_needs_gc = 1;
> + wake_up_gc(ca->set);
> + }
>   }
>   }
>  out:
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 013e35a9e317..deb924e1d790 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -977,7 +977,7 @@ bool bch_cached_dev_error(struct cached_dev *dc);
>  __printf(2, 3)
>  bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...);
>  
> -void bch_prio_write(struct cache *ca);
> +int bch_prio_write(struct cache *ca, bool wait);
>  void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent);
>  
>  extern struct workqueue_struct *bcache_wq;
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 20ed838e9413..716ea272fb55 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -529,7 +529,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, 
> int op,
>   closure_sync(cl);
>  }
>  
> -void bch_prio_write(struct cache *ca)
> +int bch_prio_write(struct cache *ca, bool wait)
>  {
>   int i;
>   struct bucket *b;
> @@ -564,8 +564,12 @@ void bch_prio_write(struct cache *ca)
>   p->magic= pset_magic(>sb);
>   p->csum = bch_crc64(>magic, bucket_bytes(ca) - 8);
>  
> - bucket = bch_bucket_alloc(ca, RESERVE_PRIO, true);
> - BUG_ON(bucket == -1);
> + bucket = bch_bucket_alloc(ca, RESERVE_PRIO, wait);
> + if (bucket == -1) {
> + if (!wait)
> + return -ENOMEM;
> + BUG_ON(1);
> + }

Coly,

looking more at this change, I think we should handle the failure path
properly or we may leak buckets, am I right? (sorry for not realizing
this before). Maybe we need something like the following on top of my
previous patch.

I'm going to run more stress tests with this patch applied and will try
to figure out if we're actually leaking buckets without it.

---
Subject: bcache: prevent leaking buckets in bch_prio_write()

Handle the allocation failure path properly in bch_prio_write() to avoid
leaking buckets from the previous successful iterations.

Signed-off-by: Andrea Righi 
---
 drivers/md/bcache/super.c | 20

[PATCH v2] bcache: fix deadlock in bcache_allocator

2019-08-06 Thread Andrea Righi

bcache_allocator() can call the following:

 bch_allocator_thread()
  -> bch_prio_write()
 -> bch_bucket_alloc()
-> wait on >set->bucket_wait

But the wake up event on bucket_wait is supposed to come from
bch_allocator_thread() itself => deadlock:

[ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 
seconds.
[ 1158.495929]   Not tainted 5.3.0-050300rc3-generic #201908042232
[ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1158.504413] bcache_allocato D0 15861  2 0x80004000
[ 1158.504419] Call Trace:
[ 1158.504429]  __schedule+0x2a8/0x670
[ 1158.504432]  schedule+0x2d/0x90
[ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
[ 1158.504453]  ? wait_woken+0x80/0x80
[ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
[ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
[ 1158.504491]  kthread+0x121/0x140
[ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
[ 1158.504506]  ? kthread_park+0xb0/0xb0
[ 1158.504510]  ret_from_fork+0x35/0x40

Fix by making the call to bch_prio_write() non-blocking, so that
bch_allocator_thread() never waits on itself.

Moreover, make sure to wake up the garbage collector thread when
bch_prio_write() is failing to allocate buckets.

BugLink: https://bugs.launchpad.net/bugs/1784665
BugLink: https://bugs.launchpad.net/bugs/1796292
Signed-off-by: Andrea Righi 
---
Changes in v2:
 - prevent retry_invalidate busy loop in bch_allocator_thread()

 drivers/md/bcache/alloc.c  |  5 -
 drivers/md/bcache/bcache.h |  2 +-
 drivers/md/bcache/super.c  | 13 +
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 6f776823b9ba..a1df0d95151c 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -377,7 +377,10 @@ static int bch_allocator_thread(void *arg)
if (!fifo_full(>free_inc))
goto retry_invalidate;
 
-   bch_prio_write(ca);
+   if (bch_prio_write(ca, false) < 0) {
+   ca->invalidate_needs_gc = 1;
+   wake_up_gc(ca->set);
+   }
}
}
 out:
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 013e35a9e317..deb924e1d790 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -977,7 +977,7 @@ bool bch_cached_dev_error(struct cached_dev *dc);
 __printf(2, 3)
 bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...);
 
-void bch_prio_write(struct cache *ca);
+int bch_prio_write(struct cache *ca, bool wait);
 void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent);
 
 extern struct workqueue_struct *bcache_wq;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 20ed838e9413..716ea272fb55 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -529,7 +529,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, int 
op,
closure_sync(cl);
 }
 
-void bch_prio_write(struct cache *ca)
+int bch_prio_write(struct cache *ca, bool wait)
 {
int i;
struct bucket *b;
@@ -564,8 +564,12 @@ void bch_prio_write(struct cache *ca)
p->magic= pset_magic(>sb);
p->csum = bch_crc64(>magic, bucket_bytes(ca) - 8);
 
-   bucket = bch_bucket_alloc(ca, RESERVE_PRIO, true);
-   BUG_ON(bucket == -1);
+   bucket = bch_bucket_alloc(ca, RESERVE_PRIO, wait);
+   if (bucket == -1) {
+   if (!wait)
+   return -ENOMEM;
+   BUG_ON(1);
+   }
 
mutex_unlock(>set->bucket_lock);
prio_io(ca, bucket, REQ_OP_WRITE, 0);
@@ -593,6 +597,7 @@ void bch_prio_write(struct cache *ca)
 
ca->prio_last_buckets[i] = ca->prio_buckets[i];
}
+   return 0;
 }
 
 static void prio_read(struct cache *ca, uint64_t bucket)
@@ -1954,7 +1959,7 @@ static int run_cache_set(struct cache_set *c)
 
mutex_lock(>bucket_lock);
for_each_cache(ca, c, i)
-   bch_prio_write(ca);
+   bch_prio_write(ca, true);
mutex_unlock(>bucket_lock);
 
err = "cannot allocate new UUID bucket";
-- 
2.20.1

Re: [PATCH] bcache: fix deadlock in bcache_allocator()

2019-08-06 Thread Andrea Righi

On Wed, Jul 10, 2019 at 05:46:56PM +0200, Andrea Righi wrote:
> On Wed, Jul 10, 2019 at 11:11:37PM +0800, Coly Li wrote:
> > On 2019/7/10 5:31 下午, Andrea Righi wrote:
> > > bcache_allocator() can call the following:
> > > 
> > >  bch_allocator_thread()
> > >   -> bch_prio_write()
> > >  -> bch_bucket_alloc()
> > > -> wait on >set->bucket_wait
> > > 
> > > But the wake up event on bucket_wait is supposed to come from
> > > bch_allocator_thread() itself => deadlock:
> > > 
> > >  [ 242.888435] INFO: task bcache_allocato:9015 blocked for more than 120 
> > > seconds.
> > >  [ 242.893786] Not tainted 4.20.0-042000rc3-generic #201811182231
> > >  [ 242.896669] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > > disables this message.
> > >  [ 242.900428] bcache_allocato D 0 9015 2 0x8000
> > >  [ 242.900434] Call Trace:
> > >  [ 242.900448] __schedule+0x2a2/0x880
> > >  [ 242.900455] ? __schedule+0x2aa/0x880
> > >  [ 242.900462] schedule+0x2c/0x80
> > >  [ 242.900480] bch_bucket_alloc+0x19d/0x380 [bcache]
> > >  [ 242.900503] ? wait_woken+0x80/0x80
> > >  [ 242.900519] bch_prio_write+0x190/0x340 [bcache]
> > >  [ 242.900530] bch_allocator_thread+0x482/0xd10 [bcache]
> > >  [ 242.900535] kthread+0x120/0x140
> > >  [ 242.900546] ? bch_invalidate_one_bucket+0x80/0x80 [bcache]
> > >  [ 242.900549] ? kthread_park+0x90/0x90
> > >  [ 242.900554] ret_from_fork+0x35/0x40
> > > 
> > > Fix by making the call to bch_prio_write() non-blocking, so that
> > > bch_allocator_thread() never waits on itself.
> > > 
> > > Moreover, make sure to wake up the garbage collector thread when
> > > bch_prio_write() is failing to allocate buckets.
> > > 
> > > BugLink: https://bugs.launchpad.net/bugs/1784665
> > > Signed-off-by: Andrea Righi 
> > 
> > Hi Andrea,
> > 
> 
> Hi Coly,
> 
> > >From the BugLink, it seems several critical bcache fixes are missing.
> > Could you please to try current 5.3-rc kernel, and try whether such
> > problem exists or not ?
> 
> Sure, I'll do a test with the latest 5.3-rc kernel. I just wanna mention
> that I've been able to reproduce this problem after backporting all the
> fixes (even those from linux-next), but I agree that testing 5.3-rc is a
> better idea (I may have introduced bugs while backporting stuff).

Finally I've been able to do a test with the latest 5.3.0-rc3 vanilla
kernel (from today's Linus git) and I confirm that I can reproduce the
same deadlock issue:

[ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 120 
seconds.
[ 1158.495929]   Not tainted 5.3.0-050300rc3-generic #201908042232
[ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1158.504413] bcache_allocato D0 15861  2 0x80004000
[ 1158.504419] Call Trace:
[ 1158.504429]  __schedule+0x2a8/0x670
[ 1158.504432]  schedule+0x2d/0x90
[ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
[ 1158.504453]  ? wait_woken+0x80/0x80
[ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
[ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
[ 1158.504491]  kthread+0x121/0x140
[ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
[ 1158.504506]  ? kthread_park+0xb0/0xb0
[ 1158.504510]  ret_from_fork+0x35/0x40

[ 1158.473567] INFO: task python3:13282 blocked for more than 120 seconds.
[ 1158.479846]   Not tainted 5.3.0-050300rc3-generic #201908042232
[ 1158.484503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1158.490237] python3 D0 13282  13274 0x4000
[ 1158.490246] Call Trace:
[ 1158.490347]  __schedule+0x2a8/0x670
[ 1158.490360]  ? __switch_to_asm+0x40/0x70
[ 1158.490365]  schedule+0x2d/0x90
[ 1158.490433]  bch_bucket_alloc+0xe5/0x370 [bcache]
[ 1158.490468]  ? wait_woken+0x80/0x80
[ 1158.490484]  __bch_bucket_alloc_set+0x10d/0x160 [bcache]
[ 1158.490497]  bch_bucket_alloc_set+0x4e/0x70 [bcache]
[ 1158.490519]  __uuid_write+0x61/0x180 [bcache]
[ 1158.490538]  ? __write_super+0x154/0x190 [bcache]
[ 1158.490556]  bch_uuid_write+0x16/0x40 [bcache]
[ 1158.490573]  __cached_dev_store+0x668/0x8c0 [bcache]
[ 1158.490592]  bch_cached_dev_store+0x46/0x110 [bcache]
[ 1158.490623]  sysfs_kf_write+0x3c/0x50
[ 1158.490631]  kernfs_fop_write+0x125/0x1a0
[ 1158.490648]  __vfs_write+0x1b/0x40
[ 1158.490654]  vfs_write+0xb1/0x1a0
[ 1158.490658]  ksys_write+0xa7/0xe0
[ 1158.490663]  __x64_sys_write+0x1a/0x20
[ 1158.490675]  do_syscall_64+0x5a/0x130
[ 1158.490685]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

A better reproducer have been posted here:
https://launchpadlibrarian.net/435523192/curtin-nvme.sh

(see https://bugs.launchpad.net/curtin/+bug/1796292 for more details)

With this new reproducer script is very easy to hit the deadlock.

I've slightly modified my original patch and with that applied it seems
that I can't trigger any problem. I'm not sure if my patch is actually
the right thing to do, but it seems to prevent the deadlock from
happening.

I'll send a v2 soon.

-Andrea

Re: [PATCH] bcache: fix deadlock in bcache_allocator()

2019-07-10 Thread Andrea Righi

On Wed, Jul 10, 2019 at 11:11:37PM +0800, Coly Li wrote:
> On 2019/7/10 5:31 下午, Andrea Righi wrote:
> > bcache_allocator() can call the following:
> > 
> >  bch_allocator_thread()
> >   -> bch_prio_write()
> >  -> bch_bucket_alloc()
> > -> wait on >set->bucket_wait
> > 
> > But the wake up event on bucket_wait is supposed to come from
> > bch_allocator_thread() itself => deadlock:
> > 
> >  [ 242.888435] INFO: task bcache_allocato:9015 blocked for more than 120 
> > seconds.
> >  [ 242.893786] Not tainted 4.20.0-042000rc3-generic #201811182231
> >  [ 242.896669] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> >  [ 242.900428] bcache_allocato D 0 9015 2 0x8000
> >  [ 242.900434] Call Trace:
> >  [ 242.900448] __schedule+0x2a2/0x880
> >  [ 242.900455] ? __schedule+0x2aa/0x880
> >  [ 242.900462] schedule+0x2c/0x80
> >  [ 242.900480] bch_bucket_alloc+0x19d/0x380 [bcache]
> >  [ 242.900503] ? wait_woken+0x80/0x80
> >  [ 242.900519] bch_prio_write+0x190/0x340 [bcache]
> >  [ 242.900530] bch_allocator_thread+0x482/0xd10 [bcache]
> >  [ 242.900535] kthread+0x120/0x140
> >  [ 242.900546] ? bch_invalidate_one_bucket+0x80/0x80 [bcache]
> >  [ 242.900549] ? kthread_park+0x90/0x90
> >  [ 242.900554] ret_from_fork+0x35/0x40
> > 
> > Fix by making the call to bch_prio_write() non-blocking, so that
> > bch_allocator_thread() never waits on itself.
> > 
> > Moreover, make sure to wake up the garbage collector thread when
> > bch_prio_write() is failing to allocate buckets.
> > 
> > BugLink: https://bugs.launchpad.net/bugs/1784665
> > Signed-off-by: Andrea Righi 
> 
> Hi Andrea,
> 

Hi Coly,

> >From the BugLink, it seems several critical bcache fixes are missing.
> Could you please to try current 5.3-rc kernel, and try whether such
> problem exists or not ?

Sure, I'll do a test with the latest 5.3-rc kernel. I just wanna mention
that I've been able to reproduce this problem after backporting all the
fixes (even those from linux-next), but I agree that testing 5.3-rc is a
better idea (I may have introduced bugs while backporting stuff).

> 
> For this patch itself, it looks good except that I am not sure whether
> invoking garbage collection is a proper method. Because bch_prio_write()
> is called right after garbage collection gets done, jump back to
> retry_invalidate: again may just hide a non-space long time waiting
> condition.

Honestly I was thinking the same, but if I don't call the garbage
collector bch_allocator_thread() gets stuck forever (or for a very very
long time) in the retry_invalidate loop...

> 
> Could you please give me some hint, on how to reproduce such hang
> timeout situation. If I am lucky to reproduce such problem on 5.3-rc
> kernel, it may be very helpful to understand what exact problem your
> patch fixes.

Fortunately I have a reproducer, here's the script that I'm using:

---
#!/bin/bash -x

BACKING=/sys/class/block/bcache0
CACHE=/sys/fs/bcache/*-*-*
while true; do
echo "1" | tee ${BACKING}/bcache/stop
echo "1" | tee ${CACHE}/stop
udevadm settle
[ ! -e "${BACKING}" -a ! -e "${CACHE}" ] && break
sleep 1
done
wipefs --all --force /dev/vdc2
wipefs --all --force /dev/vdc1
wipefs --all --force /dev/vdc
wipefs --all --force /dev/vdd
blockdev --rereadpt /dev/vdc
blockdev --rereadpt /dev/vdd
udevadm settle

# create ext4 fs over bcache
parted /dev/vdc --script mklabel msdos || exit 1
udevadm settle --exit-if-exists=/dev/vdc
parted /dev/vdc --script mkpart primary 2048s 2047999s || exit 1
udevadm settle --exit-if-exists=/dev/vdc1
parted /dev/vdc --script mkpart primary 2048000s 20922367s || exit 1
udevadm settle --exit-if-exists=/dev/vdc2
make-bcache -C /dev/vdd || exit 1
while true; do
udevadm settle
CSET=`ls /sys/fs/bcache | grep -- -`
[ -n "$CSET" ] && break;
sleep 1
done
make-bcache -B /dev/vdc2 || exit 1
while true; do
udevadm settle
[ -e "${BACKING}" ] && break
sleep 1;
done
echo $CSET | tee ${BACKING}/bcache/attach
udevadm settle --exit-if-exists=/dev/bcache0
bcache-super-show /dev/vdc2
udevadm settle
mkfs.ext4 -F -L boot-fs -U e9f00d20-95a0-11e8-82a2-525400123401 /dev/vdc1
udevadm settle
mkfs.ext4 -F -L root-fs -U e9f00d21-95a0-11e8-82a2-525400123401 /dev/bcache0 || 
exit 1
blkid
---

I just run this as root in a busy loop (something like
`while :; do ./test.sh; done`) on a kvm instance with two extra disks
(in addition to the root disk).

The extra disks are created as following:

 qemu-img create -f qcow2 disk1.qcow 10G
 qemu-img create -f qcow2 disk2.qcow 2G

I'm using these particular sizes, but I think we can reproduce the same
problem also using different sizes.

Thanks,
-Andrea

[PATCH] bcache: fix deadlock in bcache_allocator()

2019-07-10 Thread Andrea Righi

bcache_allocator() can call the following:

 bch_allocator_thread()
  -> bch_prio_write()
 -> bch_bucket_alloc()
-> wait on >set->bucket_wait

But the wake up event on bucket_wait is supposed to come from
bch_allocator_thread() itself => deadlock:

 [ 242.888435] INFO: task bcache_allocato:9015 blocked for more than 120 
seconds.
 [ 242.893786] Not tainted 4.20.0-042000rc3-generic #201811182231
 [ 242.896669] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
 [ 242.900428] bcache_allocato D 0 9015 2 0x8000
 [ 242.900434] Call Trace:
 [ 242.900448] __schedule+0x2a2/0x880
 [ 242.900455] ? __schedule+0x2aa/0x880
 [ 242.900462] schedule+0x2c/0x80
 [ 242.900480] bch_bucket_alloc+0x19d/0x380 [bcache]
 [ 242.900503] ? wait_woken+0x80/0x80
 [ 242.900519] bch_prio_write+0x190/0x340 [bcache]
 [ 242.900530] bch_allocator_thread+0x482/0xd10 [bcache]
 [ 242.900535] kthread+0x120/0x140
 [ 242.900546] ? bch_invalidate_one_bucket+0x80/0x80 [bcache]
 [ 242.900549] ? kthread_park+0x90/0x90
 [ 242.900554] ret_from_fork+0x35/0x40

Fix by making the call to bch_prio_write() non-blocking, so that
bch_allocator_thread() never waits on itself.

Moreover, make sure to wake up the garbage collector thread when
bch_prio_write() is failing to allocate buckets.

BugLink: https://bugs.launchpad.net/bugs/1784665
Signed-off-by: Andrea Righi 
---
 drivers/md/bcache/alloc.c  |  6 +-
 drivers/md/bcache/bcache.h |  2 +-
 drivers/md/bcache/super.c  | 13 +
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index f8986effcb50..0797587600c7 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -377,7 +377,11 @@ static int bch_allocator_thread(void *arg)
if (!fifo_full(>free_inc))
goto retry_invalidate;
 
-   bch_prio_write(ca);
+   if (bch_prio_write(ca, false) < 0) {
+   ca->invalidate_needs_gc = 1;
+   wake_up_gc(ca->set);
+   goto retry_invalidate;
+   }
}
}
 out:
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index fdf75352e16a..dc5106b21260 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -979,7 +979,7 @@ bool bch_cached_dev_error(struct cached_dev *dc);
 __printf(2, 3)
 bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...);
 
-void bch_prio_write(struct cache *ca);
+int bch_prio_write(struct cache *ca, bool wait);
 void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent);
 
 extern struct workqueue_struct *bcache_wq;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 1b63ac876169..6598b457df1a 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -525,7 +525,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, int 
op,
closure_sync(cl);
 }
 
-void bch_prio_write(struct cache *ca)
+int bch_prio_write(struct cache *ca, bool wait)
 {
int i;
struct bucket *b;
@@ -560,8 +560,12 @@ void bch_prio_write(struct cache *ca)
p->magic= pset_magic(>sb);
p->csum = bch_crc64(>magic, bucket_bytes(ca) - 8);
 
-   bucket = bch_bucket_alloc(ca, RESERVE_PRIO, true);
-   BUG_ON(bucket == -1);
+   bucket = bch_bucket_alloc(ca, RESERVE_PRIO, wait);
+   if (bucket == -1) {
+   if (!wait)
+   return -ENOMEM;
+   BUG_ON(1);
+   }
 
mutex_unlock(>set->bucket_lock);
prio_io(ca, bucket, REQ_OP_WRITE, 0);
@@ -589,6 +593,7 @@ void bch_prio_write(struct cache *ca)
 
ca->prio_last_buckets[i] = ca->prio_buckets[i];
}
+   return 0;
 }
 
 static void prio_read(struct cache *ca, uint64_t bucket)
@@ -1903,7 +1908,7 @@ static int run_cache_set(struct cache_set *c)
 
mutex_lock(>bucket_lock);
for_each_cache(ca, c, i)
-   bch_prio_write(ca);
+   bch_prio_write(ca, true);
mutex_unlock(>bucket_lock);
 
err = "cannot allocate new UUID bucket";
-- 
2.20.1

[PATCH v2] openvswitch: fix flow actions reallocation

2019-03-28 Thread Andrea Righi

The flow action buffer can be resized if it's not big enough to contain
all the requested flow actions. However, this resize doesn't take into
account the new requested size, the buffer is only increased by a factor
of 2x. This might be not enough to contain the new data, causing a
buffer overflow, for example:

[   42.044472] 
=
[   42.045608] BUG kmalloc-96 (Not tainted): Redzone overwritten
[   42.046415] 
-

[   42.047715] Disabling lock debugging due to kernel taint
[   42.047716] INFO: 0x8bf2c4a5-0x720c0928. First byte 0x0 instead of 0xcc
[   42.048677] INFO: Slab 0xbc6d2040 objects=29 used=18 fp=0xdc07dec4 
flags=0x2808101
[   42.049743] INFO: Object 0xd53a3464 @offset=2528 fp=0xccdcdebb

[   42.050747] Redzone 76f1b237: cc cc cc cc cc cc cc cc
  
[   42.051839] Object d53a3464: 6b 6b 6b 6b 6b 6b 6b 6b 0c 00 00 00 6c 00 00 00 
 l...
[   42.053015] Object f49a30cc: 6c 00 0c 00 00 00 00 00 00 00 00 03 78 a3 15 f6 
 l...x...
[   42.054203] Object acfe4220: 20 00 02 00 ff ff ff ff 00 00 00 00 00 00 00 00 
  ...
[   42.055370] Object 21024e91: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
 
[   42.056541] Object 070e04c3: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
 
[   42.057797] Object 948a777a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
 
[   42.059061] Redzone 8bf2c4a5: 00 00 00 00
  
[   42.060189] Padding a681b46e: 5a 5a 5a 5a 5a 5a 5a 5a
  

Fix by making sure the new buffer is properly resized to contain all the
requested data.

BugLink: https://bugs.launchpad.net/bugs/1813244
Signed-off-by: Andrea Righi 
---
Changes in v2:
 - correctly resize to current_size+req_size (thanks to Pravin)

 net/openvswitch/flow_netlink.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 691da853bef5..4bdf5e3ac208 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2306,14 +2306,14 @@ static struct nlattr *reserve_sfa_size(struct 
sw_flow_actions **sfa,
 
struct sw_flow_actions *acts;
int new_acts_size;
-   int req_size = NLA_ALIGN(attr_len);
+   size_t req_size = NLA_ALIGN(attr_len);
int next_offset = offsetof(struct sw_flow_actions, actions) +
(*sfa)->actions_len;
 
if (req_size <= (ksize(*sfa) - next_offset))
goto out;
 
-   new_acts_size = ksize(*sfa) * 2;
+   new_acts_size = max(next_offset + req_size, ksize(*sfa) * 2);
 
if (new_acts_size > MAX_ACTIONS_BUFSIZE) {
if ((MAX_ACTIONS_BUFSIZE - next_offset) < req_size) {
-- 
2.19.1

Re: [PATCH -tip v3 04/10] x86/kprobes: Prohibit probing on IRQ handlers directly

2019-03-26 Thread Andrea Righi

On Tue, Mar 26, 2019 at 11:50:52PM +0900, Masami Hiramatsu wrote:
> On Mon, 25 Mar 2019 17:23:34 -0400
> Steven Rostedt  wrote:
> 
> > On Wed, 13 Feb 2019 01:12:44 +0900
> > Masami Hiramatsu  wrote:
> > 
> > > Prohibit probing on IRQ handlers in irqentry_text because
> > > if it interrupts user mode, at that point we haven't changed
> > > to kernel space yet and which eventually leads a double fault.
> > > E.g.
> > > 
> > >  # echo p apic_timer_interrupt > kprobe_events
> > 
> > Hmm, this breaks one of my tests (which I probe on do_IRQ).
> 
> OK, it seems this patch is a bit redundant, because
> I found that these interrupt handler issue has been fixed
> by Andrea's commit before merge this patch.
> 
> commit a50480cb6d61d5c5fc13308479407b628b6bc1c5
> Author: Andrea Righi 
> Date:   Thu Dec 6 10:56:48 2018 +0100
> 
> kprobes/x86: Blacklist non-attachable interrupt functions
> 
> These interrupt functions are already non-attachable by kprobes.
> Blacklist them explicitly so that they can show up in
> /sys/kernel/debug/kprobes/blacklist and tools like BCC can use this
> additional information.
> 
> This description is a bit odd (maybe his patch is after mine?) I think
> while updating this series, the patches were merged out of order.
> Anyway, with above patch, the core problematic probe points are blacklisted.

This is the previous thread when I posted my patch (not sure if it helps
to figure out what happened - maybe it was just an out of order merge
issue, like you said):

https://lkml.org/lkml/2018/12/6/212

> 
> > 
> > It's been working for years.
> > 
> > 
> > >  # echo 1 > events/kprobes/enable
> > >  PANIC: double fault, error_code: 0x0
> > >  CPU: 1 PID: 814 Comm: less Not tainted 4.20.0-rc3+ #30
> > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
> > >  RIP: 0010:error_entry+0x12/0xf0
> > >  [snip]
> > >  Call Trace:
> > >   
> > >   ? native_iret+0x7/0x7
> > >   ? async_page_fault+0x8/0x30
> > >   ? trace_hardirqs_on_thunk+0x1c/0x1c
> > >   ? error_entry+0x7c/0xf0
> > >   ? async_page_fault+0x8/0x30
> > >   ? native_iret+0x7/0x7
> > >   ? int3+0xa/0x20
> > >   ? trace_hardirqs_on_thunk+0x1c/0x1c
> > >   ? error_entry+0x7c/0xf0
> > >   ? int3+0xa/0x20
> > >   ? apic_timer_interrupt+0x1/0x20
> > >   
> > >  Kernel panic - not syncing: Machine halted.
> > >  Kernel Offset: disabled
> > 
> > I'm not able to reproduce this (by removing this commit). 
> 
> I ensured that if I revert both of this patch and Andrea's patch,
> I can reproduce this with probing on apic_timer_interrupt().
> 
> > I'm thinking something else may have changed, as I've been tracing
> > interrupt entries for years, and interrupting userspace while doing
> > this.
> > 
> > I've even added probes where ftrace isn't (where it uses an int3) and
> > still haven't hit a problem.
> > 
> > I think this patch is swatting a symptom of a bug and not addressing
> > the bug itself. Can you send me the config that triggers this?
> 
> Yes, it seems you're right. Andrea's commit specifically fixed the
> issue and mine is redundant. (I'm not sure why do_IRQ is in 
> __irqentry_text...)

Not sure if there are specific reasons for that, but do_IRQ is part of
__irqentry_text because it's explicitly marked with __irq_entry.

> 
> So, Ingo, please revert this, since this bug already has been fixed by
> commit a50480cb6d61 ("kprobes: x86_64: blacklist non-attachable interrupt
> functions")
> 
> BTW, for further error investigation, I attached my kconfig which is
> usually I'm testing (some options can be changed) on Qemu.
> I'm using my mini-container shellscript ( https://github.com/mhiramat/mincs 
> ) which supports qemu-container.
> 
> 
> Thank you,
> 
> -- 
> Masami Hiramatsu 

Thanks,
-Andrea

[PATCH v2] btrfs: raid56: properly unmap parity page in finish_parity_scrub()

2019-03-14 Thread Andrea Righi

Parity page is incorrectly unmapped in finish_parity_scrub(), triggering
a reference counter bug on i386, i.e.:

 [ 157.662401] kernel BUG at mm/highmem.c:349!
 [ 157.666725] invalid opcode:  [#1] SMP PTI

The reason is that kunmap(p_page) was completely left out, so we never
did an unmap for the p_page and the loop unmapping the rbio page was
iterating over the wrong number of stripes: unmapping should be done
with nr_data instead of rbio->real_stripes.

Test case to reproduce the bug:

 - create a raid5 btrfs filesystem:
   # mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde

 - mount it:
   # mount /dev/sdb /mnt

 - run btrfs scrub in a loop:
   # while :; do btrfs scrub start -BR /mnt; done

BugLink: https://bugs.launchpad.net/bugs/1812845
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Andrea Righi 
---
Changes in v2:
 - added a better description about this fix (thanks to Johannes)

 fs/btrfs/raid56.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1869ba8e5981..67a6f7d47402 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2430,8 +2430,9 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
bitmap_clear(rbio->dbitmap, pagenr, 1);
kunmap(p);
 
-   for (stripe = 0; stripe < rbio->real_stripes; stripe++)
+   for (stripe = 0; stripe < nr_data; stripe++)
kunmap(page_in_rbio(rbio, stripe, pagenr, 0));
+   kunmap(p_page);
}
 
__free_page(p_page);
-- 
2.19.1

[PATCH v4] blkcg: prevent priority inversion problem during sync()

2019-03-09 Thread Andrea Righi

When sync(2) is executed from a high-priority cgroup, the process is
forced to wait the completion of the entire outstanding writeback I/O,
even the I/O that was originally generated by low-priority cgroups
potentially.

This may cause massive latencies to random processes (even those running
in the root cgroup) that shouldn't be I/O-throttled at all, similarly to
a classic priority inversion problem.

Prevent this problem by saving a list of blkcg's that are waiting for
writeback: every time a sync(2) is executed the current blkcg is added
to the list.

Then, when I/O is throttled, if there's a blkcg waiting for writeback
different than the current blkcg, no throttling is applied (we can
probably refine this logic later, i.e., a better policy could be to
adjust the I/O rate using the blkcg with the highest speed from the list
of waiters).

See also:
  https://lkml.org/lkml/2019/3/7/640

Signed-off-by: Andrea Righi 
---
Changes in v4:
  - fix a build bug when CONFIG_BLOCK is unset

 block/blk-cgroup.c   | 130 +++
 block/blk-throttle.c |  11 ++-
 fs/fs-writeback.c|   5 ++
 fs/sync.c|   8 +-
 include/linux/backing-dev-defs.h |   2 +
 include/linux/blk-cgroup.h   |  35 -
 mm/backing-dev.c |   2 +
 7 files changed, 188 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 77f37ef8ef06..5334cb3acd22 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1351,6 +1351,136 @@ struct cgroup_subsys io_cgrp_subsys = {
 };
 EXPORT_SYMBOL_GPL(io_cgrp_subsys);
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+struct blkcg_wb_sleeper {
+   struct blkcg *blkcg;
+   refcount_t refcnt;
+   struct list_head node;
+};
+
+static struct blkcg_wb_sleeper *
+blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *bws;
+
+   list_for_each_entry(bws, >cgwb_waiters, node)
+   if (bws->blkcg == blkcg)
+   return bws;
+   return NULL;
+}
+
+static void
+blkcg_wb_sleeper_add(struct backing_dev_info *bdi, struct blkcg_wb_sleeper 
*bws)
+{
+   list_add(>node, >cgwb_waiters);
+}
+
+static void
+blkcg_wb_sleeper_del(struct backing_dev_info *bdi, struct blkcg_wb_sleeper 
*bws)
+{
+   list_del_init(>node);
+}
+
+/**
+ * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
+ * @blkcg: current blkcg cgroup
+ * @bdi: block device to check
+ *
+ * Return true if any other blkcg different than the current one is waiting for
+ * writeback on the target block device, false otherwise.
+ */
+bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *bws;
+   bool ret = false;
+
+   if (likely(list_empty(>cgwb_waiters)))
+   return false;
+   spin_lock(>cgwb_waiters_lock);
+   list_for_each_entry(bws, >cgwb_waiters, node)
+   if (bws->blkcg != blkcg) {
+   ret = true;
+   break;
+   }
+   spin_unlock(>cgwb_waiters_lock);
+
+   return ret;
+}
+
+/**
+ * blkcg_start_wb_wait_on_bdi - add current blkcg to writeback waiters list
+ * @bdi: target block device
+ *
+ * Add current blkcg to the list of writeback waiters on target block device.
+ */
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *new_bws, *bws;
+   struct blkcg *blkcg;
+
+   new_bws = kzalloc(sizeof(*new_bws), GFP_KERNEL);
+   if (unlikely(!new_bws))
+   return;
+
+   rcu_read_lock();
+   blkcg = blkcg_from_current();
+   if (likely(blkcg)) {
+   /* Check if blkcg is already sleeping on bdi */
+   spin_lock_bh(>cgwb_waiters_lock);
+   bws = blkcg_wb_sleeper_find(blkcg, bdi);
+   if (bws) {
+   refcount_inc(>refcnt);
+   } else {
+   /* Add current blkcg as a new wb sleeper on bdi */
+   css_get(>css);
+   new_bws->blkcg = blkcg;
+   refcount_set(_bws->refcnt, 1);
+   blkcg_wb_sleeper_add(bdi, new_bws);
+   new_bws = NULL;
+   }
+   spin_unlock_bh(>cgwb_waiters_lock);
+   }
+   rcu_read_unlock();
+
+   kfree(new_bws);
+}
+
+/**
+ * blkcg_stop_wb_wait_on_bdi - remove current blkcg from writeback waiters list
+ * @bdi: target block device
+ *
+ * Remove current blkcg from the list of writeback waiters on target block
+ * device.
+ */
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *bws = NULL;
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = blkcg_from_current();
+   if (!blkcg) {
+   rcu_read_unlock();
+

[PATCH v3] blkcg: prevent priority inversion problem during sync()

2019-03-08 Thread Andrea Righi

When sync(2) is executed from a high-priority cgroup, the process is
forced to wait the completion of the entire outstanding writeback I/O,
even the I/O that was originally generated by low-priority cgroups
potentially.

This may cause massive latencies to random processes (even those running
in the root cgroup) that shouldn't be I/O-throttled at all, similarly to
a classic priority inversion problem.

Prevent this problem by saving a list of blkcg's that are waiting for
writeback: every time a sync(2) is executed the current blkcg is added
to the list.

Then, when I/O is throttled, if there's a blkcg waiting for writeback
different than the current blkcg, no throttling is applied (we can
probably refine this logic later, i.e., a better policy could be to
adjust the I/O rate using the blkcg with the highest speed from the list
of waiters).

See also:
  https://lkml.org/lkml/2019/3/7/640

Signed-off-by: Andrea Righi 
---
Changes in v3:
 - drop sync(2) isolation patches (this will be addressed by another
   patch, potentially operating at the fs namespace level)
 - use a per-bdi lock and a per-bdi list instead of a global lock and a
   global list to save the list of sync(2) waiters

 block/blk-cgroup.c   | 130 +++
 block/blk-throttle.c |  11 ++-
 fs/fs-writeback.c|   5 ++
 fs/sync.c|   8 +-
 include/linux/backing-dev-defs.h |   2 +
 include/linux/blk-cgroup.h   |  25 ++
 mm/backing-dev.c |   2 +
 7 files changed, 179 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2bed5725aa03..b380d678cfc2 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1351,6 +1351,136 @@ struct cgroup_subsys io_cgrp_subsys = {
 };
 EXPORT_SYMBOL_GPL(io_cgrp_subsys);
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+struct blkcg_wb_sleeper {
+   struct blkcg *blkcg;
+   refcount_t refcnt;
+   struct list_head node;
+};
+
+static struct blkcg_wb_sleeper *
+blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *bws;
+
+   list_for_each_entry(bws, >cgwb_waiters, node)
+   if (bws->blkcg == blkcg)
+   return bws;
+   return NULL;
+}
+
+static void
+blkcg_wb_sleeper_add(struct backing_dev_info *bdi, struct blkcg_wb_sleeper 
*bws)
+{
+   list_add(>node, >cgwb_waiters);
+}
+
+static void
+blkcg_wb_sleeper_del(struct backing_dev_info *bdi, struct blkcg_wb_sleeper 
*bws)
+{
+   list_del_init(>node);
+}
+
+/**
+ * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
+ * @blkcg: current blkcg cgroup
+ * @bdi: block device to check
+ *
+ * Return true if any other blkcg different than the current one is waiting for
+ * writeback on the target block device, false otherwise.
+ */
+bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *bws;
+   bool ret = false;
+
+   if (likely(list_empty(>cgwb_waiters)))
+   return false;
+   spin_lock(>cgwb_waiters_lock);
+   list_for_each_entry(bws, >cgwb_waiters, node)
+   if (bws->blkcg != blkcg) {
+   ret = true;
+   break;
+   }
+   spin_unlock(>cgwb_waiters_lock);
+
+   return ret;
+}
+
+/**
+ * blkcg_start_wb_wait_on_bdi - add current blkcg to writeback waiters list
+ * @bdi: target block device
+ *
+ * Add current blkcg to the list of writeback waiters on target block device.
+ */
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *new_bws, *bws;
+   struct blkcg *blkcg;
+
+   new_bws = kzalloc(sizeof(*new_bws), GFP_KERNEL);
+   if (unlikely(!new_bws))
+   return;
+
+   rcu_read_lock();
+   blkcg = blkcg_from_current();
+   if (likely(blkcg)) {
+   /* Check if blkcg is already sleeping on bdi */
+   spin_lock_bh(>cgwb_waiters_lock);
+   bws = blkcg_wb_sleeper_find(blkcg, bdi);
+   if (bws) {
+   refcount_inc(>refcnt);
+   } else {
+   /* Add current blkcg as a new wb sleeper on bdi */
+   css_get(>css);
+   new_bws->blkcg = blkcg;
+   refcount_set(_bws->refcnt, 1);
+   blkcg_wb_sleeper_add(bdi, new_bws);
+   new_bws = NULL;
+   }
+   spin_unlock_bh(>cgwb_waiters_lock);
+   }
+   rcu_read_unlock();
+
+   kfree(new_bws);
+}
+
+/**
+ * blkcg_stop_wb_wait_on_bdi - remove current blkcg from writeback waiters list
+ * @bdi: target block device
+ *
+ * Remove current blkcg from the list of writeback waiters on target block
+ * device.
+ */
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi

Re: [PATCH v2 0/3] blkcg: sync() isolation

2019-03-08 Thread Andrea Righi

On Fri, Mar 08, 2019 at 12:22:20PM -0500, Josef Bacik wrote:
> On Thu, Mar 07, 2019 at 07:08:31PM +0100, Andrea Righi wrote:
> > = Problem =
> > 
> > When sync() is executed from a high-priority cgroup, the process is forced 
> > to
> > wait the completion of the entire outstanding writeback I/O, even the I/O 
> > that
> > was originally generated by low-priority cgroups potentially.
> > 
> > This may cause massive latencies to random processes (even those running in 
> > the
> > root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
> > priority inversion problem.
> > 
> > This topic has been previously discussed here:
> > https://patchwork.kernel.org/patch/10804489/
> > 
> 
> Sorry to move the goal posts on you again Andrea, but Tejun and I talked about
> this some more offline.
> 
> We don't want cgroup to become the arbiter of correctness/behavior here.  We
> just want it to be isolating things.
> 
> For you that means you can drop the per-cgroup flag stuff, and only do the
> priority boosting for multiple sync(2) waiters.  That is a real priority
> inversion that needs to be fixed.  io.latency and io.max are capable of 
> noticing
> that a low priority group is going above their configured limits and putting
> pressure elsewhere accordingly.

Alright, so IIUC that means we just need patch 1/3 for now (with the
per-bdi lock instead of the global lock). If that's the case I'll focus
at that patch then.

> 
> Tejun said he'd rather see the sync(2) isolation be done at the namespace 
> level.
> That way if you have fs namespacing you are already isolated to your 
> namespace.
> If you feel like tackling that then hooray, but that's a separate dragon to 
> slay
> so don't feel like you have to right now.

Makes sense. I can take a look and see what I can do after posting the
new patch with the priority inversion fix only.

> 
> This way we keep cgroup doing its job, controlling resources.  Then we allow
> namespacing to do its thing, isolating resources.  Thanks,
> 
> Josef

Looks like a good plan to me. Thanks for the update.

-Andrea

Re: [PATCH v2 3/3] blkcg: implement sync() isolation

2019-03-07 Thread Andrea Righi

On Thu, Mar 07, 2019 at 05:07:01PM -0500, Josef Bacik wrote:
> On Thu, Mar 07, 2019 at 07:08:34PM +0100, Andrea Righi wrote:
> > Keep track of the inodes that have been dirtied by each blkcg cgroup and
> > make sure that a blkcg issuing a sync() can trigger the writeback + wait
> > of only those pages that belong to the cgroup itself.
> > 
> > This behavior is applied only when io.sync_isolation is enabled in the
> > cgroup, otherwise the old behavior is applied: sync() triggers the
> > writeback of any dirty page.
> > 
> > Signed-off-by: Andrea Righi 
> > ---
> >  block/blk-cgroup.c | 47 ++
> >  fs/fs-writeback.c  | 52 +++---
> >  fs/inode.c |  1 +
> >  include/linux/blk-cgroup.h | 22 
> >  include/linux/fs.h |  4 +++
> >  mm/page-writeback.c|  1 +
> >  6 files changed, 124 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 4305e78d1bb2..7d3b26ba4575 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -1480,6 +1480,53 @@ void blkcg_stop_wb_wait_on_bdi(struct 
> > backing_dev_info *bdi)
> > spin_unlock(_wb_sleeper_lock);
> > rcu_read_unlock();
> >  }
> > +
> > +/**
> > + * blkcg_set_mapping_dirty - set owner of a dirty mapping
> > + * @mapping: target address space
> > + *
> > + * Set the current blkcg as the owner of the address space @mapping (the 
> > first
> > + * blkcg that dirties @mapping becomes the owner).
> > + */
> > +void blkcg_set_mapping_dirty(struct address_space *mapping)
> > +{
> > +   struct blkcg *curr_blkcg, *blkcg;
> > +
> > +   if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) ||
> > +   mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> > +   return;
> > +
> > +   rcu_read_lock();
> > +   curr_blkcg = blkcg_from_current();
> > +   blkcg = blkcg_from_mapping(mapping);
> > +   if (curr_blkcg != blkcg) {
> > +   if (blkcg)
> > +   css_put(>css);
> > +   css_get(_blkcg->css);
> > +   rcu_assign_pointer(mapping->i_blkcg, curr_blkcg);
> > +   }
> > +   rcu_read_unlock();
> > +}
> > +
> > +/**
> > + * blkcg_set_mapping_clean - clear the owner of a dirty mapping
> > + * @mapping: target address space
> > + *
> > + * Unset the owner of @mapping when it becomes clean.
> > + */
> > +
> > +void blkcg_set_mapping_clean(struct address_space *mapping)
> > +{
> > +   struct blkcg *blkcg;
> > +
> > +   rcu_read_lock();
> > +   blkcg = rcu_dereference(mapping->i_blkcg);
> > +   if (blkcg) {
> > +   css_put(>css);
> > +   RCU_INIT_POINTER(mapping->i_blkcg, NULL);
> > +   }
> > +   rcu_read_unlock();
> > +}
> >  #endif
> >  
> 
> Why do we need this?  We already have the inode_attach_wb(), which has the
> blkcg_css embedded in it for whoever dirtied the inode first.  Can we not just
> use that?  Thanks,
> 
> Josef

I'm realizing only now that inode_attach_wb() also has blkcg embedded
in addition to the memcg. I think I can use that and drop these
blkcg_set_mapping_dirty/clean()..

Thanks,
-Andrea

Re: [PATCH v2 1/3] blkcg: prevent priority inversion problem during sync()

2019-03-07 Thread Andrea Righi

On Thu, Mar 07, 2019 at 05:10:53PM -0500, Josef Bacik wrote:
> On Thu, Mar 07, 2019 at 07:08:32PM +0100, Andrea Righi wrote:
> > Prevent priority inversion problem when a high-priority blkcg issues a
> > sync() and it is forced to wait the completion of all the writeback I/O
> > generated by any other low-priority blkcg, causing massive latencies to
> > processes that shouldn't be I/O-throttled at all.
> > 
> > The idea is to save a list of blkcg's that are waiting for writeback:
> > every time a sync() is executed the current blkcg is added to the list.
> > 
> > Then, when I/O is throttled, if there's a blkcg waiting for writeback
> > different than the current blkcg, no throttling is applied (we can
> > probably refine this logic later, i.e., a better policy could be to
> > adjust the throttling I/O rate using the blkcg with the highest speed
> > from the list of waiters - priority inheritance, kinda).
> > 
> > Signed-off-by: Andrea Righi 
> > ---
> >  block/blk-cgroup.c   | 131 +++
> >  block/blk-throttle.c |  11 ++-
> >  fs/fs-writeback.c|   5 ++
> >  fs/sync.c|   8 +-
> >  include/linux/backing-dev-defs.h |   2 +
> >  include/linux/blk-cgroup.h   |  23 ++
> >  mm/backing-dev.c |   2 +
> >  7 files changed, 178 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 2bed5725aa03..4305e78d1bb2 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -1351,6 +1351,137 @@ struct cgroup_subsys io_cgrp_subsys = {
> >  };
> >  EXPORT_SYMBOL_GPL(io_cgrp_subsys);
> >  
> > +#ifdef CONFIG_CGROUP_WRITEBACK
> > +struct blkcg_wb_sleeper {
> > +   struct backing_dev_info *bdi;
> > +   struct blkcg *blkcg;
> > +   refcount_t refcnt;
> > +   struct list_head node;
> > +};
> > +
> > +static DEFINE_SPINLOCK(blkcg_wb_sleeper_lock);
> > +static LIST_HEAD(blkcg_wb_sleeper_list);
> > +
> > +static struct blkcg_wb_sleeper *
> > +blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi)
> > +{
> > +   struct blkcg_wb_sleeper *bws;
> > +
> > +   list_for_each_entry(bws, _wb_sleeper_list, node)
> > +   if (bws->blkcg == blkcg && bws->bdi == bdi)
> > +   return bws;
> > +   return NULL;
> > +}
> > +
> > +static void blkcg_wb_sleeper_add(struct blkcg_wb_sleeper *bws)
> > +{
> > +   list_add(>node, _wb_sleeper_list);
> > +}
> > +
> > +static void blkcg_wb_sleeper_del(struct blkcg_wb_sleeper *bws)
> > +{
> > +   list_del_init(>node);
> > +}
> > +
> > +/**
> > + * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
> > + * @blkcg: current blkcg cgroup
> > + * @bdi: block device to check
> > + *
> > + * Return true if any other blkcg different than the current one is 
> > waiting for
> > + * writeback on the target block device, false otherwise.
> > + */
> > +bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info 
> > *bdi)
> > +{
> > +   struct blkcg_wb_sleeper *bws;
> > +   bool ret = false;
> > +
> > +   spin_lock(_wb_sleeper_lock);
> > +   list_for_each_entry(bws, _wb_sleeper_list, node)
> > +   if (bws->bdi == bdi && bws->blkcg != blkcg) {
> > +   ret = true;
> > +   break;
> > +   }
> > +   spin_unlock(_wb_sleeper_lock);
> > +
> > +   return ret;
> > +}
> 
> No global lock please, add something to the bdi I think?  Also have a fast 
> path
> of

OK, I'll add a list per-bdi and a lock as well.

> 
> if (list_empty(blkcg_wb_sleeper_list))
>return false;

OK.

> 
> we don't need to be super accurate here.  Thanks,
> 
> Josef

Thanks,
-Andrea

[PATCH v2 2/3] blkcg: introduce io.sync_isolation

2019-03-07 Thread Andrea Righi

Add a flag to the blkcg cgroups to make sync()'ers in a cgroup only be
allowed to write out pages that have been dirtied by the cgroup itself.

This flag is disabled by default (meaning that we are not changing the
previous behavior by default).

When this flag is enabled any cgroup can write out only dirty pages that
belong to the cgroup itself (except for the root cgroup that would still
be able to write out all pages globally).

Signed-off-by: Andrea Righi 
---
 Documentation/admin-guide/cgroup-v2.rst |  9 ++
 block/blk-throttle.c| 37 +
 include/linux/blk-cgroup.h  |  7 +
 3 files changed, 53 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 53d3288c328b..17fff0ee97b8 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1448,6 +1448,15 @@ IO Interface Files
Shows pressure stall information for IO. See
Documentation/accounting/psi.txt for details.
 
+  io.sync_isolation
+A flag (0|1) that determines whether a cgroup is allowed to write out
+only pages that have been dirtied by the cgroup itself. This option is
+set to false (0) by default, meaning that any cgroup would try to write
+out dirty pages globally, even those that have been dirtied by other
+cgroups.
+
+Setting this option to true (1) provides a better isolation across
+cgroups that are doing an intense write I/O activity.
 
 Writeback
 ~
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index da817896cded..4bc3b40a4d93 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1704,6 +1704,35 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
return ret ?: nbytes;
 }
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+static int sync_isolation_show(struct seq_file *sf, void *v)
+{
+   struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+
+   seq_printf(sf, "%d\n", test_bit(BLKCG_SYNC_ISOLATION, >flags));
+   return 0;
+}
+
+static ssize_t sync_isolation_write(struct kernfs_open_file *of,
+   char *buf, size_t nbytes, loff_t off)
+{
+   struct blkcg *blkcg = css_to_blkcg(of_css(of));
+   unsigned long val;
+   int err;
+
+   buf = strstrip(buf);
+   err = kstrtoul(buf, 0, );
+   if (err)
+   return err;
+   if (val)
+   set_bit(BLKCG_SYNC_ISOLATION, >flags);
+   else
+   clear_bit(BLKCG_SYNC_ISOLATION, >flags);
+
+   return nbytes;
+}
+#endif
+
 static struct cftype throtl_files[] = {
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
{
@@ -1721,6 +1750,14 @@ static struct cftype throtl_files[] = {
.write = tg_set_limit,
.private = LIMIT_MAX,
},
+#ifdef CONFIG_CGROUP_WRITEBACK
+   {
+   .name = "sync_isolation",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = sync_isolation_show,
+   .write = sync_isolation_write,
+   },
+#endif
{ } /* terminate */
 };
 
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 0f7dcb70e922..6ac5aa049334 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -44,6 +44,12 @@ enum blkg_rwstat_type {
 
 struct blkcg_gq;
 
+/* blkcg->flags */
+enum {
+   /* sync()'ers allowed to write out pages dirtied by the blkcg */
+   BLKCG_SYNC_ISOLATION,
+};
+
 struct blkcg {
struct cgroup_subsys_state  css;
spinlock_t  lock;
@@ -55,6 +61,7 @@ struct blkcg {
struct blkcg_policy_data*cpd[BLKCG_MAX_POLS];
 
struct list_headall_blkcgs_node;
+   unsigned long   flags;
 #ifdef CONFIG_CGROUP_WRITEBACK
struct list_headcgwb_wait_node;
struct list_headcgwb_list;
-- 
2.19.1

[PATCH v2 3/3] blkcg: implement sync() isolation

2019-03-07 Thread Andrea Righi

Keep track of the inodes that have been dirtied by each blkcg cgroup and
make sure that a blkcg issuing a sync() can trigger the writeback + wait
of only those pages that belong to the cgroup itself.

This behavior is applied only when io.sync_isolation is enabled in the
cgroup, otherwise the old behavior is applied: sync() triggers the
writeback of any dirty page.

Signed-off-by: Andrea Righi 
---
 block/blk-cgroup.c | 47 ++
 fs/fs-writeback.c  | 52 +++---
 fs/inode.c |  1 +
 include/linux/blk-cgroup.h | 22 
 include/linux/fs.h |  4 +++
 mm/page-writeback.c|  1 +
 6 files changed, 124 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4305e78d1bb2..7d3b26ba4575 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1480,6 +1480,53 @@ void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info 
*bdi)
spin_unlock(_wb_sleeper_lock);
rcu_read_unlock();
 }
+
+/**
+ * blkcg_set_mapping_dirty - set owner of a dirty mapping
+ * @mapping: target address space
+ *
+ * Set the current blkcg as the owner of the address space @mapping (the first
+ * blkcg that dirties @mapping becomes the owner).
+ */
+void blkcg_set_mapping_dirty(struct address_space *mapping)
+{
+   struct blkcg *curr_blkcg, *blkcg;
+
+   if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) ||
+   mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+   return;
+
+   rcu_read_lock();
+   curr_blkcg = blkcg_from_current();
+   blkcg = blkcg_from_mapping(mapping);
+   if (curr_blkcg != blkcg) {
+   if (blkcg)
+   css_put(>css);
+   css_get(_blkcg->css);
+   rcu_assign_pointer(mapping->i_blkcg, curr_blkcg);
+   }
+   rcu_read_unlock();
+}
+
+/**
+ * blkcg_set_mapping_clean - clear the owner of a dirty mapping
+ * @mapping: target address space
+ *
+ * Unset the owner of @mapping when it becomes clean.
+ */
+
+void blkcg_set_mapping_clean(struct address_space *mapping)
+{
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = rcu_dereference(mapping->i_blkcg);
+   if (blkcg) {
+   css_put(>css);
+   RCU_INIT_POINTER(mapping->i_blkcg, NULL);
+   }
+   rcu_read_unlock();
+}
 #endif
 
 /**
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 77c039a0ec25..d003d0593f41 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -58,6 +58,9 @@ struct wb_writeback_work {
 
struct list_head list;  /* pending work list */
struct wb_completion *done; /* set if the caller waits */
+#ifdef CONFIG_CGROUP_WRITEBACK
+   struct blkcg *blkcg;
+#endif
 };
 
 /*
@@ -916,6 +919,29 @@ static int __init cgroup_writeback_init(void)
 }
 fs_initcall(cgroup_writeback_init);
 
+static void blkcg_set_sync_domain(struct wb_writeback_work *work)
+{
+   rcu_read_lock();
+   work->blkcg = blkcg_from_current();
+   rcu_read_unlock();
+}
+
+static bool blkcg_same_sync_domain(struct wb_writeback_work *work,
+  struct address_space *mapping)
+{
+   struct blkcg *blkcg;
+
+   if (!work->blkcg || work->blkcg == _root)
+   return true;
+   if (!test_bit(BLKCG_SYNC_ISOLATION, >blkcg->flags))
+   return true;
+   rcu_read_lock();
+   blkcg = blkcg_from_mapping(mapping);
+   rcu_read_unlock();
+
+   return blkcg == work->blkcg;
+}
+
 #else  /* CONFIG_CGROUP_WRITEBACK */
 
 static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) { }
@@ -959,6 +985,15 @@ static void bdi_split_work_to_wbs(struct backing_dev_info 
*bdi,
}
 }
 
+static void blkcg_set_sync_domain(struct wb_writeback_work *work)
+{
+}
+
+static bool blkcg_same_sync_domain(struct wb_writeback_work *work,
+  struct address_space *mapping)
+{
+   return true;
+}
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
 /*
@@ -1131,7 +1166,7 @@ static int move_expired_inodes(struct list_head 
*delaying_queue,
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
-   struct inode *inode;
+   struct inode *inode, *next;
int do_sb_sort = 0;
int moved = 0;
 
@@ -1141,11 +1176,12 @@ static int move_expired_inodes(struct list_head 
*delaying_queue,
expire_time = jiffies - (dirtytime_expire_interval * HZ);
older_than_this = _time;
}
-   while (!list_empty(delaying_queue)) {
-   inode = wb_inode(delaying_queue->prev);
+   list_for_each_entry_safe(inode, next, delaying_queue, i_io_list) {
if (older_than_this &&
inode_dirtied_after(inode, *older_than_this))
break;
+   if

[PATCH v2 1/3] blkcg: prevent priority inversion problem during sync()

2019-03-07 Thread Andrea Righi

Prevent priority inversion problem when a high-priority blkcg issues a
sync() and it is forced to wait the completion of all the writeback I/O
generated by any other low-priority blkcg, causing massive latencies to
processes that shouldn't be I/O-throttled at all.

The idea is to save a list of blkcg's that are waiting for writeback:
every time a sync() is executed the current blkcg is added to the list.

Then, when I/O is throttled, if there's a blkcg waiting for writeback
different than the current blkcg, no throttling is applied (we can
probably refine this logic later, i.e., a better policy could be to
adjust the throttling I/O rate using the blkcg with the highest speed
from the list of waiters - priority inheritance, kinda).

Signed-off-by: Andrea Righi 
---
 block/blk-cgroup.c   | 131 +++
 block/blk-throttle.c |  11 ++-
 fs/fs-writeback.c|   5 ++
 fs/sync.c|   8 +-
 include/linux/backing-dev-defs.h |   2 +
 include/linux/blk-cgroup.h   |  23 ++
 mm/backing-dev.c |   2 +
 7 files changed, 178 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2bed5725aa03..4305e78d1bb2 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1351,6 +1351,137 @@ struct cgroup_subsys io_cgrp_subsys = {
 };
 EXPORT_SYMBOL_GPL(io_cgrp_subsys);
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+struct blkcg_wb_sleeper {
+   struct backing_dev_info *bdi;
+   struct blkcg *blkcg;
+   refcount_t refcnt;
+   struct list_head node;
+};
+
+static DEFINE_SPINLOCK(blkcg_wb_sleeper_lock);
+static LIST_HEAD(blkcg_wb_sleeper_list);
+
+static struct blkcg_wb_sleeper *
+blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *bws;
+
+   list_for_each_entry(bws, _wb_sleeper_list, node)
+   if (bws->blkcg == blkcg && bws->bdi == bdi)
+   return bws;
+   return NULL;
+}
+
+static void blkcg_wb_sleeper_add(struct blkcg_wb_sleeper *bws)
+{
+   list_add(>node, _wb_sleeper_list);
+}
+
+static void blkcg_wb_sleeper_del(struct blkcg_wb_sleeper *bws)
+{
+   list_del_init(>node);
+}
+
+/**
+ * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
+ * @blkcg: current blkcg cgroup
+ * @bdi: block device to check
+ *
+ * Return true if any other blkcg different than the current one is waiting for
+ * writeback on the target block device, false otherwise.
+ */
+bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *bws;
+   bool ret = false;
+
+   spin_lock(_wb_sleeper_lock);
+   list_for_each_entry(bws, _wb_sleeper_list, node)
+   if (bws->bdi == bdi && bws->blkcg != blkcg) {
+   ret = true;
+   break;
+   }
+   spin_unlock(_wb_sleeper_lock);
+
+   return ret;
+}
+
+/**
+ * blkcg_start_wb_wait_on_bdi - add current blkcg to writeback waiters list
+ * @bdi: target block device
+ *
+ * Add current blkcg to the list of writeback waiters on target block device.
+ */
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *new_bws, *bws;
+   struct blkcg *blkcg;
+
+   new_bws = kzalloc(sizeof(*new_bws), GFP_KERNEL);
+   if (unlikely(!new_bws))
+   return;
+
+   rcu_read_lock();
+   blkcg = blkcg_from_current();
+   if (likely(blkcg)) {
+   /* Check if blkcg is already sleeping on bdi */
+   spin_lock(_wb_sleeper_lock);
+   bws = blkcg_wb_sleeper_find(blkcg, bdi);
+   if (bws) {
+   refcount_inc(>refcnt);
+   } else {
+   /* Add current blkcg as a new wb sleeper on bdi */
+   css_get(>css);
+   new_bws->blkcg = blkcg;
+   new_bws->bdi = bdi;
+   refcount_set(_bws->refcnt, 1);
+   blkcg_wb_sleeper_add(new_bws);
+   new_bws = NULL;
+   }
+   spin_unlock(_wb_sleeper_lock);
+   }
+   rcu_read_unlock();
+
+   kfree(new_bws);
+}
+
+/**
+ * blkcg_stop_wb_wait_on_bdi - remove current blkcg from writeback waiters list
+ * @bdi: target block device
+ *
+ * Remove current blkcg from the list of writeback waiters on target block
+ * device.
+ */
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg_wb_sleeper *bws = NULL;
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = blkcg_from_current();
+   if (!blkcg) {
+   rcu_read_unlock();
+   return;
+   }
+   spin_lock(_wb_sleeper_lock);
+   bws = blkcg_wb_sleeper_find(blkcg, bdi);
+   if (unlikely(!bws)) {
+

[PATCH v2 0/3] blkcg: sync() isolation

2019-03-07 Thread Andrea Righi

= Problem =

When sync() is executed from a high-priority cgroup, the process is forced to
wait the completion of the entire outstanding writeback I/O, even the I/O that
was originally generated by low-priority cgroups potentially.

This may cause massive latencies to random processes (even those running in the
root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
priority inversion problem.

This topic has been previously discussed here:
https://patchwork.kernel.org/patch/10804489/

[ Thanks to Josef for the suggestions ]

= Solution =

Here's a slightly more detailed description of the solution, as suggested by
Josef and Tejun (let me know if I misunderstood or missed anything):

 - track the submitter of wb work (when issuing sync()) and the cgroup that
   originally dirtied any inode, then use this information to determine the
   proper "sync() domain" and decide if the I/O speed needs to be boosted or
   not in order to prevent priority-inversion problems

 - by default when sync() is issued, all the outstanding writeback I/O is
   boosted to maximum speed to prevent priority inversion problems

 - if sync() is issued by the same throttled cgroup that generated the dirty
   pages, the corresponding writeback I/O is still throttled normally

 - add a new flag to cgroups (io.sync_isolation) that would make sync()'ers in
   that cgroup only be allowed to write out dirty pages that belong to its
   cgroup

= Test =

Here's a trivial example to trigger the problem:

 - create 2 cgroups: cg1 and cg2

 # mkdir /sys/fs/cgroup/unified/cg1
 # mkdir /sys/fs/cgroup/unified/cg2

 - set an I/O limit of 1MB/s on cg1/io.ma:

 # echo "8:0 rbps=1048576 wbps=1048576" > /sys/fs/cgroup/unified/cg1/io.max

 - run a write-intensive workload in cg1

 # cat /proc/self/cgroup
 0::/cg1
 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based 
--runtime=30

 - run sync in cg2 and measure time

== Vanilla kernel ==

 # cat /proc/self/cgroup
 0::/cg2

 # time sync
 real   9m32,618s
 user   0m0,000s
 sys0m0,018s

Ideally "sync" should complete almost immediately, because cg2 is unlimited and
it's not doing any I/O at all. Instead, the entire system is totally sluggish,
waiting for the throttled writeback I/O to complete, and it also triggers many
hung task timeout warnings.

== With this patch set applied and io.sync_isolation=0 (default) ==

 # cat /proc/self/cgroup
 0::/cg2

 # time sync
 real   0m2,044s
 user   0m0,009s
 sys0m0,000s

[ Time range goes from 2s to 4s ]

== With this patch set applied and io.sync_isolation=1 ==

 # cat /proc/self/cgroup
 0::/cg2

 # time sync

 real   0m0,768s
 user   0m0,001s
 sys0m0,008s

[ Time range goes from 0.7s to 1.6s ]

Changes in v2:
 - fix: properly keep track of sync waiters when a blkcg is writing to
   many block devices at the same time

Andrea Righi (3):
  blkcg: prevent priority inversion problem during sync()
  blkcg: introduce io.sync_isolation
  blkcg: implement sync() isolation

 Documentation/admin-guide/cgroup-v2.rst |   9 +++
 block/blk-cgroup.c  | 178 

 block/blk-throttle.c|  48 ++-
 fs/fs-writeback.c   |  57 +-
 fs/inode.c  |   1 +
 fs/sync.c   |   8 ++-
 include/linux/backing-dev-defs.h|   2 +
 include/linux/blk-cgroup.h  |  52 +
 include/linux/fs.h  |   4 ++
 mm/backing-dev.c|   2 +
 mm/page-writeback.c |   1 +
 11 files changed, 355 insertions(+), 7 deletions(-)

[PATCH 0/3] blkcg: sync() isolation

2019-02-19 Thread Andrea Righi

= Problem =

When sync() is executed from a high-priority cgroup, the process is forced to
wait the completion of the entire outstanding writeback I/O, even the I/O that
was originally generated by low-priority cgroups potentially.

This may cause massive latencies to random processes (even those running in the
root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
priority inversion problem.

This topic has been previously discussed here:
https://patchwork.kernel.org/patch/10804489/

[ Thanks to Josef for the suggestions ]

= Solution =

Here's a slightly more detailed description of the solution, as suggested by
Josef and Tejun (let me know if I misunderstood or missed anything):

 - track the submitter of wb work (when issuing sync()) and the cgroup that
   originally dirtied any inode, then use this information to determine the
   proper "sync() domain" and decide if the I/O speed needs to be boosted or
   not in order to prevent priority-inversion problems

 - by default when sync() is issued, all the outstanding writeback I/O is
   boosted to maximum speed to prevent priority inversion problems

 - if sync() is issued by the same throttled cgroup that generated the dirty
   pages, the corresponding writeback I/O is still throttled normally

 - add a new flag to cgroups (io.sync_isolation) that would make sync()'ers in
   that cgroup only be allowed to write out dirty pages that belong to its
   cgroup

= Test =

Here's a trivial example to trigger the problem:

 - create 2 cgroups: cg1 and cg2

 # mkdir /sys/fs/cgroup/unified/cg1
 # mkdir /sys/fs/cgroup/unified/cg2

 - set an I/O limit of 1MB/s on cg1/io.ma:

 # echo "8:0 rbps=1048576 wbps=1048576" > /sys/fs/cgroup/unified/cg1/io.max

 - run a write-intensive workload in cg1

 # cat /proc/self/cgroup
 0::/cg1
 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based 
--runtime=30

 - run sync in cg2 and measure time

== Vanilla kernel ==

 # cat /proc/self/cgroup
 0::/cg2

 # time sync
 real   9m32,618s
 user   0m0,000s
 sys0m0,018s

Ideally "sync" should complete almost immediately, because cg2 is unlimited and
it's not doing any I/O at all. Instead, the entire system is totally sluggish,
waiting for the throttled writeback I/O to complete, and it also triggers many
hung task timeout warnings.

== With this patch set applied and io.sync_isolation=0 (default) ==

 # cat /proc/self/cgroup
 0::/cg2

 # time sync
 real   0m2,044s
 user   0m0,009s
 sys0m0,000s

[ Time range goes from 2s to 4s ]

== With this patch set applied and io.sync_isolation=1 ==

 # cat /proc/self/cgroup
 0::/cg2

 # time sync

 real   0m0,768s
 user   0m0,001s
 sys0m0,008s

[ Time range goes from 0.7s to 1.6s ]

Andrea Righi (3):
  blkcg: prevent priority inversion problem during sync()
  blkcg: introduce io.sync_isolation
  blkcg: implement sync() isolation

 Documentation/admin-guide/cgroup-v2.rst |   9 +++
 block/blk-cgroup.c  | 120 
 block/blk-throttle.c|  48 -
 fs/fs-writeback.c   |  57 ++-
 fs/inode.c  |   1 +
 fs/sync.c   |   8 ++-
 include/linux/backing-dev-defs.h|   2 +
 include/linux/blk-cgroup.h  |  52 ++
 include/linux/fs.h  |   4 ++
 mm/backing-dev.c|   2 +
 mm/page-writeback.c |   1 +
 11 files changed, 297 insertions(+), 7 deletions(-)

[PATCH 3/3] blkcg: implement sync() isolation

2019-02-19 Thread Andrea Righi

Keep track of the inodes that have been dirtied by each blkcg cgroup and
make sure that a blkcg issuing a sync() can trigger the writeback + wait
of only those pages that belong to the cgroup itself.

This behavior is enabled only when io.sync_isolation is enabled in the
cgroup, otherwise the old behavior is applied: sync() triggers the
writeback of any dirty page.

Signed-off-by: Andrea Righi 
---
 block/blk-cgroup.c | 47 ++
 fs/fs-writeback.c  | 52 +++---
 fs/inode.c |  1 +
 include/linux/blk-cgroup.h | 22 
 include/linux/fs.h |  4 +++
 mm/page-writeback.c|  1 +
 6 files changed, 124 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fb3c39eadf92..c6ddf9eeab37 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1422,6 +1422,53 @@ void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info 
*bdi)
rcu_read_unlock();
synchronize_rcu();
 }
+
+/**
+ * blkcg_set_mapping_dirty - set owner of a dirty mapping
+ * @mapping: target address space
+ *
+ * Set the current blkcg as the owner of the address space @mapping (the first
+ * blkcg that dirties @mapping becomes the owner).
+ */
+void blkcg_set_mapping_dirty(struct address_space *mapping)
+{
+   struct blkcg *curr_blkcg, *blkcg;
+
+   if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) ||
+   mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+   return;
+
+   rcu_read_lock();
+   curr_blkcg = blkcg_from_current();
+   blkcg = blkcg_from_mapping(mapping);
+   if (curr_blkcg != blkcg) {
+   if (blkcg)
+   css_put(>css);
+   css_get(_blkcg->css);
+   rcu_assign_pointer(mapping->i_blkcg, curr_blkcg);
+   }
+   rcu_read_unlock();
+}
+
+/**
+ * blkcg_set_mapping_dirty - clear the owner of a dirty mapping
+ * @mapping: target address space
+ *
+ * Unset the owner of @mapping when it becomes clean.
+ */
+
+void blkcg_set_mapping_clean(struct address_space *mapping)
+{
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = rcu_dereference(mapping->i_blkcg);
+   if (blkcg) {
+   css_put(>css);
+   RCU_INIT_POINTER(mapping->i_blkcg, NULL);
+   }
+   rcu_read_unlock();
+}
 #endif
 
 /**
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 77c039a0ec25..d003d0593f41 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -58,6 +58,9 @@ struct wb_writeback_work {
 
struct list_head list;  /* pending work list */
struct wb_completion *done; /* set if the caller waits */
+#ifdef CONFIG_CGROUP_WRITEBACK
+   struct blkcg *blkcg;
+#endif
 };
 
 /*
@@ -916,6 +919,29 @@ static int __init cgroup_writeback_init(void)
 }
 fs_initcall(cgroup_writeback_init);
 
+static void blkcg_set_sync_domain(struct wb_writeback_work *work)
+{
+   rcu_read_lock();
+   work->blkcg = blkcg_from_current();
+   rcu_read_unlock();
+}
+
+static bool blkcg_same_sync_domain(struct wb_writeback_work *work,
+  struct address_space *mapping)
+{
+   struct blkcg *blkcg;
+
+   if (!work->blkcg || work->blkcg == _root)
+   return true;
+   if (!test_bit(BLKCG_SYNC_ISOLATION, >blkcg->flags))
+   return true;
+   rcu_read_lock();
+   blkcg = blkcg_from_mapping(mapping);
+   rcu_read_unlock();
+
+   return blkcg == work->blkcg;
+}
+
 #else  /* CONFIG_CGROUP_WRITEBACK */
 
 static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) { }
@@ -959,6 +985,15 @@ static void bdi_split_work_to_wbs(struct backing_dev_info 
*bdi,
}
 }
 
+static void blkcg_set_sync_domain(struct wb_writeback_work *work)
+{
+}
+
+static bool blkcg_same_sync_domain(struct wb_writeback_work *work,
+  struct address_space *mapping)
+{
+   return true;
+}
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
 /*
@@ -1131,7 +1166,7 @@ static int move_expired_inodes(struct list_head 
*delaying_queue,
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
-   struct inode *inode;
+   struct inode *inode, *next;
int do_sb_sort = 0;
int moved = 0;
 
@@ -1141,11 +1176,12 @@ static int move_expired_inodes(struct list_head 
*delaying_queue,
expire_time = jiffies - (dirtytime_expire_interval * HZ);
older_than_this = _time;
}
-   while (!list_empty(delaying_queue)) {
-   inode = wb_inode(delaying_queue->prev);
+   list_for_each_entry_safe(inode, next, delaying_queue, i_io_list) {
if (older_than_this &&
inode_dirtied_after(inode, *older_than_this))
break;
+   if (!blkcg_same_sync_domain(work, inod

[PATCH 1/3] blkcg: prevent priority inversion problem during sync()

2019-02-19 Thread Andrea Righi

Prevent priority inversion problem when a high-priority blkcg issues a
sync() and it is forced to wait the completion of all the writeback I/O
generated by any other low-priority blkcg, causing massive latencies to
processes that shouldn't be I/O-throttled at all.

The idea is to save a list of blkcg's that are waiting for writeback:
every time a sync() is executed the current blkcg is added to the list.

Then, when I/O is throttled, if there's a blkcg waiting for writeback
different than the current blkcg, no throttling is applied (we can
probably refine this logic later, i.e., a better policy could be to
adjust the throttling I/O rate using the blkcg with the highest speed
from the list of waiters - priority inheritance, kinda).

Signed-off-by: Andrea Righi 
---
 block/blk-cgroup.c   | 73 
 block/blk-throttle.c | 11 +++--
 fs/fs-writeback.c|  5 +++
 fs/sync.c|  8 +++-
 include/linux/backing-dev-defs.h |  2 +
 include/linux/blk-cgroup.h   | 23 ++
 mm/backing-dev.c |  2 +
 7 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2bed5725aa03..fb3c39eadf92 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1351,6 +1351,79 @@ struct cgroup_subsys io_cgrp_subsys = {
 };
 EXPORT_SYMBOL_GPL(io_cgrp_subsys);
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+/**
+ * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
+ * @blkcg: current blkcg cgroup
+ * @bdi: block device to check
+ *
+ * Return true if any other blkcg is waiting for writeback on the target block
+ * device, false otherwise.
+ */
+bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+   struct blkcg *wait_blkcg;
+   bool ret = false;
+
+   if (unlikely(!bdi))
+   return false;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(wait_blkcg, >cgwb_waiters, cgwb_wait_node)
+   if (wait_blkcg != blkcg) {
+   ret = true;
+   break;
+   }
+   rcu_read_unlock();
+
+   return ret;
+}
+
+/**
+ * blkcg_start_wb_wait_on_bdi - add current blkcg to writeback waiters list
+ * @bdi: target block device
+ *
+ * Add current blkcg to the list of writeback waiters on target block device.
+ */
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = blkcg_from_current();
+   if (blkcg) {
+   css_get(>css);
+   spin_lock(>cgwb_waiters_lock);
+   list_add_rcu(>cgwb_wait_node, >cgwb_waiters);
+   spin_unlock(>cgwb_waiters_lock);
+   }
+   rcu_read_unlock();
+}
+
+/**
+ * blkcg_stop_wb_wait_on_bdi - remove current blkcg from writeback waiters list
+ * @bdi: target block device
+ *
+ * Remove current blkcg from the list of writeback waiters on target block
+ * device.
+ */
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = blkcg_from_current();
+   if (blkcg) {
+   spin_lock(>cgwb_waiters_lock);
+   list_del_rcu(>cgwb_wait_node);
+   spin_unlock(>cgwb_waiters_lock);
+   css_put(>css);
+   }
+   rcu_read_unlock();
+   synchronize_rcu();
+}
+#endif
+
 /**
  * blkcg_activate_policy - activate a blkcg policy on a request_queue
  * @q: request_queue of interest
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1b97a73d2fb1..da817896cded 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -970,9 +970,13 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
 {
bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
+   struct throtl_data *td = tg->td;
+   struct request_queue *q = td->queue;
+   struct backing_dev_info *bdi = q->backing_dev_info;
+   struct blkcg_gq *blkg = tg_to_blkg(tg);
 
/*
-* Currently whole state machine of group depends on first bio
+* Currently whole state machine of group depends on first bio
 * queued in the group bio list. So one should not be calling
 * this function with a different bio if there are other bios
 * queued.
@@ -981,8 +985,9 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
   bio != throtl_peek_queued(>service_queue.queued[rw]));
 
/* If tg->bps = -1, then BW is unlimited */
-   if (tg_bps_limit(tg, rw) == U64_MAX &&
-   tg_iops_limit(tg, rw) == UINT_MAX) {
+   if (blkcg_wb_waiters_on_bdi(blkg->blkcg, bdi) ||
+   (tg_bps_limit(tg, rw) == U64_MAX &&
+   tg_iops_limit(tg, rw) == UINT_MAX)) {
if (wait)

[PATCH 2/3] blkcg: introduce io.sync_isolation

2019-02-19 Thread Andrea Righi

Add a flag to the blkcg cgroups to make sync()'ers in a cgroup only be
allowed to write out pages that have been dirtied by the cgroup itself.

This flag is disabled by default (meaning that we are not changing the
previous behavior by default).

When this flag is enabled any cgroup can write out only dirty pages that
belong to the cgroup itself (except for the root cgroup that would still
be able to write out all pages globally).

Signed-off-by: Andrea Righi 
---
 Documentation/admin-guide/cgroup-v2.rst |  9 ++
 block/blk-throttle.c| 37 +
 include/linux/blk-cgroup.h  |  7 +
 3 files changed, 53 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 7bf3f129c68b..f98027fc2398 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1432,6 +1432,15 @@ IO Interface Files
Shows pressure stall information for IO. See
Documentation/accounting/psi.txt for details.
 
+  io.sync_isolation
+A flag (0|1) that determines whether a cgroup is allowed to write out
+only pages that have been dirtied by the cgroup itself. This option is
+set to false (0) by default, meaning that any cgroup would try to write
+out dirty pages globally, even those that have been dirtied by other
+cgroups.
+
+Setting this option to true (1) provides a better isolation across
+cgroups that are doing an intense write I/O activity.
 
 Writeback
 ~
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index da817896cded..4bc3b40a4d93 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1704,6 +1704,35 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
return ret ?: nbytes;
 }
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+static int sync_isolation_show(struct seq_file *sf, void *v)
+{
+   struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+
+   seq_printf(sf, "%d\n", test_bit(BLKCG_SYNC_ISOLATION, >flags));
+   return 0;
+}
+
+static ssize_t sync_isolation_write(struct kernfs_open_file *of,
+   char *buf, size_t nbytes, loff_t off)
+{
+   struct blkcg *blkcg = css_to_blkcg(of_css(of));
+   unsigned long val;
+   int err;
+
+   buf = strstrip(buf);
+   err = kstrtoul(buf, 0, );
+   if (err)
+   return err;
+   if (val)
+   set_bit(BLKCG_SYNC_ISOLATION, >flags);
+   else
+   clear_bit(BLKCG_SYNC_ISOLATION, >flags);
+
+   return nbytes;
+}
+#endif
+
 static struct cftype throtl_files[] = {
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
{
@@ -1721,6 +1750,14 @@ static struct cftype throtl_files[] = {
.write = tg_set_limit,
.private = LIMIT_MAX,
},
+#ifdef CONFIG_CGROUP_WRITEBACK
+   {
+   .name = "sync_isolation",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = sync_isolation_show,
+   .write = sync_isolation_write,
+   },
+#endif
{ } /* terminate */
 };
 
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 0f7dcb70e922..6ac5aa049334 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -44,6 +44,12 @@ enum blkg_rwstat_type {
 
 struct blkcg_gq;
 
+/* blkcg->flags */
+enum {
+   /* sync()'ers allowed to write out pages dirtied by the blkcg */
+   BLKCG_SYNC_ISOLATION,
+};
+
 struct blkcg {
struct cgroup_subsys_state  css;
spinlock_t  lock;
@@ -55,6 +61,7 @@ struct blkcg {
struct blkcg_policy_data*cpd[BLKCG_MAX_POLS];
 
struct list_headall_blkcgs_node;
+   unsigned long   flags;
 #ifdef CONFIG_CGROUP_WRITEBACK
struct list_headcgwb_wait_node;
struct list_headcgwb_list;
-- 
2.17.1

[tip:perf/core] kprobes: Prohibit probing on bsearch()

2019-02-13 Thread tip-bot for Andrea Righi

Commit-ID:  02106f883cd745523f7766d90a739f983f19e650
Gitweb: https://git.kernel.org/tip/02106f883cd745523f7766d90a739f983f19e650
Author: Andrea Righi 
AuthorDate: Wed, 13 Feb 2019 01:15:34 +0900
Committer:  Ingo Molnar 
CommitDate: Wed, 13 Feb 2019 08:16:41 +0100

kprobes: Prohibit probing on bsearch()

Since kprobe breakpoing handler is using bsearch(), probing on this
routine can cause recursive breakpoint problem.

int3
 ->do_int3()
   ->ftrace_int3_handler()
 ->ftrace_location()
   ->ftrace_location_range()
 ->bsearch() -> int3

Prohibit probing on bsearch().

Signed-off-by: Andrea Righi 
Acked-by: Masami Hiramatsu 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/154998813406.31052.8791425358974650922.stgit@devbox
Signed-off-by: Ingo Molnar 
---
 lib/bsearch.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/bsearch.c b/lib/bsearch.c
index 18b445b010c3..82512fe7b33c 100644
--- a/lib/bsearch.c
+++ b/lib/bsearch.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * bsearch - binary search an array of elements
@@ -53,3 +54,4 @@ void *bsearch(const void *key, const void *base, size_t num, 
size_t size,
return NULL;
 }
 EXPORT_SYMBOL(bsearch);
+NOKPROBE_SYMBOL(bsearch);

Re: [RFC PATCH v2] blkcg: prevent priority inversion problem during sync()

2019-02-11 Thread Andrea Righi

On Mon, Feb 11, 2019 at 10:39:34AM -0500, Josef Bacik wrote:
> On Sat, Feb 09, 2019 at 03:07:49PM +0100, Andrea Righi wrote:
> > This is an attempt to mitigate the priority inversion problem of a
> > high-priority blkcg issuing a sync() and being forced to wait the
> > completion of all the writeback I/O generated by any other low-priority
> > blkcg, causing massive latencies to processes that shouldn't be
> > I/O-throttled at all.
> > 
> > The idea is to save a list of blkcg's that are waiting for writeback:
> > every time a sync() is executed the current blkcg is added to the list.
> > 
> > Then, when I/O is throttled, if there's a blkcg waiting for writeback
> > different than the current blkcg, no throttling is applied (we can
> > probably refine this logic later, i.e., a better policy could be to
> > adjust the throttling I/O rate using the blkcg with the highest speed
> > from the list of waiters - priority inheritance, kinda).
> > 
> > This topic has been discussed here:
> > https://lwn.net/ml/cgroups/20190118103127.325-1-righi.and...@gmail.com/
> > 
> > But we didn't come up with any definitive solution.
> > 
> > This patch is not a definitive solution either, but it's an attempt to
> > continue addressing this issue and handling the priority inversion
> > problem with sync() in a better way.
> > 
> > Signed-off-by: Andrea Righi 
> 
> Talked with Tejun about this some and we agreed the following is probably the
> best way forward

First of all thanks for the update!

> 
> 1) Track the submitter of the wb work to the writeback code.

Are we going to track the cgroup that originated the dirty pages (or
maybe dirty inodes) or do you have any idea in particular?

> 2) Sync() defaults to the root cg, and and it writes all the things as the 
> root
>cg.

OK.

> 3) Add a flag to the cgroups that would make sync()'ers in that group only be
>allowed to write out things that belong to its group.

So, IIUC, when this flag is enabled a cgroup that is doing sync() would
trigger the writeback of the pages that belong to that cgroup only and
it waits only for these pages to be sync-ed, right? In this case
writeback can still go at cgroup's speed.

Instead when the flag is disabled, sync() would trigger writeback I/O
globally, as usual, and it goes at full speed (root cgroup's speed).

Am I understanding correctly?

> 
> This way we avoid the priority inversion of having things like systemd or 
> random
> logged in user doing sync() and having to wait, and we keep low prio cgroups
> from causing big IO storms by syncing out stuff and getting upgraded to root
> priority just to avoid the inversion.
> 
> Obviously by default we want this flag to be off since its such a big change,
> but people/setups really worried about this behavior (Facebook for instance
> would likely use this flag) can go ahead and set it and be sure we're getting
> good isolation and still avoiding the priority inversion associated with 
> running
> sync from a high priority context.  Thanks,
> 
> Josef

Thanks,
-Andrea

[RFC PATCH v2] blkcg: prevent priority inversion problem during sync()

2019-02-09 Thread Andrea Righi

This is an attempt to mitigate the priority inversion problem of a
high-priority blkcg issuing a sync() and being forced to wait the
completion of all the writeback I/O generated by any other low-priority
blkcg, causing massive latencies to processes that shouldn't be
I/O-throttled at all.

The idea is to save a list of blkcg's that are waiting for writeback:
every time a sync() is executed the current blkcg is added to the list.

Then, when I/O is throttled, if there's a blkcg waiting for writeback
different than the current blkcg, no throttling is applied (we can
probably refine this logic later, i.e., a better policy could be to
adjust the throttling I/O rate using the blkcg with the highest speed
from the list of waiters - priority inheritance, kinda).

This topic has been discussed here:
https://lwn.net/ml/cgroups/20190118103127.325-1-righi.and...@gmail.com/

But we didn't come up with any definitive solution.

This patch is not a definitive solution either, but it's an attempt to
continue addressing this issue and handling the priority inversion
problem with sync() in a better way.

Signed-off-by: Andrea Righi 
---
Changes in v2:
 - fix: use the proper current blkcg in blkcg_wb_waiters_on_bdi()

 block/blk-cgroup.c   | 69 
 block/blk-throttle.c | 11 +++--
 fs/fs-writeback.c|  4 ++
 include/linux/backing-dev-defs.h |  2 +
 include/linux/blk-cgroup.h   | 18 -
 mm/backing-dev.c |  2 +
 6 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2bed5725aa03..21f14148a9c6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1635,6 +1635,75 @@ static void blkcg_scale_delay(struct blkcg_gq *blkg, u64 
now)
}
 }
 
+/**
+ * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
+ * @blkcg: current blkcg cgroup
+ * @bdi: block device to check
+ *
+ * Return true if any other blkcg is waiting for writeback on the target block
+ * device, false otherwise.
+ */
+bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+   struct blkcg *wait_blkcg;
+   bool ret = false;
+
+   if (unlikely(!bdi))
+   return false;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(wait_blkcg, >cgwb_waiters, cgwb_wait_node)
+   if (wait_blkcg != blkcg) {
+   ret = true;
+   break;
+   }
+   rcu_read_unlock();
+
+   return ret;
+}
+
+/**
+ * blkcg_start_wb_wait_on_bdi - add current blkcg to writeback waiters list
+ * @bdi: target block device
+ *
+ * Add current blkcg to the list of writeback waiters on target block device.
+ */
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = css_to_blkcg(task_css(current, io_cgrp_id));
+   if (blkcg) {
+   spin_lock(>cgwb_waiters_lock);
+   list_add_rcu(>cgwb_wait_node, >cgwb_waiters);
+   spin_unlock(>cgwb_waiters_lock);
+   }
+   rcu_read_unlock();
+}
+
+/**
+ * blkcg_stop_wb_wait_on_bdi - remove current blkcg from writeback waiters list
+ * @bdi: target block device
+ *
+ * Remove current blkcg from the list of writeback waiters on target block
+ * device.
+ */
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = css_to_blkcg(task_css(current, io_cgrp_id));
+   if (blkcg) {
+   spin_lock(>cgwb_waiters_lock);
+   list_del_rcu(>cgwb_wait_node);
+   spin_unlock(>cgwb_waiters_lock);
+   }
+   rcu_read_unlock();
+   synchronize_rcu();
+}
+
 /*
  * This is called when we want to actually walk up the hierarchy and check to
  * see if we need to throttle, and then actually throttle if there is some
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1b97a73d2fb1..da817896cded 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -970,9 +970,13 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
 {
bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
+   struct throtl_data *td = tg->td;
+   struct request_queue *q = td->queue;
+   struct backing_dev_info *bdi = q->backing_dev_info;
+   struct blkcg_gq *blkg = tg_to_blkg(tg);
 
/*
-* Currently whole state machine of group depends on first bio
+* Currently whole state machine of group depends on first bio
 * queued in the group bio list. So one should not be calling
 * this function with a different bio if there are other bios
 * queued.
@@ -981,8 +985,9 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
   bio != throtl_peek_queued(>service_queue.queued

Re: [RFC PATCH] blkcg: prevent priority inversion problem during sync()

2019-02-09 Thread Andrea Righi

On Sat, Feb 09, 2019 at 01:06:33PM +0100, Andrea Righi wrote:
...
> +/**
> + * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
> + * @bdi: block device to check
> + *
> + * Return true if any other blkcg is waiting for writeback on the target 
> block
> + * device, false otherwise.
> + */
> +bool blkcg_wb_waiters_on_bdi(struct backing_dev_info *bdi)
> +{
> + struct blkcg *blkcg, *curr_blkcg;
> + bool ret = false;
> +
> + if (unlikely(!bdi))
> + return false;
> +
> + rcu_read_lock();
> + curr_blkcg = css_to_blkcg(task_css(current, io_cgrp_id));

Sorry, the logic is messed up here. We shouldn't get curr_blkcg from the
current task, because during writeback throttling the context is
obviously not the current task.

I'll post a new patch soon.

> + list_for_each_entry_rcu(blkcg, >cgwb_waiters, cgwb_wait_node)
> + if (blkcg != curr_blkcg) {
> + ret = true;
> + break;
> + }
> + rcu_read_unlock();
> +
> + return ret;
> +}

-Andrea

[RFC PATCH] blkcg: prevent priority inversion problem during sync()

2019-02-09 Thread Andrea Righi

This is an attempt to mitigate the priority inversion problem of a
high-priority blkcg issuing a sync() and being forced to wait the
completion of all the writeback I/O generated by any other low-priority
blkcg, causing massive latencies to processes that shouldn't be
I/O-throttled at all.

The idea is to save a list of blkcg's that are waiting for writeback:
every time a sync() is executed the current blkcg is added to the list.

Then, when I/O is throttled, if there's a blkcg waiting for writeback
different than the current blkcg, no throttling is applied (we can
probably refine this logic later, i.e., a better policy could be to
adjust the throttling rate using the blkcg with the highest speed from
the list of waiters - priority inheritance, kinda).

This topic has been discussed here:
https://lwn.net/ml/cgroups/20190118103127.325-1-righi.and...@gmail.com/

But we didn't come up with any definitive solution.

This patch is not a definitive solution either, but it's an attempt to
continue addressing the issue and, hopefully, handle the priority
inversion problem with sync() in a better way.

Signed-off-by: Andrea Righi 
---
 block/blk-cgroup.c   | 69 
 block/blk-throttle.c | 10 +++--
 fs/fs-writeback.c|  4 ++
 include/linux/backing-dev-defs.h |  2 +
 include/linux/blk-cgroup.h   | 15 +++
 mm/backing-dev.c |  2 +
 6 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2bed5725aa03..d71e3cb0688d 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1635,6 +1635,75 @@ static void blkcg_scale_delay(struct blkcg_gq *blkg, u64 
now)
}
 }
 
+/**
+ * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
+ * @bdi: block device to check
+ *
+ * Return true if any other blkcg is waiting for writeback on the target block
+ * device, false otherwise.
+ */
+bool blkcg_wb_waiters_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg *blkcg, *curr_blkcg;
+   bool ret = false;
+
+   if (unlikely(!bdi))
+   return false;
+
+   rcu_read_lock();
+   curr_blkcg = css_to_blkcg(task_css(current, io_cgrp_id));
+   list_for_each_entry_rcu(blkcg, >cgwb_waiters, cgwb_wait_node)
+   if (blkcg != curr_blkcg) {
+   ret = true;
+   break;
+   }
+   rcu_read_unlock();
+
+   return ret;
+}
+
+/**
+ * blkcg_start_wb_wait_on_bdi - add current blkcg to writeback waiters list
+ * @bdi: target block device
+ *
+ * Add current blkcg to the list of writeback waiters on target block device.
+ */
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = css_to_blkcg(task_css(current, io_cgrp_id));
+   if (blkcg) {
+   spin_lock(>cgwb_waiters_lock);
+   list_add_rcu(>cgwb_wait_node, >cgwb_waiters);
+   spin_unlock(>cgwb_waiters_lock);
+   }
+   rcu_read_unlock();
+}
+
+/**
+ * blkcg_stop_wb_wait_on_bdi - remove current blkcg from writeback waiters list
+ * @bdi: target block device
+ *
+ * Remove current blkcg from the list of writeback waiters on target block
+ * device.
+ */
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+   struct blkcg *blkcg;
+
+   rcu_read_lock();
+   blkcg = css_to_blkcg(task_css(current, io_cgrp_id));
+   if (blkcg) {
+   spin_lock(>cgwb_waiters_lock);
+   list_del_rcu(>cgwb_wait_node);
+   spin_unlock(>cgwb_waiters_lock);
+   }
+   rcu_read_unlock();
+   synchronize_rcu();
+}
+
 /*
  * This is called when we want to actually walk up the hierarchy and check to
  * see if we need to throttle, and then actually throttle if there is some
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1b97a73d2fb1..14d9cd6e702d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -970,9 +970,12 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
 {
bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
+   struct throtl_data *td = tg->td;
+   struct request_queue *q = td->queue;
+   struct backing_dev_info *bdi = q->backing_dev_info;
 
/*
-* Currently whole state machine of group depends on first bio
+* Currently whole state machine of group depends on first bio
 * queued in the group bio list. So one should not be calling
 * this function with a different bio if there are other bios
 * queued.
@@ -981,8 +984,9 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
   bio != throtl_peek_queued(>service_queue.queued[rw]));
 
/* If tg->bps = -1, then BW is unlimited */
-   if (tg_bps_limit(tg, rw) == U64_MAX &&

Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-29 Thread Andrea Righi

On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote:
> On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> > Hi Vivek,
> > 
> > sorry for the late reply.
> > 
> > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > > 
> > > [..]
> > > > Alright, let's skip the root cgroup for now. I think the point here is
> > > > if we want to provide sync() isolation among cgroups or not.
> > > > 
> > > > According to the manpage:
> > > > 
> > > >sync()  causes  all  pending  modifications  to filesystem 
> > > > metadata and cached file data to be
> > > >written to the underlying filesystems.
> > > > 
> > > > And:
> > > >According to the standard specification (e.g., POSIX.1-2001), 
> > > > sync() schedules the writes, but
> > > >may  return  before  the actual writing is done.  However Linux 
> > > > waits for I/O completions, and
> > > >thus sync() or syncfs() provide the same guarantees as fsync 
> > > > called on every file in the  sys‐
> > > >tem or filesystem respectively.
> > > > 
> > > > Excluding the root cgroup, do you think a sync() issued inside a
> > > > specific cgroup should wait for I/O completions only for the writes that
> > > > have been generated by that cgroup?
> > > 
> > > Can we account I/O towards the cgroup which issued "sync" only if write
> > > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > > that solve problem, assuming its doable?
> > 
> > Maybe this would mitigate the problem, in part, but it doesn't solve it.
> > 
> > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> > issues "sync", the fast cgroup needs to wait a lot, because writeback is
> > happening at the speed of the slow cgroup.
> 
> Hi Andrea,
> 
> But that's true only for I/O which has already been submitted to block
> layer, right? Any new I/O yet to be submitted could still be attributed
> to faster cgroup requesting sync.

Right. If we could bump up the new I/O yet to be submitted I think we
could effectively prevent the priority inversion problem (the ongoing
writeback I/O should be negligible).

> 
> Until and unless cgroups limits are absurdly low, it should not take very
> long for already submitted I/O to finish. If yes, then in practice, it
> might not be a big problem?

I was actually doing my tests with a very low limit (1MB/s both for rbps
and wbps), but this shows the problem very well I think.

Here's what I'm doing:

 [ slow cgroup (1Mbps read/write) ]

   $ cat /sys/fs/cgroup/unified/cg1/io.max
   259:0 rbps=1048576 wbps=1048576 riops=max wiops=max
   $ cat /proc/self/cgroup
   0::/cg1

   $ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based 
--runtime=30

 [ fast cgroup (root cgroup, no limitation) ]

   # cat /proc/self/cgroup
   0::/

   # time sync
   real 9m32,618s
   user 0m0,000s
   sys  0m0,018s

With this simple test I can easily trigger hung task timeout warnings
and make the whole system totally sluggish (even the processes running
in the root cgroup).

When fio ends, writeback is still taking forever to complete, as you can
see by the insane amount that sync takes to complete.

-Andrea

Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-28 Thread Andrea Righi

Hi Vivek,

sorry for the late reply.

On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> 
> [..]
> > Alright, let's skip the root cgroup for now. I think the point here is
> > if we want to provide sync() isolation among cgroups or not.
> > 
> > According to the manpage:
> > 
> >sync()  causes  all  pending  modifications  to filesystem metadata 
> > and cached file data to be
> >written to the underlying filesystems.
> > 
> > And:
> >According to the standard specification (e.g., POSIX.1-2001), sync() 
> > schedules the writes, but
> >may  return  before  the actual writing is done.  However Linux 
> > waits for I/O completions, and
> >thus sync() or syncfs() provide the same guarantees as fsync called 
> > on every file in the  sys‐
> >tem or filesystem respectively.
> > 
> > Excluding the root cgroup, do you think a sync() issued inside a
> > specific cgroup should wait for I/O completions only for the writes that
> > have been generated by that cgroup?
> 
> Can we account I/O towards the cgroup which issued "sync" only if write
> rate of sync cgroup is higher than cgroup to which page belongs to. Will
> that solve problem, assuming its doable?

Maybe this would mitigate the problem, in part, but it doesn't solve it.

The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
issues "sync", the fast cgroup needs to wait a lot, because writeback is
happening at the speed of the slow cgroup.

Ideally in this case we should bump up the writeback speed, maybe even
temporarily inherit the write rate of the sync cgroup, similarly to a
priority-inversion locking scenario, but I think it's not doable at the
moment without applying big changes.

Or we could isolate the sync domain, meaning that a cgroup issuing a
sync will only wait for the syncing of the pages that belong to that
sync cgroup. But probably also this method requires big changes...

-Andrea

Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-19 Thread Andrea Righi

On Fri, Jan 18, 2019 at 02:46:53PM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> > On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > > This is a redesign of my old cgroup-io-throttle controller:
> > > > https://lwn.net/Articles/330531/
> > > > 
> > > > I'm resuming this old patch to point out a problem that I think is still
> > > > not solved completely.
> > > > 
> > > > = Problem =
> > > > 
> > > > The io.max controller works really well at limiting synchronous I/O
> > > > (READs), but a lot of I/O requests are initiated outside the context of
> > > > the process that is ultimately responsible for its creation (e.g.,
> > > > WRITEs).
> > > > 
> > > > Throttling at the block layer in some cases is too late and we may end
> > > > up slowing down processes that are not responsible for the I/O that
> > > > is being processed at that level.
> > > 
> > > How so?  The writeback threads are per-cgroup and have the cgroup stuff 
> > > set
> > > properly.  So if you dirty a bunch of pages, they are associated with your
> > > cgroup, and then writeback happens and it's done in the writeback thread
> > > associated with your cgroup and then that is throttled.  Then you are 
> > > throttled
> > > at balance_dirty_pages() because the writeout is taking longer.
> > 
> > Right, writeback is per-cgroup and slowing down writeback affects only
> > that specific cgroup, but, there are cases where other processes from
> > other cgroups may require to wait on that writeback to complete before
> > doing I/O (for example an fsync() to a file shared among different
> > cgroups). In this case we may end up blocking cgroups that shouldn't be
> > blocked, that looks like a priority-inversion problem. This is the
> > problem that I'm trying to address.
> 
> Well this case is a misconfiguration, you shouldn't be sharing files between
> cgroups.  But even if you are, fsync() is synchronous, we should be getting 
> the
> context from the process itself and thus should have its own rules applied.
> There's nothing we can do for outstanding IO, but that shouldn't be that much.
> That would need to be dealt with on a per-contoller basis.

OK, fair point. We shouldn't be sharing files between cgroups.

I'm still not sure if we can have similar issues with metadata I/O (that
may introduce latencies like the sync() scenario), I have to investigate
more and do more tests.

> 
> > 
> > > 
> > > I introduced the blk_cgroup_congested() stuff for paths that it's not 
> > > easy to
> > > clearly tie IO to the thing generating the IO, such as readahead and 
> > > such.  If
> > > you are running into this case that may be something worth using.  Course 
> > > it
> > > only works for io.latency now but there's no reason you can't add support 
> > > to it
> > > for io.max or whatever.
> > 
> > IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> > memcg), something like this: if the cgroup is already congested don't
> > generate extra I/O due to readahead. Am I right?
> 
> Yeah, but that's just how it's currently used, it can be used any which way we
> feel like.

I think it'd be very interesting to have the possibility to either
throttle I/O before writeback or during writeback. Right now we can only
throttle writeback. Maybe we can try to introduce some kind of dirty
page throttling controller using blk_cgroup_congested()... Opinions?

> 
> > 
> > > 
> > > > 
> > > > = Proposed solution =
> > > > 
> > > > The main idea of this controller is to split I/O measurement and I/O
> > > > throttling: I/O is measured at the block layer for READS, at page cache
> > > > (dirty pages) for WRITEs, and processes are limited while they're
> > > > generating I/O at the VFS level, based on the measured I/O.
> > > > 
> > > 
> > > This is what blk_cgroup_congested() is meant to accomplish, I would 
> > > suggest
> > > looking into that route and simply changing the existing io controller 
> > > you are
> > > using to take advantage of that so it will actually throttle things.  
> > > Then just
> > > sprinkle it around the areas where we indirectly generate IO.  Thanks,
> > 
> > Absolutely, I can probably use blk_cgroup_congested() as a method to
> &g

Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-18 Thread Andrea Righi

On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik  
> > ha scritto:
> > 
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> >> This is a redesign of my old cgroup-io-throttle controller:
> >> https://lwn.net/Articles/330531/
> >> 
> >> I'm resuming this old patch to point out a problem that I think is still
> >> not solved completely.
> >> 
> >> = Problem =
> >> 
> >> The io.max controller works really well at limiting synchronous I/O
> >> (READs), but a lot of I/O requests are initiated outside the context of
> >> the process that is ultimately responsible for its creation (e.g.,
> >> WRITEs).
> >> 
> >> Throttling at the block layer in some cases is too late and we may end
> >> up slowing down processes that are not responsible for the I/O that
> >> is being processed at that level.
> > 
> > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > properly.  So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled.  Then you are 
> > throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> > 
> 
> IIUC, Andrea described this problem: certain processes in a certain group 
> dirty a
> lot of pages, causing write back to start.  Then some other blameless
> process in the same group experiences very high latency, in spite of
> the fact that it has to do little I/O.
> 
> Does your blk_cgroup_congested() stuff solves this issue?
> 
> Or simply I didn't get what Andrea meant at all :)
> 
> Thanks,
> Paolo

Yes, there is also this problem: provide fairness among processes
running inside the same cgroup.

This is a tough one, because once the I/O limit is reached whoever
process comes next gets punished, even if it hasn't done any I/O before.

Well, my proposal doesn't solve this problem. To solve this one in the
"throttling" scenario, we should probably save some information about
the previously generated I/O activity and apply a delay proportional to
that (like a dynamic weight for each process inside each cgroup).

AFAICS the io.max controller doesn't solve this problem either.

-Andrea

Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-18 Thread Andrea Righi

On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > This is a redesign of my old cgroup-io-throttle controller:
> > https://lwn.net/Articles/330531/
> > 
> > I'm resuming this old patch to point out a problem that I think is still
> > not solved completely.
> > 
> > = Problem =
> > 
> > The io.max controller works really well at limiting synchronous I/O
> > (READs), but a lot of I/O requests are initiated outside the context of
> > the process that is ultimately responsible for its creation (e.g.,
> > WRITEs).
> > 
> > Throttling at the block layer in some cases is too late and we may end
> > up slowing down processes that are not responsible for the I/O that
> > is being processed at that level.
> 
> How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> properly.  So if you dirty a bunch of pages, they are associated with your
> cgroup, and then writeback happens and it's done in the writeback thread
> associated with your cgroup and then that is throttled.  Then you are 
> throttled
> at balance_dirty_pages() because the writeout is taking longer.

Right, writeback is per-cgroup and slowing down writeback affects only
that specific cgroup, but, there are cases where other processes from
other cgroups may require to wait on that writeback to complete before
doing I/O (for example an fsync() to a file shared among different
cgroups). In this case we may end up blocking cgroups that shouldn't be
blocked, that looks like a priority-inversion problem. This is the
problem that I'm trying to address.

> 
> I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> clearly tie IO to the thing generating the IO, such as readahead and such.  If
> you are running into this case that may be something worth using.  Course it
> only works for io.latency now but there's no reason you can't add support to 
> it
> for io.max or whatever.

IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
memcg), something like this: if the cgroup is already congested don't
generate extra I/O due to readahead. Am I right?

> 
> > 
> > = Proposed solution =
> > 
> > The main idea of this controller is to split I/O measurement and I/O
> > throttling: I/O is measured at the block layer for READS, at page cache
> > (dirty pages) for WRITEs, and processes are limited while they're
> > generating I/O at the VFS level, based on the measured I/O.
> > 
> 
> This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> looking into that route and simply changing the existing io controller you are
> using to take advantage of that so it will actually throttle things.  Then 
> just
> sprinkle it around the areas where we indirectly generate IO.  Thanks,

Absolutely, I can probably use blk_cgroup_congested() as a method to
determine when a cgroup should be throttled (instead of doing my own
I/O measuring), but to prevent the "slow writeback slowing down other
cgroups" issue I still need to apply throttling when pages are dirtied
in page cache.

Thanks,
-Andrea

Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-18 Thread Andrea Righi

On Fri, Jan 18, 2019 at 12:04:17PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi 
> >  ha scritto:
> > 
> > This is a redesign of my old cgroup-io-throttle controller:
> > https://lwn.net/Articles/330531/
> > 
> > I'm resuming this old patch to point out a problem that I think is still
> > not solved completely.
> > 
> > = Problem =
> > 
> > The io.max controller works really well at limiting synchronous I/O
> > (READs), but a lot of I/O requests are initiated outside the context of
> > the process that is ultimately responsible for its creation (e.g.,
> > WRITEs).
> > 
> > Throttling at the block layer in some cases is too late and we may end
> > up slowing down processes that are not responsible for the I/O that
> > is being processed at that level.
> > 
> > = Proposed solution =
> > 
> > The main idea of this controller is to split I/O measurement and I/O
> > throttling: I/O is measured at the block layer for READS, at page cache
> > (dirty pages) for WRITEs, and processes are limited while they're
> > generating I/O at the VFS level, based on the measured I/O.
> > 
> 
> Hi Andrea,
> what the about the case where two processes are dirtying the same
> pages?  Which will be charged?
> 
> Thanks,
> Paolo

Hi Paolo,

in this case only the first one will be charged for the I/O activity
(the one that changes a page from clean to dirty). This is probably not
totally fair in some cases, but I think it's a good compromise, at the
end rewriting the same page over and over while it's already dirty
doesn't actually generate I/O activity, until the page is flushed back
to disk.

Obviously I'm open to other better ideas and suggestions.

Thanks!
-Andrea

[RFC PATCH 1/3] fsio-throttle: documentation

2019-01-18 Thread Andrea Righi

Document the filesystem I/O controller: description, usage, design,
etc.

Signed-off-by: Andrea Righi 
---
 Documentation/cgroup-v1/fsio-throttle.txt | 142 ++
 1 file changed, 142 insertions(+)
 create mode 100644 Documentation/cgroup-v1/fsio-throttle.txt

diff --git a/Documentation/cgroup-v1/fsio-throttle.txt 
b/Documentation/cgroup-v1/fsio-throttle.txt
new file mode 100644
index ..4f33cae2adea
--- /dev/null
+++ b/Documentation/cgroup-v1/fsio-throttle.txt
@@ -0,0 +1,142 @@
+
+ Filesystem I/O throttling controller
+
+--
+1. OVERVIEW
+
+This controller allows to limit filesystem I/O of mounted devices of specific
+process containers (cgroups [1]) enforcing delays to the processes that exceed
+the limits defined for their cgroup.
+
+The goal of the filesystem I/O controller is to improve performance
+predictability from applications' point of view and provide performance
+isolation of different control groups sharing the same filesystems.
+
+--
+2. DESIGN
+
+I/O activity generated by READs is evaluated at the block layer, WRITEs are
+evaluated when a page changes from clear to dirty (rewriting a page that was
+already dirty doesn't generate extra I/O activity).
+
+Throttling is always performed at the VFS layer.
+
+This solution has the advantage of always being able to determine the
+task/cgroup that originally generated the I/O request and it prevents
+filesystem locking contention and potential priority inversion problems
+(example: journal I/O being throttled that may slow down the entire system).
+
+The downside of this solution is that the controller is more fuzzy (compared to
+the blkio controller) and it allows I/O bursts that may happen at the I/O
+scheduler layer.
+
+--
+2.1. TOKEN BUCKET THROTTLING
+
+Tokens (I/O rate) - 
+o
+o
+o
+  ... <--.
+  \ /| Bucket size (burst limit)
+   \ooo/ | 
+---   <--'
+ |ooo
+Incoming --->|---> Conforming
+I/O  |oo   I/O
+requests  -->|-->  requests
+ |
+>|
+
+Token bucket [2] throttling:  tokens are added to the bucket
+every seconds; the bucket can hold at the most  tokens; I/O
+requests are accepted if there are available tokens in the bucket; when a
+request of N bytes arrives, N tokens are removed from the bucket; if less than
+N tokens are available in the bucket, the request is delayed until a sufficient
+amount of token is available again in the bucket.
+
+--
+3. USER INTERFACE
+
+A new I/O limit (in MB/s) can be defined using the file:
+- fsio.max_mbs
+
+The syntax of a throttling policy is the following:
+
+":  "
+
+Examples:
+
+- set a maximum I/O rate of 10MB/s on /dev/sda (8:0), bucket size = 10MB:
+
+  # echo "8:0 10 10" > /sys/fs/cgroup/cg1/fsio.max_mbs
+
+- remove the I/O limit defined for /dev/sda (8:0):
+
+  # echo "8:0 0 0" > /sys/fs/cgroup/cg1/fsio.max_mbs
+
+--
+4. Additional parameters
+
+--
+4.1. Sleep timeslice
+
+Sleep timeslice is a configurable parameter that allows to decide the minimum
+time of sleep to enforce to throttled tasks. Tasks will never be put to sleep
+for less than the sleep timeslice. Moreover wakeup timers will be always
+aligned to a multiple of the sleep timeslice.
+
+Increasing the sleep timeslice has the advantage of reducing the overhead of
+the controller: with a more coarse-grained control, less timers are created to
+wake-up tasks, that means less softirq pressure in the system and less overhead
+introduced. However, a bigger sleep timeslice makes the controller more fuzzy
+since throttled tasks are going to receive less throttling events with larger
+sleeps.
+
+The parameter can be changed via:
+/sys/module/fsio_throttle/parameters/throttle_timeslice_ms
+
+The default value is 250 ms.
+
+Example:
+  - set the sleep timeslice to 1s:
+
+# echo 1000 > /sys/module/fsio_throttle/parameters/throttle_timeslice_ms
+
+--
+4.2. Sleep timeframe
+
+This parameter defines maximum time to sleep for a throttled task.
+
+The parameter can be changed via:
+/sys/module/fsio_throttle/parameters/throttle_timeslice_ms
+
+The default value is 2 sec.
+
+Example:
+  - set the sleep timeframe to 5s:
+
+# echo 5000 > /sys/module/fsio_throttle/parameters/throttle_timeframe_ms
+
+4.3. Throttle kernel threads
+
+By default kernel threads are never throttl

[RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-18 Thread Andrea Righi

This is a redesign of my old cgroup-io-throttle controller:
https://lwn.net/Articles/330531/

I'm resuming this old patch to point out a problem that I think is still
not solved completely.

= Problem =

The io.max controller works really well at limiting synchronous I/O
(READs), but a lot of I/O requests are initiated outside the context of
the process that is ultimately responsible for its creation (e.g.,
WRITEs).

Throttling at the block layer in some cases is too late and we may end
up slowing down processes that are not responsible for the I/O that
is being processed at that level.

= Proposed solution =

The main idea of this controller is to split I/O measurement and I/O
throttling: I/O is measured at the block layer for READS, at page cache
(dirty pages) for WRITEs, and processes are limited while they're
generating I/O at the VFS level, based on the measured I/O.

= Example =

Here's a trivial example: create 2 cgroups, set an io.max limit of
10MB/s, run a write-intensive workload on both and after a while, from a
root cgroup, run "sync".

 # cat /proc/self/cgroup
 0::/cg1
 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based 
--runtime=30

 # cat /proc/self/cgroup
 0::/cg2
 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based 
--runtime=30

 - io.max controller:

 # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg1/io.max
 # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg2/io.max

 # cat /proc/self/cgroup
 0::/
 # time sync

 real   0m51,241s
 user   0m0,000s
 sys0m0,113s

Ideally "sync" should complete almost immediately, because the root
cgroup is unlimited and it's not doing any I/O at all, but instead it's
blocked for more than 50 sec with io.max, because the writeback is
throttled to satisfy the io.max limits.

- fsio controller:

 # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg1/fsio.max_mbs
 # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg2/fsio.max_mbs

 [you can find details about the syntax in the documentation patch]

 # cat /proc/self/cgroup
 0::/
 # time sync

 real   0m0,146s
 user   0m0,003s
 sys0m0,001s

= Questions =

Q: Do we need another controller?
A: Probably no, I think it would be better to integrate this policy (or
   something similar) in the current blkio controller, this is just to
   highlight the problem and get some ideas on how to address it.

Q: What about proportional limits / latency?
A: It should be trivial to add latency-based limits if we integrate this in the
   current I/O controller. About proportional limits (weights), they're
   strictly related to I/O scheduling and since this controller doesn't touch
   I/O dispatching policies it's not trivial to implement proportional limits
   (bandwidth limiting is definitely more straightforward).

Q: Applying delays at the VFS layer doesn't prevent I/O spikes during
   writeback, right?
A: Correct, the tradeoff here is to tolerate I/O bursts during writeback to
   avoid priority inversion problems in the system.

Andrea Righi (3):
  fsio-throttle: documentation
  fsio-throttle: controller infrastructure
  fsio-throttle: instrumentation

 Documentation/cgroup-v1/fsio-throttle.txt | 142 +
 block/blk-core.c  |  10 +
 include/linux/cgroup_subsys.h |   4 +
 include/linux/fsio-throttle.h |  43 +++
 include/linux/writeback.h |   7 +-
 init/Kconfig  |  11 +
 kernel/cgroup/Makefile|   1 +
 kernel/cgroup/fsio-throttle.c | 501 ++
 mm/filemap.c  |  20 +-
 mm/page-writeback.c   |  14 +-
 10 files changed, 749 insertions(+), 4 deletions(-)

[RFC PATCH 3/3] fsio-throttle: instrumentation

2019-01-18 Thread Andrea Righi

Apply the fsio controller to the opportune kernel functions to evaluate
and throttle filesystem I/O.

Signed-off-by: Andrea Righi 
---
 block/blk-core.c  | 10 ++
 include/linux/writeback.h |  7 ++-
 mm/filemap.c  | 20 +++-
 mm/page-writeback.c   | 14 --
 4 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3c5f61ceeb67..4b4717f64ac1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -956,6 +957,15 @@ generic_make_request_checks(struct bio *bio)
 */
create_io_context(GFP_ATOMIC, q->node);
 
+   /*
+* Account only READs at this layer (WRITEs are accounted and throttled
+* in balance_dirty_pages()) and don't enfore sleeps (state=0): in this
+* way we can prevent potential lock contentions and priority inversion
+* problems at the filesystem layer.
+*/
+   if (bio_op(bio) == REQ_OP_READ)
+   fsio_throttle(bio_dev(bio), bio->bi_iter.bi_size, 0);
+
if (!blkcg_bio_issue_check(q, bio))
return false;
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 738a0c24874f..1e161c7969e5 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -356,7 +356,12 @@ void global_dirty_limits(unsigned long *pbackground, 
unsigned long *pdirty);
 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
 void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
-void balance_dirty_pages_ratelimited(struct address_space *mapping);
+
+#define balance_dirty_pages_ratelimited(__mapping) \
+   __balance_dirty_pages_ratelimited(__mapping, false)
+void __balance_dirty_pages_ratelimited(struct address_space *mapping,
+  bool redirty);
+
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..5cc0959274d6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2040,6 +2041,7 @@ static ssize_t generic_file_buffered_read(struct kiocb 
*iocb,
 {
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
+   struct block_device *bdev = as_to_bdev(mapping);
struct inode *inode = mapping->host;
struct file_ra_state *ra = >f_ra;
loff_t *ppos = >ki_pos;
@@ -2068,6 +2070,7 @@ static ssize_t generic_file_buffered_read(struct kiocb 
*iocb,
 
cond_resched();
 find_page:
+   fsio_throttle(bdev_to_dev(bdev), 0, TASK_INTERRUPTIBLE);
if (fatal_signal_pending(current)) {
error = -EINTR;
goto out;
@@ -2308,11 +2311,17 @@ generic_file_read_iter(struct kiocb *iocb, struct 
iov_iter *iter)
if (iocb->ki_flags & IOCB_DIRECT) {
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
+   struct block_device *bdev = as_to_bdev(mapping);
struct inode *inode = mapping->host;
loff_t size;
 
size = i_size_read(inode);
if (iocb->ki_flags & IOCB_NOWAIT) {
+   unsigned long long sleep;
+
+   sleep = fsio_throttle(bdev_to_dev(bdev), 0, 0);
+   if (sleep)
+   return -EAGAIN;
if (filemap_range_has_page(mapping, iocb->ki_pos,
   iocb->ki_pos + count - 1))
return -EAGAIN;
@@ -2322,6 +2331,7 @@ generic_file_read_iter(struct kiocb *iocb, struct 
iov_iter *iter)
iocb->ki_pos + count - 1);
if (retval < 0)
goto out;
+   fsio_throttle(bdev_to_dev(bdev), 0, TASK_INTERRUPTIBLE);
}
 
file_accessed(file);
@@ -2366,9 +2376,11 @@ EXPORT_SYMBOL(generic_file_read_iter);
 static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
 {
struct address_space *mapping = file->f_mapping;
+   struct block_device *bdev = as_to_bdev(mapping);
struct page *page;
int ret;
 
+   fsio_throttle(bdev_to_dev(bdev), 0, TASK_INTERRUPTIBLE);
do {
page = __page_cache_alloc(gfp_mask);
if (!page)
@@ -2498,11 +2510,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 */
page = find_get_page(mapping, offset);
if (likely(page) && !(vmf->

[RFC PATCH 2/3] fsio-throttle: controller infrastructure

2019-01-18 Thread Andrea Righi

This is the core of the fsio-throttle controller: it defines the
interface to the cgroup subsystem and implements the I/O measurement and
throttling logic.

Signed-off-by: Andrea Righi 
---
 include/linux/cgroup_subsys.h |   4 +
 include/linux/fsio-throttle.h |  43 +++
 init/Kconfig  |  11 +
 kernel/cgroup/Makefile|   1 +
 kernel/cgroup/fsio-throttle.c | 501 ++
 5 files changed, 560 insertions(+)
 create mode 100644 include/linux/fsio-throttle.h
 create mode 100644 kernel/cgroup/fsio-throttle.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcff3b4..33beb70c0eca 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -61,6 +61,10 @@ SUBSYS(pids)
 SUBSYS(rdma)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_FSIO_THROTTLE)
+SUBSYS(fsio)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/include/linux/fsio-throttle.h b/include/linux/fsio-throttle.h
new file mode 100644
index ..3a46df712475
--- /dev/null
+++ b/include/linux/fsio-throttle.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __FSIO_THROTTLE_H__
+#define __FSIO_THROTTLE_H__
+
+#include 
+#include 
+
+#ifdef CONFIG_BLOCK
+static inline dev_t bdev_to_dev(struct block_device *bdev)
+{
+   return bdev ? MKDEV(MAJOR(bdev->bd_inode->i_rdev),
+   bdev->bd_disk->first_minor) : 0;
+}
+
+static inline struct block_device *as_to_bdev(struct address_space *mapping)
+{
+   return (mapping->host && mapping->host->i_sb->s_bdev) ?
+   mapping->host->i_sb->s_bdev : NULL;
+}
+#else /* CONFIG_BLOCK */
+static dev_t bdev_to_dev(struct block_device *bdev)
+{
+   return 0;
+}
+
+static inline struct block_device *as_to_bdev(struct address_space *mapping)
+{
+   return NULL;
+}
+#endif /* CONFIG_BLOCK */
+
+#ifdef CONFIG_CGROUP_FSIO_THROTTLE
+int fsio_throttle(dev_t dev, ssize_t bytes, int state);
+#else /* CONFIG_CGROUP_FSIO_THROTTLE */
+static inline int
+fsio_throttle(dev_t dev, ssize_t bytes, int state)
+{
+   return 0;
+}
+#endif /* CONFIG_CGROUP_FSIO_THROTTLE */
+
+#endif /* __FSIO_THROTTLE_H__ */
diff --git a/init/Kconfig b/init/Kconfig
index d47cb77a220e..95d7342801eb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -775,6 +775,17 @@ config CGROUP_WRITEBACK
depends on MEMCG && BLK_CGROUP
default y
 
+config CGROUP_FSIO_THROTTLE
+   bool "Filesystem I/O throttling controller"
+   default n
+   depends on BLOCK
+   help
+ This option enables filesystem I/O throttling infrastructure.
+
+ This allows to properly throttle reads and writes at the filesystem
+ level, without introducing I/O locking contentions or priority
+ inversion problems.
+
 menuconfig CGROUP_SCHED
bool "CPU controller"
default n
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index bfcdae896122..12de828b36cd 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -2,6 +2,7 @@
 obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o
 
 obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
+obj-$(CONFIG_CGROUP_FSIO_THROTTLE) += fsio-throttle.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
diff --git a/kernel/cgroup/fsio-throttle.c b/kernel/cgroup/fsio-throttle.c
new file mode 100644
index ..46f3ffd4015b
--- /dev/null
+++ b/kernel/cgroup/fsio-throttle.c
@@ -0,0 +1,501 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * fsio-throttle.c - I/O cgroup controller
+ *
+ * Copyright (C) 2019 Andrea Righi 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define KB(x)   ((x) * 1024)
+#define MB(x)   (KB(KB(x)))
+#define GB(x)   (MB(KB(x)))
+
+static int throttle_kernel_threads __read_mostly;
+module_param(throttle_kernel_threads, int, 0644);
+MODULE_PARM_DESC(throttle_kernel_threads,
+ "enable/disable I/O throttling for kernel threads");
+
+static int throttle_timeslice_ms __read_mostly = 250;
+module_param(throttle_timeslice_ms, int, 0644);
+MODULE_PARM_DESC(throttle_kernel_threads,
+ "throttling time slice (default 250ms)");
+
+static int throttle_timeframe_ms __read_mostly = 2000;
+module_param(throttle_timeframe_ms, int, 0644);
+MODULE_PARM_DESC(throttle_kernel_threads,
+ "maximum sleep time enforced (default 2000ms)");
+
+struct iothrottle {
+   struct cgroup_subsys_state css;
+   struct list_head list;
+   /* protect the list of iothrottle_node elements (list) */
+   struct mutex lock;
+   wait_queue_head_t wait;
+   struct timer_list timer;
+   bool timer_cancel;
+   /* protect the wait queue elements */
+   spinlock_t

Re: [PATCH v2 0/9] kprobes: Fix and improve blacklist symbols

2019-01-12 Thread Andrea Righi

On Sat, Jan 12, 2019 at 11:25:40AM +0900, Masami Hiramatsu wrote:
...
> And I found several functions which must be blacklisted.
>  - optprobe template code, which is just a template code and
>never be executed. Moreover, since it can be copied and
>reused, if we probe it, it modifies the template code and
>can cause a crash. ([1/9][2/9])
>  - functions which is called before kprobe_int3_handler()
>handles kprobes. This can cause a breakpoint recursion. ([3/9])
>  - IRQ entry text, which should not be probed since register/pagetable
>status has not been stable at that point. ([4/9])
>  - Suffixed symbols, like .constprop, .part etc. Those suffixed
>symbols never be blacklisted even if the non-suffixed version
>has been blacklisted. ([5/9])
>  - hardirq tracer also works before int3 handling. ([6/9])
>  - preempt_check debug function also is involved in int3 handling.
>([7/9])
>  - RCU debug routine is also called before kprobe_int3_handler().
>([8/9])
>  - Some lockdep functions are also involved in int3 handling.
>([9/9])
> 
> Of course there still may be some functions which can be called
> by configuration change, I'll continue to test it.

Hi Masami,

I think I've found another recursion problem. Could you include also
this one?

Thanks,

From: Andrea Righi 
Subject: [PATCH] kprobes: prohibit probing on bsearch()

Since kprobe breakpoing handler is using bsearch(), probing on this
routine can cause recursive breakpoint problem.

int3
 ->do_int3()
   ->ftrace_int3_handler()
 ->ftrace_location()
   ->ftrace_location_range()
     ->bsearch() -> int3

Prohibit probing on bsearch().

Signed-off-by: Andrea Righi 
---
 lib/bsearch.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/bsearch.c b/lib/bsearch.c
index 18b445b010c3..82512fe7b33c 100644
--- a/lib/bsearch.c
+++ b/lib/bsearch.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * bsearch - binary search an array of elements
@@ -53,3 +54,4 @@ void *bsearch(const void *key, const void *base, size_t num, 
size_t size,
return NULL;
 }
 EXPORT_SYMBOL(bsearch);
+NOKPROBE_SYMBOL(bsearch);
-- 
2.17.1

[PATCH v2] tracing/kprobes: fix NULL pointer dereference in trace_kprobe_create()

2019-01-10 Thread Andrea Righi

It is possible to trigger a NULL pointer dereference by writing an
incorrectly formatted string to krpobe_events (trying to create a
kretprobe omitting the symbol).

Example:

 echo "r:event_1 " >> /sys/kernel/debug/tracing/kprobe_events

That triggers this:

 BUG: unable to handle kernel NULL pointer dereference at 
 #PF error: [normal kernel read fault]
 PGD 0 P4D 0
 Oops:  [#1] SMP PTI
 CPU: 6 PID: 1757 Comm: bash Not tainted 5.0.0-rc1+ #125
 Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.5.1 08/09/2018
 RIP: 0010:kstrtoull+0x2/0x20
 Code: 28 00 00 00 75 17 48 83 c4 18 5b 41 5c 5d c3 b8 ea ff ff ff eb e1 b8 de 
ff ff ff eb da e8 d6 36 bb ff 66 0f 1f 44 00 00 31 c0 <80> 3f 2b 55 48 89 e5 0f 
94 c0 48 01 c7 e8 5c ff ff ff 5d c3 66 2e
 RSP: 0018:b5d482e57cb8 EFLAGS: 00010246
 RAX:  RBX: 0001 RCX: 82b12720
 RDX: b5d482e57cf8 RSI:  RDI: 
 RBP: b5d482e57d70 R08: a0c05e5a7080 R09: a0c05e003980
 R10:  R11: 4000 R12: a0c04fe87b08
 R13: 0001 R14: 000b R15: a0c058d749e1
 FS:  7f137c7f7740() GS:a0c05e58() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2:  CR3: 000497d46004 CR4: 003606e0
 Call Trace:
  ? trace_kprobe_create+0xb6/0x840
  ? _cond_resched+0x19/0x40
  ? _cond_resched+0x19/0x40
  ? __kmalloc+0x62/0x210
  ? argv_split+0x8f/0x140
  ? trace_kprobe_create+0x840/0x840
  ? trace_kprobe_create+0x840/0x840
  create_or_delete_trace_kprobe+0x11/0x30
  trace_run_command+0x50/0x90
  trace_parse_run_command+0xc1/0x160
  probes_write+0x10/0x20
  __vfs_write+0x3a/0x1b0
  ? apparmor_file_permission+0x1a/0x20
  ? security_file_permission+0x31/0xf0
  ? _cond_resched+0x19/0x40
  vfs_write+0xb1/0x1a0
  ksys_write+0x55/0xc0
  __x64_sys_write+0x1a/0x20
  do_syscall_64+0x5a/0x120
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix by doing the proper argument checks in trace_kprobe_create().

Link: 
https://lore.kernel.org/lkml/20190111095108.b79a2ee026185cbd62365...@kernel.org
Fixes: 6212dd29683e ("tracing/kprobes: Use dyn_event framework for kprobe 
events")
Cc: sta...@vger.kernel.org
Signed-off-by: Andrea Righi 
Signed-off-by: Masami Hiramatsu 
---
 v2: argument check refactoring

 kernel/trace/trace_kprobe.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 5c19b8c41c7e..d5fb09ebba8b 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -607,11 +607,17 @@ static int trace_kprobe_create(int argc, const char 
*argv[])
char buf[MAX_EVENT_NAME_LEN];
unsigned int flags = TPARG_FL_KERNEL;
 
-   /* argc must be >= 1 */
-   if (argv[0][0] == 'r') {
+   switch (argv[0][0]) {
+   case 'r':
is_return = true;
flags |= TPARG_FL_RETURN;
-   } else if (argv[0][0] != 'p' || argc < 2)
+   break;
+   case 'p':
+   break;
+   default:
+   return -ECANCELED;
+   }
+   if (argc < 2)
return -ECANCELED;
 
event = strchr([0][1], ':');
-- 
2.17.1

[PATCH] tracing/kprobes: fix NULL pointer dereference in trace_kprobe_create()

2019-01-10 Thread Andrea Righi

It is possible to trigger a NULL pointer dereference by writing an
incorrectly formatted string to krpobe_events (omitting the symbol).

Example:

 echo "r:event_1 " >> /sys/kernel/debug/tracing/kprobe_events

That triggers this:

 BUG: unable to handle kernel NULL pointer dereference at 
 #PF error: [normal kernel read fault]
 PGD 0 P4D 0
 Oops:  [#1] SMP PTI
 CPU: 6 PID: 1757 Comm: bash Not tainted 5.0.0-rc1+ #125
 Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.5.1 08/09/2018
 RIP: 0010:kstrtoull+0x2/0x20
 Code: 28 00 00 00 75 17 48 83 c4 18 5b 41 5c 5d c3 b8 ea ff ff ff eb e1 b8 de 
ff ff ff eb da e8 d6 36 bb ff 66 0f 1f 44 00 00 31 c0 <80> 3f 2b 55 48 89 e5 0f 
94 c0 48 01 c7 e8 5c ff ff ff 5d c3 66 2e
 RSP: 0018:b5d482e57cb8 EFLAGS: 00010246
 RAX:  RBX: 0001 RCX: 82b12720
 RDX: b5d482e57cf8 RSI:  RDI: 
 RBP: b5d482e57d70 R08: a0c05e5a7080 R09: a0c05e003980
 R10:  R11: 4000 R12: a0c04fe87b08
 R13: 0001 R14: 000b R15: a0c058d749e1
 FS:  7f137c7f7740() GS:a0c05e58() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2:  CR3: 000497d46004 CR4: 003606e0
 Call Trace:
  ? trace_kprobe_create+0xb6/0x840
  ? _cond_resched+0x19/0x40
  ? _cond_resched+0x19/0x40
  ? __kmalloc+0x62/0x210
  ? argv_split+0x8f/0x140
  ? trace_kprobe_create+0x840/0x840
  ? trace_kprobe_create+0x840/0x840
  create_or_delete_trace_kprobe+0x11/0x30
  trace_run_command+0x50/0x90
  trace_parse_run_command+0xc1/0x160
  probes_write+0x10/0x20
  __vfs_write+0x3a/0x1b0
  ? apparmor_file_permission+0x1a/0x20
  ? security_file_permission+0x31/0xf0
  ? _cond_resched+0x19/0x40
  vfs_write+0xb1/0x1a0
  ksys_write+0x55/0xc0
  __x64_sys_write+0x1a/0x20
  do_syscall_64+0x5a/0x120
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix by doing the proper argument check when a NULL symbol is passed in
trace_kprobe_create().

Signed-off-by: Andrea Righi 
---
 kernel/trace/trace_kprobe.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 5c19b8c41c7e..76410ceeff50 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -644,6 +644,8 @@ static int trace_kprobe_create(int argc, const char *argv[])
 
/* try to parse an address. if that fails, try to read the
 * input as a symbol. */
+   if (!argv[1])
+   return -EINVAL;
if (kstrtoul(argv[1], 0, (unsigned long *))) {
/* Check whether uprobe event specified */
if (strchr(argv[1], '/') && strchr(argv[1], ':'))
-- 
2.17.1

Re: [PATCH v2 0/3] kprobes: Fix kretprobe issues

2019-01-08 Thread Andrea Righi

On Tue, Jan 08, 2019 at 01:43:55PM +0900, Masami Hiramatsu wrote:
> Hello,
> 
> This is v2 series of fixing kretprobe incorrect stacking order patches.
> In this version, I fixed a lack of kprobes.h including and added new
> patch for kretprobe trampoline recursion issue. (and add Cc:stable)
> 
> (1) kprobe incorrct stacking order problem
> 
> On recent talk with Andrea, I started more precise investigation on
> the kernel panic with kretprobes on notrace functions, which Francis
> had been reported last year ( https://lkml.org/lkml/2017/7/14/466 ).
> 
> See the investigation details in 
> https://lkml.kernel.org/r/154686789378.15479.2886543882215785247.stgit@devbox
> 
> When we put a kretprobe on ftrace_ops_assist_func() and put another
> kretprobe on probed-function, below happens
> 
> 
>  ->
>->fentry
> ->ftrace_ops_assist_func()
>  ->int3
>   ->kprobe_int3_handler()
>   ...->pre_handler_kretprobe()
>push the return address (*fentry*) of ftrace_ops_assist_func() to
>top of the kretprobe list and replace it with kretprobe_trampoline.
>   <-kprobe_int3_handler()
>  <-(int3)
>  ->kprobe_ftrace_handler()
>   ...->pre_handler_kretprobe()
>push the return address (caller) of probed-function to top of the
>kretprobe list and replace it with kretprobe_trampoline.
>  <-(kprobe_ftrace_handler())
> <-(ftrace_ops_assist_func())
> [kretprobe_trampoline]
>  ->tampoline_handler()
>pop the return address (caller) from top of the kretprobe list
>  <-(trampoline_handler())
> 
> [run caller with incorrect stack information]
><-()
>   !!KERNEL PANIC!!
> 
> Therefore, this kernel panic happens only when we put 2 k*ret*probes on
> ftrace_ops_assist_func() and other functions. If we put kprobes, it
> doesn't cause any issue, since it doesn't change the return address.
> 
> To fix (or just avoid) this issue, we can introduce a frame pointer
> verification to skip wrong order entries. And I also would like to
> blacklist those functions because those are part of ftrace-based 
> kprobe handling routine.
> 
> (2) kretprobe trampoline recursion problem
> 
> This was found by Andrea in the previous thread
> https://lkml.kernel.org/r/20190107183444.GA5966@xps-13
> 
> 
>  echo "r:event_1 __fdget" >> kprobe_events
>  echo "r:event_2 _raw_spin_lock_irqsave" >> kprobe_events
>  echo 1 > events/kprobes/enable
>  [DEADLOCK]
> 
> 
> Because kretprobe trampoline_handler uses spinlock for protecting
> hash table, if we probe the spinlock itself, it causes deadlock.
> Thank you Andrea and Steve for discovering this root cause!!
> 
> This bug has been introduced with the asm-coded trampoline
> code, since previously it used another kprobe for hooking
> the function return placeholder (which only has a nop) and
> trampoline handler was called from that kprobe.
> 
> To fix this bug, I introduced a dummy kprobe and set it in
> current_kprobe as we did in old days.
> 
> Thank you,

It looks all good to me, with this patch set I couldn't break the
kernel in any way.

Tested-by: Andrea Righi 

Thanks,
-Andrea

Re: [PATCH 0/2] kprobes: Fix kretprobe incorrect stacking order problem

2019-01-07 Thread Andrea Righi

On Mon, Jan 07, 2019 at 04:28:33PM -0500, Steven Rostedt wrote:
> On Mon, 7 Jan 2019 22:19:04 +0100
> Andrea Righi  wrote:
> 
> > > > If we put a kretprobe to raw_spin_lock_irqsave() it looks like
> > > > kretprobe is going to call kretprobe...  
> > > 
> > > Right, but we should be able to add some recursion protection to stop
> > > that. I have similar protection in the ftrace code.  
> > 
> > If we assume that __raw_spin_lock/unlock*() are always inlined a
> 
> I wouldn't assume that.
> 
> > possible way to prevent this recursion could be to use directly those
> > functions to do locking from the kretprobe trampoline.
> > 
> > But I'm not sure if that's a safe assumption... if not I'll see if I can
> > find a better solution.
> 
> All you need to do is have a per_cpu variable, where you just do:
> 
>   preempt_disable_notrace();
>   if (this_cpu_read(kprobe_recursion))
>   goto out;
>   this_cpu_inc(kprobe_recursion);
>   [...]
>   this_cpu_dec(kprobe_recursion);
> out:
>   preempt_enable_notrace();
> 
> And then just ignore any kprobes that trigger while you are processing
> the current kprobe.
> 
> Something like that. If you want (or if it already happens) replace
> preempt_disable() with local_irq_save().

Oh.. definitely much better. I'll work on that and send a new patch.
Thanks for the suggestion!

-Andrea

Re: [PATCH 0/2] kprobes: Fix kretprobe incorrect stacking order problem

2019-01-07 Thread Andrea Righi

On Mon, Jan 07, 2019 at 02:59:18PM -0500, Steven Rostedt wrote:
> On Mon, 7 Jan 2019 20:52:09 +0100
> Andrea Righi  wrote:
> 
> > > Ug, kretprobe calls spinlocks in the callback? I wonder if we can
> > > remove them.
> > > 
> > > I'm guessing this is a different issue than the one that this patch
> > > fixes. This sounds like we are calling kretprobe from kretprobe?
> > > 
> > > -- Steve  
> > 
> > kretprobe_trampoline()
> >   -> trampoline_handler()
> > -> kretprobe_hash_lock()
> >   -> raw_spin_lock_irqsave()  
> > 
> > If we put a kretprobe to raw_spin_lock_irqsave() it looks like
> > kretprobe is going to call kretprobe...
> 
> Right, but we should be able to add some recursion protection to stop
> that. I have similar protection in the ftrace code.

If we assume that __raw_spin_lock/unlock*() are always inlined a
possible way to prevent this recursion could be to use directly those
functions to do locking from the kretprobe trampoline.

But I'm not sure if that's a safe assumption... if not I'll see if I can
find a better solution.

Thanks,

From: Andrea Righi 
Subject: [PATCH] kprobes: prevent recursion deadlock with kretprobe and
 spinlocks

kretprobe_trampoline() is using a spinlock to protect the hash of
kretprobes. Adding a kretprobe to the spinlock functions may cause
a recursion deadlock where kretprobe is calling itself:

 kretprobe_trampoline()
   -> trampoline_handler()
 -> kretprobe_hash_lock()
   -> raw_spin_lock_irqsave()
 -> _raw_spin_lock_irqsave()
 kretprobe_trampoline from _raw_spin_lock_irqsave => DEADLOCK

 kretprobe_trampoline()
   -> trampoline_handler()
 -> recycle_rp_inst()
   -> raw_spin_lock()
 -> _raw_spin_lock()
 kretprobe_trampoline from _raw_spin_lock => DEADLOCK

Use the corresponding inlined spinlock functions to prevent this
recursion.

Signed-off-by: Andrea Righi 
---
 kernel/kprobes.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index f4ddfdd2d07e..b89bef5e3d80 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1154,9 +1154,9 @@ void recycle_rp_inst(struct kretprobe_instance *ri,
hlist_del(>hlist);
INIT_HLIST_NODE(>hlist);
if (likely(rp)) {
-   raw_spin_lock(>lock);
+   __raw_spin_lock(>lock);
hlist_add_head(>hlist, >free_instances);
-   raw_spin_unlock(>lock);
+   __raw_spin_unlock(>lock);
} else
/* Unregistering */
hlist_add_head(>hlist, head);
@@ -1172,7 +1172,7 @@ __acquires(hlist_lock)
 
*head = _inst_table[hash];
hlist_lock = kretprobe_table_lock_ptr(hash);
-   raw_spin_lock_irqsave(hlist_lock, *flags);
+   *flags = __raw_spin_lock_irqsave(hlist_lock);
 }
 NOKPROBE_SYMBOL(kretprobe_hash_lock);
 
@@ -1193,7 +1193,7 @@ __releases(hlist_lock)
raw_spinlock_t *hlist_lock;
 
hlist_lock = kretprobe_table_lock_ptr(hash);
-   raw_spin_unlock_irqrestore(hlist_lock, *flags);
+   __raw_spin_unlock_irqrestore(hlist_lock, *flags);
 }
 NOKPROBE_SYMBOL(kretprobe_hash_unlock);
 
-- 
2.17.1

Re: [PATCH 0/2] kprobes: Fix kretprobe incorrect stacking order problem

2019-01-07 Thread Andrea Righi

On Mon, Jan 07, 2019 at 02:27:49PM -0500, Steven Rostedt wrote:
> On Mon, 7 Jan 2019 19:34:44 +0100
> Andrea Righi  wrote:
> 
> > On Mon, Jan 07, 2019 at 10:31:34PM +0900, Masami Hiramatsu wrote:
> > ...
> > > BTW, this is not all of issues. To remove CONFIG_KPROBE_EVENTS_ON_NOTRACE
> > > I'm trying to find out other notrace functions which can cause
> > > kernel crash by probing. Mostly done on x86, so I'll post it
> > > after this series.  
> > 
> > Not sure if you found it already, but it looks like some of the
> > _raw_spin_lock/unlock* functions (when they're not inlined) are causing
> > the same problem (or something similar), I can deadlock the system by
> > doing this for example:
> > 
> >  echo "r:event_1 __fdget" >> kprobe_events
> >  echo "r:event_2 _raw_spin_lock_irqsave" >> kprobe_events
> >  echo 1 > events/kprobes/enable
> >  [DEADLOCK]
> > 
> > Sending the following just in case...
> >
> 
> Ug, kretprobe calls spinlocks in the callback? I wonder if we can
> remove them.
> 
> I'm guessing this is a different issue than the one that this patch
> fixes. This sounds like we are calling kretprobe from kretprobe?
> 
> -- Steve

kretprobe_trampoline()
  -> trampoline_handler()
-> kretprobe_hash_lock()
  -> raw_spin_lock_irqsave()

If we put a kretprobe to raw_spin_lock_irqsave() it looks like
kretprobe is going to call kretprobe...

-Andrea

Re: [PATCH 0/2] kprobes: Fix kretprobe incorrect stacking order problem

2019-01-07 Thread Andrea Righi

On Mon, Jan 07, 2019 at 10:31:34PM +0900, Masami Hiramatsu wrote:
...
> BTW, this is not all of issues. To remove CONFIG_KPROBE_EVENTS_ON_NOTRACE
> I'm trying to find out other notrace functions which can cause
> kernel crash by probing. Mostly done on x86, so I'll post it
> after this series.

Not sure if you found it already, but it looks like some of the
_raw_spin_lock/unlock* functions (when they're not inlined) are causing
the same problem (or something similar), I can deadlock the system by
doing this for example:

 echo "r:event_1 __fdget" >> kprobe_events
 echo "r:event_2 _raw_spin_lock_irqsave" >> kprobe_events
 echo 1 > events/kprobes/enable
 [DEADLOCK]

Sending the following just in case...

Thanks,

 kernel/locking/spinlock.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 936f3d14dd6b..d93e88019239 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -128,6 +129,7 @@ int __lockfunc _raw_spin_trylock(raw_spinlock_t *lock)
return __raw_spin_trylock(lock);
 }
 EXPORT_SYMBOL(_raw_spin_trylock);
+NOKPROBE_SYMBOL(_raw_spin_trylock);
 #endif
 
 #ifndef CONFIG_INLINE_SPIN_TRYLOCK_BH
@@ -136,6 +138,7 @@ int __lockfunc _raw_spin_trylock_bh(raw_spinlock_t *lock)
return __raw_spin_trylock_bh(lock);
 }
 EXPORT_SYMBOL(_raw_spin_trylock_bh);
+NOKPROBE_SYMBOL(_raw_spin_trylock_bh);
 #endif
 
 #ifndef CONFIG_INLINE_SPIN_LOCK
@@ -144,6 +147,7 @@ void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
__raw_spin_lock(lock);
 }
 EXPORT_SYMBOL(_raw_spin_lock);
+NOKPROBE_SYMBOL(_raw_spin_lock);
 #endif
 
 #ifndef CONFIG_INLINE_SPIN_LOCK_IRQSAVE
@@ -152,6 +156,7 @@ unsigned long __lockfunc 
_raw_spin_lock_irqsave(raw_spinlock_t *lock)
return __raw_spin_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_raw_spin_lock_irqsave);
+NOKPROBE_SYMBOL(_raw_spin_lock_irqsave);
 #endif
 
 #ifndef CONFIG_INLINE_SPIN_LOCK_IRQ
@@ -160,6 +165,7 @@ void __lockfunc _raw_spin_lock_irq(raw_spinlock_t *lock)
__raw_spin_lock_irq(lock);
 }
 EXPORT_SYMBOL(_raw_spin_lock_irq);
+NOKPROBE_SYMBOL(_raw_spin_lock_irq);
 #endif
 
 #ifndef CONFIG_INLINE_SPIN_LOCK_BH
@@ -168,6 +174,7 @@ void __lockfunc _raw_spin_lock_bh(raw_spinlock_t *lock)
__raw_spin_lock_bh(lock);
 }
 EXPORT_SYMBOL(_raw_spin_lock_bh);
+NOKPROBE_SYMBOL(_raw_spin_lock_bh);
 #endif
 
 #ifdef CONFIG_UNINLINE_SPIN_UNLOCK
@@ -176,6 +183,7 @@ void __lockfunc _raw_spin_unlock(raw_spinlock_t *lock)
__raw_spin_unlock(lock);
 }
 EXPORT_SYMBOL(_raw_spin_unlock);
+NOKPROBE_SYMBOL(_raw_spin_unlock);
 #endif
 
 #ifndef CONFIG_INLINE_SPIN_UNLOCK_IRQRESTORE
@@ -184,6 +192,7 @@ void __lockfunc _raw_spin_unlock_irqrestore(raw_spinlock_t 
*lock, unsigned long
__raw_spin_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_raw_spin_unlock_irqrestore);
+NOKPROBE_SYMBOL(_raw_spin_unlock_irqrestore);
 #endif
 
 #ifndef CONFIG_INLINE_SPIN_UNLOCK_IRQ
@@ -192,6 +201,7 @@ void __lockfunc _raw_spin_unlock_irq(raw_spinlock_t *lock)
__raw_spin_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_raw_spin_unlock_irq);
+NOKPROBE_SYMBOL(_raw_spin_unlock_irq);
 #endif
 
 #ifndef CONFIG_INLINE_SPIN_UNLOCK_BH
@@ -200,6 +210,7 @@ void __lockfunc _raw_spin_unlock_bh(raw_spinlock_t *lock)
__raw_spin_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_raw_spin_unlock_bh);
+NOKPROBE_SYMBOL(_raw_spin_unlock_bh);
 #endif
 
 #ifndef CONFIG_INLINE_READ_TRYLOCK

Signed-off-by: Andrea Righi

Re: [PATCH 0/2] kprobes: Fix kretprobe incorrect stacking order problem

2019-01-07 Thread Andrea Righi

unction)
> [kretprobe_trampoline]
>  ->tampoline_handler()
>pop the return address (caller) from top of the kretprobe list
>  <-(trampoline_handler())
> 
> 
> When we put a kretprobe on ftrace_ops_assist_func(), below happens
> 
> 
>  ->
>->fentry
> ->ftrace_ops_assist_func()
>  ->int3
>   ->kprobe_int3_handler()
>   ...->pre_handler_kretprobe()
>push the return address (*fentry*) of ftrace_ops_assist_func() to
>top of the kretprobe list and replace it with kretprobe_trampoline.
>   <-kprobe_int3_handler()
>  <-(int3)
>  ->kprobe_ftrace_handler()
>   ...->pre_handler_kretprobe()
>push the return address (caller) of probed-function to top of the
>kretprobe list and replace it with kretprobe_trampoline.
>  <-(kprobe_ftrace_handler())
> <-(ftrace_ops_assist_func())
> [kretprobe_trampoline]
>  ->tampoline_handler()
>pop the return address (caller) from top of the kretprobe list
>  <-(trampoline_handler())
> 
> [run caller with incorrect stack information]
><-()
>   !!KERNEL PANIC!!
> 
> Therefore, this kernel panic happens only when we put 2 k*ret*probes on
> ftrace_ops_assist_func() and other functions. If we put kprobes, it
> doesn't cause any issue, since it doesn't change the return address.
> 
> To fix (or just avoid) this issue, we can introduce a frame pointer
> verification to skip wrong order entries. And I also would like to
> blacklist those functions because those are part of ftrace-based 
> kprobe handling routine.
> 
> BTW, this is not all of issues. To remove CONFIG_KPROBE_EVENTS_ON_NOTRACE
> I'm trying to find out other notrace functions which can cause
> kernel crash by probing. Mostly done on x86, so I'll post it
> after this series.
> 
> Thank you,

Apart than the missing include  in PATCH 2/2
everything else looks good to me.

Tested-by: Andrea Righi 

Thanks!
-Andrea

Re: [PATCH 2/2] kprobes: Mark ftrace mcount handler functions nokprobe

2019-01-07 Thread Andrea Righi

On Mon, Jan 07, 2019 at 10:32:32PM +0900, Masami Hiramatsu wrote:
> Mark ftrace mcount handler functions nokprobe since
> probing on these functions with kretprobe pushes
> return address incorrectly on kretprobe shadow stack.
> 
> Signed-off-by: Masami Hiramatsu 
> Reported-by: Francis Deslauriers 
> ---
>  kernel/trace/ftrace.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index f0ff24173a0b..ad4babad4a03 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -6250,7 +6250,7 @@ void ftrace_reset_array_ops(struct trace_array *tr)
>   tr->ops->func = ftrace_stub;
>  }
>  
> -static inline void
> +static nokprobe_inline void

I think we need to #include , otherwise:

  CC  kernel/trace/ftrace.o
kernel/trace/ftrace.c:6219:24: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or 
‘__attribute__’ before ‘void’
 static nokprobe_inline void
^~~~

 kernel/trace/ftrace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 3a58ad280d83..0333241034d5 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 

Thanks,
-Andrea

>  __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
>  struct ftrace_ops *ignored, struct pt_regs *regs)
>  {
> @@ -6310,11 +6310,13 @@ static void ftrace_ops_list_func(unsigned long ip, 
> unsigned long parent_ip,
>  {
>   __ftrace_ops_list_func(ip, parent_ip, NULL, regs);
>  }
> +NOKPROBE_SYMBOL(ftrace_ops_list_func);
>  #else
>  static void ftrace_ops_no_ops(unsigned long ip, unsigned long parent_ip)
>  {
>   __ftrace_ops_list_func(ip, parent_ip, NULL, NULL);
>  }
> +NOKPROBE_SYMBOL(ftrace_ops_no_ops);
>  #endif
>  
>  /*
> @@ -6341,6 +6343,7 @@ static void ftrace_ops_assist_func(unsigned long ip, 
> unsigned long parent_ip,
>   preempt_enable_notrace();
>   trace_clear_recursion(bit);
>  }
> +NOKPROBE_SYMBOL(ftrace_ops_assist_func);
>  
>  /**
>   * ftrace_ops_get_func - get the function a trampoline should call

Re: [PATCH v2 0/3] x86: kprobes: Show correct blaclkist in debugfs

2019-01-01 Thread Andrea Righi

On Tue, Jan 01, 2019 at 10:16:54PM +0900, Masami Hiramatsu wrote:
...
> > > > > Do you see a nice and clean way to blacklist all these functions
> > > > > (something like arch_populate_kprobe_blacklist()), or should we just
> > > > > flag all of them explicitly with NOKPROBE_SYMBOL()?
> > > > 
> > > > As I pointed, you can probe it via your own kprobe module. Like 
> > > > systemtap,
> > > > you still can probe it. The blacklist is for "kprobes", not for 
> > > > "kprobe_events".
> > > > (Those are used to same, but since the above commit, those are 
> > > > different now)
> > > > 
> > > > I think the most sane solution is, identifying which (combination of) 
> > > > functions
> > > > in ftrace (kernel/trace/*) causes a problem, marking those 
> > > > NOKPROBE_SYMBOL() and
> > > > removing CONFIG_KPROBE_EVENTS_ON_NOTRACE.
> > 
> > I'm planning to spend a little bit more time on this and see if I can
> > identify the problematic ftrace functions and eventually drop
> > CONFIG_KPROBE_EVENTS_ON_NOTRACE, following the sane solution.
> > 
> > However, in the meantime, with the following patch I've been able to get
> > a more reliable kprobes blacklist and show also the notrace functions in
> > debugfs when CONFIG_KPROBE_EVENTS_ON_NOTRACE is off.
> 
> Hmm, if CONFIG_KPROBE_EVENTS_ON_NOTRACE=n, we already have a whitelist of
> functions in /sys/kernel/debug/tracing/available_filter_functions,
> so I don't think we need a blacklist.

OK.

> 
> > It's probably ugly and inefficient, because it's iterating over all
> > symbols in x86's arch_populate_kprobe_blacklist(), but it seems to work
> > for my specific use case, so I thought it shouldn't be bad to share it,
> > just in case (maybe someone else is also interested).
> 
> Hmm, but in that case, it limits other native kprobes users like systemtap
> to disable probing on notrace functions with no reasons. That may not be 
> acceptable.

True...

> 
> OK, I'll retry to find which notrace function combination tracing with
> kprobes are problematic. Let me do it...

OK. Thanks tons for looking into this!

-Andrea

Re: [PATCH v2 0/3] x86: kprobes: Show correct blaclkist in debugfs

2018-12-27 Thread Andrea Righi

On Tue, Dec 18, 2018 at 06:24:35PM +0100, Andrea Righi wrote:
> On Tue, Dec 18, 2018 at 01:50:26PM +0900, Masami Hiramatsu wrote:
> ...
> > > Side question: there are certain symbols in arch/x86/xen that should be
> > > blacklisted explicitly, because they're non-attachable.
> > > 
> > > More exactly, all functions defined in arch/x86/xen/spinlock.c,
> > > arch/x86/xen/time.c and arch/x86/xen/irq.c.
> > > 
> > > The reason is that these files are compiled without -pg to allow the
> > > usage of ftrace within a Xen domain apparently (from
> > > arch/x86/xen/Makefile):
> > > 
> > >  ifdef CONFIG_FUNCTION_TRACER
> > >  # Do not profile debug and lowlevel utilities
> > >  CFLAGS_REMOVE_spinlock.o = -pg
> > >  CFLAGS_REMOVE_time.o = -pg
> > >  CFLAGS_REMOVE_irq.o = -pg
> > >  endif
> > 
> > 
> > Actually, the reason why you can not probe those functions via
> > tracing/kprobe_events is just a side effect. You can probe it if you
> > write a kprobe module. Since the kprobe_events depends on some ftrace
> > tracing functions, it sometimes cause a recursive call problem. To avoid
> > this issue, I have introduced a CONFIG_KPROBE_EVENTS_ON_NOTRACE, see
> > commit 45408c4f9250 ("tracing: kprobes: Prohibit probing on notrace 
> > function").
> > 
> > If you set CONFIG_KPROBE_EVENTS_ON_NOTRACE=n, you can continue putting 
> > probes
> > on Xen spinlock functions too.
> 
> OK.
> 
> > 
> > > Do you see a nice and clean way to blacklist all these functions
> > > (something like arch_populate_kprobe_blacklist()), or should we just
> > > flag all of them explicitly with NOKPROBE_SYMBOL()?
> > 
> > As I pointed, you can probe it via your own kprobe module. Like systemtap,
> > you still can probe it. The blacklist is for "kprobes", not for 
> > "kprobe_events".
> > (Those are used to same, but since the above commit, those are different 
> > now)
> > 
> > I think the most sane solution is, identifying which (combination of) 
> > functions
> > in ftrace (kernel/trace/*) causes a problem, marking those 
> > NOKPROBE_SYMBOL() and
> > removing CONFIG_KPROBE_EVENTS_ON_NOTRACE.

I'm planning to spend a little bit more time on this and see if I can
identify the problematic ftrace functions and eventually drop
CONFIG_KPROBE_EVENTS_ON_NOTRACE, following the sane solution.

However, in the meantime, with the following patch I've been able to get
a more reliable kprobes blacklist and show also the notrace functions in
debugfs when CONFIG_KPROBE_EVENTS_ON_NOTRACE is off.

It's probably ugly and inefficient, because it's iterating over all
symbols in x86's arch_populate_kprobe_blacklist(), but it seems to work
for my specific use case, so I thought it shouldn't be bad to share it,
just in case (maybe someone else is also interested).

Thanks,

From: Andrea Righi 
Subject: [PATCH] x86: kprobes: automatically blacklist all non-traceable 
functions

Iterate over all symbols to detect those that are non-traceable and
blacklist them.

Signed-off-by: Andrea Righi 
---
 arch/x86/kernel/kprobes/core.c | 11 +--
 kernel/kprobes.c   | 22 --
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 4ba75afba527..8cc7191ba3f9 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1026,10 +1026,17 @@ int kprobe_fault_handler(struct pt_regs *regs, int 
trapnr)
 }
 NOKPROBE_SYMBOL(kprobe_fault_handler);
 
+static int do_kprobes_arch_blacklist(void *data, const char *name,
+struct module *mod, unsigned long addr)
+{
+   if (arch_within_kprobe_blacklist(addr))
+   kprobe_add_ksym_blacklist(addr);
+   return 0;
+}
+
 int __init arch_populate_kprobe_blacklist(void)
 {
-   return kprobe_add_area_blacklist((unsigned long)__entry_text_start,
-(unsigned long)__entry_text_end);
+   return kallsyms_on_each_symbol(do_kprobes_arch_blacklist, NULL);
 }
 
 int __init arch_init_kprobes(void)
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index f4ddfdd2d07e..2e824cd536ba 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1389,11 +1389,29 @@ static int register_aggr_kprobe(struct kprobe *orig_p, 
struct kprobe *p)
return ret;
 }
 
+#if defined(CONFIG_KPROBES_ON_FTRACE) && \
+   !defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
+static bool within_notrace(unsigned long addr)
+{
+   unsigned long offset, size;
+
+   if (!kallsyms_lookup_size_offset(addr, , ))
+   return true;
+   return !ftrace_location_ra

Re: [PATCH v2 0/3] x86: kprobes: Show correct blaclkist in debugfs

2018-12-18 Thread Andrea Righi

On Tue, Dec 18, 2018 at 01:50:26PM +0900, Masami Hiramatsu wrote:
...
> > Side question: there are certain symbols in arch/x86/xen that should be
> > blacklisted explicitly, because they're non-attachable.
> > 
> > More exactly, all functions defined in arch/x86/xen/spinlock.c,
> > arch/x86/xen/time.c and arch/x86/xen/irq.c.
> > 
> > The reason is that these files are compiled without -pg to allow the
> > usage of ftrace within a Xen domain apparently (from
> > arch/x86/xen/Makefile):
> > 
> >  ifdef CONFIG_FUNCTION_TRACER
> >  # Do not profile debug and lowlevel utilities
> >  CFLAGS_REMOVE_spinlock.o = -pg
> >  CFLAGS_REMOVE_time.o = -pg
> >  CFLAGS_REMOVE_irq.o = -pg
> >  endif
> 
> 
> Actually, the reason why you can not probe those functions via
> tracing/kprobe_events is just a side effect. You can probe it if you
> write a kprobe module. Since the kprobe_events depends on some ftrace
> tracing functions, it sometimes cause a recursive call problem. To avoid
> this issue, I have introduced a CONFIG_KPROBE_EVENTS_ON_NOTRACE, see
> commit 45408c4f9250 ("tracing: kprobes: Prohibit probing on notrace 
> function").
> 
> If you set CONFIG_KPROBE_EVENTS_ON_NOTRACE=n, you can continue putting probes
> on Xen spinlock functions too.

OK.

> 
> > Do you see a nice and clean way to blacklist all these functions
> > (something like arch_populate_kprobe_blacklist()), or should we just
> > flag all of them explicitly with NOKPROBE_SYMBOL()?
> 
> As I pointed, you can probe it via your own kprobe module. Like systemtap,
> you still can probe it. The blacklist is for "kprobes", not for 
> "kprobe_events".
> (Those are used to same, but since the above commit, those are different now)
> 
> I think the most sane solution is, identifying which (combination of) 
> functions
> in ftrace (kernel/trace/*) causes a problem, marking those NOKPROBE_SYMBOL() 
> and
> removing CONFIG_KPROBE_EVENTS_ON_NOTRACE.

OK. Thanks for the clarification!

-Andrea

Re: [PATCH v2 0/3] x86: kprobes: Show correct blaclkist in debugfs

2018-12-17 Thread Andrea Righi

On Mon, Dec 17, 2018 at 05:20:25PM +0900, Masami Hiramatsu wrote:
> This is v2 series for showing correct kprobe blacklist in
> debugfs.
> 
> v1 is here:
> 
>  https://lkml.org/lkml/2018/12/7/517
> 
> I splitted the RFC v1 patch into x86 and generic parts,
> also added a patch to remove unneeded arch-specific
> blacklist check function (because those have been added
> to the generic blacklist.)
> 
> If this style is good, I will make another series for the
> archs which have own arch_within_kprobe_blacklist(), and
> eventually replace that with arch_populate_kprobe_blacklist()
> so that user can get the correct kprobe blacklist in debugfs.
> 
> Thank you,

Looks good to me. Thanks!

Tested-by: Andrea Righi 

Side question: there are certain symbols in arch/x86/xen that should be
blacklisted explicitly, because they're non-attachable.

More exactly, all functions defined in arch/x86/xen/spinlock.c,
arch/x86/xen/time.c and arch/x86/xen/irq.c.

The reason is that these files are compiled without -pg to allow the
usage of ftrace within a Xen domain apparently (from
arch/x86/xen/Makefile):

 ifdef CONFIG_FUNCTION_TRACER
 # Do not profile debug and lowlevel utilities
 CFLAGS_REMOVE_spinlock.o = -pg
 CFLAGS_REMOVE_time.o = -pg
 CFLAGS_REMOVE_irq.o = -pg
 endif

Do you see a nice and clean way to blacklist all these functions
(something like arch_populate_kprobe_blacklist()), or should we just
flag all of them explicitly with NOKPROBE_SYMBOL()?

Thanks,
-Andrea

[PATCH] kprobes/x86/xen: blacklist non-attachable xen interrupt functions

2018-12-10 Thread Andrea Righi

Blacklist symbols in Xen probe-prohibited areas, so that user can see
these prohibited symbols in debugfs.

See also: a50480cb6d61.

Signed-off-by: Andrea Righi 
---
 arch/x86/xen/xen-asm_64.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
index bb1c2da0381d..1e9ef0ba30a5 100644
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -24,6 +25,7 @@ ENTRY(xen_\name)
pop %r11
jmp  \name
 END(xen_\name)
+_ASM_NOKPROBE(xen_\name)
 .endm
 
 xen_pv_trap divide_error
-- 
2.17.1

Re: [PATCH] kprobes: x86_64: blacklist non-attachable interrupt functions

2018-12-07 Thread Andrea Righi

On Sat, Dec 08, 2018 at 12:48:59PM +0900, Masami Hiramatsu wrote:
> On Fri, 7 Dec 2018 18:58:05 +0100
> Andrea Righi  wrote:
> 
> > On Sat, Dec 08, 2018 at 01:01:20AM +0900, Masami Hiramatsu wrote:
> > > Hi Andrea and Ingo,
> > > 
> > > Here is the patch what I meant. I just ran it on qemu-x86, and seemed 
> > > working.
> > > After introducing this patch, I will start adding 
> > > arch_populate_kprobe_blacklist()
> > > to some arches.
> > > 
> > > Thank you,
> > > 
> > > [RFC] kprobes: x86/kprobes: Blacklist symbols in arch-defined prohibited 
> > > area
> > > 
> > > From: Masami Hiramatsu 
> > > 
> > > Blacklist symbols in arch-defined probe-prohibited areas.
> > > With this change, user can see all symbols which are prohibited
> > > to probe in debugfs.
> > > 
> > > All archtectures which have custom prohibit areas should define
> > > its own arch_populate_kprobe_blacklist() function, but unless that,
> > > all symbols marked __kprobes are blacklisted.
> > 
> > What about iterating all symbols and use arch_within_kprobe_blacklist()
> > to check if we need to blacklist them or not.
> 
> Sorry, I don't want to iterate all ksyms since it may take a long time
> (especially embedded small devices.)
> 
> > 
> > In this way we don't have to introduce an
> > arch_populate_kprobe_blacklist() for each architecture.
> 
> Hmm, I had a same idea, but there are some arch which prohibit probing
> extable entries (e.g. arm64.) For correctness of the blacklist, I think
> it should be listed (not entire the function body).
> I also rather like to remove arch_within_kprobe_blacklist() instead.

OK. Thanks.

-Andrea

Re: [PATCH] kprobes: x86_64: blacklist non-attachable interrupt functions

2018-12-07 Thread Andrea Righi

On Sat, Dec 08, 2018 at 12:48:59PM +0900, Masami Hiramatsu wrote:
> On Fri, 7 Dec 2018 18:58:05 +0100
> Andrea Righi  wrote:
> 
> > On Sat, Dec 08, 2018 at 01:01:20AM +0900, Masami Hiramatsu wrote:
> > > Hi Andrea and Ingo,
> > > 
> > > Here is the patch what I meant. I just ran it on qemu-x86, and seemed 
> > > working.
> > > After introducing this patch, I will start adding 
> > > arch_populate_kprobe_blacklist()
> > > to some arches.
> > > 
> > > Thank you,
> > > 
> > > [RFC] kprobes: x86/kprobes: Blacklist symbols in arch-defined prohibited 
> > > area
> > > 
> > > From: Masami Hiramatsu 
> > > 
> > > Blacklist symbols in arch-defined probe-prohibited areas.
> > > With this change, user can see all symbols which are prohibited
> > > to probe in debugfs.
> > > 
> > > All archtectures which have custom prohibit areas should define
> > > its own arch_populate_kprobe_blacklist() function, but unless that,
> > > all symbols marked __kprobes are blacklisted.
> > 
> > What about iterating all symbols and use arch_within_kprobe_blacklist()
> > to check if we need to blacklist them or not.
> 
> Sorry, I don't want to iterate all ksyms since it may take a long time
> (especially embedded small devices.)
> 
> > 
> > In this way we don't have to introduce an
> > arch_populate_kprobe_blacklist() for each architecture.
> 
> Hmm, I had a same idea, but there are some arch which prohibit probing
> extable entries (e.g. arm64.) For correctness of the blacklist, I think
> it should be listed (not entire the function body).
> I also rather like to remove arch_within_kprobe_blacklist() instead.

OK. Thanks.

-Andrea

Re: [PATCH] kprobes: x86_64: blacklist non-attachable interrupt functions

2018-12-07 Thread Andrea Righi

On Sat, Dec 08, 2018 at 12:42:10PM +0900, Masami Hiramatsu wrote:
> On Fri, 7 Dec 2018 18:00:26 +0100
> Andrea Righi  wrote:
> 
> > On Sat, Dec 08, 2018 at 01:01:20AM +0900, Masami Hiramatsu wrote:
> > > Hi Andrea and Ingo,
> > > 
> > > Here is the patch what I meant. I just ran it on qemu-x86, and seemed 
> > > working.
> > > After introducing this patch, I will start adding 
> > > arch_populate_kprobe_blacklist()
> > > to some arches.
> > > 
> > > Thank you,
> > > 
> > > [RFC] kprobes: x86/kprobes: Blacklist symbols in arch-defined prohibited 
> > > area
> > > 
> > > From: Masami Hiramatsu 
> > > 
> > > Blacklist symbols in arch-defined probe-prohibited areas.
> > > With this change, user can see all symbols which are prohibited
> > > to probe in debugfs.
> > > 
> > > All archtectures which have custom prohibit areas should define
> > > its own arch_populate_kprobe_blacklist() function, but unless that,
> > > all symbols marked __kprobes are blacklisted.
> > > 
> > > Reported-by: Andrea Righi 
> > > Signed-off-by: Masami Hiramatsu 
> > > ---
> > 
> > [snip]
> > 
> > > +int kprobe_add_ksym_blacklist(unsigned long entry)
> > > +{
> > > + struct kprobe_blacklist_entry *ent;
> > > + unsigned long offset = 0, size = 0;
> > > +
> > > + if (!kernel_text_address(entry) ||
> > > + !kallsyms_lookup_size_offset(entry, , ))
> > > + return -EINVAL;
> > > +
> > > + ent = kmalloc(sizeof(*ent), GFP_KERNEL);
> > > + if (!ent)
> > > + return -ENOMEM;
> > > + ent->start_addr = entry - offset;
> > > + ent->end_addr = entry - offset + size;
> > 
> > Do we need to take offset into account? The code before wasn't using it.
> 
> Yes, if we hit an alias symbol (zero-size), we forcibly increment address
> and retry it. In that case, offset will be 1.
> 
> > 
> > > + INIT_LIST_HEAD(>list);
> > > + list_add_tail(>list, _blacklist);
> > > +
> > > + return (int)size;
> > > +}
> > > +
> > > +/* Add functions in arch defined probe-prohibited area */
> > > +int __weak arch_populate_kprobe_blacklist(void)
> > > +{
> > > + unsigned long entry;
> > > + int ret = 0;
> > > +
> > > + for (entry = (unsigned long)__kprobes_text_start;
> > > +  entry < (unsigned long)__kprobes_text_end;
> > > +  entry += ret) {
> > > + ret = kprobe_add_ksym_blacklist(entry);
> > > + if (ret < 0)
> > > + return ret;
> > > + if (ret == 0)   /* In case of alias symbol */
> > > + ret = 1;
> 
> Here, we incremented.
> 
> Thank you,

Makes sense, thanks for the clarification.

-Andrea

Re: [PATCH] kprobes: x86_64: blacklist non-attachable interrupt functions

2018-12-07 Thread Andrea Righi

On Sat, Dec 08, 2018 at 12:42:10PM +0900, Masami Hiramatsu wrote:
> On Fri, 7 Dec 2018 18:00:26 +0100
> Andrea Righi  wrote:
> 
> > On Sat, Dec 08, 2018 at 01:01:20AM +0900, Masami Hiramatsu wrote:
> > > Hi Andrea and Ingo,
> > > 
> > > Here is the patch what I meant. I just ran it on qemu-x86, and seemed 
> > > working.
> > > After introducing this patch, I will start adding 
> > > arch_populate_kprobe_blacklist()
> > > to some arches.
> > > 
> > > Thank you,
> > > 
> > > [RFC] kprobes: x86/kprobes: Blacklist symbols in arch-defined prohibited 
> > > area
> > > 
> > > From: Masami Hiramatsu 
> > > 
> > > Blacklist symbols in arch-defined probe-prohibited areas.
> > > With this change, user can see all symbols which are prohibited
> > > to probe in debugfs.
> > > 
> > > All archtectures which have custom prohibit areas should define
> > > its own arch_populate_kprobe_blacklist() function, but unless that,
> > > all symbols marked __kprobes are blacklisted.
> > > 
> > > Reported-by: Andrea Righi 
> > > Signed-off-by: Masami Hiramatsu 
> > > ---
> > 
> > [snip]
> > 
> > > +int kprobe_add_ksym_blacklist(unsigned long entry)
> > > +{
> > > + struct kprobe_blacklist_entry *ent;
> > > + unsigned long offset = 0, size = 0;
> > > +
> > > + if (!kernel_text_address(entry) ||
> > > + !kallsyms_lookup_size_offset(entry, , ))
> > > + return -EINVAL;
> > > +
> > > + ent = kmalloc(sizeof(*ent), GFP_KERNEL);
> > > + if (!ent)
> > > + return -ENOMEM;
> > > + ent->start_addr = entry - offset;
> > > + ent->end_addr = entry - offset + size;
> > 
> > Do we need to take offset into account? The code before wasn't using it.
> 
> Yes, if we hit an alias symbol (zero-size), we forcibly increment address
> and retry it. In that case, offset will be 1.
> 
> > 
> > > + INIT_LIST_HEAD(>list);
> > > + list_add_tail(>list, _blacklist);
> > > +
> > > + return (int)size;
> > > +}
> > > +
> > > +/* Add functions in arch defined probe-prohibited area */
> > > +int __weak arch_populate_kprobe_blacklist(void)
> > > +{
> > > + unsigned long entry;
> > > + int ret = 0;
> > > +
> > > + for (entry = (unsigned long)__kprobes_text_start;
> > > +  entry < (unsigned long)__kprobes_text_end;
> > > +  entry += ret) {
> > > + ret = kprobe_add_ksym_blacklist(entry);
> > > + if (ret < 0)
> > > + return ret;
> > > + if (ret == 0)   /* In case of alias symbol */
> > > + ret = 1;
> 
> Here, we incremented.
> 
> Thank you,

Makes sense, thanks for the clarification.

-Andrea

1 2 3 >

1 - 100 of 270 matches

Mail list logo