Re: [PATCH] fs: btrfs: Remove repeated struct declaration

2021-04-01 Thread Nikolay Borisov



On 1.04.21 г. 11:03, Wan Jiabing wrote:
> struct btrfs_inode is declared twice. One is declared at 67th line.
> The blew declaration is not needed. Remove the duplicate.
> struct btrfs_fs_info should be declared in the forward declarations.
> Move it to the forward declarations.
> 
> Signed-off-by: Wan Jiabing 

Reviewed-by: Nikolay Borisov 


Re: [PATCH] btrfs: Use immediate assignment when referencing cc-option

2021-03-17 Thread Nikolay Borisov



On 17.03.21 г. 0:46 ч., Victor Erminpour wrote:
> Calling cc-option will use KBUILD_CFLAGS, which when lazy setting
> subdir-ccflags-y produces the following build error:
> 
> scripts/Makefile.lib:10: *** Recursive variable `KBUILD_CFLAGS' \
>   references itself (eventually).  Stop.
> 
> Use := assignment to subdir-ccflags-y when referencing cc-option.
> This causes make to also evaluate += immediately, cc-option
> calls are done right away and we don't end up with KBUILD_CFLAGS
> referencing itself.
> 
> Signed-off-by: Victor Erminpour 
> ---
>  fs/btrfs/Makefile | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index b634c42115ea..3dba1336fa95 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -7,10 +7,10 @@ subdir-ccflags-y += -Wmissing-format-attribute
>  subdir-ccflags-y += -Wmissing-prototypes
>  subdir-ccflags-y += -Wold-style-definition
>  subdir-ccflags-y += -Wmissing-include-dirs
> -subdir-ccflags-y += $(call cc-option, -Wunused-but-set-variable)
> -subdir-ccflags-y += $(call cc-option, -Wunused-const-variable)
> -subdir-ccflags-y += $(call cc-option, -Wpacked-not-aligned)
> -subdir-ccflags-y += $(call cc-option, -Wstringop-truncation)
> +subdir-ccflags-y := $(call cc-option, -Wunused-but-set-variable)
> +subdir-ccflags-y := $(call cc-option, -Wunused-const-variable)
> +subdir-ccflags-y := $(call cc-option, -Wpacked-not-aligned)
> +subdir-ccflags-y := $(call cc-option, -Wstringop-truncation)
>  # The following turn off the warnings enabled by -Wextra
>  subdir-ccflags-y += -Wno-missing-field-initializers
>  subdir-ccflags-y += -Wno-sign-compare
> 

Why does this patch change only some assignments and others are left as
they were?


Re: [btrfs] 5297199a8b: xfstests.btrfs.220.fail

2021-03-09 Thread Nikolay Borisov



On 9.03.21 г. 10:49 ч., kernel test robot wrote:
> 
> 
> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: 5297199a8bca12b8b96afcbf2341605efb6005de ("btrfs: remove inode number 
> cache feature")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 
> in testcase: xfstests
> version: xfstests-x86_64-d41dcbd-1_20201218
> with following parameters:
> 
>   disk: 6HDD
>   fs: btrfs
>   test: btrfs-group-22
>   ucode: 0x28
> 
> test-description: xfstests is a regression test suite for xfs and other files 
> ystems.
> test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
> 
> 
> on test machine: 8 threads Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 8G 
> memory
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
> 
> 2021-03-09 04:13:26 export TEST_DIR=/fs/sdb1
> 2021-03-09 04:13:26 export TEST_DEV=/dev/sdb1
> 2021-03-09 04:13:26 export FSTYP=btrfs
> 2021-03-09 04:13:26 export SCRATCH_MNT=/fs/scratch
> 2021-03-09 04:13:26 mkdir /fs/scratch -p
> 2021-03-09 04:13:26 export SCRATCH_DEV_POOL="/dev/sdb2 /dev/sdb3 /dev/sdb4 
> /dev/sdb5 /dev/sdb6"
> 2021-03-09 04:13:26 sed "s:^:btrfs/:" 
> //lkp/benchmarks/xfstests/tests/btrfs-group-22
> 2021-03-09 04:13:26 ./check btrfs/220 btrfs/221 btrfs/222 btrfs/223 btrfs/224 
> btrfs/225 btrfs/226 btrfs/227
> FSTYP -- btrfs
> PLATFORM  -- Linux/x86_64 lkp-hsw-d01 5.10.0-rc7-00162-g5297199a8bca #1 
> SMP Sat Feb 27 21:06:26 CST 2021
> MKFS_OPTIONS  -- /dev/sdb2
> MOUNT_OPTIONS -- /dev/sdb2 /fs/scratch
> 
> btrfs/220 - output mismatch (see 
> /lkp/benchmarks/xfstests/results//btrfs/220.out.bad)
> --- tests/btrfs/220.out   2021-01-14 07:40:58.0 +
> +++ /lkp/benchmarks/xfstests/results//btrfs/220.out.bad   2021-03-09 
> 04:13:32.880794446 +
> @@ -1,2 +1,3 @@
>  QA output created by 220
> +Unexepcted mount options, checking for 
> 'inode_cache,relatime,space_cache,subvol=/,subvolid=5' in 
> 'rw,relatime,space_cache,subvolid=5,subvol=/' using 'inode_cache'


Given that the commit removes the inode_cache feature that's expected, I
assume you need to adjust your fstests configuration to not use
inode_cache.


[tip: locking/core] locking/rwsem: Remove empty rwsem.h

2021-02-01 Thread tip-bot2 for Nikolay Borisov
The following commit has been merged into the locking/core branch of tip:

Commit-ID: 442187f3c2de40bab13b8f9751b37925bede73b0
Gitweb:
https://git.kernel.org/tip/442187f3c2de40bab13b8f9751b37925bede73b0
Author:Nikolay Borisov 
AuthorDate:Tue, 26 Jan 2021 12:17:21 +02:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 29 Jan 2021 20:02:34 +01:00

locking/rwsem: Remove empty rwsem.h

This is a leftover from 7f26482a872c ("locking/percpu-rwsem: Remove the 
embedded rwsem")

Signed-off-by: Nikolay Borisov 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Will Deacon 
Link: https://lkml.kernel.org/r/20210126101721.976027-1-nbori...@suse.com
---
 kernel/locking/rwsem.h | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 kernel/locking/rwsem.h

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
deleted file mode 100644
index e69de29..000
--- a/kernel/locking/rwsem.h
+++ /dev/null


Re: [PATCH] x86: Disable CET instrumentation in the kernel

2021-01-29 Thread Nikolay Borisov



On 29.01.21 г. 18:49 ч., Josh Poimboeuf wrote:
> Agreed, stable is a good idea.   I think Nikolay saw it with GCC 9.


Yes I did, with the default Ubuntu compiler as well as the default gcc-10 
compiler: 

# gcc -v -Q -O2 --help=target | grep protection

gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) 
COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' 
'-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/9/cc1 -v -imultiarch x86_64-linux-gnu help-dummy 
-dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 
-version --help=target -fasynchronous-unwind-tables -fstack-protector-strong 
-Wformat -Wformat-security -fstack-clash-protection -fcf-protection -o 
/tmp/ccSecttk.s
GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, 
MPC version 1.1.0, isl version isl-0.22.1-GMP


It has -fcf-protection turned on by default it seems. 


Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Nikolay Borisov



On 29.01.21 г. 3:34 ч., Alexei Starovoitov wrote:
> On Thu, Jan 28, 2021 at 07:24:14PM +0100, Peter Zijlstra wrote:
>> On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
>>> it would be placed on the __fentry__ (and not endbr64) hence it works.
>>> So perhaps a workaround outside of bpf could essentially detect this
>>> scenario and adjust the probe to be on the __fentry__ and not preceding
>>> instruction if it's detected to be endbr64 ?
>>
>> Arguably the fentry handler should also set the nmi context, it can,
>> after all, interrupt pretty much any other context by construction.
> 
> But that doesn't make it a real nmi.
> nmi can actually interrupt anything. Whereas kprobe via int3 cannot
> do nokprobe and noinstr sections. The exposure is a lot different.
> ftrace is even more contained. It's only at the start of the functions.
> It's even smaller subset of places than kprobes.
> So ftrace < kprobe < nmi.
> Grouping them all into nmi doesn't make sense to me.
> That bpf breaking change came unnoticed mostly because people use
> kprobes in the beginning of the functions only, but there are cases
> where kprobes are in the middle too. People just didn't upgrade kernels
> fast enough to notice.

nit: slight correction - I observed while I was putting kprobes at the
beginning of the function but __fentry__ wasn't the first thing in the
function's code. The reason why people haven't observed is because
everyone is running with retpolines enabled which disables CFI/CET.

> imo appropriate solution would be to have some distinction between
> ftrace < kprobe_via_int3 < nmi, so that bpf side can react accordingly.
> That code in trace_call_bpf:
>   if (in_nmi()) /* not supported yet */
> was necessary in the past. Now we have all sorts of protections built in.
> So it's probably ok to just drop it, but I think adding
> called_via_ftrace vs called_via_kprobe_int3 vs called_via_nmi
> is more appropriate solution long term.
> 


Re: [PATCH] x86: Disable CET instrumentation in the kernel

2021-01-28 Thread Nikolay Borisov



On 28.01.21 г. 23:52 ч., Josh Poimboeuf wrote:
> 
> With retpolines disabled, some configurations of GCC will add Intel CET
> instrumentation to the kernel by default.  That breaks certain tracing
> scenarios by adding a superfluous ENDBR64 instruction before the fentry
> call, for functions which can be called indirectly.
> 
> CET instrumentation isn't currently necessary in the kernel, as CET is
> only supported in user space.  Disable it unconditionally.
> 
> Reported-by: Nikolay Borisov 
> Signed-off-by: Josh Poimboeuf 

Tested-by: Nikolay Borisov 
Reviewed-by: Nikolay Borisov 


Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Nikolay Borisov



On 28.01.21 г. 18:12 ч., Nikolay Borisov wrote:
> 
> 
> On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
>> Hi,
> 
> 
> 
>>
>> Alexei, could you tell me what is the concerning situation for bpf?
> 
> Another data point masami is that this affects bpf kprobes which are
> entered via int3, alternatively if the kprobe is entered via
> kprobe_ftrace_handler it works as expected. I haven't been able to
> determine why a particular bpf probe won't use ftrace's infrastructure
> if it's put at the beginning of the function.  An alternative call chain
> is :
> 
>  => __ftrace_trace_stack
>  => trace_call_bpf
>  => kprobe_perf_func
>  => kprobe_ftrace_handler
>  => 0xc095d0c8
>  => btrfs_validate_metadata_buffer
>  => end_bio_extent_readpage
>  => end_workqueue_fn
>  => btrfs_work_helper
>  => process_one_work
>  => worker_thread
>  => kthread
>  => ret_from_fork
> 
>>

I have a working theory why I'm seeing this. My kernel (broken) was
compiled with retpolines off and with the gcc that comes with ubuntu
(both 9 and 10:
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
gcc-10 (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
)

this results in CFI being enabled so functions look like:
0x81493890 <+0>: endbr64
0x81493894 <+4>: callq  0x8104d820 <__fentry__>

i.e fentry's thunk is not the first instruction on the function hence
it's not going through the optimized ftrace handler. Instead it's using
int3 which is broken as ascertained.

After testing with my testcase I confirm that with cfi off and
__fentry__ being the first entry bpf starts working. And indeed, even
with CFI turned on if I use a probe like :

bpftrace -e 'kprobe:btrfs_sync_file+4 {printf("kprobe: %s\n",
kstack());}' &>bpf-output &


it would be placed on the __fentry__ (and not endbr64) hence it works.
So perhaps a workaround outside of bpf could essentially detect this
scenario and adjust the probe to be on the __fentry__ and not preceding
instruction if it's detected to be endbr64 ?







Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Nikolay Borisov



On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
> Hi,



> 
> Alexei, could you tell me what is the concerning situation for bpf?

Another data point masami is that this affects bpf kprobes which are
entered via int3, alternatively if the kprobe is entered via
kprobe_ftrace_handler it works as expected. I haven't been able to
determine why a particular bpf probe won't use ftrace's infrastructure
if it's put at the beginning of the function.  An alternative call chain
is :

 => __ftrace_trace_stack
 => trace_call_bpf
 => kprobe_perf_func
 => kprobe_ftrace_handler
 => 0xc095d0c8
 => btrfs_validate_metadata_buffer
 => end_bio_extent_readpage
 => end_workqueue_fn
 => btrfs_work_helper
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork

> 
> Thank you,
> 
> From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001
> From: Masami Hiramatsu 
> Date: Thu, 28 Jan 2021 12:31:02 +0900
> Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler
> 
> Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has
> changed the kprobe handler to run in the NMI context, in_nmi() always returns
> true. This means the bpf events on kprobes always skipped.
> 
> Signed-off-by: Masami Hiramatsu 
> ---
>  kernel/trace/bpf_trace.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 6c0018abe68a..764400260eb6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
> void *ctx)
>  {
>   unsigned int ret;
>  
> - if (in_nmi()) /* not supported yet */
> - return 1;
> -
>   cant_sleep();
>  
>   if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
> 


Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Nikolay Borisov



On 28.01.21 г. 5:38 ч., Masami Hiramatsu wrote:
> Hi,
> 



> 
> Yeah, there is. Nikolay, could you try this tentative patch?
I can confirm that with this patch everything is working. I also applied
the following diff:

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6c0018abe68a..cc5a3a18816d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,8 +96,10 @@ unsigned int trace_call_bpf(struct trace_event_call
*call, void *ctx)
 {
unsigned int ret;

-   if (in_nmi()) /* not supported yet */
+   if (in_nmi()) /* not supported yet */ {
+   trace_dump_stack(0);
return 1;
+   }

cant_sleep();



And can confirm that the branch is being hit and the following call
trace is produced:

 => __ftrace_trace_stack
 => trace_call_bpf
 => kprobe_perf_func
 => kprobe_int3_handler
 => exc_int3
 => asm_exc_int3
 => btrfs_sync_file
 => do_fsync
 => __x64_sys_fsync
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe


> 
> Of course this just drops the NMI check from the handler, so alternative
> checker is required. But I'm not sure what the original code concerns.
> As far as I can see, there seems no re-entrant block flag, nor locks
> among ebpf programs in runtime.
> 
> Alexei, could you tell me what is the concerning situation for bpf?
> 
> Thank you,
> 
> From c5cd0e5f60ef6494c9e1579ec1b82b7344c41f9a Mon Sep 17 00:00:00 2001
> From: Masami Hiramatsu 
> Date: Thu, 28 Jan 2021 12:31:02 +0900
> Subject: [PATCH] tracing: bpf: Remove in_nmi() check from kprobe handler
> 
> Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") has
> changed the kprobe handler to run in the NMI context, in_nmi() always returns
> true. This means the bpf events on kprobes always skipped.

FWIW I'd prefer if in addition to the original commit you also mention:

ba1f2b2eaa2a ("x86/entry: Fix NMI vs IRQ state tracking")
b6be002bcd1d ("x86/entry: Move nmi entry/exit into common code")

Since they changed the way nmi state is managed in exc_int3 and not in
the original do_int3. THe latter no longer contains any references to
nmi-related code.

> 
> Signed-off-by: Masami Hiramatsu 
> ---
>  kernel/trace/bpf_trace.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 6c0018abe68a..764400260eb6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -96,9 +96,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
> void *ctx)
>  {
>   unsigned int ret;
>  
> - if (in_nmi()) /* not supported yet */
> - return 1;
> -
>   cant_sleep();
>  
>   if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
> 


Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Nikolay Borisov


On 27.01.21 г. 17:24 ч., Masami Hiramatsu wrote:
> On Thu, 28 Jan 2021 00:13:53 +0900
> Masami Hiramatsu  wrote:
> 
>> Hi Nikolay,
>>
>> On Wed, 27 Jan 2021 15:43:29 +0200
>> Nikolay Borisov  wrote:
>>
>>> Hello,
>>>
>>> I'm currently seeing latest Linus' master being somewhat broken w.r.t
>>> krpobes. In particular I have the following test-case:
>>>
>>> #!/bin/bash
>>>
>>> mkfs.btrfs -f /dev/vdc &> /dev/null
>>> mount /dev/vdc /media/scratch/
>>>
>>> bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
>>> &>bpf-output &
>>> bpf_trace_pid=$!
>>>
>>> # force btrfs_sync_file to be called
>>> sleep 2
>>> xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5
>>>
>>> kill $bpf_trace_pid
>>> sleep 1
>>>
>>> grep -q kprobe bpf-output
>>> retval=$?
>>> rm -f bpf-output
>>> umount /media/scratch
>>>
>>> exit $retval
>>>
>>> It traces btrfs_sync_file which is called when fsync is executed on a
>>> btrfs file, however I don't see the stacktrace being printed i.e the
>>> kprobe doesn't fire at all. The following alternative program:
>>>
>>> bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
>>> %s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
>>> kstack());}'
>>>
>>> only prints the stack from the tracepoint and not from the kprobe, given
>>> that the tracepoint is called from the btrfs_sync_file function.
>>
>> Thank you for reporting!
>>
>> If you don't mind, could you confirm it with ftrace (tracefs)?
>> bpftrace etc. involves too many things. It is better to test with
>> simpler way to test it.
>> I'm not familer with the bpftrace, but I think you can check it with
>>
>> # cd /sys/kernel/debug/tracing
>> # echo p:myevent btrfs_sync_file >> kprobe_events
>> # echo stacktrace > events/kprobes/myevent/trigger
>>  (or echo 1 > options/stacktrace , if trigger file doesn't exist)
> 
> Of course, also you have to enable the event.
>  # echo 1 > events/kprobes/myevent/enable
> 
> And check the results
> 
>  # cat trace
> 
> 
>> Could you also share your kernel config, so that we can reproduce it?
> 

I've attached the config and indeed with the scenario you proposed it
seems to works. I see:

   xfs_io-20280   [000] d.Z.  9900.748633: myevent:
(btrfs_sync_file+0x0/0x580)
  xfs_io-20280   [000] d.Z.  9900.748647: 
 => kprobe_trace_func
 => kprobe_dispatcher
 => kprobe_int3_handler
 => exc_int3
 => asm_exc_int3
 => btrfs_sync_file
 => do_fsync
 => __x64_sys_fsync
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe



#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 5.11.0-rc5 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=90300
CONFIG_LD_VERSION=23400
CONFIG_CLANG_VERSION=0
CONFIG_LLD_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION="-default"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem


Re: [PATCH v2] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-27 Thread Nikolay Borisov



On 27.01.21 г. 15:57 ч., Michal Rostecki wrote:
> From: Michal Rostecki 
> 
> Before this change, the btrfs_get_io_geometry() function was calling
> btrfs_get_chunk_map() to get the extent mapping, necessary for
> calculating the I/O geometry. It was using that extent mapping only
> internally and freeing the pointer after its execution.
> 
> That resulted in calling btrfs_get_chunk_map() de facto twice by the
> __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
> first and then calling btrfs_get_chunk_map() directly to get the extent
> mapping, used by the rest of the function.
> 
> This change fixes that by passing the extent mapping to the
> btrfs_get_io_geometry() function as an argument.
> 
> v2:
> When btrfs_get_chunk_map() returns an error in btrfs_submit_direct():
> - Use errno_to_blk_status(PTR_ERR(em)) as the status
> - Set em to NULL
> 
> Signed-off-by: Michal Rostecki 
> ---
>  fs/btrfs/inode.c   | 38 +-
>  fs/btrfs/volumes.c | 39 ---
>  fs/btrfs/volumes.h |  5 +++--
>  3 files changed, 48 insertions(+), 34 deletions(-)

So this adds more code but for what benefit? In your reply to Filipe you
said you didn't observe this being a performance-affecting change so

> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0dbe1aaa0b71..e2ee3a9c1140 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
> size, struct bio *bio,
>   struct inode *inode = page->mapping->host;
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   u64 logical = bio->bi_iter.bi_sector << 9;
> + struct extent_map *em;
>   u64 length = 0;
>   u64 map_length;
> - int ret;
> + int ret = 0;
>   struct btrfs_io_geometry geom;
>  
>   if (bio_flags & EXTENT_BIO_COMPRESSED)
> @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, 
> size_t size, struct bio *bio,
>  
>   length = bio->bi_iter.bi_size;
>   map_length = length;
> - ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, map_length,
> - );
> + em = btrfs_get_chunk_map(fs_info, logical, map_length);
> + if (IS_ERR(em))
> + return PTR_ERR(em);
> + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
> + map_length, );
>   if (ret < 0)
> - return ret;
> + goto out;
>  
> - if (geom.len < length + size)
> - return 1;
> - return 0;
> + if (geom.len < length + size) {
> + ret = 1;
> + goto out;
> + }

this could be simply

if (geom.len  +out:
> + free_extent_map(em);
> + return ret;
>  }
>  
>  /*
> @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
>   u64 submit_len;
>   int clone_offset = 0;
>   int clone_len;
> + int logical;
>   int ret;
>   blk_status_t status;
>   struct btrfs_io_geometry geom;
>   struct btrfs_dio_data *dio_data = iomap->private;
> + struct extent_map *em;
>  
>   dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
>   if (!dip) {
> @@ -7970,11 +7980,18 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
>   }
>  
>   start_sector = dio_bio->bi_iter.bi_sector;
> + logical = start_sector << 9;
>   submit_len = dio_bio->bi_iter.bi_size;
>  
>   do {
> - ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
> - start_sector << 9, submit_len,
> + em = btrfs_get_chunk_map(fs_info, logical, submit_len);
> + if (IS_ERR(em)) {
> + status = errno_to_blk_status(PTR_ERR(em));
> + em = NULL;
> + goto out_err;
> + }
> + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
> + logical, submit_len,
>   );
>   if (ret) {
>   status = errno_to_blk_status(ret);
> @@ -8030,12 +8047,15 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
>   clone_offset += clone_len;
>   start_sector += clone_len >> 9;
>   file_offset += clone_len;
> +
> + free_extent_map(em);
>   } while (submit_len > 0);
>   return BLK_QC_T_NONE;
>  
>  out_err:
>   dip->dio_bio->bi_status = status;
>   btrfs_dio_private_put(dip);
> + free_extent_map(em);
>   return BLK_QC_T_NONE;
>  }

For example in this function you increase complexity by having to deal
with free_extent_map as well so I'm not sure this is a net-win.

>  
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a8ec8539cd8d..4c753b17c0a2 100644
> 

kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-27 Thread Nikolay Borisov
Hello,

I'm currently seeing latest Linus' master being somewhat broken w.r.t
krpobes. In particular I have the following test-case:

#!/bin/bash

mkfs.btrfs -f /dev/vdc &> /dev/null
mount /dev/vdc /media/scratch/

bpftrace -e 'kprobe:btrfs_sync_file {printf("kprobe: %s\n", kstack());}'
&>bpf-output &
bpf_trace_pid=$!

# force btrfs_sync_file to be called
sleep 2
xfs_io -f -c "pwrite 0 4m" -c "fsync" /media/scratch/file5

kill $bpf_trace_pid
sleep 1

grep -q kprobe bpf-output
retval=$?
rm -f bpf-output
umount /media/scratch

exit $retval

It traces btrfs_sync_file which is called when fsync is executed on a
btrfs file, however I don't see the stacktrace being printed i.e the
kprobe doesn't fire at all. The following alternative program:

bpftrace -e 'tracepoint:btrfs:btrfs_sync_file {printf("tracepoint:
%s\n", kstack());} kprobe:btrfs_sync_file {printf("kprobe: %s\n",
kstack());}'

only prints the stack from the tracepoint and not from the kprobe, given
that the tracepoint is called from the btrfs_sync_file function.

I started bisecting this and arrived at the following commit:

0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

FWIW the following series is applied on the kernel I was testing:
https://lore.kernel.org/lkml/159870598914.1229682.15230803449082078353.stgit@devnote2/

but it's still broken.


[PATCH v2] locking/rwsem: Remove empty rwsem.h

2021-01-26 Thread Nikolay Borisov
This is a leftover from 7f26482a872c ("locking/percpu-rwsem: Remove the 
embedded rwsem")

Signed-off-by: Nikolay Borisov 
---
V2:
 * Add reference to commit which made the file useless.

 kernel/locking/rwsem.h | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 kernel/locking/rwsem.h

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
deleted file mode 100644
index e69de29bb2d1..
--
2.25.1



Re: [RFC PATCH 04/37] btrfs: use bio_init_fields in volumes

2021-01-19 Thread Nikolay Borisov



On 19.01.21 г. 7:05 ч., Chaitanya Kulkarni wrote:
> Signed-off-by: Chaitanya Kulkarni 
> ---
>  fs/btrfs/volumes.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index ee086fc56c30..836167212252 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6371,14 +6371,12 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, 
> struct bio *bio,
>  
>   bio->bi_private = bbio;
This line can be removed as ->private is initialized by bio_init_fields.

>   btrfs_io_bio(bio)->device = dev;
> - bio->bi_end_io = btrfs_end_bio;
> - bio->bi_iter.bi_sector = physical >> 9;
> + bio_init_fields(bio, dev->bdev, physical >> 9, bbio, btrfs_end_bio, 0, 
> 0);
>   btrfs_debug_in_rcu(fs_info,
>   "btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
>   bio_op(bio), bio->bi_opf, bio->bi_iter.bi_sector,
>   (unsigned long)dev->bdev->bd_dev, rcu_str_deref(dev->name),
>   dev->devid, bio->bi_iter.bi_size);
> - bio_set_dev(bio, dev->bdev);
>  
>   btrfs_bio_counter_inc_noblocked(fs_info);
>  
> 


[PATCH] locking/rwsem: Remove empty rwsem.h

2021-01-13 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov 
---
 kernel/locking/rwsem.h | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 kernel/locking/rwsem.h

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
deleted file mode 100644
index e69de29bb2d1..
-- 
2.25.1



Re: [RFC PATCH 10/11] mm/filemap: Add folio_add_to_page_cache

2020-12-11 Thread Nikolay Borisov



On 8.12.20 г. 21:46 ч., Matthew Wilcox (Oracle) wrote:
> Pages being added to the page cache should already be folios, so
> turn add_to_page_cache_lru() into a wrapper.  Saves hundreds of
> bytes of text.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/pagemap.h | 13 +++--
>  mm/filemap.c| 62 -
>  2 files changed, 41 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 060faeb8d701..3bc56b3aa384 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -778,9 +778,9 @@ static inline int fault_in_pages_readable(const char 
> __user *uaddr, int size)
>  }
>  
>  int add_to_page_cache_locked(struct page *page, struct address_space 
> *mapping,
> - pgoff_t index, gfp_t gfp_mask);
> -int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
> - pgoff_t index, gfp_t gfp_mask);
> + pgoff_t index, gfp_t gfp);
> +int folio_add_to_page_cache(struct folio *folio, struct address_space 
> *mapping,
> + pgoff_t index, gfp_t gfp);
>  extern void delete_from_page_cache(struct page *page);
>  extern void __delete_from_page_cache(struct page *page, void *shadow);
>  int replace_page_cache_page(struct page *old, struct page *new, gfp_t 
> gfp_mask);
> @@ -805,6 +805,13 @@ static inline int add_to_page_cache(struct page *page,
>   return error;
>  }
>  
> +static inline int add_to_page_cache_lru(struct page *page,
> + struct address_space *mapping, pgoff_t index, gfp_t gfp)
> +{
> + return folio_add_to_page_cache((struct folio *)page, mapping,
> + index, gfp);
> +}
> +
>  /**
>   * struct readahead_control - Describes a readahead request.
>   *
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 56ff6aa24265..297144524f58 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -828,25 +828,25 @@ int replace_page_cache_page(struct page *old, struct 
> page *new, gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
> -static noinline int __add_to_page_cache_locked(struct page *page,
> +static noinline int __add_to_page_cache_locked(struct folio *folio,
>   struct address_space *mapping,
> - pgoff_t offset, gfp_t gfp,
> + pgoff_t index, gfp_t gfp,
>   void **shadowp)
>  {
> - XA_STATE(xas, >i_pages, offset);
> - int huge = PageHuge(page);
> + XA_STATE(xas, >i_pages, index);
> + int huge = PageHuge(>page);

PageHuge also does page_compound, since you know this is either the head
page or not you could use PageHeadHuge which simply checks if it's a
head page and then goes directly to perform the hugepage check via the
dtor.




Re: [PATCH v2 1/8] lib/find_bit.c: Add find_last_zero_bit

2020-12-06 Thread Nikolay Borisov



On 6.12.20 г. 10:56 ч., Yun Levi wrote:
>> This, and the change above this, are not related to this patch so you
>> might not want to include them.
> 
>> Also, why is this patch series even needed?  I don't see a justification
>> for it anywhere, only "what" this patch is, not "why".
> 
> I think the find_last_zero_bit will help to improve in
> 7th patch's change and It can be used in the future.
> But if my thinking is bad.. Please let me know..
> 
> Thanks.
> Levi.
> 

btrfs' free space cache v1 is going to be removed some time in the
future so introducing kernel-wide change just for its own sake is a bit
premature. Also do you have measurements showing it indeed improves
performances?


Re: [PATCH] printk: ringbuffer: Convert function argument to local variable

2020-11-10 Thread Nikolay Borisov



On 10.11.20 г. 15:14 ч., John Ogness wrote:
> On 2020-11-10, Nikolay Borisov  wrote:
>> data_alloc's 2nd argument is always rb::text_data_ring and that functino
>> always takes a struct printk_ringbuffer. Instead of passing the data
>> ring buffer as an argument simply make it a local variable.
> 
> This is a relic of when we had a second data ring (for
> dictionaries). The patch is a nice cleanup, but there are actually
> several functions that could use this exact same cleanup:
> 
> - data_make_reusable()
> - data_push_tail()
> - data_alloc()
> - data_realloc()
> 
> Perhaps we should fix them all in a single patch?

I observed that right after sending this patch, so I have authored the
necessary changes I can squash them and send them.

> 
> John Ogness
> 


[PATCH] printk: ringbuffer: Convert function argument to local variable

2020-11-10 Thread Nikolay Borisov
data_alloc's 2nd argument is always rb::text_data_ring and that functino
always takes a struct printk_ringbuffer. Instead of passing the data
ring buffer as an argument simply make it a local variable.

Signed-off-by: Nikolay Borisov 
---
 kernel/printk/printk_ringbuffer.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/printk/printk_ringbuffer.c 
b/kernel/printk/printk_ringbuffer.c
index 6b1525685277..7f2713e1bbcc 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1021,10 +1021,10 @@ static unsigned long get_next_lpos(struct prb_data_ring 
*data_ring,
  * if necessary. This function also associates the data block with
  * a specified descriptor.
  */
-static char *data_alloc(struct printk_ringbuffer *rb,
-   struct prb_data_ring *data_ring, unsigned int size,
+static char *data_alloc(struct printk_ringbuffer *rb, unsigned int size,
struct prb_data_blk_lpos *blk_lpos, unsigned long id)
 {
+   struct prb_data_ring *data_ring = >text_data_ring;
struct prb_data_block *blk;
unsigned long begin_lpos;
unsigned long next_lpos;
@@ -1397,7 +1397,7 @@ bool prb_reserve_in_last(struct prb_reserved_entry *e, 
struct printk_ringbuffer
if (r->text_buf_size > max_size)
goto fail;
 
-   r->text_buf = data_alloc(rb, >text_data_ring, 
r->text_buf_size,
+   r->text_buf = data_alloc(rb, r->text_buf_size,
 >text_blk_lpos, id);
} else {
if (!get_data(>text_data_ring, >text_blk_lpos, 
_size))
@@ -1549,8 +1549,7 @@ bool prb_reserve(struct prb_reserved_entry *e, struct 
printk_ringbuffer *rb,
if (info->seq > 0)
desc_make_final(desc_ring, DESC_ID(id - 1));
 
-   r->text_buf = data_alloc(rb, >text_data_ring, r->text_buf_size,
->text_blk_lpos, id);
+   r->text_buf = data_alloc(rb, r->text_buf_size, >text_blk_lpos, id);
/* If text data allocation fails, a data-less record is committed. */
if (r->text_buf_size && !r->text_buf) {
prb_commit(e);
-- 
2.25.1



Re: [PATCH] fs: tree-checker: fix missing brace warning for old compilers

2020-10-03 Thread Nikolay Borisov



On 3.10.20 г. 3:11 ч., Pujin Shi wrote:
> For older versions of gcc, the array = {0}; will cause warnings:
> 
> fs/btrfs/tree-checker.c: In function 'check_root_item':
> fs/btrfs/tree-checker.c:1038:9: warning: missing braces around initializer 
> [-Wmissing-braces]
>   struct btrfs_root_item ri = { 0 };
>  ^
> fs/btrfs/tree-checker.c:1038:9: warning: (near initialization for 'ri.inode') 
> [-Wmissing-braces]
> 
> 1 warnings generated
> 
> Fixes: 443b313c7ff8 ("btrfs: tree-checker: fix false alert caused by legacy 
> btrfs root item")
> Signed-off-by: Pujin Shi 

This is a compiler artifact, please see:
http://www.ex-parrot.com/~chris/random/initialise.html

ALso having an empty initialization list like = {} while valid for gcc
is actually invalid according to the official standard. Check ISO C
Standard section 6.7.9 for the correct syntax of initializer-list.

IOW - NAK.

> ---
>  fs/btrfs/tree-checker.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index f0ffd5ee77bd..5028b3af308c 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -1035,7 +1035,7 @@ static int check_root_item(struct extent_buffer *leaf, 
> struct btrfs_key *key,
>  int slot)
>  {
>   struct btrfs_fs_info *fs_info = leaf->fs_info;
> - struct btrfs_root_item ri = { 0 };
> + struct btrfs_root_item ri = {};
>   const u64 valid_root_flags = BTRFS_ROOT_SUBVOL_RDONLY |
>BTRFS_ROOT_SUBVOL_DEAD;
>   int ret;
> 


Re: [PATCH] btrfs: Fix missing close devices

2020-09-23 Thread Nikolay Borisov



On 21.09.20 г. 10:29 ч., qiang.zh...@windriver.com wrote:
> From: Zqiang 
> 
> When the btrfs fill super error, we should first close devices and
> then call deactivate_locked_super func to free fs_info.
> 
> Signed-off-by: Zqiang 
> ---
>  fs/btrfs/super.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 8840a4fa81eb..3bfd54e8f388 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1675,6 +1675,7 @@ static struct dentry *btrfs_mount_root(struct 
> file_system_type *fs_type,
>   error = security_sb_set_mnt_opts(s, new_sec_opts, 0, NULL);
>   security_free_mnt_opts(_sec_opts);
>   if (error) {
> + btrfs_close_devices(fs_devices);
>   deactivate_locked_super(s);
>   return ERR_PTR(error);
>   }
> 

NAK,

Devices are properly closed via:


deactivate_locked_super
  kill_sb (btrfs_kill_super)
kill_anon_super
  generic_shutdown_super
   put_super (btrfs_put_super)
 close_ctree


It seems you haven't done deep enough analysis of the involved call chains.


Re: More filesystem need this fix (xfs: use MMAPLOCK around filemap_map_pages())

2020-09-16 Thread Nikolay Borisov



On 17.09.20 г. 4:44 ч., Dave Chinner wrote:
> On Wed, Sep 16, 2020 at 05:58:51PM +0200, Jan Kara wrote:
>> On Sat 12-09-20 09:19:11, Amir Goldstein wrote:
>>> On Tue, Jun 23, 2020 at 8:21 AM Dave Chinner  wrote:



> 
> So
> 
> P0p1
> 
> hole punch starts
>   takes XFS_MMAPLOCK_EXCL
>   truncate_pagecache_range()
> unmap_mapping_range(start, end)
>   
>   
>   do_fault_around()
> ->map_pages
>   filemap_map_pages()
> page mapping valid,
> page is up to date
> maps PTEs
>   
> truncate_inode_pages_range()
>   truncate_cleanup_page(page)
> invalidates page
>   delete_from_page_cache_batch(page)
> frees page
>   
> 
> That doesn't seem good to me.
> 
> Sure, maybe the page hasn't been freed back to the free lists
> because of elevated refcounts. But it's been released by the
> filesystem and not longer in the page cache so nothing good can come
> of this situation...
> 
> AFAICT, this race condition exists for the truncate case as well
> as filemap_map_pages() doesn't have a "page beyond inode size" check
> in it. 

(It's not relevant to the discussion at hand but for the sake of
completeness):

It does have a check:

 max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
 if (page->index >= max_idx)
  goto unlock;





Re: INFO: task hung in vfs_setxattr (3)

2020-09-15 Thread Nikolay Borisov



On 15.09.20 г. 7:59 ч., syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:7fe10096 Merge branch 'linus' of git://git.kernel.org/pub/..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=140b085390
> kernel config:  https://syzkaller.appspot.com/x/.config?x=a9075b36a6ae26c9
> dashboard link: https://syzkaller.appspot.com/bug?extid=738cca7d7d9754493513
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=108d45a590
> 
> The issue was bisected to:
> 
> commit 6a3c7f5c87854e948c3c234e5f5e745c7c553722
> Author: Nikolay Borisov 
> Date:   Thu May 28 08:05:13 2020 +
> 
> btrfs: don't balance btree inode pages from buffered write path
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=14884d2190
> final oops: https://syzkaller.appspot.com/x/report.txt?x=16884d2190
> console output: https://syzkaller.appspot.com/x/log.txt?x=12884d2190


Unlikely, that patch does change the performance characteristics because
now we don't balance metadata when doing buffered writes but I don't see
how it can lead to a hang.


Re: [PATCH v3 00/10] NTFS read-write driver GPL implementation by Paragon Software

2020-08-29 Thread Nikolay Borisov



On 28.08.20 г. 17:39 ч., Konstantin Komarov wrote:
> This patch adds NTFS Read-Write driver to fs/ntfs3.
> 
> Having decades of expertise in commercial file systems development and huge
> test coverage, we at Paragon Software GmbH want to make our contribution to
> the Open Source Community by providing implementation of NTFS Read-Write
> driver for the Linux Kernel.
> 
> This is fully functional NTFS Read-Write driver. Current version works with
> NTFS(including v3.1) and normal/compressed/sparse files and supports journal 
> replaying.
> 
> We plan to support this version after the codebase once merged, and add new
> features and fix bugs. For example, full journaling support over JBD will be
> added in later updates.
> 
> v2:
>  - patch splitted to chunks (file-wise)
>  - build issues fixed
>  - sparse and checkpatch.pl errors fixed
>  - NULL pointer dereference on mkfs.ntfs-formatted volume mount fixed
>  - cosmetics + code cleanup
> 
> v3:
>  - added acl, noatime, no_acs_rules, prealloc mount options
>  - added fiemap support
>  - fixed encodings support
>  - removed typedefs
>  - adapted Kernel-way logging mechanisms
>  - fixed typos and corner-case issues
> 
> Konstantin Komarov (10):
>   fs/ntfs3: Add headers and misc files
>   fs/ntfs3: Add initialization of super block

This patch is missing

>   fs/ntfs3: Add bitmap
>   fs/ntfs3: Add file operations and implementationThis patch is missing

>   fs/ntfs3: Add attrib operations
>   fs/ntfs3: Add compression
>   fs/ntfs3: Add NTFS journal
This patch is missing

>   fs/ntfs3: Add Kconfig, Makefile and doc
>   fs/ntfs3: Add NTFS3 in fs/Kconfig and fs/Makefile
>   fs/ntfs3: Add MAINTAINERS
> 
>  Documentation/filesystems/ntfs3.rst |  103 +
>  MAINTAINERS |7 +
>  fs/Kconfig  |1 +
>  fs/Makefile |1 +
>  fs/ntfs3/Kconfig|   23 +
>  fs/ntfs3/Makefile   |   11 +
>  fs/ntfs3/attrib.c   | 1285 +++
>  fs/ntfs3/attrlist.c |  462 +++
>  fs/ntfs3/bitfunc.c  |  137 +
>  fs/ntfs3/bitmap.c   | 1545 
>  fs/ntfs3/debug.h|   45 +
>  fs/ntfs3/dir.c  |  642 
>  fs/ntfs3/file.c | 1214 +++
>  fs/ntfs3/frecord.c  | 2378 
>  fs/ntfs3/fslog.c| 5222 +++
>  fs/ntfs3/fsntfs.c   | 2218 
>  fs/ntfs3/index.c| 2661 ++
>  fs/ntfs3/inode.c| 2068 +++
>  fs/ntfs3/lznt.c |  451 +++
>  fs/ntfs3/namei.c|  580 +++
>  fs/ntfs3/ntfs.h | 1249 +++
>  fs/ntfs3/ntfs_fs.h  | 1001 +
>  fs/ntfs3/record.c   |  615 
>  fs/ntfs3/run.c  | 1188 ++
>  fs/ntfs3/super.c| 1406 
>  fs/ntfs3/upcase.c   |   78 +
>  fs/ntfs3/xattr.c| 1007 ++
>  27 files changed, 27598 insertions(+)
>  create mode 100644 Documentation/filesystems/ntfs3.rst
>  create mode 100644 fs/ntfs3/Kconfig
>  create mode 100644 fs/ntfs3/Makefile
>  create mode 100644 fs/ntfs3/attrib.c
>  create mode 100644 fs/ntfs3/attrlist.c
>  create mode 100644 fs/ntfs3/bitfunc.c
>  create mode 100644 fs/ntfs3/bitmap.c
>  create mode 100644 fs/ntfs3/debug.h
>  create mode 100644 fs/ntfs3/dir.c
>  create mode 100644 fs/ntfs3/file.c
>  create mode 100644 fs/ntfs3/frecord.c
>  create mode 100644 fs/ntfs3/fslog.c
>  create mode 100644 fs/ntfs3/fsntfs.c
>  create mode 100644 fs/ntfs3/index.c
>  create mode 100644 fs/ntfs3/inode.c
>  create mode 100644 fs/ntfs3/lznt.c
>  create mode 100644 fs/ntfs3/namei.c
>  create mode 100644 fs/ntfs3/ntfs.h
>  create mode 100644 fs/ntfs3/ntfs_fs.h
>  create mode 100644 fs/ntfs3/record.c
>  create mode 100644 fs/ntfs3/run.c
>  create mode 100644 fs/ntfs3/super.c
>  create mode 100644 fs/ntfs3/upcase.c
>  create mode 100644 fs/ntfs3/xattr.c
> 


Re: [PATCH] fs: NTFS read-write driver GPL implementation by Paragon Software.

2020-08-14 Thread Nikolay Borisov



On 14.08.20 г. 15:29 ч., Konstantin Komarov wrote:
> This patch adds NTFS Read-Write driver to fs/ntfs3.
> 
> Having decades of expertise in commercial file systems development and huge
> test coverage, we at Paragon Software GmbH want to make our contribution to
> the Open Source Community by providing implementation of NTFS Read-Write
> driver for the Linux Kernel.
> 
> This is fully functional NTFS Read-Write driver. Current version works with
> NTFS(including v3.1) normal/compressed/sparse files and supports journal 
> replaying.
> 
> We plan to support this version after the codebase once merged, and add new
> features and fix bugs. For example, full journaling support over JBD will be
> added in later updates.
> 
> The patch is too big to handle it within an e-mail body, so it is available 
> to download 
> on our server: https://dl.paragon-software.com/ntfs3/ntfs3.patch
> 
> Signed-off-by: Konstantin Komarov 
> 

So how exactly do you expect someone to review this monstrosity ?


Re: [PATCH] btrfs: fix error value in btrfs_get_extent

2020-08-03 Thread Nikolay Borisov



On 3.08.20 г. 12:39 ч., Nikolay Borisov wrote:
> 
> 
> On 3.08.20 г. 12:35 ч., Pavel Machek wrote:
>> btrfs_get_extent() sets variable ret, but out: error path expect error
>> to be in variable err. Fix that.
>>
>> Signed-off-by: Pavel Machek (CIP) 
> 
> Good catch, this also needs:
> 
> Fixes: 6bf9e4bd6a27 ("btrfs: inode: Verify inode mode to avoid NULL
> pointer dereference")
> 
> Reviewed-by: Nikolay Borisov 

Actually the reason this error got introduced in the first place and I
missed it during the review is that the function is doing something
rather counter-intuitive - it's using 'err' variable as a synonym for
'ret'. A better approach would be to simply remove 'err' from that
function. I'm now authoring such a patch, nevertheless the issue still
stands.

> 
> 
>>
>> ---
>>
>> Notice that patch introducing this problem is on its way to 4.19.137-stable.
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 7befb7c12bd3..4aaa01540f89 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -7012,7 +7012,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
>> *inode,
>>  found_type == BTRFS_FILE_EXTENT_PREALLOC) {
>>  /* Only regular file could have regular/prealloc extent */
>>  if (!S_ISREG(inode->vfs_inode.i_mode)) {
>> -ret = -EUCLEAN;
>> +err = -EUCLEAN;
>>  btrfs_crit(fs_info,
>>  "regular/prealloc extent found for non-regular inode %llu",
>> btrfs_ino(inode));
>>
> 


Re: [PATCH] btrfs: fix error value in btrfs_get_extent

2020-08-03 Thread Nikolay Borisov



On 3.08.20 г. 12:35 ч., Pavel Machek wrote:
> btrfs_get_extent() sets variable ret, but out: error path expect error
> to be in variable err. Fix that.
> 
> Signed-off-by: Pavel Machek (CIP) 

Good catch, this also needs:

Fixes: 6bf9e4bd6a27 ("btrfs: inode: Verify inode mode to avoid NULL
pointer dereference")

Reviewed-by: Nikolay Borisov 


> 
> ---
> 
> Notice that patch introducing this problem is on its way to 4.19.137-stable.
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 7befb7c12bd3..4aaa01540f89 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7012,7 +7012,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
> *inode,
>   found_type == BTRFS_FILE_EXTENT_PREALLOC) {
>   /* Only regular file could have regular/prealloc extent */
>   if (!S_ISREG(inode->vfs_inode.i_mode)) {
> - ret = -EUCLEAN;
> + err = -EUCLEAN;
>   btrfs_crit(fs_info,
>   "regular/prealloc extent found for non-regular inode %llu",
>  btrfs_ino(inode));
> 


Re: [PATCH] btfrs: initialize return of btrfs_extent_same

2020-07-05 Thread Nikolay Borisov



On 5.07.20 г. 17:20 ч., t...@redhat.com wrote:
> From: Tom Rix 
> 
> clang static analysis flags a garbage return
> 
> fs/btrfs/reflink.c:611:2: warning: Undefined or garbage value returned to 
> caller [core.uninitialized.UndefReturn]
> return ret;
> ^~
> ret will not be set when olen is 0
> When olen is 0, this function does no work.
> 
> So initialize ret to 0
> 
> Signed-off-by: Tom Rix 

And I forgot:

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/reflink.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index 040009d1cc31..200a80fcbecb 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -572,7 +572,7 @@ static int btrfs_extent_same_range(struct inode *src, u64 
> loff, u64 len,
>  static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
>struct inode *dst, u64 dst_loff)
>  {
> - int ret;
> + int ret = 0;
>   u64 i, tail_len, chunk_count;
>   struct btrfs_root *root_dst = BTRFS_I(dst)->root;
>  
> 


Re: [PATCH] btfrs: initialize return of btrfs_extent_same

2020-07-05 Thread Nikolay Borisov



On 5.07.20 г. 17:20 ч., t...@redhat.com wrote:
> From: Tom Rix 
> 
> clang static analysis flags a garbage return
> 
> fs/btrfs/reflink.c:611:2: warning: Undefined or garbage value returned to 
> caller [core.uninitialized.UndefReturn]
> return ret;
> ^~
> ret will not be set when olen is 0
> When olen is 0, this function does no work.
> 
> So initialize ret to 0
> 
> Signed-off-by: Tom Rix 

Patch itself is good however, the bug cannot currently be triggered, due
to the following code in the sole caller (btrfs_remap_file_range):



   15 if (ret < 0 || len == 0)

   14 goto out_unlock;

   13

   12 if (remap_flags & REMAP_FILE_DEDUP)

   11 ret = btrfs_extent_same(src_inode, off, len,
dst_inode, destoff);
   10 else

9 ret = btrfs_clone_files(dst_file, src_file, off,
len, destoff);



i.e len is guaranteed to be non-zero

> ---
>  fs/btrfs/reflink.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index 040009d1cc31..200a80fcbecb 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -572,7 +572,7 @@ static int btrfs_extent_same_range(struct inode *src, u64 
> loff, u64 len,
>  static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
>struct inode *dst, u64 dst_loff)
>  {
> - int ret;
> + int ret = 0;
>   u64 i, tail_len, chunk_count;
>   struct btrfs_root *root_dst = BTRFS_I(dst)->root;
>  
> 


Re: Lockdep warning after `mdadm -S`

2020-06-10 Thread Nikolay Borisov



On 10.06.20 г. 10:19 ч., Michał Mirosław wrote:
> Dear Developers,
> 
> I found a lockdep warning in dmesg some after doing 'mdadm -S' while
> also having btrfs mounted (light to none I/O load).  Disks under MD and
> btrfs are unrelated.

Huhz, I think that's genuine, because btrfs and md shared at least the bd_mutex 
and the workqueue. So the scenario could go something along the lines of: 

   

 
T1: T2: 
T3: T4: 
T5:
 initiates process of stopping md,  
transaction commit 
blocked on   User tries to (erroneously) mount mddev 
 holds bd_mutex, open_mutex,
wants to begin a new transaction butdevice_list_mutex, 
callstack #3,but blocks on callstack at #4 since T1 
 blocks on flushing md_misc_wq, Should begin  mmdev_delayed_delete  
blocks on callstack #0 due to a holding sb_internal 
holds bd_open  and is holding device_list_mutex
 Callstack #6   but never does because workqueue is 
running transaction commit in T4
blocked due to T3 being blocked 
NB: This happens from writeback

context, ie. same as T2 workqueue 

bd_mutex held by T1 

So T5 blocks T4, which blocks T3, which blocks the shared writeback workqueue, 
this prevents T2 from running which when done would cause T1 to unlock 
bd_mutex, 
which would unblock T5. I think this is generally possible but highly unlikely.

Also looking at the code in T5 (callstack #4 below) it seems that the same 
could happen if 
scan ioctl is sent for the mddev. Discussing this with peterz he proposed the 
following half-baked 
patch: https://paste.debian.net/1151314/

The idea is to remove the md_open mutex which would break the dependency chain 
between 
#4->#6. What do mdraid people think?


> 
> Best Regards,
> Michał Mirosław
> 
> ==
> WARNING: possible circular locking dependency detected
> 5.7.1mq+ #383 Tainted: G   O 
> --
> kworker/u16:3/8175 is trying to acquire lock:
> 8882f19556a0 (sb_internal#3){.+.+}-{0:0}, at: 
> start_transaction+0x37e/0x550 [btrfs]
> 
> but task is already holding lock:
> c900087c7e68 ((work_completion)(&(>dwork)->work)){+.+.}-{0:0}, at: 
> process_one_work+0x235/0x620
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #8 ((work_completion)(&(>dwork)->work)){+.+.}-{0:0}:
>__flush_work+0x331/0x490
>wb_shutdown+0x8f/0xb0
>bdi_unregister+0x72/0x1f0
>del_gendisk+0x2b0/0x2c0
>md_free+0x28/0x90
>kobject_put+0xa6/0x1b0
>process_one_work+0x2b6/0x620
>worker_thread+0x35/0x3e0
>kthread+0x143/0x160
>ret_from_fork+0x3a/0x50
> 
> -> #7 ((work_completion)(>del_work)){+.+.}-{0:0}:
>process_one_work+0x28d/0x620
>worker_thread+0x35/0x3e0
>kthread+0x143/0x160
>ret_from_fork+0x3a/0x50
> 
> -> #6 ((wq_completion)md_misc){+.+.}-{0:0}:
>flush_workqueue+0xa9/0x4e0
>__md_stop_writes+0x18/0x100
>do_md_stop+0x165/0x2d0
>md_ioctl+0xa52/0x1d60
>blkdev_ioctl+0x1cc/0x2a0
>block_ioctl+0x3a/0x40
>ksys_ioctl+0x81/0xc0
>__x64_sys_ioctl+0x11/0x20
>do_syscall_64+0x4f/0x210
>entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #5 (>open_mutex){+.+.}-{3:3}:
>__mutex_lock+0x93/0x9c0
>md_open+0x43/0xc0
>__blkdev_get+0xea/0x560
>blkdev_get+0x60/0x130
>do_dentry_open+0x147/0x3e0
>path_openat+0x84f/0xa80
>do_filp_open+0x8e/0x100
>do_sys_openat2+0x225/0x2e0
>do_sys_open+0x46/0x80
>do_syscall_64+0x4f/0x210
>entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #4 (>bd_mutex){+.+.}-{3:3}:
>__mutex_lock+0x93/0x9c0
>__blkdev_get+0x77/0x560
>blkdev_get+0x60/0x130
>blkdev_get_by_path+0x41/0x80
>btrfs_get_bdev_and_sb+0x16/0xb0 [btrfs]
>open_fs_devices+0x9d/0x240 [btrfs]
>btrfs_open_devices+0x89/0x90 [btrfs]
>

[RESEND PATCH] bloat-o-meter: Support comparing library archives

2020-06-03 Thread Nikolay Borisov
Library archives (.a) usually contain multiple object files so their
output of nm --size-sort contains lines like:


03a8 t run_test

extent-map-tests.o:


bloat-o-meter currently doesn't handle them which results in errors
when calling .split() on them. Fix this by simply ignoring them. This
enables diffing subsystems which generate built-in.a files.

Signed-off-by: Nikolay Borisov 
---

Resending and CCing Andrew this time.

 scripts/bloat-o-meter | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/scripts/bloat-o-meter b/scripts/bloat-o-meter
index 8c965f6a9881..d7ca46c612b3 100755
--- a/scripts/bloat-o-meter
+++ b/scripts/bloat-o-meter
@@ -26,6 +26,8 @@ re_NUMBER = re.compile(r'\.[0-9]+')
 sym = {}
 with os.popen("nm --size-sort " + file) as f:
 for line in f:
+if line.startswith("\n") or ":" in line:
+continue
 size, type, name = line.split()
 if type in format:
 # strip generated symbols
--
2.17.1



[PATCH] bloat-o-meter: Support comparing library archives

2020-05-31 Thread Nikolay Borisov
Library archives (.a) usually contain multiple object files so their
output of nm --size-sort contains lines like:


03a8 t run_test

extent-map-tests.o:


bloat-o-meter currently doesn't handle them which results in errors
when calling .split() on them. Fix this by simply ignoring them. This
enables diffing subsystems which generate built-in.a files.

Signed-off-by: Nikolay Borisov 
---
 scripts/bloat-o-meter | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/scripts/bloat-o-meter b/scripts/bloat-o-meter
index 8c965f6a9881..d7ca46c612b3 100755
--- a/scripts/bloat-o-meter
+++ b/scripts/bloat-o-meter
@@ -26,6 +26,8 @@ re_NUMBER = re.compile(r'\.[0-9]+')
 sym = {}
 with os.popen("nm --size-sort " + file) as f:
 for line in f:
+if line.startswith("\n") or ":" in line:
+continue
 size, type, name = line.split()
 if type in format:
 # strip generated symbols
-- 
2.17.1



Re: [PATCH 1/4] fs: btrfs: fix a data race in btrfs_block_group_done()

2020-05-09 Thread Nikolay Borisov



On 9.05.20 г. 8:20 ч., Jia-Ju Bai wrote:
> The functions btrfs_block_group_done() and caching_thread() are 
> concurrently executed at runtime in the following call contexts:
> 
> Thread 1:
>   btrfs_sync_file()
> start_ordered_ops()
>   btrfs_fdatawrite_range()
> btrfs_writepages() [via function pointer]
>   extent_writepages()
> extent_write_cache_pages()
>   __extent_writepage()
> writepage_delalloc()
>   btrfs_run_delalloc_range()
> cow_file_range()
>   btrfs_reserve_extent()
> find_free_extent()
>   btrfs_block_group_done()
> 
> Thread 2:
>   caching_thread()
> 
> In btrfs_block_group_done():
>   smp_mb();
>   return cache->cached == BTRFS_CACHE_FINISHED ||
>   cache->cached == BTRFS_CACHE_ERROR;
> 
> In caching_thread():
>   spin_lock(_group->lock);
>   block_group->caching_ctl = NULL;
>   block_group->cached = ret ? BTRFS_CACHE_ERROR : BTRFS_CACHE_FINISHED;
>   spin_unlock(_group->lock);
> 
> The values cache->cached and block_group->cached access the same memory, 
> and thus a data race can occur.
> This data race was found and actually reproduced by our concurrency 
> fuzzer.
> 
> To fix this race, the spinlock cache->lock is used to protect the 
> access to cache->cached in btrfs_block_group_done().
> 
> Signed-off-by: Jia-Ju Bai 
> ---
>  fs/btrfs/block-group.h | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index 107bb557ca8d..fb5f12acea40 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -278,9 +278,13 @@ static inline u64 btrfs_system_alloc_profile(struct 
> btrfs_fs_info *fs_info)
>  
>  static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
>  {
> + int flag;
>   smp_mb();
> - return cache->cached == BTRFS_CACHE_FINISHED ||
> - cache->cached == BTRFS_CACHE_ERROR;
> + spin_lock(>lock);
> + flag = (cache->cached == BTRFS_CACHE_FINISHED ||
> + cache->cached == BTRFS_CACHE_ERROR);
> + spin_unlock(>lock);
> + return flag;

Using the lock also obviates the need for the memory barrier.
Furthermore this race is benign because even if it occurs and we call
into btrfs_cache_block_group we do check cache->cached under
btrfs_block_group::lock and do the right thing, though we would have
done a bit more unnecessary wor if that's the case. So have you actually
measured the effect of adding the the spinlock? This is likely done so
as to make the fastpath lock-free. Perhaps a better approach would be to
annotate the access of cached with READ/WRITE once so that it's fetched
from memory and not optimized out by the compiler and leave the more
heavy work in the unlikely path.

Please exercise some critical thinking when looking into such issues as
there might be a good reason why something has been coded in a
particular way and it might look wrong on a first look but in fact is not.

>  }
>  
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> 


Re: [PATCH] btrfs: prevent memory leak in super.c

2019-09-24 Thread Nikolay Borisov



On 24.09.19 г. 1:57 ч., Navid Emamdoost wrote:
> In btrfs_mount_root the last error checking was not going to the error
> handling path. Fixed it.
> 
> Signed-off-by: Navid Emamdoost 

NAK

deactivate_locked_super actually calls btrfs_kill_super which in turn
calls generic_shutdown_super which does the required shutdown sequence.

> ---
>  fs/btrfs/super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 1b151af25772..9f3f62c000fa 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1565,7 +1565,7 @@ static struct dentry *btrfs_mount_root(struct 
> file_system_type *fs_type,
>   security_free_mnt_opts(_sec_opts);
>   if (error) {
>   deactivate_locked_super(s);
> - return ERR_PTR(error);
> + goto error_close_devices;
>   }
>  
>   return dget(s->s_root);
> 


Re: [PATCH] btrfs compression: check string length

2019-09-24 Thread Nikolay Borisov



On 24.09.19 г. 9:14 ч., Pavel Machek wrote:
> AFAICT, with current code user could pass something like "lzox" and
> still get "lzo" compression. Check string lengths to prevent that.
> 
> Signed-off-by: Pavel Machek 
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index b05b361..1083ab4 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -51,7 +51,7 @@ bool btrfs_compress_is_valid_type(const char *str, size_t 
> len)
>   for (i = 1; i < ARRAY_SIZE(btrfs_compress_types); i++) {
>   size_t comp_len = strlen(btrfs_compress_types[i]);
>  
> - if (len < comp_len)
> + if (len != comp_len)
>   continue;

It's like that so that we can support compression strings such as
zlib:9. In fact the initial version was written like you suggest:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg88216.html

>  
>   if (!strncmp(btrfs_compress_types[i], str, comp_len))
> 


Re: [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct

2019-07-26 Thread Nikolay Borisov



On 11.07.19 г. 22:52 ч., Chris Mason wrote:
> On 11 Jul 2019, at 12:00, Nikolay Borisov wrote:
> 
>> On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
>>> From: Chris Mason 
>>>
>>> The btrfs writepages function collects a large range of pages flagged
>>> for delayed allocation, and then sends them down through the COW code
>>> for processing.  When compression is on, we allocate one async_cow
>>
>> nit: The code no longer uses async_cow to represent in-flight chunks 
>> but
>> the more aptly named async_chunk. Presumably this patchset predates
>> those changes.
> 
> Not by much, but yes.
> 
>>
>>>
>>> The end result is that cowA is completed and cleaned up before cowB 
>>> even
>>> starts processing.  This means we can free locked_page() and reuse it
>>> elsewhere.  If we get really lucky, it'll have the same page->index 
>>> in
>>> its new home as it did before.
>>>
>>> While we're processing cowB, we might decide we need to fall back to
>>> uncompressed IO, and so compress_file_range() will call
>>> __set_page_dirty_nobufers() on cowB->locked_page.
>>>
>>> Without cgroups in use, this creates as a phantom dirty page, which> 
>>> isn't great but isn't the end of the world.  With cgroups in use, we
>>
>> Having a phantom dirty page is not great but not terrible without
>> cgroups but apart from that, does it have any other implications?
> 
> Best case, it'll go through the writepage fixup worker and go through 
> the whole cow machinery again.  Worst case we go to this code more than 
> once:
> 
>  /*
>   * if page_started, cow_file_range inserted an
>   * inline extent and took care of all the 
> unlocking
>   * and IO for us.  Otherwise, we need to submit
>   * all those pages down to the drive.
>   */
>  if (!page_started && !ret)
>  extent_write_locked_range(inode,
>async_extent->start,
>async_extent->start +
>async_extent->ram_size 
> - 1,
>WB_SYNC_ALL);
>  else if (ret)
>  unlock_page(async_chunk->locked_page);
> 
> 
> That never happened in production as far as I can tell, but it seems 
> possible.
> 
>>
>>
>>> might crash in the accounting code because page->mapping->i_wb isn't
>>> set.
>>>
>>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference 
>>> at 00d0
>>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
>>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
>>> [ 8308.541750] Oops:  [#1] SMP DEBUG_PAGEALLOC
>>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
>>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
>>> [ 8308.567891] RSP: 0018:c9000a97bbe0 EFLAGS: 00010286
>>> [ 8308.568986] RAX: 0005 RBX: 0090 RCX: 
>>> 00026115
>>> [ 8308.570734] RDX: 0030 RSI:  RDI: 
>>> 0090
>>> [ 8308.572543] RBP:  R08: fff5 R09: 
>>> 
>>> [ 8308.573856] R10: 000260c0 R11: 881037fc26c0 R12: 
>>> 
>>> [ 8308.580099] R13: 880fe4111548 R14: c9000a97bc90 R15: 
>>> 0001
>>> [ 8308.582520] FS:  7f5503ced480() GS:880ff720() 
>>> knlGS:
>>> [ 8308.585440] CS:  0010 DS:  ES:  CR0: 80050033
>>> [ 8308.587951] CR2: 00d0 CR3: 0001e0459005 CR4: 
>>> 00360ee0
>>> [ 8308.590707] DR0:  DR1:  DR2: 
>>> 
>>> [ 8308.592865] DR3:  DR6: fffe0ff0 DR7: 
>>> 0400
>>> [ 8308.594469] Call Trace:
>>> [ 8308.595149]  account_page_cleaned+0x15b/0x1f0
>>> [ 8308.596340]  __cancel_dirty_page+0x146/0x200
>>> [ 8308.599395]  truncate_cleanup_page+0x92/0xb0
>>> [ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
>>> [ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
>>> [ 8308.619108]  evict+0xc1/0x190
>&

Re: BTRFS: Kmemleak errrors with do_sys_ftruncate

2019-07-26 Thread Nikolay Borisov



On 26.07.19 г. 16:16 ч., Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> On a Power 8 server
> 
> Linux power 5.3.0-rc1+ #1 SMP Fri Jul 26 11:34:28 CEST 2019 ppc64le 
> ppc64le ppc64le GNU/Linux
> 
> Kmemleak reports the warnings below.
> 
> ```
> $ sudo more /sys/kernel/debug/kmemleak
> unreferenced object 0xc0f290866130 (size 80):
>   comm "networkd-dispat", pid 7630, jiffies 4294903138 (age 6540.748s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 ff ff 00 00 00 00 00 00  
> 40 61 86 90 f2 00 00 c0 00 00 00 00 00 00 00 00  @a..
>   backtrace:
> [] alloc_extent_state+0x4c/0x1b0 [btrfs]
> [<3f22a890>] __set_extent_bit+0x330/0x790 [btrfs]
> [<20fcf70f>] lock_extent_bits+0x98/0x320 [btrfs]
> [<1d1d0f77>] btrfs_lock_and_flush_ordered_range+0xf0/0x1b0 [btrfs]
> [<195c001c>] btrfs_cont_expand+0x14c/0x600 [btrfs]
> [] btrfs_setattr+0x130/0x770 [btrfs]
> [<8f89382e>] notify_change+0x218/0x5b0
> [] do_truncate+0x9c/0x140
> [<8b736052>] do_sys_ftruncate+0x1e4/0x210
> [] system_call+0x5c/0x70
> unreferenced object 0xc07f9cc555f0 (size 80):
>   comm "unattended-upgr", pid 8069, jiffies 4294903488 (age 6539.396s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 ff ff 00 00 00 00 00 00  
> 00 56 c5 9c 7f 00 00 c0 00 00 00 00 00 00 00 00  .V..
>   backtrace:
> [] alloc_extent_state+0x4c/0x1b0 [btrfs]
> [<3f22a890>] __set_extent_bit+0x330/0x790 [btrfs]
> [<20fcf70f>] lock_extent_bits+0x98/0x320 [btrfs]
> [<1d1d0f77>] btrfs_lock_and_flush_ordered_range+0xf0/0x1b0 [btrfs]
> [<195c001c>] btrfs_cont_expand+0x14c/0x600 [btrfs]
> [] btrfs_setattr+0x130/0x770 [btrfs]
> [<8f89382e>] notify_change+0x218/0x5b0
> [] do_truncate+0x9c/0x140
> [<8b736052>] do_sys_ftruncate+0x1e4/0x210
> [] system_call+0x5c/0x70
> unreferenced object 0xc07fa1431fe0 (size 80):
>   comm "lxd", pid 8347, jiffies 4294904066 (age 6537.104s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 ff ff 00 00 00 00 00 00  
> f0 1f 43 a1 7f 00 00 c0 00 00 00 00 00 00 00 00  ..C.
>   backtrace:
> [] alloc_extent_state+0x4c/0x1b0 [btrfs]
> [<3f22a890>] __set_extent_bit+0x330/0x790 [btrfs]
> [<20fcf70f>] lock_extent_bits+0x98/0x320 [btrfs]
> [<1d1d0f77>] btrfs_lock_and_flush_ordered_range+0xf0/0x1b0 [btrfs]
> [<195c001c>] btrfs_cont_expand+0x14c/0x600 [btrfs]
> [] btrfs_setattr+0x130/0x770 [btrfs]
> [<8f89382e>] notify_change+0x218/0x5b0
> [] do_truncate+0x9c/0x140
> [<8b736052>] do_sys_ftruncate+0x1e4/0x210
> [] system_call+0x5c/0x70
> unreferenced object 0xc07f98359470 (size 80):
>   comm "lxd", pid 8347, jiffies 4294904738 (age 6534.416s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 ff ff 00 00 00 00 00 00  
> 80 94 35 98 7f 00 00 c0 00 00 00 00 00 00 00 00  ..5.
>   backtrace:
> [] alloc_extent_state+0x4c/0x1b0 [btrfs]
> [<3f22a890>] __set_extent_bit+0x330/0x790 [btrfs]
> [<20fcf70f>] lock_extent_bits+0x98/0x320 [btrfs]
> [<1d1d0f77>] btrfs_lock_and_flush_ordered_range+0xf0/0x1b0 [btrfs]
> [<195c001c>] btrfs_cont_expand+0x14c/0x600 [btrfs]
> [] btrfs_setattr+0x130/0x770 [btrfs]
> [<8f89382e>] notify_change+0x218/0x5b0
> [] do_truncate+0x9c/0x140
> [<8b736052>] do_sys_ftruncate+0x1e4/0x210
> [] system_call+0x5c/0x70
> unreferenced object 0xc07f9e7078f0 (size 80):
>   comm "landscape-sysin", pid 10187, jiffies 4295061226 (age 5908.520s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 ff ff 00 00 00 00 00 00  
> 00 79 70 9e 7f 00 00 c0 00 00 00 00 00 00 00 00  .yp.
>   backtrace:
> [] alloc_extent_state+0x4c/0x1b0 [btrfs]
> [<3f22a890>] __set_extent_bit+0x330/0x790 [btrfs]
> [<20fcf70f>] lock_extent_bits+0x98/0x320 [btrfs]
> [<1d1d0f77>] btrfs_lock_and_flush_ordered_range+0xf0/0x1b0 [btrfs]
> [<195c001c>] btrfs_cont_expand+0x14c/0x600 [btrfs]
> [] btrfs_setattr+0x130/0x770 [btrfs]
> [<8f89382e>] notify_change+0x218/0x5b0
> [] do_truncate+0x9c/0x140
> [<8b736052>] do_sys_ftruncate+0x1e4/0x210
> [] system_call+0x5c/0x70
> unreferenced object 0xc0f29b4f48d0 (size 80):
>   comm "landscape-sysin", pid 14433, jiffies 4295569439 (age 

5.3-rc1 BUGS in dma_addressing_limited

2019-07-24 Thread Nikolay Borisov
Hello Christoph, 

5.3-rc1 crashes for me when run in qemu with scsi disks. 
Quick investigation shows that the following triggers a BUG_ON: 

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index e11b115dd0e4..4465e352b8dd 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -689,6 +689,7 @@ static inline int dma_coerce_mask_and_coherent(struct 
device *dev, u64 mask)
  */
 static inline bool dma_addressing_limited(struct device *dev)
 {
+   BUG_ON(!(dev->dma_mask));
return min_not_zero(*dev->dma_mask, dev->bus_dma_mask) <
dma_get_required_mask(dev);


Otherwise here is what the real backtrace looks like: 

[5.387839] scsi host0: Virtio SCSI HBA
[5.389860] BUG: kernel NULL pointer dereference, address: 
[5.390217] #PF: supervisor read access in kernel mode
[5.390520] #PF: error_code(0x) - not-present page
[5.390813] PGD 0 P4D 0 
[5.391007] Oops:  [#1] SMP
[5.391007] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-default #578
[5.391007] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[5.391007] RIP: 0010:dma_direct_max_mapping_size+0x21/0x80
[5.391007] Code: 0f b6 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 55 53 48 89 fb 
e8 f1 0e 00 00 84 c0 74 42 48 8b 83 e8 01 00 00 48 8b ab f8 01 00 00 <48> 8b 00 
48 85 c0 74 0c 48 85 ed 74 31 48 39 c5 48 0f 47 e8 48 89
[5.391007] RSP: :b0edc0013ac0 EFLAGS: 00010202
[5.391007] RAX:  RBX: 9216f9d8b838 RCX: 
[5.391007] RDX:  RSI: 007e RDI: 9216f9d8b838
[5.391007] RBP:  R08: 000249ffd97b R09: 0001
[5.391007] R10:  R11:  R12: 9216f9d8b838
[5.391007] R13:  R14: 9216f7348580 R15: 
[5.391007] FS:  () GS:9216fba0() 
knlGS:
[5.391007] CS:  0010 DS:  ES:  CR0: 80050033
[5.391007] CR2:  CR3: 7a211000 CR4: 06e0
[5.391007] Call Trace:
[5.391007]  __scsi_init_queue+0x75/0x130
[5.391007]  scsi_mq_alloc_queue+0x34/0x50
[5.391007]  scsi_alloc_sdev+0x232/0x300
[5.391007]  scsi_probe_and_add_lun+0x482/0xda0
[5.391007]  ? scsi_alloc_target+0x282/0x340
[5.391007]  __scsi_scan_target+0xe6/0x610
[5.391007]  ? sched_clock_local+0x12/0x80
[5.391007]  ? sched_clock_cpu+0x94/0xc0
[5.391007]  scsi_scan_channel.part.15+0x55/0x70
[5.391007]  scsi_scan_host_selected+0xd7/0x180
[5.391007]  virtscsi_probe+0x6f6/0x710
[5.391007]  ? msi_get_domain_info+0x10/0x10
[5.391007]  virtio_dev_probe+0x147/0x1d0
[5.391007]  really_probe+0xd6/0x3b0
[5.391007]  ? set_debug_rodata+0x11/0x11
[5.391007]  device_driver_attach+0x4f/0x60
[5.391007]  __driver_attach+0x99/0x130
[5.391007]  ? device_driver_attach+0x60/0x60
[5.391007]  bus_for_each_dev+0x76/0xc0
[5.391007]  bus_add_driver+0x144/0x220
[5.391007]  ? sym2_init+0xf6/0xf6
[5.391007]  driver_register+0x5b/0xe0
[5.391007]  ? sym2_init+0xf6/0xf6
[5.391007]  init+0x86/0xcc
[5.391007]  do_one_initcall+0x5a/0x2d4
[5.391007]  ? set_debug_rodata+0x11/0x11
[5.391007]  ? rcu_read_lock_sched_held+0x74/0x80
[5.391007]  kernel_init_freeable+0x139/0x1c9
[5.391007]  ? rest_init+0x260/0x260
[5.391007]  kernel_init+0xa/0x100
[5.391007]  ret_from_fork+0x24/0x30
[5.391007] Modules linked in:
[5.391007] CR2: 
[5.391007] ---[ end trace 03e50b8909d2f2e5 ]---
[5.391007] RIP: 0010:dma_direct_max_mapping_size+0x21/0x80
[5.391007] Code: 0f b6 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 55 53 48 89 fb 
e8 f1 0e 00 00 84 c0 74 42 48 8b 83 e8 01 00 00 48 8b ab f8 01 00 00 <48> 8b 00 
48 85 c0 74 0c 48 85 ed 74 31 48 39 c5 48 0f 47 e8 48 89
[5.391007] RSP: :b0edc0013ac0 EFLAGS: 00010202
[5.391007] RAX:  RBX: 9216f9d8b838 RCX: 
[5.391007] RDX:  RSI: 007e RDI: 9216f9d8b838
[5.391007] RBP:  R08: 000249ffd97b R09: 0001
[5.391007] R10:  R11:  R12: 9216f9d8b838
[5.391007] R13:  R14: 9216f7348580 R15: 
[5.391007] FS:  () GS:9216fba0() 
knlGS:
[5.391007] CS:  0010 DS:  ES:  CR0: 80050033
[5.391007] CR2:  CR3: 7a211000 CR4: 06e0
[5.391007] BUG: sleeping function called from invalid context at 
./include/linux/percpu-rwsem.h:38
[5.391007] in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: swapper/0
[5.391007] INFO: lockdep is turned off.
[5.391007] irq event stamp: 13427044
[5.391007] hardirqs last  enabled at (13427043): [] 

[PATCH v2 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock

2019-07-19 Thread Nikolay Borisov
This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drw
lock to ensure this invariant still holds. "Readers" are snapshot
creators from create_snapshot and 'writers' are nocow writers from
buffered write path or btrfs_setsize. This locking scheme allows for
multiple snapshots to happen while any nocow writers are blocked, since
writes to page cache in the nocow path will make snapshots inconsistent.

So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.

The main gain from using the drw is it's now a lot easier to reason about
the guarantees of the locking scheme and whether there is some silent
breakage lurking.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.h   |  9 ++---
 fs/btrfs/disk-io.c | 46 ++
 fs/btrfs/extent-tree.c | 35 
 fs/btrfs/file.c| 11 +-
 fs/btrfs/inode.c   |  8 
 fs/btrfs/ioctl.c   | 10 +++--
 6 files changed, 25 insertions(+), 94 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b7c9359b24a0..c04a23384210 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1087,11 +1087,6 @@ static inline struct btrfs_fs_info *btrfs_sb(struct 
super_block *sb)
return sb->s_fs_info;
 }
 
-struct btrfs_subvolume_writers {
-   struct percpu_counter   counter;
-   wait_queue_head_t   wait;
-};
-
 /*
  * The state of btrfs root
  */
@@ -1263,8 +1258,8 @@ struct btrfs_root {
 * root_item_lock.
 */
int dedupe_in_progress;
-   struct btrfs_subvolume_writers *subv_writers;
-   atomic_t will_be_snapshotted;
+   struct btrfs_drw_lock snapshot_lock;
+
atomic_t snapshot_force_cow;
 
/* For qgroup metadata reserved space */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index baebbb74032d..5a967d6264c9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1122,32 +1122,6 @@ void btrfs_clean_tree_block(struct extent_buffer *buf)
}
 }
 
-static struct btrfs_subvolume_writers *btrfs_alloc_subvolume_writers(void)
-{
-   struct btrfs_subvolume_writers *writers;
-   int ret;
-
-   writers = kmalloc(sizeof(*writers), GFP_NOFS);
-   if (!writers)
-   return ERR_PTR(-ENOMEM);
-
-   ret = percpu_counter_init(>counter, 0, GFP_NOFS);
-   if (ret < 0) {
-   kfree(writers);
-   return ERR_PTR(ret);
-   }
-
-   init_waitqueue_head(>wait);
-   return writers;
-}
-
-static void
-btrfs_free_subvolume_writers(struct btrfs_subvolume_writers *writers)
-{
-   percpu_counter_destroy(>counter);
-   kfree(writers);
-}
-
 static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info 
*fs_info,
 u64 objectid)
 {
@@ -1195,7 +1169,6 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
atomic_set(>log_writers, 0);
atomic_set(>log_batch, 0);
refcount_set(>refs, 1);
-   atomic_set(>will_be_snapshotted, 0);
atomic_set(>snapshot_force_cow, 0);
atomic_set(>nr_swapfiles, 0);
root->log_transid = 0;
@@ -1484,7 +1457,7 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_root 
*tree_root,
 int btrfs_init_fs_root(struct btrfs_root *root)
 {
int ret;
-   struct btrfs_subvolume_writers *writers;
+   unsigned int nofs_flag;
 
root->free_ino_ctl = kzalloc(sizeof(*root->free_ino_ctl), GFP_NOFS);
root->free_ino_pinned = kzalloc(sizeof(*root->free_ino_pinned),
@@ -1494,12 +1467,16 @@ int btrfs_init_fs_root(struct btrfs_root *root)
goto fail;
}
 
-   writers = btrfs_alloc_subvolume_writers();
-   if (IS_ERR(writers)) {
-   ret = PTR_ERR(writers);
+   /*
+* We might be called under a transaction (e.g. indirect
+* backref resolution) which could deadlock if it triggers memory
+* reclaim
+*/
+   nofs_flag = memalloc_nofs_save();
+   ret = btrfs_drw_lock_init(>snapshot_lock);
+   memalloc_nofs_restore(nofs_flag);
+   if (ret)
goto fail;
-   }
-   root->subv_writers = writers;
 
btrfs_init_free_ino_ctl(root);
spin_lock_init(>ino_cache_lock);
@@ -3920,8 +3897,7 @@ void btrfs_free_fs_root(struct btrfs_root *root)
WARN_ON(!RB_EMPTY_ROOT(>inode_tree));
if (root->anon_dev)
free_anon_bdev(root->anon_dev);
-   if (root->subv_writers)
-   btrfs_free_subvolume_writers(root->subv_writers);
+   btrfs_drw_lock_destroy(>snapshot_lock);

[PATCH v2 0/2] Refactor snapshot vs nocow writers locking

2019-07-19 Thread Nikolay Borisov
Hello, 

Here is the second version of the DRW lock for btrfs. Main changes from v1: 

* Fixed all checkpatch warnings (Andrea Parri)
* Properly call write_unlock in btrfs_drw_try_write_lock (Filipe Manana)
* Comment fix. 
* Stress tested it via locktorture. Survived for 8 straight days on a 4 socket
48 thread machine.

I have also produced a PlusCal specification which I'd be happy to discuss with 
people since I'm new to formal specification and I seem it doesn't have the 
right fidelity: 

 MODULE specs 
EXTENDS Integers, Sequences, TLC

CONSTANT NumLockers

ASSUME NumLockers > 0

(*--algorithm DRW

variables
lock_state = "idle",
states = {"idle", "write_locked", "read_locked", "read_waiting", 
"write_waiting"},
threads = [thread \in 1..NumLockers |-> "idle" ] \* everyone is in idle 
state at first, this generates a tuple

define
INV_SingleLockerType  == \/ lock_state = "write_locked" /\ ~\E thread \in 
1..Len(threads): threads[thread] = "read_locked"
 \/ lock_state = "read_locked" /\ ~\E thread \in 
1..Len(threads): threads[thread] = "write_locked"
 \/ lock_state = "idle" /\ \A thread \in 
1..Len(threads): threads[thread] = "idle"
end define;

macro ReadLock(tid) begin
if lock_state = "idle" \/ lock_state = "read_locked" then
lock_state := "read_locked";
threads[tid] := "read_locked";
else
assert lock_state = "write_locked";
threads[tid] := "write_waiting"; \* waiting for writers to finish
await lock_state = "" \/ lock_state = "read_locked";
end if;

end macro;

macro WriteLock(tid) begin
if lock_state = "idle" \/ lock_state = "write_locked" then
lock_state := "write_locked";
threads[tid] := "write_locked";
else
assert lock_state = "read_locked";
threads[tid] := "reader_waiting"; \* waiting for readers to finish
await lock_state = "idle" \/ lock_state = "write_locked";
end if;

end macro;

macro ReadUnlock(tid) begin
if threads[tid] = "read_locked" then
threads[tid] := "idle";
if ~\E thread \in 1..Len(threads): threads[thread] = "read_locked" then
lock_state := "idle"; \* we were the last read holder, everyone 
else should be waiting, unlock the lock
end if;
end if;

end macro;

macro WriteUnlock(tid) begin
if threads[tid] = "write_locked" then
threads[tid] := "idle";
if ~\E thread \in 1..Len(threads): threads[thread] = "write_locked" then
lock_state := "idle"; \* we were the last write holder, everyone 
else should be waiting, unlock the lock
end if;
end if;

end macro;

process lock \in 1..NumLockers

begin LOCKER:
while TRUE do
either
ReadLock(self);
or
WriteLock(self);
or
ReadUnlock(self);
or
WriteUnlock(self);
end either;
end while;
end process;

end algorithm; *)




Nikolay Borisov (2):
  btrfs: Implement DRW lock
  btrfs: convert snapshot/nocow exlcusion to drw lock

 fs/btrfs/ctree.h   | 10 ++---
 fs/btrfs/disk-io.c | 46 ++
 fs/btrfs/extent-tree.c | 35 -
 fs/btrfs/file.c| 11 +++---
 fs/btrfs/inode.c   |  8 ++--
 fs/btrfs/ioctl.c   | 10 ++---
 fs/btrfs/locking.c | 88 ++
 fs/btrfs/locking.h | 20 ++
 8 files changed, 134 insertions(+), 94 deletions(-)

-- 
2.17.1



Re: [PATCH 09/12] xfs: refactor the ioend merging code

2019-06-25 Thread Nikolay Borisov



On 25.06.19 г. 13:14 ч., Christoph Hellwig wrote:
> On Mon, Jun 24, 2019 at 07:06:22PM +0300, Nikolay Borisov wrote:
>>> +{
>>> +   struct list_headtmp;
>>> +
>>> +   list_replace_init(>io_list, );
>>> +   xfs_destroy_ioend(ioend, error);
>>> +   while ((ioend = list_pop(, struct xfs_ioend, io_list)))
>>> +   xfs_destroy_ioend(ioend, error);
>>
>> nit: I'd prefer if the list_pop patch is right before this one since
>> this is the first user of it.
> 
> I try to keep generic infrastructure first instead of interveawing
> it with subystem-specific patches.
> 
>> Additionally, I don't think list_pop is
>> really a net-negative win 
> 
> What is a "net-negative win" ?

What I meant was 'net-positive win', in terms of making the code more
readable or optimised.

> 
>> in comparison to list_for_each_entry_safe
>> here. In fact this "delete the list" would seems more idiomatic if
>> implemented via list_for_each_entry_safe
> 
> I disagree.  The for_each loops require an additional next iterator,
> and also don't clearly express what is going on, but require additional
> spotting of the list_del.

That is of course your opinion. At the very least we can agree to disagree.

What I'm worried about, though, is now you've essentially introduced a
new idiom to dispose of lists, which is used only in your code. If it
doesn't become more widespread and gradually start replacing current
list_for_each_entry_safe usage then you would have increased the public
list interface to cater for one specific use case, just because it seems
more natural to you. I guess only time will show whether it makes sense
to have list_pop_entry



Re: [PATCH 09/12] xfs: refactor the ioend merging code

2019-06-24 Thread Nikolay Borisov



On 24.06.19 г. 8:52 ч., Christoph Hellwig wrote:
> Introduce two nicely abstracted helper, which can be moved to the
> iomap code later.  Also use list_pop and list_first_entry_or_null
> to simplify the code a bit.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/xfs/xfs_aops.c | 66 ++-
>  1 file changed, 36 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index acbd73976067..5d302ebe2a33 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -121,6 +121,19 @@ xfs_destroy_ioend(
>   }
>  }
>  
> +static void
> +xfs_destroy_ioends(
> + struct xfs_ioend*ioend,
> + int error)
> +{
> + struct list_headtmp;
> +
> + list_replace_init(>io_list, );
> + xfs_destroy_ioend(ioend, error);
> + while ((ioend = list_pop(, struct xfs_ioend, io_list)))
> + xfs_destroy_ioend(ioend, error);

nit: I'd prefer if the list_pop patch is right before this one since
this is the first user of it. Additionally, I don't think list_pop is
really a net-negative win in comparison to list_for_each_entry_safe
here. In fact this "delete the list" would seems more idiomatic if
implemented via list_for_each_entry_safe

> +}
> +
>  /*
>   * Fast and loose check if this write could update the on-disk inode size.
>   */
> @@ -173,7 +186,6 @@ xfs_end_ioend(
>   struct xfs_ioend*ioend)
>  {
>   unsigned intnofs_flag = memalloc_nofs_save();
> - struct list_headioend_list;
>   struct xfs_inode*ip = XFS_I(ioend->io_inode);
>   xfs_off_t   offset = ioend->io_offset;
>   size_t  size = ioend->io_size;
> @@ -207,16 +219,7 @@ xfs_end_ioend(
>   if (!error && xfs_ioend_is_append(ioend))
>   error = xfs_setfilesize(ip, offset, size);
>  done:
> - list_replace_init(>io_list, _list);
> - xfs_destroy_ioend(ioend, error);
> -
> - while (!list_empty(_list)) {
> - ioend = list_first_entry(_list, struct xfs_ioend,
> - io_list);
> - list_del_init(>io_list);
> - xfs_destroy_ioend(ioend, error);
> - }
> -
> + xfs_destroy_ioends(ioend, error);
>   memalloc_nofs_restore(nofs_flag);
>  }
>  
> @@ -246,15 +249,16 @@ xfs_ioend_try_merge(
>   struct xfs_ioend*ioend,
>   struct list_head*more_ioends)
>  {
> - struct xfs_ioend*next_ioend;
> + struct xfs_ioend*next;
>  
> - while (!list_empty(more_ioends)) {
> - next_ioend = list_first_entry(more_ioends, struct xfs_ioend,
> - io_list);
> - if (!xfs_ioend_can_merge(ioend, next_ioend))
> + INIT_LIST_HEAD(>io_list);
> +
> + while ((next = list_first_entry_or_null(more_ioends, struct xfs_ioend,
> + io_list))) {
> + if (!xfs_ioend_can_merge(ioend, next))
>   break;
> - list_move_tail(_ioend->io_list, >io_list);
> - ioend->io_size += next_ioend->io_size;
> + list_move_tail(>io_list, >io_list);
> + ioend->io_size += next->io_size;
>   }
>  }
>  
> @@ -277,29 +281,31 @@ xfs_ioend_compare(
>   return 0;
>  }
>  
> +static void
> +xfs_sort_ioends(
> + struct list_head*ioend_list)
> +{
> + list_sort(NULL, ioend_list, xfs_ioend_compare);
> +}
> +
>  /* Finish all pending io completions. */
>  void
>  xfs_end_io(
>   struct work_struct  *work)
>  {
> - struct xfs_inode*ip;
> + struct xfs_inode*ip =
> + container_of(work, struct xfs_inode, i_ioend_work);
>   struct xfs_ioend*ioend;
> - struct list_headcompletion_list;
> + struct list_headtmp;
>   unsigned long   flags;
>  
> - ip = container_of(work, struct xfs_inode, i_ioend_work);
> -
>   spin_lock_irqsave(>i_ioend_lock, flags);
> - list_replace_init(>i_ioend_list, _list);
> + list_replace_init(>i_ioend_list, );
>   spin_unlock_irqrestore(>i_ioend_lock, flags);
>  
> - list_sort(NULL, _list, xfs_ioend_compare);
> -
> - while (!list_empty(_list)) {
> - ioend = list_first_entry(_list, struct xfs_ioend,
> - io_list);
> - list_del_init(>io_list);
> - xfs_ioend_try_merge(ioend, _list);
> + xfs_sort_ioends();
> + while ((ioend = list_pop(, struct xfs_ioend, io_list))) {
> + xfs_ioend_try_merge(ioend, );
>   xfs_end_ioend(ioend);

Here again, tmp is a local copy that is immutable so using while()
instead of list_for_each_entry_safe doesn't seem to be providing much.

>   }
>  }
> 


[tip:locking/core] locking/lockdep: Rename lockdep_assert_held_exclusive() -> lockdep_assert_held_write()

2019-06-17 Thread tip-bot for Nikolay Borisov
Commit-ID:  9ffbe8ac05dbb4ab4a4836a55a47fc6be945a38f
Gitweb: https://git.kernel.org/tip/9ffbe8ac05dbb4ab4a4836a55a47fc6be945a38f
Author: Nikolay Borisov 
AuthorDate: Fri, 31 May 2019 13:06:51 +0300
Committer:  Ingo Molnar 
CommitDate: Mon, 17 Jun 2019 12:09:24 +0200

locking/lockdep: Rename lockdep_assert_held_exclusive() -> 
lockdep_assert_held_write()

All callers of lockdep_assert_held_exclusive() use it to verify the
correct locking state of either a semaphore (ldisc_sem in tty,
mmap_sem for perf events, i_rwsem of inode for dax) or rwlock by
apparmor. Thus it makes sense to rename _exclusive to _write since
that's the semantics callers care. Additionally there is already
lockdep_assert_held_read(), which this new naming is more consistent with.

No functional changes.

Signed-off-by: Nikolay Borisov 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190531100651.3969-1-nbori...@suse.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/core.c   | 2 +-
 drivers/infiniband/core/device.c | 2 +-
 drivers/tty/tty_ldisc.c  | 8 
 fs/dax.c | 2 +-
 include/linux/lockdep.h  | 4 ++--
 security/apparmor/label.c| 8 
 6 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f315425d8468..cf91d80b8452 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2179,7 +2179,7 @@ static void x86_pmu_event_mapped(struct perf_event 
*event, struct mm_struct *mm)
 * For now, this can't happen because all callers hold mmap_sem
 * for write.  If this changes, we'll need a different solution.
 */
-   lockdep_assert_held_exclusive(>mmap_sem);
+   lockdep_assert_held_write(>mmap_sem);
 
if (atomic_inc_return(>context.perf_rdpmc_allowed) == 1)
on_each_cpu_mask(mm_cpumask(mm), refresh_pce, NULL, 1);
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 29f7b15c81d9..d020bb4d03d5 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -457,7 +457,7 @@ static int alloc_name(struct ib_device *ibdev, const char 
*name)
int rc;
int i;
 
-   lockdep_assert_held_exclusive(_rwsem);
+   lockdep_assert_held_write(_rwsem);
ida_init();
xa_for_each (, index, device) {
char buf[IB_DEVICE_NAME_MAX];
diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index e38f104db174..fde8d4073e74 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -487,7 +487,7 @@ static int tty_ldisc_open(struct tty_struct *tty, struct 
tty_ldisc *ld)
 
 static void tty_ldisc_close(struct tty_struct *tty, struct tty_ldisc *ld)
 {
-   lockdep_assert_held_exclusive(>ldisc_sem);
+   lockdep_assert_held_write(>ldisc_sem);
WARN_ON(!test_bit(TTY_LDISC_OPEN, >flags));
clear_bit(TTY_LDISC_OPEN, >flags);
if (ld->ops->close)
@@ -509,7 +509,7 @@ static int tty_ldisc_failto(struct tty_struct *tty, int ld)
struct tty_ldisc *disc = tty_ldisc_get(tty, ld);
int r;
 
-   lockdep_assert_held_exclusive(>ldisc_sem);
+   lockdep_assert_held_write(>ldisc_sem);
if (IS_ERR(disc))
return PTR_ERR(disc);
tty->ldisc = disc;
@@ -633,7 +633,7 @@ EXPORT_SYMBOL_GPL(tty_set_ldisc);
  */
 static void tty_ldisc_kill(struct tty_struct *tty)
 {
-   lockdep_assert_held_exclusive(>ldisc_sem);
+   lockdep_assert_held_write(>ldisc_sem);
if (!tty->ldisc)
return;
/*
@@ -681,7 +681,7 @@ int tty_ldisc_reinit(struct tty_struct *tty, int disc)
struct tty_ldisc *ld;
int retval;
 
-   lockdep_assert_held_exclusive(>ldisc_sem);
+   lockdep_assert_held_write(>ldisc_sem);
ld = tty_ldisc_get(tty, disc);
if (IS_ERR(ld)) {
BUG_ON(disc == N_TTY);
diff --git a/fs/dax.c b/fs/dax.c
index 2e48c7ebb973..bf8686d48b2d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1188,7 +1188,7 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
unsigned flags = 0;
 
if (iov_iter_rw(iter) == WRITE) {
-   lockdep_assert_held_exclusive(>i_rwsem);
+   lockdep_assert_held_write(>i_rwsem);
flags |= IOMAP_WRITE;
} else {
lockdep_assert_held(>i_rwsem);
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 30a0f81aa130..151d55711082 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -394,7 +394,7 @@ extern void lock_unpin_lock(struct lockdep_map *lock, 
struct pin_cookie);
WARN_ON(debug_locks && !lockdep_is_held(l));\
} while (0)
 
-#define lockdep_assert_held_exclusive(l)   do {\
+#define lockdep_assert_held_w

[PATCH 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock

2019-06-06 Thread Nikolay Borisov
This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drw
lock to ensure this invariant still holds. "Readers" are snapshot
creators from create_snapshot and 'writers' are nocow writers from
buffered write path or btrfs_setsize. This locking scheme allows for
multiple snapshots to happen while any nocow writers are blocked, since
writes to page cache in the nocow path will make snapshots inconsistent.

So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.

The main gain from using the drw is it's now a lot easier to reason about
the guarantees of the locking scheme and whether there is some silent
breakage lurking.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.h   | 10 +++---
 fs/btrfs/disk-io.c | 39 +++
 fs/btrfs/extent-tree.c | 35 ---
 fs/btrfs/file.c| 12 ++--
 fs/btrfs/inode.c   |  8 
 fs/btrfs/ioctl.c   | 10 +++---
 6 files changed, 19 insertions(+), 95 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a66ed58058d9..fa8a2e15c77c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -32,6 +32,7 @@
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
+#include "drw_lock.h"
 
 struct btrfs_trans_handle;
 struct btrfs_transaction;
@@ -1174,11 +1175,6 @@ static inline struct btrfs_fs_info *btrfs_sb(struct 
super_block *sb)
return sb->s_fs_info;
 }
 
-struct btrfs_subvolume_writers {
-   struct percpu_counter   counter;
-   wait_queue_head_t   wait;
-};
-
 /*
  * The state of btrfs root
  */
@@ -1350,8 +1346,8 @@ struct btrfs_root {
 * root_item_lock.
 */
int dedupe_in_progress;
-   struct btrfs_subvolume_writers *subv_writers;
-   atomic_t will_be_snapshotted;
+   struct btrfs_drw_lock snapshot_lock;
+
atomic_t snapshot_force_cow;
 
/* For qgroup metadata reserved space */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 05f215b4d060..ece45e606846 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1125,32 +1125,6 @@ void btrfs_clean_tree_block(struct extent_buffer *buf)
}
 }
 
-static struct btrfs_subvolume_writers *btrfs_alloc_subvolume_writers(void)
-{
-   struct btrfs_subvolume_writers *writers;
-   int ret;
-
-   writers = kmalloc(sizeof(*writers), GFP_NOFS);
-   if (!writers)
-   return ERR_PTR(-ENOMEM);
-
-   ret = percpu_counter_init(>counter, 0, GFP_NOFS);
-   if (ret < 0) {
-   kfree(writers);
-   return ERR_PTR(ret);
-   }
-
-   init_waitqueue_head(>wait);
-   return writers;
-}
-
-static void
-btrfs_free_subvolume_writers(struct btrfs_subvolume_writers *writers)
-{
-   percpu_counter_destroy(>counter);
-   kfree(writers);
-}
-
 static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info 
*fs_info,
 u64 objectid)
 {
@@ -1198,7 +1172,6 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
atomic_set(>log_writers, 0);
atomic_set(>log_batch, 0);
refcount_set(>refs, 1);
-   atomic_set(>will_be_snapshotted, 0);
atomic_set(>snapshot_force_cow, 0);
atomic_set(>nr_swapfiles, 0);
root->log_transid = 0;
@@ -1487,7 +1460,6 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_root 
*tree_root,
 int btrfs_init_fs_root(struct btrfs_root *root)
 {
int ret;
-   struct btrfs_subvolume_writers *writers;
 
root->free_ino_ctl = kzalloc(sizeof(*root->free_ino_ctl), GFP_NOFS);
root->free_ino_pinned = kzalloc(sizeof(*root->free_ino_pinned),
@@ -1497,12 +1469,8 @@ int btrfs_init_fs_root(struct btrfs_root *root)
goto fail;
}
 
-   writers = btrfs_alloc_subvolume_writers();
-   if (IS_ERR(writers)) {
-   ret = PTR_ERR(writers);
-   goto fail;
-   }
-   root->subv_writers = writers;
+
+   btrfs_drw_lock_init(>snapshot_lock);
 
btrfs_init_free_ino_ctl(root);
spin_lock_init(>ino_cache_lock);
@@ -3873,8 +3841,7 @@ void btrfs_free_fs_root(struct btrfs_root *root)
WARN_ON(!RB_EMPTY_ROOT(>inode_tree));
if (root->anon_dev)
free_anon_bdev(root->anon_dev);
-   if (root->subv_writers)
-   btrfs_free_subvolume_writers(root->subv_writers);
+   btrfs_drw_lock_destroy(>snapshot_lock);
free_extent_buffer(root->node);
free_extent_buffer(root->commit_root);
kfree

[PATCH 1/2] btrfs: Implement DRW lock

2019-06-06 Thread Nikolay Borisov
A (D)ouble (R)eader (W)riter lock is a locking primitive that allows
to have multiple readers or multiple writers but not multiple readers
and writers holding it concurrently. The code is factored out from
the existing open-coded locking scheme used to exclude pending
snapshots from nocow writers and vice-versa. Current implementation
actually favors Readers (that is snapshot creaters) to writers (nocow
writers of the filesystem).

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/Makefile   |  2 +-
 fs/btrfs/drw_lock.c | 71 +
 fs/btrfs/drw_lock.h | 23 +++
 3 files changed, 95 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/drw_lock.c
 create mode 100644 fs/btrfs/drw_lock.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..dc60127791e6 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o free-space-tree.o tree-checker.o
+  uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c
new file mode 100644
index ..9681bf7544be
--- /dev/null
+++ b/fs/btrfs/drw_lock.c
@@ -0,0 +1,71 @@
+#include "drw_lock.h"
+#include "ctree.h"
+
+void btrfs_drw_lock_init(struct btrfs_drw_lock *lock)
+{
+   atomic_set(>readers, 0);
+   percpu_counter_init(>writers, 0, GFP_KERNEL);
+   init_waitqueue_head(>pending_readers);
+   init_waitqueue_head(>pending_writers);
+}
+
+void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock)
+{
+   percpu_counter_destroy(>writers);
+}
+
+bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock)
+{
+   if (atomic_read(>readers))
+   return false;
+
+   percpu_counter_inc(>writers);
+
+   /*
+* Ensure writers count is updated before we check for
+* pending readers
+*/
+   smp_mb();
+   if (atomic_read(>readers)) {
+   btrfs_drw_read_unlock(lock);
+   return false;
+   }
+
+   return true;
+}
+
+void btrfs_drw_write_lock(struct btrfs_drw_lock *lock)
+{
+   while(true) {
+   if (btrfs_drw_try_write_lock(lock))
+   return;
+   wait_event(lock->pending_writers, !atomic_read(>readers));
+   }
+}
+
+void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock)
+{
+   percpu_counter_dec(>writers);
+   cond_wake_up(>pending_readers);
+}
+
+void btrfs_drw_read_lock(struct btrfs_drw_lock *lock)
+{
+   atomic_inc(>readers);
+   smp_mb__after_atomic();
+
+   wait_event(lock->pending_readers,
+  percpu_counter_sum(>writers) == 0);
+}
+
+void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock)
+{
+   /*
+* Atomic RMW operations imply full barrier, so woken up writers
+* are guaranteed to see the decrement
+*/
+   if (atomic_dec_and_test(>readers))
+   wake_up(>pending_writers);
+}
+
+
diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h
new file mode 100644
index ..baff59561c06
--- /dev/null
+++ b/fs/btrfs/drw_lock.h
@@ -0,0 +1,23 @@
+#ifndef BTRFS_DRW_LOCK_H
+#define BTRFS_DRW_LOCK_H
+
+#include 
+#include 
+#include 
+
+struct btrfs_drw_lock {
+   atomic_t readers;
+   struct percpu_counter writers;
+   wait_queue_head_t pending_writers;
+   wait_queue_head_t pending_readers;
+};
+
+void btrfs_drw_lock_init(struct btrfs_drw_lock *lock);
+void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock);
+void btrfs_drw_write_lock(struct btrfs_drw_lock *lock);
+bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock);
+void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock);
+void btrfs_drw_read_lock(struct btrfs_drw_lock *lock);
+void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock);
+
+#endif
-- 
2.17.1



[PATCH] lockdep: Rename lockdep_assert_held_exclusive -> lockdep_assert_held_write

2019-05-31 Thread Nikolay Borisov
All callers of lockdep_assert_held_exclusive use it to verify the
correct locking state of either a semaphore (ldisc_sem in tty,
mmap_sem for perf events, i_rwsem of inode for dax) or rwlock by
apparmor. Thus it makes sense to rename _exclusive to _write since
that's the semantics callers care. Additionally there is already
locdep_assert_held_read, this bring asymmetry to the naming.

No functional changes.

Signed-off-by: Nikolay Borisov 
---
 arch/x86/events/core.c   | 2 +-
 drivers/infiniband/core/device.c | 2 +-
 drivers/tty/tty_ldisc.c  | 8 
 fs/dax.c | 2 +-
 include/linux/lockdep.h  | 4 ++--
 security/apparmor/label.c| 8 
 6 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f315425d8468..cf91d80b8452 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2179,7 +2179,7 @@ static void x86_pmu_event_mapped(struct perf_event 
*event, struct mm_struct *mm)
 * For now, this can't happen because all callers hold mmap_sem
 * for write.  If this changes, we'll need a different solution.
 */
-   lockdep_assert_held_exclusive(>mmap_sem);
+   lockdep_assert_held_write(>mmap_sem);
 
if (atomic_inc_return(>context.perf_rdpmc_allowed) == 1)
on_each_cpu_mask(mm_cpumask(mm), refresh_pce, NULL, 1);
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 78dc07c6ac4b..bbf72f995d1b 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -440,7 +440,7 @@ static int alloc_name(struct ib_device *ibdev, const char 
*name)
int rc;
int i;
 
-   lockdep_assert_held_exclusive(_rwsem);
+   lockdep_assert_held_write(_rwsem);
ida_init();
xa_for_each (, index, device) {
char buf[IB_DEVICE_NAME_MAX];
diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index e38f104db174..fde8d4073e74 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -487,7 +487,7 @@ static int tty_ldisc_open(struct tty_struct *tty, struct 
tty_ldisc *ld)
 
 static void tty_ldisc_close(struct tty_struct *tty, struct tty_ldisc *ld)
 {
-   lockdep_assert_held_exclusive(>ldisc_sem);
+   lockdep_assert_held_write(>ldisc_sem);
WARN_ON(!test_bit(TTY_LDISC_OPEN, >flags));
clear_bit(TTY_LDISC_OPEN, >flags);
if (ld->ops->close)
@@ -509,7 +509,7 @@ static int tty_ldisc_failto(struct tty_struct *tty, int ld)
struct tty_ldisc *disc = tty_ldisc_get(tty, ld);
int r;
 
-   lockdep_assert_held_exclusive(>ldisc_sem);
+   lockdep_assert_held_write(>ldisc_sem);
if (IS_ERR(disc))
return PTR_ERR(disc);
tty->ldisc = disc;
@@ -633,7 +633,7 @@ EXPORT_SYMBOL_GPL(tty_set_ldisc);
  */
 static void tty_ldisc_kill(struct tty_struct *tty)
 {
-   lockdep_assert_held_exclusive(>ldisc_sem);
+   lockdep_assert_held_write(>ldisc_sem);
if (!tty->ldisc)
return;
/*
@@ -681,7 +681,7 @@ int tty_ldisc_reinit(struct tty_struct *tty, int disc)
struct tty_ldisc *ld;
int retval;
 
-   lockdep_assert_held_exclusive(>ldisc_sem);
+   lockdep_assert_held_write(>ldisc_sem);
ld = tty_ldisc_get(tty, disc);
if (IS_ERR(ld)) {
BUG_ON(disc == N_TTY);
diff --git a/fs/dax.c b/fs/dax.c
index f74386293632..1224275f79d7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1196,7 +1196,7 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
unsigned flags = 0;
 
if (iov_iter_rw(iter) == WRITE) {
-   lockdep_assert_held_exclusive(>i_rwsem);
+   lockdep_assert_held_write(>i_rwsem);
flags |= IOMAP_WRITE;
} else {
lockdep_assert_held(>i_rwsem);
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 6e2377e6c1d6..d6d6a857fe52 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -385,7 +385,7 @@ extern void lock_unpin_lock(struct lockdep_map *lock, 
struct pin_cookie);
WARN_ON(debug_locks && !lockdep_is_held(l));\
} while (0)
 
-#define lockdep_assert_held_exclusive(l)   do {\
+#define lockdep_assert_held_write(l)   do {\
WARN_ON(debug_locks && !lockdep_is_held_type(l, 0));\
} while (0)
 
@@ -466,7 +466,7 @@ struct lockdep_map { };
 #define lockdep_is_held_type(l, r) (1)
 
 #define lockdep_assert_held(l) do { (void)(l); } while (0)
-#define lockdep_assert_held_exclusive(l)   do { (void)(l); } while (0)
+#define lockdep_assert_held_write(l)   do { (void)(l); } while (0)
 #define lockdep_assert_held_read(l)do { (void)(l); } while (0)
 #define lockdep_assert_held_once(l)  

Re: [PATCH] riscv: fix locking violation in page fault handler

2019-05-07 Thread Nikolay Borisov



On 7.05.19 г. 17:12 ч., Andreas Schwab wrote:
> On Mai 07 2019, Nikolay Borisov  wrote:
> 
>> At the very least the code under
>> no_context label could go into it's own function since it just kills the
>> process and never returns?
> 
> This is not true.

Be more specific, according to do_task_dead after the last __schedule is
called the calling context is no more?
> 
> Andreas.
> 


Re: [PATCH] riscv: fix locking violation in page fault handler

2019-05-07 Thread Nikolay Borisov



On 7.05.19 г. 10:36 ч., Andreas Schwab wrote:
> When a user mode process accesses an address in the vmalloc area
> do_page_fault tries to unlock the mmap semaphore when it isn't locked.
> 
> Signed-off-by: Andreas Schwab 
> ---
>  arch/riscv/mm/fault.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> index 88401d5125bc..c51878e5a66a 100644
> --- a/arch/riscv/mm/fault.c
> +++ b/arch/riscv/mm/fault.c
> @@ -181,6 +181,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>   up_read(>mmap_sem);
>   /* User mode accesses just cause a SIGSEGV */
>   if (user_mode(regs)) {
> +bad_area_do_trap:
>   do_trap(regs, SIGSEGV, code, addr, tsk);
>   return;
>   }
> @@ -230,7 +231,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>   int index;
>  
>   if (user_mode(regs))
> - goto bad_area;
> + goto bad_area_do_trap;
>  
>   /*
>* Synchronize this task's top level page-table
> 

In this case I think it will be a lot cleaner if you just duplicated the
do_trap call. On a slightly different note - is there any reason why
do_page_fault is such a spaghetti mess? At the very least the code under
no_context label could go into it's own function since it just kills the
process and never returns? Furthermore the whole vmalloc_fault just
cries for being factored  out in a function, it's explicitly in it's own
 block.


Re: [V2] btrfs: drop inode reference count on error path

2019-04-18 Thread Nikolay Borisov



On 18.04.19 г. 17:07 ч., Josef Bacik wrote:
> On Thu, Apr 18, 2019 at 03:50:00PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 18.04.19 г. 14:06 ч., Pan Bian wrote:
>>> The reference count of inode is incremented by ihold. It should be
>>> dropped if not used. However, the reference count is not dropped if
>>> error occurs during updating the inode or deleting orphan items. This
>>> patch fixes the bug.
>>>
>>> Signed-off-by: Pan Bian 
>>> ---
>>> V2: move ihold just before device_initialize to make code clearer
>>> ---
>>>  fs/btrfs/inode.c | 54 
>>> +-
>>>  1 file changed, 25 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 82fdda8..d6630df 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -6579,7 +6579,7 @@ static int btrfs_link(struct dentry *old_dentry, 
>>> struct inode *dir,
>>> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>> u64 index;
>>> int err;
>>> -   int drop_inode = 0;
>>> +   int log_mode;
>>>  
>>> /* do not allow sys_link's with other subvols of the same device */
>>> if (root->root_key.objectid != BTRFS_I(inode)->root->root_key.objectid)
>>> @@ -6616,41 +6616,37 @@ static int btrfs_link(struct dentry *old_dentry, 
>>> struct inode *dir,
>>> err = btrfs_add_nondir(trans, BTRFS_I(dir), dentry, BTRFS_I(inode),
>>> 1, index);
>>>  
>>> -   if (err) {
>>> -   drop_inode = 1;
>>> -   } else {
>>> -   struct dentry *parent = dentry->d_parent;
>>> -   int ret;
>>> +   if (err)
>>> +   goto err_link;
>>>  
>>> -   err = btrfs_update_inode(trans, root, inode);
>>> +   err = btrfs_update_inode(trans, root, inode);
>>> +   if (err)
>>> +   goto err_link;
>>> +   if (inode->i_nlink == 1) {
>>> +   /*
>>> +* If new hard link count is 1, it's a file created
>>> +* with open(2) O_TMPFILE flag.
>>> +*/
>>> +   err = btrfs_orphan_del(trans, BTRFS_I(inode));
>>> if (err)
>>> -   goto fail;
>>> -   if (inode->i_nlink == 1) {
>>> -   /*
>>> -* If new hard link count is 1, it's a file created
>>> -* with open(2) O_TMPFILE flag.
>>> -*/
>>> -   err = btrfs_orphan_del(trans, BTRFS_I(inode));
>>> -   if (err)
>>> -   goto fail;
>>> -   }
>>> -   BTRFS_I(inode)->last_link_trans = trans->transid;
>>> -   d_instantiate(dentry, inode);
>>> -   ret = btrfs_log_new_name(trans, BTRFS_I(inode), NULL, parent,
>>> -true, NULL);
>>> -   if (ret == BTRFS_NEED_TRANS_COMMIT) {
>>> -   err = btrfs_commit_transaction(trans);
>>> -   trans = NULL;
>>> -   }
>>> +   goto err_link;
>>> +   }
>>> +   BTRFS_I(inode)->last_link_trans = trans->transid;
>>> +   ihold(inode);
> 
> Where is the iput for this ihold?

This ihold is sort of "given" to the d_instantiate. I.e the iput happens
when the respective dentry is unhashed/removed.

> 
> Josef
> 


Re: [PATCH] fs/open: Fix most outstanding security bugs

2019-04-01 Thread Nikolay Borisov



On 1.04.19 г. 12:01 ч., Johannes Thumshirn wrote:
> Over the last 20 years, the Linux kernel has accumulated hundreds if not
> thousands of security vulnerabilities.
> 
> One common pattern in most of these security related reports is processes
> called "syzkaller", "trinity" or "syz-executor" opening files and then
> abuse kernel interfaces causing kernel crashes or even worse threats using
> memory overwrites or by exploiting race conditions.
> 
> Hunting down these bugs has become time consuming and very expensive, so
> I've decided to put an end to it.
> 
> If one of the above mentioned processes tries opening a file, return -EPERM
> indicating this process does not have the permission to open files on Linux
> anymore.
> 
> Signed-off-by: Johannes Thumshirn 

Ack-by: Nikolay Borisov 

> ---
>  fs/open.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/open.c b/fs/open.c
> index f1c2f855fd43..3a3b460beccd 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1056,6 +1056,20 @@ long do_sys_open(int dfd, const char __user *filename, 
> int flags, umode_t mode)
>   struct open_flags op;
>   int fd = build_open_flags(flags, mode, );
>   struct filename *tmp;
> + char comm[TASK_COMM_LEN];
> + int i;
> + static const char * const list[] = {
> + "syzkaller",
> + "syz-executor,"
> + "trinity",
> + NULL
> + };
> +
> + get_task_comm(comm, current);
> +
> + for (i = 0; i < ARRAY_SIZE(list); i++)
> + if (!strncmp(comm, list[i], strlen(list[i])))
> + return -EPERM;
>  
>   if (fd)
>   return fd;
> 


Re: [btrfs] 70d28b0e4f: BUG:kernel_reboot-without-warning_in_early-boot_stage,last_printk:Probing_EDD(edd=off_to_disable)...ok

2019-04-01 Thread Nikolay Borisov


On 1.04.19 г. 16:24 ч., kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: 70d28b0e4f8ed2d38571e7b1f9bec7f321a53102 ("btrfs: tree-checker: 
> Verify dev item")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> in testcase: trinity
> with following parameters:
> 
>   runtime: 300s
> 
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 2G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> ++++
> | 
>| 36b9d2bc69 | 70d28b0e4f |
> ++++
> | boot_successes  
>| 14 | 0  |
> | boot_failures   
>| 2  | 14 |
> | IP-Config:Auto-configuration_of_network_failed  
>| 2  ||
> | 
> BUG:kernel_reboot-without-warning_in_early-boot_stage,last_printk:Probing_EDD(edd=off_to_disable)...ok
>  | 0  | 14 |
> ++++
> 
> 
> 
> early console in setup code
> Probing EDD (edd=off to disable)... ok
> BUG: kernel reboot-without-warning in early-boot stage, last printk: Probing 
> EDD (edd=off to disable)... ok
> Linux version 5.0.0-rc8-00196-g70d28b0 #1
> Command line: ip=vm-snb-quantal-x86_64-1415::dhcp root=/dev/ram0 user=lkp 
> job=/lkp/jobs/scheduled/vm-snb-quantal-x86_64-1415/trinity-300s-quantal-core-x86_64-2018-11-09.cgz-70d28b0-20190330-29362-1y6g0qb-2.yaml
>  ARCH=x86_64 kconfig=x86_64-randconfig-s5-03231928 
> branch=linux-devel/devel-hourly-2019032317 
> commit=70d28b0e4f8ed2d38571e7b1f9bec7f321a53102 
> BOOT_IMAGE=/pkg/linux/x86_64-randconfig-s5-03231928/gcc-7/70d28b0e4f8ed2d38571e7b1f9bec7f321a53102/vmlinuz-5.0.0-rc8-00196-g70d28b0
>  max_uptime=1500 
> RESULT_ROOT=/result/trinity/300s/vm-snb-quantal-x86_64/quantal-core-x86_64-2018-11-09.cgz/x86_64-randconfig-s5-03231928/gcc-7/70d28b0e4f8ed2d38571e7b1f9bec7f321a53102/8
>  LKP_SERVER=inn debug apic=debug sysrq_always_enabled 
> rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on panic=-1 
> softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 
> prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err ignore_loglevel 
> console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw 
> rcuperf.shutdown=0
> 

Can this report be made useful by actually including output from serial
console? For example possible bug-ons or whatnot? dmesg.xz just contains
qemu's command line + some metadata about the test and :

"BUG: kernel reboot-without-warning in early-boot stage, last printk:
Probing EDD (edd=off to disable)... ok"

At least a stack trace would have been useful.




Re: [PATCH] btrfs: zstd ensure reclaim timer is properly cleaned up

2019-02-22 Thread Nikolay Borisov



On 21.02.19 г. 22:25 ч., Dennis Zhou wrote:
> The timer function, zstd_reclaim_timer_fn(), reschedules itself under
> certain conditions. Switch to del_timer_sync() to ensure that the timer
> function hasn't rescheduled itself.

According to del_timer_sync it just waits for any concurrent invocation
to finish. But it's responsibility of the caller to ensure the timer
cannot really be restarted. It's not obvious how
zstd_cleanup_workspace_manager ensures that the timer won't be restared.
Looking at the timer function for it to not restart wsm.lru_list has to
be empty. And seeing that workspace->lru_list is deleted afterwards I'm
not sure this invariant holds.

> 
> Signed-off-by: Dennis Zhou 
> ---
>  fs/btrfs/zstd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
> index 3e418a3aeb11..62de9a211321 100644
> --- a/fs/btrfs/zstd.c
> +++ b/fs/btrfs/zstd.c
> @@ -195,7 +195,7 @@ static void zstd_cleanup_workspace_manager(void)
>   struct workspace *workspace;
>   int i;
>  
> - del_timer();
> + del_timer_sync();
>  
>   for (i = 0; i < ZSTD_BTRFS_MAX_LEVEL; i++) {
>   while (!list_empty(_ws[i])) {
> 


Re: [PATCH 07/11] btrfs: move to fn pointers for get/put workspaces

2019-01-29 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> The previous patch added generic helpers for get_workspace() and
> put_workspace(). Now, we can migrate ownership of the workspace_manager
> to be in the compression type code as the compression code itself
> doesn't care beyond being able to get a workspace. The init/cleanup
> and get/put methods are abstracted so each compression algorithm can
> decide how they want to manage their workspaces.
> 
> Signed-off-by: Dennis Zhou 

TBH I can't really see the value in this patch. IMO it doesn't make the
code more readable, on the contrary, you create algorithm-specific
wrappers over the generic function, where the sole specialization is in
the arguments passed to the generic functions. You introduce 4 more
function pointers and this affects performance negatively (albeit can't
say to what extent) due to spectre mitigations (retpolines).

I also read the follow up patches with the hopes of seeing how the code
becomes cleaner to no avail. At this point I'm really not in favor of
this particular patch.


Re: [PATCH 10/11] btrfs: zstd use the passed through level instead of default

2019-01-29 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> Zstd currently only supports the default level of compression. This
> patch switches to using the level passed in for btrfs zstd
> configuration.
> 
> Zstd workspaces now keep track of the requested level as this can differ
> from the size of the workspace.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/zstd.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
> index 43f3be755b8c..a951d4fe77f7 100644
> --- a/fs/btrfs/zstd.c
> +++ b/fs/btrfs/zstd.c
> @@ -21,10 +21,10 @@
>  #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
>  #define ZSTD_BTRFS_DEFAULT_LEVEL 3
>  
> -static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len)
> +static ZSTD_parameters zstd_get_btrfs_parameters(unsigned int level,
> +  size_t src_len)
>  {
> - ZSTD_parameters params = ZSTD_getParams(ZSTD_BTRFS_DEFAULT_LEVEL,
> - src_len, 0);
> + ZSTD_parameters params = ZSTD_getParams(level, src_len, 0);
>  
>   if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG)
>   params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG;
> @@ -36,6 +36,7 @@ struct workspace {
>   void *mem;
>   size_t size;
>   char *buf;
> + unsigned int req_level;
>   struct list_head list;
>   ZSTD_inBuffer in_buf;
>   ZSTD_outBuffer out_buf;
> @@ -55,7 +56,12 @@ static void zstd_cleanup_workspace_manager(void)
>  
>  static struct list_head *zstd_get_workspace(unsigned int level)
>  {
> - return btrfs_get_workspace(, level);
> + struct list_head *ws = btrfs_get_workspace(, level);
> + struct workspace *workspace = list_entry(ws, struct workspace, list);
> +
> + workspace->req_level = level;
> +
> + return ws;
>  }
>  
>  static void zstd_put_workspace(struct list_head *ws)
> @@ -75,7 +81,7 @@ static void zstd_free_workspace(struct list_head *ws)
>  static struct list_head *zstd_alloc_workspace(unsigned int level)
>  {
>   ZSTD_parameters params =
> - zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT);
> + zstd_get_btrfs_parameters(level, ZSTD_BTRFS_MAX_INPUT);
>   struct workspace *workspace;
>  
>   workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
> @@ -117,7 +123,8 @@ static int zstd_compress_pages(struct list_head *ws,
>   unsigned long len = *total_out;
>   const unsigned long nr_dest_pages = *out_pages;
>   unsigned long max_out = nr_dest_pages * PAGE_SIZE;
> - ZSTD_parameters params = zstd_get_btrfs_parameters(len);
> + ZSTD_parameters params = zstd_get_btrfs_parameters(workspace->req_level,
> +len);
>  
>   *out_pages = 0;
>   *total_out = 0;
> 


Re: [PATCH 09/11] btrfs: change set_level() to bound the level passed in

2019-01-29 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> Currently, the only user of set_level() is zlib which sets an internal
> workspace parameter. As level is now plumbed into get_workspace(), this
> can be handled there rather than separately.
> 
> This repurposes set_level() to bound the level passed in so it can be
> used when setting the mounts compression level and as well as verifying
> the level before getting a workspace. The other benefit is this divides
> the meaning of compress(0) and get_workspace(0). The former means we
> want to use the default compression level of the compression type. The
> latter means we can use any workspace available.
> 
> Signed-off-by: Dennis Zhou 
> ---
>  fs/btrfs/compression.c | 23 +++
>  fs/btrfs/compression.h |  4 ++--
>  fs/btrfs/lzo.c |  3 ++-
>  fs/btrfs/super.c   |  4 +++-
>  fs/btrfs/zlib.c| 18 +++---
>  fs/btrfs/zstd.c|  3 ++-
>  6 files changed, 35 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index e509071eaa69..a552c6f61e6d 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -1008,9 +1008,9 @@ int btrfs_compress_pages(unsigned int type_level, 
> struct address_space *mapping,
>   struct list_head *workspace;
>   int ret;
>  
> - workspace = get_workspace(type, level);
> + level = btrfs_compress_op[type]->set_level(level);
>  
> - btrfs_compress_op[type]->set_level(workspace, level);
> + workspace = get_workspace(type, level);
>   ret = btrfs_compress_op[type]->compress_pages(workspace, mapping,
> start, pages,
> out_pages,
> @@ -1563,14 +1563,21 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
> start, u64 end)
>   return ret;
>  }
>  
> -unsigned int btrfs_compress_str2level(const char *str)
> +unsigned int btrfs_compress_str2level(unsigned int type, const char *str)
>  {
> - if (strncmp(str, "zlib", 4) != 0)
> + unsigned int level;
> + int ret;
> +
> + if (!type)
>   return 0;
>  
> - /* Accepted form: zlib:1 up to zlib:9 and nothing left after the number 
> */
> - if (str[4] == ':' && '1' <= str[5] && str[5] <= '9' && str[6] == 0)
> - return str[5] - '0';
> + if (str[0] == ':') {
> + ret = kstrtouint(str + 1, 10, );
> + if (ret)
> + level = 0;
> + }
> +
> + level = btrfs_compress_op[type]->set_level(level);
>  
> - return BTRFS_ZLIB_DEFAULT_LEVEL;
> + return level;
>  }
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index e3627139bc5c..d607be40aa0e 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -90,7 +90,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
> *inode, u64 start,
>  blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio 
> *bio,
>int mirror_num, unsigned long bio_flags);
>  
> -unsigned btrfs_compress_str2level(const char *str);
> +unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
>  
>  enum btrfs_compression_type {
>   BTRFS_COMPRESS_NONE  = 0,
> @@ -149,7 +149,7 @@ struct btrfs_compress_op {
> unsigned long start_byte,
> size_t srclen, size_t destlen);
>  
> - void (*set_level)(struct list_head *ws, unsigned int type);
> + unsigned int (*set_level)(unsigned int level);

It might be good to document the return value since this is an
interface. AFAICS implementations are required to return the actual
level set irrespective of what level was passed in, no?

>  };
>  
>  extern const struct btrfs_compress_op btrfs_heuristic_compress;
> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> index f132af45a924..579d53ae256f 100644
> --- a/fs/btrfs/lzo.c
> +++ b/fs/btrfs/lzo.c
> @@ -507,8 +507,9 @@ static int lzo_decompress(struct list_head *ws, unsigned 
> char *data_in,
>   return ret;
>  }
>  
> -static void lzo_set_level(struct list_head *ws, unsigned int type)
> +static unsigned int lzo_set_level(unsigned int level)
>  {
> + return 0;
>  }
>  
>  const struct btrfs_compress_op btrfs_lzo_compress = {
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index c5586ffd1426..b28dff207383 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -529,7 +529,9 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   if (token != Opt_compress &&
>   token != Opt_compress_force)
>   info->compress_level =
> -   
> btrfs_compress_str2level(args[0].from);
> +   btrfs_compress_str2level(
> + BTRFS_COMPRESS_ZLIB,
> +  

Re: [PATCH 08/11] btrfs: plumb level through the compression interface

2019-01-29 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> Zlib compression supports multiple levels, but doesn't require changing
> in how a workspace itself is created and managed. Zstd introduces a
> different memory requirement such that higher levels of compression
> require more memory. This requires changes in how the alloc()/get()
> methods work for zstd. This pach plumbs compression level through the
> interface as a parameter in preparation for zstd compression levels.
> This gives the compression types opportunity to create/manage based on
> the compression level.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/compression.c | 31 ---
>  fs/btrfs/compression.h |  7 ---
>  fs/btrfs/lzo.c |  6 +++---
>  fs/btrfs/zlib.c|  7 ---
>  fs/btrfs/zstd.c|  6 +++---
>  5 files changed, 30 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index ab694760ffdb..e509071eaa69 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -744,9 +744,9 @@ static void heuristic_cleanup_workspace_manager(void)
>   btrfs_cleanup_workspace_manager(_wsm);
>  }
>  
> -static struct list_head *heuristic_get_workspace(void)
> +static struct list_head *heuristic_get_workspace(unsigned int level)
>  {
> - return btrfs_get_workspace(_wsm);
> + return btrfs_get_workspace(_wsm, level);
>  }
>  
>  static void heuristic_put_workspace(struct list_head *ws)
> @@ -766,7 +766,7 @@ static void free_heuristic_ws(struct list_head *ws)
>   kfree(workspace);
>  }
>  
> -static struct list_head *alloc_heuristic_ws(void)
> +static struct list_head *alloc_heuristic_ws(unsigned int level)
>  {
>   struct heuristic_ws *ws;
>  
> @@ -825,7 +825,7 @@ void btrfs_init_workspace_manager(struct 
> workspace_manager *wsm,
>* Preallocate one workspace for each compression type so
>* we can guarantee forward progress in the worst case
>*/
> - workspace = wsm->ops->alloc_workspace();
> + workspace = wsm->ops->alloc_workspace(0);
>   if (IS_ERR(workspace)) {
>   pr_warn("BTRFS: cannot preallocate compression workspace, will 
> try later\n");
>   } else {
> @@ -853,7 +853,8 @@ void btrfs_cleanup_workspace_manager(struct 
> workspace_manager *wsman)
>   * Preallocation makes a forward progress guarantees and we do not return
>   * errors.
>   */
> -struct list_head *btrfs_get_workspace(struct workspace_manager *wsm)
> +struct list_head *btrfs_get_workspace(struct workspace_manager *wsm,
> +   unsigned int level)
>  {
>   struct list_head *workspace;
>   int cpus = num_online_cpus();
> @@ -899,7 +900,7 @@ struct list_head *btrfs_get_workspace(struct 
> workspace_manager *wsm)
>* context of btrfs_compress_bio/btrfs_compress_pages
>*/
>   nofs_flag = memalloc_nofs_save();
> - workspace = wsm->ops->alloc_workspace();
> + workspace = wsm->ops->alloc_workspace(level);
>   memalloc_nofs_restore(nofs_flag);
>  
>   if (IS_ERR(workspace)) {
> @@ -930,9 +931,9 @@ struct list_head *btrfs_get_workspace(struct 
> workspace_manager *wsm)
>   return workspace;
>  }
>  
> -static struct list_head *get_workspace(int type)
> +static struct list_head *get_workspace(int type, int level)
>  {
> - return btrfs_compress_op[type]->get_workspace();
> + return btrfs_compress_op[type]->get_workspace(level);
>  }
>  
>  /*
> @@ -1003,12 +1004,13 @@ int btrfs_compress_pages(unsigned int type_level, 
> struct address_space *mapping,
>unsigned long *total_out)
>  {
>   int type = BTRFS_COMPRESS_TYPE(type_level);
> + int level = BTRFS_COMPRESS_LEVEL(type_level);
>   struct list_head *workspace;
>   int ret;
>  
> - workspace = get_workspace(type);
> + workspace = get_workspace(type, level);
>  
> - btrfs_compress_op[type]->set_level(workspace, type_level);
> + btrfs_compress_op[type]->set_level(workspace, level);
>   ret = btrfs_compress_op[type]->compress_pages(workspace, mapping,
> start, pages,
> out_pages,
> @@ -1037,7 +1039,7 @@ static int btrfs_decompress_bio(struct compressed_bio 
> *cb)
>   int ret;
>   int type = cb->compress_type;
>  
> - workspace = get_workspace(type);
> + workspace = get_workspace(type, 0);
>   ret = btrfs_compress_op[type]->decompress_bio(workspace, cb);
>   put_workspace(type, wor

Re: [PATCH 05/11] btrfs: add helper methods for workspace manager init and cleanup

2019-01-28 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> Workspace manager init and cleanup code is open coded inside a for loop
> over the compression types. This forces each compression type to rely on
> the same workspace manager implementation. This patch creates helper
> methods that will be the generic implementation for btrfs workspace
> management.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Nikolay Borisov 


Re: [PATCH 04/11] btrfs: unify compression ops with workspace_manager

2019-01-28 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> Make the workspace_manager own the interface operations rather than
> managing index-paired arrays for the workspace_manager and compression
> operations.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/compression.c | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index bda7e8d2cbc7..b7e986e16640 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -777,6 +777,7 @@ const struct btrfs_compress_op btrfs_heuristic_compress = 
> {
>  };
>  
>  struct workspace_manager {
> + const struct btrfs_compress_op *ops;
>   struct list_head idle_ws;
>   spinlock_t ws_lock;
>   /* Number of free workspaces */
> @@ -802,6 +803,8 @@ void __init btrfs_init_compress(void)
>   int i;
>  
>   for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
> + wsm[i].ops = btrfs_compress_op[i];
> +
>   INIT_LIST_HEAD([i].idle_ws);
>   spin_lock_init([i].ws_lock);
>   atomic_set([i].total_ws, 0);
> @@ -811,7 +814,7 @@ void __init btrfs_init_compress(void)
>* Preallocate one workspace for each compression type so
>* we can guarantee forward progress in the worst case
>*/
> - workspace = btrfs_compress_op[i]->alloc_workspace();
> + workspace = wsm[i].ops->alloc_workspace();
>   if (IS_ERR(workspace)) {
>   pr_warn("BTRFS: cannot preallocate compression 
> workspace, will try later\n");
>   } else {
> @@ -874,7 +877,7 @@ static struct list_head *find_workspace(int type)
>* context of btrfs_compress_bio/btrfs_compress_pages
>*/
>   nofs_flag = memalloc_nofs_save();
> - workspace = btrfs_compress_op[type]->alloc_workspace();
> + workspace = wsm[type].ops->alloc_workspace();
>   memalloc_nofs_restore(nofs_flag);
>  
>   if (IS_ERR(workspace)) {
> @@ -932,7 +935,7 @@ static void free_workspace(int type, struct list_head 
> *workspace)
>   }
>   spin_unlock(ws_lock);
>  
> - btrfs_compress_op[type]->free_workspace(workspace);
> + wsm[type].ops->free_workspace(workspace);
>   atomic_dec(total_ws);
>  wake:
>   cond_wake_up(ws_wait);
> @@ -950,7 +953,7 @@ static void free_workspaces(void)
>   while (!list_empty([i].idle_ws)) {
>   workspace = wsm[i].idle_ws.next;
>   list_del(workspace);
> - btrfs_compress_op[i]->free_workspace(workspace);
> + wsm[i].ops->free_workspace(workspace);
>   atomic_dec([i].total_ws);
>   }
>   }
> 


Re: [PATCH 03/11] btrfs: manage heuristic workspace as index 0

2019-01-28 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> While the heuristic workspaces aren't really compression workspaces,
> they use the same interface for managing them. So rather than branching,
> let's just handle them once again as the index 0 compression type.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Nikolay Borisov  albeit one minor nit
below.
> ---
>  fs/btrfs/compression.c  | 107 +++-
>  fs/btrfs/compression.h  |   3 +-
>  fs/btrfs/ioctl.c|   2 +-
>  fs/btrfs/tree-checker.c |   4 +-
>  4 files changed, 33 insertions(+), 83 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index aced261984e2..bda7e8d2cbc7 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -37,6 +37,8 @@ const char* btrfs_compress_type2str(enum 
> btrfs_compression_type type)
>   case BTRFS_COMPRESS_ZSTD:
>   case BTRFS_COMPRESS_NONE:
>   return btrfs_compress_types[type];
> + default:
> + return NULL;

nit: With this change...

>   }
>  
>   return NULL;

This becomes redundant. I doubt the compiler will issue a warning since
it should be clever enough to figure we will never exit the switch()
construct.

> @@ -769,6 +771,11 @@ static struct list_head *alloc_heuristic_ws(void)
>   return ERR_PTR(-ENOMEM);
>  }
>  
> +const struct btrfs_compress_op btrfs_heuristic_compress = {
> + .alloc_workspace = alloc_heuristic_ws,
> + .free_workspace = free_heuristic_ws,
> +};
> +
>  struct workspace_manager {
>   struct list_head idle_ws;
>   spinlock_t ws_lock;
> @@ -782,9 +789,8 @@ struct workspace_manager {
>  
>  static struct workspace_manager wsm[BTRFS_COMPRESS_TYPES];
>  
> -static struct workspace_manager btrfs_heuristic_ws;
> -
>  static const struct btrfs_compress_op * const btrfs_compress_op[] = {
> + _heuristic_compress,
>   _zlib_compress,
>   _lzo_compress,
>   _zstd_compress,
> @@ -795,21 +801,6 @@ void __init btrfs_init_compress(void)
>   struct list_head *workspace;
>   int i;
>  
> - INIT_LIST_HEAD(_heuristic_ws.idle_ws);
> - spin_lock_init(_heuristic_ws.ws_lock);
> - atomic_set(_heuristic_ws.total_ws, 0);
> - init_waitqueue_head(_heuristic_ws.ws_wait);
> -
> - workspace = alloc_heuristic_ws();
> - if (IS_ERR(workspace)) {
> - pr_warn(
> - "BTRFS: cannot preallocate heuristic workspace, will try later\n");
> - } else {
> - atomic_set(_heuristic_ws.total_ws, 1);
> - btrfs_heuristic_ws.free_ws = 1;
> - list_add(workspace, _heuristic_ws.idle_ws);
> - }
> -
>   for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
>   INIT_LIST_HEAD([i].idle_ws);
>   spin_lock_init([i].ws_lock);
> @@ -837,11 +828,10 @@ void __init btrfs_init_compress(void)
>   * Preallocation makes a forward progress guarantees and we do not return
>   * errors.
>   */
> -static struct list_head *__find_workspace(int type, bool heuristic)
> +static struct list_head *find_workspace(int type)
>  {
>   struct list_head *workspace;
>   int cpus = num_online_cpus();
> - int idx = type - 1;
>   unsigned nofs_flag;
>   struct list_head *idle_ws;
>   spinlock_t *ws_lock;
> @@ -849,19 +839,11 @@ static struct list_head *__find_workspace(int type, 
> bool heuristic)
>   wait_queue_head_t *ws_wait;
>   int *free_ws;
>  
> - if (heuristic) {
> - idle_ws  = _heuristic_ws.idle_ws;
> - ws_lock  = _heuristic_ws.ws_lock;
> - total_ws = _heuristic_ws.total_ws;
> - ws_wait  = _heuristic_ws.ws_wait;
> - free_ws  = _heuristic_ws.free_ws;
> - } else {
> - idle_ws  = [idx].idle_ws;
> - ws_lock  = [idx].ws_lock;
> - total_ws = [idx].total_ws;
> - ws_wait  = [idx].ws_wait;
> - free_ws  = [idx].free_ws;
> - }
> + idle_ws  = [type].idle_ws;
> + ws_lock  = [type].ws_lock;
> + total_ws = [type].total_ws;
> + ws_wait  = [type].ws_wait;
> + free_ws  = [type].free_ws;
>  
>  again:
>   spin_lock(ws_lock);
> @@ -892,10 +874,7 @@ static struct list_head *__find_workspace(int type, bool 
> heuristic)
>* context of btrfs_compress_bio/btrfs_compress_pages
>*/
>   nofs_flag = memalloc_nofs_save();
> - if (heuristic)
> - workspace = alloc_heuristic_ws();
> - else
> - workspace = btrfs_compress_op[idx]->alloc_workspace();
> + workspace = btrfs_compress_op[type]->alloc_workspace();
>   memalloc_nof

Re: [PATCH 02/11] btrfs: rename workspaces_list to workspace_manager

2019-01-28 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> This is in preparation for zstd compression levels. As each level will
> require different sized workspaces, workspaces_list is no longer a
> really fitting name.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/compression.c | 46 +-
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 586f95ac0aea..aced261984e2 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -769,7 +769,7 @@ static struct list_head *alloc_heuristic_ws(void)
>   return ERR_PTR(-ENOMEM);
>  }
>  
> -struct workspaces_list {
> +struct workspace_manager {
>   struct list_head idle_ws;
>   spinlock_t ws_lock;
>   /* Number of free workspaces */
> @@ -780,9 +780,9 @@ struct workspaces_list {
>   wait_queue_head_t ws_wait;
>  };
>  
> -static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES];
> +static struct workspace_manager wsm[BTRFS_COMPRESS_TYPES];
>  
> -static struct workspaces_list btrfs_heuristic_ws;
> +static struct workspace_manager btrfs_heuristic_ws;
>  
>  static const struct btrfs_compress_op * const btrfs_compress_op[] = {
>   _zlib_compress,
> @@ -811,10 +811,10 @@ void __init btrfs_init_compress(void)
>   }
>  
>   for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
> - INIT_LIST_HEAD(_comp_ws[i].idle_ws);
> - spin_lock_init(_comp_ws[i].ws_lock);
> - atomic_set(_comp_ws[i].total_ws, 0);
> - init_waitqueue_head(_comp_ws[i].ws_wait);
> + INIT_LIST_HEAD([i].idle_ws);
> + spin_lock_init([i].ws_lock);
> + atomic_set([i].total_ws, 0);
> + init_waitqueue_head([i].ws_wait);
>  
>   /*
>* Preallocate one workspace for each compression type so
> @@ -824,9 +824,9 @@ void __init btrfs_init_compress(void)
>   if (IS_ERR(workspace)) {
>   pr_warn("BTRFS: cannot preallocate compression 
> workspace, will try later\n");
>   } else {
> - atomic_set(_comp_ws[i].total_ws, 1);
> - btrfs_comp_ws[i].free_ws = 1;
> - list_add(workspace, _comp_ws[i].idle_ws);
> + atomic_set([i].total_ws, 1);
> + wsm[i].free_ws = 1;
> + list_add(workspace, [i].idle_ws);
>   }
>   }
>  }
> @@ -856,11 +856,11 @@ static struct list_head *__find_workspace(int type, 
> bool heuristic)
>   ws_wait  = _heuristic_ws.ws_wait;
>   free_ws  = _heuristic_ws.free_ws;
>   } else {
> - idle_ws  = _comp_ws[idx].idle_ws;
> - ws_lock  = _comp_ws[idx].ws_lock;
> - total_ws = _comp_ws[idx].total_ws;
> - ws_wait  = _comp_ws[idx].ws_wait;
> - free_ws  = _comp_ws[idx].free_ws;
> + idle_ws  = [idx].idle_ws;
> + ws_lock  = [idx].ws_lock;
> + total_ws = [idx].total_ws;
> + ws_wait  = [idx].ws_wait;
> + free_ws  = [idx].free_ws;
>   }
>  
>  again:
> @@ -952,11 +952,11 @@ static void __free_workspace(int type, struct list_head 
> *workspace,
>   ws_wait  = _heuristic_ws.ws_wait;
>   free_ws  = _heuristic_ws.free_ws;
>   } else {
> - idle_ws  = _comp_ws[idx].idle_ws;
> - ws_lock  = _comp_ws[idx].ws_lock;
> - total_ws = _comp_ws[idx].total_ws;
> - ws_wait  = _comp_ws[idx].ws_wait;
> - free_ws  = _comp_ws[idx].free_ws;
> + idle_ws  = [idx].idle_ws;
> + ws_lock  = [idx].ws_lock;
> + total_ws = [idx].total_ws;
> + ws_wait  = [idx].ws_wait;
> + free_ws  = [idx].free_ws;
>   }
>  
>   spin_lock(ws_lock);
> @@ -998,11 +998,11 @@ static void free_workspaces(void)
>   }
>  
>   for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
> - while (!list_empty(_comp_ws[i].idle_ws)) {
> - workspace = btrfs_comp_ws[i].idle_ws.next;
> + while (!list_empty([i].idle_ws)) {
> + workspace = wsm[i].idle_ws.next;
>   list_del(workspace);
>   btrfs_compress_op[i]->free_workspace(workspace);
> - atomic_dec(_comp_ws[i].total_ws);
> + atomic_dec([i].total_ws);
>   }
>   }
>  }
> 


Re: [PATCH 01/11] btrfs: add macros for compression type and level

2019-01-28 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> It is very easy to miss places that rely on a certain bitshifting for
> decyphering the type_level overloading. Make macros handle this instead.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Nikolay Borisov 


Re: [PATCH 11/11] btrfs: add zstd compression level support

2019-01-28 Thread Nikolay Borisov



On 28.01.19 г. 23:24 ч., Dennis Zhou wrote:
> Zstd compression requires different amounts of memory for each level of
> compression. The prior patches implemented indirection to allow for each
> compression type to manage their workspaces independently. This patch
> uses this indirection to implement compression level support for zstd.
> 
> As mentioned above, a requirement that differs zstd from zlib is that
> higher levels of compression require more memory. To manage this, each
> compression level has its own queue of workspaces. A global LRU is used
> to help with reclaim. To guarantee forward progress, a max level
> workspace is preallocated and hidden from the LRU.
> 
> When getting a workspace, it uses a bitmap to identify the levels that
> are populated and scans up. If it finds a workspace that is greater than
> it, it uses it, but does not update the last_used time and the
> corresponding place in the LRU. This provides a mechanism to decrease
> memory utilization as we only keep around workspaces that are sized
> appropriately for the in use compression levels.
> 
> By knowing which compression levels have available workspaces, we can
> recycle rather than always create new workspaces as well as take
> advantage of the preallocated max level for forward progress. If we hit
> memory pressure, we sleep on the max level workspace. We continue to
> rescan in case we can use a smaller workspace, but eventually should be
> able to obtain the max level workspace or allocate one again should
> memory pressure subside. The memory requirement for decompression is the
> same as level 1, and therefore can use any of available workspace.
> 
> The number of workspaces is bound by an upper limit of the workqueue's
> limit which currently is 2 (percpu limit). Second, a reclaim timer is
> used to free inactive/improperly sized workspaces. The reclaim timer is
> set to 67s to avoid colliding with transaction commit (every 30s) and
> attempts to reclaim any unused workspace older than 45s.
> 
> Repeating the experiment from v2 [1], the Silesia corpus was copied to a
> btrfs filesystem 10 times and then read back after dropping the caches.
> The btrfs filesystem was on an SSD.
> 
> Level   Ratio   Compression (MB/s)  Decompression (MB/s)
> 1   2.658438.47910.51
> 2   2.744364.86886.55
> 3   2.801336.33828.41
> 4   2.858286.71886.55
> 5   2.916212.77556.84
> 6   2.363119.82990.85
> 7   3.000154.06849.30
> 8   3.011159.54875.03
> 9   3.025100.51940.15
> 10  3.033118.97616.26
> 11  3.036 94.19802.11
> 12  3.037 73.45931.49
> 13  3.041 55.17835.26
> 14  3.087 44.70716.78
> 15  3.126 37.30878.84
> 
> [1] 
> https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terre...@fb.com/
> 
> Signed-off-by: Dennis Zhou 
> Cc: Nick Terrell 
> Cc: Omar Sandoval 
> ---
>  fs/btrfs/super.c |   6 +-
>  fs/btrfs/zstd.c  | 229 +--
>  2 files changed, 226 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index b28dff207383..0ecc513cb56c 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -544,9 +544,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   btrfs_clear_opt(info->mount_opt, NODATASUM);
>   btrfs_set_fs_incompat(info, COMPRESS_LZO);
>   no_compress = 0;
> - } else if (strcmp(args[0].from, "zstd") == 0) {
> + } else if (strncmp(args[0].from, "zstd", 4) == 0) {
>   compress_type = "zstd";
>   info->compress_type = BTRFS_COMPRESS_ZSTD;
> + info->compress_level =
> + btrfs_compress_str2level(
> +  BTRFS_COMPRESS_ZSTD,
> +  args[0].from + 4);
>   btrfs_set_opt(info->mount_opt, COMPRESS);
>   btrfs_clear_opt(info->mount_opt, NODATACOW);
>   btrfs_clear_opt(info->mount_opt, NODATASUM);
> diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
> index a951d4fe77f7..ce9b466c197f 100644
> --- a/fs/btrfs/zstd.c
> +++ b/fs/btrfs/zstd.c
> @@ -6,20 +6,27 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include "compression.h"
> +#include "ctree.h"
>  
>  #define 

Re: [PATCH] [PATCH] btrfs: remove unused wariable 'num_pages'

2019-01-23 Thread Nikolay Borisov



On 23.01.19 г. 0:59 ч., Matthew Friday wrote:
> Signed-off-by: Matthew Friday 

It seems you are using an outdated version of the code. When sending new
code always base it on misc-next branch from
https://github.com/kdave/btrfs-devel

> ---
>  fs/btrfs/ioctl.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 9c8e1734429c..6243f734c0cd 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3278,7 +3278,6 @@ static int btrfs_extent_same(struct inode *src, u64 
> loff, u64 olen,
>struct inode *dst, u64 dst_loff)
>  {
>   int ret;
> - int num_pages = PAGE_ALIGN(BTRFS_MAX_DEDUPE_LEN) >> PAGE_SHIFT;
>   u64 i, tail_len, chunk_count;
>  
>   /* don't make the dst file partly checksummed */
> @@ -3291,8 +3290,6 @@ static int btrfs_extent_same(struct inode *src, u64 
> loff, u64 olen,
>  
>   tail_len = olen % BTRFS_MAX_DEDUPE_LEN;
>   chunk_count = div_u64(olen, BTRFS_MAX_DEDUPE_LEN);
> - if (chunk_count == 0)
> - num_pages = PAGE_ALIGN(tail_len) >> PAGE_SHIFT;
>  
>   for (i = 0; i < chunk_count; i++) {
>   ret = btrfs_extent_same_range(src, loff, BTRFS_MAX_DEDUPE_LEN,
> 


[PATCH] mm: Refactor readahead defines in mm.h

2018-12-21 Thread Nikolay Borisov
All users of VM_MAX_READAHEAD actually convert it to kbytes and then to
pages. Define the macro explicitly as (SZ_128K / PAGE_SIZE). This
simplifies the expression in every filesystem. Also rename the macro to
VM_READAHEAD_PAGES to properly convey its meaning. Finally remove unused
VM_MIN_READAHEAD

Signed-off-by: Nikolay Borisov 
---
 block/blk-core.c   | 3 +--
 fs/9p/vfs_super.c  | 2 +-
 fs/afs/super.c | 2 +-
 fs/btrfs/disk-io.c | 2 +-
 fs/fuse/inode.c| 2 +-
 include/linux/mm.h | 4 ++--
 6 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index deb56932f8c4..d25c8564a117 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1031,8 +1031,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
gfp_mask, int node_id,
if (!q->stats)
goto fail_stats;
 
-   q->backing_dev_info->ra_pages =
-   (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
+   q->backing_dev_info->ra_pages = VM_READAHEAD_PAGES;
q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->backing_dev_info->name = "block";
q->node = node_id;
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index 48ce50484e80..10d3bd3f534b 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -92,7 +92,7 @@ v9fs_fill_super(struct super_block *sb, struct 
v9fs_session_info *v9ses,
return ret;
 
if (v9ses->cache)
-   sb->s_bdi->ra_pages = (VM_MAX_READAHEAD * 1024)/PAGE_SIZE;
+   sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
 
sb->s_flags |= SB_ACTIVE | SB_DIRSYNC;
if (!v9ses->cache)
diff --git a/fs/afs/super.c b/fs/afs/super.c
index dcd07fe99871..e684f6769b15 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -399,7 +399,7 @@ static int afs_fill_super(struct super_block *sb,
ret = super_setup_bdi(sb);
if (ret)
return ret;
-   sb->s_bdi->ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_SIZE;
+   sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
 
/* allocate the root inode and dentry */
if (as->dyn_root) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6d776717d8b3..ee47d8b5b50c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2900,7 +2900,7 @@ int open_ctree(struct super_block *sb,
sb->s_bdi->congested_fn = btrfs_congested_fn;
sb->s_bdi->congested_data = fs_info;
sb->s_bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
-   sb->s_bdi->ra_pages = VM_MAX_READAHEAD * SZ_1K / PAGE_SIZE;
+   sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super);
sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE);
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 568abed20eb2..d3eab53a29b7 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1009,7 +1009,7 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct 
super_block *sb)
if (err)
return err;
 
-   sb->s_bdi->ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
+   sb->s_bdi->ra_pages = VM_READAHEAD_PAGES;
/* fuse does it's own writeback accounting */
sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..1579082af177 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mempolicy;
 struct anon_vma;
@@ -2396,8 +2397,7 @@ int __must_check write_one_page(struct page *page);
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD   128 /* kbytes */
-#define VM_MIN_READAHEAD   16  /* kbytes (includes current page) */
+#define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE)
 
 int force_page_cache_readahead(struct address_space *mapping, struct file 
*filp,
pgoff_t offset, unsigned long nr_to_read);
-- 
2.17.1



Re: [PATCH] mm: Define VM_(MAX|MIN)_READAHEAD via sizes.h constants

2018-12-21 Thread Nikolay Borisov



On 21.12.18 г. 15:24 ч., Matthew Wilcox wrote:
> On Fri, Dec 21, 2018 at 02:53:14PM +0200, Nikolay Borisov wrote:
>> All users of the aformentioned macros convert them to kbytes by
>> multplying. Instead, directly define the macros via the aptly named
>> SZ_16K/SZ_128K ones. Also remove the now redundant comments explaining
>> that VM_* are defined in kbytes it's obvious. No functional changes.
> 
> Actually, all users of these constants convert them to pages!
> 
>> +q->backing_dev_info->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
>> +sb->s_bdi->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
>> +sb->s_bdi->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
>> +sb->s_bdi->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
>> +sb->s_bdi->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
> 
>> -#define VM_MAX_READAHEAD128 /* kbytes */
>> -#define VM_MIN_READAHEAD16  /* kbytes (includes current page) */
>> +#define VM_MAX_READAHEADSZ_128K
>> +#define VM_MIN_READAHEADSZ_16K  /* includes current page */
> 
> So perhaps:
> 
> #define VM_MAX_READAHEAD  (SZ_128K / PAGE_SIZE)
> 
> VM_MIN_READAHEAD isn't used, so just delete it?

I thought about that but didn't know if people will complain that some
times in the future we might need it.

> 


[PATCH] mm: Define VM_(MAX|MIN)_READAHEAD via sizes.h constants

2018-12-21 Thread Nikolay Borisov
All users of the aformentioned macros convert them to kbytes by
multplying. Instead, directly define the macros via the aptly named
SZ_16K/SZ_128K ones. Also remove the now redundant comments explaining
that VM_* are defined in kbytes it's obvious. No functional changes.

Signed-off-by: Nikolay Borisov 
---

I guess it makes sense for this to land via Andrew's mmotm tree?

 block/blk-core.c   | 3 +--
 fs/9p/vfs_super.c  | 2 +-
 fs/afs/super.c | 2 +-
 fs/btrfs/disk-io.c | 2 +-
 fs/fuse/inode.c| 2 +-
 include/linux/mm.h | 5 +++--
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index deb56932f8c4..d28b2eeec07e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1031,8 +1031,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
gfp_mask, int node_id,
if (!q->stats)
goto fail_stats;
 
-   q->backing_dev_info->ra_pages =
-   (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
+   q->backing_dev_info->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->backing_dev_info->name = "block";
q->node = node_id;
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index 48ce50484e80..5c9f757410ae 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -92,7 +92,7 @@ v9fs_fill_super(struct super_block *sb, struct 
v9fs_session_info *v9ses,
return ret;
 
if (v9ses->cache)
-   sb->s_bdi->ra_pages = (VM_MAX_READAHEAD * 1024)/PAGE_SIZE;
+   sb->s_bdi->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
 
sb->s_flags |= SB_ACTIVE | SB_DIRSYNC;
if (!v9ses->cache)
diff --git a/fs/afs/super.c b/fs/afs/super.c
index dcd07fe99871..d1f3af74481a 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -399,7 +399,7 @@ static int afs_fill_super(struct super_block *sb,
ret = super_setup_bdi(sb);
if (ret)
return ret;
-   sb->s_bdi->ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_SIZE;
+   sb->s_bdi->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
 
/* allocate the root inode and dentry */
if (as->dyn_root) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6d776717d8b3..d84e7283d24b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2900,7 +2900,7 @@ int open_ctree(struct super_block *sb,
sb->s_bdi->congested_fn = btrfs_congested_fn;
sb->s_bdi->congested_data = fs_info;
sb->s_bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
-   sb->s_bdi->ra_pages = VM_MAX_READAHEAD * SZ_1K / PAGE_SIZE;
+   sb->s_bdi->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super);
sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE);
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 568abed20eb2..25766e9035b1 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1009,7 +1009,7 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct 
super_block *sb)
if (err)
return err;
 
-   sb->s_bdi->ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_SIZE;
+   sb->s_bdi->ra_pages = VM_MAX_READAHEAD / PAGE_SIZE;
/* fuse does it's own writeback accounting */
sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..e2085eaceae9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mempolicy;
 struct anon_vma;
@@ -2396,8 +2397,8 @@ int __must_check write_one_page(struct page *page);
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD   128 /* kbytes */
-#define VM_MIN_READAHEAD   16  /* kbytes (includes current page) */
+#define VM_MAX_READAHEAD   SZ_128K
+#define VM_MIN_READAHEAD   SZ_16K  /* includes current page */
 
 int force_page_cache_readahead(struct address_space *mapping, struct file 
*filp,
pgoff_t offset, unsigned long nr_to_read);
-- 
2.17.1



Re: [PATCH] workqueue: remove some duplicated includes

2018-12-01 Thread Nikolay Borisov



On 1.12.18 г. 8:46 ч., Frank Lee wrote:
> It seems that the trivial tree has not been maintained for some time.
> Are there aother tree to choose?


perhaps mmotm, maintained by Andrew (cc'ed)
> 
> MBR,
> Yangtao
> On Sat, Dec 1, 2018 at 2:36 AM Tejun Heo  wrote:
>>
>> On Mon, Nov 26, 2018 at 09:33:26AM -0500, Yangtao Li wrote:
>>> workqueue.h and kthread.h are included twice.It's unnecessary.
>>> hence just remove them.
>>>
>>> Signed-off-by: Yangtao Li 
>>
>> Acked-by: Tejun Heo 
>>
>> Can you route this through the trivial tree?
>>
>> Thanks.
>>
>> --
>> tejun


Re: [PATCH] workqueue: remove some duplicated includes

2018-12-01 Thread Nikolay Borisov



On 1.12.18 г. 8:46 ч., Frank Lee wrote:
> It seems that the trivial tree has not been maintained for some time.
> Are there aother tree to choose?


perhaps mmotm, maintained by Andrew (cc'ed)
> 
> MBR,
> Yangtao
> On Sat, Dec 1, 2018 at 2:36 AM Tejun Heo  wrote:
>>
>> On Mon, Nov 26, 2018 at 09:33:26AM -0500, Yangtao Li wrote:
>>> workqueue.h and kthread.h are included twice.It's unnecessary.
>>> hence just remove them.
>>>
>>> Signed-off-by: Yangtao Li 
>>
>> Acked-by: Tejun Heo 
>>
>> Can you route this through the trivial tree?
>>
>> Thanks.
>>
>> --
>> tejun


Re: [PATCH RFC 01/15] MIPS: replace **** with a hug

2018-11-30 Thread Nikolay Borisov



On 30.11.18 г. 23:50 ч., Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 11:06:30PM +0200, Nikolay Borisov wrote:
>> I find it quite ridiculous that grown up people are engaging at yet
>> another coc-related conversation and are on the way to blowing it out of
>> proportions. Let's focus on improving the ACTUAL code what's the point
>> to bikeshed on the word?
>>
>> I've seen people before me comment that some comments are plainly out of
>> date and that warrants their removal but otherwise nitpicking things
>> like that. It seems common sense is very scant these days...
> 
> I have absolutely nothing against CoC.

I didn't mean you are against the CoC, I meant it's ridiculous to follow
it so closely.

And regarding 'fuck', I believe this is rather pertinent as to why this
patchset is unnecessary:

https://ia801406.us.archive.org/30/items/JackWagnerattr.MontyPythontheWordFuck/The_Word_Fuck.ogg

> 
> If you look up from LKML, there total zero posts from criticizing of
> having such. OK, maybe I find it bit too high-level, and yeah, this
> patch set has been useful to know where the line is. I can go through
> the commits and think more appropriate alternatives (thus the RFC-tag).
> 
> Thanks.
> 
> /Jarkko
> 


Re: [PATCH RFC 01/15] MIPS: replace **** with a hug

2018-11-30 Thread Nikolay Borisov



On 30.11.18 г. 23:50 ч., Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 11:06:30PM +0200, Nikolay Borisov wrote:
>> I find it quite ridiculous that grown up people are engaging at yet
>> another coc-related conversation and are on the way to blowing it out of
>> proportions. Let's focus on improving the ACTUAL code what's the point
>> to bikeshed on the word?
>>
>> I've seen people before me comment that some comments are plainly out of
>> date and that warrants their removal but otherwise nitpicking things
>> like that. It seems common sense is very scant these days...
> 
> I have absolutely nothing against CoC.

I didn't mean you are against the CoC, I meant it's ridiculous to follow
it so closely.

And regarding 'fuck', I believe this is rather pertinent as to why this
patchset is unnecessary:

https://ia801406.us.archive.org/30/items/JackWagnerattr.MontyPythontheWordFuck/The_Word_Fuck.ogg

> 
> If you look up from LKML, there total zero posts from criticizing of
> having such. OK, maybe I find it bit too high-level, and yeah, this
> patch set has been useful to know where the line is. I can go through
> the commits and think more appropriate alternatives (thus the RFC-tag).
> 
> Thanks.
> 
> /Jarkko
> 


Re: [PATCH RFC 01/15] MIPS: replace **** with a hug

2018-11-30 Thread Nikolay Borisov



On 30.11.18 г. 21:57 ч., Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 11:27:10AM -0800, Jarkko Sakkinen wrote:
>> In order to comply with the CoC, replace  with a hug.
>>
>> Signed-off-by: Jarkko Sakkinen 
> 
> Since v17 of the SGX patch set, my cover letters get cut for some
> reason. I receive them myself but they don't appear on kernel
> lists. I don't what the hug is going on :-/
> 
> Here's the cover letter for this patch set:
> 
> From 386420215a93d2eb5bbcb061e914cfb16f456de8 Mon Sep 17 00:00:00 2001
> From: Jarkko Sakkinen 
> Date: Fri, 30 Nov 2018 11:25:40 -0800
> Subject: [PATCH RFC 00/15] Zero s, hugload of hugs <3
> 
> In order to comply with the CoC, replace  with a hug.

I find it quite ridiculous that grown up people are engaging at yet
another coc-related conversation and are on the way to blowing it out of
proportions. Let's focus on improving the ACTUAL code what's the point
to bikeshed on the word?

I've seen people before me comment that some comments are plainly out of
date and that warrants their removal but otherwise nitpicking things
like that. It seems common sense is very scant these days...



> 
> Jarkko Sakkinen (15):
>   MIPS: replace  with a hug
>   Documentation: replace  with a hug
>   drm/nouveau: replace  with a hug
>   m68k: replace  with a hug
>   parisc: replace  with a hug
>   cpufreq: replace  with a hug
>   ide: replace  with a hug
>   media: replace  with a hug
>   mtd: replace  with a hug
>   net/sunhme: replace  with a hug
>   scsi: replace  with a hug
>   inotify: replace  with a hug
>   irq: replace  with a hug
>   lib: replace  with a hug
>   net: replace  with a hug
> 
>  Documentation/kernel-hacking/locking.rst  |  2 +-
>  arch/m68k/include/asm/sun3ints.h  |  2 +-
>  arch/mips/pci/ops-bridge.c| 24 +--
>  arch/mips/sgi-ip22/ip22-setup.c   |  2 +-
>  arch/parisc/kernel/sys_parisc.c   |  2 +-
>  drivers/cpufreq/powernow-k7.c |  2 +-
>  .../gpu/drm/nouveau/nvkm/subdev/bios/init.c   |  2 +-
>  .../nouveau/nvkm/subdev/pmu/fuc/macros.fuc|  2 +-
>  drivers/ide/cmd640.c  |  2 +-
>  drivers/media/i2c/bt819.c |  8 ---
>  drivers/mtd/mtd_blkdevs.c |  2 +-
>  drivers/net/ethernet/sun/sunhme.c |  4 ++--
>  drivers/scsi/qlogicpti.h  |  2 +-
>  fs/notify/inotify/inotify_user.c  |  2 +-
>  kernel/irq/timings.c  |  2 +-
>  lib/vsprintf.c|  2 +-
>  net/core/skbuff.c |  2 +-
>  17 files changed, 33 insertions(+), 31 deletions(-)
> 


Re: [PATCH RFC 01/15] MIPS: replace **** with a hug

2018-11-30 Thread Nikolay Borisov



On 30.11.18 г. 21:57 ч., Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 11:27:10AM -0800, Jarkko Sakkinen wrote:
>> In order to comply with the CoC, replace  with a hug.
>>
>> Signed-off-by: Jarkko Sakkinen 
> 
> Since v17 of the SGX patch set, my cover letters get cut for some
> reason. I receive them myself but they don't appear on kernel
> lists. I don't what the hug is going on :-/
> 
> Here's the cover letter for this patch set:
> 
> From 386420215a93d2eb5bbcb061e914cfb16f456de8 Mon Sep 17 00:00:00 2001
> From: Jarkko Sakkinen 
> Date: Fri, 30 Nov 2018 11:25:40 -0800
> Subject: [PATCH RFC 00/15] Zero s, hugload of hugs <3
> 
> In order to comply with the CoC, replace  with a hug.

I find it quite ridiculous that grown up people are engaging at yet
another coc-related conversation and are on the way to blowing it out of
proportions. Let's focus on improving the ACTUAL code what's the point
to bikeshed on the word?

I've seen people before me comment that some comments are plainly out of
date and that warrants their removal but otherwise nitpicking things
like that. It seems common sense is very scant these days...



> 
> Jarkko Sakkinen (15):
>   MIPS: replace  with a hug
>   Documentation: replace  with a hug
>   drm/nouveau: replace  with a hug
>   m68k: replace  with a hug
>   parisc: replace  with a hug
>   cpufreq: replace  with a hug
>   ide: replace  with a hug
>   media: replace  with a hug
>   mtd: replace  with a hug
>   net/sunhme: replace  with a hug
>   scsi: replace  with a hug
>   inotify: replace  with a hug
>   irq: replace  with a hug
>   lib: replace  with a hug
>   net: replace  with a hug
> 
>  Documentation/kernel-hacking/locking.rst  |  2 +-
>  arch/m68k/include/asm/sun3ints.h  |  2 +-
>  arch/mips/pci/ops-bridge.c| 24 +--
>  arch/mips/sgi-ip22/ip22-setup.c   |  2 +-
>  arch/parisc/kernel/sys_parisc.c   |  2 +-
>  drivers/cpufreq/powernow-k7.c |  2 +-
>  .../gpu/drm/nouveau/nvkm/subdev/bios/init.c   |  2 +-
>  .../nouveau/nvkm/subdev/pmu/fuc/macros.fuc|  2 +-
>  drivers/ide/cmd640.c  |  2 +-
>  drivers/media/i2c/bt819.c |  8 ---
>  drivers/mtd/mtd_blkdevs.c |  2 +-
>  drivers/net/ethernet/sun/sunhme.c |  4 ++--
>  drivers/scsi/qlogicpti.h  |  2 +-
>  fs/notify/inotify/inotify_user.c  |  2 +-
>  kernel/irq/timings.c  |  2 +-
>  lib/vsprintf.c|  2 +-
>  net/core/skbuff.c |  2 +-
>  17 files changed, 33 insertions(+), 31 deletions(-)
> 


[PATCH] fs: Don't opencode lru_to_page

2018-11-29 Thread Nikolay Borisov
Multiple filesystems opencode lru_to_page. Rectify this by moving the
macro from mm_inline (which is specific to lru stuff) to the more
generic mm.h header and start using the macro where appropriate.

No functional changes.

Signed-off-by: Nikolay Borisov 
Acked-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Link: https://lkml.kernel.org/r/20181129075301.29087-1-nbori...@suse.com
---

Hi Andrew, 

I believe mmotm is the best place for a patch like this so please take it. It 
already had a first incarnation where it was 2 patches and Michal suggested to 
squash it so here it is. 

Please take this patch
 fs/afs/file.c | 5 +++--
 fs/btrfs/extent_io.c  | 2 +-
 fs/ceph/addr.c| 5 ++---
 fs/cifs/file.c| 3 ++-
 fs/ext4/readpage.c| 2 +-
 fs/ocfs2/aops.c   | 3 ++-
 fs/orangefs/inode.c   | 2 +-
 include/linux/mm.h| 2 ++
 include/linux/mm_inline.h | 3 ---
 mm/swap.c | 2 +-
 10 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index d6bc3f5d784b..323ae9912203 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 static int afs_file_mmap(struct file *file, struct vm_area_struct *vma);
@@ -441,7 +442,7 @@ static int afs_readpages_one(struct file *file, struct 
address_space *mapping,
/* Count the number of contiguous pages at the front of the list.  Note
 * that the list goes prev-wards rather than next-wards.
 */
-   first = list_entry(pages->prev, struct page, lru);
+   first = lru_to_page(pages);
index = first->index + 1;
n = 1;
for (p = first->lru.prev; p != pages; p = p->prev) {
@@ -473,7 +474,7 @@ static int afs_readpages_one(struct file *file, struct 
address_space *mapping,
 * page at the end of the file.
 */
do {
-   page = list_entry(pages->prev, struct page, lru);
+   page = lru_to_page(pages);
list_del(>lru);
index = page->index;
if (add_to_page_cache_lru(page, mapping, index,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 19f4b8fd654f..8332c5f4b1c3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4104,7 +4104,7 @@ int extent_readpages(struct address_space *mapping, 
struct list_head *pages,
u64 prev_em_start = (u64)-1;
 
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
-   page = list_entry(pages->prev, struct page, lru);
+   page = lru_to_page(pages);
 
prefetchw(>flags);
list_del(>lru);
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8eade7a993c1..5d0c05e288cc 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -306,7 +306,7 @@ static int start_read(struct inode *inode, struct 
ceph_rw_context *rw_ctx,
struct ceph_osd_client *osdc =
_inode_to_client(inode)->client->osdc;
struct ceph_inode_info *ci = ceph_inode(inode);
-   struct page *page = list_entry(page_list->prev, struct page, lru);
+   struct page *page = lru_to_page(page_list);
struct ceph_vino vino;
struct ceph_osd_request *req;
u64 off;
@@ -333,8 +333,7 @@ static int start_read(struct inode *inode, struct 
ceph_rw_context *rw_ctx,
if (got)
ceph_put_cap_refs(ci, got);
while (!list_empty(page_list)) {
-   page = list_entry(page_list->prev,
- struct page, lru);
+   page = lru_to_page(page_list);
list_del(>lru);
put_page(page);
}
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 74c33d5fafc8..b16a4d887d17 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "cifsfs.h"
 #include "cifspdu.h"
@@ -3975,7 +3976,7 @@ readpages_get_pages(struct address_space *mapping, struct 
list_head *page_list,
 
INIT_LIST_HEAD(tmplist);
 
-   page = list_entry(page_list->prev, struct page, lru);
+   page = lru_to_page(page_list);
 
/*
 * Lock the page and put it in the cache. Since no one else
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index f461d75ac049..6aa282ee455a 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -128,7 +128,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
 
prefetchw(>flags);
if (pages) {
-   page = list_entry(pages->prev, struct page, lru);
+   page = lru_to_page(pages);
list_del(>lru);
if (add_to_page_

[PATCH] fs: Don't opencode lru_to_page

2018-11-29 Thread Nikolay Borisov
Multiple filesystems opencode lru_to_page. Rectify this by moving the
macro from mm_inline (which is specific to lru stuff) to the more
generic mm.h header and start using the macro where appropriate.

No functional changes.

Signed-off-by: Nikolay Borisov 
Acked-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Link: https://lkml.kernel.org/r/20181129075301.29087-1-nbori...@suse.com
---

Hi Andrew, 

I believe mmotm is the best place for a patch like this so please take it. It 
already had a first incarnation where it was 2 patches and Michal suggested to 
squash it so here it is. 

Please take this patch
 fs/afs/file.c | 5 +++--
 fs/btrfs/extent_io.c  | 2 +-
 fs/ceph/addr.c| 5 ++---
 fs/cifs/file.c| 3 ++-
 fs/ext4/readpage.c| 2 +-
 fs/ocfs2/aops.c   | 3 ++-
 fs/orangefs/inode.c   | 2 +-
 include/linux/mm.h| 2 ++
 include/linux/mm_inline.h | 3 ---
 mm/swap.c | 2 +-
 10 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index d6bc3f5d784b..323ae9912203 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 static int afs_file_mmap(struct file *file, struct vm_area_struct *vma);
@@ -441,7 +442,7 @@ static int afs_readpages_one(struct file *file, struct 
address_space *mapping,
/* Count the number of contiguous pages at the front of the list.  Note
 * that the list goes prev-wards rather than next-wards.
 */
-   first = list_entry(pages->prev, struct page, lru);
+   first = lru_to_page(pages);
index = first->index + 1;
n = 1;
for (p = first->lru.prev; p != pages; p = p->prev) {
@@ -473,7 +474,7 @@ static int afs_readpages_one(struct file *file, struct 
address_space *mapping,
 * page at the end of the file.
 */
do {
-   page = list_entry(pages->prev, struct page, lru);
+   page = lru_to_page(pages);
list_del(>lru);
index = page->index;
if (add_to_page_cache_lru(page, mapping, index,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 19f4b8fd654f..8332c5f4b1c3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4104,7 +4104,7 @@ int extent_readpages(struct address_space *mapping, 
struct list_head *pages,
u64 prev_em_start = (u64)-1;
 
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
-   page = list_entry(pages->prev, struct page, lru);
+   page = lru_to_page(pages);
 
prefetchw(>flags);
list_del(>lru);
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8eade7a993c1..5d0c05e288cc 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -306,7 +306,7 @@ static int start_read(struct inode *inode, struct 
ceph_rw_context *rw_ctx,
struct ceph_osd_client *osdc =
_inode_to_client(inode)->client->osdc;
struct ceph_inode_info *ci = ceph_inode(inode);
-   struct page *page = list_entry(page_list->prev, struct page, lru);
+   struct page *page = lru_to_page(page_list);
struct ceph_vino vino;
struct ceph_osd_request *req;
u64 off;
@@ -333,8 +333,7 @@ static int start_read(struct inode *inode, struct 
ceph_rw_context *rw_ctx,
if (got)
ceph_put_cap_refs(ci, got);
while (!list_empty(page_list)) {
-   page = list_entry(page_list->prev,
- struct page, lru);
+   page = lru_to_page(page_list);
list_del(>lru);
put_page(page);
}
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 74c33d5fafc8..b16a4d887d17 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "cifsfs.h"
 #include "cifspdu.h"
@@ -3975,7 +3976,7 @@ readpages_get_pages(struct address_space *mapping, struct 
list_head *page_list,
 
INIT_LIST_HEAD(tmplist);
 
-   page = list_entry(page_list->prev, struct page, lru);
+   page = lru_to_page(page_list);
 
/*
 * Lock the page and put it in the cache. Since no one else
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index f461d75ac049..6aa282ee455a 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -128,7 +128,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
 
prefetchw(>flags);
if (pages) {
-   page = list_entry(pages->prev, struct page, lru);
+   page = lru_to_page(pages);
list_del(>lru);
if (add_to_page_

[PATCH 1/2] mm: Move lru_to_page to mm.h

2018-11-28 Thread Nikolay Borisov
There are multiple places in the kernel which opencode this helper,
this patch moves it to the more generic mm.h header in preparation for
using it. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 include/linux/mm.h| 2 ++
 include/linux/mm_inline.h | 3 ---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..47b4aa5bba93 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -146,6 +146,8 @@ extern int overcommit_kbytes_handler(struct ctl_table *, 
int, void __user *,
 /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE 
*/
 #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
+#define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
+
 /*
  * Linux kernel virtual memory manager primitives.
  * The idea being to have a "virtual" mm in the same way
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 10191c28fc04..04ec454d44ce 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -124,7 +124,4 @@ static __always_inline enum lru_list page_lru(struct page 
*page)
}
return lru;
 }
-
-#define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
-
 #endif
-- 
2.17.1



[PATCH 1/2] mm: Move lru_to_page to mm.h

2018-11-28 Thread Nikolay Borisov
There are multiple places in the kernel which opencode this helper,
this patch moves it to the more generic mm.h header in preparation for
using it. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 include/linux/mm.h| 2 ++
 include/linux/mm_inline.h | 3 ---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..47b4aa5bba93 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -146,6 +146,8 @@ extern int overcommit_kbytes_handler(struct ctl_table *, 
int, void __user *,
 /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE 
*/
 #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
+#define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
+
 /*
  * Linux kernel virtual memory manager primitives.
  * The idea being to have a "virtual" mm in the same way
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 10191c28fc04..04ec454d44ce 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -124,7 +124,4 @@ static __always_inline enum lru_list page_lru(struct page 
*page)
}
return lru;
 }
-
-#define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
-
 #endif
-- 
2.17.1



[PATCH] tracing: Export trace_dump_stack to modules

2018-10-17 Thread Nikolay Borisov
There is no reason for this function to be unexprted and it's a useful
debugging aid.

Signed-off-by: Nikolay Borisov 
---
 kernel/trace/trace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bf6f1d70484d..15c7a7d01505 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2727,6 +2727,7 @@ void trace_dump_stack(int skip)
__ftrace_trace_stack(global_trace.trace_buffer.buffer,
 flags, skip, preempt_count(), NULL);
 }
+EXPORT_SYMBOL_GPL(trace_dump_stack);
 
 static DEFINE_PER_CPU(int, user_stack_count);
 
-- 
2.7.4



[PATCH] tracing: Export trace_dump_stack to modules

2018-10-17 Thread Nikolay Borisov
There is no reason for this function to be unexprted and it's a useful
debugging aid.

Signed-off-by: Nikolay Borisov 
---
 kernel/trace/trace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bf6f1d70484d..15c7a7d01505 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2727,6 +2727,7 @@ void trace_dump_stack(int skip)
__ftrace_trace_stack(global_trace.trace_buffer.buffer,
 flags, skip, preempt_count(), NULL);
 }
+EXPORT_SYMBOL_GPL(trace_dump_stack);
 
 static DEFINE_PER_CPU(int, user_stack_count);
 
-- 
2.7.4



Re: [PATCH] lib/Kconfig.debug: add a comment to PROVE_LOCKING impact

2018-10-09 Thread Nikolay Borisov



On  9.10.2018 18:39, Lukasz Luba wrote:
> This patch add some comment related to performance impact,
> which can be really big (x3 times slower context switch).
> 
> Signed-off-by: Lukasz Luba 

I don't think this brings any value. lockdep is a debugging aid aimed at
developers. A developer should be aware that it has a performance impact
and the classification x3 doesn't really bring anything.
> ---
>  lib/Kconfig.debug | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 4966c4f..9e67a2a3 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1088,6 +1088,9 @@ config PROVE_LOCKING
>the proof of observed correctness is also maintained for an
>arbitrary combination of these separate locking variants.
>  
> +  This feature enables LOCKDEP which can harm system performance
> +  even x3 times.
> +
>For more details, see Documentation/locking/lockdep-design.txt.
>  
>  config LOCK_STAT
> @@ -1112,6 +1115,10 @@ config LOCK_STAT
>CONFIG_LOCK_STAT defines "contended" and "acquired" lock events.
>(CONFIG_LOCKDEP defines "acquire" and "release" events.)
>  
> +  This feature enables LOCKDEP which can harm system performance
> +  even x3 times.
> +  For more details, see Documentation/locking/lockdep-design.txt.
> +
>  config DEBUG_RT_MUTEXES
>   bool "RT Mutex debugging, deadlock detection"
>   depends on DEBUG_KERNEL && RT_MUTEXES
> @@ -1175,6 +1182,10 @@ config DEBUG_LOCK_ALLOC
>spin_lock_init()/mutex_init()/etc., or whether there is any lock
>held during task exit.
>  
> +  This feature enables LOCKDEP which can harm system performance
> +  even x3 times.
> +  For more details, see Documentation/locking/lockdep-design.txt.
> +
>  config LOCKDEP
>   bool
>   depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
> 


Re: [PATCH] lib/Kconfig.debug: add a comment to PROVE_LOCKING impact

2018-10-09 Thread Nikolay Borisov



On  9.10.2018 18:39, Lukasz Luba wrote:
> This patch add some comment related to performance impact,
> which can be really big (x3 times slower context switch).
> 
> Signed-off-by: Lukasz Luba 

I don't think this brings any value. lockdep is a debugging aid aimed at
developers. A developer should be aware that it has a performance impact
and the classification x3 doesn't really bring anything.
> ---
>  lib/Kconfig.debug | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 4966c4f..9e67a2a3 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1088,6 +1088,9 @@ config PROVE_LOCKING
>the proof of observed correctness is also maintained for an
>arbitrary combination of these separate locking variants.
>  
> +  This feature enables LOCKDEP which can harm system performance
> +  even x3 times.
> +
>For more details, see Documentation/locking/lockdep-design.txt.
>  
>  config LOCK_STAT
> @@ -1112,6 +1115,10 @@ config LOCK_STAT
>CONFIG_LOCK_STAT defines "contended" and "acquired" lock events.
>(CONFIG_LOCKDEP defines "acquire" and "release" events.)
>  
> +  This feature enables LOCKDEP which can harm system performance
> +  even x3 times.
> +  For more details, see Documentation/locking/lockdep-design.txt.
> +
>  config DEBUG_RT_MUTEXES
>   bool "RT Mutex debugging, deadlock detection"
>   depends on DEBUG_KERNEL && RT_MUTEXES
> @@ -1175,6 +1182,10 @@ config DEBUG_LOCK_ALLOC
>spin_lock_init()/mutex_init()/etc., or whether there is any lock
>held during task exit.
>  
> +  This feature enables LOCKDEP which can harm system performance
> +  even x3 times.
> +  For more details, see Documentation/locking/lockdep-design.txt.
> +
>  config LOCKDEP
>   bool
>   depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
> 


Re: [PATCHv3 2/2] btrfs: change remove_extent_mapping to be void function

2018-09-12 Thread Nikolay Borisov



On 13.09.2018 06:01, zhong jiang wrote:
> remove_extent_mapping use the variable "ret" for return value,
> but it is not modified after initialzation. Further, I find that
> any of the callers do not handle the return value, so it is safe
> to drop the unneeded "ret" and make it to be void function.
> 
> Signed-off-by: zhong jiang 

Reviewed-by: Nikolay Borisov  ---
>  fs/btrfs/extent_map.c | 5 +
>  fs/btrfs/extent_map.h | 2 +-
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index c4e2347..81b6a08 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -428,16 +428,13 @@ struct extent_map *search_extent_mapping(struct 
> extent_map_tree *tree,
>   * Removes @em from @tree.  No reference counts are dropped, and no checks
>   * are done to see if the range is in use
>   */
> -int remove_extent_mapping(struct extent_map_tree *tree, struct extent_map 
> *em)
> +void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map 
> *em)
>  {
> - int ret = 0;
> -
>   WARN_ON(test_bit(EXTENT_FLAG_PINNED, >flags));
>   rb_erase_cached(>rb_node, >map);
>   if (!test_bit(EXTENT_FLAG_LOGGING, >flags))
>   list_del_init(>list);
>   RB_CLEAR_NODE(>rb_node);
> - return ret;
>  }
>  
>  void replace_extent_mapping(struct extent_map_tree *tree,
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index df4e1a5..8798745 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -78,7 +78,7 @@ struct extent_map *lookup_extent_mapping(struct 
> extent_map_tree *tree,
>u64 start, u64 len);
>  int add_extent_mapping(struct extent_map_tree *tree,
>  struct extent_map *em, int modified);
> -int remove_extent_mapping(struct extent_map_tree *tree, struct extent_map 
> *em);
> +void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map 
> *em);
>  void replace_extent_mapping(struct extent_map_tree *tree,
>   struct extent_map *cur,
>   struct extent_map *new,
> 


Re: [PATCHv3 2/2] btrfs: change remove_extent_mapping to be void function

2018-09-12 Thread Nikolay Borisov



On 13.09.2018 06:01, zhong jiang wrote:
> remove_extent_mapping use the variable "ret" for return value,
> but it is not modified after initialzation. Further, I find that
> any of the callers do not handle the return value, so it is safe
> to drop the unneeded "ret" and make it to be void function.
> 
> Signed-off-by: zhong jiang 

Reviewed-by: Nikolay Borisov  ---
>  fs/btrfs/extent_map.c | 5 +
>  fs/btrfs/extent_map.h | 2 +-
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index c4e2347..81b6a08 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -428,16 +428,13 @@ struct extent_map *search_extent_mapping(struct 
> extent_map_tree *tree,
>   * Removes @em from @tree.  No reference counts are dropped, and no checks
>   * are done to see if the range is in use
>   */
> -int remove_extent_mapping(struct extent_map_tree *tree, struct extent_map 
> *em)
> +void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map 
> *em)
>  {
> - int ret = 0;
> -
>   WARN_ON(test_bit(EXTENT_FLAG_PINNED, >flags));
>   rb_erase_cached(>rb_node, >map);
>   if (!test_bit(EXTENT_FLAG_LOGGING, >flags))
>   list_del_init(>list);
>   RB_CLEAR_NODE(>rb_node);
> - return ret;
>  }
>  
>  void replace_extent_mapping(struct extent_map_tree *tree,
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index df4e1a5..8798745 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -78,7 +78,7 @@ struct extent_map *lookup_extent_mapping(struct 
> extent_map_tree *tree,
>u64 start, u64 len);
>  int add_extent_mapping(struct extent_map_tree *tree,
>  struct extent_map *em, int modified);
> -int remove_extent_mapping(struct extent_map_tree *tree, struct extent_map 
> *em);
> +void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map 
> *em);
>  void replace_extent_mapping(struct extent_map_tree *tree,
>   struct extent_map *cur,
>   struct extent_map *new,
> 


Re: [PATCH 8/9] dt-bindings: interrupt-controller: RISC-V PLIC documentation

2018-08-02 Thread Nikolay Borisov



On 26.07.2018 17:37, Christoph Hellwig wrote:
> From: Palmer Dabbelt 
> 
> This patch adds documentation for the platform-level interrupt
> controller (PLIC) found in all RISC-V systems.  This interrupt
> controller routes interrupts from all the devices in the system to each
> hart-local interrupt controller.
> 
> Note: the DTS bindings for the PLIC aren't set in stone yet, as we might
> want to change how we're specifying holes in the hart list.
> 
> Signed-off-by: Palmer Dabbelt 
> ---
>  .../interrupt-controller/riscv,plic0.txt  | 55 +++
>  1 file changed, 55 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/interrupt-controller/riscv,plic0.txt
> 
> diff --git 
> a/Documentation/devicetree/bindings/interrupt-controller/riscv,plic0.txt 
> b/Documentation/devicetree/bindings/interrupt-controller/riscv,plic0.txt
> new file mode 100644
> index ..99cd359dbd43
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/interrupt-controller/riscv,plic0.txt
> @@ -0,0 +1,55 @@
> +RISC-V Platform-Level Interrupt Controller (PLIC)
> +-
> +
> +The RISC-V supervisor ISA specification allows for the presence of a
> +platform-level interrupt controller (PLIC).   The PLIC connects all external
> +interrupts in the system to all hart contexts in the system, via the external
> +interrupt source in each hart's hart-local interrupt controller (HLIC).  A 
> hart
> +context is a privilege mode in a hardware execution thread.  For example, in
> +an 4 core system with 2-way SMT, you have 8 harts and probably at least two
> +privilege modes per hart; machine mode and supervisor mode.
> +
> +Each interrupt can be enabled on per-context basis. Any context can claim
> +a pending enabled interrupt and then release it once it has been handled.
> +
> +Each interrupt has a configurable priority. Higher priority interrupts are
> +serviced firs. Each context can specify a priority threshold. Interrupts
^^ missing 't'
> +with priority below this threshold will not cause the PLIC to raise its
> +interrupt line leading to the context.
> +
> +While the PLIC supports both edge-triggered and level-triggered interrupts,
> +interrupt handlers are oblivious to this distinction and therefor it is not
^^
missing 'e'
> +specific in the PLIC device-tree binding.
> +
> +While the RISC-V ISA doesn't specify a memory layout for the PLIC, the
> +"riscv,plic0" device is a concrete implementation of the PLIC that contains a
> +specific memory layout.  More details about the memory layout of the
> +"riscv,plic0" device can be found as a comment in the device driver, or as 
> part
> +of the SiFive U5 Coreplex Series Manual (page 22 of the PDF of version 1.0)
> +
> +
> +Required properties:
> +- compatible : "riscv,plic0"
> +- #address-cells : should be <0>
> +- #interrupt-cells : should be <1>
> +- interrupt-controller : Identifies the node as an interrupt controller
> +- reg : Should contain 1 register range (address and length)
> +- interrupts-extended : Specifies which contexts are connected to the PLIC,
> +  with "-1" specifying that a context is not present.
> +
> +Example:
> +
> + plic: interrupt-controller@c00 {
> + #address-cells = <0>;
> + #interrupt-cells = <1>;
> + compatible = "riscv,plic0";
> + interrupt-controller;
> + interrupts-extended = <
> +  11
> +  11  9
> +  11  9
> +  11  9
> +  11  9>;
> + reg = <0xc00 0x400>;
> + riscv,ndev = <10>;
> + };
> 


Re: [PATCH 8/9] dt-bindings: interrupt-controller: RISC-V PLIC documentation

2018-08-02 Thread Nikolay Borisov



On 26.07.2018 17:37, Christoph Hellwig wrote:
> From: Palmer Dabbelt 
> 
> This patch adds documentation for the platform-level interrupt
> controller (PLIC) found in all RISC-V systems.  This interrupt
> controller routes interrupts from all the devices in the system to each
> hart-local interrupt controller.
> 
> Note: the DTS bindings for the PLIC aren't set in stone yet, as we might
> want to change how we're specifying holes in the hart list.
> 
> Signed-off-by: Palmer Dabbelt 
> ---
>  .../interrupt-controller/riscv,plic0.txt  | 55 +++
>  1 file changed, 55 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/interrupt-controller/riscv,plic0.txt
> 
> diff --git 
> a/Documentation/devicetree/bindings/interrupt-controller/riscv,plic0.txt 
> b/Documentation/devicetree/bindings/interrupt-controller/riscv,plic0.txt
> new file mode 100644
> index ..99cd359dbd43
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/interrupt-controller/riscv,plic0.txt
> @@ -0,0 +1,55 @@
> +RISC-V Platform-Level Interrupt Controller (PLIC)
> +-
> +
> +The RISC-V supervisor ISA specification allows for the presence of a
> +platform-level interrupt controller (PLIC).   The PLIC connects all external
> +interrupts in the system to all hart contexts in the system, via the external
> +interrupt source in each hart's hart-local interrupt controller (HLIC).  A 
> hart
> +context is a privilege mode in a hardware execution thread.  For example, in
> +an 4 core system with 2-way SMT, you have 8 harts and probably at least two
> +privilege modes per hart; machine mode and supervisor mode.
> +
> +Each interrupt can be enabled on per-context basis. Any context can claim
> +a pending enabled interrupt and then release it once it has been handled.
> +
> +Each interrupt has a configurable priority. Higher priority interrupts are
> +serviced firs. Each context can specify a priority threshold. Interrupts
^^ missing 't'
> +with priority below this threshold will not cause the PLIC to raise its
> +interrupt line leading to the context.
> +
> +While the PLIC supports both edge-triggered and level-triggered interrupts,
> +interrupt handlers are oblivious to this distinction and therefor it is not
^^
missing 'e'
> +specific in the PLIC device-tree binding.
> +
> +While the RISC-V ISA doesn't specify a memory layout for the PLIC, the
> +"riscv,plic0" device is a concrete implementation of the PLIC that contains a
> +specific memory layout.  More details about the memory layout of the
> +"riscv,plic0" device can be found as a comment in the device driver, or as 
> part
> +of the SiFive U5 Coreplex Series Manual (page 22 of the PDF of version 1.0)
> +
> +
> +Required properties:
> +- compatible : "riscv,plic0"
> +- #address-cells : should be <0>
> +- #interrupt-cells : should be <1>
> +- interrupt-controller : Identifies the node as an interrupt controller
> +- reg : Should contain 1 register range (address and length)
> +- interrupts-extended : Specifies which contexts are connected to the PLIC,
> +  with "-1" specifying that a context is not present.
> +
> +Example:
> +
> + plic: interrupt-controller@c00 {
> + #address-cells = <0>;
> + #interrupt-cells = <1>;
> + compatible = "riscv,plic0";
> + interrupt-controller;
> + interrupts-extended = <
> +  11
> +  11  9
> +  11  9
> +  11  9
> +  11  9>;
> + reg = <0xc00 0x400>;
> + riscv,ndev = <10>;
> + };
> 


Re: [PATCH 14/16] btrfs: simplify btrfs_iget()

2018-07-30 Thread Nikolay Borisov
[ADDED David Sterba and Linux-btrfs ml to cc. ]

On 30.07.2018 01:04, Al Viro wrote:
> From: Al Viro 
> 
> don't open-code iget_failed(), don't bother with btrfs_free_path(NULL),
> move handling of positive return values of btrfs_lookup_inode() from
> btrfs_read_locked_inode() to btrfs_iget() and kill now obviously pointless
> ASSERT() in there.
> 
> Signed-off-by: Al Viro 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/inode.c | 24 
>  1 file changed, 8 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8f0b2592feb0..388b2dba68a0 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3610,18 +3610,15 @@ static int btrfs_read_locked_inode(struct inode 
> *inode)
>   filled = true;
>  
>   path = btrfs_alloc_path();
> - if (!path) {
> - ret = -ENOMEM;
> - goto make_bad;
> - }
> + if (!path)
> + return -ENOMEM;
>  
>   memcpy(, _I(inode)->location, sizeof(location));
>  
>   ret = btrfs_lookup_inode(NULL, root, path, , 0);
>   if (ret) {
> - if (ret > 0)
> - ret = -ENOENT;
> - goto make_bad;
> + btrfs_free_path(path);
> + return ret;
>   }
>  
>   leaf = path->nodes[0];
> @@ -3774,10 +3771,6 @@ static int btrfs_read_locked_inode(struct inode *inode)
>  
>   btrfs_sync_inode_flags_to_i_flags(inode);
>   return 0;
> -
> -make_bad:
> - btrfs_free_path(path);
> - return ret;
>  }
>  
>  /*
> @@ -5713,11 +5706,10 @@ struct inode *btrfs_iget(struct super_block *s, 
> struct btrfs_key *location,
>   if (new)
>   *new = 1;
>   } else {
> - make_bad_inode(inode);
> - unlock_new_inode(inode);
> - iput(inode);
> - ASSERT(ret < 0);
> - inode = ERR_PTR(ret < 0 ? ret : -ESTALE);
> + iget_failed(inode);
> + if (ret > 0)
> + ret = -ENOENT;
> + inode = ERR_PTR(ret);
>   }
>   }
>  
> 


Re: [PATCH 14/16] btrfs: simplify btrfs_iget()

2018-07-30 Thread Nikolay Borisov
[ADDED David Sterba and Linux-btrfs ml to cc. ]

On 30.07.2018 01:04, Al Viro wrote:
> From: Al Viro 
> 
> don't open-code iget_failed(), don't bother with btrfs_free_path(NULL),
> move handling of positive return values of btrfs_lookup_inode() from
> btrfs_read_locked_inode() to btrfs_iget() and kill now obviously pointless
> ASSERT() in there.
> 
> Signed-off-by: Al Viro 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/inode.c | 24 
>  1 file changed, 8 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8f0b2592feb0..388b2dba68a0 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3610,18 +3610,15 @@ static int btrfs_read_locked_inode(struct inode 
> *inode)
>   filled = true;
>  
>   path = btrfs_alloc_path();
> - if (!path) {
> - ret = -ENOMEM;
> - goto make_bad;
> - }
> + if (!path)
> + return -ENOMEM;
>  
>   memcpy(, _I(inode)->location, sizeof(location));
>  
>   ret = btrfs_lookup_inode(NULL, root, path, , 0);
>   if (ret) {
> - if (ret > 0)
> - ret = -ENOENT;
> - goto make_bad;
> + btrfs_free_path(path);
> + return ret;
>   }
>  
>   leaf = path->nodes[0];
> @@ -3774,10 +3771,6 @@ static int btrfs_read_locked_inode(struct inode *inode)
>  
>   btrfs_sync_inode_flags_to_i_flags(inode);
>   return 0;
> -
> -make_bad:
> - btrfs_free_path(path);
> - return ret;
>  }
>  
>  /*
> @@ -5713,11 +5706,10 @@ struct inode *btrfs_iget(struct super_block *s, 
> struct btrfs_key *location,
>   if (new)
>   *new = 1;
>   } else {
> - make_bad_inode(inode);
> - unlock_new_inode(inode);
> - iput(inode);
> - ASSERT(ret < 0);
> - inode = ERR_PTR(ret < 0 ? ret : -ESTALE);
> + iget_failed(inode);
> + if (ret > 0)
> + ret = -ENOENT;
> + inode = ERR_PTR(ret);
>   }
>   }
>  
> 


[PATCH] jfs: Fix buffer overrun in ea_get

2018-06-18 Thread Nikolay Borisov
Currently ea_buf->xattr buffer is allocated with min(min_size, ea_size).
This is wrong since after the xattr buffer is allocated the ->max_size
variable is actually rounded up to th next ->s_blocksize size. Fix this
by using the rounded up max_size as input to the malloc.

Suggested-by: Shankara Pailoor 
Reported-by: Shankara Pailoor 
CC: shankarapail...@gmail.com
Signed-off-by: Nikolay Borisov 
---
Hello David, 

I'm sending you the patch for the issue which was originally reported and 
suggested by Shankar.  I won't usually got and override the original 
author of a patch but given the clear lack of experience with upstream (missing 
SOB line, no changelog explaining the change etc) and the 
fact there is already a CVE for this issue (using syzkaller for quick CVE 
generation seems to be all the rage these days, go figure...) I'd rather have 
an upstream, backportable version sooner rather than later. 

 fs/jfs/xattr.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/jfs/xattr.c b/fs/jfs/xattr.c
index c60f3d32ee91..96b9355ff69a 100644
--- a/fs/jfs/xattr.c
+++ b/fs/jfs/xattr.c
@@ -493,14 +493,14 @@ static int ea_get(struct inode *inode, struct ea_buffer 
*ea_buf, int min_size)
 * To keep the rest of the code simple.  Allocate a
 * contiguous buffer to work with
 */
-   ea_buf->xattr = kmalloc(size, GFP_KERNEL);
-   if (ea_buf->xattr == NULL)
-   return -ENOMEM;
-
ea_buf->flag = EA_MALLOC;
ea_buf->max_size = (size + sb->s_blocksize - 1) &
~(sb->s_blocksize - 1);
 
+   ea_buf->xattr = kmalloc(ea_buf->max_size, GFP_KERNEL);
+   if (ea_buf->xattr == NULL)
+   return -ENOMEM;
+
if (ea_size == 0)
return 0;
 
-- 
2.7.4



[PATCH] jfs: Fix buffer overrun in ea_get

2018-06-18 Thread Nikolay Borisov
Currently ea_buf->xattr buffer is allocated with min(min_size, ea_size).
This is wrong since after the xattr buffer is allocated the ->max_size
variable is actually rounded up to th next ->s_blocksize size. Fix this
by using the rounded up max_size as input to the malloc.

Suggested-by: Shankara Pailoor 
Reported-by: Shankara Pailoor 
CC: shankarapail...@gmail.com
Signed-off-by: Nikolay Borisov 
---
Hello David, 

I'm sending you the patch for the issue which was originally reported and 
suggested by Shankar.  I won't usually got and override the original 
author of a patch but given the clear lack of experience with upstream (missing 
SOB line, no changelog explaining the change etc) and the 
fact there is already a CVE for this issue (using syzkaller for quick CVE 
generation seems to be all the rage these days, go figure...) I'd rather have 
an upstream, backportable version sooner rather than later. 

 fs/jfs/xattr.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/jfs/xattr.c b/fs/jfs/xattr.c
index c60f3d32ee91..96b9355ff69a 100644
--- a/fs/jfs/xattr.c
+++ b/fs/jfs/xattr.c
@@ -493,14 +493,14 @@ static int ea_get(struct inode *inode, struct ea_buffer 
*ea_buf, int min_size)
 * To keep the rest of the code simple.  Allocate a
 * contiguous buffer to work with
 */
-   ea_buf->xattr = kmalloc(size, GFP_KERNEL);
-   if (ea_buf->xattr == NULL)
-   return -ENOMEM;
-
ea_buf->flag = EA_MALLOC;
ea_buf->max_size = (size + sb->s_blocksize - 1) &
~(sb->s_blocksize - 1);
 
+   ea_buf->xattr = kmalloc(ea_buf->max_size, GFP_KERNEL);
+   if (ea_buf->xattr == NULL)
+   return -ENOMEM;
+
if (ea_size == 0)
return 0;
 
-- 
2.7.4



Re: [PATCH] doc: add description to dirtytime_expire_seconds

2018-06-15 Thread Nikolay Borisov



On 31.05.2018 02:56, Yang Shi wrote:
> commit 1efff914afac8a965ad63817ecf8861a927c2ace ("fs: add
> dirtytime_expire_seconds sysctl") introduced dirtytime_expire_seconds
> knob, but there is not description about it in
> Documentation/sysctl/vm.txt.
> 
> Add the description for it.
> 
> Cc: Theodore Ts'o 
> Signed-off-by: Yang Shi 
> ---
> I didn't dig into the old review discussion about why the description
> was not added at the first place. I'm supposed every knob under /proc/sys
> should have a brief description.
> 
>  Documentation/sysctl/vm.txt | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 17256f2..f4f4f9c 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/vm:
>  - dirty_bytes
>  - dirty_expire_centisecs
>  - dirty_ratio
> +- dirtytime_expire_seconds
>  - dirty_writeback_centisecs
>  - drop_caches
>  - extfrag_threshold
> @@ -178,6 +179,16 @@ The total available memory is not equal to total system 
> memory.
>  
>  ==
>  
> +dirtytime_expire_seconds
> +
> +When a lazytime inode is constantly having its pages dirtied, it with an

The second part of this sentence, after the comma doesn't parse.

> +updated timestamp will never get chance to be written out.  This tunable
> +is used to define when dirty inode is old enough to be eligible for
> +writeback by the kernel flusher threads. And, it is also used as the
> +interval to wakeup dirtytime_writeback thread. It is expressed in seconds.

I think the final sentence is a bit redundant, given the very explicit
name of the knob.

> +
> +==
> +
>  dirty_writeback_centisecs
>  
>  The kernel flusher threads will periodically wake up and write `old' data
> 


Re: [PATCH] doc: add description to dirtytime_expire_seconds

2018-06-15 Thread Nikolay Borisov



On 31.05.2018 02:56, Yang Shi wrote:
> commit 1efff914afac8a965ad63817ecf8861a927c2ace ("fs: add
> dirtytime_expire_seconds sysctl") introduced dirtytime_expire_seconds
> knob, but there is not description about it in
> Documentation/sysctl/vm.txt.
> 
> Add the description for it.
> 
> Cc: Theodore Ts'o 
> Signed-off-by: Yang Shi 
> ---
> I didn't dig into the old review discussion about why the description
> was not added at the first place. I'm supposed every knob under /proc/sys
> should have a brief description.
> 
>  Documentation/sysctl/vm.txt | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 17256f2..f4f4f9c 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/vm:
>  - dirty_bytes
>  - dirty_expire_centisecs
>  - dirty_ratio
> +- dirtytime_expire_seconds
>  - dirty_writeback_centisecs
>  - drop_caches
>  - extfrag_threshold
> @@ -178,6 +179,16 @@ The total available memory is not equal to total system 
> memory.
>  
>  ==
>  
> +dirtytime_expire_seconds
> +
> +When a lazytime inode is constantly having its pages dirtied, it with an

The second part of this sentence, after the comma doesn't parse.

> +updated timestamp will never get chance to be written out.  This tunable
> +is used to define when dirty inode is old enough to be eligible for
> +writeback by the kernel flusher threads. And, it is also used as the
> +interval to wakeup dirtytime_writeback thread. It is expressed in seconds.

I think the final sentence is a bit redundant, given the very explicit
name of the knob.

> +
> +==
> +
>  dirty_writeback_centisecs
>  
>  The kernel flusher threads will periodically wake up and write `old' data
> 


Re: [PATCH] doc: document scope NOFS, NOIO APIs

2018-05-28 Thread Nikolay Borisov


On 25.05.2018 10:52, Michal Hocko wrote:
> On Thu 24-05-18 09:37:18, Randy Dunlap wrote:
>> On 05/24/2018 04:43 AM, Michal Hocko wrote:
> [...]
>>> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
>>> +resp. __GFP_IO (note the later implies clearing the first as well) in
>>
>> latter
> 
> ?
> No I really meant that clearing __GFP_IO implies __GFP_FS clearing
Sorry to barge in like that, but Randy is right.




https://www.merriam-webster.com/dictionary/latter

" of, relating to, or being the second of two groups or things or the
last of several groups or things referred to




> 


Re: [PATCH] doc: document scope NOFS, NOIO APIs

2018-05-28 Thread Nikolay Borisov


On 25.05.2018 10:52, Michal Hocko wrote:
> On Thu 24-05-18 09:37:18, Randy Dunlap wrote:
>> On 05/24/2018 04:43 AM, Michal Hocko wrote:
> [...]
>>> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
>>> +resp. __GFP_IO (note the later implies clearing the first as well) in
>>
>> latter
> 
> ?
> No I really meant that clearing __GFP_IO implies __GFP_FS clearing
Sorry to barge in like that, but Randy is right.




https://www.merriam-webster.com/dictionary/latter

" of, relating to, or being the second of two groups or things or the
last of several groups or things referred to




> 


Re: [btrfs_put_block_group] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 free_fs_root+0xc2/0xd0 [btrfs]

2018-05-03 Thread Nikolay Borisov


On  3.05.2018 11:07, Anand Jain wrote:
> 
> 
> On 04/19/2018 03:25 PM, Nikolay Borisov wrote:
>>
>>
>> On 19.04.2018 08:32, Fengguang Wu wrote:
>>> Hello,
>>>
>>> FYI this happens in mainline kernel and at least dates back to v4.16 .
>>>
>>> It's rather rare error and happens when running xfstests.
>>
>> Yeah, so this is something which only recently was characterised as
>> leaking delalloc inodes. I can easily reproduce this when running
>> generic/019 test. A fix is in the works.
> 
> Hi Nikolay
>  Is the fix available? generic/475 also notice the same thing.

Yes, the series is [PATCH 0/5] Fix delalloc inodes leaking on btrfs unmount

> 
> -Anand
> 
> 
>>>
>>> [  438.327552] BTRFS: error (device dm-0) in
>>> __btrfs_free_extent:6962: errno=-5 IO failure
>>> [  438.336415] BTRFS: error (device dm-0) in
>>> btrfs_run_delayed_refs:3070: errno=-5 IO failure
>>> [  438.345590] BTRFS error (device dm-0): pending csums is 1028096
>>> [  438.369254] BTRFS error (device dm-0): cleaner transaction attach
>>> returned -30
>>> [  438.377674] BTRFS info (device dm-0): at unmount delalloc count 98304
>>> [  438.385166] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675
>>> free_fs_root+0xc2/0xd0 [btrfs]
>>> [  438.396562] Modules linked in: dm_snapshot dm_thin_pool
>>> dm_persistent_data dm_bio_prison dm_bufio dm_flakey dm_mod netconsole
>>> btrfs xor zstd_decompress zstd_compress xxhash raid6_pq sd_mod sg
>>> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
>>> ata_generic pata_acpi intel_rapl x86_pkg_temp_thermal
>>> intel_powerclamp coretemp snd_hda_intel kvm_intel snd_hda_codec kvm
>>> irqbypass crct10dif_pclmul eeepc_wmi crc32_pclmul crc32c_intel
>>> ghash_clmulni_intel pata_via asus_wmi sparse_keymap snd_hda_core
>>> ata_piix snd_hwdep ppdev rfkill wmi_bmof i915 pcbc snd_pcm snd_timer
>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops snd
>>> aesni_intel parport_pc crypto_simd pcspkr libata soundcore cryptd
>>> glue_helper drm wmi parport video shpchp ip_tables
>>> [  438.467607] CPU: 1 PID: 14674 Comm: umount Not tainted 4.17.0-rc1 #1
>>> [  438.474798] Hardware name: System manufacturer System Product
>>> Name/P8H67-M PRO, BIOS 1002 04/01/2011
>>> [  438.484804] RIP: 0010:free_fs_root+0xc2/0xd0 [btrfs]
>>> [  438.490590] RSP: 0018:c90008b0fda8 EFLAGS: 00010282
>>> [  438.496641] RAX: 88017c5954b0 RBX: 880137f6d800 RCX:
>>> 00018013
>>> [  438.504652] RDX: 0001 RSI: ea0006e93600 RDI:
>>> 
>>> [  438.512679] RBP: 88017b36 R08: 8801ba4dd000 R09:
>>> 00018013
>>> [  438.520644] R10: c90008b0fc70 R11:  R12:
>>> c90008b0fdd0
>>> [  438.528657] R13: 88017b360080 R14: c90008b0fdc8 R15:
>>> 
>>> [  438.536662] FS:  7f06c1a80fc0() GS:8801bfa8()
>>> knlGS:
>>> [  438.545582] CS:  0010 DS:  ES:  CR0: 80050033
>>> [  438.552157] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4:
>>> 000606e0
>>> [  438.642653] RAX:  RBX: 0234a2d0 RCX:
>>> 7f06c1359cf7
>>> [  438.793559] CPU: 1 PID: 14674 Comm: umount Tainted: G   
>>> W 4.17.0-rc1 #1
>>> [  438.802152] Hardware name: System manufacturer System Product
>>> Name/P8H67-M PRO, BIOS 1002 04/01/2011
>>> [  438.812108] RIP: 0010:btrfs_put_block_group+0x41/0x60 [btrfs]
>>> [  438.819364] RSP: 0018:c90008b0fde0 EFLAGS: 00010206
>>> [  438.825378] RAX:  RBX: 8801abf63000 RCX:
>>> e38e38e38e38e38f
>>> [  438.833307] RDX: 0001 RSI: 09f6 RDI:
>>> 8801abf63000
>>> [  438.841230] RBP: 88017b36 R08: 88017d3b7750 R09:
>>> 000180380010
>>> [  438.849133] R10: c90008b0fca0 R11:  R12:
>>> 8801abf63000
>>> [  438.857047] R13: 88017b3600a0 R14: 8801abf630e0 R15:
>>> dead0100
>>> [  438.864943] FS:  7f06c1a80fc0() GS:8801bfa8()
>>> knlGS:
>>> [  438.873793] CS:  0010 DS:  ES:  CR0: 80050033
>>> [  438.880320] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4:
>>> 000606e0
>>> [  438.888226] Call Trace:
>>> [  438.891454]  btrfs_free_block_groups+0x138/0x3d0 [btrfs]

Re: [btrfs_put_block_group] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 free_fs_root+0xc2/0xd0 [btrfs]

2018-05-03 Thread Nikolay Borisov


On  3.05.2018 11:07, Anand Jain wrote:
> 
> 
> On 04/19/2018 03:25 PM, Nikolay Borisov wrote:
>>
>>
>> On 19.04.2018 08:32, Fengguang Wu wrote:
>>> Hello,
>>>
>>> FYI this happens in mainline kernel and at least dates back to v4.16 .
>>>
>>> It's rather rare error and happens when running xfstests.
>>
>> Yeah, so this is something which only recently was characterised as
>> leaking delalloc inodes. I can easily reproduce this when running
>> generic/019 test. A fix is in the works.
> 
> Hi Nikolay
>  Is the fix available? generic/475 also notice the same thing.

Yes, the series is [PATCH 0/5] Fix delalloc inodes leaking on btrfs unmount

> 
> -Anand
> 
> 
>>>
>>> [  438.327552] BTRFS: error (device dm-0) in
>>> __btrfs_free_extent:6962: errno=-5 IO failure
>>> [  438.336415] BTRFS: error (device dm-0) in
>>> btrfs_run_delayed_refs:3070: errno=-5 IO failure
>>> [  438.345590] BTRFS error (device dm-0): pending csums is 1028096
>>> [  438.369254] BTRFS error (device dm-0): cleaner transaction attach
>>> returned -30
>>> [  438.377674] BTRFS info (device dm-0): at unmount delalloc count 98304
>>> [  438.385166] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675
>>> free_fs_root+0xc2/0xd0 [btrfs]
>>> [  438.396562] Modules linked in: dm_snapshot dm_thin_pool
>>> dm_persistent_data dm_bio_prison dm_bufio dm_flakey dm_mod netconsole
>>> btrfs xor zstd_decompress zstd_compress xxhash raid6_pq sd_mod sg
>>> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
>>> ata_generic pata_acpi intel_rapl x86_pkg_temp_thermal
>>> intel_powerclamp coretemp snd_hda_intel kvm_intel snd_hda_codec kvm
>>> irqbypass crct10dif_pclmul eeepc_wmi crc32_pclmul crc32c_intel
>>> ghash_clmulni_intel pata_via asus_wmi sparse_keymap snd_hda_core
>>> ata_piix snd_hwdep ppdev rfkill wmi_bmof i915 pcbc snd_pcm snd_timer
>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops snd
>>> aesni_intel parport_pc crypto_simd pcspkr libata soundcore cryptd
>>> glue_helper drm wmi parport video shpchp ip_tables
>>> [  438.467607] CPU: 1 PID: 14674 Comm: umount Not tainted 4.17.0-rc1 #1
>>> [  438.474798] Hardware name: System manufacturer System Product
>>> Name/P8H67-M PRO, BIOS 1002 04/01/2011
>>> [  438.484804] RIP: 0010:free_fs_root+0xc2/0xd0 [btrfs]
>>> [  438.490590] RSP: 0018:c90008b0fda8 EFLAGS: 00010282
>>> [  438.496641] RAX: 88017c5954b0 RBX: 880137f6d800 RCX:
>>> 00018013
>>> [  438.504652] RDX: 0001 RSI: ea0006e93600 RDI:
>>> 
>>> [  438.512679] RBP: 88017b36 R08: 8801ba4dd000 R09:
>>> 00018013
>>> [  438.520644] R10: c90008b0fc70 R11:  R12:
>>> c90008b0fdd0
>>> [  438.528657] R13: 88017b360080 R14: c90008b0fdc8 R15:
>>> 
>>> [  438.536662] FS:  7f06c1a80fc0() GS:8801bfa8()
>>> knlGS:
>>> [  438.545582] CS:  0010 DS:  ES:  CR0: 80050033
>>> [  438.552157] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4:
>>> 000606e0
>>> [  438.642653] RAX:  RBX: 0234a2d0 RCX:
>>> 7f06c1359cf7
>>> [  438.793559] CPU: 1 PID: 14674 Comm: umount Tainted: G   
>>> W 4.17.0-rc1 #1
>>> [  438.802152] Hardware name: System manufacturer System Product
>>> Name/P8H67-M PRO, BIOS 1002 04/01/2011
>>> [  438.812108] RIP: 0010:btrfs_put_block_group+0x41/0x60 [btrfs]
>>> [  438.819364] RSP: 0018:c90008b0fde0 EFLAGS: 00010206
>>> [  438.825378] RAX:  RBX: 8801abf63000 RCX:
>>> e38e38e38e38e38f
>>> [  438.833307] RDX: 0001 RSI: 09f6 RDI:
>>> 8801abf63000
>>> [  438.841230] RBP: 88017b36 R08: 88017d3b7750 R09:
>>> 000180380010
>>> [  438.849133] R10: c90008b0fca0 R11:  R12:
>>> 8801abf63000
>>> [  438.857047] R13: 88017b3600a0 R14: 8801abf630e0 R15:
>>> dead0100
>>> [  438.864943] FS:  7f06c1a80fc0() GS:8801bfa8()
>>> knlGS:
>>> [  438.873793] CS:  0010 DS:  ES:  CR0: 80050033
>>> [  438.880320] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4:
>>> 000606e0
>>> [  438.888226] Call Trace:
>>> [  438.891454]  btrfs_free_block_groups+0x138/0x3d0 [btrfs]

Re: Moving unmaintained filesystems to staging

2018-04-25 Thread Nikolay Borisov


On 25.04.2018 23:30, David Sterba wrote:
> On Wed, Apr 25, 2018 at 08:46:02AM -0700, Matthew Wilcox wrote:
>> Recently ncpfs got moved to staging.  Also recently, we had some fuzzer
>> developers report bugs in hfs, which they deem a security hole because
>> Ubuntu attempts to automount an inserted USB device as hfs.
>>
>> We have no maintainer for hfs, and no likely prospect of anyone stepping
>> up soon to become hfs maintainer.  I think it's irresponsible of us
>> to present unmaintained code on an equal basis with filesystems under
>> active maintenance like ext2.
>>
>> I therefore propose we move the following filesystems which are explicitly
>> listed as Orphaned to staging:
>>
>> affs - Amiga filesystem.
>> efs - old SGI filesystem predating XFS, used on CDs for a while.
>> hfs - Mac filesystem.
>> hfsplus - Mac filesystem.
>>
>> I further propose we move the following filesystems which have no entry
>> in MAINTAINERS to staging:
>>
>> adfs - Acorn filesystem from the 1990s.
>> minix
>> qnx6
> 
> I had similar toughts some time ago while browsing the fs/ directory.
> Access to the filesystem images can be reimplemented in FUSE, but other
> than that, I don't think the in-kernel code would be missed.
> 
> It's hard to know how many users are there. I was curious eg. about bfs,
> befs, coda or feevxfs, looked at the last commits and searched around
> web if there are any mentions or user community. But as long as there's
> somebody listed in MAINTAINERS, the above are not candidates for moving
> to staging or deletion.
> 

I think the presence of maintainers entry is necessary but insufficient.
What if the maintainer has long lost interest but just didn't bother
updating the file. In the very least maintainers should be pinged and
asked if they are still maintaining the respective piece of code.


Re: Moving unmaintained filesystems to staging

2018-04-25 Thread Nikolay Borisov


On 25.04.2018 23:30, David Sterba wrote:
> On Wed, Apr 25, 2018 at 08:46:02AM -0700, Matthew Wilcox wrote:
>> Recently ncpfs got moved to staging.  Also recently, we had some fuzzer
>> developers report bugs in hfs, which they deem a security hole because
>> Ubuntu attempts to automount an inserted USB device as hfs.
>>
>> We have no maintainer for hfs, and no likely prospect of anyone stepping
>> up soon to become hfs maintainer.  I think it's irresponsible of us
>> to present unmaintained code on an equal basis with filesystems under
>> active maintenance like ext2.
>>
>> I therefore propose we move the following filesystems which are explicitly
>> listed as Orphaned to staging:
>>
>> affs - Amiga filesystem.
>> efs - old SGI filesystem predating XFS, used on CDs for a while.
>> hfs - Mac filesystem.
>> hfsplus - Mac filesystem.
>>
>> I further propose we move the following filesystems which have no entry
>> in MAINTAINERS to staging:
>>
>> adfs - Acorn filesystem from the 1990s.
>> minix
>> qnx6
> 
> I had similar toughts some time ago while browsing the fs/ directory.
> Access to the filesystem images can be reimplemented in FUSE, but other
> than that, I don't think the in-kernel code would be missed.
> 
> It's hard to know how many users are there. I was curious eg. about bfs,
> befs, coda or feevxfs, looked at the last commits and searched around
> web if there are any mentions or user community. But as long as there's
> somebody listed in MAINTAINERS, the above are not candidates for moving
> to staging or deletion.
> 

I think the presence of maintainers entry is necessary but insufficient.
What if the maintainer has long lost interest but just didn't bother
updating the file. In the very least maintainers should be pinged and
asked if they are still maintaining the respective piece of code.


Re: [btrfs_put_block_group] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 free_fs_root+0xc2/0xd0 [btrfs]

2018-04-19 Thread Nikolay Borisov


On 19.04.2018 08:32, Fengguang Wu wrote:
> Hello,
> 
> FYI this happens in mainline kernel and at least dates back to v4.16 .
> 
> It's rather rare error and happens when running xfstests.

Yeah, so this is something which only recently was characterised as
leaking delalloc inodes. I can easily reproduce this when running
generic/019 test. A fix is in the works.

> 
> [  438.327552] BTRFS: error (device dm-0) in __btrfs_free_extent:6962: 
> errno=-5 IO failure
> [  438.336415] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:3070: 
> errno=-5 IO failure
> [  438.345590] BTRFS error (device dm-0): pending csums is 1028096
> [  438.369254] BTRFS error (device dm-0): cleaner transaction attach returned 
> -30
> [  438.377674] BTRFS info (device dm-0): at unmount delalloc count 98304
> [  438.385166] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 
> free_fs_root+0xc2/0xd0 [btrfs]
> [  438.396562] Modules linked in: dm_snapshot dm_thin_pool dm_persistent_data 
> dm_bio_prison dm_bufio dm_flakey dm_mod netconsole btrfs xor zstd_decompress 
> zstd_compress xxhash raid6_pq sd_mod sg snd_hda_codec_hdmi 
> snd_hda_codec_realtek snd_hda_codec_generic ata_generic pata_acpi intel_rapl 
> x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel kvm_intel 
> snd_hda_codec kvm irqbypass crct10dif_pclmul eeepc_wmi crc32_pclmul 
> crc32c_intel ghash_clmulni_intel pata_via asus_wmi sparse_keymap snd_hda_core 
> ata_piix snd_hwdep ppdev rfkill wmi_bmof i915 pcbc snd_pcm snd_timer 
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops snd aesni_intel 
> parport_pc crypto_simd pcspkr libata soundcore cryptd glue_helper drm wmi 
> parport video shpchp ip_tables
> [  438.467607] CPU: 1 PID: 14674 Comm: umount Not tainted 4.17.0-rc1 #1
> [  438.474798] Hardware name: System manufacturer System Product Name/P8H67-M 
> PRO, BIOS 1002 04/01/2011
> [  438.484804] RIP: 0010:free_fs_root+0xc2/0xd0 [btrfs]
> [  438.490590] RSP: 0018:c90008b0fda8 EFLAGS: 00010282
> [  438.496641] RAX: 88017c5954b0 RBX: 880137f6d800 RCX: 
> 00018013
> [  438.504652] RDX: 0001 RSI: ea0006e93600 RDI: 
> 
> [  438.512679] RBP: 88017b36 R08: 8801ba4dd000 R09: 
> 00018013
> [  438.520644] R10: c90008b0fc70 R11:  R12: 
> c90008b0fdd0
> [  438.528657] R13: 88017b360080 R14: c90008b0fdc8 R15: 
> 
> [  438.536662] FS:  7f06c1a80fc0() GS:8801bfa8() 
> knlGS:
> [  438.545582] CS:  0010 DS:  ES:  CR0: 80050033
> [  438.552157] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4: 
> 000606e0
> [  438.642653] RAX:  RBX: 0234a2d0 RCX: 
> 7f06c1359cf7
> [  438.793559] CPU: 1 PID: 14674 Comm: umount Tainted: GW 
> 4.17.0-rc1 #1
> [  438.802152] Hardware name: System manufacturer System Product Name/P8H67-M 
> PRO, BIOS 1002 04/01/2011
> [  438.812108] RIP: 0010:btrfs_put_block_group+0x41/0x60 [btrfs]
> [  438.819364] RSP: 0018:c90008b0fde0 EFLAGS: 00010206
> [  438.825378] RAX:  RBX: 8801abf63000 RCX: 
> e38e38e38e38e38f
> [  438.833307] RDX: 0001 RSI: 09f6 RDI: 
> 8801abf63000
> [  438.841230] RBP: 88017b36 R08: 88017d3b7750 R09: 
> 000180380010
> [  438.849133] R10: c90008b0fca0 R11:  R12: 
> 8801abf63000
> [  438.857047] R13: 88017b3600a0 R14: 8801abf630e0 R15: 
> dead0100
> [  438.864943] FS:  7f06c1a80fc0() GS:8801bfa8() 
> knlGS:
> [  438.873793] CS:  0010 DS:  ES:  CR0: 80050033
> [  438.880320] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4: 
> 000606e0
> [  438.888226] Call Trace:
> [  438.891454]  btrfs_free_block_groups+0x138/0x3d0 [btrfs]
> [  438.897569]  close_ctree+0x13b/0x2f0 [btrfs]
> [  438.902618]  generic_shutdown_super+0x6c/0x120:
>   __read_once_size at 
> include/linux/compiler.h:188
>(inlined by) list_empty at 
> include/linux/list.h:203
>(inlined by) 
> generic_shutdown_super at fs/super.c:442
> [  438.907801]  kill_anon_super+0xe/0x20:
>   kill_anon_super at 
> fs/super.c:1038
> [  438.912223]  btrfs_kill_super+0x13/0x100 [btrfs]
> [  438.917598]  deactivate_locked_super+0x3f/0x70:
>   deactivate_locked_super at 
> fs/super.c:320
> [  438.922757]  cleanup_mnt+0x3b/0x70:
>   cleanup_mnt at 
> fs/namespace.c:1174
> [  438.926879]  task_work_run+0xa3/0xe0:
>   task_work_run at 
> kernel/task_work.c:115 (discriminator 1)
> [  438.931205]  exit_to_usermode_loop+0x9e/0xa0:
>   

Re: [btrfs_put_block_group] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 free_fs_root+0xc2/0xd0 [btrfs]

2018-04-19 Thread Nikolay Borisov


On 19.04.2018 08:32, Fengguang Wu wrote:
> Hello,
> 
> FYI this happens in mainline kernel and at least dates back to v4.16 .
> 
> It's rather rare error and happens when running xfstests.

Yeah, so this is something which only recently was characterised as
leaking delalloc inodes. I can easily reproduce this when running
generic/019 test. A fix is in the works.

> 
> [  438.327552] BTRFS: error (device dm-0) in __btrfs_free_extent:6962: 
> errno=-5 IO failure
> [  438.336415] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:3070: 
> errno=-5 IO failure
> [  438.345590] BTRFS error (device dm-0): pending csums is 1028096
> [  438.369254] BTRFS error (device dm-0): cleaner transaction attach returned 
> -30
> [  438.377674] BTRFS info (device dm-0): at unmount delalloc count 98304
> [  438.385166] WARNING: CPU: 1 PID: 14674 at fs/btrfs/disk-io.c:3675 
> free_fs_root+0xc2/0xd0 [btrfs]
> [  438.396562] Modules linked in: dm_snapshot dm_thin_pool dm_persistent_data 
> dm_bio_prison dm_bufio dm_flakey dm_mod netconsole btrfs xor zstd_decompress 
> zstd_compress xxhash raid6_pq sd_mod sg snd_hda_codec_hdmi 
> snd_hda_codec_realtek snd_hda_codec_generic ata_generic pata_acpi intel_rapl 
> x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel kvm_intel 
> snd_hda_codec kvm irqbypass crct10dif_pclmul eeepc_wmi crc32_pclmul 
> crc32c_intel ghash_clmulni_intel pata_via asus_wmi sparse_keymap snd_hda_core 
> ata_piix snd_hwdep ppdev rfkill wmi_bmof i915 pcbc snd_pcm snd_timer 
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops snd aesni_intel 
> parport_pc crypto_simd pcspkr libata soundcore cryptd glue_helper drm wmi 
> parport video shpchp ip_tables
> [  438.467607] CPU: 1 PID: 14674 Comm: umount Not tainted 4.17.0-rc1 #1
> [  438.474798] Hardware name: System manufacturer System Product Name/P8H67-M 
> PRO, BIOS 1002 04/01/2011
> [  438.484804] RIP: 0010:free_fs_root+0xc2/0xd0 [btrfs]
> [  438.490590] RSP: 0018:c90008b0fda8 EFLAGS: 00010282
> [  438.496641] RAX: 88017c5954b0 RBX: 880137f6d800 RCX: 
> 00018013
> [  438.504652] RDX: 0001 RSI: ea0006e93600 RDI: 
> 
> [  438.512679] RBP: 88017b36 R08: 8801ba4dd000 R09: 
> 00018013
> [  438.520644] R10: c90008b0fc70 R11:  R12: 
> c90008b0fdd0
> [  438.528657] R13: 88017b360080 R14: c90008b0fdc8 R15: 
> 
> [  438.536662] FS:  7f06c1a80fc0() GS:8801bfa8() 
> knlGS:
> [  438.545582] CS:  0010 DS:  ES:  CR0: 80050033
> [  438.552157] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4: 
> 000606e0
> [  438.642653] RAX:  RBX: 0234a2d0 RCX: 
> 7f06c1359cf7
> [  438.793559] CPU: 1 PID: 14674 Comm: umount Tainted: GW 
> 4.17.0-rc1 #1
> [  438.802152] Hardware name: System manufacturer System Product Name/P8H67-M 
> PRO, BIOS 1002 04/01/2011
> [  438.812108] RIP: 0010:btrfs_put_block_group+0x41/0x60 [btrfs]
> [  438.819364] RSP: 0018:c90008b0fde0 EFLAGS: 00010206
> [  438.825378] RAX:  RBX: 8801abf63000 RCX: 
> e38e38e38e38e38f
> [  438.833307] RDX: 0001 RSI: 09f6 RDI: 
> 8801abf63000
> [  438.841230] RBP: 88017b36 R08: 88017d3b7750 R09: 
> 000180380010
> [  438.849133] R10: c90008b0fca0 R11:  R12: 
> 8801abf63000
> [  438.857047] R13: 88017b3600a0 R14: 8801abf630e0 R15: 
> dead0100
> [  438.864943] FS:  7f06c1a80fc0() GS:8801bfa8() 
> knlGS:
> [  438.873793] CS:  0010 DS:  ES:  CR0: 80050033
> [  438.880320] CR2: 7f06c12b0260 CR3: 00017d3de004 CR4: 
> 000606e0
> [  438.888226] Call Trace:
> [  438.891454]  btrfs_free_block_groups+0x138/0x3d0 [btrfs]
> [  438.897569]  close_ctree+0x13b/0x2f0 [btrfs]
> [  438.902618]  generic_shutdown_super+0x6c/0x120:
>   __read_once_size at 
> include/linux/compiler.h:188
>(inlined by) list_empty at 
> include/linux/list.h:203
>(inlined by) 
> generic_shutdown_super at fs/super.c:442
> [  438.907801]  kill_anon_super+0xe/0x20:
>   kill_anon_super at 
> fs/super.c:1038
> [  438.912223]  btrfs_kill_super+0x13/0x100 [btrfs]
> [  438.917598]  deactivate_locked_super+0x3f/0x70:
>   deactivate_locked_super at 
> fs/super.c:320
> [  438.922757]  cleanup_mnt+0x3b/0x70:
>   cleanup_mnt at 
> fs/namespace.c:1174
> [  438.926879]  task_work_run+0xa3/0xe0:
>   task_work_run at 
> kernel/task_work.c:115 (discriminator 1)
> [  438.931205]  exit_to_usermode_loop+0x9e/0xa0:
>   

Re: [PATCH] fs/dcache.c: re-add cond_resched() in shrink_dcache_parent()

2018-04-14 Thread Nikolay Borisov


On 14.04.2018 00:14, Andrew Morton wrote:
> On Fri, 13 Apr 2018 13:28:23 -0700 Khazhismel Kumykov  
> wrote:
> 
>> shrink_dcache_parent may spin waiting for a parallel shrink_dentry_list.
>> In this case we may have 0 dentries to dispose, so we will never
>> schedule out while waiting for the parallel shrink_dentry_list to
>> complete.
>>
>> Tested that this fixes syzbot reports of stalls in shrink_dcache_parent()
> 
> Well I guess the patch is OK as a stopgap, but things seem fairly
> messed up in there.  shrink_dcache_parent() shouldn't be doing a
> busywait, waiting for the concurrent shrink_dentry_list().
> 
> Either we should be waiting (sleeping) for the concurrent operation to
> complete or we should just bail out of shrink_dcache_parent(), perhaps
> with 
> 
>   if (list_empty())
>   break;
> 
> or similar.  Dunno.

I agree, however, not being a dcache expert I'd refrain from touching
it, since it seems to be rather fragile. Perhaps Al could take a look in
there?

> 
> 
> That block comment over `struct select_data' is not a good one.  "It
> returns zero iff...".  *What* returns zero?  select_collect()?  No it
> doesn't, it returns an `enum d_walk_ret'.  Perhaps the comment is
> trying to refer to select_data.found.  And the real interpretation of
> select_data.found is, umm, hard to describe.  "Counts the number of
> dentries which are on a shrink list or which were moved to the dispose
> list".  Why?  What's that all about?
> 
> This code needs a bit of thought, documentation and perhaps a redo,
> I suspect.
> 


  1   2   3   4   5   6   7   >