date:20171218

On Mon 11-12-17 16:55:30, Josef Bacik wrote:
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 356a814e7c8e..48de090f5a07 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -179,9 +179,19 @@ enum node_stat_item {
>   NR_VMSCAN_IMMEDIATE,/* Prioritise for reclaim when writeback ends */
>   NR_DIRTIED, /* page dirtyings since bootup */
>   NR_WRITTEN, /* page writings since bootup */
> + NR_METADATA_DIRTY_BYTES,/* Metadata dirty bytes */
> + NR_METADATA_WRITEBACK_BYTES,/* Metadata writeback bytes */
> + NR_METADATA_BYTES,  /* total metadata bytes in use. */
>   NR_VM_NODE_STAT_ITEMS
>  };

Please add here something like: "Warning: These counters will overflow on
32-bit machines if we ever have more than 2G of metadata on such machine!
But kernel won't be able to address that easily either so it should not be
a real issue."

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4bb13e72ac97..0b32e6381590 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -273,6 +273,13 @@ void __mod_node_page_state(struct pglist_data *pgdat, 
> enum node_stat_item item,
>  
>   t = __this_cpu_read(pcp->stat_threshold);
>  
> + /*
> +  * If this item is counted in bytes and not pages adjust the threshold
> +  * accordingly.
> +  */
> + if (is_bytes_node_stat(item))
> + t <<= PAGE_SHIFT;
> +
>   if (unlikely(x > t || x < -t)) {
>   node_page_state_add(x, pgdat, item);
>   x = 0;

This is wrong. The per-cpu counters are stored in s8 so you cannot just
bump the threshold. I would just ignore the PCP counters for metadata (I
don't think they are that critical for performance for metadata tracking)
and add to the comment I've suggested above: "Also note that updates to
these counters won't be batched using per-cpu counters since the updates
are generally larger than the counter threshold."

Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: qgroup: Fix root item corruption when multiple same source snapshiots are created with quota enabled

2017-12-18 Thread Qu Wenruo

When multiple pending snapshots referring the same source subvolume are
executed, enabled quota will cause root item corruption, where root
items are using old bytenr (no backref in extent tree).

This can be triggered by fstests btrfs/152.

The cause is when source subvolume is still dirty, extra commit
(simplied transaction commit) of qgroup_account_snapshot() can skip
dirty roots not recorded in current transaction, making root item of
source subvolume not updated.

Fix it by forcing recording source subvolume in current transaction
before qgroup sub-transaction commit.

Reported-by: Justin Maggard 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/transaction.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index ddae813c01dd..f645e5de5fa5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -319,7 +319,7 @@ static int record_root_in_trans(struct btrfs_trans_handle 
*trans,
if ((test_bit(BTRFS_ROOT_REF_COWS, >state) &&
root->last_trans < trans->transid) || force) {
WARN_ON(root == fs_info->extent_root);
-   WARN_ON(root->commit_root != root->node);
+   WARN_ON(!force && root->commit_root != root->node);
 
/*
 * see below for IN_TRANS_SETUP usage rules
@@ -1366,6 +1366,14 @@ static int qgroup_account_snapshot(struct 
btrfs_trans_handle *trans,
if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags))
return 0;
 
+   /*
+* Ensure dirty @src will be commited.
+* Or after comming commit_fs_roots() and switch_commit_roots(),
+* any dirty but not recorded root will never be updated again.
+* Causing outdated root item.
+*/
+   record_root_in_trans(trans, src, 1);
+
/*
 * We are going to commit transaction, see btrfs_commit_transaction()
 * comment for reason locking tree_log_mutex
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 03/10] lib: add a __fprop_add_percpu_max

On Mon 11-12-17 16:55:28, Josef Bacik wrote:
> From: Josef Bacik 
> 
> This helper allows us to add an arbitrary amount to the fprop
> structures.
> 
> Signed-off-by: Josef Bacik 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/flex_proportions.h | 11 +--
>  lib/flex_proportions.c   |  9 +
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/flex_proportions.h 
> b/include/linux/flex_proportions.h
> index 0d348e011a6e..9f88684bf0a0 100644
> --- a/include/linux/flex_proportions.h
> +++ b/include/linux/flex_proportions.h
> @@ -83,8 +83,8 @@ struct fprop_local_percpu {
>  int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp);
>  void fprop_local_destroy_percpu(struct fprop_local_percpu *pl);
>  void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu 
> *pl);
> -void __fprop_inc_percpu_max(struct fprop_global *p, struct 
> fprop_local_percpu *pl,
> - int max_frac);
> +void __fprop_add_percpu_max(struct fprop_global *p, struct 
> fprop_local_percpu *pl,
> + unsigned long nr, int max_frac);
>  void fprop_fraction_percpu(struct fprop_global *p,
>   struct fprop_local_percpu *pl, unsigned long *numerator,
>   unsigned long *denominator);
> @@ -99,4 +99,11 @@ void fprop_inc_percpu(struct fprop_global *p, struct 
> fprop_local_percpu *pl)
>   local_irq_restore(flags);
>  }
>  
> +static inline
> +void __fprop_inc_percpu_max(struct fprop_global *p,
> + struct fprop_local_percpu *pl, int max_frac)
> +{
> + __fprop_add_percpu_max(p, pl, 1, max_frac);
> +}
> +
>  #endif
> diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
> index 2cc1f94e03a1..31003989d34a 100644
> --- a/lib/flex_proportions.c
> +++ b/lib/flex_proportions.c
> @@ -255,8 +255,9 @@ void fprop_fraction_percpu(struct fprop_global *p,
>   * Like __fprop_inc_percpu() except that event is counted only if the given
>   * type has fraction smaller than @max_frac/FPROP_FRAC_BASE
>   */
> -void __fprop_inc_percpu_max(struct fprop_global *p,
> - struct fprop_local_percpu *pl, int max_frac)
> +void __fprop_add_percpu_max(struct fprop_global *p,
> + struct fprop_local_percpu *pl, unsigned long nr,
> + int max_frac)
>  {
>   if (unlikely(max_frac < FPROP_FRAC_BASE)) {
>   unsigned long numerator, denominator;
> @@ -267,6 +268,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
>   return;
>   } else
>   fprop_reflect_period_percpu(p, pl);
> - percpu_counter_add_batch(>events, 1, PROP_BATCH);
> - percpu_counter_add(>events, 1);
> + percpu_counter_add_batch(>events, nr, PROP_BATCH);
> + percpu_counter_add(>events, nr);
>  }
> -- 
> 2.7.5
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 3/5] bpf: add a bpf_override_function helper

On Mon, 18 Dec 2017 16:09:30 +0100
Daniel Borkmann  wrote:

> On 12/18/2017 10:51 AM, Masami Hiramatsu wrote:
> > On Fri, 15 Dec 2017 14:12:54 -0500
> > Josef Bacik  wrote:
> >> From: Josef Bacik 
> >>
> >> Error injection is sloppy and very ad-hoc.  BPF could fill this niche
> >> perfectly with it's kprobe functionality.  We could make sure errors are
> >> only triggered in specific call chains that we care about with very
> >> specific situations.  Accomplish this with the bpf_override_funciton
> >> helper.  This will modify the probe'd callers return value to the
> >> specified value and set the PC to an override function that simply
> >> returns, bypassing the originally probed function.  This gives us a nice
> >> clean way to implement systematic error injection for all of our code
> >> paths.
> > 
> > OK, got it. I think the error_injectable function list should be defined
> > in kernel/trace/bpf_trace.c because only bpf calls it and needs to care
> > the "safeness".
> > 
> > [...]
> >> diff --git a/arch/x86/kernel/kprobes/ftrace.c 
> >> b/arch/x86/kernel/kprobes/ftrace.c
> >> index 8dc0161cec8f..1ea748d682fd 100644
> >> --- a/arch/x86/kernel/kprobes/ftrace.c
> >> +++ b/arch/x86/kernel/kprobes/ftrace.c
> >> @@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
> >>p->ainsn.boostable = false;
> >>return 0;
> >>  }
> >> +
> >> +asmlinkage void override_func(void);
> >> +asm(
> >> +  ".type override_func, @function\n"
> >> +  "override_func:\n"
> >> +  "   ret\n"
> >> +  ".size override_func, .-override_func\n"
> >> +);
> >> +
> >> +void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
> >> +{
> >> +  regs->ip = (unsigned long)_func;
> >> +}
> >> +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
> > 
> > Calling this as "override_function" is meaningless. This is a function
> > which just return. So I think combination of just_return_func() and
> > arch_bpf_override_func_just_return() will be better.
> > 
> > Moreover, this arch/x86/kernel/kprobes/ftrace.c is an archtecture
> > dependent implementation of kprobes, not bpf.
> 
> Josef, please work out any necessary cleanups that would still need
> to be addressed based on Masami's feedback and send them as follow-up
> patches, thanks.
> 
> > Hmm, arch/x86/net/bpf_jit_comp.c will be better place?
> 
> (No, it's JIT only and I'd really prefer to keep it that way, mixing
>  this would result in a huge mess.)

OK, that is same to kprobes. kernel/kprobes.c and arch/x86/kernel/kprobe/*
are for instrumentation code. And kernel/trace/trace_kprobe.c is ftrace's
kprobe user interface, just one implementation of kprobe usage. So please
do not mix it up. It will result in a huge mess to me.

Thank you,

-- 
Masami Hiramatsu 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 1/5] add infrastructure for tagging functions as error injectable

On Fri, 15 Dec 2017 14:12:52 -0500
Josef Bacik  wrote:

> From: Josef Bacik 
> 
> Using BPF we can override kprob'ed functions and return arbitrary
> values.  Obviously this can be a bit unsafe, so make this feature opt-in
> for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
> order to give BPF access to that function for error injection purposes.
> 
> Signed-off-by: Josef Bacik 
> Acked-by: Ingo Molnar 
> ---
>  include/asm-generic/vmlinux.lds.h |  10 +++
>  include/linux/bpf.h   |  11 +++
>  include/linux/kprobes.h   |   1 +
>  include/linux/module.h|   5 ++
>  kernel/kprobes.c  | 163 
> ++
>  kernel/module.c   |   6 +-
>  6 files changed, 195 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index ee8b707d9fa9..a2e8582d094a 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -136,6 +136,15 @@
>  #define KPROBE_BLACKLIST()
>  #endif
>  
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE
> +#define ERROR_INJECT_LIST()  . = ALIGN(8);   
> \
> + 
> VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
> + KEEP(*(_kprobe_error_inject_list))  
> \
> + VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
> = .;
> +#else
> +#define ERROR_INJECT_LIST()
> +#endif
> +
>  #ifdef CONFIG_EVENT_TRACING
>  #define FTRACE_EVENTS()  . = ALIGN(8);   
> \
>   VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
> @@ -564,6 +573,7 @@
>   FTRACE_EVENTS() \
>   TRACE_SYSCALLS()\
>   KPROBE_BLACKLIST()  \
> + ERROR_INJECT_LIST() \
>   MEM_DISCARD(init.rodata)\
>   CLK_OF_TABLES() \
>   RESERVEDMEM_OF_TABLES() \
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e55e4255a210..7f4d2a953173 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -576,4 +576,15 @@ extern const struct bpf_func_proto 
> bpf_sock_map_update_proto;
>  void bpf_user_rnd_init_once(void);
>  u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  
> +#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE

BTW, CONFIG_BPF_KPROBE_OVERRIDE is also confusable name.
Since this feature override a function to just return with
some return value (as far as I understand, or would you
also plan to modify execution path inside a function?),
I think it should be better CONFIG_BPF_FUNCTION_OVERRIDE or
CONFIG_BPF_EXECUTION_OVERRIDE.

Indeed, BPF is based on kprobes, but it seems you are limiting it
with ftrace (function-call trace) (I'm not sure the reason why),
so using "kprobes" for this feature seems strange for me.

The idea in this patch itself (marking injectable function on
a list) is OK to me. 

Thank you,

> +#define BPF_ALLOW_ERROR_INJECTION(fname) \
> +static unsigned long __used  \
> + __attribute__((__section__("_kprobe_error_inject_list")))   \
> + _eil_addr_##fname = (unsigned long)fname;
> +#else
> +#define BPF_ALLOW_ERROR_INJECTION(fname)
> +#endif
> +#endif
> +
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index 9440a2fc8893..963fd364f3d6 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long 
> offset);
>  extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, 
> unsigned long offset);
>  
>  extern bool within_kprobe_blacklist(unsigned long addr);
> +extern bool within_kprobe_error_injection_list(unsigned long addr);
>  
>  struct kprobe_insn_cache {
>   struct mutex mutex;
> diff --git a/include/linux/module.h b/include/linux/module.h
> index c69b49abe877..548fa09fa806 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -475,6 +475,11 @@ struct module {
>   ctor_fn_t *ctors;
>   unsigned int num_ctors;
>  #endif
> +
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE
> + unsigned int num_kprobe_ei_funcs;
> + unsigned long *kprobe_ei_funcs;
> +#endif
>  } cacheline_aligned __randomize_layout;
>  #ifndef MODULE_ARCH_INIT
>  #define MODULE_ARCH_INIT {}
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index da2ccf142358..b4aab48ad258 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -83,6

Re: [PATCH] btrfs: Fix memory leak on multiple mounts



On 19.12.2017 08:05, Misono, Tomohiro wrote:
> Hello, 
> 
> On 2017/12/18 19:06, Nikolay Borisov wrote:
>>
>>
>> On 18.12.2017 12:03, Nikolay Borisov wrote:
>>> Currently if a mounted-btrfs instance is mounted for the 2nd time
>>> without first unmounting the first instance then we hit a memory leak
>>> in btrfs_mount_root due to the fs_info of the acquired superblock is
>>> different than the newly allocated fs info. Fix this by specifically
>>> checking if the fs_info instance of the newly acquired superblock is
>>> the same as ours and free it if not.
>>>
>>> Reproducer:
>>>
>>> mount /dev/vdc /media/scratch
>>> mount /dev/vdc /media/scratch2 <- memory leak hit
>>>
>>> Signed-off-by: Nikolay Borisov 
>>
>> This one fixes:
>>
>> 4da82a9e60a1 btrfs: add btrfs_mount_root() and new file_system_type
>>
>> Which seems to be in for-next only.
> 
> The patch doesn't change the logic here (same as current btrfs_mount()).
> 
>>
>>> ---
>>>  fs/btrfs/super.c | 6 +-
>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>> index e57eb6e70278..ea3bca85be44 100644
>>> --- a/fs/btrfs/super.c
>>> +++ b/fs/btrfs/super.c
>>> @@ -1612,7 +1612,11 @@ static struct dentry *btrfs_mount_root(struct 
>>> file_system_type *fs_type,
>>> goto error_sec_opts;
>>> }
>>>  
>>> -   fs_info = btrfs_sb(s);
>>> +   if (btrfs_sb(s) != fs_info) {
>>> +   free_fs_info(fs_info);
>>> +   fs_info = btrfs_sb(s);
>>> +   }
>>> +
>>> error = setup_security_options(fs_info, s, _sec_opts);
>>> if (error) {
>>> deactivate_locked_super(s);
>>>
> 
> I'm not sure how the memory leak happens.
> Just above this line is:
> 
> ---
>  if (s->s_root) {
>btrfs_close_devices(fs_devices);
>free_fs_info(fs_info);
>if ((flags ^ s->s_flags) & SB_RDONLY)
>  error = -EBUSY;
>  } else {
>snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
>btrfs_sb(s)->bdev_holder = fs_type;
>error = btrfs_fill_super(s, fs_devices, data);
>  }
> ---
> 
> In the first mount, s->s_root is null and btrfs_fill_super() is called.
> On the other hand, in the second mount, s->s_root is not null and fs_info is 
> freed there.
> 
> So, I think there is no memory leak in the current code, or am I missing 
> something? 

Hm, you are right, I've missed that, so disregard this patch.
> 
> Regards,
> Tomohiro
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Fix memory leak on multiple mounts

2017-12-18 Thread Misono, Tomohiro

Hello, 

On 2017/12/18 19:06, Nikolay Borisov wrote:
> 
> 
> On 18.12.2017 12:03, Nikolay Borisov wrote:
>> Currently if a mounted-btrfs instance is mounted for the 2nd time
>> without first unmounting the first instance then we hit a memory leak
>> in btrfs_mount_root due to the fs_info of the acquired superblock is
>> different than the newly allocated fs info. Fix this by specifically
>> checking if the fs_info instance of the newly acquired superblock is
>> the same as ours and free it if not.
>>
>> Reproducer:
>>
>> mount /dev/vdc /media/scratch
>> mount /dev/vdc /media/scratch2 <- memory leak hit
>>
>> Signed-off-by: Nikolay Borisov 
> 
> This one fixes:
> 
> 4da82a9e60a1 btrfs: add btrfs_mount_root() and new file_system_type
> 
> Which seems to be in for-next only.

The patch doesn't change the logic here (same as current btrfs_mount()).

> 
>> ---
>>  fs/btrfs/super.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index e57eb6e70278..ea3bca85be44 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1612,7 +1612,11 @@ static struct dentry *btrfs_mount_root(struct 
>> file_system_type *fs_type,
>>  goto error_sec_opts;
>>  }
>>  
>> -fs_info = btrfs_sb(s);
>> +if (btrfs_sb(s) != fs_info) {
>> +free_fs_info(fs_info);
>> +fs_info = btrfs_sb(s);
>> +}
>> +
>>  error = setup_security_options(fs_info, s, _sec_opts);
>>  if (error) {
>>  deactivate_locked_super(s);
>>

I'm not sure how the memory leak happens.
Just above this line is:

---
 if (s->s_root) {
   btrfs_close_devices(fs_devices);
   free_fs_info(fs_info);
   if ((flags ^ s->s_flags) & SB_RDONLY)
 error = -EBUSY;
 } else {
   snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
   btrfs_sb(s)->bdev_holder = fs_type;
   error = btrfs_fill_super(s, fs_devices, data);
 }
---

In the first mount, s->s_root is null and btrfs_fill_super() is called.
On the other hand, in the second mount, s->s_root is not null and fs_info is 
freed there.

So, I think there is no memory leak in the current code, or am I missing 
something? 

Regards,
Tomohiro

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH btrfs] buffer_head: create_bh_bio() can be static

2017-12-18 Thread kbuild test robot


Fixes: 61b4603b8338 ("buffer_head: separate out create_bh_bio() from 
submit_bh_wbc()")
Signed-off-by: Fengguang Wu 
---
 buffer.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index eb15599..b793f6d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3055,8 +3055,8 @@ void guard_bio_eod(int op, struct bio *bio)
}
 }
 
-struct bio *create_bh_bio(int op, int op_flags, struct buffer_head *bh,
-  enum rw_hint write_hint)
+static struct bio *create_bh_bio(int op, int op_flags, struct buffer_head *bh,
+enum rw_hint write_hint)
 {
struct bio *bio;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[btrfs:cgroup-btrfs 3/5] fs/buffer.c:3058:12: sparse: symbol 'create_bh_bio' was not declared. Should it be static?

2017-12-18 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
cgroup-btrfs
head:   2e032c72c43e8008be45f23376d9e24d75c3d85f
commit: 61b4603b83389326984a47e702a319a40d006f77 [3/5] buffer_head: separate 
out create_bh_bio() from submit_bh_wbc()
reproduce:
# apt-get install sparse
git checkout 61b4603b83389326984a47e702a319a40d006f77
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)


Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check lowmem crash, backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed

On Mon, Dec 18, 2017 at 8:26 PM, Chris Murphy  wrote:
> On Mon, Dec 18, 2017 at 7:19 PM, Su Yue  wrote:

>>>
>>
>> Did you apply all patches on this branch?
>
> Yes, I think so:


 1032  git remote add qu https://github.com/adam900710/btrfs-progs
 1033  git fetch qu
 1034  git checkout -b qulowmemfix qu/lowmem_fix


And then rebuilt. Also checkedout kdave devel branch and rebuilt, same results.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check lowmem crash, backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed

On Mon, Dec 18, 2017 at 7:19 PM, Su Yue  wrote:
>
>
> On 12/19/2017 09:34 AM, Chris Murphy wrote:
>>
>> On Sun, Dec 17, 2017 at 8:07 PM, Su Yue 
>> wrote:
>>>
>>> Hi,
>>>
>>> On 12/18/2017 10:50 AM, Chris Murphy wrote:


 This is v4.14.

 I've filed a bug which contains the build steps, versions. It's crashing
 on all volumes I try it on so far.


>>>
>>> It was fixed by Qu's patch btrfs-progs: backref: Allow backref walk
>>> to handle direct parent ref which is available on his github
>>> https://github.com/adam900710/btrfs-progs/tree/lowmem_fix
>>>
>> OK that fixed the crashing problem on all tested volumes. One volume so
>> far has a bunch of these:
>>
>> ERROR: data extent[1218774720512 167936] backref lost
>>
>
> Did you apply all patches on this branch?

Yes, I think so:

git


> If only the patch mentioned above is applied, you can try another
> patch named
>  btrfs-progs: lowmem check: Fix false backref lost warning for keyed
> extent data ref
> which exists in same branch.
>
> Otherwise, it seems a bug of lowmem check. Would you please
> provide some dump of btrfs-debug-tree?
>
> # btrfs-debug-tree /$device  |\
>   grep -C10 1218774720512
>

There are two errors repeated many times:

ERROR: data extent[1214681899008 21139456] backref lost
ERROR: data extent[1218774720512 167936] backref lost

URLs for both:

https://drive.google.com/open?id=1hdgjYZg-QFeVbT_xVUxlZRC9Q0PP29eS
https://drive.google.com/open?id=1zomVGrtw6TX6ZNL5QmBn4nhgV4OgUI92


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check lowmem crash, backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed

2017-12-18 Thread Su Yue




On 12/19/2017 09:34 AM, Chris Murphy wrote:

On Sun, Dec 17, 2017 at 8:07 PM, Su Yue 
wrote:

Hi,

On 12/18/2017 10:50 AM, Chris Murphy wrote:


This is v4.14.

I've filed a bug which contains the build steps, versions. It's 
crashing on all volumes I try it on so far.





It was fixed by Qu's patch btrfs-progs: backref: Allow backref walk
to handle direct parent ref which is available on his github 
https://github.com/adam900710/btrfs-progs/tree/lowmem_fix


OK that fixed the crashing problem on all tested volumes. One volume 
so far has a bunch of these:


ERROR: data extent[1218774720512 167936] backref lost



Did you apply all patches on this branch?
If only the patch mentioned above is applied, you can try another
patch named
 btrfs-progs: lowmem check: Fix false backref lost warning for keyed
extent data ref
which exists in same branch.

Otherwise, it seems a bug of lowmem check. Would you please
provide some dump of btrfs-debug-tree?

# btrfs-debug-tree /$device  |\
  grep -C10 1218774720512

You can erase filenames if you think they are private.

With the same numbers for each line. Is that fixed in kdave devel 
branch? Or pending? All commits in this brach are pending.


Thanks,
Su


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check lowmem crash, backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed

On Sun, Dec 17, 2017 at 8:07 PM, Su Yue  wrote:
> Hi,
>
> On 12/18/2017 10:50 AM, Chris Murphy wrote:
>>
>> This is v4.14.
>>
>> I've filed a bug which contains the build steps, versions. It's
>> crashing on all volumes I try it on so far.
>>
>>
>
> It was fixed by Qu's patch
>   btrfs-progs: backref: Allow backref walk to handle direct parent ref
> which is available on his github
> https://github.com/adam900710/btrfs-progs/tree/lowmem_fix
>
 OK that fixed the crashing problem on all tested volumes. One volume
so far has a bunch of these:

ERROR: data extent[1218774720512 167936] backref lost

With the same numbers for each line. Is that fixed in kdave devel
branch? Or pending?

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: again "out of space" and remount read only, with 4.14

2017-12-18 Thread Martin Raiber

On 03.12.2017 16:39 Martin Raiber wrote:
> Am 26.11.2017 um 17:02 schrieb Tomasz Chmielewski:
>> On 2017-11-27 00:37, Martin Raiber wrote:
>>> On 26.11.2017 08:46 Tomasz Chmielewski wrote:
 Got this one on a 4.14-rc7 filesystem with some 400 GB left:
>>> I guess it is too late now, but I guess the "btrfs fi usage" output of
>>> the file system (especially after it went ro) would be useful.
>> It was more or less similar as it went ro:
>>
>> # btrfs fi usage /srv
>> Overall:
>>     Device size:   5.25TiB
>>     Device allocated:  4.45TiB
>>     Device unallocated:  823.97GiB
>>     Device missing:  0.00B
>>     Used:  4.33TiB
>>     Free (estimated):    471.91GiB  (min: 471.91GiB)
>>     Data ratio:   2.00
>>     Metadata ratio:   2.00
>>     Global reserve:  512.00MiB  (used: 0.00B)
>>
>> Unallocated:
>>    /dev/sda4 411.99GiB
>>    /dev/sdb4 411.99GiB
> I wanted to check if is the same issue I have, e.g. with 4.14.1
> space_cache=v2:
>
> [153245.341823] BTRFS: error (device loop0) in
> btrfs_run_delayed_refs:3089: errno=-28 No space left
> [153245.341845] BTRFS: error (device loop0) in btrfs_drop_snapshot:9317:
> errno=-28 No space left
> [153245.341848] BTRFS info (device loop0): forced readonly
> [153245.341972] BTRFS warning (device loop0): Skipping commit of aborted
> transaction.
> [153245.341975] BTRFS: error (device loop0) in cleanup_transaction:1873:
> errno=-28 No space left
> # btrfs fi usage /media/backup
> Overall:
>     Device size:  49.60TiB
>     Device allocated: 38.10TiB
>     Device unallocated:   11.50TiB
>     Device missing:  0.00B
>     Used: 36.98TiB
>     Free (estimated): 12.59TiB  (min: 12.59TiB)
>     Data ratio:   1.00
>     Metadata ratio:   1.00
>     Global reserve:    2.00GiB  (used: 1.99GiB)
>
> Data,single: Size:37.70TiB, Used:36.61TiB
>    /dev/loop0 37.70TiB
>
> Metadata,single: Size:411.01GiB, Used:380.98GiB
>    /dev/loop0    411.01GiB
>
> System,single: Size:36.00MiB, Used:4.00MiB
>    /dev/loop0 36.00MiB
>
> Unallocated:
>    /dev/loop0 11.50TiB
>
> Note the global reserve being at maximum. I already increased that in
> the code to 2G and that seems to make this issue appear more rarely.

This time with enospc_debug mount option:

With Linux 4.14.3. Single large device.

[15179.739038] [ cut here ]
[15179.739059] WARNING: CPU: 0 PID: 28694 at fs/btrfs/extent-tree.c:8458
btrfs_alloc_tree_block+0x38f/0x4a0
[15179.739060] Modules linked in: bcache loop dm_crypt algif_skcipher
af_alg st sr_mod cdrom xfs libcrc32c zbud intel_rapl sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt kvm_intel kvm
iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel pcbc raid1 mgag200 snd_pcm aesni_intel ttm snd_timer
drm_kms_helper snd soundcore aes_x86_64 crypto_simd glue_helper cryptd
pcspkr i2c_i801 joydev drm mei_me evdev lpc_ich mei mfd_core ipmi_si
ipmi_devintf ipmi_msghandler tpm_tis tpm_tis_core tpm wmi ioatdma button
shpchp fuse autofs4 hid_generic usbhid hid sg sd_mod dm_mod dax md_mod
crc32c_intel isci ahci mpt3sas libsas libahci igb raid_class ehci_pci
i2c_algo_bit libata dca ehci_hcd scsi_transport_sas ptp nvme pps_core
scsi_mod usbcore nvme_core
[15179.739133] CPU: 0 PID: 28694 Comm: btrfs Not tainted 4.14.3 #2
[15179.739134] Hardware name: Supermicro
X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[15179.739136] task: 8813e4f02ac0 task.stack: c9000aea
[15179.739140] RIP: 0010:btrfs_alloc_tree_block+0x38f/0x4a0
[15179.739141] RSP: 0018:c9000aea3558 EFLAGS: 00010292
[15179.739144] RAX: 001d RBX: 4000 RCX:

[15179.739146] RDX: 880c4fa15b38 RSI: 880c4fa0de58 RDI:
880c4fa0de58
[15179.739147] RBP: c9000aea35d0 R08: 0001 R09:
0662
[15179.739149] R10: 1600 R11: 0662 R12:
880c0a454000
[15179.739151] R13: 880c4ba33800 R14: 0001 R15:
880c0a454128
[15179.739153] FS:  7f0d699128c0() GS:880c4fa0()
knlGS:
[15179.739155] CS:  0010 DS:  ES:  CR0: 80050033
[15179.739156] CR2: 7bbfcdf2c6e8 CR3: 00151da91003 CR4:
000606f0
[15179.739158] Call Trace:
[15179.739166]  __btrfs_cow_block+0x117/0x580
[15179.739169]  btrfs_cow_block+0xdf/0x200
[15179.739171]  btrfs_search_slot+0x1ea/0x990
[15179.739174]  lookup_inline_extent_backref+0xdf/0x570
[15179.739177]  ? __set_page_dirty_nobuffers+0x11e/0x160
[15179.739179]  insert_inline_extent_backref+0x40/0xb0
[15179.739183]  ? __slab_free+0x178/0x2c0
[15179.739185]  __btrfs_inc_extent_ref.isra.59+0x6b/0x220
[15179.739188]  __btrfs_run_delayed_refs+0x924/0x12c0

Re: Storing errors in the XArray

2017-12-18 Thread Randy Dunlap

On 12/15/2017 09:10 AM, Matthew Wilcox wrote:
> On Mon, Dec 11, 2017 at 03:10:22PM -0800, Randy Dunlap wrote:
>>> +The XArray does not support storing :c:func:`IS_ERR` pointers; some
>>> +conflict with data values and others conflict with entries the XArray
>>> +uses for its own purposes.  If you need to store special values which
>>> +cannot be confused with real kernel pointers, the values 4, 8, ... 4092
>>> +are available.
>>
>> or if I know that they values are errno-range values, I can just shift them
>> left by 2 to store them and then shift them right by 2 to use them?
> 
> On further thought, I like this idea so much, it's worth writing helpers
> for this usage.  And test-suite (also doubles as a demonstration of how
> to use it).
> 
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index c616e9319c7c..53aa251df57a 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -232,6 +232,39 @@ static inline bool xa_is_value(const void *entry)
>   return (unsigned long)entry & 1;
>  }
>  
> +/**
> + * xa_mk_errno() - Create an XArray entry from an error number.
> + * @error: Error number to store in XArray.
> + *
> + * Return: An entry suitable for storing in the XArray.
> + */
> +static inline void *xa_mk_errno(long error)
> +{
> + return (void *)(error << 2);
> +}
> +
> +/**
> + * xa_to_errno() - Get error number stored in an XArray entry.
> + * @entry: XArray entry.
> + *
> + * Return: The error number stored in the XArray entry.
> + */
> +static inline unsigned long xa_to_errno(const void *entry)
> +{
> + return (long)entry >> 2;
> +}
> +
> +/**
> + * xa_is_errno() - Determine if an entry is an errno.
> + * @entry: XArray entry.
> + *
> + * Return: True if the entry is an errno, false if it is a pointer.
> + */
> +static inline bool xa_is_errno(const void *entry)
> +{
> + return (((unsigned long)entry & 3) == 0) && (entry > (void *)-4096);

Some named mask bits would be ^^^ preferable there.
#define MAX_ERRNO   4095 // from err.h
 && (entry >= (void 
*)-MAX_ERRNO);

> +}
> +
>  /**
>   * xa_is_internal() - Is the entry an internal entry?
>   * @entry: Entry retrieved from the XArray
> diff --git a/tools/testing/radix-tree/xarray-test.c 
> b/tools/testing/radix-tree/xarray-test.c
> index 43111786ebdd..b843cedf3988 100644
> --- a/tools/testing/radix-tree/xarray-test.c
> +++ b/tools/testing/radix-tree/xarray-test.c
> @@ -29,7 +29,13 @@ void check_xa_err(struct xarray *xa)
>   assert(xa_err(xa_store(xa, 1, xa_mk_value(0), GFP_KERNEL)) == 0);
>   assert(xa_err(xa_store(xa, 1, NULL, 0)) == 0);
>  // kills the test-suite :-(
> -// assert(xa_err(xa_store(xa, 0, xa_mk_internal(0), 0)) == -EINVAL);
> +//   assert(xa_err(xa_store(xa, 0, xa_mk_internal(0), 0)) == -EINVAL);
> +
> + assert(xa_err(xa_store(xa, 0, xa_mk_errno(-ENOMEM), GFP_KERNEL)) == 0);
> + assert(xa_err(xa_load(xa, 0)) == 0);
> + assert(xa_is_errno(xa_load(xa, 0)) == true);
> + assert(xa_to_errno(xa_load(xa, 0)) == -ENOMEM);
> + xa_erase(xa, 0);
>  }
>  
>  void check_xa_tag(struct xarray *xa)
> 

Thanks,
-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Naming of tag operations in the XArray

2017-12-18 Thread Randy Dunlap

On 12/15/2017 04:34 AM, Matthew Wilcox wrote:
> On Thu, Dec 14, 2017 at 08:22:14PM -0800, Matthew Wilcox wrote:
>> On Mon, Dec 11, 2017 at 03:10:22PM -0800, Randy Dunlap wrote:
 +A freshly-initialised XArray contains a ``NULL`` pointer at every index.
 +Each non-``NULL`` entry in the array has three bits associated with
 +it called tags.  Each tag may be flipped on or off independently of
 +the others.  You can search for entries with a given tag set.
>>>
>>> Only tags that are set, or search for entries with some tag(s) cleared?
>>> Or is that like a mathematical set?
>>
>> hmm ...
>>
>> "Each tag may be set or cleared independently of the others.  You can
>> search for entries which have a particular tag set."
>>
>> Doesn't completely remove the ambiguity, but I can't think of how to phrase
>> that better ...
> 
> Thinking about this some more ...
> 
> At the moment, the pieces of the API which deal with tags look like this:
> 
> bool xa_tagged(const struct xarray *, xa_tag_t)
> bool xa_get_tag(struct xarray *, unsigned long index, xa_tag_t);
> void xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
> void xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
> int xa_get_tagged(struct xarray *, void **dst, unsigned long start,
> unsigned long max, unsigned int n, xa_tag_t);
> 
> bool xas_get_tag(const struct xa_state *, xa_tag_t);
> void xas_set_tag(const struct xa_state *, xa_tag_t);
> void xas_clear_tag(const struct xa_state *, xa_tag_t);
> void *xas_find_tag(struct xa_state *, unsigned long max, xa_tag_t);
> xas_for_each_tag(xas, entry, max, tag) { }
> 
> (at some point there will be an xa_for_each_tag too, there just hasn't
> been a user yet).
> 
> I'm always ambivalent about using the word 'get' in an API because it has
> two common meanings; (increment a refcount) and (return the state).  How

Yes, I get that.  But you usually wouldn't lock a tag AFAIK.

> would people feel about these names instead:

I think that the original names are mostly better, except I do like
xa_select() instead of xa_get_tagged().  But even that doesn't have
to change.

> bool xa_any_tagged(xa, tag);
> bool xa_is_tagged(xa, index, tag);
> void xa_tag(xa, index, tag);
> void xa_untag(xa, index, tag);
> int xa_select(xa, dst, start, max, n, tag);
> 
> bool xas_is_tagged(xas, tag);
> void xas_tag(xas, tag);
> void xas_untag(xas, tag);
> void *xas_find_tag(xas, max, tag);
> xas_for_each_tag(xas, entry, max, tag) { }
> 
> (the last two are unchanged)
> 


-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

On Mon, Dec 18, 2017 at 3:28 PM, Chris Murphy  wrote:
> On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain  wrote:
>
>>  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
>>  caused by [1], which we should revert back, since..
>>- balance (to raid1 chunk) may fail if FS is near full
>>- recovery (to raid1 chunk) will take more writes as compared
>>  to recovery under degraded raid1 chunks
>
>
> The advantage of writing single chunks when degraded, is in the case
> where a missing device returns (is readded, intact). Catching up that
> device with the first drive, is a manual but simple invocation of
> 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft'   The
> alternative is a full balance or full scrub. It's pretty tedious for
> big arrays.
>
> mdadm uses bitmap=internal for any array larger than 100GB for this
> reason, avoiding full resync.
>
> 'btrfs sub find' will list all *added* files since an arbitrarily
> specified generation; but not deletions.

Looks like LVM raid types (the non-legacy ones that use md driver)
also use a bitmap by default.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain  wrote:

>  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
>  caused by [1], which we should revert back, since..
>- balance (to raid1 chunk) may fail if FS is near full
>- recovery (to raid1 chunk) will take more writes as compared
>  to recovery under degraded raid1 chunks

The advantage of writing single chunks when degraded, is in the case
where a missing device returns (is readded, intact). Catching up that
device with the first drive, is a manual but simple invocation of
'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft'   The
alternative is a full balance or full scrub. It's pretty tedious for
big arrays.

mdadm uses bitmap=internal for any array larger than 100GB for this
reason, avoiding full resync.

'btrfs sub find' will list all *added* files since an arbitrarily
specified generation; but not deletions.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-18 Thread Dave Chinner

On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> [PATCH] SQUASH: add memory barriers around i_version accesses

Why explicit memory barriers rather than annotating the operations
with the required semantics and getting the barriers the arch
requires automatically?  I suspect this should be using
atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
the atomic_cmpxchg needs to have release semantics to match the
acquire semantics needed for the load of the current value.

>From include/linux/atomics.h:

 * For compound atomics performing both a load and a store, ACQUIRE
 * semantics apply only to the load and RELEASE semantics only to the
 * store portion of the operation. Note that a failed cmpxchg_acquire
 * does -not- imply any memory ordering constraints.

Memory barriers hurt my brain. :/

At minimum, shouldn't the atomic op specific barriers be used rather
than full memory barriers? i.e:

> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..39ec9aa9e08e 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -87,6 +87,25 @@ static inline void
>  inode_set_iversion_raw(struct inode *inode, const u64 val)
>  {
>   atomic64_set(>i_version, val);
> + smp_wmb();

smp_mb__after_atomic();
.
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + smp_rmb();

smp_mb__before_atomic();

> + return atomic64_read(>i_version);
>  }

And, of course, these will require comments explaining what they
match with and what they are ordering.

> @@ -152,7 +171,7 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
>   u64 cur, old, new;
>  
> - cur = (u64)atomic64_read(>i_version);
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is clear then we needn't do anything */
>   if (!force && !(cur & I_VERSION_QUERIED))

cmpxchg in this loop needs a release barrier so everyone will
see the change?

> @@ -183,23 +202,6 @@ inode_inc_iversion(struct inode *inode)
>   inode_maybe_inc_iversion(inode, true);
>  }
>  
> -/**
> - * inode_peek_iversion_raw - grab a "raw" iversion value
> - * @inode: inode from which i_version should be read
> - *
> - * Grab a "raw" inode->i_version value and return it. The i_version is not
> - * flagged or converted in any way. This is mostly used to access a 
> self-managed
> - * i_version.
> - *
> - * With those filesystems, we want to treat the i_version as an entirely
> - * opaque value.
> - */
> -static inline u64
> -inode_peek_iversion_raw(const struct inode *inode)
> -{
> - return atomic64_read(>i_version);
> -}
> -
>  /**
>   * inode_iversion_need_inc - is the i_version in need of being incremented?
>   * @inode: inode to check
> @@ -248,7 +250,7 @@ inode_query_iversion(struct inode *inode)
>  {
>   u64 cur, old, new;
>  
> - cur = atomic64_read(>i_version);
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is already set, then no need to swap */
>   if (cur & I_VERSION_QUERIED)

cmpxchg in this loop needs a release barrier so everyone will
see the change on load?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-18 Thread Peter Grandi

>> The fact is, the only cases where this is really an issue is
>> if you've either got intermittently bad hardware, or are
>> dealing with external

> Well, the RAID1+ is all about the failing hardware.

>> storage devices. For the majority of people who are using
>> multi-device setups, the common case is internally connected
>> fixed storage devices with properly working hardware, and for
>> that use case, it works perfectly fine.

> If you're talking about "RAID"-0 or storage pools (volume
> management) that is true. But if you imply, that RAID1+ "works
> perfectly fine as long as hardware works fine" this is
> fundamentally wrong.

I really agree with this, the argument about "properly working
hardware" is utterly ridiculous. I'll to this: apparently I am
not the first one to discover the "anomalies" in the "RAID"
profiles, but I may have been the first to document some of
them, e.g. the famous issues with the 'raid1' profile. How did I
discover them? Well, I had used Btrfs in single device mode for
a bit, and wanted to try multi-device, and the docs seemed
"strange", so I did tests before trying it out.

The tests were simply on a spare PC with a bunch of old disks to
create two block devices (partitions), put them in 'raid1' first
natively, then by adding a new member to an existing partition,
and then 'remove' one, or simply unplug it (actually 'echo 1 >
/sys/block/.../device/delete') initially. I wanted to check
exactly what happened, resync times, speed, behaviour and speed
when degraded, just ordinary operational tasks.

Well I found significant problems after less than one hour. I
can't imagine anyone with some experience of hw or sw RAID
(especially hw RAID, as hw RAID firmware is often fantastically
buggy especially as to RAID operations) that wouldn't have done
the same tests before operational use, and would not have found
the same issues too straight away. The only guess I could draw
is that whover designed the "RAID" profile had zero operational
system administration experience.

> If the hardware needs to work properly for the RAID to work
> properly, noone would need this RAID in the first place.

It is not just that, but some maintenance operations are needed
even if the hardware works properly: for example preventive
maintenance, replacing drives that are becoming too old,
expanding capacity, testing periodically hardware bits. Systems
engineers don't just say "it works, let's assume it continues to
work properly, why worry".

My impression is that multi-device and "chunks" were designed in
one way by someone, and someone else did not understand the
intent, and confused them with "RAID", and based the 'raid'
profiles on that confusion. For example the 'raid10' profile
seems the least confused to me, and that's I think because the
"RAID" aspect is kept more distinct from the "multi-device"
aspect. But perhaps I am an optimist...

To simplify a longer discussion to have "RAID" one needs an
explicit design concept of "stripe", which in Btrfs needs to be
quite different from that of "set of member devices" and
"chunks", so that for example adding/removing to a "stripe" is
not quite the same thing as adding/removing members to a volume,
plus to make a distinction between online and offline members,
not just added and removed ones, and well-defined state machine
transitions (e.g. in response to hardware problems) among all
those, like in MD RAID. But the importance of such distinctions
may not be apparent to everybody.

But I may have read comments in which "block device" (a data
container on some medium), "block device inode" (a descriptor
for that) and "block device name" (a path to a "block device
inode") were hopelessly confused, so I don't hold a lot of
hope. :-(
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-18 Thread Tomasz Pala

On Mon, Dec 18, 2017 at 08:06:57 -0500, Austin S. Hemmelgarn wrote:

> The fact is, the only cases where this is really an issue is if you've 
> either got intermittently bad hardware, or are dealing with external 

Well, the RAID1+ is all about the failing hardware.

> storage devices.  For the majority of people who are using multi-device 
> setups, the common case is internally connected fixed storage devices 
> with properly working hardware, and for that use case, it works 
> perfectly fine.

If you're talking about "RAID"-0 or storage pools (volume management)
that is true.
But if you imply, that RAID1+ "works perfectly fine as long as hardware
works fine" this is fundamentally wrong. If the hardware needs to work
properly for the RAID to work properly, noone would need this RAID in
the first place.

> that BTRFS should not care.  At the point at which a device is dropping 
> off the bus and reappearing with enough regularity for this to be an 
> issue, you have absolutely no idea how else it's corrupting your data, 
> and support of such a situation is beyond any filesystem (including ZFS).

Support for such situation is exactly what RAID performs. So don't blame
people for expecting this to be handled as long as you call the
filesystem feature a 'RAID'.

If this feature is not going to mitigate hardware hiccups by design (as
opposed to "not implemented yet, needs some time", which is perfectly
understandable), just don't call it 'RAID'.

All the features currently working, like bit-rot mitigation for
duplicated data (dup/raid*) using checksums, are something different
than RAID itself. RAID means "survive failure of N devices/controllers"
- I got one "RAID1" stuck in r/o after degraded mount, not nice... Not
_expected_ to happen after single disk failure (without any reappearing).

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

On Mon, 2017-12-18 at 12:36 -0500, J. Bruce Fields wrote:
> On Mon, Dec 18, 2017 at 12:22:20PM -0500, Jeff Layton wrote:
> > On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > > >  static inline bool
> > > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > > >  {
> > > > -   atomic64_t *ivp = (atomic64_t *)>i_version;
> > > > +   u64 cur, old, new;
> > > >  
> > > > -   atomic64_inc(ivp);
> > > > +   cur = (u64)atomic64_read(>i_version);
> > > > +   for (;;) {
> > > > +   /* If flag is clear then we needn't do anything */
> > > > +   if (!force && !(cur & I_VERSION_QUERIED))
> > > > +   return false;
> > > 
> > > The fast path here misses any memory barrier. Thus it seems this query
> > > could be in theory reordered before any store that happened to modify the
> > > inode? Or maybe we could race and miss the fact that in fact this 
> > > i_version
> > > has already been queried? But maybe there's some higher level locking that
> > > makes sure this is all a non-issue... But in that case it would deserve
> > > some comment I guess.
> > > 
> > 
> > There's no higher-level locking. Getting locking out of this codepath is
> > a good thing IMO. The larger question here is whether we really care
> > about ordering this with anything else.
> > 
> > The i_version, as implemented today, is not ordered with actual changes
> > to the inode. We only take the i_lock today when modifying it, not when
> > querying it. It's possible today that you could see the results of a
> > change and then do a fetch of the i_version that doesn't show an
> > increment vs. a previous change.
> > 
> > It'd be nice if this were atomic with the actual changes that it
> > represents, but I think that would be prohibitively expensive. That may
> > be something we need to address. I'm not sure we really want to do it as
> > part of this patchset though.
> 
> I don't think we need full atomicity, but we *do* need the i_version
> increment to be seen as ordered after the inode change.  That should be
> relatively inexpensive, yes?
> 
> Otherwise an inode change could be overlooked indefinitely, because a
> client could cache the old contents with the new i_version value and
> never see the i_version change again as long as the inode isn't touched
> again.
> 
> (If the client instead caches new data with the old inode version, there
> may be an unnecessary cache invalidation next time the client queries
> the change attribute.  That's a performance bug, but not really a
> correctness problem, at least for clients depending only on
> close-to-open semantics: they'll only depend on seeing the new change
> attribute after the change has been flushed to disk.  The performance
> problem should be mitigated by the race being very rare.)

Perhaps something like this patch squashed into #19?

This should tighten up the initial loads and stores, like Jan was
suggesting (I think). The increments are done via cmpxchg so they
already have a full implied barrier (IIUC), and we don't exit the loop
until it succeeds or it looks like the flag is set.

To be clear though:

This series does not try to order the i_version update with any other
inode metadata or data (though we generally do update it after a write
is done, rather than before or during).

Inode field updates are not at all synchronized either and this series
does not change that. A stat() is also generally done locklessly so you
can easily get a half-baked attributes there if you hit the timing
wrong.

-8<--

[PATCH] SQUASH: add memory barriers around i_version accesses

Signed-off-by: Jeff Layton 
---
 include/linux/iversion.h | 40 +---
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index a9fbf99709df..39ec9aa9e08e 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -87,6 +87,25 @@ static inline void
 inode_set_iversion_raw(struct inode *inode, const u64 val)
 {
atomic64_set(>i_version, val);
+   smp_wmb();
+}
+
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a 
self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
+{
+   smp_rmb();
+   return atomic64_read(>i_version);
 }
 
 /**
@@ -152,7 +171,7 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
u64 cur, old, new;
 
-   cur = (u64)atomic64_read(>i_version);
+   cur = inode_peek_iversion_raw(inode);
for (;;) {

Re: [PATCH v3 01/19] fs: new API for handling inode->i_version

On Mon, 2017-12-18 at 10:11 -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> Add a documentation blob that explains what the i_version field is, how
> it is expected to work, and how it is currently implemented by various
> filesystems.
> 
> We already have inode_inc_iversion. Add several other functions for
> manipulating and accessing the i_version counter. For now, the
> implementation is trivial and basically works the way that all of the
> open-coded i_version accesses work today.
> 
> Future patches will convert existing users of i_version to use the new
> API, and then convert the backend implementation to do things more
> efficiently.
> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/fs.h   |  15 
>  include/linux/iversion.h | 205 
> +++
>  2 files changed, 205 insertions(+), 15 deletions(-)
>  create mode 100644 include/linux/iversion.h
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 511fbaabf624..76382c24e9d0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode 
> *inode)
>   mark_inode_dirty(inode);
>  }
>  
> -/**
> - * inode_inc_iversion - increments i_version
> - * @inode: inode that need to be updated
> - *
> - * Every time the inode is modified, the i_version field will be incremented.
> - * The filesystem has to be mounted with i_version flag
> - */
> -
> -static inline void inode_inc_iversion(struct inode *inode)
> -{
> -   spin_lock(>i_lock);
> -   inode->i_version++;
> -   spin_unlock(>i_lock);
> -}
> -

Note that this patch is broken, as I neglected to include
linux/iversion.h in the files with the existing callers of
inode_inc_iversion. Fixed in my in-tree version.


>  enum file_time_flags {
>   S_ATIME = 1,
>   S_MTIME = 2,
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> new file mode 100644
> index ..bb50d27c71f9
> --- /dev/null
> +++ b/include/linux/iversion.h
> @@ -0,0 +1,205 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_IVERSION_H
> +#define _LINUX_IVERSION_H
> +
> +#include 
> +
> +/*
> + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> + * appear different to observers if there was a change to the inode's data or
> + * metadata since it was last queried.
> + *
> + * It should be considered an opaque value by observers. If it remains the 
> same
> + * since it was last checked, then nothing has changed in the inode. If it's
> + * different then something has changed. Observers cannot infer anything 
> about
> + * the nature or magnitude of the changes from the value, only that the inode
> + * has changed in some fashion.
> + *
> + * Not all filesystems properly implement the i_version counter. Subsystems 
> that
> + * want to use i_version field on an inode should first check whether the
> + * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION 
> macro).
> + *
> + * Those that set SB_I_VERSION will automatically have their i_version 
> counter
> + * incremented on writes to normal files. If the SB_I_VERSION is not set, 
> then
> + * the VFS will not touch it on writes, and the filesystem can use it how it
> + * wishes. Note that the filesystem is always responsible for updating the
> + * i_version on namespace changes in directories (mkdir, rmdir, unlink, 
> etc.).
> + * We consider these sorts of filesystems to have a kernel-managed i_version.
> + *
> + * Note that some filesystems (e.g. NFS and AFS) just use the field to store
> + * a server-provided value (for the most part). For that reason, those
> + * filesystems do not set SB_I_VERSION. These filesystems are considered to
> + * have a self-managed i_version.
> + */
> +
> +/**
> + * inode_set_iversion_raw - set i_version to the specified raw value
> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set @inode's i_version field to @new. This function is for use by
> + * filesystems that self-manage the i_version.
> + *
> + * For example, the NFS client stores its NFSv4 change attribute in this way,
> + * and the AFS client stores the data_version from the server here.
> + */
> +static inline void
> +inode_set_iversion_raw(struct inode *inode, const u64 new)
> +{
> + inode->i_version = new;
> +}
> +
> +/**
> + * inode_set_iversion - set i_version to a particular value
> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set @inode's i_version field to @new. This function is for filesystems 
> with
> + * a kernel-managed i_version.
> + *
> + * For now, this just does the same thing as the _raw variant.
> + */
> +static inline void
> +inode_set_iversion(struct inode *inode, const u64 new)
> +{
> + inode_set_iversion_raw(inode, new);
> +}
> +
> +/**
> + *

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-18 Thread J. Bruce Fields

On Mon, Dec 18, 2017 at 12:22:20PM -0500, Jeff Layton wrote:
> On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > >  static inline bool
> > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > >  {
> > > - atomic64_t *ivp = (atomic64_t *)>i_version;
> > > + u64 cur, old, new;
> > >  
> > > - atomic64_inc(ivp);
> > > + cur = (u64)atomic64_read(>i_version);
> > > + for (;;) {
> > > + /* If flag is clear then we needn't do anything */
> > > + if (!force && !(cur & I_VERSION_QUERIED))
> > > + return false;
> > 
> > The fast path here misses any memory barrier. Thus it seems this query
> > could be in theory reordered before any store that happened to modify the
> > inode? Or maybe we could race and miss the fact that in fact this i_version
> > has already been queried? But maybe there's some higher level locking that
> > makes sure this is all a non-issue... But in that case it would deserve
> > some comment I guess.
> > 
> 
> There's no higher-level locking. Getting locking out of this codepath is
> a good thing IMO. The larger question here is whether we really care
> about ordering this with anything else.
> 
> The i_version, as implemented today, is not ordered with actual changes
> to the inode. We only take the i_lock today when modifying it, not when
> querying it. It's possible today that you could see the results of a
> change and then do a fetch of the i_version that doesn't show an
> increment vs. a previous change.
> 
> It'd be nice if this were atomic with the actual changes that it
> represents, but I think that would be prohibitively expensive. That may
> be something we need to address. I'm not sure we really want to do it as
> part of this patchset though.

I don't think we need full atomicity, but we *do* need the i_version
increment to be seen as ordered after the inode change.  That should be
relatively inexpensive, yes?

Otherwise an inode change could be overlooked indefinitely, because a
client could cache the old contents with the new i_version value and
never see the i_version change again as long as the inode isn't touched
again.

(If the client instead caches new data with the old inode version, there
may be an unnecessary cache invalidation next time the client queries
the change attribute.  That's a performance bug, but not really a
correctness problem, at least for clients depending only on
close-to-open semantics: they'll only depend on seeing the new change
attribute after the change has been flushed to disk.  The performance
problem should be mitigated by the race being very rare.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs subvolume list for not mounted filesystem?

2017-12-18 Thread Andrei Borzenkov

18.12.2017 19:49, Ulli Horlacher пишет:
> I want to mount an alternative subvolume of a btrfs filesystem.
> I can list the subvolumes when the filesystem is mounted, but how do I
> know them, when the filesystem is not mounted? Is there a query command?
> 
> root@xerus:~# mount | grep /test
> /dev/sdd4 on /test type btrfs 
> (rw,relatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)
> 
> root@xerus:~# btrfs subvolume list /test
> ID 258 gen 156 top level 5 path tux/zz
> ID 259 gen 156 top level 5 path tux/z1
> ID 260 gen 156 top level 5 path tmp/zz
> ID 261 gen 19 top level 260 path tmp/zz/z1
> ID 269 gen 156 top level 5 path tux/test
> ID 271 gen 129 top level 269 path tux/test/.snapshot/2017-12-02_1341.test
> 
> root@xerus:~# umount /test
> 
> root@xerus:~# btrfs subvolume list /dev/sdd4
> ERROR: not a btrfs filesystem: /dev/sdd4
> ERROR: can't access '/dev/sdd4'
> 
> 
> 

btrfs-debug-tree -r /dev/sdd4

(or in more recent btrfs-progs, "btrfs inspect-internal dump-tree") will
show subvolume IDs (as ROOT_ITEM); I'm not sure of there is simple way
to resolve subvolume path names on unmounted device.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> >  static inline bool
> >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> >  {
> > -   atomic64_t *ivp = (atomic64_t *)>i_version;
> > +   u64 cur, old, new;
> >  
> > -   atomic64_inc(ivp);
> > +   cur = (u64)atomic64_read(>i_version);
> > +   for (;;) {
> > +   /* If flag is clear then we needn't do anything */
> > +   if (!force && !(cur & I_VERSION_QUERIED))
> > +   return false;
> 
> The fast path here misses any memory barrier. Thus it seems this query
> could be in theory reordered before any store that happened to modify the
> inode? Or maybe we could race and miss the fact that in fact this i_version
> has already been queried? But maybe there's some higher level locking that
> makes sure this is all a non-issue... But in that case it would deserve
> some comment I guess.
> 

There's no higher-level locking. Getting locking out of this codepath is
a good thing IMO. The larger question here is whether we really care
about ordering this with anything else.

The i_version, as implemented today, is not ordered with actual changes
to the inode. We only take the i_lock today when modifying it, not when
querying it. It's possible today that you could see the results of a
change and then do a fetch of the i_version that doesn't show an
increment vs. a previous change.

It'd be nice if this were atomic with the actual changes that it
represents, but I think that would be prohibitively expensive. That may
be something we need to address. I'm not sure we really want to do it as
part of this patchset though.

> > +
> > +   /* Since lowest bit is flag, add 2 to avoid it */
> > +   new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> > +
> > +   old = atomic64_cmpxchg(>i_version, cur, new);
> > +   if (likely(old == cur))
> > +   break;
> > +   cur = old;
> > +   }
> > return true;
> >  }
> >  
> 
> ...
> 
> >  static inline u64
> >  inode_query_iversion(struct inode *inode)
> >  {
> > -   return inode_peek_iversion(inode);
> > +   u64 cur, old, new;
> > +
> > +   cur = atomic64_read(>i_version);
> > +   for (;;) {
> > +   /* If flag is already set, then no need to swap */
> > +   if (cur & I_VERSION_QUERIED)
> > +   break;
> > +
> > +   new = cur | I_VERSION_QUERIED;
> > +   old = atomic64_cmpxchg(>i_version, cur, new);
> > +   if (old == cur)
> > +   break;
> > +   cur = old;
> > +   }
> 
> Why not just use atomic64_or() here?
> 

If the cmpxchg fails, then either:

1) it was incremented
2) someone flagged it QUERIED

If an increment happened then we don't need to flag it as QUERIED if
we're returning an older value. If we use atomic64_or, then we can't
tell if an increment happened so we'd end up potentially flagging it
more than necessary.

In principle, either outcome is technically OK and we don't have to loop
if the cmpxchg doesn't work. That said, if we think there might be a
later i_version available, then I think we probably want to try to query
it again so we can return as late a one as possible.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs subvolume list for not mounted filesystem?

2017-12-18 Thread Ulli Horlacher

I want to mount an alternative subvolume of a btrfs filesystem.
I can list the subvolumes when the filesystem is mounted, but how do I
know them, when the filesystem is not mounted? Is there a query command?

root@xerus:~# mount | grep /test
/dev/sdd4 on /test type btrfs 
(rw,relatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)

root@xerus:~# btrfs subvolume list /test
ID 258 gen 156 top level 5 path tux/zz
ID 259 gen 156 top level 5 path tux/z1
ID 260 gen 156 top level 5 path tmp/zz
ID 261 gen 19 top level 260 path tmp/zz/z1
ID 269 gen 156 top level 5 path tux/test
ID 271 gen 129 top level 269 path tux/test/.snapshot/2017-12-02_1341.test

root@xerus:~# umount /test

root@xerus:~# btrfs subvolume list /dev/sdd4
ERROR: not a btrfs filesystem: /dev/sdd4
ERROR: can't access '/dev/sdd4'



-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20171218164941.ga22...@rus.uni-stuttgart.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

On Mon 18-12-17 10:11:56, Jeff Layton wrote:
>  static inline bool
>  inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
> - atomic64_t *ivp = (atomic64_t *)>i_version;
> + u64 cur, old, new;
>  
> - atomic64_inc(ivp);
> + cur = (u64)atomic64_read(>i_version);
> + for (;;) {
> + /* If flag is clear then we needn't do anything */
> + if (!force && !(cur & I_VERSION_QUERIED))
> + return false;

The fast path here misses any memory barrier. Thus it seems this query
could be in theory reordered before any store that happened to modify the
inode? Or maybe we could race and miss the fact that in fact this i_version
has already been queried? But maybe there's some higher level locking that
makes sure this is all a non-issue... But in that case it would deserve
some comment I guess.

> +
> + /* Since lowest bit is flag, add 2 to avoid it */
> + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> +
> + old = atomic64_cmpxchg(>i_version, cur, new);
> + if (likely(old == cur))
> + break;
> + cur = old;
> + }
>   return true;
>  }
>  

...

>  static inline u64
>  inode_query_iversion(struct inode *inode)
>  {
> - return inode_peek_iversion(inode);
> + u64 cur, old, new;
> +
> + cur = atomic64_read(>i_version);
> + for (;;) {
> + /* If flag is already set, then no need to swap */
> + if (cur & I_VERSION_QUERIED)
> + break;
> +
> + new = cur | I_VERSION_QUERIED;
> + old = atomic64_cmpxchg(>i_version, cur, new);
> + if (old == cur)
> + break;
> + cur = old;
> + }

Why not just use atomic64_or() here?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND 1/2] btrfs: handle volume split brain scenario


On 2017-12-18 09:39, Anand Jain wrote:





Now the procedure to assemble the disks would be to continue to mount
the good set first without the device set on which new data can be
ignored, and later run btrfs device scan to bring in the missing device
and complete the RAID group which then shall reset the flag
BTRFS_SUPER_FLAG_VOL_MOVED_ON.

Couple of thoughts on this:

1. This needs to go in _after_ the patches to allow the kernel to 
forget devices.


  Right will mention that explicitly or it can use kernel module reload
  if its a choice.

  Otherwise, the only way to handle things is to physically disconnect 
the device you don't want, mount the other one, and then reconnect the 
disconnected device, because pretty much everything uses udev, which 
by default scans for BTRFS on every device that gets connected, and 
handling it like that is not an option for some people (potentially 
for quite a few people).


  Yeah. That's a problem. If its a choice then they can reload
  the module.
Which may not always be a choice (though I sincerely hope people aren't 
messing around with stuff like this with system-critical filesystems 
like / and /usr...).


2. How exactly is this supposed to work for people who don't have new 
enough btrfs-progs to tell the kernel to forget the device?  Ideally, 
there should be some other way to tell the kernel which one to use 
(perhaps a one-shot mount option?) without needing any userspace 
support (and if that happens, then I retract my first comment).  I'm 
not saying that the currently proposed method is a bad solution, just 
that it's not a complete solution from a usability perspective.


  Oh. I got it. But the current solution using forget cli OR kernel
  module reload is a kind of general solution, also split brain problem
  arises from wrong usage.

  Your concern of not having forget cli is only a short term concern,
  developing a new mount-option for a short term concern is not
  sure if its right. I am ok either way, if more people prefer that
  way I don't mind writing one.

3. How does this get handled with volumes using the raid1 profile and 
more than 2 devices?
In that case, things get complicated fast. Handling an array of N 
devices in raid1 mode where N-1 devices have this flag set is in 
theory easy, except you have no verification that the N-1 devices were 
mounted separately or as a group (in the first case, the user needs to 
pick one, in the second, it should automatically recover since they 
all agree on the state of the filesystem).


  Even in N device raid1, the number of device that can be missing
  is still one. There will be at least one or more devices which
  are common to both the sets. So split brain can't happen.
No, there isn't a guarantee that that's the case.  Suppose you have 3 
devices in a raid1 and one goes missing.  The user can now, in trying to 
fix this, mount each of the two remaining devices separately (by 
accident of course), in which case you can still have a split brain 
situation.  The same logic applies for all values of N greater than 3 as 
well, with a greater number of possible situations for each higher N 
(for N=4, you have at least 4 ways the user can create a split-brain, 
for N=5 you have at least 11, etc).


I have no issue with there being on handling for this (it should be a 
reasonably rare case among people who don't have existing knowledge of 
how to fix it as things are already), but if that's the case it should 
ideally be explicitly stated that you don't handle this.


4. Having some way to do this solely from btrfs-progs would be nice 
too (so that you can just drop to a shell during boot, or drop to the 
initramfs, and fix things without having to mess with device 
scanning), but is secondary to the kernel-side IMHO.


  Pls. Note this patch only detects and avoids the split brain devices
  to loose data, by failing the mount. However the fix is out side of
  this patch. Next, cli 'btrfs dev forget' is anyway isn't purposely
  made for this purpose. Its a generic un-scan code which helps here
  otherwise as well.

  If we need a purposely built cli (btrfstune ?) to fix the SB, it
  can be done. But I think it will be complicated.
It's arguably easier to make a (possibly single) call to `btrfstune` to 
unset the flag on the device you want to drop and then mount again than 
it is to make the kernel forget it, mount degraded, and then re-scan 
devices (and sounds a whole lot safer and less error-prone too).


Even if that does become the case, having kernel code to mark the 
devices mounted degraded and refuse the mount in some circumstances is 
still needed though.


Other than all of that, I do think this is a good idea.  Being able to 
more sanely inform the user about the situation and help them figure 
out how to fix it correctly is something that's desperately needed here.


Thanks, Anand



Signed-off-by: Anand Jain 
---
On top of kdave misc-next.

Re: [PATCH v3 16/19] fs: only set S_VERSION when updating times if necessary

On Mon 18-12-17 10:11:53, Jeff Layton wrote:
> From: Jeff Layton 
> 
> We only really need to update i_version if someone has queried for it
> since we last incremented it. By doing that, we can avoid having to
> update the inode if the times haven't changed.
> 
> If the times have changed, then we go ahead and forcibly increment the
> counter, under the assumption that we'll be going to the storage
> anyway, and the increment itself is relatively cheap.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/inode.c | 20 ++--
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 03102d6ef044..83f6cfc3cde7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  /*
> @@ -1634,17 +1635,24 @@ static int relatime_need_update(const struct path 
> *path, struct inode *inode,
>  int generic_update_time(struct inode *inode, struct timespec *time, int 
> flags)
>  {
>   int iflags = I_DIRTY_TIME;
> + bool dirty = false;
>  
> - if (flags & S_ATIME)
> + if (flags & S_ATIME) {
>   inode->i_atime = *time;
> + dirty |= !(inode->i_sb->s_flags & SB_LAZYTIME);
> + }
>   if (flags & S_VERSION)
> - inode_inc_iversion(inode);
> - if (flags & S_CTIME)
> + dirty |= inode_maybe_inc_iversion(inode, dirty);
> + if (flags & S_CTIME) {
>   inode->i_ctime = *time;
> - if (flags & S_MTIME)
> + dirty = true;
> + }
> + if (flags & S_MTIME) {
>   inode->i_mtime = *time;
> + dirty = true;
> + }

The SB_LAZYTIME handling is wrong here. That option is not only about atime
handling but rather about all inode time stamps. So you rather need
something like:

if (flags & (S_ATIME | S_CTIME | S_MTIME) &&
!(inode->i_sb->s_flags & SB_LAZYTIME))
dirty = true;

Honza

>  
> - if (!(inode->i_sb->s_flags & SB_LAZYTIME) || (flags & S_VERSION))
> + if (dirty)
>   iflags |= I_DIRTY_SYNC;
>   __mark_inode_dirty(inode, iflags);
>   return 0;
> @@ -1863,7 +1871,7 @@ int file_update_time(struct file *file)
>   if (!timespec_equal(>i_ctime, ))
>   sync_it |= S_CTIME;
>  
> - if (IS_I_VERSION(inode))
> + if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
>   sync_it |= S_VERSION;
>  
>   if (!sync_it)
> -- 
> 2.14.3
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 01/19] fs: new API for handling inode->i_version

From: Jeff Layton 

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h   |  15 
 include/linux/iversion.h | 205 +++
 2 files changed, 205 insertions(+), 15 deletions(-)
 create mode 100644 include/linux/iversion.h

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaabf624..76382c24e9d0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode 
*inode)
mark_inode_dirty(inode);
 }
 
-/**
- * inode_inc_iversion - increments i_version
- * @inode: inode that need to be updated
- *
- * Every time the inode is modified, the i_version field will be incremented.
- * The filesystem has to be mounted with i_version flag
- */
-
-static inline void inode_inc_iversion(struct inode *inode)
-{
-   spin_lock(>i_lock);
-   inode->i_version++;
-   spin_unlock(>i_lock);
-}
-
 enum file_time_flags {
S_ATIME = 1,
S_MTIME = 2,
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
new file mode 100644
index ..bb50d27c71f9
--- /dev/null
+++ b/include/linux/iversion.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_IVERSION_H
+#define _LINUX_IVERSION_H
+
+#include 
+
+/*
+ * The change attribute (i_version) is mandated by NFSv4 and is mostly for
+ * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
+ * appear different to observers if there was a change to the inode's data or
+ * metadata since it was last queried.
+ *
+ * It should be considered an opaque value by observers. If it remains the same
+ * since it was last checked, then nothing has changed in the inode. If it's
+ * different then something has changed. Observers cannot infer anything about
+ * the nature or magnitude of the changes from the value, only that the inode
+ * has changed in some fashion.
+ *
+ * Not all filesystems properly implement the i_version counter. Subsystems 
that
+ * want to use i_version field on an inode should first check whether the
+ * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
+ *
+ * Those that set SB_I_VERSION will automatically have their i_version counter
+ * incremented on writes to normal files. If the SB_I_VERSION is not set, then
+ * the VFS will not touch it on writes, and the filesystem can use it how it
+ * wishes. Note that the filesystem is always responsible for updating the
+ * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
+ * We consider these sorts of filesystems to have a kernel-managed i_version.
+ *
+ * Note that some filesystems (e.g. NFS and AFS) just use the field to store
+ * a server-provided value (for the most part). For that reason, those
+ * filesystems do not set SB_I_VERSION. These filesystems are considered to
+ * have a self-managed i_version.
+ */
+
+/**
+ * inode_set_iversion_raw - set i_version to the specified raw value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set @inode's i_version field to @new. This function is for use by
+ * filesystems that self-manage the i_version.
+ *
+ * For example, the NFS client stores its NFSv4 change attribute in this way,
+ * and the AFS client stores the data_version from the server here.
+ */
+static inline void
+inode_set_iversion_raw(struct inode *inode, const u64 new)
+{
+   inode->i_version = new;
+}
+
+/**
+ * inode_set_iversion - set i_version to a particular value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set @inode's i_version field to @new. This function is for filesystems with
+ * a kernel-managed i_version.
+ *
+ * For now, this just does the same thing as the _raw variant.
+ */
+static inline void
+inode_set_iversion(struct inode *inode, const u64 new)
+{
+   inode_set_iversion_raw(inode, new);
+}
+
+/**
+ * inode_set_iversion_queried - set i_version to a particular value and set
+ *  flag to indicate that it has been viewed
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * When loading in an i_version value from a backing store, we typically don't
+ * know whether it was previously viewed before being stored or not. Thus, we
+ * must assume that it was, to ensure that any changes will result in the
+ * value changing.
+ *
+ * This function will set the

[PATCH v3 00/19] fs: rework and optimize i_version handling in filesystems

From: Jeff Layton 

v3:
- move i_version handling functions to new header file
- document that the kernel-managed i_version implementation will appear to
  increase over time
- fix inode_cmp_iversion to handle wraparound correctly

v2:
- xfs should use inode_peek_iversion instead of inode_peek_iversion_raw
- rework file_update_time patch
- don't dirty inode when only S_ATIME is set and SB_LAZYTIME is enabled
- better comments and documentation

tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it if no
one is looking at it. We can also clean up the code to prepare to
eventually expose this value via statx().

Note that this set relies on a few patches that are in other trees. The
full stack that I've been testing with is here:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=iversion

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA
will also use it to optimize away some remeasurement if it's available.
NFS and AFS just use it to store an opaque change attribute from the
server.

Only btrfs, ext4, and xfs increment it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change.

It turns out though that none of these users of i_version require that
it change on every change to the file. The only real requirement is that
it be different if something changed since the last time we queried for
it.

If we keep track of when something queries the value, we can avoid
bumping the counter and an on-disk update when nothing else has changed
if no one has queried it since it was last incremented.

This patchset changes the code to only bump the i_version counter when
it's strictly necessary, or when we're updating the inode metadata
anyway (e.g. when times change).

It takes the approach of converting the existing accessors of i_version
to use a new API, while leaving the underlying implementation mostly the
same.  The last patch then converts the existing implementation to keep
track of whether the value has been queried since it was last
incremented. It then uses that to avoid incrementing the counter when
it can.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

I can see measurable performance gains on xfs and ext4 with iversion
enabled, when streaming small (4k) I/Os. 

btrfs shows some slight gain in testing, but not quite the magnitude
that xfs and ext4 show. I'm not sure why yet and would appreciate some
input from btrfs folks.

My goal is to get this into linux-next fairly soon. If it shows no
problems then we can look at merging it for 4.16, or 4.17 if all of the
prequisite patches are not yet merged.

Jeff Layton (19):
  fs: new API for handling inode->i_version
  fs: don't take the i_lock in inode_inc_iversion
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: only set S_VERSION when updating times if necessary
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was
changed
  fs: handle inode->i_version more efficiently

 fs/affs/amigaffs.c|   5 +-
 fs/affs/dir.c |   5 +-
 fs/affs/super.c   |   3 +-
 fs/afs/fsclient.c |   3 +-
 fs/afs/inode.c|   5 +-
 fs/btrfs/delayed-inode.c  |   7 +-
 fs/btrfs/file.c   |   1 +
 fs/btrfs/inode.c  |  12 +-
 fs/btrfs/ioctl.c  |   1 +
 fs/btrfs/tree-log.c   |   4 +-
 fs/btrfs/xattr.c  |   1 +
 fs/exofs/dir.c|   9 +-
 fs/exofs/super.c  |   3 +-
 fs/ext2/dir.c |   9 +-
 fs/ext2/super.c   |   5 +-
 fs/ext4/dir.c |   9 +-
 fs/ext4/inline.c  |   7 +-
 fs/ext4/inode.c   |  13 +-
 fs/ext4/ioctl.c   |   3 +-
 fs/ext4/namei.c

[PATCH v3 04/19] affs: convert to new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/affs/amigaffs.c | 5 +++--
 fs/affs/dir.c  | 5 +++--
 fs/affs/super.c| 3 ++-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 0f0e6925e97d..14a6c1b90c9f 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -10,6 +10,7 @@
  */
 
 #include 
+#include 
 #include "affs.h"
 
 /*
@@ -60,7 +61,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
affs_brelse(dir_bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
mark_inode_dirty(dir);
 
return 0;
@@ -114,7 +115,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head 
*rem_bh)
affs_brelse(bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
mark_inode_dirty(dir);
 
return retval;
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index a105e77df2c1..d180b46453cf 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -14,6 +14,7 @@
  *
  */
 
+#include 
 #include "affs.h"
 
 static int affs_readdir(struct file *, struct dir_context *);
@@ -80,7 +81,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
 * we can jump directly to where we left off.
 */
ino = (u32)(long)file->private_data;
-   if (ino && file->f_version == inode->i_version) {
+   if (ino && inode_cmp_iversion(inode, file->f_version) == 0) {
pr_debug("readdir() left off=%d\n", ino);
goto inside;
}
@@ -130,7 +131,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
} while (ino);
}
 done:
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
file->private_data = (void *)(long)ino;
affs_brelse(fh_bh);
 
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 1117e36134cc..e602619aed9d 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "affs.h"
 
 static int affs_statfs(struct dentry *dentry, struct kstatfs *buf);
@@ -102,7 +103,7 @@ static struct inode *affs_alloc_inode(struct super_block 
*sb)
if (!i)
return NULL;
 
-   i->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
i->i_lc = NULL;
i->i_ext_bh = NULL;
i->i_pa_cnt = 0;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 03/19] fat: convert to new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/fat/dir.c |  3 ++-
 fs/fat/inode.c   |  9 +
 fs/fat/namei_msdos.c |  7 ---
 fs/fat/namei_vfat.c  | 22 +++---
 4 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index b833ffeee1e1..8e100c3bf72c 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "fat.h"
 
 /*
@@ -1055,7 +1056,7 @@ int fat_remove_entries(struct inode *dir, struct 
fat_slot_info *sinfo)
brelse(bh);
if (err)
return err;
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
if (nr_slots) {
/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 20a0a89eaca5..ffbbf0520d9e 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "fat.h"
 
 #ifndef CONFIG_FAT_DEFAULT_IOCHARSET
@@ -507,7 +508,7 @@ int fat_fill_inode(struct inode *inode, struct 
msdos_dir_entry *de)
MSDOS_I(inode)->i_pos = 0;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = get_seconds();
 
if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) {
@@ -590,7 +591,7 @@ struct inode *fat_build_inode(struct super_block *sb,
goto out;
}
inode->i_ino = iunique(sb, MSDOS_ROOT_INO);
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
err = fat_fill_inode(inode, de);
if (err) {
iput(inode);
@@ -1377,7 +1378,7 @@ static int fat_read_root(struct inode *inode)
MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = 0;
inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO);
inode->i_op = sbi->dir_ops;
@@ -1828,7 +1829,7 @@ int fat_fill_super(struct super_block *sb, void *data, 
int silent, int isvfat,
if (!root_inode)
goto out_fail;
root_inode->i_ino = MSDOS_ROOT_INO;
-   root_inode->i_version = 1;
+   inode_set_iversion(root_inode, 1);
error = fat_read_root(root_inode);
if (error < 0) {
iput(root_inode);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index d24d2758a363..582ca731a6c9 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include "fat.h"
 
 /* Characters that are undesirable in an MS-DOS file name */
@@ -480,7 +481,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
} else
mark_inode_dirty(old_inode);
 
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = 
current_time(old_dir);
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
@@ -508,7 +509,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
goto out;
new_i_pos = sinfo.i_pos;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -540,7 +541,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 02c03a3a..cefea792cde8 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -20,7 +20,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include "fat.h"
 
 static inline unsigned long vfat_d_version(struct dentry *dentry)
@@ -46,7 +46,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
 {
int ret = 1;
spin_lock(>d_lock);
-   if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version)
+   if (inode_cmp_iversion(d_inode(dentry->d_parent), 
vfat_d_version(dentry)))
ret = 0;
spin_unlock(>d_lock);
return ret;
@@ -759,7 +759,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct 
dentry *dentry,
 out:
mutex_unlock(_SB(sb)->s_lock);
if (!inode)
-   vfat_d_version_set(dentry, dir->i_version);
+   vfat_d_version_set(dentry, inode_query_iversion(dir));
return d_splice_alias(inode, dentry);
 error:

[PATCH v3 02/19] fs: don't take the i_lock in inode_inc_iversion

From: Jeff Layton 

The rationale for taking the i_lock when incrementing this value is
lost in antiquity. The readers of the field don't take it (at least
not universally), so my assumption is that it was only done here to
serialize incrementors.

If that is indeed the case, then we can drop the i_lock from this
codepath and treat it as a atomic64_t for the purposes of
incrementing it. This allows us to use inode_inc_iversion without
any danger of lock inversion.

Note that the read side is not fetched atomically with this change.
The assumption here is that that is not a critical issue since the
i_version is not fully synchronized with anything else anyway.

Signed-off-by: Jeff Layton 
---
 include/linux/iversion.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index bb50d27c71f9..e08c634779df 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, const u64 
new)
 static inline bool
 inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
-   spin_lock(>i_lock);
-   inode->i_version++;
-   spin_unlock(>i_lock);
+   atomic64_t *ivp = (atomic64_t *)>i_version;
+
+   atomic64_inc(ivp);
return true;
 }
 
+
 /**
  * inode_inc_iversion - forcibly increment i_version
  * @inode: inode that needs to be updated
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 05/19] afs: convert to new i_version API

From: Jeff Layton 

For AFS, it's generally treated as an opaque value, so we use the
*_raw variants of the API here.

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data, not for the metadata. We'll
need to reconcile that somehow if we ever want to present this to
userspace via statx.

Signed-off-by: Jeff Layton 
---
 fs/afs/fsclient.c | 3 ++-
 fs/afs/inode.c| 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index b90ef39ae914..88ec38c2d83c 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "afs_fs.h"
 
@@ -124,7 +125,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
vnode->vfs_inode.i_ctime.tv_sec = status->mtime_client;
vnode->vfs_inode.i_mtime= vnode->vfs_inode.i_ctime;
vnode->vfs_inode.i_atime= vnode->vfs_inode.i_ctime;
-   vnode->vfs_inode.i_version  = data_version;
+   inode_set_iversion_raw(>vfs_inode, data_version);
}
 
expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3415eb7484f6..dcd2e08d6cdb 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 static const struct inode_operations afs_symlink_inode_operations = {
@@ -89,7 +90,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, 
struct key *key)
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
inode->i_generation = vnode->fid.unique;
-   inode->i_version= vnode->status.data_version;
+   inode_set_iversion_raw(inode, vnode->status.data_version);
inode->i_mapping->a_ops = _fs_aops;
 
read_sequnlock_excl(>cb_lock);
@@ -218,7 +219,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const 
char *dev_name,
inode->i_ctime.tv_nsec  = 0;
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
-   inode->i_version= 0;
+   inode_set_iversion_raw(inode, 0);
inode->i_generation = 0;
 
set_bit(AFS_VNODE_PSEUDODIR, >flags);
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 07/19] exofs: switch to new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/exofs/dir.c   | 9 +
 fs/exofs/super.c | 3 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 98233a97b7b8..c5a53fcc43ea 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -31,6 +31,7 @@
  * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
  */
 
+#include 
 #include "exofs.h"
 
 static inline unsigned exofs_chunk_size(struct inode *inode)
@@ -60,7 +61,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
if (!PageUptodate(page))
SetPageUptodate(page);
@@ -241,7 +242,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
-   int need_revalidate = (file->f_version != inode->i_version);
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
return 0;
@@ -264,8 +265,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (struct exofs_dir_entry *)(kaddr + offset);
limit = kaddr + exofs_last_byte(inode, n) -
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 819624cfc8da..7e244093c0e5 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "exofs.h"
 
@@ -159,7 +160,7 @@ static struct inode *exofs_alloc_inode(struct super_block 
*sb)
if (!oi)
return NULL;
 
-   oi->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
return >vfs_inode;
 }
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 06/19] btrfs: convert to new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/btrfs/delayed-inode.c | 7 +--
 fs/btrfs/file.c  | 1 +
 fs/btrfs/inode.c | 7 +--
 fs/btrfs/ioctl.c | 1 +
 fs/btrfs/tree-log.c  | 4 +++-
 fs/btrfs/xattr.c | 1 +
 6 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5d73f79ded8b..6a246ae2bcb2 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -18,6 +18,7 @@
  */
 
 #include 
+#include 
 #include "delayed-inode.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -1700,7 +1701,8 @@ static void fill_stack_inode_item(struct 
btrfs_trans_handle *trans,
btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
btrfs_set_stack_inode_generation(inode_item,
 BTRFS_I(inode)->generation);
-   btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
+   btrfs_set_stack_inode_sequence(inode_item,
+  inode_peek_iversion(inode));
btrfs_set_stack_inode_transid(inode_item, trans->transid);
btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
@@ -1754,7 +1756,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
 BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item);
 
-   inode->i_version = btrfs_stack_inode_sequence(inode_item);
+   inode_set_iversion_queried(inode,
+  btrfs_stack_inode_sequence(inode_item));
inode->i_rdev = 0;
*rdev = btrfs_stack_inode_rdev(inode_item);
BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index eb1bac7c8553..c95d7b2efefb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3cb5be9..ac8692849a81 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3777,7 +3778,8 @@ static int btrfs_read_locked_inode(struct inode *inode)
BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item);
 
-   inode->i_version = btrfs_inode_sequence(leaf, inode_item);
+   inode_set_iversion_queried(inode,
+  btrfs_inode_sequence(leaf, inode_item));
inode->i_generation = BTRFS_I(inode)->generation;
inode->i_rdev = 0;
rdev = btrfs_inode_rdev(leaf, inode_item);
@@ -3945,7 +3947,8 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
 );
btrfs_set_token_inode_generation(leaf, item, BTRFS_I(inode)->generation,
 );
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, );
+   btrfs_set_token_inode_sequence(leaf, item, inode_peek_iversion(inode),
+  );
btrfs_set_token_inode_transid(leaf, item, trans->transid, );
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, );
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, );
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2ef8acaac688..aa452c9e2eff 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7bf9b31561db..1b7d92075c1f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "tree-log.h"
 #include "disk-io.h"
 #include "locking.h"
@@ -3609,7 +3610,8 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
btrfs_set_token_inode_nbytes(leaf, item, inode_get_bytes(inode),
 );
 
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, );
+   btrfs_set_token_inode_sequence(leaf, item,
+  inode_peek_iversion(inode), );
btrfs_set_token_inode_transid(leaf, item, trans->transid, );
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, );
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, );
diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
index 2c7e53f9ff1b..5258c1714830 100644
--- a/fs/btrfs/xattr.c
+++ b/fs/btrfs/xattr.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include

[PATCH v3 08/19] ext2: convert to new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 fs/ext2/dir.c   | 9 +
 fs/ext2/super.c | 5 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 987647986f47..4111085a129f 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
@@ -92,7 +93,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
 
if (pos+len > dir->i_size) {
@@ -293,7 +294,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
unsigned char *types = NULL;
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXT2_DIR_REC_LEN(1))
return 0;
@@ -319,8 +320,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
offset = ext2_validate_entry(kaddr, offset, 
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7646818ab266..554c98b8a93a 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
@@ -184,7 +185,7 @@ static struct inode *ext2_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
ei->i_block_alloc_info = NULL;
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
 #ifdef CONFIG_QUOTA
memset(>i_dquot, 0, sizeof(ei->i_dquot));
 #endif
@@ -1569,7 +1570,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, 
int type,
return err;
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 10/19] nfs: convert to new i_version API

From: Jeff Layton 

For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.

Signed-off-by: Jeff Layton 
---
 fs/nfs/delegation.c|  3 ++-
 fs/nfs/fscache-index.c |  5 +++--
 fs/nfs/inode.c | 18 +-
 fs/nfs/nfs4proc.c  | 10 ++
 fs/nfs/nfstrace.h  |  5 +++--
 fs/nfs/write.c |  8 +++-
 6 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index ade44ca0c66c..d8b47624fee2 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -347,7 +348,7 @@ int nfs_inode_set_delegation(struct inode *inode, struct 
rpc_cred *cred, struct
nfs4_stateid_copy(>stateid, >delegation);
delegation->type = res->delegation_type;
delegation->pagemod_limit = res->pagemod_limit;
-   delegation->change_attr = inode->i_version;
+   delegation->change_attr = inode_peek_iversion_raw(inode);
delegation->cred = get_rpccred(cred);
delegation->inode = inode;
delegation->flags = 1<vfs_inode.i_ctime;
 
if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_peek_iversion_raw(>vfs_inode);
 
if (bufmax > sizeof(auxdata))
bufmax = sizeof(auxdata);
@@ -243,7 +244,7 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void 
*cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;
 
if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_peek_iversion_raw(>vfs_inode);
 
if (memcmp(data, , datalen) != 0)
return FSCACHE_CHECKAUX_OBSOLETE;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b992d2382ffa..0b85cca1184b 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -38,8 +38,8 @@
 #include 
 #include 
 #include 
-
 #include 
+#include 
 
 #include "nfs4_fs.h"
 #include "callback.h"
@@ -483,7 +483,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
memset(>i_atime, 0, sizeof(inode->i_atime));
memset(>i_mtime, 0, sizeof(inode->i_mtime));
memset(>i_ctime, 0, sizeof(inode->i_ctime));
-   inode->i_version = 0;
+   inode_set_iversion_raw(inode, 0);
inode->i_size = 0;
clear_nlink(inode);
inode->i_uid = make_kuid(_user_ns, -2);
@@ -508,7 +508,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
else if (nfs_server_capable(inode, NFS_CAP_CTIME))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR);
if (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   inode->i_version = fattr->change_attr;
+   inode_set_iversion_raw(inode, fattr->change_attr);
else
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR
| NFS_INO_REVAL_PAGECACHE);
@@ -1289,8 +1289,8 @@ static unsigned long nfs_wcc_update_inode(struct inode 
*inode, struct nfs_fattr
 
if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   && inode->i_version == fattr->pre_change_attr) {
-   inode->i_version = fattr->change_attr;
+   && !inode_cmp_iversion(inode, fattr->pre_change_attr)) {
+   inode_set_iversion_raw(inode, fattr->change_attr);
if (S_ISDIR(inode->i_mode))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
ret |= NFS_INO_INVALID_ATTR;
@@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode 
*inode, struct nfs_fattr *fat
 
if (!nfs_file_has_buffered_writers(nfsi)) {
/* Verify a few of the more important attributes */
-   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode->i_version != fattr->change_attr)
+   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode_cmp_iversion(inode, fattr->change_attr))
invalid |=

[PATCH v3 14/19] xfs: convert to new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_inode_buf.c | 7 +--
 fs/xfs/xfs_icache.c   | 5 +++--
 fs/xfs/xfs_inode.c| 3 ++-
 fs/xfs/xfs_inode_item.c   | 3 ++-
 fs/xfs/xfs_trans_inode.c  | 4 +++-
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6b7989038d75..b9c0bf80669c 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -32,6 +32,8 @@
 #include "xfs_ialloc.h"
 #include "xfs_dir2.h"
 
+#include 
+
 /*
  * Check that none of the inode's in the buffer have a next
  * unlinked field of 0.
@@ -264,7 +266,8 @@ xfs_inode_from_disk(
to->di_flags= be16_to_cpu(from->di_flags);
 
if (to->di_version == 3) {
-   inode->i_version = be64_to_cpu(from->di_changecount);
+   inode_set_iversion_queried(inode,
+  be64_to_cpu(from->di_changecount));
to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
to->di_flags2 = be64_to_cpu(from->di_flags2);
@@ -314,7 +317,7 @@ xfs_inode_to_disk(
to->di_flags = cpu_to_be16(from->di_flags);
 
if (from->di_version == 3) {
-   to->di_changecount = cpu_to_be64(inode->i_version);
+   to->di_changecount = cpu_to_be64(inode_peek_iversion(inode));
to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
to->di_flags2 = cpu_to_be64(from->di_flags2);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 43005fbe8b1e..4c315adb05e6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -37,6 +37,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * Allocate and initialise an xfs_inode.
@@ -293,14 +294,14 @@ xfs_reinit_inode(
int error;
uint32_tnlink = inode->i_nlink;
uint32_tgeneration = inode->i_generation;
-   uint64_tversion = inode->i_version;
+   uint64_tversion = inode_peek_iversion(inode);
umode_t mode = inode->i_mode;
 
error = inode_init_always(mp->m_super, inode);
 
set_nlink(inode, nlink);
inode->i_generation = generation;
-   inode->i_version = version;
+   inode_set_iversion_queried(inode, version);
inode->i_mode = mode;
return error;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 801274126648..dfc5e60d8af3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -16,6 +16,7 @@
  * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
  */
 #include 
+#include 
 
 #include "xfs.h"
 #include "xfs_fs.h"
@@ -833,7 +834,7 @@ xfs_ialloc(
ip->i_d.di_flags = 0;
 
if (ip->i_d.di_version == 3) {
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
ip->i_d.di_flags2 = 0;
ip->i_d.di_cowextsize = 0;
ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ee5c3bf19ad..7571abf5dfb3 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -30,6 +30,7 @@
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 
+#include 
 
 kmem_zone_t*xfs_ili_zone;  /* inode log item zone */
 
@@ -354,7 +355,7 @@ xfs_inode_to_log_dinode(
to->di_next_unlinked = NULLAGINO;
 
if (from->di_version == 3) {
-   to->di_changecount = inode->i_version;
+   to->di_changecount = inode_peek_iversion(inode);
to->di_crtime.t_sec = from->di_crtime.t_sec;
to->di_crtime.t_nsec = from->di_crtime.t_nsec;
to->di_flags2 = from->di_flags2;
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index daa7615497f9..225544327c4f 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -28,6 +28,8 @@
 #include "xfs_inode_item.h"
 #include "xfs_trace.h"
 
+#include 
+
 /*
  * Add a locked inode to the transaction.
  *
@@ -117,7 +119,7 @@ xfs_trans_log_inode(
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   VFS_I(ip)->i_version++;
+   inode_inc_iversion(VFS_I(ip));
flags |= XFS_ILOG_CORE;
}
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 09/19] ext4: convert to new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
Acked-by: Theodore Ts'o 
---
 fs/ext4/dir.c|  9 +
 fs/ext4/inline.c |  7 ---
 fs/ext4/inode.c  | 13 +
 fs/ext4/ioctl.c  |  3 ++-
 fs/ext4/namei.c  |  5 +++--
 fs/ext4/super.c  |  3 ++-
 fs/ext4/xattr.c  |  5 +++--
 7 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index d5babc9f222b..afda0a0499ce 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext4.h"
 #include "xattr.h"
 
@@ -208,7 +209,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ext4_dir_entry_2 *)
(bh->b_data + i);
@@ -227,7 +228,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < inode->i_size
@@ -568,10 +569,10 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 * cached entries.
 */
if ((!info->curr_node) ||
-   (file->f_version != inode->i_version)) {
+   inode_cmp_iversion(inode, file->f_version)) {
info->curr_node = NULL;
free_rb_tree_fname(>root);
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
ret = ext4_htree_fill_tree(file, info->curr_hash,
   info->curr_minor_hash,
   >next_hash);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 1367553c43bb..a8b987b71173 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -14,6 +14,7 @@
 
 #include 
 #include 
+#include 
 
 #include "ext4_jbd2.h"
 #include "ext4.h"
@@ -1042,7 +1043,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 */
dir->i_mtime = dir->i_ctime = current_time(dir);
ext4_update_dx_flag(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
return 1;
 }
 
@@ -1494,7 +1495,7 @@ int ext4_read_inline_dir(struct file *file,
 * dirent right now.  Scan from the start of the inline
 * dir to make sure.
 */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < extra_size && i < offset;) {
/*
 * "." is with offset 0 and
@@ -1526,7 +1527,7 @@ int ext4_read_inline_dir(struct file *file,
}
offset = i;
ctx->pos = offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < extra_size) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7df2c5644e59..1b0d54b372f2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -4873,12 +4874,14 @@ struct inode *ext4_iget(struct super_block *sb, 
unsigned long ino)
EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
+   u64 ivers = le32_to_cpu(raw_inode->i_disk_version);
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
-   inode->i_version |=
+   ivers |=
(__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
}
+   inode_set_iversion_queried(inode, ivers);
}
 
ret = 0;
@@ -5164,11 +5167,13 @@ static int ext4_do_update_inode(handle_t *handle,
}
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+   u64 ivers = inode_peek_iversion(inode);
+
+   raw_inode->i_disk_version =

[PATCH v3 11/19] nfsd: convert to new i_version API

From: Jeff Layton 

Mostly just making sure we use the "get" wrappers so we know when
it is being fetched for later use.

Signed-off-by: Jeff Layton 
---
 fs/nfsd/nfsfh.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 43f31cf49bae..b8444189223b 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static inline __u32 ino_t_to_u32(ino_t ino)
 {
@@ -259,7 +260,7 @@ static inline u64 nfsd4_change_attribute(struct inode 
*inode)
chattr =  inode->i_ctime.tv_sec;
chattr <<= 30;
chattr += inode->i_ctime.tv_nsec;
-   chattr += inode->i_version;
+   chattr += inode_query_iversion(inode);
return chattr;
 }
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 13/19] ufs: use new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/ufs/dir.c   | 9 +
 fs/ufs/inode.c | 3 ++-
 fs/ufs/super.c | 3 ++-
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 2edc1755b7c5..50dfce000864 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ufs_fs.h"
 #include "ufs.h"
@@ -47,7 +48,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (pos+len > dir->i_size) {
i_size_write(dir, pos+len);
@@ -428,7 +429,7 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1);
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
unsigned flags = UFS_SB(sb)->s_flags;
 
UFSD("BEGIN\n");
@@ -455,8 +456,8 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
offset = ufs_validate_entry(sb, kaddr, offset, 
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (struct ufs_dir_entry *)(kaddr+offset);
limit = kaddr + ufs_last_byte(inode, n) - UFS_DIR_REC_LEN(1);
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index afb601c0dda0..c843ec858cf7 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ufs_fs.h"
 #include "ufs.h"
@@ -693,7 +694,7 @@ struct inode *ufs_iget(struct super_block *sb, unsigned 
long ino)
if (err)
goto bad_inode;
 
-   inode->i_version++;
+   inode_inc_iversion(inode);
ufsi->i_lastfrag =
(inode->i_size + uspi->s_fsize - 1) >> uspi->s_fshift;
ufsi->i_dir_start_lookup = 0;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 4d497e9c6883..b6ba80e05bff 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -88,6 +88,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ufs_fs.h"
 #include "ufs.h"
@@ -1440,7 +1441,7 @@ static struct inode *ufs_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
 
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
seqlock_init(>meta_lock);
mutex_init(>truncate_mutex);
return >vfs_inode;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 15/19] IMA: switch IMA over to new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 security/integrity/ima/ima_api.c  | 3 ++-
 security/integrity/ima/ima_main.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index c7e8db0ea4c0..c6ae42266270 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -215,7 +216,7 @@ int ima_collect_measurement(struct integrity_iint_cache 
*iint,
 * which do not support i_version, support is limited to an initial
 * measurement/appraisal/audit.
 */
-   i_version = file_inode(file)->i_version;
+   i_version = inode_query_iversion(inode);
hash.hdr.algo = algo;
 
/* Initialize hash digest to 0's in case of failure */
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 50b8254d..06a70c5a2329 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -128,7 +129,7 @@ static void ima_check_last_writer(struct 
integrity_iint_cache *iint,
inode_lock(inode);
if (atomic_read(>i_writecount) == 1) {
if (!IS_I_VERSION(inode) ||
-   (iint->version != inode->i_version) ||
+   inode_cmp_iversion(inode, iint->version) ||
(iint->flags & IMA_NEW_FILE)) {
iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
iint->measured_pcrs = 0;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 12/19] ocfs2: convert to new i_version API

From: Jeff Layton 

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 fs/ocfs2/dir.c  | 15 ---
 fs/ocfs2/inode.c|  3 ++-
 fs/ocfs2/namei.c|  3 ++-
 fs/ocfs2/quota_global.c |  3 ++-
 4 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index febe6312ceff..32f9c72dff17 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1174,7 +1175,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct 
inode *dir,
le16_add_cpu(>rec_len,
le16_to_cpu(de->rec_len));
de->inode = 0;
-   dir->i_version++;
+   inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, bh);
goto bail;
}
@@ -1729,7 +1730,7 @@ int __ocfs2_add_entry(handle_t *handle,
if (ocfs2_dir_indexed(dir))
ocfs2_recalc_free_list(dir, handle, lookup);
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, insert_bh);
retval = 0;
goto bail;
@@ -1775,7 +1776,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < i_size_read(inode) && i < offset; ) {
de = (struct ocfs2_dir_entry *)
(data->id_data + i);
@@ -1791,7 +1792,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
i += le16_to_cpu(de->rec_len);
}
ctx->pos = offset = i;
-   *f_version = inode->i_version;
+   *f_version = inode_query_iversion(inode);
}
 
de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
@@ -1869,7 +1870,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ocfs2_dir_entry *) (bh->b_data + 
i);
/* It's too expensive to do a full
@@ -1886,7 +1887,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   *f_version = inode->i_version;
+   *f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < i_size_read(inode)
@@ -1940,7 +1941,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 
*f_version,
  */
 int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
 {
-   u64 version = inode->i_version;
+   u64 version = inode_query_iversion(inode);
ocfs2_dir_foreach_blk(inode, , ctx, true);
return 0;
 }
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 1a1e0078ab38..d51b80edd972 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -302,7 +303,7 @@ void ocfs2_populate_inode(struct inode *inode, struct 
ocfs2_dinode *fe,
OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);
 
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
inode->i_generation = le32_to_cpu(fe->i_generation);
inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
inode->i_mode = le16_to_cpu(fe->i_mode);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d9b36f..c801eddc4bf3 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1520,7 +1521,7 @@ static int ocfs2_rename(struct inode *old_dir,
mlog_errno(status);
goto bail;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
if (S_ISDIR(new_inode->i_mode))

[PATCH v3 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

From: Jeff Layton 

If XFS_ILOG_CORE is already set then go ahead and increment it.

Signed-off-by: Jeff Layton 
---
 fs/xfs/xfs_trans_inode.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 225544327c4f..4a89da4b6fe7 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -112,15 +112,17 @@ xfs_trans_log_inode(
 
/*
 * First time we log the inode in a transaction, bump the inode change
-* counter if it is configured for this to occur. We don't use
-* inode_inc_version() because there is no need for extra locking around
-* i_version as we already hold the inode locked exclusively for
-* metadata modification.
+* counter if it is configured for this to occur. While we have the
+* inode locked exclusively for metadata modification, we can usually
+* avoid setting XFS_ILOG_CORE if no one has queried the value since
+* the last time it was incremented. If we have XFS_ILOG_CORE already
+* set however, then go ahead and bump the i_version counter
+* unconditionally.
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   inode_inc_iversion(VFS_I(ip));
-   flags |= XFS_ILOG_CORE;
+   if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
+   flags |= XFS_ILOG_CORE;
}
 
tp->t_flags |= XFS_TRANS_DIRTY;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 16/19] fs: only set S_VERSION when updating times if necessary

From: Jeff Layton 

We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.

If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 03102d6ef044..83f6cfc3cde7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 /*
@@ -1634,17 +1635,24 @@ static int relatime_need_update(const struct path 
*path, struct inode *inode,
 int generic_update_time(struct inode *inode, struct timespec *time, int flags)
 {
int iflags = I_DIRTY_TIME;
+   bool dirty = false;
 
-   if (flags & S_ATIME)
+   if (flags & S_ATIME) {
inode->i_atime = *time;
+   dirty |= !(inode->i_sb->s_flags & SB_LAZYTIME);
+   }
if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
+   dirty |= inode_maybe_inc_iversion(inode, dirty);
+   if (flags & S_CTIME) {
inode->i_ctime = *time;
-   if (flags & S_MTIME)
+   dirty = true;
+   }
+   if (flags & S_MTIME) {
inode->i_mtime = *time;
+   dirty = true;
+   }
 
-   if (!(inode->i_sb->s_flags & SB_LAZYTIME) || (flags & S_VERSION))
+   if (dirty)
iflags |= I_DIRTY_SYNC;
__mark_inode_dirty(inode, iflags);
return 0;
@@ -1863,7 +1871,7 @@ int file_update_time(struct file *file)
if (!timespec_equal(>i_ctime, ))
sync_it |= S_CTIME;
 
-   if (IS_I_VERSION(inode))
+   if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
sync_it |= S_VERSION;
 
if (!sync_it)
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed

From: Jeff Layton 

At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ac8692849a81..76245323a7c8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6107,19 +6107,20 @@ static int btrfs_update_time(struct inode *inode, 
struct timespec *now,
 int flags)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
+   bool dirty = flags & ~S_VERSION;
 
if (btrfs_root_readonly(root))
return -EROFS;
 
if (flags & S_VERSION)
-   inode_inc_iversion(inode);
+   dirty |= inode_maybe_inc_iversion(inode, dirty);
if (flags & S_CTIME)
inode->i_ctime = *now;
if (flags & S_MTIME)
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
-   return btrfs_dirty_inode(inode);
+   return dirty ? btrfs_dirty_inode(inode) : 0;
 }
 
 /*
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 19/19] fs: handle inode->i_version more efficiently

From: Jeff Layton 

Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.

Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.

When we go to maybe increment it, we fetch the value and check the flag
bit.  If it's clear then we don't need to do anything if the update
isn't being forced.

If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.

On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.

This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h   |   2 +-
 include/linux/iversion.h | 180 ++-
 2 files changed, 131 insertions(+), 51 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 76382c24e9d0..6804d075933e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -639,7 +639,7 @@ struct inode {
struct hlist_head   i_dentry;
struct rcu_head i_rcu;
};
-   u64 i_version;
+   atomic64_t  i_version;
atomic_ti_count;
atomic_ti_dio_count;
atomic_ti_writecount;
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index e08c634779df..a9fbf99709df 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -5,6 +5,8 @@
 #include 
 
 /*
+ * The inode->i_version field:
+ * ---
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
  * appear different to observers if there was a change to the inode's data or
@@ -27,86 +29,143 @@
  * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
  * We consider these sorts of filesystems to have a kernel-managed i_version.
  *
+ * This implementation uses the low bit in the i_version field as a flag to
+ * track when the value has been queried. If it has not been queried since it
+ * was last incremented, we can skip the increment in most cases.
+ *
+ * In the event that we're updating the ctime, we will usually go ahead and
+ * bump the i_version anyway. Since that has to go to stable storage in some
+ * fashion, we might as well increment it as well.
+ *
+ * With this implementation, the value should always appear to observers to
+ * increase over time if the file has changed. It's recommended to use
+ * inode_cmp_iversion() helper to compare values.
+ *
  * Note that some filesystems (e.g. NFS and AFS) just use the field to store
- * a server-provided value (for the most part). For that reason, those
+ * a server-provided value for the most part. For that reason, those
  * filesystems do not set SB_I_VERSION. These filesystems are considered to
  * have a self-managed i_version.
+ *
+ * Persistently storing the i_version
+ * --
+ * Queries of the i_version field are not gated on them hitting the backing
+ * store. It's always possible that the host could crash after allowing
+ * a query of the value but before it has made it to disk.
+ *
+ * To mitigate this problem, filesystems should always use
+ * inode_set_iversion_queried when loading an existing inode from disk. This
+ * ensures that the next attempted inode increment will result in the value
+ * changing.
+ *
+ * Storing the value to disk therefore does not count as a query, so those
+ * filesystems should use inode_peek_iversion to grab the value to be stored.
+ * There is no need to flag the value as having been queried in that case.
  */
 
+/*
+ * We borrow the lowest bit in the i_version to use as a flag to tell whether
+ * it has been queried since we last incremented it. If it has, then we must
+ * increment it on the next change. After that, we can clear the flag and
+ * avoid incrementing it again until it has again been queried.
+ */
+#define

Re: [PATCH v10 3/5] bpf: add a bpf_override_function helper

2017-12-18 Thread Daniel Borkmann

On 12/18/2017 10:51 AM, Masami Hiramatsu wrote:
> On Fri, 15 Dec 2017 14:12:54 -0500
> Josef Bacik  wrote:
>> From: Josef Bacik 
>>
>> Error injection is sloppy and very ad-hoc.  BPF could fill this niche
>> perfectly with it's kprobe functionality.  We could make sure errors are
>> only triggered in specific call chains that we care about with very
>> specific situations.  Accomplish this with the bpf_override_funciton
>> helper.  This will modify the probe'd callers return value to the
>> specified value and set the PC to an override function that simply
>> returns, bypassing the originally probed function.  This gives us a nice
>> clean way to implement systematic error injection for all of our code
>> paths.
> 
> OK, got it. I think the error_injectable function list should be defined
> in kernel/trace/bpf_trace.c because only bpf calls it and needs to care
> the "safeness".
> 
> [...]
>> diff --git a/arch/x86/kernel/kprobes/ftrace.c 
>> b/arch/x86/kernel/kprobes/ftrace.c
>> index 8dc0161cec8f..1ea748d682fd 100644
>> --- a/arch/x86/kernel/kprobes/ftrace.c
>> +++ b/arch/x86/kernel/kprobes/ftrace.c
>> @@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
>>  p->ainsn.boostable = false;
>>  return 0;
>>  }
>> +
>> +asmlinkage void override_func(void);
>> +asm(
>> +".type override_func, @function\n"
>> +"override_func:\n"
>> +"   ret\n"
>> +".size override_func, .-override_func\n"
>> +);
>> +
>> +void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
>> +{
>> +regs->ip = (unsigned long)_func;
>> +}
>> +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
> 
> Calling this as "override_function" is meaningless. This is a function
> which just return. So I think combination of just_return_func() and
> arch_bpf_override_func_just_return() will be better.
> 
> Moreover, this arch/x86/kernel/kprobes/ftrace.c is an archtecture
> dependent implementation of kprobes, not bpf.

Josef, please work out any necessary cleanups that would still need
to be addressed based on Masami's feedback and send them as follow-up
patches, thanks.

> Hmm, arch/x86/net/bpf_jit_comp.c will be better place?

(No, it's JIT only and I'd really prefer to keep it that way, mixing
 this would result in a huge mess.)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND 1/2] btrfs: handle volume split brain scenario




On 12/18/2017 08:52 PM, Nikolay Borisov wrote:



On 17.12.2017 15:52, Anand Jain wrote:

In two device configs of RAID1/RAID5 where one device can be missing
in the degraded mount, or in the configs such as four devices RAID6
where two devices can be missing, in these type of configs it can form
two separate set of devices where each of the set can be mounted without
the other set. And if user does that and writes the data, one set would
have to loose the data. As of now we just loose the data without the
user consent depending on the SB generation number which user won't have
any meaningful interpretations.


This is rather hard to parse something like:

In raid configs RAID1/RAID5/RAID6 it's possible to have some devices
missing which would render btrfs to be mounted in degraded state but
still be operational. In those cases it's possible (albeit highly
undesirable) that the degraded and missing parts of the filesystem are
mounted independently. When writes occur such split-brain scenarios
(caused by intentional user action) then one of the sides of the RAID
config will have to be blown away when bringing it back to the
consistent state.


 It make sense to put it like this. Thanks.



This patch introduces a new SB flag BTRFS_SUPER_FLAG_VOL_MOVED_ON which

MOVED_ON sounds way too colloquial. something like VOL_FORCED_MOUNT,
VOL_PARTIAL_MOUNT.

Another thing which comes to mind is why not piggyback on the degraded
mount capabilities of the filesystem - is it just a mount option or do
we have a flag which indicates this volume was mounted in degraded mode?
If there is not flag perhaps name the flag VOL_DEGRADED_MOUNT?


 I struggled to name it. Thanks for the suggestion.
 How about just that BTRFS_SUPER_FLAG_DEGRADED.
 It matches BTRFS_MOUNT_DEGRADED.


will be set when RAID group is mounted with missing device, so when RAID
is mounted with no missing device and if all the devices contains this
flag then we know that each of these set are mounted without the other
set. In this scenario this patch will fail the mount so that user can
decide which set would they prefer to keep the data.


This is also hard to parse


 Sure will update.



Now the procedure to assemble the disks would be to continue to mount
the good set first without the device set on which new data can be
ignored, and later run btrfs device scan to bring in the missing device
and complete the RAID group which then shall reset the flag
BTRFS_SUPER_FLAG_VOL_MOVED_ON.

Signed-off-by: Anand Jain 
---
On top of kdave misc-next.

  fs/btrfs/disk-io.c  | 56 -
  fs/btrfs/volumes.c  | 14 +--
  include/uapi/linux/btrfs_tree.h |  1 +
  3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a3e9b74f6006..55bc6c846671 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -61,7 +61,8 @@
 BTRFS_HEADER_FLAG_RELOC |\
 BTRFS_SUPER_FLAG_ERROR |\
 BTRFS_SUPER_FLAG_SEEDING |\
-BTRFS_SUPER_FLAG_METADUMP)
+BTRFS_SUPER_FLAG_METADUMP|\
+BTRFS_SUPER_FLAG_VOL_MOVED_ON)
  
  static const struct extent_io_ops btree_extent_io_ops;

  static void end_workqueue_fn(struct btrfs_work *work);
@@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct btrfs_fs_info 
*fs_info)
return 0;
  }
  
+bool volume_has_split_brain(struct btrfs_fs_info *fs_info)

+{
+   unsigned long devs_moved_on = 0;
+   struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;


almost every other instance of the which iterates fs_devices uses the
full name of the temp variable fs_devices, let's do that here as well
for the sake of grep


 Ok.


+   struct list_head *head = _devs->devices;
+   struct btrfs_device *dev;
+
+again:
+   list_for_each_entry(dev, head, dev_list) {
+   struct buffer_head *bh;
+   struct btrfs_super_block *sb;
+
+   if (!dev->devid)
+   continue;
+
+   bh = btrfs_read_dev_super(dev->bdev);
+   if (IS_ERR(bh))
+   continue;
+
+   sb = (struct btrfs_super_block *)bh->b_data;
+   if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_VOL_MOVED_ON)
+   devs_moved_on++;
+   brelse(bh);
+   }
+
+   fs_devs = fs_devs->seed;
+   if (fs_devs) {
+   head = _devs->devices;
+   goto again;
+   }
+
+   if (devs_moved_on == fs_info->fs_devices->total_devices)
+   return true;
+   else
+   return false;
+}
+
  int open_ctree(struct super_block *sb,
   struct btrfs_fs_devices *fs_devices,
   char *options)
@@ -2872,6 +2910,22 @@ int open_ctree(struct super_block *sb,

Re: [PATCH RESEND 1/2] btrfs: handle volume split brain scenario






Now the procedure to assemble the disks would be to continue to mount
the good set first without the device set on which new data can be
ignored, and later run btrfs device scan to bring in the missing device
and complete the RAID group which then shall reset the flag
BTRFS_SUPER_FLAG_VOL_MOVED_ON.

Couple of thoughts on this:

1. This needs to go in _after_ the patches to allow the kernel to forget 
devices.


 Right will mention that explicitly or it can use kernel module reload
 if its a choice.

  Otherwise, the only way to handle things is to physically 
disconnect the device you don't want, mount the other one, and then 
reconnect the disconnected device, because pretty much everything uses 
udev, which by default scans for BTRFS on every device that gets 
connected, and handling it like that is not an option for some people 
(potentially for quite a few people).


 Yeah. That's a problem. If its a choice then they can reload
 the module.

2. How exactly is this supposed to work for people who don't have new 
enough btrfs-progs to tell the kernel to forget the device?  Ideally, 
there should be some other way to tell the kernel which one to use 
(perhaps a one-shot mount option?) without needing any userspace support 
(and if that happens, then I retract my first comment).  I'm not saying 
that the currently proposed method is a bad solution, just that it's not 
a complete solution from a usability perspective.


 Oh. I got it. But the current solution using forget cli OR kernel
 module reload is a kind of general solution, also split brain problem
 arises from wrong usage.

 Your concern of not having forget cli is only a short term concern,
 developing a new mount-option for a short term concern is not
 sure if its right. I am ok either way, if more people prefer that
 way I don't mind writing one.

3. How does this get handled with volumes using the raid1 profile and 
more than 2 devices?
In that case, things get complicated fast. 
Handling an array of N devices in raid1 mode where N-1 devices have this 
flag set is in theory easy, except you have no verification that the N-1 
devices were mounted separately or as a group (in the first case, the 
user needs to pick one, in the second, it should automatically recover 
since they all agree on the state of the filesystem).


 Even in N device raid1, the number of device that can be missing
 is still one. There will be at least one or more devices which
 are common to both the sets. So split brain can't happen.

4. Having some way to do this solely from btrfs-progs would be nice too 
(so that you can just drop to a shell during boot, or drop to the 
initramfs, and fix things without having to mess with device scanning), 
but is secondary to the kernel-side IMHO.


 Pls. Note this patch only detects and avoids the split brain devices
 to loose data, by failing the mount. However the fix is out side of
 this patch. Next, cli 'btrfs dev forget' is anyway isn't purposely
 made for this purpose. Its a generic un-scan code which helps here
 otherwise as well.

 If we need a purposely built cli (btrfstune ?) to fix the SB, it
 can be done. But I think it will be complicated.

Other than all of that, I do think this is a good idea.  Being able to 
more sanely inform the user about the situation and help them figure out 
how to fix it correctly is something that's desperately needed here.


Thanks, Anand



Signed-off-by: Anand Jain 
---
On top of kdave misc-next.

  fs/btrfs/disk-io.c  | 56 
-

  fs/btrfs/volumes.c  | 14 +--
  include/uapi/linux/btrfs_tree.h |  1 +
  3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a3e9b74f6006..55bc6c846671 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -61,7 +61,8 @@
   BTRFS_HEADER_FLAG_RELOC |\
   BTRFS_SUPER_FLAG_ERROR |\
   BTRFS_SUPER_FLAG_SEEDING |\
- BTRFS_SUPER_FLAG_METADUMP)
+ BTRFS_SUPER_FLAG_METADUMP|\
+ BTRFS_SUPER_FLAG_VOL_MOVED_ON)
  static const struct extent_io_ops btree_extent_io_ops;
  static void end_workqueue_fn(struct btrfs_work *work);
@@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct 
btrfs_fs_info *fs_info)

  return 0;
  }
+bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
+{
+    unsigned long devs_moved_on = 0;
+    struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
+    struct list_head *head = _devs->devices;
+    struct btrfs_device *dev;
+
+again:
+    list_for_each_entry(dev, head, dev_list) {
+    struct buffer_head *bh;
+    struct btrfs_super_block *sb;
+
+    if (!dev->devid)
+    continue;
+
+    bh = btrfs_read_dev_super(dev->bdev);
+    if (IS_ERR(bh))
+    continue;
+
+    sb = (struct btrfs_super_block *)bh->b_data;
+    if (btrfs_super_flags(sb) &

Re: Unexpected raid1 behaviour





what was intended that it should be able to detect a previous
member block-device becoming available again as a different
device inode, which currently is very dangerous in some vital
situations.


 Peter, What's the dangerous part here ?


  If device disappears, the patch [4] will completely take out the
  device from btrfs, and continues to RW in degraded mode.
  When it reappears then [5] will bring it back to the RW list.


but [5] relies on someone from userspace (presumably udev) actually
invoking BTRFS_IOC_SCAN_DEV/IOSC_DEVICES_READY, no ?


 Nikoly, Yes. Most of the destro udev already does that. udev calls
 btrfs dev scan when SB is overwritten from userland or when a device
 with primary SB as btrfs (re)appears.


Because
device_list_add is only ever called from btrfs_scan_one_device, which in
turn is called by either of the aforementioned IOCTLS or during mount
(which is not at play here).


 Hm. as above.

Thanks Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] btrfs: handle dynamically reappearing missing device




On 12/18/2017 08:01 PM, Nikolay Borisov wrote:



On 17.12.2017 05:04, Anand Jain wrote:

If the device is not present at the time of (-o degrade) mount,
the mount context will create a dummy missing struct btrfs_device.
Later this device may reappear after the FS is mounted and
then device is included in the device list but it missed the
open_device part. So this patch handles that case by going
through the open_device steps which this device missed and finally
adds to the device alloc list.

So now with this patch, to bring back the missing device user can run,

btrfs dev scan 

Without this kernel patch, even though 'btrfs fi show' and 'btrfs
dev ready' would tell you that missing device has reappeared
successfully but actually in kernel FS layer it didn't.

Signed-off-by: Anand Jain 
---
This patch needs:
  [PATCH 0/4]  factor __btrfs_open_devices()
v3:
The check for missing in the device_list_add() is now a another
patch as its not related.
  btrfs: fix inconsistency during missing device rejoin

v2:
Add more comments.
Add more change log.
Add to check if device missing is set, to handle the case
dev open fail and user will rerun the dev scan.

  fs/btrfs/volumes.c | 57 --
  1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 93d65c72b731..5c3190c65f81 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -812,8 +812,61 @@ static noinline int device_list_add(const char *path,
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
-   fs_devices->missing_devices--;
-   clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state);
+   int ret;
+   struct btrfs_fs_info *fs_info = fs_devices->fs_info;
+   fmode_t fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+   if (btrfs_super_flags(disk_super) &
+   BTRFS_SUPER_FLAG_SEEDING)
+   fmode &= ~FMODE_WRITE;
+
+   /*
+* Missing can be set only when FS is mounted.
+* So here its always fs_devices->opened > 0 and most
+* of the struct device members are already updated by
+* the mount process even if this device was missing, so
+* now follow the normal open device procedure for this
+* device. The scrub will take care of filling the


So how is scrub supposed to be initiated - automatically or by the user
since it's assumed he will have seen the message that a device has been
added? Reading the comment I'd expect scrub is kicked automatically.


 Oh. Right. it isn't a good idea to reset degraded option (below) in
 this thread, instead let scrub do it. Then user can notice too. In fact
 even if all the devices are present but if chunks does not match to
 what user has configured for, then its a degraded volume.

 When I was resetting the degraded option below. I had that question,
 but now the thought process and reasoning is satisfying. Will update.
 Thanks.



+* missing stripes for raid56 and balance for raid1 and
+* raid10.
+*/
+   ASSERT(fs_devices->opened);
+   mutex_lock(_devices->device_list_mutex);
+   mutex_lock(_info->chunk_mutex);
+   /*
+* As of now do not fail the dev scan thread for the
+* reason that btrfs_open_one_device() fails and keep
+* the legacy dev scan requisites as it is.
+* And reset missing only if open is successful, as
+* user can rerun dev scan after fixing the device
+* for which the device open (below) failed.
+*/
+   ret = btrfs_open_one_device(fs_devices, device, fmode,
+   fs_info->bdev_holder);
+   if (!ret) {
+   fs_devices->missing_devices--;
+   clear_bit(BTRFS_DEV_STATE_MISSING,
+   >dev_state);
+btrfs_clear_opt(fs_info->mount_opt, DEGRADED);  <-- 




+   btrfs_warn(fs_info,
+   "BTRFS: device %s devid %llu joined\n",
+   path, devid);


why do we have to warn, this is considered to be a "good" thing so
perhaps btrfs_info?


 you are right, it should be btrfs_info. Will change.


+   }
+
+

Re: Unexpected raid1 behaviour


On 2017-12-16 14:50, Dark Penguin wrote:

Could someone please point me towards some read about how btrfs handles
multiple devices? Namely, kicking faulty devices and re-adding them.

I've been using btrfs on single devices for a while, but now I want to
start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
tried to see how does it handle various situations. The experience left
me very surprised; I've tried a number of things, all of which produced
unexpected results.

Expounding a bit on Duncan's answer with some more specific info.


I create a btrfs raid1 filesystem on two hard drives and mount it.

- When I pull one of the drives out (simulating a simple cable failure,
which happens pretty often to me), the filesystem sometimes goes
read-only. ??? > - But only after a while, and not always. ???
The filesystem won't go read-only until it hits an I/O error, and it's 
non-deterministic how long it will be before that happens on an idle 
filesystem that only sees read access (because if all the files that are 
being read are in the page cache).

- When I fix the cable problem (plug the device back), it's immediately
"re-added" back. But I see no replication of the data I've written onto
a degraded filesystem... Nothing shows any problems, so "my filesystem
must be ok". ???
One of two things happens in this case, and why there is no re-sync is 
dependent on which happens, but both ultimately have to do with the fact 
that BTRFS assumes I/O errors are from device failures, and are at worst 
transient.  Either:


1. The device reappears with the same name. This happens if the time it 
was disconnected is less than the kernel's command timeout (30 seconds 
by default).  In this case, BTRFS may not even notice that the device 
was gone (and if it doesn't, then a re-sync isn't necessary, since it 
will retry all the writes it needs to).  In this case, BTRFS assumes the 
I/O errors were temporary, and keeps using the device after logging the 
errors.  If this happens, then you need to manually re-sync things by 
scrubbing the filesystem (or balancing, but scrubbing is preferred as it 
should run quicker and will only re-write what is actually needed).
2. The device reappears with a different name.  In this case, the device 
was gone long enough that the block layer is certain it was 
disconnected, and thus when it reappears and BTRFS still holds open 
references to the old device node, it gets a new device node.  In this 
case, if the 'new' device is scanned, BTRFS will recognize it as part of 
the FS, but will keep using the old device node.  The correct fix here 
is to unmount the filesystem, re-scan all devices, and then remount the 
filesystem and manually re-sync with a scrub.



- If I unmount the filesystem and then mount it back, I see all my
recent changes lost (everything I wrote during the "degraded" period).
I'm not quite sure about this, but I think BTRFS is rolling back to the 
last common generation number for some reason.



- If I continue working with a degraded raid1 filesystem (even without
damaging it further by re-adding the faulty device), after a while it
won't mount at all, even with "-o degraded".
This is (probably) a known bug relating to chunk handling.  In a two 
device volume using a raid1 profile with a missing device, older kernels 
(I don't remember when the fix went in, but I could have sworn it was in 
4.13) will (erroneously) generate single-profile chunks when they need 
to allocate new chunks.  When you then go to mount the filesystem, the 
check for the degraded mount-ability of the FS fails because there is a 
device missing and single profile chunks.


Now, even without that bug, it's never a good idea t0o run a storage 
array degraded for any extended period of time, regardless of what type 
of array it is (BTRFS, ZFS, MD, LVM, or even hardware RAID).  By keeping 
it in 'degraded' mode, you're essentially telling the system that the 
array will be fixed in a reasonably short time-frame, which impacts how 
it handles the array.  If you're not going to fix it almost immediately, 
you should almost always reshape the array to account for the missing 
device if at all possible, as that will improve relative data safety and 
generally get you better performance than running degraded will.


I can't wrap my head about all this. Either the kicked device should not
be re-added, or it should be re-added "properly", or it should at least
show some errors and not pretend nothing happened, right?..
BTRFS is not the best at error reporting at the moment.  If you check 
the output of `btrfs device stats` for that filesystem though, it should 
show non-zero values in the error counters (note that these counters are 
cumulative, so they are counts since the last time they were reset (or 
when the FS was created if they have never been reset).  Similarly, 
scrub should report errors, there should be error messages in the kernel 
log, and switching the FS to read-only

Re: Unexpected raid1 behaviour

On 2017-12-17 10:48, Peter Grandi wrote:

"Duncan"'s reply is slightly optimistic in parts, so some
further information...

[ ... ]

Basically, at this point btrfs doesn't have "dynamic" device
handling. That is, if a device disappears, it doesn't know
it.

That's just the consequence of what is a completely broken
conceptual model: the current way most multi-device profiles are
designed is that block-devices and only be "added" or "removed",
and cannot be "broken"/"missing". Therefore if IO fails, that is
just one IO failing, not the entire block-device going away.
The time when a block-device is noticed as sort-of missing is
when it is not available for "add"-ing at start.

Put another way, the multi-device design is/was based on the
demented idea that block-devices that are missing are/should be
"remove"d, so that a 2-device volume with a 'raid1' profile
becomes a 1-device volume with a 'single'/'dup' profile, and not
a 2-device volume with a missing block-device and an incomplete
'raid1' profile, even if things have been awkwardly moving in
that direction in recent years.

Note the above is not totally accurate today because various
hacks have been introduced to work around the various issues.
You do realize you just restated exactly what Duncan said, just in a
much more verbose (and aggressively negative) manner...

Thus, if a device disappears, to get it back you really have
to reboot, or at least unload/reload the btrfs kernel module,
in ordered to clear the stale device state and have btrfs
rescan and reassociate devices with the matching filesystems.

IIRC that is not quite accurate: a "missing" device can be
nowadays "replace"d (by "devid") or "remove"d, the latter
possibly implying profile changes:

https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete

Terrible tricks like this also work:

https://www.spinics.net/lists/linux-btrfs/msg48394.html
While that is all true, none of that _fixes_ the issue of a device
disappearing and then being reconnected. In theory, you can use `btrfs
device replace` to force BTRFS to acknowledge the new name (by
'replacing' the missing device with the now returned device), but doing
so is horribly inefficient as to not be worth it unless you have no
other choice.

Meanwhile, as mentioned above, there's active work on proper
dynamic btrfs device tracking and management. It may or may
not be ready for 4.16, but once it goes in, btrfs should
properly detect a device going away and react accordingly,

I haven't seen that, but I doubt that it is the radical redesign
of the multi-device layer of Btrfs that is needed to give it
operational semantics similar to those of MD RAID, and that I
have vaguely described previously.
Anand has been working on hot spare support, and as part of that has
done some work on handling of missing devices.

and it should detect a device coming back as a different
device too.

That is disagreeable because of poor terminology: I guess that
what was intended that it should be able to detect a previous
member block-device becoming available again as a different
device inode, which currently is very dangerous in some vital
situations.
How exactly is this dangerous? The only situation I can think of is if
a bogus device is hot-plugged and happens to perfectly match all the
required identifiers, and at that point you've either got someone
attacking your system who already has sufficient access to do whatever
the hell they want with it, or you did something exceedingly stupid, and
both cases are dangerous by themselves.

Longer term, there's further patches that will provide a
hot-spare functionality, automatically bringing in a device
pre-configured as a hot- spare if a device disappears, but
that of course requires that btrfs properly recognize devices
disappearing and coming back first, so one thing at a time.

That would be trivial if the complete redesign of block-device
states of the Btrfs multi-device layer happened, adding an
"active" flag to an "accessible" flag to describe new member
states, for example.
No, it wouldn't be trivial, because a complete redesign of part of the
filesystem would be needed.

My guess that while logically consistent, the current
multi-device logic is fundamentally broken from an operational
point of view, and needs a complete replacement instead of
fixes.
Then why don't you go write up some patches yourself if you feel so
strongly about it?

The fact is, the only cases where this is really an issue is if you've
either got intermittently bad hardware, or are dealing with external
storage devices. For the majority of people who are using multi-device
setups, the common case is internally connected fixed storage devices
with properly working hardware, and for that use case, it works
perfectly fine. In fact, the only people I've seen any reports of
issues from are either:

1. Testing the behavior of device management (such as

Re: [PATCH RESEND 1/2] btrfs: handle volume split brain scenario



On 17.12.2017 15:52, Anand Jain wrote:
> In two device configs of RAID1/RAID5 where one device can be missing
> in the degraded mount, or in the configs such as four devices RAID6
> where two devices can be missing, in these type of configs it can form
> two separate set of devices where each of the set can be mounted without
> the other set. And if user does that and writes the data, one set would
> have to loose the data. As of now we just loose the data without the
> user consent depending on the SB generation number which user won't have
> any meaningful interpretations.

This is rather hard to parse something like:

In raid configs RAID1/RAID5/RAID6 it's possible to have some devices
missing which would render btrfs to be mounted in degraded state but
still be operational. In those cases it's possible (albeit highly
undesirable) that the degraded and missing parts of the filesystem are
mounted independently. When writes occur such split-brain scenarios
(caused by intentional user action) then one of the sides of the RAID
config will have to be blown away when bringing it back to the
consistent state.

> 
> This patch introduces a new SB flag BTRFS_SUPER_FLAG_VOL_MOVED_ON which
MOVED_ON sounds way too colloquial. something like VOL_FORCED_MOUNT,
VOL_PARTIAL_MOUNT.

Another thing which comes to mind is why not piggyback on the degraded
mount capabilities of the filesystem - is it just a mount option or do
we have a flag which indicates this volume was mounted in degraded mode?
If there is not flag perhaps name the flag VOL_DEGRADED_MOUNT?

> will be set when RAID group is mounted with missing device, so when RAID
> is mounted with no missing device and if all the devices contains this
> flag then we know that each of these set are mounted without the other
> set. In this scenario this patch will fail the mount so that user can
> decide which set would they prefer to keep the data.

This is also hard to parse

> 
> Now the procedure to assemble the disks would be to continue to mount
> the good set first without the device set on which new data can be
> ignored, and later run btrfs device scan to bring in the missing device
> and complete the RAID group which then shall reset the flag
> BTRFS_SUPER_FLAG_VOL_MOVED_ON.
> 
> Signed-off-by: Anand Jain 
> ---
> On top of kdave misc-next.
> 
>  fs/btrfs/disk-io.c  | 56 
> -
>  fs/btrfs/volumes.c  | 14 +--
>  include/uapi/linux/btrfs_tree.h |  1 +
>  3 files changed, 68 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a3e9b74f6006..55bc6c846671 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -61,7 +61,8 @@
>BTRFS_HEADER_FLAG_RELOC |\
>BTRFS_SUPER_FLAG_ERROR |\
>BTRFS_SUPER_FLAG_SEEDING |\
> -  BTRFS_SUPER_FLAG_METADUMP)
> +  BTRFS_SUPER_FLAG_METADUMP|\
> +  BTRFS_SUPER_FLAG_VOL_MOVED_ON)
>  
>  static const struct extent_io_ops btree_extent_io_ops;
>  static void end_workqueue_fn(struct btrfs_work *work);
> @@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct btrfs_fs_info 
> *fs_info)
>   return 0;
>  }
>  
> +bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
> +{
> + unsigned long devs_moved_on = 0;
> + struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;

almost every other instance of the which iterates fs_devices uses the
full name of the temp variable fs_devices, let's do that here as well
for the sake of grep

> + struct list_head *head = _devs->devices;
> + struct btrfs_device *dev;
> +
> +again:
> + list_for_each_entry(dev, head, dev_list) {
> + struct buffer_head *bh;
> + struct btrfs_super_block *sb;
> +
> + if (!dev->devid)
> + continue;
> +
> + bh = btrfs_read_dev_super(dev->bdev);
> + if (IS_ERR(bh))
> + continue;
> +
> + sb = (struct btrfs_super_block *)bh->b_data;
> + if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_VOL_MOVED_ON)
> + devs_moved_on++;
> + brelse(bh);
> + }
> +
> + fs_devs = fs_devs->seed;
> + if (fs_devs) {
> + head = _devs->devices;
> + goto again;
> + }
> +
> + if (devs_moved_on == fs_info->fs_devices->total_devices)
> + return true;
> + else
> + return false;
> +}
> +
>  int open_ctree(struct super_block *sb,
>  struct btrfs_fs_devices *fs_devices,
>  char *options)
> @@ -2872,6 +2910,22 @@ int open_ctree(struct super_block *sb,
>   goto fail_sysfs;
>   }
>  
> + if (fs_info->fs_devices->missing_devices) {
> + btrfs_set_super_flags(fs_info->super_copy,
> +

[PATCH V4 10/45] btrfs: avoid to access bvec table directly for a cloned bio

2017-12-18 Thread Ming Lei

Commit 17347cec15f919901c90(Btrfs: change how we iterate bios in endio)
mentioned that for dio the submitted bio may be fast cloned, we
can't access the bvec table directly for a cloned bio, so use
bio_get_first_bvec() to retrieve the 1st bvec.

Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-btrfs@vger.kernel.org
Cc: Liu Bo 
Reviewed-by: Liu Bo 
Acked: David Sterba 
Signed-off-by: Ming Lei 
---
 fs/btrfs/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4d5cb6e93c80..cb1e2d201434 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8015,6 +8015,7 @@ static blk_status_t dio_read_error(struct inode *inode, 
struct bio *failed_bio,
int segs;
int ret;
blk_status_t status;
+   struct bio_vec bvec;
 
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
@@ -8030,8 +8031,9 @@ static blk_status_t dio_read_error(struct inode *inode, 
struct bio *failed_bio,
}
 
segs = bio_segments(failed_bio);
+   bio_get_first_bvec(failed_bio, );
if (segs > 1 ||
-   (failed_bio->bi_io_vec->bv_len > btrfs_inode_sectorsize(inode)))
+   (bvec.bv_len > btrfs_inode_sectorsize(inode)))
read_mode |= REQ_FAILFAST_DEV;
 
isector = start - btrfs_io_bio(failed_bio)->logical;
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V4 09/45] btrfs: avoid access to .bi_vcnt directly

2017-12-18 Thread Ming Lei

BTRFS uses bio->bi_vcnt to figure out page numbers, this way becomes not
correct once we start to enable multipage bvec.

So use bio_nr_pages() to do that instead.

Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba 
Signed-off-by: Ming Lei 
---
 fs/btrfs/extent_io.c | 9 +
 fs/btrfs/extent_io.h | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 69cd63d4503d..d43360b33ef6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2257,7 +2257,7 @@ int btrfs_get_io_failure_record(struct inode *inode, u64 
start, u64 end,
return 0;
 }
 
-bool btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
+bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
   struct io_failure_record *failrec, int failed_mirror)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2281,7 +2281,7 @@ bool btrfs_check_repairable(struct inode *inode, struct 
bio *failed_bio,
 *  a) deliver good data to the caller
 *  b) correct the bad sectors on disk
 */
-   if (failed_bio->bi_vcnt > 1) {
+   if (failed_bio_pages > 1) {
/*
 * to fulfill b), we need to know the exact failing sectors, as
 * we don't want to rewrite any more than the failed ones. thus,
@@ -2374,6 +2374,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
int read_mode = 0;
blk_status_t status;
int ret;
+   unsigned failed_bio_pages = bio_pages_all(failed_bio);
 
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
@@ -2381,13 +2382,13 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
if (ret)
return ret;
 
-   if (!btrfs_check_repairable(inode, failed_bio, failrec,
+   if (!btrfs_check_repairable(inode, failed_bio_pages, failrec,
failed_mirror)) {
free_io_failure(failure_tree, tree, failrec);
return -EIO;
}
 
-   if (failed_bio->bi_vcnt > 1)
+   if (failed_bio_pages > 1)
read_mode |= REQ_FAILFAST_DEV;
 
phy_offset >>= inode->i_sb->s_blocksize_bits;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 93dcae0c3183..20854d63c75b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -540,7 +540,7 @@ void btrfs_free_io_failure_record(struct btrfs_inode 
*inode, u64 start,
u64 end);
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
struct io_failure_record **failrec_ret);
-bool btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
+bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
struct io_failure_record *failrec, int fail_mirror);
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio 
*failed_bio,
struct io_failure_record *failrec,
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V4 24/45] btrfs: use segment_last_page to get bio's last page

2017-12-18 Thread Ming Lei

Preparing for supporting multipage bvec.

Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Ming Lei 
---
 fs/btrfs/compression.c | 5 -
 fs/btrfs/extent_io.c   | 5 +++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 15430c9fc6d7..1a62725a5f08 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -412,8 +412,11 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
 static u64 bio_end_offset(struct bio *bio)
 {
struct bio_vec *last = bio_last_bvec_all(bio);
+   struct bio_vec bv;
 
-   return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
+   segment_last_page(last, );
+
+   return page_offset(bv.bv_page) + bv.bv_len + bv.bv_offset;
 }
 
 static noinline int add_ra_bio_pages(struct inode *inode,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6984e0d5b00d..f466289b66a3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2726,11 +2726,12 @@ static int __must_check submit_one_bio(struct bio *bio, 
int mirror_num,
 {
blk_status_t ret = 0;
struct bio_vec *bvec = bio_last_bvec_all(bio);
-   struct page *page = bvec->bv_page;
+   struct bio_vec bv;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
 
-   start = page_offset(page) + bvec->bv_offset;
+   segment_last_page(bvec, );
+   start = page_offset(bv.bv_page) + bv.bv_offset;
 
bio->bi_private = NULL;
bio_get(bio);
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND 1/2] btrfs: handle volume split brain scenario


On 2017-12-17 08:52, Anand Jain wrote:

In two device configs of RAID1/RAID5 where one device can be missing
in the degraded mount, or in the configs such as four devices RAID6
where two devices can be missing, in these type of configs it can form
two separate set of devices where each of the set can be mounted without
the other set. And if user does that and writes the data, one set would
have to loose the data. As of now we just loose the data without the
user consent depending on the SB generation number which user won't have
any meaningful interpretations.

This patch introduces a new SB flag BTRFS_SUPER_FLAG_VOL_MOVED_ON which
will be set when RAID group is mounted with missing device, so when RAID
is mounted with no missing device and if all the devices contains this
flag then we know that each of these set are mounted without the other
set. In this scenario this patch will fail the mount so that user can
decide which set would they prefer to keep the data.

Now the procedure to assemble the disks would be to continue to mount
the good set first without the device set on which new data can be
ignored, and later run btrfs device scan to bring in the missing device
and complete the RAID group which then shall reset the flag
BTRFS_SUPER_FLAG_VOL_MOVED_ON.

Couple of thoughts on this:

1. This needs to go in _after_ the patches to allow the kernel to forget 
devices.  Otherwise, the only way to handle things is to physically 
disconnect the device you don't want, mount the other one, and then 
reconnect the disconnected device, because pretty much everything uses 
udev, which by default scans for BTRFS on every device that gets 
connected, and handling it like that is not an option for some people 
(potentially for quite a few people).
2. How exactly is this supposed to work for people who don't have new 
enough btrfs-progs to tell the kernel to forget the device?  Ideally, 
there should be some other way to tell the kernel which one to use 
(perhaps a one-shot mount option?) without needing any userspace support 
(and if that happens, then I retract my first comment).  I'm not saying 
that the currently proposed method is a bad solution, just that it's not 
a complete solution from a usability perspective.
3. How does this get handled with volumes using the raid1 profile and 
more than 2 devices?  In that case, things get complicated fast. 
Handling an array of N devices in raid1 mode where N-1 devices have this 
flag set is in theory easy, except you have no verification that the N-1 
devices were mounted separately or as a group (in the first case, the 
user needs to pick one, in the second, it should automatically recover 
since they all agree on the state of the filesystem).
4. Having some way to do this solely from btrfs-progs would be nice too 
(so that you can just drop to a shell during boot, or drop to the 
initramfs, and fix things without having to mess with device scanning), 
but is secondary to the kernel-side IMHO.


Other than all of that, I do think this is a good idea.  Being able to 
more sanely inform the user about the situation and help them figure out 
how to fix it correctly is something that's desperately needed here.


Signed-off-by: Anand Jain 
---
On top of kdave misc-next.

  fs/btrfs/disk-io.c  | 56 -
  fs/btrfs/volumes.c  | 14 +--
  include/uapi/linux/btrfs_tree.h |  1 +
  3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a3e9b74f6006..55bc6c846671 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -61,7 +61,8 @@
 BTRFS_HEADER_FLAG_RELOC |\
 BTRFS_SUPER_FLAG_ERROR |\
 BTRFS_SUPER_FLAG_SEEDING |\
-BTRFS_SUPER_FLAG_METADUMP)
+BTRFS_SUPER_FLAG_METADUMP|\
+BTRFS_SUPER_FLAG_VOL_MOVED_ON)
  
  static const struct extent_io_ops btree_extent_io_ops;

  static void end_workqueue_fn(struct btrfs_work *work);
@@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct btrfs_fs_info 
*fs_info)
return 0;
  }
  
+bool volume_has_split_brain(struct btrfs_fs_info *fs_info)

+{
+   unsigned long devs_moved_on = 0;
+   struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
+   struct list_head *head = _devs->devices;
+   struct btrfs_device *dev;
+
+again:
+   list_for_each_entry(dev, head, dev_list) {
+   struct buffer_head *bh;
+   struct btrfs_super_block *sb;
+
+   if (!dev->devid)
+   continue;
+
+   bh = btrfs_read_dev_super(dev->bdev);
+   if (IS_ERR(bh))
+   continue;
+
+   sb = (struct btrfs_super_block *)bh->b_data;
+   if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_VOL_MOVED_ON)
+

Re: Unexpected raid1 behaviour



On 18.12.2017 10:49, Anand Jain wrote:
> 
> 
>> Put another way, the multi-device design is/was based on the
>> demented idea that block-devices that are missing are/should be
>> "remove"d, so that a 2-device volume with a 'raid1' profile
>> becomes a 1-device volume with a 'single'/'dup' profile, and not
>> a 2-device volume with a missing block-device and an incomplete
>> 'raid1' profile, 
> 
>  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
>  caused by [1], which we should revert back, since..
>    - balance (to raid1 chunk) may fail if FS is near full
>    - recovery (to raid1 chunk) will take more writes as compared
>  to recovery under degraded raid1 chunks
> 
>  [1]
>  commit 95669976bd7d30ae265db938ecb46a6b7f8cb893
>  Btrfs: don't consider the missing device when allocating new chunks
> 
>  There is an attempt to fix it [2], but will certainly takes time as
>  there are many things to fix around this.
> 
>  [2]
>  [PATCH RFC] btrfs: create degraded-RAID1 chunks
> 
>> even if things have been awkwardly moving in
>> that direction in recent years.
>> Note the above is not totally accurate today because various
>> hacks have been introduced to work around the various issues.
>  May be you are talking about [3]. Pls note its a workaround
>  patch (which I mentioned in its original patch). Its nice that
>  we fixed the availability issue through this patch and the
>  helper function it added also helps the other developments.
>  But for long term we need to work on [2].
> 
>  [3]
>  btrfs: Introduce a function to check if all chunks a OK for degraded rw
> mount
> 
>>> Thus, if a device disappears, to get it back you really have
>>> to reboot, or at least unload/reload the btrfs kernel module,
>>> in ordered to clear the stale device state and have btrfs
>>> rescan and reassociate devices with the matching filesystems.
>>
>> IIRC that is not quite accurate: a "missing" device can be
>> nowadays "replace"d (by "devid") or "remove"d, the latter
>> possibly implying profile changes:
>>
>>   
>> https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete
>>
>>
>> Terrible tricks like this also work:
>>
>>    https://www.spinics.net/lists/linux-btrfs/msg48394.html
> 
>  Its replace, which isn't about bringing back a missing disk.
> 
> 
>>> Meanwhile, as mentioned above, there's active work on proper
>>> dynamic btrfs device tracking and management. It may or may
>>> not be ready for 4.16, but once it goes in, btrfs should
>>> properly detect a device going away and react accordingly,
>>
>> I haven't seen that, but I doubt that it is the radical redesign
>> of the multi-device layer of Btrfs that is needed to give it
>> operational semantics similar to those of MD RAID, and that I
>> have vaguely described previously.
> 
>  I agree that btrfs volume manager is incomplete in view of
>  data center RAS requisites, there are couple of critical
>  bugs and inconsistent design between raid profiles, but I
>  doubt if it needs a radical redesign.
> 
>  Pls take a look at [4], comments are appreciated as usual.
>  I have experimented with two approaches and both are reasonable. -
>  There isn't any harm to leave failed disk opened (but stop any
>  new IO to it). And there will be udev
>  'btrfs dev forget --mounted ' call when device disappears
>  so that we can close the device.
>  In the 2nd approach, close the failed device right away when disk
>  write fails, so that we continue to have only two device states.
>  I like the latter.
> 
>>> and it should detect a device coming back as a different
>>> device too.
>>
>> That is disagreeable because of poor terminology: I guess that
>> what was intended that it should be able to detect a previous
>> member block-device becoming available again as a different
>> device inode, which currently is very dangerous in some vital
>> situations.
> 
>  If device disappears, the patch [4] will completely take out the
>  device from btrfs, and continues to RW in degraded mode.
>  When it reappears then [5] will bring it back to the RW list.

but [5] relies on someone from userspace (presumably udev) actually
invoking BTRFS_IOC_SCAN_DEV/IOSC_DEVICES_READY, no ? Because
device_list_add is only ever called from btrfs_scan_one_device, which in
turn is called by either of the aforementioned IOCTLS or during mount
(which is not at play here).

> 
>   [4]
>   btrfs: introduce device dynamic state transition to failed
>   [5]
>   btrfs: handle dynamically reappearing missing device
> 
>  From the btrfs original design, it always depends on device SB
>  fsid:uuid:devid so it does not matter about the device
>  path or device inode or device transport layer. For eg. Dynamically
>  you can bring a device under different transport and it will work
>  without any down time.
> 
> 
>> That would be trivial if the complete redesign of block-device
>> states of the Btrfs multi-device layer happened, adding an
>> "active" flag to an

Re: [PATCH v3] btrfs: handle dynamically reappearing missing device



On 17.12.2017 05:04, Anand Jain wrote:
> If the device is not present at the time of (-o degrade) mount,
> the mount context will create a dummy missing struct btrfs_device.
> Later this device may reappear after the FS is mounted and
> then device is included in the device list but it missed the
> open_device part. So this patch handles that case by going
> through the open_device steps which this device missed and finally
> adds to the device alloc list.
> 
> So now with this patch, to bring back the missing device user can run,
> 
>btrfs dev scan 
> 
> Without this kernel patch, even though 'btrfs fi show' and 'btrfs
> dev ready' would tell you that missing device has reappeared
> successfully but actually in kernel FS layer it didn't.
> 
> Signed-off-by: Anand Jain 
> ---
> This patch needs:
>  [PATCH 0/4]  factor __btrfs_open_devices()
> v3:
> The check for missing in the device_list_add() is now a another
> patch as its not related.
>  btrfs: fix inconsistency during missing device rejoin
> 
> v2:
> Add more comments.
> Add more change log.
> Add to check if device missing is set, to handle the case
> dev open fail and user will rerun the dev scan.
> 
>  fs/btrfs/volumes.c | 57 
> --
>  1 file changed, 55 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 93d65c72b731..5c3190c65f81 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -812,8 +812,61 @@ static noinline int device_list_add(const char *path,
>   rcu_string_free(device->name);
>   rcu_assign_pointer(device->name, name);
>   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
> - fs_devices->missing_devices--;
> - clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state);
> + int ret;
> + struct btrfs_fs_info *fs_info = fs_devices->fs_info;
> + fmode_t fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
> +
> + if (btrfs_super_flags(disk_super) &
> + BTRFS_SUPER_FLAG_SEEDING)
> + fmode &= ~FMODE_WRITE;
> +
> + /*
> +  * Missing can be set only when FS is mounted.
> +  * So here its always fs_devices->opened > 0 and most
> +  * of the struct device members are already updated by
> +  * the mount process even if this device was missing, so
> +  * now follow the normal open device procedure for this
> +  * device. The scrub will take care of filling the

So how is scrub supposed to be initiated - automatically or by the user
since it's assumed he will have seen the message that a device has been
added? Reading the comment I'd expect scrub is kicked automatically.

> +  * missing stripes for raid56 and balance for raid1 and
> +  * raid10.
> +  */
> + ASSERT(fs_devices->opened);
> + mutex_lock(_devices->device_list_mutex);
> + mutex_lock(_info->chunk_mutex);
> + /*
> +  * As of now do not fail the dev scan thread for the
> +  * reason that btrfs_open_one_device() fails and keep
> +  * the legacy dev scan requisites as it is.
> +  * And reset missing only if open is successful, as
> +  * user can rerun dev scan after fixing the device
> +  * for which the device open (below) failed.
> +  */
> + ret = btrfs_open_one_device(fs_devices, device, fmode,
> + fs_info->bdev_holder);
> + if (!ret) {
> + fs_devices->missing_devices--;
> + clear_bit(BTRFS_DEV_STATE_MISSING,
> + >dev_state);
> + btrfs_clear_opt(fs_info->mount_opt, DEGRADED);
> + btrfs_warn(fs_info,
> + "BTRFS: device %s devid %llu joined\n",
> + path, devid);

why do we have to warn, this is considered to be a "good" thing so
perhaps btrfs_info?

> + }
> +
> + if (test_bit(BTRFS_DEV_STATE_WRITEABLE,
> +  >dev_state) &&

Can a device that is missing really be writable? So we have 2 cases
where we add a missing device:

1. Is via add_missing_dev which sets dev_state_missing so writable will
be false.

2. In read_one_dev when we have successfully found the device from
btrfs_find_device but it doesn't have a ->bdev member, in which case we
don't really

Re: Unexpected raid1 behaviour

2017-12-18 Thread Peter Grandi

>> I haven't seen that, but I doubt that it is the radical
>> redesign of the multi-device layer of Btrfs that is needed to
>> give it operational semantics similar to those of MD RAID,
>> and that I have vaguely described previously.

> I agree that btrfs volume manager is incomplete in view of
> data center RAS requisites, there are couple of critical
> bugs and inconsistent design between raid profiles, but I
> doubt if it needs a radical redesign.

Well it needs a radical redesign because the original design was
based on an entirely consistent and logical concept that was
quite different from that required for sensible operations, and
then special-case case was added (and keeps being added) to
fix the consequences.

But I suspect that it does not need a radical *recoding*,
because most if not all of the needed code is already there.
All tha needs changing most likely is the member state-machine,
that's the bit that need a radical redesign, and it is a
relatively small part of the whole.

The closer the member state-machine design is to the MD RAID one
the better as it is a very workable, proven model.

Sometimes I suspect that the design needs to be changed to also
add a formal notion of "stripe" to the Btrfs internals, where a
"stripe" is a collection of chunks that are "related" (and
something like that is already part of the 'raid10' profile),
but I think that needs not be user-visible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Fix memory leak on multiple mounts



On 18.12.2017 12:03, Nikolay Borisov wrote:
> Currently if a mounted-btrfs instance is mounted for the 2nd time
> without first unmounting the first instance then we hit a memory leak
> in btrfs_mount_root due to the fs_info of the acquired superblock is
> different than the newly allocated fs info. Fix this by specifically
> checking if the fs_info instance of the newly acquired superblock is
> the same as ours and free it if not.
> 
> Reproducer:
> 
> mount /dev/vdc /media/scratch
> mount /dev/vdc /media/scratch2 <- memory leak hit
> 
> Signed-off-by: Nikolay Borisov 

This one fixes:

4da82a9e60a1 btrfs: add btrfs_mount_root() and new file_system_type

Which seems to be in for-next only.

> ---
>  fs/btrfs/super.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index e57eb6e70278..ea3bca85be44 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1612,7 +1612,11 @@ static struct dentry *btrfs_mount_root(struct 
> file_system_type *fs_type,
>   goto error_sec_opts;
>   }
>  
> - fs_info = btrfs_sb(s);
> + if (btrfs_sb(s) != fs_info) {
> + free_fs_info(fs_info);
> + fs_info = btrfs_sb(s);
> + }
> +
>   error = setup_security_options(fs_info, s, _sec_opts);
>   if (error) {
>   deactivate_locked_super(s);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: Fix memory leak on multiple mounts

Currently if a mounted-btrfs instance is mounted for the 2nd time
without first unmounting the first instance then we hit a memory leak
in btrfs_mount_root due to the fs_info of the acquired superblock is
different than the newly allocated fs info. Fix this by specifically
checking if the fs_info instance of the newly acquired superblock is
the same as ours and free it if not.

Reproducer:

mount /dev/vdc /media/scratch
mount /dev/vdc /media/scratch2 <- memory leak hit

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/super.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index e57eb6e70278..ea3bca85be44 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1612,7 +1612,11 @@ static struct dentry *btrfs_mount_root(struct 
file_system_type *fs_type,
goto error_sec_opts;
}
 
-   fs_info = btrfs_sb(s);
+   if (btrfs_sb(s) != fs_info) {
+   free_fs_info(fs_info);
+   fs_info = btrfs_sb(s);
+   }
+
error = setup_security_options(fs_info, s, _sec_opts);
if (error) {
deactivate_locked_super(s);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 3/5] bpf: add a bpf_override_function helper

On Fri, 15 Dec 2017 14:12:54 -0500
Josef Bacik  wrote:

> From: Josef Bacik 
> 
> Error injection is sloppy and very ad-hoc.  BPF could fill this niche
> perfectly with it's kprobe functionality.  We could make sure errors are
> only triggered in specific call chains that we care about with very
> specific situations.  Accomplish this with the bpf_override_funciton
> helper.  This will modify the probe'd callers return value to the
> specified value and set the PC to an override function that simply
> returns, bypassing the originally probed function.  This gives us a nice
> clean way to implement systematic error injection for all of our code
> paths.

OK, got it. I think the error_injectable function list should be defined
in kernel/trace/bpf_trace.c because only bpf calls it and needs to care
the "safeness".

[...]
> diff --git a/arch/x86/kernel/kprobes/ftrace.c 
> b/arch/x86/kernel/kprobes/ftrace.c
> index 8dc0161cec8f..1ea748d682fd 100644
> --- a/arch/x86/kernel/kprobes/ftrace.c
> +++ b/arch/x86/kernel/kprobes/ftrace.c
> @@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
>   p->ainsn.boostable = false;
>   return 0;
>  }
> +
> +asmlinkage void override_func(void);
> +asm(
> + ".type override_func, @function\n"
> + "override_func:\n"
> + "   ret\n"
> + ".size override_func, .-override_func\n"
> +);
> +
> +void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
> +{
> + regs->ip = (unsigned long)_func;
> +}
> +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);

Calling this as "override_function" is meaningless. This is a function
which just return. So I think combination of just_return_func() and
arch_bpf_override_func_just_return() will be better.

Moreover, this arch/x86/kernel/kprobes/ftrace.c is an archtecture
dependent implementation of kprobes, not bpf.

Hmm, arch/x86/net/bpf_jit_comp.c will be better place?
(why don't we have arch/x86/kernel/bpf.c?)

[..]
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 492700c5fb4d..91f4b57dab82 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -42,6 +42,7 @@ struct trace_kprobe {
>   (offsetof(struct trace_kprobe, tp.args) +   \
>   (sizeof(struct probe_arg) * (n)))
>  
> +DEFINE_PER_CPU(int, bpf_kprobe_override);
>  
>  static nokprobe_inline bool trace_kprobe_is_return(struct trace_kprobe *tk)
>  {
> @@ -87,6 +88,27 @@ static nokprobe_inline unsigned long 
> trace_kprobe_nhit(struct trace_kprobe *tk)
>   return nhit;
>  }
>  
> +int trace_kprobe_ftrace(struct trace_event_call *call)
> +{
> + struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
> + return kprobe_ftrace(>rp.kp);
> +}
> +
> +int trace_kprobe_error_injectable(struct trace_event_call *call)
> +{
> + struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
> + unsigned long addr;
> +
> + if (tk->symbol) {
> + addr = (unsigned long)
> + kallsyms_lookup_name(trace_kprobe_symbol(tk));
> + addr += tk->rp.kp.offset;

If the tk is already registered, you don't need to get address,
you can use kp.addr. Anyway, kprobe_ftrace() also requires the
kprobe already registered.

> + } else {
> + addr = (unsigned long)tk->rp.kp.addr;
> + }
> + return within_kprobe_error_injection_list(addr);
> +}
> +
>  static int register_kprobe_event(struct trace_kprobe *tk);
>  static int unregister_kprobe_event(struct trace_kprobe *tk);
>  
> @@ -1170,7 +1192,7 @@ static int kretprobe_event_define_fields(struct 
> trace_event_call *event_call)
>  #ifdef CONFIG_PERF_EVENTS
>  
>  /* Kprobe profile handler */
> -static void
> +static int
>  kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
>  {
>   struct trace_event_call *call = >tp.call;
> @@ -1179,12 +1201,29 @@ kprobe_perf_func(struct trace_kprobe *tk, struct 
> pt_regs *regs)
>   int size, __size, dsize;
>   int rctx;
>  
> - if (bpf_prog_array_valid(call) && !trace_call_bpf(call, regs))
> - return;
> + if (bpf_prog_array_valid(call)) {
> + int ret;
> +
> + ret = trace_call_bpf(call, regs);
> +
> + /*
> +  * We need to check and see if we modified the pc of the
> +  * pt_regs, and if so clear the kprobe and return 1 so that we
> +  * don't do the instruction skipping.  Also reset our state so
> +  * we are clean the next pass through.
> +  */
> + if (__this_cpu_read(bpf_kprobe_override)) {
> + __this_cpu_write(bpf_kprobe_override, 0);
> + reset_current_kprobe();

OK, I will fix this issue(reset kprobe and preempt-enable) by removing
jprobe soon.
(currently waiting for removing {tcp,sctp,dccp}_probe code, which are
 only users of jprobe in the kernel)

Thank you,


> + return 1;
> +

Re: [PATCH] btrfs: factor btrfs_check_rw_degradable() to check given device

2017-12-18 Thread Qu Wenruo



On 2017年12月18日 17:08, Anand Jain wrote:
> Update btrfs_check_rw_degradable() to check against the given
> device if its lost.
> 
> We can use this function to know if the volume is going to be
> in degraded mode OR failed state, when the given device fails.
> Which is needed when we are handling the device failed state.
> 
> A preparatory patch does not affect the flow as such.
> 
> Signed-off-by: Anand Jain 

Looks good.

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/disk-io.c | 4 ++--
>  fs/btrfs/super.c   | 2 +-
>  fs/btrfs/volumes.c | 8 ++--
>  fs/btrfs/volumes.h | 4 ++--
>  4 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 55bc6c846671..5604f5e1873e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2904,7 +2904,7 @@ int open_ctree(struct super_block *sb,
>   goto fail_sysfs;
>   }
>  
> - if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info)) {
> + if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) {
>   btrfs_warn(fs_info,
>   "writeable mount is not allowed due to too many missing 
> devices");
>   goto fail_sysfs;
> @@ -3423,7 +3423,7 @@ static blk_status_t wait_dev_flush(struct btrfs_device 
> *device)
>  
>  static int check_barrier_error(struct btrfs_fs_info *fs_info)
>  {
> - if (!btrfs_check_rw_degradable(fs_info))
> + if (!btrfs_check_rw_degradable(fs_info, NULL))
>   return -EIO;
>   return 0;
>  }
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index f8129ebad963..cb6d62cf83ec 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1855,7 +1855,7 @@ static int btrfs_remount(struct super_block *sb, int 
> *flags, char *data)
>   goto restore;
>   }
>  
> - if (!btrfs_check_rw_degradable(fs_info)) {
> + if (!btrfs_check_rw_degradable(fs_info, NULL)) {
>   btrfs_warn(fs_info,
>   "too many missing devices, writeable remount is 
> not allowed");
>   ret = -EACCES;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index b881fc2b2225..9c4868d6fd9a 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6923,7 +6923,8 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
>   * Return true if all chunks meet the minimal RW mount requirements.
>   * Return false if any chunk doesn't meet the minimal RW mount requirements.
>   */
> -bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info)
> +bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
> + struct btrfs_device *failing_dev)
>  {
>   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
>   struct extent_map *em;
> @@ -6955,9 +6956,12 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info 
> *fs_info)
>   test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) 
> ||
>   dev->last_flush_error)
>   missing++;
> + else if (failing_dev && failing_dev == dev)
> + missing++;
>   }
>   if (missing > max_tolerated) {
> - btrfs_warn(fs_info,
> + if (!failing_dev)
> + btrfs_warn(fs_info,
>   "chunk %llu missing %d devices, max tolerance is %d for writeable 
> mount",
>  em->start, missing, max_tolerated);
>   free_extent_map(em);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index c7484da44ca4..e707fb39b2d4 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -555,7 +555,7 @@ void btrfs_update_commit_device_bytes_used(struct 
> btrfs_fs_info *fs_info,
>  struct list_head *btrfs_get_fs_uuids(void);
>  void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
>  void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
> -
> -bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
> +bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
> + struct btrfs_device *failing_dev);
>  
>  #endif
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v10 1/5] add infrastructure for tagging functions as error injectable

On Fri, 15 Dec 2017 14:12:52 -0500
Josef Bacik  wrote:

> From: Josef Bacik 
> 
> Using BPF we can override kprob'ed functions and return arbitrary
> values.  Obviously this can be a bit unsafe, so make this feature opt-in
> for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
> order to give BPF access to that function for error injection purposes.
> 

NAK. I'm very confused. What the reason to add this feature is implemented
in kernel/kprobes.c? It is seemed within an usual "usage" of kprobes.
I recommend you to implement this somewhere else... like
kernel/error_injection.c, or kernel/module.c.

More precisely list up the reasons why,

 - This is just for providing an API to check the address within an 
  address-range list inside kmodule (not related to kprobes).
 - There is no check in kprobes to modified address by using the API.
  (yes, that will cause a big overhead...)
 - This can mislead user NOT to change the instruction pointer from
  the kprobes except for that list.
 - If user intends to insert a piece of code (like livepatch) in a
  function, they do NOT think it is an "error injection".
 - Or if they find this API, and "what?? I can not change instruction
  pointer by kprobes? but I can." and report it a bug on lkml...

So, I don't like to see this in kprobes.c. It is better to make another
layer to do this.

Thank you,

> Signed-off-by: Josef Bacik 
> Acked-by: Ingo Molnar 
> ---
>  include/asm-generic/vmlinux.lds.h |  10 +++
>  include/linux/bpf.h   |  11 +++
>  include/linux/kprobes.h   |   1 +
>  include/linux/module.h|   5 ++
>  kernel/kprobes.c  | 163 
> ++
>  kernel/module.c   |   6 +-
>  6 files changed, 195 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index ee8b707d9fa9..a2e8582d094a 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -136,6 +136,15 @@
>  #define KPROBE_BLACKLIST()
>  #endif
>  
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE
> +#define ERROR_INJECT_LIST()  . = ALIGN(8);   
> \
> + 
> VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
> + KEEP(*(_kprobe_error_inject_list))  
> \
> + VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
> = .;
> +#else
> +#define ERROR_INJECT_LIST()
> +#endif
> +
>  #ifdef CONFIG_EVENT_TRACING
>  #define FTRACE_EVENTS()  . = ALIGN(8);   
> \
>   VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
> @@ -564,6 +573,7 @@
>   FTRACE_EVENTS() \
>   TRACE_SYSCALLS()\
>   KPROBE_BLACKLIST()  \
> + ERROR_INJECT_LIST() \
>   MEM_DISCARD(init.rodata)\
>   CLK_OF_TABLES() \
>   RESERVEDMEM_OF_TABLES() \
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e55e4255a210..7f4d2a953173 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -576,4 +576,15 @@ extern const struct bpf_func_proto 
> bpf_sock_map_update_proto;
>  void bpf_user_rnd_init_once(void);
>  u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  
> +#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE
> +#define BPF_ALLOW_ERROR_INJECTION(fname) \
> +static unsigned long __used  \
> + __attribute__((__section__("_kprobe_error_inject_list")))   \
> + _eil_addr_##fname = (unsigned long)fname;
> +#else
> +#define BPF_ALLOW_ERROR_INJECTION(fname)
> +#endif
> +#endif
> +
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index 9440a2fc8893..963fd364f3d6 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long 
> offset);
>  extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, 
> unsigned long offset);
>  
>  extern bool within_kprobe_blacklist(unsigned long addr);
> +extern bool within_kprobe_error_injection_list(unsigned long addr);
>  
>  struct kprobe_insn_cache {
>   struct mutex mutex;
> diff --git a/include/linux/module.h b/include/linux/module.h
> index c69b49abe877..548fa09fa806 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -475,6 +475,11 @@ struct module {
>   ctor_fn_t *ctors;
>   unsigned int num_ctors;
>  #endif

[PATCH] btrfs: factor btrfs_check_rw_degradable() to check given device

Update btrfs_check_rw_degradable() to check against the given
device if its lost.

We can use this function to know if the volume is going to be
in degraded mode OR failed state, when the given device fails.
Which is needed when we are handling the device failed state.

A preparatory patch does not affect the flow as such.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 4 ++--
 fs/btrfs/super.c   | 2 +-
 fs/btrfs/volumes.c | 8 ++--
 fs/btrfs/volumes.h | 4 ++--
 4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 55bc6c846671..5604f5e1873e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2904,7 +2904,7 @@ int open_ctree(struct super_block *sb,
goto fail_sysfs;
}
 
-   if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info)) {
+   if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) {
btrfs_warn(fs_info,
"writeable mount is not allowed due to too many missing 
devices");
goto fail_sysfs;
@@ -3423,7 +3423,7 @@ static blk_status_t wait_dev_flush(struct btrfs_device 
*device)
 
 static int check_barrier_error(struct btrfs_fs_info *fs_info)
 {
-   if (!btrfs_check_rw_degradable(fs_info))
+   if (!btrfs_check_rw_degradable(fs_info, NULL))
return -EIO;
return 0;
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f8129ebad963..cb6d62cf83ec 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1855,7 +1855,7 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
goto restore;
}
 
-   if (!btrfs_check_rw_degradable(fs_info)) {
+   if (!btrfs_check_rw_degradable(fs_info, NULL)) {
btrfs_warn(fs_info,
"too many missing devices, writeable remount is 
not allowed");
ret = -EACCES;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b881fc2b2225..9c4868d6fd9a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6923,7 +6923,8 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
  * Return true if all chunks meet the minimal RW mount requirements.
  * Return false if any chunk doesn't meet the minimal RW mount requirements.
  */
-bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info)
+bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
+   struct btrfs_device *failing_dev)
 {
struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
struct extent_map *em;
@@ -6955,9 +6956,12 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info 
*fs_info)
test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) 
||
dev->last_flush_error)
missing++;
+   else if (failing_dev && failing_dev == dev)
+   missing++;
}
if (missing > max_tolerated) {
-   btrfs_warn(fs_info,
+   if (!failing_dev)
+   btrfs_warn(fs_info,
"chunk %llu missing %d devices, max tolerance is %d for writeable 
mount",
   em->start, missing, max_tolerated);
free_extent_map(em);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index c7484da44ca4..e707fb39b2d4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -555,7 +555,7 @@ void btrfs_update_commit_device_bytes_used(struct 
btrfs_fs_info *fs_info,
 struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
-
-bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
+bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
+   struct btrfs_device *failing_dev);
 
 #endif
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour




formerly missing device - a very big penalty because the whole array
has to be done to catch it up for what might be only a few minutes of
missing time.

 For raid1 [1] cli will pick only new chunks.
  [1]
  btrfs bal start -dprofiles=single -mprofiles=single 

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

Agreed. IMO degraded-raid1-single-chunk is an accidental feature
caused by [1], which we should revert back, since..
- balance (to raid1 chunk) may fail if FS is near full
- recovery (to raid1 chunk) will take more writes as compared
to recovery under degraded raid1 chunks

[1]
commit 95669976bd7d30ae265db938ecb46a6b7f8cb893
Btrfs: don't consider the missing device when allocating new chunks

There is an attempt to fix it [2], but will certainly takes time as
there are many things to fix around this.

[2]
[PATCH RFC] btrfs: create degraded-RAID1 chunks

> even if things have been awkwardly moving in
> that direction in recent years.

Note the above is not totally accurate today because various
hacks have been introduced to work around the various issues.

May be you are talking about [3]. Pls note its a workaround
patch (which I mentioned in its original patch). Its nice that
we fixed the availability issue through this patch and the
helper function it added also helps the other developments.
But for long term we need to work on [2].

[3]
btrfs: Introduce a function to check if all chunks a OK for degraded
rw mount

IIRC that is not quite accurate: a "missing" device can be
nowadays "replace"d (by "devid") or "remove"d, the latter
possibly implying profile changes:

https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete

Terrible tricks like this also work:

https://www.spinics.net/lists/linux-btrfs/msg48394.html

Its replace, which isn't about bringing back a missing disk.

I agree that btrfs volume manager is incomplete in view of
data center RAS requisites, there are couple of critical
bugs and inconsistent design between raid profiles, but I
doubt if it needs a radical redesign.

Pls take a look at [4], comments are appreciated as usual.
I have experimented with two approaches and both are reasonable. -
There isn't any harm to leave failed disk opened (but stop any
new IO to it). And there will be udev
'btrfs dev forget --mounted ' call when device disappears
so that we can close the device.
In the 2nd approach, close the failed device right away when disk
write fails, so that we continue to have only two device states.
I like the latter.

and it should detect a device coming back as a different
device too.

If device disappears, the patch [4] will completely take out the
device from btrfs, and continues to RW in degraded mode.
When it reappears then [5] will bring it back to the RW list.

[4]
btrfs: introduce device dynamic state transition to failed
[5]
btrfs: handle dynamically reappearing missing device

From the btrfs original design, it always depends on device SB
fsid:uuid:devid so it does not matter about the device
path or device inode or device transport layer. For eg. Dynamically
you can bring a device under different transport and it will work
without any down time.

> That would be trivial if the complete redesign of block-device
> states of the Btrfs multi-device layer happened, adding an
> "active" flag to an "accessible" flag to describe new member
> states, for example.

I think you are talking about BTRFS_DEV_STATE.. But I think
Duncan is talking about the patches which I included in my
reply.

Thanks, Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: xfstests/btrfs/100 and 151 failure