Re: [PATCH] drbd: fix discard_zeroes_if_aligned regression

2019-06-01 Thread Eric Wheeler
On Tue, 28 May 2019, Lars Ellenberg wrote:

> On Fri, May 10, 2019 at 05:36:32PM +0000, Eric Wheeler wrote:
> > Hi Lars,
> > 
> > We just tried 4.19.x and this bugs still exists. We applied the patch 
> > which was originally submitted to this thread and it still applies cleanly 
> > and seems to work for our use case. You mentioned that you had some older 
> > code which zeroed out unaligned discard requests (or perhaps it was for a 
> > different purpose) that you may be able to use to patch this. Could you 
> > dig those up and see if we can get this solved?
> > 
> > It would be nice to be able to use drbd with thin backing volumes from the 
> > vanilla kernel. If this has already been fixed in something newer than 
> > 4.19, then please point me to the commit.
> 
> I think it was merged upstream in 5.0
> f31e583aa2c2 drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")

Thanks Lars, I appreciate your patch.  

Your unaligned zerout code in drbd_issue_discard_or_zero_out() looks 
great.  I particulary like how you adjusted max_discard_sectors to the 
granularity, as well as alignment handling.  Well thought out.

Your commit notes that "for backward compatibility, P_TRIM means zero-out, 
unless the DRBD_FF_WZEROES feature flag is agreed upon during handshake."

We test our environment by deploying the newer kernel on one of the DRBD 
servers and checking for regressions---but this will cause a zero-out on 
the new server because the old server doesn't yet support DRBD_FF_WZEROES.

For our purpose, can you think of any reason that it would be unsafe to 
hack the following into drbd_do_features() so the newer version will not 
zero-out while we test and get both nodes up to the newer version?

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index c7ad88d..76191e6 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -5382,6 +5382,8 @@ static int drbd_do_features(struct drbd_connection 
*connection)
connection->agreed_pro_version = min_t(int, PRO_VERSION_MAX, 
p->protocol_max);
connection->agreed_features = PRO_FEATURES & 
be32_to_cpu(p->feature_flags);
 
+   connection->agreed_features |= DRBD_FF_WZEROES;
+
drbd_info(connection, "Handshake successful: "
 "Agreed network protocol version %d\n", 
connection->agreed_pro_version);


--
Eric Wheeler



> 
> -- 
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
> : R, Integration, Ops, Consulting, Support
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT
> 

Re: [PATCH] drbd: fix discard_zeroes_if_aligned regression

2019-05-10 Thread Eric Wheeler
On Tue, 16 Jan 2018, Lars Ellenberg wrote:

> On Mon, Jan 15, 2018 at 11:26:15PM -0800, Christoph Hellwig wrote:
> > NAK.  Calling a discard and expecting zeroing is simply buggy.
> 
> What he said.
> 
> The bug/misunderstanding was that we now use zeroout even for discards,
> with the assumption that it would try to do discards where it can.
> 
> Which is even true.
> 
> But our expectations of when zeroout "should" use unmap,
> and where it actually can do that safely
> based on the information it has,
> don't really match:
> our expectations where wrong, we assumed too much.
> (in the general case).
> 
> Which means in DRBD we have to stop use zeroout for discards,
> and again pass down normal discard for discards.
> 
> In the specific case where the backend to DRBD is lvm-thin,
> AND it does de-alloc blocks on discard,
> AND it does NOT have skip_block_zeroing set or it's backend
> does zero on discard and it does discard passdown, and we tell
> DRBD about all of that (using the discard_zeroes_if_aligned flag)
> then we can do this "zero head and tail, discard full blocks",
> and expect them to be zero.
> 
> If skip_block_zeroing is set however, and there is no discard
> passdown in thin, or the backend of thin does not zero on discard,
> DRBD can still pass down discards to thin, and expect them do be
> de-allocated, but can not expect discarded ranges to remain
> "zero", any later partial write to an unallocated area could pull
> in different "undefined" garbage on different replicas for the
> not-written-to part of a new allocated block.
> 
> The end result is that you have areas of the block device
> that return different data depending on which replica you
> read from.
> 
> But that is the case even eithout discard in that setup,
> don't do that then or live with it.
> 
> "undefined data" is undefined, you have that directly on thin
> in that setup already, with DRBD on top you now have several
> versions of "undefined".
> 
> Anything on top of such a setup must not do "read-modify-write"
> of "undefined" data and expect a defined result, adding DRBD
> on top does not change that.
> 
> DRBD can deal with that just fine, but our "online verify" will
> report loads of boring "mismatches" for those areas.
> 
> TL;DR: we'll stop doing "discard-is-zeroout"
> (our assumptions were wrong).
> We still won't do exactly "discard-is-discard", but re-enable our
> "discard-is-discard plus zeroout on head and tail", because in
> some relevant setups, this gives us the best result, and avoids
> the false positives in our online-verify.
> 
> Lars
> 

Hi Lars,

We just tried 4.19.x and this bugs still exists. We applied the patch 
which was originally submitted to this thread and it still applies cleanly 
and seems to work for our use case. You mentioned that you had some older 
code which zeroed out unaligned discard requests (or perhaps it was for a 
different purpose) that you may be able to use to patch this. Could you 
dig those up and see if we can get this solved?

It would be nice to be able to use drbd with thin backing volumes from the 
vanilla kernel. If this has already been fixed in something newer than 
4.19, then please point me to the commit.

Thank you for your help!

 
--
Eric Wheeler




Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.

2018-01-25 Thread Eric Wheeler
On Thu, 25 Jan 2018, Tetsuo Handa wrote:

> Using a debug patch and a reproducer shown below, we can trivially form
> a circular locking dependency shown below.
> 
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8219001..240efb1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -950,7 +950,7 @@ static void oom_kill_process(struct oom_control *oc, 
> const char *message)
>   }
>   task_unlock(p);
>  
> - if (__ratelimit(_rs))
> + if (0 && __ratelimit(_rs))
>   dump_header(oc, p);
>  
>   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1afb2af..9858449 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -410,6 +410,9 @@ static unsigned long do_shrink_slab(struct shrink_control 
> *shrinkctl,
>   return freed;
>  }
>  
> +struct lockdep_map __shrink_slab_map =
> + STATIC_LOCKDEP_MAP_INIT("shrink_slab", &__shrink_slab_map);
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -453,6 +456,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   goto out;
>   }
>  
> + lock_map_acquire(&__shrink_slab_map);
> +
>   list_for_each_entry(shrinker, _list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
> @@ -491,6 +496,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   }
>   }
>  
> + lock_map_release(&__shrink_slab_map);
> +
>   up_read(_rwsem);
>  out:
>   cond_resched();
> 
> 
> 
> #include 
> 
> int main(int argc, char *argv[])
> {
>   unsigned long long size;
>   char *buf = NULL;
>   unsigned long long i;
>   for (size = 1048576; size < 512ULL * (1 << 30); size *= 2) {
>   char *cp = realloc(buf, size);
>   if (!cp) {
>   size /= 2;
>   break;
>   }
>   buf = cp;
>   }
>   for (i = 0; i < size; i += 4096)
>   buf[i] = 0;
>   return 0;
> }

Hi Tetsuo,

Thank you for looking into this!

I tried running this C program in 4.14.15 but did not get a deadlock, just 
OOM kills. Is the patch required to induce the deadlock?

Also, what are you doing to XFS to make it trigger?

--
Eric Wheeler

> 
> 
> 
> CentOS Linux 7 (Core)
> Kernel 4.15.0-rc8-next-20180119+ on an x86_64
> 
> localhost login: [   36.954893] cp (2850) used greatest stack depth: 10816 
> bytes left
> [   89.216085] Out of memory: Kill process 6981 (a.out) score 876 or 
> sacrifice child
> [   89.225853] Killed process 6981 (a.out) total-vm:4264020kB, 
> anon-rss:3346832kB, file-rss:8kB, shmem-rss:0kB
> [   89.313597] oom_reaper: reaped process 6981 (a.out), now anon-rss:0kB, 
> file-rss:0kB, shmem-rss:0kB
> [   92.640566] Out of memory: Kill process 6983 (a.out) score 876 or 
> sacrifice child
> [   92.642153] 
> [   92.643464] Killed process 6983 (a.out) total-vm:4264020kB, 
> anon-rss:3348624kB, file-rss:4kB, shmem-rss:0kB
> [   92.644416] ==
> [   92.644417] WARNING: possible circular locking dependency detected
> [   92.644418] 4.15.0-rc8-next-20180119+ #222 Not tainted
> [   92.644419] --
> [   92.644419] kworker/u256:29/401 is trying to acquire lock:
> [   92.644420]  (shrink_slab){+.+.}, at: [<40040aca>] 
> shrink_slab.part.42+0x73/0x350
> [   92.644428] 
> [   92.644428] but task is already holding lock:
> [   92.665257]  (_nondir_ilock_class){}, at: [<ae515ec8>] 
> xfs_ilock+0xa3/0x180 [xfs]
> [   92.668490] 
> [   92.668490] which lock already depends on the new lock.
> [   92.668490] 
> [   92.672781] 
> [   92.672781] the existing dependency chain (in reverse order) is:
> [   92.676310] 
> [   92.676310] -> #1 (_nondir_ilock_class){}:
> [   92.679519]xfs_free_eofblocks+0x9d/0x210 [xfs]
> [   92.681716]xfs_fs_destroy_inode+0x9e/0x220 [xfs]
> [   92.683962]dispose_list+0x30/0x40
> [   92.685822]prune_icache_sb+0x4d/0x70
> [   92.687961]super_cache_scan+0x136/0x180
> [   92.690017]shrink_slab.part.42+0x205/0x350
> [   92.692109]shrink_node+0x313/0x320
> [   92.694177]kswapd+0x386/0x6d0
> [   92.695951]kthread+0xeb/0x120
> 

Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.

2018-01-25 Thread Eric Wheeler
On Thu, 25 Jan 2018, Tetsuo Handa wrote:

> Using a debug patch and a reproducer shown below, we can trivially form
> a circular locking dependency shown below.
> 
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8219001..240efb1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -950,7 +950,7 @@ static void oom_kill_process(struct oom_control *oc, 
> const char *message)
>   }
>   task_unlock(p);
>  
> - if (__ratelimit(_rs))
> + if (0 && __ratelimit(_rs))
>   dump_header(oc, p);
>  
>   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1afb2af..9858449 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -410,6 +410,9 @@ static unsigned long do_shrink_slab(struct shrink_control 
> *shrinkctl,
>   return freed;
>  }
>  
> +struct lockdep_map __shrink_slab_map =
> + STATIC_LOCKDEP_MAP_INIT("shrink_slab", &__shrink_slab_map);
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -453,6 +456,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   goto out;
>   }
>  
> + lock_map_acquire(&__shrink_slab_map);
> +
>   list_for_each_entry(shrinker, _list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
> @@ -491,6 +496,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   }
>   }
>  
> + lock_map_release(&__shrink_slab_map);
> +
>   up_read(_rwsem);
>  out:
>   cond_resched();
> 
> 
> 
> #include 
> 
> int main(int argc, char *argv[])
> {
>   unsigned long long size;
>   char *buf = NULL;
>   unsigned long long i;
>   for (size = 1048576; size < 512ULL * (1 << 30); size *= 2) {
>   char *cp = realloc(buf, size);
>   if (!cp) {
>   size /= 2;
>   break;
>   }
>   buf = cp;
>   }
>   for (i = 0; i < size; i += 4096)
>   buf[i] = 0;
>   return 0;
> }

Hi Tetsuo,

Thank you for looking into this!

I tried running this C program in 4.14.15 but did not get a deadlock, just 
OOM kills. Is the patch required to induce the deadlock?

Also, what are you doing to XFS to make it trigger?

--
Eric Wheeler

> 
> 
> 
> CentOS Linux 7 (Core)
> Kernel 4.15.0-rc8-next-20180119+ on an x86_64
> 
> localhost login: [   36.954893] cp (2850) used greatest stack depth: 10816 
> bytes left
> [   89.216085] Out of memory: Kill process 6981 (a.out) score 876 or 
> sacrifice child
> [   89.225853] Killed process 6981 (a.out) total-vm:4264020kB, 
> anon-rss:3346832kB, file-rss:8kB, shmem-rss:0kB
> [   89.313597] oom_reaper: reaped process 6981 (a.out), now anon-rss:0kB, 
> file-rss:0kB, shmem-rss:0kB
> [   92.640566] Out of memory: Kill process 6983 (a.out) score 876 or 
> sacrifice child
> [   92.642153] 
> [   92.643464] Killed process 6983 (a.out) total-vm:4264020kB, 
> anon-rss:3348624kB, file-rss:4kB, shmem-rss:0kB
> [   92.644416] ==
> [   92.644417] WARNING: possible circular locking dependency detected
> [   92.644418] 4.15.0-rc8-next-20180119+ #222 Not tainted
> [   92.644419] --
> [   92.644419] kworker/u256:29/401 is trying to acquire lock:
> [   92.644420]  (shrink_slab){+.+.}, at: [<40040aca>] 
> shrink_slab.part.42+0x73/0x350
> [   92.644428] 
> [   92.644428] but task is already holding lock:
> [   92.665257]  (_nondir_ilock_class){}, at: [<ae515ec8>] 
> xfs_ilock+0xa3/0x180 [xfs]
> [   92.668490] 
> [   92.668490] which lock already depends on the new lock.
> [   92.668490] 
> [   92.672781] 
> [   92.672781] the existing dependency chain (in reverse order) is:
> [   92.676310] 
> [   92.676310] -> #1 (_nondir_ilock_class){}:
> [   92.679519]xfs_free_eofblocks+0x9d/0x210 [xfs]
> [   92.681716]xfs_fs_destroy_inode+0x9e/0x220 [xfs]
> [   92.683962]dispose_list+0x30/0x40
> [   92.685822]prune_icache_sb+0x4d/0x70
> [   92.687961]super_cache_scan+0x136/0x180
> [   92.690017]shrink_slab.part.42+0x205/0x350
> [   92.692109]shrink_node+0x313/0x320
> [   92.694177]kswapd+0x386/0x6d0
> [   92.695951]kthread+0xeb/0x120
> 

Re: [PATCH] drbd: fix discard_zeroes_if_aligned regression

2018-01-16 Thread Eric Wheeler
On Mon, 15 Jan 2018, Christoph Hellwig wrote:

> NAK.  Calling a discard and expecting zeroing is simply buggy.

But of course, that would be silly.

We don't expect discard to zero---but we do expect discard to discard!

> And double NAK for patches like this without a linux-block Cc.

My appologies, I thought this was internal to DRBD.  

What is the general rule here?

Should linux-block always be Cc'ed with a patch?

--
Eric Wheeler



Re: [PATCH] drbd: fix discard_zeroes_if_aligned regression

2018-01-16 Thread Eric Wheeler
On Mon, 15 Jan 2018, Christoph Hellwig wrote:

> NAK.  Calling a discard and expecting zeroing is simply buggy.

But of course, that would be silly.

We don't expect discard to zero---but we do expect discard to discard!

> And double NAK for patches like this without a linux-block Cc.

My appologies, I thought this was internal to DRBD.  

What is the general rule here?

Should linux-block always be Cc'ed with a patch?

--
Eric Wheeler



Re: [PATCH 9/8] KVM: x86: limit MSR_IA32_SPEC_CTRL access based on CPUID availability

2018-01-15 Thread Eric Wheeler
On Tue, 9 Jan 2018, Paolo Bonzini wrote:
> MSR_IA32_SPEC_CTRL is not available unless CPU[7,0].EDX[26] is 1.
> Check that against host CPUID or guest CPUID, respectively for
> host-initiated and guest-initiated accesses.

Hi Radim, Paolo:

In porting this patch series to v4.14, I'm getting this BUILD_BUG_ON:

In file included from arch/x86/kvm/vmx.c:21:0:
In function 'x86_feature_cpuid',
inlined from 'guest_cpuid_get_register' at arch/x86/kvm/cpuid.h:72:25,
inlined from 'vmx_get_msr' at arch/x86/kvm/cpuid.h:101:6:
arch/x86/kvm/cpuid.h:64:232: error: call to '__compiletime_assert_64' 
declared with attribute error: BUILD_BUG_ON failed: 
reverse_cpuid[x86_leaf].function == 0
  BUILD_BUG_ON(reverse_cpuid[x86_leaf].function == 0);



^
In function 'x86_feature_cpuid',
inlined from 'guest_cpuid_get_register' at arch/x86/kvm/cpuid.h:72:25,
inlined from 'vmx_set_msr' at arch/x86/kvm/cpuid.h:101:6:
arch/x86/kvm/cpuid.h:64:232: error: call to '__compiletime_assert_64' 
declared with attribute error: BUILD_BUG_ON failed: 
reverse_cpuid[x86_leaf].function == 0
  BUILD_BUG_ON(reverse_cpuid[x86_leaf].function == 0);




I think this is caused by the following call stack for 
X86_FEATURE_SPEC_CTRL, but if not please correct me here:

arch/x86/kvm/vmx.c:
vmx_get_msr/vmx_set_msr()
guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)
guest_cpuid_get_register(vcpu, x86_feature); 
x86_feature_cpuid(x86_feature);
x86_feature_cpuid(unsigned x86_feature)

BUILD_BUG_ON(reverse_cpuid[x86_leaf].function == 0);


It looks like I need to add something to reverse_cpuid[] but I'm not sure 
what.  

Do you know what needs to be added here?

-Eric

--
Eric Wheeler



> 
> Suggested-by: Jim Mattson <jmatt...@google.com>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> ---
>   This is for after X86_FEATURE_SPEC_CTRL is added to Linux, but
>   I still wanted to ack Jim's improvement.
> 
>  arch/x86/kvm/svm.c | 8 
>  arch/x86/kvm/vmx.c | 8 
>  2 files changed, 16 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 97126c2bd663..3a646580d7c5 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3648,6 +3648,10 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   msr_info->data = svm->nested.vm_cr_msr;
>   break;
>   case MSR_IA32_SPEC_CTRL:
> + if (!static_cpu_has(X86_FEATURE_SPEC_CTRL) ||
> + (!msr_info->host_initiated &&
> +  !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)))
> + return 1;
>   msr_info->data = svm->spec_ctrl;
>   break;
>   case MSR_IA32_UCODE_REV:
> @@ -3806,6 +3810,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr)
>   vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", 
> ecx, data);
>   break;
>   case MSR_IA32_SPEC_CTRL:
> + if (!static_cpu_has(X86_FEATURE_SPEC_CTRL) ||
> + (!msr_info->host_initiated &&
> +  !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)))
> + return 1;
>   svm->spec_ctrl = data;
>   break;
>   case MSR_IA32_APICBASE:
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 49b4a2d61603..42bc7ee293e4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3368,6 +3368,10 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   msr_info->data = guest_read_tsc(vcpu);
>   break;
>   case MSR_IA32_SPEC_CTRL:
> + if (!static_cpu_has(X86_FEATURE_SPEC_CTRL) ||
> + (!msr_info->host_initiated &&
> +  !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)))
> + return 1;
>   msr_info->data = to_vmx(vcpu)->spec_ctrl;
>   break;
>   case MSR_IA32_SYSENTER_CS:
> @@ -3510,6 +3514,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   kvm_write_tsc(vcpu, msr_info);
>   break;
>   

Re: [PATCH 9/8] KVM: x86: limit MSR_IA32_SPEC_CTRL access based on CPUID availability

2018-01-15 Thread Eric Wheeler
On Tue, 9 Jan 2018, Paolo Bonzini wrote:
> MSR_IA32_SPEC_CTRL is not available unless CPU[7,0].EDX[26] is 1.
> Check that against host CPUID or guest CPUID, respectively for
> host-initiated and guest-initiated accesses.

Hi Radim, Paolo:

In porting this patch series to v4.14, I'm getting this BUILD_BUG_ON:

In file included from arch/x86/kvm/vmx.c:21:0:
In function 'x86_feature_cpuid',
inlined from 'guest_cpuid_get_register' at arch/x86/kvm/cpuid.h:72:25,
inlined from 'vmx_get_msr' at arch/x86/kvm/cpuid.h:101:6:
arch/x86/kvm/cpuid.h:64:232: error: call to '__compiletime_assert_64' 
declared with attribute error: BUILD_BUG_ON failed: 
reverse_cpuid[x86_leaf].function == 0
  BUILD_BUG_ON(reverse_cpuid[x86_leaf].function == 0);



^
In function 'x86_feature_cpuid',
inlined from 'guest_cpuid_get_register' at arch/x86/kvm/cpuid.h:72:25,
inlined from 'vmx_set_msr' at arch/x86/kvm/cpuid.h:101:6:
arch/x86/kvm/cpuid.h:64:232: error: call to '__compiletime_assert_64' 
declared with attribute error: BUILD_BUG_ON failed: 
reverse_cpuid[x86_leaf].function == 0
  BUILD_BUG_ON(reverse_cpuid[x86_leaf].function == 0);




I think this is caused by the following call stack for 
X86_FEATURE_SPEC_CTRL, but if not please correct me here:

arch/x86/kvm/vmx.c:
vmx_get_msr/vmx_set_msr()
guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)
guest_cpuid_get_register(vcpu, x86_feature); 
x86_feature_cpuid(x86_feature);
x86_feature_cpuid(unsigned x86_feature)

BUILD_BUG_ON(reverse_cpuid[x86_leaf].function == 0);


It looks like I need to add something to reverse_cpuid[] but I'm not sure 
what.  

Do you know what needs to be added here?

-Eric

--
Eric Wheeler



> 
> Suggested-by: Jim Mattson 
> Signed-off-by: Paolo Bonzini 
> ---
>   This is for after X86_FEATURE_SPEC_CTRL is added to Linux, but
>   I still wanted to ack Jim's improvement.
> 
>  arch/x86/kvm/svm.c | 8 
>  arch/x86/kvm/vmx.c | 8 
>  2 files changed, 16 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 97126c2bd663..3a646580d7c5 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3648,6 +3648,10 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   msr_info->data = svm->nested.vm_cr_msr;
>   break;
>   case MSR_IA32_SPEC_CTRL:
> + if (!static_cpu_has(X86_FEATURE_SPEC_CTRL) ||
> + (!msr_info->host_initiated &&
> +  !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)))
> + return 1;
>   msr_info->data = svm->spec_ctrl;
>   break;
>   case MSR_IA32_UCODE_REV:
> @@ -3806,6 +3810,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr)
>   vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", 
> ecx, data);
>   break;
>   case MSR_IA32_SPEC_CTRL:
> + if (!static_cpu_has(X86_FEATURE_SPEC_CTRL) ||
> + (!msr_info->host_initiated &&
> +  !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)))
> + return 1;
>   svm->spec_ctrl = data;
>   break;
>   case MSR_IA32_APICBASE:
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 49b4a2d61603..42bc7ee293e4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3368,6 +3368,10 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   msr_info->data = guest_read_tsc(vcpu);
>   break;
>   case MSR_IA32_SPEC_CTRL:
> + if (!static_cpu_has(X86_FEATURE_SPEC_CTRL) ||
> + (!msr_info->host_initiated &&
> +  !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL)))
> + return 1;
>   msr_info->data = to_vmx(vcpu)->spec_ctrl;
>   break;
>   case MSR_IA32_SYSENTER_CS:
> @@ -3510,6 +3514,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   kvm_write_tsc(vcpu, msr_info);
>   break;
>   case MSR_IA32_SPEC_CTRL:
> + if (!s

Re: [PATCH 8/8] KVM: x86: add SPEC_CTRL and IBPB_SUPPORT to MSR and CPUID lists

2018-01-15 Thread Eric Wheeler
On Sat, 13 Jan 2018, Paolo Bonzini wrote:

> Just add the new MSR at the end of the array.

I'm assuming you meant emulated_msrs[], correct?  


--
Eric Wheeler



> 
> Paolo
> 
> ----- Eric Wheeler <k...@lists.ewheeler.net> ha scritto:
> > On Tue, 9 Jan 2018, Paolo Bonzini wrote:
> > 
> > > Expose them to userspace, now that guests can use them.
> > > I am not adding cpufeatures here to avoid having a kernel
> > > that shows spec_ctrl in /proc/cpuinfo and actually has no
> > > support whatsoever for IBRS/IBPB.  Keep the ugly special-casing
> > > for now, and clean it up once the generic arch/x86/ code
> > > learns about them.
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index daa1918031df..4abb37d9f4d8 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1032,6 +1032,7 @@ unsigned int kvm_get_pt_addr_cnt(void)
> > >   MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
> > >   MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
> > >   MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> > > + MSR_IA32_SPEC_CTRL,
> > >  };
> > 
> > Hi Paolo,
> > 
> > Thank you for posting this!
> > 
> > I am trying to merge this into 4.14 which does not have 
> > kvm_get_pt_addr_cnt. The rest of the patch commits, but this gets 
> > rejected. Is this a necessary part of the commit?
> > 
> > patching file arch/x86/kvm/cpuid.c
> > Hunk #1 succeeded at 389 (offset -8 lines).
> > Hunk #2 succeeded at 479 (offset -9 lines).
> > Hunk #3 succeeded at 636 (offset -27 lines).
> > patching file arch/x86/kvm/x86.c
> > Hunk #1 FAILED at 1032.
> > 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kvm/x86.c.rej
> > 
> > ]# cat arch/x86/kvm/x86.c.rej
> > --- arch/x86/kvm/x86.c
> > +++ arch/x86/kvm/x86.c
> > @@ -1032,6 +1032,7 @@
> > MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
> > MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
> > MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> > +   MSR_IA32_SPEC_CTRL,
> >  };
> >  
> >  static unsigned num_msrs_to_save;
> > 
> > 
> > Thank you for your help!
> > 
> > --
> > Eric Wheeler
> > 
> > 
> > >  
> > >  static unsigned num_msrs_to_save;
> > > -- 
> > > 1.8.3.1
> > > 
> > > 
> > > 
> 
> 


Re: [PATCH 8/8] KVM: x86: add SPEC_CTRL and IBPB_SUPPORT to MSR and CPUID lists

2018-01-15 Thread Eric Wheeler
On Sat, 13 Jan 2018, Paolo Bonzini wrote:

> Just add the new MSR at the end of the array.

I'm assuming you meant emulated_msrs[], correct?  


--
Eric Wheeler



> 
> Paolo
> 
> ----- Eric Wheeler  ha scritto:
> > On Tue, 9 Jan 2018, Paolo Bonzini wrote:
> > 
> > > Expose them to userspace, now that guests can use them.
> > > I am not adding cpufeatures here to avoid having a kernel
> > > that shows spec_ctrl in /proc/cpuinfo and actually has no
> > > support whatsoever for IBRS/IBPB.  Keep the ugly special-casing
> > > for now, and clean it up once the generic arch/x86/ code
> > > learns about them.
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index daa1918031df..4abb37d9f4d8 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1032,6 +1032,7 @@ unsigned int kvm_get_pt_addr_cnt(void)
> > >   MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
> > >   MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
> > >   MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> > > + MSR_IA32_SPEC_CTRL,
> > >  };
> > 
> > Hi Paolo,
> > 
> > Thank you for posting this!
> > 
> > I am trying to merge this into 4.14 which does not have 
> > kvm_get_pt_addr_cnt. The rest of the patch commits, but this gets 
> > rejected. Is this a necessary part of the commit?
> > 
> > patching file arch/x86/kvm/cpuid.c
> > Hunk #1 succeeded at 389 (offset -8 lines).
> > Hunk #2 succeeded at 479 (offset -9 lines).
> > Hunk #3 succeeded at 636 (offset -27 lines).
> > patching file arch/x86/kvm/x86.c
> > Hunk #1 FAILED at 1032.
> > 1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kvm/x86.c.rej
> > 
> > ]# cat arch/x86/kvm/x86.c.rej
> > --- arch/x86/kvm/x86.c
> > +++ arch/x86/kvm/x86.c
> > @@ -1032,6 +1032,7 @@
> > MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
> > MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
> > MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> > +   MSR_IA32_SPEC_CTRL,
> >  };
> >  
> >  static unsigned num_msrs_to_save;
> > 
> > 
> > Thank you for your help!
> > 
> > --
> > Eric Wheeler
> > 
> > 
> > >  
> > >  static unsigned num_msrs_to_save;
> > > -- 
> > > 1.8.3.1
> > > 
> > > 
> > > 
> 
> 


[PATCH] drbd: standardize kthread/workqueue thread naming to include drbd minor number

2018-01-15 Thread Eric Wheeler
From: Eric Wheeler <g...@linux.ewheeler.net>

For DRBD resources with long names that start with the same prefix,
it was difficult to find all process pids for that minor since names
are truncated to the task_struct's comm field (16 bytes).

This patch names all processes associated with a DRBD device as drbdN_*
where N is the DRBD minor in the same ways that the drbdN_submit workqueue
is named.  Userspace tools can then lookup the name=>minor=>pid mapping
and for all pids and use tools like chrt, ioprio, nice, add pids to
cgroups, or for other useful purpose.

Signed-off-by: Eric Wheeler <d...@linux.ewheeler.net>
---
 drivers/block/drbd/drbd_main.c | 31 +--
 drivers/block/drbd/drbd_receiver.c | 19 +--
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 8cb3791..c5444b3 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -330,7 +330,21 @@ static int drbd_thread_setup(void *arg)
unsigned long flags;
int retval;
 
-   snprintf(current->comm, sizeof(current->comm), "drbd_%c_%s",
+   int i, minor = -1;
+   struct drbd_device *device;
+
+   idr_for_each_entry(>devices, device, i) {
+   if (minor >= 0 && minor != device->minor)
+   pr_err("drbd: %s(%s): minor is different.  was %d is 
now %d\n",
+   __func__,
+   resource->name,
+   minor, device->minor);
+
+   minor = device->minor;
+   }
+
+   snprintf(current->comm, sizeof(current->comm), "drbd%d_%c_%s",
+minor,
 thi->name[0],
 resource->name);
 
@@ -389,6 +403,8 @@ int drbd_thread_start(struct drbd_thread *thi)
 {
struct drbd_resource *resource = thi->resource;
struct task_struct *nt;
+   struct drbd_device *device;
+   int i, minor = -1;
unsigned long flags;
 
/* is used from state engine doing drbd_thread_stop_nowait,
@@ -417,8 +433,19 @@ int drbd_thread_start(struct drbd_thread *thi)
spin_unlock_irqrestore(>t_lock, flags);
flush_signals(current); /* otherw. may get -ERESTARTNOINTR */
 
+   idr_for_each_entry(>devices, device, i) {
+   if (minor >= 0 && minor != device->minor)
+   pr_err("drbd: drbd_thread_start(%s, %s): minor 
is different.  was %d is now %d\n",
+   thi->name,
+   thi->resource->name,
+   minor, device->minor);
+
+   minor = device->minor;
+   }
+
nt = kthread_create(drbd_thread_setup, (void *) thi,
-   "drbd_%c_%s", thi->name[0], 
thi->resource->name);
+   "drbd%d_%c_%s",
+   minor, thi->name[0], thi->resource->name);
 
if (IS_ERR(nt)) {
drbd_err(resource, "Couldn't start thread\n");
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 796eaf3..62a902f 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -934,7 +934,7 @@ static int conn_connect(struct drbd_connection *connection)
struct drbd_socket sock, msock;
struct drbd_peer_device *peer_device;
struct net_conf *nc;
-   int vnr, timeout, h;
+   int vnr, timeout, h, i, minor = -1;
bool discard_my_data, ok;
enum drbd_state_rv rv;
struct accept_wait_data ad = {
@@ -1132,10 +1132,25 @@ static int conn_connect(struct drbd_connection 
*connection)
}
 
drbd_thread_start(>ack_receiver);
+
+   if (connection->resource) {
+   struct drbd_device *device;
+
+   idr_for_each_entry(>resource->devices, device, i) {
+   if (minor >= 0 && minor != device->minor)
+   pr_err("drbd: conn_connect(%s): minor is 
different.  was %d is now %d\n",
+   connection->resource->name,
+   minor, device->minor);
+
+   minor = device->minor;
+   }
+   }
/* opencoded create_singlethread_workqueue(),
 * to be able to use format string arguments */
connection->ack_sender =
-   alloc_ordered_workqueue("drbd_as_%s", WQ_MEM_RECLAIM, 
connection->resource->name);
+   alloc_ordered_workqueue("drbd%d_as_%s",
+   WQ_MEM_RECLAIM, minor, connection->resource->name);
+
if (!connection->ack_sender) {
drbd_err(connection, "Failed to create workqueue ack_sender\n");
return 0;
-- 
1.8.3.1



[PATCH] drbd: standardize kthread/workqueue thread naming to include drbd minor number

2018-01-15 Thread Eric Wheeler
From: Eric Wheeler 

For DRBD resources with long names that start with the same prefix,
it was difficult to find all process pids for that minor since names
are truncated to the task_struct's comm field (16 bytes).

This patch names all processes associated with a DRBD device as drbdN_*
where N is the DRBD minor in the same ways that the drbdN_submit workqueue
is named.  Userspace tools can then lookup the name=>minor=>pid mapping
and for all pids and use tools like chrt, ioprio, nice, add pids to
cgroups, or for other useful purpose.

Signed-off-by: Eric Wheeler 
---
 drivers/block/drbd/drbd_main.c | 31 +--
 drivers/block/drbd/drbd_receiver.c | 19 +--
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 8cb3791..c5444b3 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -330,7 +330,21 @@ static int drbd_thread_setup(void *arg)
unsigned long flags;
int retval;
 
-   snprintf(current->comm, sizeof(current->comm), "drbd_%c_%s",
+   int i, minor = -1;
+   struct drbd_device *device;
+
+   idr_for_each_entry(>devices, device, i) {
+   if (minor >= 0 && minor != device->minor)
+   pr_err("drbd: %s(%s): minor is different.  was %d is 
now %d\n",
+   __func__,
+   resource->name,
+   minor, device->minor);
+
+   minor = device->minor;
+   }
+
+   snprintf(current->comm, sizeof(current->comm), "drbd%d_%c_%s",
+minor,
 thi->name[0],
 resource->name);
 
@@ -389,6 +403,8 @@ int drbd_thread_start(struct drbd_thread *thi)
 {
struct drbd_resource *resource = thi->resource;
struct task_struct *nt;
+   struct drbd_device *device;
+   int i, minor = -1;
unsigned long flags;
 
/* is used from state engine doing drbd_thread_stop_nowait,
@@ -417,8 +433,19 @@ int drbd_thread_start(struct drbd_thread *thi)
spin_unlock_irqrestore(>t_lock, flags);
flush_signals(current); /* otherw. may get -ERESTARTNOINTR */
 
+   idr_for_each_entry(>devices, device, i) {
+   if (minor >= 0 && minor != device->minor)
+   pr_err("drbd: drbd_thread_start(%s, %s): minor 
is different.  was %d is now %d\n",
+   thi->name,
+   thi->resource->name,
+   minor, device->minor);
+
+   minor = device->minor;
+   }
+
nt = kthread_create(drbd_thread_setup, (void *) thi,
-   "drbd_%c_%s", thi->name[0], 
thi->resource->name);
+   "drbd%d_%c_%s",
+   minor, thi->name[0], thi->resource->name);
 
if (IS_ERR(nt)) {
drbd_err(resource, "Couldn't start thread\n");
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 796eaf3..62a902f 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -934,7 +934,7 @@ static int conn_connect(struct drbd_connection *connection)
struct drbd_socket sock, msock;
struct drbd_peer_device *peer_device;
struct net_conf *nc;
-   int vnr, timeout, h;
+   int vnr, timeout, h, i, minor = -1;
bool discard_my_data, ok;
enum drbd_state_rv rv;
struct accept_wait_data ad = {
@@ -1132,10 +1132,25 @@ static int conn_connect(struct drbd_connection 
*connection)
}
 
drbd_thread_start(>ack_receiver);
+
+   if (connection->resource) {
+   struct drbd_device *device;
+
+   idr_for_each_entry(>resource->devices, device, i) {
+   if (minor >= 0 && minor != device->minor)
+   pr_err("drbd: conn_connect(%s): minor is 
different.  was %d is now %d\n",
+   connection->resource->name,
+   minor, device->minor);
+
+   minor = device->minor;
+   }
+   }
/* opencoded create_singlethread_workqueue(),
 * to be able to use format string arguments */
connection->ack_sender =
-   alloc_ordered_workqueue("drbd_as_%s", WQ_MEM_RECLAIM, 
connection->resource->name);
+   alloc_ordered_workqueue("drbd%d_as_%s",
+   WQ_MEM_RECLAIM, minor, connection->resource->name);
+
if (!connection->ack_sender) {
drbd_err(connection, "Failed to create workqueue ack_sender\n");
return 0;
-- 
1.8.3.1



[PATCH] drbd: fix discard_zeroes_if_aligned regression

2018-01-15 Thread Eric Wheeler
From: Eric Wheeler <g...@linux.ewheeler.net>

According to drbd.conf documentation, "To not break established and
expected behaviour, and suddenly cause fstrim on thin-provisioned LVs to
run out-of-space instead of freeing up space, the default value is yes."

This behavior regressed in the REQ_OP_WRITE_ZEROES refactor near
45c21793 drbd: implement REQ_OP_WRITE_ZEROES
0dbed96  drbd: make intelligent use of blkdev_issue_zeroout
which caused dm-thin backed DRBD volumes to zero blocks and run out of
space instead of passing discard to the backing device as defined by the
discard_zeroes_if_aligned option.

A helper function could reduce code duplication.

Signed-off-by: Eric Wheeler <d...@linux.ewheeler.net>
Cc: <sta...@vger.kernel.org> # 4.14
---
 drivers/block/drbd/drbd_receiver.c | 22 --
 drivers/block/drbd/drbd_req.c  | 23 +--
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 62a902f..58f0e43 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1507,9 +1507,27 @@ void drbd_bump_write_ordering(struct drbd_resource 
*resource, struct drbd_backin
 static void drbd_issue_peer_discard(struct drbd_device *device, struct 
drbd_peer_request *peer_req)
 {
struct block_device *bdev = device->ldev->backing_bdev;
+   struct disk_conf *dc;
+   bool discard_zeroes_if_aligned, zeroout;
 
-   if (blkdev_issue_zeroout(bdev, peer_req->i.sector, peer_req->i.size >> 
9,
-   GFP_NOIO, 0))
+   rcu_read_lock();
+   dc = rcu_dereference(device->ldev->disk_conf);
+   discard_zeroes_if_aligned = dc->discard_zeroes_if_aligned;
+   rcu_read_unlock();
+
+   /* Use zeroout unless discard_zeroes_if_aligned is set.
+* If blkdev_issue_discard fails, then retry with blkdev_issue_zeroout.
+* See also drbd_process_discard_req() in drbd_req.c.
+*/
+   zeroout = true;
+   if (discard_zeroes_if_aligned &&
+   blkdev_issue_discard(bdev, peer_req->i.sector,
+   peer_req->i.size >> 9, GFP_NOIO, 0) == 0)
+   zeroout = false;
+
+   if (zeroout &&
+   blkdev_issue_zeroout(bdev, peer_req->i.sector,
+   peer_req->i.size >> 9, GFP_NOIO, 0) != 0)
peer_req->flags |= EE_WAS_ERROR;
 
drbd_endio_write_sec_final(peer_req);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index de8566e..070b5e7 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1158,9 +1158,28 @@ static int drbd_process_write_request(struct 
drbd_request *req)
 static void drbd_process_discard_req(struct drbd_request *req)
 {
struct block_device *bdev = req->device->ldev->backing_bdev;
+   struct drbd_device *device = req->device;
+   struct disk_conf *dc;
+   bool discard_zeroes_if_aligned, zeroout;
 
-   if (blkdev_issue_zeroout(bdev, req->i.sector, req->i.size >> 9,
-   GFP_NOIO, 0))
+   rcu_read_lock();
+   dc = rcu_dereference(device->ldev->disk_conf);
+   discard_zeroes_if_aligned = dc->discard_zeroes_if_aligned;
+   rcu_read_unlock();
+
+   /* Use zeroout unless discard_zeroes_if_aligned is set.
+* If blkdev_issue_discard fails, then retry with blkdev_issue_zeroout.
+* See also drbd_issue_peer_discard() in drbd_receiver.c.
+*/
+   zeroout = true;
+   if (discard_zeroes_if_aligned &&
+   blkdev_issue_discard(bdev, req->i.sector, req->i.size >> 9,
+   GFP_NOIO, 0) == 0)
+   zeroout = false;
+
+   if (zeroout &&
+   blkdev_issue_zeroout(bdev, req->i.sector, req->i.size >> 9,
+   GFP_NOIO, 0) != 0)
req->private_bio->bi_status = BLK_STS_IOERR;
bio_endio(req->private_bio);
 }
-- 
1.8.3.1



[PATCH] drbd: fix discard_zeroes_if_aligned regression

2018-01-15 Thread Eric Wheeler
From: Eric Wheeler 

According to drbd.conf documentation, "To not break established and
expected behaviour, and suddenly cause fstrim on thin-provisioned LVs to
run out-of-space instead of freeing up space, the default value is yes."

This behavior regressed in the REQ_OP_WRITE_ZEROES refactor near
45c21793 drbd: implement REQ_OP_WRITE_ZEROES
0dbed96  drbd: make intelligent use of blkdev_issue_zeroout
which caused dm-thin backed DRBD volumes to zero blocks and run out of
space instead of passing discard to the backing device as defined by the
discard_zeroes_if_aligned option.

A helper function could reduce code duplication.

Signed-off-by: Eric Wheeler 
Cc:  # 4.14
---
 drivers/block/drbd/drbd_receiver.c | 22 --
 drivers/block/drbd/drbd_req.c  | 23 +--
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 62a902f..58f0e43 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1507,9 +1507,27 @@ void drbd_bump_write_ordering(struct drbd_resource 
*resource, struct drbd_backin
 static void drbd_issue_peer_discard(struct drbd_device *device, struct 
drbd_peer_request *peer_req)
 {
struct block_device *bdev = device->ldev->backing_bdev;
+   struct disk_conf *dc;
+   bool discard_zeroes_if_aligned, zeroout;
 
-   if (blkdev_issue_zeroout(bdev, peer_req->i.sector, peer_req->i.size >> 
9,
-   GFP_NOIO, 0))
+   rcu_read_lock();
+   dc = rcu_dereference(device->ldev->disk_conf);
+   discard_zeroes_if_aligned = dc->discard_zeroes_if_aligned;
+   rcu_read_unlock();
+
+   /* Use zeroout unless discard_zeroes_if_aligned is set.
+* If blkdev_issue_discard fails, then retry with blkdev_issue_zeroout.
+* See also drbd_process_discard_req() in drbd_req.c.
+*/
+   zeroout = true;
+   if (discard_zeroes_if_aligned &&
+   blkdev_issue_discard(bdev, peer_req->i.sector,
+   peer_req->i.size >> 9, GFP_NOIO, 0) == 0)
+   zeroout = false;
+
+   if (zeroout &&
+   blkdev_issue_zeroout(bdev, peer_req->i.sector,
+   peer_req->i.size >> 9, GFP_NOIO, 0) != 0)
peer_req->flags |= EE_WAS_ERROR;
 
drbd_endio_write_sec_final(peer_req);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index de8566e..070b5e7 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1158,9 +1158,28 @@ static int drbd_process_write_request(struct 
drbd_request *req)
 static void drbd_process_discard_req(struct drbd_request *req)
 {
struct block_device *bdev = req->device->ldev->backing_bdev;
+   struct drbd_device *device = req->device;
+   struct disk_conf *dc;
+   bool discard_zeroes_if_aligned, zeroout;
 
-   if (blkdev_issue_zeroout(bdev, req->i.sector, req->i.size >> 9,
-   GFP_NOIO, 0))
+   rcu_read_lock();
+   dc = rcu_dereference(device->ldev->disk_conf);
+   discard_zeroes_if_aligned = dc->discard_zeroes_if_aligned;
+   rcu_read_unlock();
+
+   /* Use zeroout unless discard_zeroes_if_aligned is set.
+* If blkdev_issue_discard fails, then retry with blkdev_issue_zeroout.
+* See also drbd_issue_peer_discard() in drbd_receiver.c.
+*/
+   zeroout = true;
+   if (discard_zeroes_if_aligned &&
+   blkdev_issue_discard(bdev, req->i.sector, req->i.size >> 9,
+   GFP_NOIO, 0) == 0)
+   zeroout = false;
+
+   if (zeroout &&
+   blkdev_issue_zeroout(bdev, req->i.sector, req->i.size >> 9,
+   GFP_NOIO, 0) != 0)
req->private_bio->bi_status = BLK_STS_IOERR;
bio_endio(req->private_bio);
 }
-- 
1.8.3.1



Re: [PATCH 8/8] KVM: x86: add SPEC_CTRL and IBPB_SUPPORT to MSR and CPUID lists

2018-01-12 Thread Eric Wheeler
On Tue, 9 Jan 2018, Paolo Bonzini wrote:

> Expose them to userspace, now that guests can use them.
> I am not adding cpufeatures here to avoid having a kernel
> that shows spec_ctrl in /proc/cpuinfo and actually has no
> support whatsoever for IBRS/IBPB.  Keep the ugly special-casing
> for now, and clean it up once the generic arch/x86/ code
> learns about them.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index daa1918031df..4abb37d9f4d8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1032,6 +1032,7 @@ unsigned int kvm_get_pt_addr_cnt(void)
>   MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
>   MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
>   MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> + MSR_IA32_SPEC_CTRL,
>  };

Hi Paolo,

Thank you for posting this!

I am trying to merge this into 4.14 which does not have 
kvm_get_pt_addr_cnt. The rest of the patch commits, but this gets 
rejected. Is this a necessary part of the commit?

patching file arch/x86/kvm/cpuid.c
Hunk #1 succeeded at 389 (offset -8 lines).
Hunk #2 succeeded at 479 (offset -9 lines).
Hunk #3 succeeded at 636 (offset -27 lines).
patching file arch/x86/kvm/x86.c
Hunk #1 FAILED at 1032.
1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kvm/x86.c.rej

]# cat arch/x86/kvm/x86.c.rej
--- arch/x86/kvm/x86.c
+++ arch/x86/kvm/x86.c
@@ -1032,6 +1032,7 @@
MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
+   MSR_IA32_SPEC_CTRL,
 };
 
 static unsigned num_msrs_to_save;


Thank you for your help!

--
Eric Wheeler


>  
>  static unsigned num_msrs_to_save;
> -- 
> 1.8.3.1
> 
> 
> 


Re: [PATCH 8/8] KVM: x86: add SPEC_CTRL and IBPB_SUPPORT to MSR and CPUID lists

2018-01-12 Thread Eric Wheeler
On Tue, 9 Jan 2018, Paolo Bonzini wrote:

> Expose them to userspace, now that guests can use them.
> I am not adding cpufeatures here to avoid having a kernel
> that shows spec_ctrl in /proc/cpuinfo and actually has no
> support whatsoever for IBRS/IBPB.  Keep the ugly special-casing
> for now, and clean it up once the generic arch/x86/ code
> learns about them.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index daa1918031df..4abb37d9f4d8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1032,6 +1032,7 @@ unsigned int kvm_get_pt_addr_cnt(void)
>   MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
>   MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
>   MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> + MSR_IA32_SPEC_CTRL,
>  };

Hi Paolo,

Thank you for posting this!

I am trying to merge this into 4.14 which does not have 
kvm_get_pt_addr_cnt. The rest of the patch commits, but this gets 
rejected. Is this a necessary part of the commit?

patching file arch/x86/kvm/cpuid.c
Hunk #1 succeeded at 389 (offset -8 lines).
Hunk #2 succeeded at 479 (offset -9 lines).
Hunk #3 succeeded at 636 (offset -27 lines).
patching file arch/x86/kvm/x86.c
Hunk #1 FAILED at 1032.
1 out of 1 hunk FAILED -- saving rejects to file arch/x86/kvm/x86.c.rej

]# cat arch/x86/kvm/x86.c.rej
--- arch/x86/kvm/x86.c
+++ arch/x86/kvm/x86.c
@@ -1032,6 +1032,7 @@
MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
+   MSR_IA32_SPEC_CTRL,
 };
 
 static unsigned num_msrs_to_save;


Thank you for your help!

--
Eric Wheeler


>  
>  static unsigned num_msrs_to_save;
> -- 
> 1.8.3.1
> 
> 
> 


Re: [PATCH v2] bcache: explicitly destroy mutex while exiting

2017-10-27 Thread Eric Wheeler
On Tue, 10 Oct 2017, Michael Lyle wrote:

> On 10/10/2017 05:25 AM, Coly Li wrote:
> > On 2017/10/10 下午5:00, Liang Chen wrote:
> >> mutex_destroy does nothing most of time, but it's better to call
> >> it to make the code future proof and it also has some meaning
> >> for like mutex debug.
> >>
> >> As Coly pointed out in a previous review, bcache_exit() may not be
> >> able to handle all the references properly if userspace registers
> >> cache and backing devices right before bch_debug_init runs and
> >> bch_debug_init failes later. So not exposing userspace interface
> >> until everything is ready to avoid that issue.
> >>
> >> Signed-off-by: Liang Chen <liangchen.li...@gmail.com>
> > 
> > Hi Liang,
> > 
> > No more comment from me, it looks good. Thanks.
> > 
> > Reviewed-by: Coly Li <col...@suse.de>
> 
> Looks good to me too.
> 
> Reviewed-by: Michael Lyle <ml...@lyle.org>

Should this Cc: stable to avoid the register race (possible 
crash?) described by Liang in other stable kernels?

Reviewed-by: Eric Wheeler <bca...@linux.ewheeler.net>

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Re: [PATCH v2] bcache: explicitly destroy mutex while exiting

2017-10-27 Thread Eric Wheeler
On Tue, 10 Oct 2017, Michael Lyle wrote:

> On 10/10/2017 05:25 AM, Coly Li wrote:
> > On 2017/10/10 下午5:00, Liang Chen wrote:
> >> mutex_destroy does nothing most of time, but it's better to call
> >> it to make the code future proof and it also has some meaning
> >> for like mutex debug.
> >>
> >> As Coly pointed out in a previous review, bcache_exit() may not be
> >> able to handle all the references properly if userspace registers
> >> cache and backing devices right before bch_debug_init runs and
> >> bch_debug_init failes later. So not exposing userspace interface
> >> until everything is ready to avoid that issue.
> >>
> >> Signed-off-by: Liang Chen 
> > 
> > Hi Liang,
> > 
> > No more comment from me, it looks good. Thanks.
> > 
> > Reviewed-by: Coly Li 
> 
> Looks good to me too.
> 
> Reviewed-by: Michael Lyle 

Should this Cc: stable to avoid the register race (possible 
crash?) described by Liang in other stable kernels?

Reviewed-by: Eric Wheeler 

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Re: bcache with existing ext4 filesystem

2017-07-26 Thread Eric Wheeler
On Wed, 26 Jul 2017, Austin S. Hemmelgarn wrote:

> On 2017-07-26 13:41, Eric Wheeler wrote:
> > On Wed, 26 Jul 2017, Pavel Machek wrote:
> > 
> > > Hi!
> > >
> > > > > > > Unfortunately, that would mean shifting 400GB data 8KB forward,
> > > > > > > and
> > > > > > > compatibility problems. So I'd prefer adding bcache superblock
> > > > > > > into
> > > > > > > the reserved space, so I can have caching _and_ compatibility with
> > > > > > > grub2 etc (and avoid 400GB move):
> > > > > >
> > > > > > The common way to do that is to move the beginning of the partition,
> > > > > > assuming your ext4 lives in a partition.
> > > > >
> > > > > Well... if I move the partition, grub2 (etc) will be unable to access
> > > > > data on it. (Plus I do not have free space before some of the
> > > > > partitions I'd like to be cached).
> > > >
> > > > Why not use dm-linear and prepend space for the bcache superblock?  If
> > > > this is your boot device, then you would need to write a custom
> > > > initrd hook too.
> > >
> > > Thanks for a pointer. That would actually work, but I'd have to be
> > > very, very careful using it...
> > >
> > > ...because if I, or systemd or some kind of automounter sees the
> > > underlying device (sda4) and writes to it (it is valid ext4 after
> > > all), I'll have inconsistent base device and cache ... and that will
> > > be asking for major problems (even in writethrough mode).
> > 
> > Sigh.  Gone are the days when distributions would only mount filesystems
> > if you ask them to.
> This is only an issue if:
> 1. You are crazy and have your desktop set up to auto-mount stuff on insertion
> (instead of on-access), or you're using a brain-dead desktop that doesn't let
> you turn this functionality off.
> or:
> 2. You are mounting by filesystem label or filesystem UUID (/dev/mapper/* and
> /dev// links are more than sufficiently stable unless you're
> changing the storage stack all the time).  This is of course almost always the
> case these days because for some reason the big distros assume that the device
> mapper links aren't stable or aren't safe to use.
> or:
> 3. You're working with a multi-device BTRFS volume (and in that case, it's not
> an issue of auto-mounting it, but an issue of buggy kernel behavior combined
> with the 'scan everything for BTRFS' udev rule that's now upstream in udev).
> > 
> > If this is a desktop, then I'm not sure what to suggest.  But for server
> > with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in
> > /etc/sysconfig/grub).  If this was LVM, then I would suggest also setting
> > global_filter in lvm.conf.  If you find other places that need poked to
> > prevent automounting then please let me know!
> Nope, just those and what I mentioned above under 1 and 2.
> > 
> > As for ext4 feature bits, can they be arbitrarily named? (I think they are
> > bits, so maybe not).  Maybe propose a patch to ext4 to provide a
> > "disable_mount" feature.  This would prevent mounting altogether, and you
> > would set/clear it when you care to.  A strange feature indeed.  Doing
> > this as an obscure feature on a single filesystem doesn't quite seem
> > right.
> In the case of ext4, you can use the MMP feature (with the associated minor
> overhead) to enforce this.

Neat!  However, MMP might not get flagged if bcache is in writeback mode 
and caches the superblock update.  He would need to set MMP on the backing 
device explicitly, which might work but seems like a bad idea for 
consistency.  Writethrough would be fine.
  
Given the earlier precautions and sanity over which volume should be 
mounted (eg, by device name not UUID), then he should be ok if mounting 
neither by label nor uuid.

-Eric

> > 
> > 
> > It might be better to have a block-layer device-mount mask so devices that
> > are allowed to be mounted can be whitelisted on the kernel cmdline or
> > something.  blk.allow_mount=8:16,253:*,...  or blk.disallow_mount=8:32 (or
> > probably both).
> > 
> > Just ideas, but it would be great to allow mounting only those
> > major/minors which are authorized, particularly with increasingly complex
> > block-device stacks.
> If it's not explicitly created as a mount unit, fstab entry, or automount
> entry, and you're not using a brain-dead desktop, and you're not using FS
> labels or UUID's to mount, it _really_ isn't an issue except on BTRFS, and for
> that this would provide _no_ benefit, because the issue there results from
> confusion over which device to use combined with the assumption that a UUID
> stored on the device is sufficient to uniquely identify a filesystem.
> 
> 


Re: bcache with existing ext4 filesystem

2017-07-26 Thread Eric Wheeler
On Wed, 26 Jul 2017, Austin S. Hemmelgarn wrote:

> On 2017-07-26 13:41, Eric Wheeler wrote:
> > On Wed, 26 Jul 2017, Pavel Machek wrote:
> > 
> > > Hi!
> > >
> > > > > > > Unfortunately, that would mean shifting 400GB data 8KB forward,
> > > > > > > and
> > > > > > > compatibility problems. So I'd prefer adding bcache superblock
> > > > > > > into
> > > > > > > the reserved space, so I can have caching _and_ compatibility with
> > > > > > > grub2 etc (and avoid 400GB move):
> > > > > >
> > > > > > The common way to do that is to move the beginning of the partition,
> > > > > > assuming your ext4 lives in a partition.
> > > > >
> > > > > Well... if I move the partition, grub2 (etc) will be unable to access
> > > > > data on it. (Plus I do not have free space before some of the
> > > > > partitions I'd like to be cached).
> > > >
> > > > Why not use dm-linear and prepend space for the bcache superblock?  If
> > > > this is your boot device, then you would need to write a custom
> > > > initrd hook too.
> > >
> > > Thanks for a pointer. That would actually work, but I'd have to be
> > > very, very careful using it...
> > >
> > > ...because if I, or systemd or some kind of automounter sees the
> > > underlying device (sda4) and writes to it (it is valid ext4 after
> > > all), I'll have inconsistent base device and cache ... and that will
> > > be asking for major problems (even in writethrough mode).
> > 
> > Sigh.  Gone are the days when distributions would only mount filesystems
> > if you ask them to.
> This is only an issue if:
> 1. You are crazy and have your desktop set up to auto-mount stuff on insertion
> (instead of on-access), or you're using a brain-dead desktop that doesn't let
> you turn this functionality off.
> or:
> 2. You are mounting by filesystem label or filesystem UUID (/dev/mapper/* and
> /dev// links are more than sufficiently stable unless you're
> changing the storage stack all the time).  This is of course almost always the
> case these days because for some reason the big distros assume that the device
> mapper links aren't stable or aren't safe to use.
> or:
> 3. You're working with a multi-device BTRFS volume (and in that case, it's not
> an issue of auto-mounting it, but an issue of buggy kernel behavior combined
> with the 'scan everything for BTRFS' udev rule that's now upstream in udev).
> > 
> > If this is a desktop, then I'm not sure what to suggest.  But for server
> > with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in
> > /etc/sysconfig/grub).  If this was LVM, then I would suggest also setting
> > global_filter in lvm.conf.  If you find other places that need poked to
> > prevent automounting then please let me know!
> Nope, just those and what I mentioned above under 1 and 2.
> > 
> > As for ext4 feature bits, can they be arbitrarily named? (I think they are
> > bits, so maybe not).  Maybe propose a patch to ext4 to provide a
> > "disable_mount" feature.  This would prevent mounting altogether, and you
> > would set/clear it when you care to.  A strange feature indeed.  Doing
> > this as an obscure feature on a single filesystem doesn't quite seem
> > right.
> In the case of ext4, you can use the MMP feature (with the associated minor
> overhead) to enforce this.

Neat!  However, MMP might not get flagged if bcache is in writeback mode 
and caches the superblock update.  He would need to set MMP on the backing 
device explicitly, which might work but seems like a bad idea for 
consistency.  Writethrough would be fine.
  
Given the earlier precautions and sanity over which volume should be 
mounted (eg, by device name not UUID), then he should be ok if mounting 
neither by label nor uuid.

-Eric

> > 
> > 
> > It might be better to have a block-layer device-mount mask so devices that
> > are allowed to be mounted can be whitelisted on the kernel cmdline or
> > something.  blk.allow_mount=8:16,253:*,...  or blk.disallow_mount=8:32 (or
> > probably both).
> > 
> > Just ideas, but it would be great to allow mounting only those
> > major/minors which are authorized, particularly with increasingly complex
> > block-device stacks.
> If it's not explicitly created as a mount unit, fstab entry, or automount
> entry, and you're not using a brain-dead desktop, and you're not using FS
> labels or UUID's to mount, it _really_ isn't an issue except on BTRFS, and for
> that this would provide _no_ benefit, because the issue there results from
> confusion over which device to use combined with the assumption that a UUID
> stored on the device is sufficient to uniquely identify a filesystem.
> 
> 


Re: bcache with existing ext4 filesystem

2017-07-26 Thread Eric Wheeler
On Wed, 26 Jul 2017, Pavel Machek wrote:

> Hi!
> 
> > > > > Unfortunately, that would mean shifting 400GB data 8KB forward, and
> > > > > compatibility problems. So I'd prefer adding bcache superblock into
> > > > > the reserved space, so I can have caching _and_ compatibility with
> > > > > grub2 etc (and avoid 400GB move):
> > > > 
> > > > The common way to do that is to move the beginning of the partition,
> > > > assuming your ext4 lives in a partition.
> > > 
> > > Well... if I move the partition, grub2 (etc) will be unable to access
> > > data on it. (Plus I do not have free space before some of the
> > > partitions I'd like to be cached).
> > 
> > Why not use dm-linear and prepend space for the bcache superblock?  If 
> > this is your boot device, then you would need to write a custom 
> > initrd hook too.
> 
> Thanks for a pointer. That would actually work, but I'd have to be
> very, very careful using it...
> 
> ...because if I, or systemd or some kind of automounter sees the
> underlying device (sda4) and writes to it (it is valid ext4 after
> all), I'll have inconsistent base device and cache ... and that will
> be asking for major problems (even in writethrough mode).

Sigh.  Gone are the days when distributions would only mount filesystems 
if you ask them to.  

If this is a desktop, then I'm not sure what to suggest.  But for server 
with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in 
/etc/sysconfig/grub).  If this was LVM, then I would suggest also setting 
global_filter in lvm.conf.  If you find other places that need poked to 
prevent automounting then please let me know!

As for ext4 feature bits, can they be arbitrarily named? (I think they are 
bits, so maybe not).  Maybe propose a patch to ext4 to provide a 
"disable_mount" feature.  This would prevent mounting altogether, and you 
would set/clear it when you care to.  A strange feature indeed.  Doing 
this as an obscure feature on a single filesystem doesn't quite seem 
right.


It might be better to have a block-layer device-mount mask so devices that 
are allowed to be mounted can be whitelisted on the kernel cmdline or 
something.  blk.allow_mount=8:16,253:*,...  or blk.disallow_mount=8:32 (or 
probably both).

Just ideas, but it would be great to allow mounting only those 
major/minors which are authorized, particularly with increasingly complex 
block-device stacks.  


--
Eric Wheeler




> 
> Actually, this already would be usable, if we killed content of cache
> device on every mount. Hmmm. I have reasonably long uptimes these days.
> 
> If possible, I'd like something more clever: bcache saves mtime of the
> ext4 filesystem on shutdown. If the mtime does not match on the next
> startup, it means someone fsck-ed the filesystem or mounted it
> directly or something, and cache is invalid.
> 
> Bonus would be some kind of interlock with "incompatible feature"
> bits. If the bcache has dirty data in write-back cache, it would be
> nice to have "incompatible feature" bit set, so that tools that don't
> have access to the cache refuse to touch it.
> 
> Best regards,
> 
>   Pavel
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> 


Re: bcache with existing ext4 filesystem

2017-07-26 Thread Eric Wheeler
On Wed, 26 Jul 2017, Pavel Machek wrote:

> Hi!
> 
> > > > > Unfortunately, that would mean shifting 400GB data 8KB forward, and
> > > > > compatibility problems. So I'd prefer adding bcache superblock into
> > > > > the reserved space, so I can have caching _and_ compatibility with
> > > > > grub2 etc (and avoid 400GB move):
> > > > 
> > > > The common way to do that is to move the beginning of the partition,
> > > > assuming your ext4 lives in a partition.
> > > 
> > > Well... if I move the partition, grub2 (etc) will be unable to access
> > > data on it. (Plus I do not have free space before some of the
> > > partitions I'd like to be cached).
> > 
> > Why not use dm-linear and prepend space for the bcache superblock?  If 
> > this is your boot device, then you would need to write a custom 
> > initrd hook too.
> 
> Thanks for a pointer. That would actually work, but I'd have to be
> very, very careful using it...
> 
> ...because if I, or systemd or some kind of automounter sees the
> underlying device (sda4) and writes to it (it is valid ext4 after
> all), I'll have inconsistent base device and cache ... and that will
> be asking for major problems (even in writethrough mode).

Sigh.  Gone are the days when distributions would only mount filesystems 
if you ask them to.  

If this is a desktop, then I'm not sure what to suggest.  But for server 
with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in 
/etc/sysconfig/grub).  If this was LVM, then I would suggest also setting 
global_filter in lvm.conf.  If you find other places that need poked to 
prevent automounting then please let me know!

As for ext4 feature bits, can they be arbitrarily named? (I think they are 
bits, so maybe not).  Maybe propose a patch to ext4 to provide a 
"disable_mount" feature.  This would prevent mounting altogether, and you 
would set/clear it when you care to.  A strange feature indeed.  Doing 
this as an obscure feature on a single filesystem doesn't quite seem 
right.


It might be better to have a block-layer device-mount mask so devices that 
are allowed to be mounted can be whitelisted on the kernel cmdline or 
something.  blk.allow_mount=8:16,253:*,...  or blk.disallow_mount=8:32 (or 
probably both).

Just ideas, but it would be great to allow mounting only those 
major/minors which are authorized, particularly with increasingly complex 
block-device stacks.  


--
Eric Wheeler




> 
> Actually, this already would be usable, if we killed content of cache
> device on every mount. Hmmm. I have reasonably long uptimes these days.
> 
> If possible, I'd like something more clever: bcache saves mtime of the
> ext4 filesystem on shutdown. If the mtime does not match on the next
> startup, it means someone fsck-ed the filesystem or mounted it
> directly or something, and cache is invalid.
> 
> Bonus would be some kind of interlock with "incompatible feature"
> bits. If the bcache has dirty data in write-back cache, it would be
> nice to have "incompatible feature" bit set, so that tools that don't
> have access to the cache refuse to touch it.
> 
> Best regards,
> 
>   Pavel
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> 


Re: v4.12: register_dcache lockdep problem on boot

2017-07-25 Thread Eric Wheeler
On Tue, 25 Jul 2017, Pavel Machek wrote:

> Hi!
> 
> I get this one during boot...
> 
>   Pavel
> [   12.717617]  __lock_acquire+0x86/0x94a
> [   12.720466]  lock_acquire+0x4a/0x66
> [   12.723223]  ? mca_reap+0x56/0x115
> [   12.725972]  down_write_trylock+0x44/0x59
> [   12.728737]  ? mca_reap+0x56/0x115
> [   12.731454]  mca_reap+0x56/0x115
> [   12.734131]  mca_alloc+0x80/0x390
> [   12.736741]  bch_btree_node_get+0x76/0x27e
> [   12.739324]  run_cache_set+0x2c0/0x7c1
> [   12.741866]  register_bcache+0xc35/0x109b
> [   12.744356]  ? register_bcache+0xc35/0x109b
> [   12.746826]  ? x86_perf_event_set_period+0x187/0x20c
> [   12.749305]  ? __mutex_lock+0x2a/0x5d7
> [   12.751725]  ? bch_cache_set_alloc+0x577/0x577
> [   12.754138]  kobj_attr_store+0x10/0x1f
> [   12.756508]  ? kobj_attr_store+0x10/0x1f
> [   12.758906]  sysfs_kf_write+0x2f/0x41
> [   12.761258]  ? sysfs_file_ops+0x40/0x40
> [   12.763521]  kernfs_fop_write+0xd3/0x14a
> [   12.765762]  ? sysfs_file_ops+0x40/0x40
> [   12.767948]  ? kernfs_vma_page_mkwrite+0x67/0x67
> [   12.770121]  __vfs_write+0x1c/0x103
> [   12.772076]  ? vfs_write+0x93/0x13c
> [   12.774058]  ? __sb_start_write+0xeb/0x14c
> [   12.776147]  ? vfs_write+0x93/0x13c
> [   12.778135]  vfs_write+0xa1/0x13c
> [   12.780033]  SyS_write+0x3d/0x81
> [   12.781854]  do_int80_syscall_32+0x42/0x84
> [   12.783631]  entry_INT80_32+0x2e/0x2e
> [   12.785357] EIP: 0xb7758c42
> [   12.787015] EFLAGS: 0246 CPU: 2
> [   12.788711] EAX: ffda EBX: 0003 ECX: b7755000 EDX: 000a
> [   12.790428] ESI: 000a EDI: b7755000 EBP: bfca9a0c ESP: bfca9954
> [   12.792168]  DS: 007b ES: 007b FS:  GS: 0033 SS: 007b
> [   12.793861] Code: 0f 85 94 00 00 00 68 da 7f e8 c4 68 f6 7d e8 c4 e8 58 e0 
> 00 00 0f ff 83 c4 08 eb 7e 8b 47 0c 39 83 08 01 00 00 0f 84 1e 03 00 00 <0f> 
> ff e9 17 03 00 00 85 db 74 0b 3b 43 10 74 e1 8b 1b 85 db 75
> [   12.797697] ---[ end trace 6e213ee067267bec ]---
> [   12.808884] bcache: bch_journal_replay() journal replay done, 0 keys in 2 
> entries, seq 4
> [   12.822321] bcache: register_cache() registered cache device sdb1
> [   13.378319] bcache: register_bdev() registered backing device sda3

It could be benign, try this patch:

-Eric

From: Liang Chen <liangchen.li...@gmail.com>

mutex_destroy does nothing most of time, but it's better to call
it to make the code future proof and it also has some meaning
for like mutex debug.

Signed-off-by: Liang Chen <liangchen.li...@gmail.com>
Reviewed-by: Eric Wheeler <bca...@linux.ewheeler.net>
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 48b8c20..1f84791 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2089,6 +2089,7 @@ static void bcache_exit(void)
if (bcache_major)
unregister_blkdev(bcache_major, "bcache");
unregister_reboot_notifier();
+   mutex_destroy(_register_lock);
 }
 
 static int __init bcache_init(void)
@@ -2106,6 +2107,7 @@ static int __init bcache_init(void)
 
bcache_major = register_blkdev(0, "bcache");
if (bcache_major < 0) {
+   mutex_destroy(_register_lock);
unregister_reboot_notifier();
return bcache_major;
}
-- 
1.8.3.1


Re: v4.12: register_dcache lockdep problem on boot

2017-07-25 Thread Eric Wheeler
On Tue, 25 Jul 2017, Pavel Machek wrote:

> Hi!
> 
> I get this one during boot...
> 
>   Pavel
> [   12.717617]  __lock_acquire+0x86/0x94a
> [   12.720466]  lock_acquire+0x4a/0x66
> [   12.723223]  ? mca_reap+0x56/0x115
> [   12.725972]  down_write_trylock+0x44/0x59
> [   12.728737]  ? mca_reap+0x56/0x115
> [   12.731454]  mca_reap+0x56/0x115
> [   12.734131]  mca_alloc+0x80/0x390
> [   12.736741]  bch_btree_node_get+0x76/0x27e
> [   12.739324]  run_cache_set+0x2c0/0x7c1
> [   12.741866]  register_bcache+0xc35/0x109b
> [   12.744356]  ? register_bcache+0xc35/0x109b
> [   12.746826]  ? x86_perf_event_set_period+0x187/0x20c
> [   12.749305]  ? __mutex_lock+0x2a/0x5d7
> [   12.751725]  ? bch_cache_set_alloc+0x577/0x577
> [   12.754138]  kobj_attr_store+0x10/0x1f
> [   12.756508]  ? kobj_attr_store+0x10/0x1f
> [   12.758906]  sysfs_kf_write+0x2f/0x41
> [   12.761258]  ? sysfs_file_ops+0x40/0x40
> [   12.763521]  kernfs_fop_write+0xd3/0x14a
> [   12.765762]  ? sysfs_file_ops+0x40/0x40
> [   12.767948]  ? kernfs_vma_page_mkwrite+0x67/0x67
> [   12.770121]  __vfs_write+0x1c/0x103
> [   12.772076]  ? vfs_write+0x93/0x13c
> [   12.774058]  ? __sb_start_write+0xeb/0x14c
> [   12.776147]  ? vfs_write+0x93/0x13c
> [   12.778135]  vfs_write+0xa1/0x13c
> [   12.780033]  SyS_write+0x3d/0x81
> [   12.781854]  do_int80_syscall_32+0x42/0x84
> [   12.783631]  entry_INT80_32+0x2e/0x2e
> [   12.785357] EIP: 0xb7758c42
> [   12.787015] EFLAGS: 0246 CPU: 2
> [   12.788711] EAX: ffda EBX: 0003 ECX: b7755000 EDX: 000a
> [   12.790428] ESI: 000a EDI: b7755000 EBP: bfca9a0c ESP: bfca9954
> [   12.792168]  DS: 007b ES: 007b FS:  GS: 0033 SS: 007b
> [   12.793861] Code: 0f 85 94 00 00 00 68 da 7f e8 c4 68 f6 7d e8 c4 e8 58 e0 
> 00 00 0f ff 83 c4 08 eb 7e 8b 47 0c 39 83 08 01 00 00 0f 84 1e 03 00 00 <0f> 
> ff e9 17 03 00 00 85 db 74 0b 3b 43 10 74 e1 8b 1b 85 db 75
> [   12.797697] ---[ end trace 6e213ee067267bec ]---
> [   12.808884] bcache: bch_journal_replay() journal replay done, 0 keys in 2 
> entries, seq 4
> [   12.822321] bcache: register_cache() registered cache device sdb1
> [   13.378319] bcache: register_bdev() registered backing device sda3

It could be benign, try this patch:

-Eric

From: Liang Chen 

mutex_destroy does nothing most of time, but it's better to call
it to make the code future proof and it also has some meaning
for like mutex debug.

Signed-off-by: Liang Chen 
Reviewed-by: Eric Wheeler 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 48b8c20..1f84791 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2089,6 +2089,7 @@ static void bcache_exit(void)
if (bcache_major)
unregister_blkdev(bcache_major, "bcache");
unregister_reboot_notifier();
+   mutex_destroy(_register_lock);
 }
 
 static int __init bcache_init(void)
@@ -2106,6 +2107,7 @@ static int __init bcache_init(void)
 
bcache_major = register_blkdev(0, "bcache");
if (bcache_major < 0) {
+   mutex_destroy(_register_lock);
unregister_reboot_notifier();
return bcache_major;
}
-- 
1.8.3.1


Re: bcache with existing ext4 filesystem

2017-07-25 Thread Eric Wheeler
On Tue, 25 Jul 2017, Pavel Machek wrote:

> On Tue 2017-07-25 12:32:48, Vojtech Pavlik wrote:
> > On Tue, Jul 25, 2017 at 08:43:04AM +0200, Pavel Machek wrote:
> > > On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote:
> > > > On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote:
> > > > > Question for you was... Is the first 1KiB of each ext4 filesystem 
> > > > > still
> > > > > free and "reserved for a bootloader"?
> > > > 
> > > > Yes.
> > > 
> > > Thanks.
> > > 
> > > > > If I needed more for bcache superblock (8KiB, IIRC), would that be
> > > > > easy to accomplish on existing filesystem?
> > > > 
> > > > Huh?  Why would the bcache superblock matter when you're talking about
> > > > the ext4 layout?  The bcache superblock will be on the bcache
> > > > device/partition, and the ext4 superblock will be on the ext4
> > > > device/partition.
> > > 
> > > I'd like to enable bcache on already existing ext4 partition. AFAICT
> > > normal situation, even on the backing device, is:
> > > 
> > > | 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data |
> > > 
> > > Unfortunately, that would mean shifting 400GB data 8KB forward, and
> > > compatibility problems. So I'd prefer adding bcache superblock into
> > > the reserved space, so I can have caching _and_ compatibility with
> > > grub2 etc (and avoid 400GB move):
> > 
> > The common way to do that is to move the beginning of the partition,
> > assuming your ext4 lives in a partition.
> 
> Well... if I move the partition, grub2 (etc) will be unable to access
> data on it. (Plus I do not have free space before some of the
> partitions I'd like to be cached).

Why not use dm-linear and prepend space for the bcache superblock?  If 
this is your boot device, then you would need to write a custom 
initrd hook too.

Note that if bcache comes up without its cache, you will need to:
echo 1 > /sys/block/sdX/bcache/running

This is of course unsafe with writeback but should be fine with 
writethrough.


--
Eric Wheeler


> 
> > I don't see how overlapping the ext4 and the bcache backing device
> > starts would give you what you want, because bcache assumes the
> > backing device data starts with an offset.
> 
> My plan is to make offset 0. AFAICT bcache superblock can be shrunk.
> 
>   Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> 


Re: bcache with existing ext4 filesystem

2017-07-25 Thread Eric Wheeler
On Tue, 25 Jul 2017, Pavel Machek wrote:

> On Tue 2017-07-25 12:32:48, Vojtech Pavlik wrote:
> > On Tue, Jul 25, 2017 at 08:43:04AM +0200, Pavel Machek wrote:
> > > On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote:
> > > > On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote:
> > > > > Question for you was... Is the first 1KiB of each ext4 filesystem 
> > > > > still
> > > > > free and "reserved for a bootloader"?
> > > > 
> > > > Yes.
> > > 
> > > Thanks.
> > > 
> > > > > If I needed more for bcache superblock (8KiB, IIRC), would that be
> > > > > easy to accomplish on existing filesystem?
> > > > 
> > > > Huh?  Why would the bcache superblock matter when you're talking about
> > > > the ext4 layout?  The bcache superblock will be on the bcache
> > > > device/partition, and the ext4 superblock will be on the ext4
> > > > device/partition.
> > > 
> > > I'd like to enable bcache on already existing ext4 partition. AFAICT
> > > normal situation, even on the backing device, is:
> > > 
> > > | 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data |
> > > 
> > > Unfortunately, that would mean shifting 400GB data 8KB forward, and
> > > compatibility problems. So I'd prefer adding bcache superblock into
> > > the reserved space, so I can have caching _and_ compatibility with
> > > grub2 etc (and avoid 400GB move):
> > 
> > The common way to do that is to move the beginning of the partition,
> > assuming your ext4 lives in a partition.
> 
> Well... if I move the partition, grub2 (etc) will be unable to access
> data on it. (Plus I do not have free space before some of the
> partitions I'd like to be cached).

Why not use dm-linear and prepend space for the bcache superblock?  If 
this is your boot device, then you would need to write a custom 
initrd hook too.

Note that if bcache comes up without its cache, you will need to:
echo 1 > /sys/block/sdX/bcache/running

This is of course unsafe with writeback but should be fine with 
writethrough.


--
Eric Wheeler


> 
> > I don't see how overlapping the ext4 and the bcache backing device
> > starts would give you what you want, because bcache assumes the
> > backing device data starts with an offset.
> 
> My plan is to make offset 0. AFAICT bcache superblock can be shrunk.
> 
>   Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> 


Re: [PULL] bcache: based on for-4.10/block, multiple updates

2017-02-27 Thread Eric Wheeler
On Wed, 14 Dec 2016, Jens Axboe wrote:
> On 12/14/2016 06:50 PM, Eric Wheeler wrote:
> > Hi Jens,
> > 
> > I know you're busy, so when you get a moment:
> > 
> > I've not yet seen your ack/nack on this yet and I want to make sure it 
> > gets in before the merge window closes for v4.10.  I rebased it on 
> > for-4.10/block as you asked so its tested and ready to go as of 12/6 
> > unless you see something that I've missed.
> 
> Pulled for 4.10, sorry about the delay.

Hi Jens,

Checking in:

Can you get this in for 4.11 since it was missed for 4.10 and we're not in 
4.11-rc1 yet?

Here it is again:

  https://bitbucket.org/ewheelerinc/linux.git for-4.10-block-bcache-updates


--
Eric Wheeler


> 
> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PULL] bcache: based on for-4.10/block, multiple updates

2017-02-27 Thread Eric Wheeler
On Wed, 14 Dec 2016, Jens Axboe wrote:
> On 12/14/2016 06:50 PM, Eric Wheeler wrote:
> > Hi Jens,
> > 
> > I know you're busy, so when you get a moment:
> > 
> > I've not yet seen your ack/nack on this yet and I want to make sure it 
> > gets in before the merge window closes for v4.10.  I rebased it on 
> > for-4.10/block as you asked so its tested and ready to go as of 12/6 
> > unless you see something that I've missed.
> 
> Pulled for 4.10, sorry about the delay.

Hi Jens,

Checking in:

Can you get this in for 4.11 since it was missed for 4.10 and we're not in 
4.11-rc1 yet?

Here it is again:

  https://bitbucket.org/ewheelerinc/linux.git for-4.10-block-bcache-updates


--
Eric Wheeler


> 
> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PULL] bcache: based on for-4.10/block, multiple updates

2017-01-31 Thread Eric Wheeler
On Tue, 31 Jan 2017, Jens Axboe wrote:
> On 01/31/2017 11:14 AM, Eric Wheeler wrote:
> > On Wed, 14 Dec 2016, Jens Axboe wrote:
> >> On 12/14/2016 06:50 PM, Eric Wheeler wrote:
> >>> Hi Jens,
> >>>
> >>> I know you're busy, so when you get a moment:
> >>>
> >>> I've not yet seen your ack/nack on this yet and I want to make sure it 
> >>> gets in before the merge window closes for v4.10.  I rebased it on 
> >>> for-4.10/block as you asked so its tested and ready to go as of 12/6 
> >>> unless you see something that I've missed.
> >>
> >> Pulled for 4.10, sorry about the delay.
> > 
> > Hi Jens,
> > 
> > Looking at 4.10-rc6, I see these two commits were merged:
> > 
> >>   bcache: partition support: add 16 minors per bcacheN device
> >>   bcache: Make gc wakeup sane, remove set_task_state()
> > 
> > but these were in the same 'for-4.10-block-bcache-updates' and don't
> > appear to have been applied:
> > 
> >>   bcache: introduce per-process ioprio-based bypass/writeback hints
> >>   bcache: documentation for ioprio cache hinting
> >>   bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
> > 
> > Did I miss an email that caused their exclusion?
> > 
> > Did these get bumped to 4.11 for some reason?
> 
> I thought I sent out that reply, maybe I didn't. I dropped the
> last three for two reasons:
> 
> 1) They happened after the merge window, and it's really a feature
>addon that should have gone in earlier
  
Onto 4.11 then.   Hopefully this is early enough :)

> 
> 2) There seemed to be confusion and discussion around the interface
>that needed to be resolved first.

None that I'm aware of.  This is a bcache-only feature and doesn't affect 
non-bcache users. In fact, people are using it as a patch with good 
results since they can control their erase-block churn.  Kent Overstreet 
acked it, too.

Can you pull this for 4.11?

Thank you for your help!

--
Eric Wheeler



> 
> Hope that explains it.
> 
> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PULL] bcache: based on for-4.10/block, multiple updates

2017-01-31 Thread Eric Wheeler
On Tue, 31 Jan 2017, Jens Axboe wrote:
> On 01/31/2017 11:14 AM, Eric Wheeler wrote:
> > On Wed, 14 Dec 2016, Jens Axboe wrote:
> >> On 12/14/2016 06:50 PM, Eric Wheeler wrote:
> >>> Hi Jens,
> >>>
> >>> I know you're busy, so when you get a moment:
> >>>
> >>> I've not yet seen your ack/nack on this yet and I want to make sure it 
> >>> gets in before the merge window closes for v4.10.  I rebased it on 
> >>> for-4.10/block as you asked so its tested and ready to go as of 12/6 
> >>> unless you see something that I've missed.
> >>
> >> Pulled for 4.10, sorry about the delay.
> > 
> > Hi Jens,
> > 
> > Looking at 4.10-rc6, I see these two commits were merged:
> > 
> >>   bcache: partition support: add 16 minors per bcacheN device
> >>   bcache: Make gc wakeup sane, remove set_task_state()
> > 
> > but these were in the same 'for-4.10-block-bcache-updates' and don't
> > appear to have been applied:
> > 
> >>   bcache: introduce per-process ioprio-based bypass/writeback hints
> >>   bcache: documentation for ioprio cache hinting
> >>   bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
> > 
> > Did I miss an email that caused their exclusion?
> > 
> > Did these get bumped to 4.11 for some reason?
> 
> I thought I sent out that reply, maybe I didn't. I dropped the
> last three for two reasons:
> 
> 1) They happened after the merge window, and it's really a feature
>addon that should have gone in earlier
  
Onto 4.11 then.   Hopefully this is early enough :)

> 
> 2) There seemed to be confusion and discussion around the interface
>that needed to be resolved first.

None that I'm aware of.  This is a bcache-only feature and doesn't affect 
non-bcache users. In fact, people are using it as a patch with good 
results since they can control their erase-block churn.  Kent Overstreet 
acked it, too.

Can you pull this for 4.11?

Thank you for your help!

--
Eric Wheeler



> 
> Hope that explains it.
> 
> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PULL] bcache: based on for-4.10/block, multiple updates

2017-01-31 Thread Eric Wheeler
On Wed, 14 Dec 2016, Jens Axboe wrote:
> On 12/14/2016 06:50 PM, Eric Wheeler wrote:
> > Hi Jens,
> > 
> > I know you're busy, so when you get a moment:
> > 
> > I've not yet seen your ack/nack on this yet and I want to make sure it 
> > gets in before the merge window closes for v4.10.  I rebased it on 
> > for-4.10/block as you asked so its tested and ready to go as of 12/6 
> > unless you see something that I've missed.
> 
> Pulled for 4.10, sorry about the delay.

Hi Jens,

Looking at 4.10-rc6, I see these two commits were merged:

>   bcache: partition support: add 16 minors per bcacheN device
>   bcache: Make gc wakeup sane, remove set_task_state()

but these were in the same 'for-4.10-block-bcache-updates' and don't
appear to have been applied:

>   bcache: introduce per-process ioprio-based bypass/writeback hints
>   bcache: documentation for ioprio cache hinting
>   bcache: update bio->bi_opf bypass/writeback REQ_ flag hints

Did I miss an email that caused their exclusion?

Did these get bumped to 4.11 for some reason?

Original pull request is below for your reference.

Thank you for your help!

--
Eric Wheeler

[...] are available in the git repository at:

  https://bitbucket.org/ewheelerinc/linux.git for-4.10-block-bcache-updates

for you to fetch changes up to 951c0a8cc489aca1c68d80e5aaa576cd7e511191:

  bcache: partition support: add 16 minors per bcacheN device (2016-12-06 
10:13:24 -0800)

----
Eric Wheeler (4):
  bcache: introduce per-process ioprio-based bypass/writeback hints
  bcache: documentation for ioprio cache hinting
  bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
  bcache: partition support: add 16 minors per bcacheN device

Kent Overstreet (1):
  bcache: Make gc wakeup sane, remove set_task_state()

 Documentation/bcache.txt  | 80 +++
 drivers/md/bcache/bcache.h|  7 ++--
 drivers/md/bcache/btree.c | 39 +++--
 drivers/md/bcache/btree.h |  3 +-
 drivers/md/bcache/request.c   | 31 +++--
 drivers/md/bcache/super.c |  7 +++-
 drivers/md/bcache/sysfs.c | 71 ++
 drivers/md/bcache/writeback.c |  8 +
 drivers/md/bcache/writeback.h | 27 ++-
 9 files changed, 245 insertions(+), 28 deletions(-)




> 
> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PULL] bcache: based on for-4.10/block, multiple updates

2017-01-31 Thread Eric Wheeler
On Wed, 14 Dec 2016, Jens Axboe wrote:
> On 12/14/2016 06:50 PM, Eric Wheeler wrote:
> > Hi Jens,
> > 
> > I know you're busy, so when you get a moment:
> > 
> > I've not yet seen your ack/nack on this yet and I want to make sure it 
> > gets in before the merge window closes for v4.10.  I rebased it on 
> > for-4.10/block as you asked so its tested and ready to go as of 12/6 
> > unless you see something that I've missed.
> 
> Pulled for 4.10, sorry about the delay.

Hi Jens,

Looking at 4.10-rc6, I see these two commits were merged:

>   bcache: partition support: add 16 minors per bcacheN device
>   bcache: Make gc wakeup sane, remove set_task_state()

but these were in the same 'for-4.10-block-bcache-updates' and don't
appear to have been applied:

>   bcache: introduce per-process ioprio-based bypass/writeback hints
>   bcache: documentation for ioprio cache hinting
>   bcache: update bio->bi_opf bypass/writeback REQ_ flag hints

Did I miss an email that caused their exclusion?

Did these get bumped to 4.11 for some reason?

Original pull request is below for your reference.

Thank you for your help!

--
Eric Wheeler

[...] are available in the git repository at:

  https://bitbucket.org/ewheelerinc/linux.git for-4.10-block-bcache-updates

for you to fetch changes up to 951c0a8cc489aca1c68d80e5aaa576cd7e511191:

  bcache: partition support: add 16 minors per bcacheN device (2016-12-06 
10:13:24 -0800)

----
Eric Wheeler (4):
  bcache: introduce per-process ioprio-based bypass/writeback hints
  bcache: documentation for ioprio cache hinting
  bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
  bcache: partition support: add 16 minors per bcacheN device

Kent Overstreet (1):
  bcache: Make gc wakeup sane, remove set_task_state()

 Documentation/bcache.txt  | 80 +++
 drivers/md/bcache/bcache.h|  7 ++--
 drivers/md/bcache/btree.c | 39 +++--
 drivers/md/bcache/btree.h |  3 +-
 drivers/md/bcache/request.c   | 31 +++--
 drivers/md/bcache/super.c |  7 +++-
 drivers/md/bcache/sysfs.c | 71 ++
 drivers/md/bcache/writeback.c |  8 +
 drivers/md/bcache/writeback.h | 27 ++-
 9 files changed, 245 insertions(+), 28 deletions(-)




> 
> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PULL] bcache: based on for-4.10/block, multiple updates

2016-12-14 Thread Eric Wheeler
Hi Jens,

I know you're busy, so when you get a moment:

I've not yet seen your ack/nack on this yet and I want to make sure it 
gets in before the merge window closes for v4.10.  I rebased it on 
for-4.10/block as you asked so its tested and ready to go as of 12/6 
unless you see something that I've missed.

Thanks!

-Eric

--
Eric Wheeler

On Tue, 6 Dec 2016, Eric Wheeler wrote:

> Hello Jens,
> 
> Please pull for 4.10:
> 
> Thank you!
> 
> This pull request is based on the git.kernel.dk for-4.10/block branch:
> 
> The following changes since commit 333ba053d145d6f9152f6b0311a345b876f0fed1:
> 
>   lightnvm: transform target get/set bad block (2016-11-29 12:12:51 -0700)
> 
> are available in the git repository at:
> 
>   https://bitbucket.org/ewheelerinc/linux.git for-4.10-block-bcache-updates
> 
> for you to fetch changes up to 951c0a8cc489aca1c68d80e5aaa576cd7e511191:
> 
>   bcache: partition support: add 16 minors per bcacheN device (2016-12-06 
> 10:13:24 -0800)
> 
> --------
> Eric Wheeler (4):
>   bcache: introduce per-process ioprio-based bypass/writeback hints
>   bcache: documentation for ioprio cache hinting
>   bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
>   bcache: partition support: add 16 minors per bcacheN device
> 
> Kent Overstreet (1):
>   bcache: Make gc wakeup sane, remove set_task_state()
> 
>  Documentation/bcache.txt  | 80 
> +++
>  drivers/md/bcache/bcache.h|  7 ++--
>  drivers/md/bcache/btree.c | 39 +++--
>  drivers/md/bcache/btree.h |  3 +-
>  drivers/md/bcache/request.c   | 31 +++--
>  drivers/md/bcache/super.c |  7 +++-
>  drivers/md/bcache/sysfs.c | 71 ++
>  drivers/md/bcache/writeback.c |  8 +
>  drivers/md/bcache/writeback.h | 27 ++-
>  9 files changed, 245 insertions(+), 28 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PULL] bcache: based on for-4.10/block, multiple updates

2016-12-14 Thread Eric Wheeler
Hi Jens,

I know you're busy, so when you get a moment:

I've not yet seen your ack/nack on this yet and I want to make sure it 
gets in before the merge window closes for v4.10.  I rebased it on 
for-4.10/block as you asked so its tested and ready to go as of 12/6 
unless you see something that I've missed.

Thanks!

-Eric

--
Eric Wheeler

On Tue, 6 Dec 2016, Eric Wheeler wrote:

> Hello Jens,
> 
> Please pull for 4.10:
> 
> Thank you!
> 
> This pull request is based on the git.kernel.dk for-4.10/block branch:
> 
> The following changes since commit 333ba053d145d6f9152f6b0311a345b876f0fed1:
> 
>   lightnvm: transform target get/set bad block (2016-11-29 12:12:51 -0700)
> 
> are available in the git repository at:
> 
>   https://bitbucket.org/ewheelerinc/linux.git for-4.10-block-bcache-updates
> 
> for you to fetch changes up to 951c0a8cc489aca1c68d80e5aaa576cd7e511191:
> 
>   bcache: partition support: add 16 minors per bcacheN device (2016-12-06 
> 10:13:24 -0800)
> 
> --------
> Eric Wheeler (4):
>   bcache: introduce per-process ioprio-based bypass/writeback hints
>   bcache: documentation for ioprio cache hinting
>   bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
>   bcache: partition support: add 16 minors per bcacheN device
> 
> Kent Overstreet (1):
>   bcache: Make gc wakeup sane, remove set_task_state()
> 
>  Documentation/bcache.txt  | 80 
> +++
>  drivers/md/bcache/bcache.h|  7 ++--
>  drivers/md/bcache/btree.c | 39 +++--
>  drivers/md/bcache/btree.h |  3 +-
>  drivers/md/bcache/request.c   | 31 +++--
>  drivers/md/bcache/super.c |  7 +++-
>  drivers/md/bcache/sysfs.c | 71 ++
>  drivers/md/bcache/writeback.c |  8 +
>  drivers/md/bcache/writeback.h | 27 ++-
>  9 files changed, 245 insertions(+), 28 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[PULL] bcache: based on for-4.10/block, multiple updates

2016-12-06 Thread Eric Wheeler
Hello Jens,

Please pull for 4.10:

Thank you!

This pull request is based on the git.kernel.dk for-4.10/block branch:

The following changes since commit 333ba053d145d6f9152f6b0311a345b876f0fed1:

  lightnvm: transform target get/set bad block (2016-11-29 12:12:51 -0700)

are available in the git repository at:

  https://bitbucket.org/ewheelerinc/linux.git for-4.10-block-bcache-updates

for you to fetch changes up to 951c0a8cc489aca1c68d80e5aaa576cd7e511191:

  bcache: partition support: add 16 minors per bcacheN device (2016-12-06 
10:13:24 -0800)


Eric Wheeler (4):
  bcache: introduce per-process ioprio-based bypass/writeback hints
  bcache: documentation for ioprio cache hinting
  bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
  bcache: partition support: add 16 minors per bcacheN device

Kent Overstreet (1):
  bcache: Make gc wakeup sane, remove set_task_state()

 Documentation/bcache.txt  | 80 +++
 drivers/md/bcache/bcache.h|  7 ++--
 drivers/md/bcache/btree.c | 39 +++--
 drivers/md/bcache/btree.h |  3 +-
 drivers/md/bcache/request.c   | 31 +++--
 drivers/md/bcache/super.c |  7 +++-
 drivers/md/bcache/sysfs.c | 71 ++
 drivers/md/bcache/writeback.c |  8 +
 drivers/md/bcache/writeback.h | 27 ++-
 9 files changed, 245 insertions(+), 28 deletions(-)


[PULL] bcache: based on for-4.10/block, multiple updates

2016-12-06 Thread Eric Wheeler
Hello Jens,

Please pull for 4.10:

Thank you!

This pull request is based on the git.kernel.dk for-4.10/block branch:

The following changes since commit 333ba053d145d6f9152f6b0311a345b876f0fed1:

  lightnvm: transform target get/set bad block (2016-11-29 12:12:51 -0700)

are available in the git repository at:

  https://bitbucket.org/ewheelerinc/linux.git for-4.10-block-bcache-updates

for you to fetch changes up to 951c0a8cc489aca1c68d80e5aaa576cd7e511191:

  bcache: partition support: add 16 minors per bcacheN device (2016-12-06 
10:13:24 -0800)


Eric Wheeler (4):
  bcache: introduce per-process ioprio-based bypass/writeback hints
  bcache: documentation for ioprio cache hinting
  bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
  bcache: partition support: add 16 minors per bcacheN device

Kent Overstreet (1):
  bcache: Make gc wakeup sane, remove set_task_state()

 Documentation/bcache.txt  | 80 +++
 drivers/md/bcache/bcache.h|  7 ++--
 drivers/md/bcache/btree.c | 39 +++--
 drivers/md/bcache/btree.h |  3 +-
 drivers/md/bcache/request.c   | 31 +++--
 drivers/md/bcache/super.c |  7 +++-
 drivers/md/bcache/sysfs.c | 71 ++
 drivers/md/bcache/writeback.c |  8 +
 drivers/md/bcache/writeback.h | 27 ++-
 9 files changed, 245 insertions(+), 28 deletions(-)


Re: [PATCH 1/2] bcache: Remove redundant set_capacity

2016-12-01 Thread Eric Wheeler
On Thu, 1 Dec 2016, wangyijing wrote:

> 
> >>> It probably is a duplicate set_capacity, but has anyone tested bringing 
> >>> on 
> >>> a writeback volume, and late-attaching the cache volume with this patch 
> >>> applied?
> >>>
> >>> Otherwise stated, is it possible to get the backing device attached 
> >>> without setting the capacity?
> >>
> >> Hi Eric, I tested this case in following steps, the result is fine, the 
> >> capability is setted.
> >>
> >> [root@38 sys]# make-bcache -B /dev/nvme1n1
> >> UUID:  6758bd42-c226-4de9-a6d5-fb003af63f9f
> >> Set UUID:  2661eadd-79b4-4c56-a2fb-9f8b505aa9fd
> >> version:   1
> >> block_size:1
> >> data_offset:   16
> >> [root@38 sys]# ls /dev/bcache
> >> bcache/  bcache0
> >> [root@38 sys]# fdisk -l
> >> Disk /dev/nvme1n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> >> Units: sectors of 1 * 512 = 512 bytes
> >> Sector size (logical/physical): 512 bytes / 512 bytes
> >> I/O size (minimum/optimal): 512 bytes / 512 bytes
> >> 
> >> Disk /dev/bcache0: 1.8 TiB, 2000398925824 bytes, 3907029152 sectors
> >> Units: sectors of 1 * 512 = 512 bytes
> >> Sector size (logical/physical): 512 bytes / 512 bytes
> >> I/O size (minimum/optimal): 512 bytes / 512 bytes
> >> 
> >> [root@38 sys]# make-bcache -C /dev/ram0
> >> UUID:  b64a4425-b9c1-4650-9cab-3856410c9566
> >> Set UUID:  a0a31965-a89d-43b6-a5d6-968897abeb7a
> >> version:   0
> >> nbuckets:  1024
> >> block_size:1
> >> bucket_size:   1024
> >> nr_in_set: 1
> >> nr_this_dev:   0
> >> first_bucket:  1
> >> [root@38 sys]# echo a0a31965-a89d-43b6-a5d6-968897abeb7a > 
> >> /sys/block/bcache0/bcache/attach
> >> [root@38 sys]# echo writeback > /sys/block/bcache0/bcache/cache_mode
> >> [root@38 sys]# mount /dev/bcache0 /tmp
> >> [root@38 sys]# cd /tmp/
> >> [root@38 tmp]# fio ~/fio_write.sh
> >> file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> >> fio-2.2.8
> >> Starting 1 thread
> >> file1: Laying out IO file(s) (1 file(s) / 128MB)
> >> Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/177.2MB/0KB /s] [0/45.4K/0 iops] 
> >> [eta 05h:33m:13s]
> >>
> >> Thanks!
> >> Yijing.
> > 
> > I want to make sure that the set_capacity call that happens on cache 
> > attachment is not necessary when a backing device is attached without
> 
> Hi Eric, set_capacity() which removed in this patch is happened at 
> cached_dev_init()
> which is called when register a backing device, what do you mean 
> "set_capacity call that happens on cache
> > attachment" ?


I'm sorry, you are correct.  I though this was the cache-dev attachment, 
not the cached-dev attachment.  Looks good.

Reviewed-by: Eric Wheeler <bca...@linux.ewheeler.net>

--
Eric Wheeler

> 
> 
> > its dirty writeback cache since bcache0 is not presented until the cache 
> > attaches in that case.
> 
> I found bcache0 device present once we do make-bcache -B /dev/nvme1n1. before 
> attach the cache set.
> So I missed something ?
> 
> > 
> > Can you also unregister the volume, attach the backing device first, and 
> > then the cache while the cache is dirty to make sure that the size is set 
> > correctly?
> 
> When I unregister the cache device, I found all the dirty data has been 
> flushed to
> backing device, so how can I do the test the case as you point ?
> 
> Thanks!
> Yijing.
> 
> > 
> > --
> > Eric Wheeler
> > 
> >>
> >>>
> >>> -Eric
> >>>
> >>>>  dc->disk.disk->queue->backing_dev_info.ra_pages =
> >>>>  max(dc->disk.disk->queue->backing_dev_info.ra_pages,
> >>>>  q->backing_dev_info.ra_pages);
> >>>> -- 
> >>>> 2.5.0
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" 
> >>>> in
> >>>> the body of a message to majord...@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>
> >>> .
> >>>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


Re: [PATCH 1/2] bcache: Remove redundant set_capacity

2016-12-01 Thread Eric Wheeler
On Thu, 1 Dec 2016, wangyijing wrote:

> 
> >>> It probably is a duplicate set_capacity, but has anyone tested bringing 
> >>> on 
> >>> a writeback volume, and late-attaching the cache volume with this patch 
> >>> applied?
> >>>
> >>> Otherwise stated, is it possible to get the backing device attached 
> >>> without setting the capacity?
> >>
> >> Hi Eric, I tested this case in following steps, the result is fine, the 
> >> capability is setted.
> >>
> >> [root@38 sys]# make-bcache -B /dev/nvme1n1
> >> UUID:  6758bd42-c226-4de9-a6d5-fb003af63f9f
> >> Set UUID:  2661eadd-79b4-4c56-a2fb-9f8b505aa9fd
> >> version:   1
> >> block_size:1
> >> data_offset:   16
> >> [root@38 sys]# ls /dev/bcache
> >> bcache/  bcache0
> >> [root@38 sys]# fdisk -l
> >> Disk /dev/nvme1n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> >> Units: sectors of 1 * 512 = 512 bytes
> >> Sector size (logical/physical): 512 bytes / 512 bytes
> >> I/O size (minimum/optimal): 512 bytes / 512 bytes
> >> 
> >> Disk /dev/bcache0: 1.8 TiB, 2000398925824 bytes, 3907029152 sectors
> >> Units: sectors of 1 * 512 = 512 bytes
> >> Sector size (logical/physical): 512 bytes / 512 bytes
> >> I/O size (minimum/optimal): 512 bytes / 512 bytes
> >> 
> >> [root@38 sys]# make-bcache -C /dev/ram0
> >> UUID:  b64a4425-b9c1-4650-9cab-3856410c9566
> >> Set UUID:  a0a31965-a89d-43b6-a5d6-968897abeb7a
> >> version:   0
> >> nbuckets:  1024
> >> block_size:1
> >> bucket_size:   1024
> >> nr_in_set: 1
> >> nr_this_dev:   0
> >> first_bucket:  1
> >> [root@38 sys]# echo a0a31965-a89d-43b6-a5d6-968897abeb7a > 
> >> /sys/block/bcache0/bcache/attach
> >> [root@38 sys]# echo writeback > /sys/block/bcache0/bcache/cache_mode
> >> [root@38 sys]# mount /dev/bcache0 /tmp
> >> [root@38 sys]# cd /tmp/
> >> [root@38 tmp]# fio ~/fio_write.sh
> >> file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> >> fio-2.2.8
> >> Starting 1 thread
> >> file1: Laying out IO file(s) (1 file(s) / 128MB)
> >> Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/177.2MB/0KB /s] [0/45.4K/0 iops] 
> >> [eta 05h:33m:13s]
> >>
> >> Thanks!
> >> Yijing.
> > 
> > I want to make sure that the set_capacity call that happens on cache 
> > attachment is not necessary when a backing device is attached without
> 
> Hi Eric, set_capacity() which removed in this patch is happened at 
> cached_dev_init()
> which is called when register a backing device, what do you mean 
> "set_capacity call that happens on cache
> > attachment" ?


I'm sorry, you are correct.  I though this was the cache-dev attachment, 
not the cached-dev attachment.  Looks good.

Reviewed-by: Eric Wheeler 

--
Eric Wheeler

> 
> 
> > its dirty writeback cache since bcache0 is not presented until the cache 
> > attaches in that case.
> 
> I found bcache0 device present once we do make-bcache -B /dev/nvme1n1. before 
> attach the cache set.
> So I missed something ?
> 
> > 
> > Can you also unregister the volume, attach the backing device first, and 
> > then the cache while the cache is dirty to make sure that the size is set 
> > correctly?
> 
> When I unregister the cache device, I found all the dirty data has been 
> flushed to
> backing device, so how can I do the test the case as you point ?
> 
> Thanks!
> Yijing.
> 
> > 
> > --
> > Eric Wheeler
> > 
> >>
> >>>
> >>> -Eric
> >>>
> >>>>  dc->disk.disk->queue->backing_dev_info.ra_pages =
> >>>>  max(dc->disk.disk->queue->backing_dev_info.ra_pages,
> >>>>  q->backing_dev_info.ra_pages);
> >>>> -- 
> >>>> 2.5.0
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" 
> >>>> in
> >>>> the body of a message to majord...@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>
> >>> .
> >>>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


Re: [PATCH 1/2] bcache: Remove redundant set_capacity

2016-11-30 Thread Eric Wheeler
On Wed, 30 Nov 2016, wangyijing wrote:

> 
> 
> 在 2016/11/30 4:49, Eric Wheeler 写道:
> > On Fri, 25 Nov 2016, Yijing Wang wrote:
> > 
> >> set_capacity() has been called in bcache_device_init(),
> >> remove the redundant one.
> >>
> >> Signed-off-by: Yijing Wang <wangyij...@huawei.com>
> >> ---
> >>  drivers/md/bcache/super.c | 3 ---
> >>  1 file changed, 3 deletions(-)
> >>
> >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> >> index 849ad44..b638a16 100644
> >> --- a/drivers/md/bcache/super.c
> >> +++ b/drivers/md/bcache/super.c
> >> @@ -1126,9 +1126,6 @@ static int cached_dev_init(struct cached_dev *dc, 
> >> unsigned block_size)
> >>if (ret)
> >>return ret;
> >>  
> >> -  set_capacity(dc->disk.disk,
> >> -   dc->bdev->bd_part->nr_sects - dc->sb.data_offset);
> >> -
> > 
> > It probably is a duplicate set_capacity, but has anyone tested bringing on 
> > a writeback volume, and late-attaching the cache volume with this patch 
> > applied?
> > 
> > Otherwise stated, is it possible to get the backing device attached 
> > without setting the capacity?
> 
> Hi Eric, I tested this case in following steps, the result is fine, the 
> capability is setted.
> 
> [root@38 sys]# make-bcache -B /dev/nvme1n1
> UUID: 6758bd42-c226-4de9-a6d5-fb003af63f9f
> Set UUID: 2661eadd-79b4-4c56-a2fb-9f8b505aa9fd
> version:  1
> block_size:   1
> data_offset:  16
> [root@38 sys]# ls /dev/bcache
> bcache/  bcache0
> [root@38 sys]# fdisk -l
> Disk /dev/nvme1n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> 
> Disk /dev/bcache0: 1.8 TiB, 2000398925824 bytes, 3907029152 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> 
> [root@38 sys]# make-bcache -C /dev/ram0
> UUID: b64a4425-b9c1-4650-9cab-3856410c9566
> Set UUID: a0a31965-a89d-43b6-a5d6-968897abeb7a
> version:  0
> nbuckets: 1024
> block_size:   1
> bucket_size:  1024
> nr_in_set:1
> nr_this_dev:  0
> first_bucket: 1
> [root@38 sys]# echo a0a31965-a89d-43b6-a5d6-968897abeb7a > 
> /sys/block/bcache0/bcache/attach
> [root@38 sys]# echo writeback > /sys/block/bcache0/bcache/cache_mode
> [root@38 sys]# mount /dev/bcache0 /tmp
> [root@38 sys]# cd /tmp/
> [root@38 tmp]# fio ~/fio_write.sh
> file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> fio-2.2.8
> Starting 1 thread
> file1: Laying out IO file(s) (1 file(s) / 128MB)
> Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/177.2MB/0KB /s] [0/45.4K/0 iops] [eta 
> 05h:33m:13s]
> 
> Thanks!
> Yijing.

I want to make sure that the set_capacity call that happens on cache 
attachment is not necessary when a backing device is attached without 
its dirty writeback cache since bcache0 is not presented until the cache 
attaches in that case.

Can you also unregister the volume, attach the backing device first, and 
then the cache while the cache is dirty to make sure that the size is set 
correctly?

--
Eric Wheeler

> 
> > 
> > -Eric
> > 
> >>dc->disk.disk->queue->backing_dev_info.ra_pages =
> >>max(dc->disk.disk->queue->backing_dev_info.ra_pages,
> >>q->backing_dev_info.ra_pages);
> >> -- 
> >> 2.5.0
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> > 
> > .
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Re: [PATCH 1/2] bcache: Remove redundant set_capacity

2016-11-30 Thread Eric Wheeler
On Wed, 30 Nov 2016, wangyijing wrote:

> 
> 
> 在 2016/11/30 4:49, Eric Wheeler 写道:
> > On Fri, 25 Nov 2016, Yijing Wang wrote:
> > 
> >> set_capacity() has been called in bcache_device_init(),
> >> remove the redundant one.
> >>
> >> Signed-off-by: Yijing Wang 
> >> ---
> >>  drivers/md/bcache/super.c | 3 ---
> >>  1 file changed, 3 deletions(-)
> >>
> >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> >> index 849ad44..b638a16 100644
> >> --- a/drivers/md/bcache/super.c
> >> +++ b/drivers/md/bcache/super.c
> >> @@ -1126,9 +1126,6 @@ static int cached_dev_init(struct cached_dev *dc, 
> >> unsigned block_size)
> >>if (ret)
> >>return ret;
> >>  
> >> -  set_capacity(dc->disk.disk,
> >> -   dc->bdev->bd_part->nr_sects - dc->sb.data_offset);
> >> -
> > 
> > It probably is a duplicate set_capacity, but has anyone tested bringing on 
> > a writeback volume, and late-attaching the cache volume with this patch 
> > applied?
> > 
> > Otherwise stated, is it possible to get the backing device attached 
> > without setting the capacity?
> 
> Hi Eric, I tested this case in following steps, the result is fine, the 
> capability is setted.
> 
> [root@38 sys]# make-bcache -B /dev/nvme1n1
> UUID: 6758bd42-c226-4de9-a6d5-fb003af63f9f
> Set UUID: 2661eadd-79b4-4c56-a2fb-9f8b505aa9fd
> version:  1
> block_size:   1
> data_offset:  16
> [root@38 sys]# ls /dev/bcache
> bcache/  bcache0
> [root@38 sys]# fdisk -l
> Disk /dev/nvme1n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> 
> Disk /dev/bcache0: 1.8 TiB, 2000398925824 bytes, 3907029152 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> 
> [root@38 sys]# make-bcache -C /dev/ram0
> UUID: b64a4425-b9c1-4650-9cab-3856410c9566
> Set UUID: a0a31965-a89d-43b6-a5d6-968897abeb7a
> version:  0
> nbuckets: 1024
> block_size:   1
> bucket_size:  1024
> nr_in_set:1
> nr_this_dev:  0
> first_bucket: 1
> [root@38 sys]# echo a0a31965-a89d-43b6-a5d6-968897abeb7a > 
> /sys/block/bcache0/bcache/attach
> [root@38 sys]# echo writeback > /sys/block/bcache0/bcache/cache_mode
> [root@38 sys]# mount /dev/bcache0 /tmp
> [root@38 sys]# cd /tmp/
> [root@38 tmp]# fio ~/fio_write.sh
> file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> fio-2.2.8
> Starting 1 thread
> file1: Laying out IO file(s) (1 file(s) / 128MB)
> Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/177.2MB/0KB /s] [0/45.4K/0 iops] [eta 
> 05h:33m:13s]
> 
> Thanks!
> Yijing.

I want to make sure that the set_capacity call that happens on cache 
attachment is not necessary when a backing device is attached without 
its dirty writeback cache since bcache0 is not presented until the cache 
attaches in that case.

Can you also unregister the volume, attach the backing device first, and 
then the cache while the cache is dirty to make sure that the size is set 
correctly?

--
Eric Wheeler

> 
> > 
> > -Eric
> > 
> >>dc->disk.disk->queue->backing_dev_info.ra_pages =
> >>max(dc->disk.disk->queue->backing_dev_info.ra_pages,
> >>q->backing_dev_info.ra_pages);
> >> -- 
> >> 2.5.0
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> > 
> > .
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Re: [PATCH 1/2] bcache: Remove redundant set_capacity

2016-11-29 Thread Eric Wheeler
On Fri, 25 Nov 2016, Yijing Wang wrote:

> set_capacity() has been called in bcache_device_init(),
> remove the redundant one.
> 
> Signed-off-by: Yijing Wang 
> ---
>  drivers/md/bcache/super.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 849ad44..b638a16 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1126,9 +1126,6 @@ static int cached_dev_init(struct cached_dev *dc, 
> unsigned block_size)
>   if (ret)
>   return ret;
>  
> - set_capacity(dc->disk.disk,
> -  dc->bdev->bd_part->nr_sects - dc->sb.data_offset);
> -

It probably is a duplicate set_capacity, but has anyone tested bringing on 
a writeback volume, and late-attaching the cache volume with this patch 
applied?

Otherwise stated, is it possible to get the backing device attached 
without setting the capacity?

-Eric

>   dc->disk.disk->queue->backing_dev_info.ra_pages =
>   max(dc->disk.disk->queue->backing_dev_info.ra_pages,
>   q->backing_dev_info.ra_pages);
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH 1/2] bcache: Remove redundant set_capacity

2016-11-29 Thread Eric Wheeler
On Fri, 25 Nov 2016, Yijing Wang wrote:

> set_capacity() has been called in bcache_device_init(),
> remove the redundant one.
> 
> Signed-off-by: Yijing Wang 
> ---
>  drivers/md/bcache/super.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 849ad44..b638a16 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1126,9 +1126,6 @@ static int cached_dev_init(struct cached_dev *dc, 
> unsigned block_size)
>   if (ret)
>   return ret;
>  
> - set_capacity(dc->disk.disk,
> -  dc->bdev->bd_part->nr_sects - dc->sb.data_offset);
> -

It probably is a duplicate set_capacity, but has anyone tested bringing on 
a writeback volume, and late-attaching the cache volume with this patch 
applied?

Otherwise stated, is it possible to get the backing device attached 
without setting the capacity?

-Eric

>   dc->disk.disk->queue->backing_dev_info.ra_pages =
>   max(dc->disk.disk->queue->backing_dev_info.ra_pages,
>   q->backing_dev_info.ra_pages);
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PULL] bcache: multiple updates

2016-10-31 Thread Eric Wheeler
On Sun, 30 Oct 2016, Jens Axboe wrote:
> On 10/30/2016 08:00 AM, Kent Overstreet wrote:
> > On Sat, Oct 29, 2016 at 06:32:38PM -0700, Eric Wheeler wrote:
> > > On Thu, 27 Oct 2016, Jens Axboe wrote:
> > >
> > > > On 10/27/2016 05:27 PM, Eric Wheeler wrote:
> > > > > Hi Jens,
> > > > >
> > > > > Please pull this v4.9-rc2 based series of bcache updates for v4.9-rc3:
> > > > > (You may disregard the previous -rc1-based request.)
> > > > >
> > > > >   git pull https://bitbucket.org/ewheelerinc/linux.git
> > > > >   v4.9-rc2-bcache-updates
> > > > >
> > > > > Thank you!
> > > > >
> > > > > --
> > > > > Eric Wheeler
> > > > >
> > > > > ]# git log --oneline v4.9-rc2..HEAD
> > > > > bd532a6 bcache: partition support: add 16 minors per bcacheN device
> > > > > 3312845 bcache: Make gc wakeup sane, remove set_task_state()
> > > > > 6bb7f1e bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
> > > > > 3d58a09 bcache: documentation for ioprio cache hinting
> > > > > 2e8884b bcache: introduce per-process ioprio-based bypass/writeback
> > > > > hints
> > > >
> > > > How many of these are applicable to 4.9-rc3? I took a quick look at
> > > > them, and some of them look like they should go into the 4.10 branch
> > > > instead. We're after the merge window, so only strict fixes. Cleanups
> > > > and no features, no go.
> > >
> > > 3312845 might need to be in 4.9.  Not sure, kent asked me to pick it up in
> > > my next pull request.  4.10 is fine for the rest.
> > >
> > > Kent, Davidlohr, does 3312845 need to land in 4.9 for some reason?
> >
> > No, that can wait until 4.10.
> 
> Great, that's what it looked like.
> 
> Eric, three things. The first is that you based this branch as if it was
> going into 4.9, which means if I pull it into my 4.10 branch, I get a
> ton of extra stuff that has been added to master since. Not a huge
> problem, I can just generate the patches and apply them.
> 
> Secondly, you are depending on REQ_THROTTLED, which in the 4.10 series,
> has been modified:
> 
> commit 8d2bbd4c8236e9e38e6b36ac9e2c54fdcfe5b335
> Author: Christoph Hellwig <h...@lst.de>
> Date:   Thu Oct 20 15:12:12 2016 +0200
> 
> block: replace REQ_THROTTLED with a bio flag
> 
> And lastly, would you mind using the regular git request-pull to
> generate your pull requests? It's what everybody else is using, and
> honestly I think it's a lot more readable than the oneline thing you are
> doing.
> 
> Please resend the series against for-4.10/block and we can get it
> applied, thanks.

Will do, thanks!

-Eric


--
Eric Wheeler


> 
> -- 
> Jens Axboe
> 
> 
> 


Re: [PULL] bcache: multiple updates

2016-10-31 Thread Eric Wheeler
On Sun, 30 Oct 2016, Jens Axboe wrote:
> On 10/30/2016 08:00 AM, Kent Overstreet wrote:
> > On Sat, Oct 29, 2016 at 06:32:38PM -0700, Eric Wheeler wrote:
> > > On Thu, 27 Oct 2016, Jens Axboe wrote:
> > >
> > > > On 10/27/2016 05:27 PM, Eric Wheeler wrote:
> > > > > Hi Jens,
> > > > >
> > > > > Please pull this v4.9-rc2 based series of bcache updates for v4.9-rc3:
> > > > > (You may disregard the previous -rc1-based request.)
> > > > >
> > > > >   git pull https://bitbucket.org/ewheelerinc/linux.git
> > > > >   v4.9-rc2-bcache-updates
> > > > >
> > > > > Thank you!
> > > > >
> > > > > --
> > > > > Eric Wheeler
> > > > >
> > > > > ]# git log --oneline v4.9-rc2..HEAD
> > > > > bd532a6 bcache: partition support: add 16 minors per bcacheN device
> > > > > 3312845 bcache: Make gc wakeup sane, remove set_task_state()
> > > > > 6bb7f1e bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
> > > > > 3d58a09 bcache: documentation for ioprio cache hinting
> > > > > 2e8884b bcache: introduce per-process ioprio-based bypass/writeback
> > > > > hints
> > > >
> > > > How many of these are applicable to 4.9-rc3? I took a quick look at
> > > > them, and some of them look like they should go into the 4.10 branch
> > > > instead. We're after the merge window, so only strict fixes. Cleanups
> > > > and no features, no go.
> > >
> > > 3312845 might need to be in 4.9.  Not sure, kent asked me to pick it up in
> > > my next pull request.  4.10 is fine for the rest.
> > >
> > > Kent, Davidlohr, does 3312845 need to land in 4.9 for some reason?
> >
> > No, that can wait until 4.10.
> 
> Great, that's what it looked like.
> 
> Eric, three things. The first is that you based this branch as if it was
> going into 4.9, which means if I pull it into my 4.10 branch, I get a
> ton of extra stuff that has been added to master since. Not a huge
> problem, I can just generate the patches and apply them.
> 
> Secondly, you are depending on REQ_THROTTLED, which in the 4.10 series,
> has been modified:
> 
> commit 8d2bbd4c8236e9e38e6b36ac9e2c54fdcfe5b335
> Author: Christoph Hellwig 
> Date:   Thu Oct 20 15:12:12 2016 +0200
> 
> block: replace REQ_THROTTLED with a bio flag
> 
> And lastly, would you mind using the regular git request-pull to
> generate your pull requests? It's what everybody else is using, and
> honestly I think it's a lot more readable than the oneline thing you are
> doing.
> 
> Please resend the series against for-4.10/block and we can get it
> applied, thanks.

Will do, thanks!

-Eric


--
Eric Wheeler


> 
> -- 
> Jens Axboe
> 
> 
> 


Re: [PULL] bcache: multiple updates

2016-10-29 Thread Eric Wheeler
On Thu, 27 Oct 2016, Jens Axboe wrote:

> On 10/27/2016 05:27 PM, Eric Wheeler wrote:
> > Hi Jens,
> >
> > Please pull this v4.9-rc2 based series of bcache updates for v4.9-rc3:
> > (You may disregard the previous -rc1-based request.)
> >
> >   git pull https://bitbucket.org/ewheelerinc/linux.git
> >   v4.9-rc2-bcache-updates
> >
> > Thank you!
> >
> > --
> > Eric Wheeler
> >
> > ]# git log --oneline v4.9-rc2..HEAD
> > bd532a6 bcache: partition support: add 16 minors per bcacheN device
> > 3312845 bcache: Make gc wakeup sane, remove set_task_state()
> > 6bb7f1e bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
> > 3d58a09 bcache: documentation for ioprio cache hinting
> > 2e8884b bcache: introduce per-process ioprio-based bypass/writeback hints
> 
> How many of these are applicable to 4.9-rc3? I took a quick look at
> them, and some of them look like they should go into the 4.10 branch
> instead. We're after the merge window, so only strict fixes. Cleanups
> and no features, no go.

3312845 might need to be in 4.9.  Not sure, kent asked me to pick it up in 
my next pull request.  4.10 is fine for the rest.

Kent, Davidlohr, does 3312845 need to land in 4.9 for some reason?

See this thread: 
Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: 
Restructure wait loop)
http://www.gossamer-threads.com/lists/linux/kernel/2550193  


--
Eric Wheeler



> 
> -- 
> Jens Axboe
> 
> 
> 


Re: [PULL] bcache: multiple updates

2016-10-29 Thread Eric Wheeler
On Thu, 27 Oct 2016, Jens Axboe wrote:

> On 10/27/2016 05:27 PM, Eric Wheeler wrote:
> > Hi Jens,
> >
> > Please pull this v4.9-rc2 based series of bcache updates for v4.9-rc3:
> > (You may disregard the previous -rc1-based request.)
> >
> >   git pull https://bitbucket.org/ewheelerinc/linux.git
> >   v4.9-rc2-bcache-updates
> >
> > Thank you!
> >
> > --
> > Eric Wheeler
> >
> > ]# git log --oneline v4.9-rc2..HEAD
> > bd532a6 bcache: partition support: add 16 minors per bcacheN device
> > 3312845 bcache: Make gc wakeup sane, remove set_task_state()
> > 6bb7f1e bcache: update bio->bi_opf bypass/writeback REQ_ flag hints
> > 3d58a09 bcache: documentation for ioprio cache hinting
> > 2e8884b bcache: introduce per-process ioprio-based bypass/writeback hints
> 
> How many of these are applicable to 4.9-rc3? I took a quick look at
> them, and some of them look like they should go into the 4.10 branch
> instead. We're after the merge window, so only strict fixes. Cleanups
> and no features, no go.

3312845 might need to be in 4.9.  Not sure, kent asked me to pick it up in 
my next pull request.  4.10 is fine for the rest.

Kent, Davidlohr, does 3312845 need to land in 4.9 for some reason?

See this thread: 
Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: 
Restructure wait loop)
http://www.gossamer-threads.com/lists/linux/kernel/2550193  


--
Eric Wheeler



> 
> -- 
> Jens Axboe
> 
> 
> 


Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-25 Thread Eric Wheeler
On Mon, 24 Oct 2016, Kent Overstreet wrote:
> On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> > On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> > 
> > > Subject: sched: Better explain sleep/wakeup
> > > From: Peter Zijlstra 
> > > Date: Wed Oct 19 15:45:27 CEST 2016
> > > 
> > > There were a few questions wrt how sleep-wakeup works. Try and explain
> > > it more.
> > > 
> > > Requested-by: Will Deacon 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > ---
> > > include/linux/sched.h |   52 
> > > ++
> > > kernel/sched/core.c   |   15 +++---
> > > 2 files changed, 44 insertions(+), 23 deletions(-)
> > > 
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_task_state(tsk, state_value)  \
> > >   do {\
> > >   (tsk)->task_state_change = _THIS_IP_;   \
> > > - smp_store_mb((tsk)->state, (state_value));  \
> > > + smp_store_mb((tsk)->state, (state_value));  \
> > >   } while (0)
> > > 
> > > -/*
> > > - * set_current_state() includes a barrier so that the write of 
> > > current->state
> > > - * is correctly serialised wrt the caller's subsequent test of whether to
> > > - * actually sleep:
> > > - *
> > > - *   set_current_state(TASK_UNINTERRUPTIBLE);
> > > - *   if (do_i_need_to_sleep())
> > > - *   schedule();
> > > - *
> > > - * If the caller does not need such serialisation then use 
> > > __set_current_state()
> > > - */
> > > #define __set_current_state(state_value)  \
> > >   do {\
> > >   current->task_state_change = _THIS_IP_; \
> > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_current_state(state_value)\
> > >   do {\
> > >   current->task_state_change = _THIS_IP_; \
> > > - smp_store_mb(current->state, (state_value));\
> > > + smp_store_mb(current->state, (state_value));\
> > >   } while (0)
> > > 
> > > #else
> > > 
> > > +/*
> > > + * @tsk had better be current, or you get to keep the pieces.
> > 
> > That reminds me we were getting rid of the set_task_state() calls. Bcache 
> > was
> > pending, being only user in the kernel that doesn't actually use current; 
> > but
> > instead breaks newly (yet blocked/uninterruptible) created garbage 
> > collection
> > kthread. I cannot figure out why this is done (ie purposely accounting the
> > load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> > as as by the allocator task, so I don't see why bcache gc should want to be
> > interruptible.
> > 
> > Kent, Jens, can we get rid of this?
> 
> Here's a patch that just fixes the way gc gets woken up. Eric, you want to 
> take
> this?

Sure, I'll put it up with my -rc2 pull request to Jens.  

A couple of sanity checks (for my understanding at least):

Why does bch_data_insert_start() no longer need to call 
set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?  

Does bch_cache_set_alloc() even need to call set_gc_sectors since 
bch_gc_thread() does before calling bch_btree_gc?

Also I'm curious, why change invalidate_needs_gc from a bitfield? 

-Eric

> 
> -- >8 --
> Subject: [PATCH] bcache: Make gc wakeup saner
> 
> This lets us ditch a set_task_state() call.
> 
> Signed-off-by: Kent Overstreet 
> ---
>  drivers/md/bcache/bcache.h  |  4 ++--
>  drivers/md/bcache/btree.c   | 39 ---
>  drivers/md/bcache/btree.h   |  3 +--
>  drivers/md/bcache/request.c |  4 +---
>  drivers/md/bcache/super.c   |  2 ++
>  5 files changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 6b420a55c7..c3ea03c9a1 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -425,7 +425,7 @@ struct cache {
>* until a gc finishes - otherwise we could pointlessly burn a ton of
>* cpu
>*/
> - unsignedinvalidate_needs_gc:1;
> + unsignedinvalidate_needs_gc;
>  
>   booldiscard; /* Get rid of? */
>  
> @@ -593,8 +593,8 @@ struct cache_set {
>  
>   /* Counts how many sectors bio_insert has added to the cache */
>   atomic_tsectors_to_gc;
> + wait_queue_head_t   gc_wait;
>  
> - wait_queue_head_t   moving_gc_wait;
>   struct keybuf   moving_gc_keys;
>   /* Number of moving GC bios in flight */
>   struct semaphoremoving_in_flight;
> diff --git a/drivers/md/bcache/btree.c 

Re: ciao set_task_state() (was Re: [PATCH -v4 6/8] locking/mutex: Restructure wait loop)

2016-10-25 Thread Eric Wheeler
On Mon, 24 Oct 2016, Kent Overstreet wrote:
> On Sun, Oct 23, 2016 at 06:57:26PM -0700, Davidlohr Bueso wrote:
> > On Wed, 19 Oct 2016, Peter Zijlstra wrote:
> > 
> > > Subject: sched: Better explain sleep/wakeup
> > > From: Peter Zijlstra 
> > > Date: Wed Oct 19 15:45:27 CEST 2016
> > > 
> > > There were a few questions wrt how sleep-wakeup works. Try and explain
> > > it more.
> > > 
> > > Requested-by: Will Deacon 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > ---
> > > include/linux/sched.h |   52 
> > > ++
> > > kernel/sched/core.c   |   15 +++---
> > > 2 files changed, 44 insertions(+), 23 deletions(-)
> > > 
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -262,20 +262,9 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_task_state(tsk, state_value)  \
> > >   do {\
> > >   (tsk)->task_state_change = _THIS_IP_;   \
> > > - smp_store_mb((tsk)->state, (state_value));  \
> > > + smp_store_mb((tsk)->state, (state_value));  \
> > >   } while (0)
> > > 
> > > -/*
> > > - * set_current_state() includes a barrier so that the write of 
> > > current->state
> > > - * is correctly serialised wrt the caller's subsequent test of whether to
> > > - * actually sleep:
> > > - *
> > > - *   set_current_state(TASK_UNINTERRUPTIBLE);
> > > - *   if (do_i_need_to_sleep())
> > > - *   schedule();
> > > - *
> > > - * If the caller does not need such serialisation then use 
> > > __set_current_state()
> > > - */
> > > #define __set_current_state(state_value)  \
> > >   do {\
> > >   current->task_state_change = _THIS_IP_; \
> > > @@ -284,11 +273,19 @@ extern char ___assert_task_state[1 - 2*!
> > > #define set_current_state(state_value)\
> > >   do {\
> > >   current->task_state_change = _THIS_IP_; \
> > > - smp_store_mb(current->state, (state_value));\
> > > + smp_store_mb(current->state, (state_value));\
> > >   } while (0)
> > > 
> > > #else
> > > 
> > > +/*
> > > + * @tsk had better be current, or you get to keep the pieces.
> > 
> > That reminds me we were getting rid of the set_task_state() calls. Bcache 
> > was
> > pending, being only user in the kernel that doesn't actually use current; 
> > but
> > instead breaks newly (yet blocked/uninterruptible) created garbage 
> > collection
> > kthread. I cannot figure out why this is done (ie purposely accounting the
> > load avg. Furthermore gc kicks in in very specific scenarios obviously, such
> > as as by the allocator task, so I don't see why bcache gc should want to be
> > interruptible.
> > 
> > Kent, Jens, can we get rid of this?
> 
> Here's a patch that just fixes the way gc gets woken up. Eric, you want to 
> take
> this?

Sure, I'll put it up with my -rc2 pull request to Jens.  

A couple of sanity checks (for my understanding at least):

Why does bch_data_insert_start() no longer need to call 
set_gc_sectors(op->c) now that bch_cache_set_alloc and bch_gc_thread do?  

Does bch_cache_set_alloc() even need to call set_gc_sectors since 
bch_gc_thread() does before calling bch_btree_gc?

Also I'm curious, why change invalidate_needs_gc from a bitfield? 

-Eric

> 
> -- >8 --
> Subject: [PATCH] bcache: Make gc wakeup saner
> 
> This lets us ditch a set_task_state() call.
> 
> Signed-off-by: Kent Overstreet 
> ---
>  drivers/md/bcache/bcache.h  |  4 ++--
>  drivers/md/bcache/btree.c   | 39 ---
>  drivers/md/bcache/btree.h   |  3 +--
>  drivers/md/bcache/request.c |  4 +---
>  drivers/md/bcache/super.c   |  2 ++
>  5 files changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 6b420a55c7..c3ea03c9a1 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -425,7 +425,7 @@ struct cache {
>* until a gc finishes - otherwise we could pointlessly burn a ton of
>* cpu
>*/
> - unsignedinvalidate_needs_gc:1;
> + unsignedinvalidate_needs_gc;
>  
>   booldiscard; /* Get rid of? */
>  
> @@ -593,8 +593,8 @@ struct cache_set {
>  
>   /* Counts how many sectors bio_insert has added to the cache */
>   atomic_tsectors_to_gc;
> + wait_queue_head_t   gc_wait;
>  
> - wait_queue_head_t   moving_gc_wait;
>   struct keybuf   moving_gc_keys;
>   /* Number of moving GC bios in flight */
>   struct semaphoremoving_in_flight;
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81d3db40cd..2efdce0724 100644
> --- a/drivers/md/bcache/btree.c
> 

Re: bcachefs: Encryption (Posting for review)

2016-09-18 Thread Eric Wheeler
;  * 128 bit checksum/MAC field
>  * Magic number - identifies a given structure as btree/journal/allocation
>information, for that filesystem
>  * Version number (of on disk format), flags (including checksum/encryption
>type).
>  * Sequence numbers: journal entries have an ascending 64 bit sequence number,
>btree node entries have a random 64 bit sequence number identifying them as
>belonging to that node. Btree nodes also have a field containing the 
> sequence
>number of the most recent journal entry they contain updates from; this is
>stored unencrypted so it can be used as part of the nonce.
>  * Size of the btree node entry/journal entry, in u64s
> 
> Btree node layout information is encrypted; an attacker could tell that a 
> given
> location on disk was a btree node, but the part of the header that indicates
> what range of the keyspace, or which btree ID (extents/dirents/xattrs/etc.), 
> or
> which level of the btree is all encrypted.
> 
>  Metadata nonces
> 
>  * Journal entries use their sequence number - which is unique for a given
>filesystem. When metadata is being replicated and we're doing multiple
>journal writes with the same sequence number - and thus nonce - we really 
> are
>writing the same data (we only checksum once, not once per write).
> 
>  * Btree nodes concatenate a few things for the nonce:
>- A 64 bit random integer, which is generated per btree node (but btree 
> nodes
>  are log structured, so entries within a given btree node share the same
>  integer).
>- A journal sequence number. For btree node writes done at around the same
>  point in time, this field can be identical in unrelated btree node 
> writes -
>  but only for btree nodes writes done relatively close in time, so the
>  journal sequence number plus the previous random integer should be more
>  than sufficient entropy.
>- And lastly the offset within the btree node, so that btree node entries
>  sharing the same random integer are guaranteed a different nonce.
> 
>  * Allocation information (struct prio_set):
>bcache/bcachefs doesn't have allocation information persisted like other
>filesystems, but this is our closest equivalent - this structure mainly
>stores generation numbers that correspond to extent pointers.
> 
>Allocation information uses a dedicated randomly generated 96 bit nonce
>field.
> 
> ### Data
> 
> Data writes have no unencrypted header: checksums/MACs, nonces, etc. are 
> stored
> with the pointers, ZFS style.
> 
> Bcache/bcachefs is extent based, not block based: pointers point to variable
> sized chunks of data, and we store one checksum/MAC per extent, not per 
> block: a
> checksum or MAC might cover up to 64k (extents that aren't checksummed or
> compressed may be larger). Nonces are thus also per extent, not per block.
> 
> Currently, the Poly1305 MAC is truncated to 64 bits - due to a desire not to
> inflate our metadata any more than necessary. Guidance from cryptographers is
> requested as to whether this is a reasonable option; do note that the MAC is 
> not
> stored with the data, but is itself stored encrypted elsewhere in the btree. 

Encrypted or not, MACs are already only as strong as sqrt(2^n) because of 
birthday paradox collisions.  Truncating to 64 bits guarantees a MAC 
collision every 2^32 generations which is too weak in my opinion.  Lost 
space or no, it is definitely better to keep the MAC the same size as the 
MACing function (or hash if HMAC).

Another note on MACs:  Using Poly1305 sounds great, but what if it is 
found to be weakened or broken or deprecated in the future? 

IMO, it would best to support MACs as HMACs and allow the user to specify 
the hash function using the existing kernel crypto library.  HMAC 
construction is as follows (|| means append):

HMAC_{HASH,key}(data) = HASH(HASH(key) || HASH(data))


Of course you are welcome to support all of the above and let the user 
decide on the MAC function (poly1305 vs HMAC_{HASH,key}) and crypto 
function (counter-mode+block-cipher, streaming ciphers including chacha*).

It is appealing to have super-fast chacha20+poly1305---and equally so to 
have the flexibility and breadth of the existing crypto suite using well 
tested crypto constructs like counter-mode and HMAC.


--
Eric Wheeler

> We do already have different fields for storing 4 byte checksums and 8 
> byte checksums; it will be a trivial matter to add a field allowing 16 
> byte checksums to be stored, and we will add that anyways - so this 
> isn't a pressing design issue, this is just a matter of what the 
> defaults should be and what we should tell users.
> 
>  Extent nonces
> 
> We don't wish to simply add a rand

Re: bcachefs: Encryption (Posting for review)

2016-09-18 Thread Eric Wheeler
;  * 128 bit checksum/MAC field
>  * Magic number - identifies a given structure as btree/journal/allocation
>information, for that filesystem
>  * Version number (of on disk format), flags (including checksum/encryption
>type).
>  * Sequence numbers: journal entries have an ascending 64 bit sequence number,
>btree node entries have a random 64 bit sequence number identifying them as
>belonging to that node. Btree nodes also have a field containing the 
> sequence
>number of the most recent journal entry they contain updates from; this is
>stored unencrypted so it can be used as part of the nonce.
>  * Size of the btree node entry/journal entry, in u64s
> 
> Btree node layout information is encrypted; an attacker could tell that a 
> given
> location on disk was a btree node, but the part of the header that indicates
> what range of the keyspace, or which btree ID (extents/dirents/xattrs/etc.), 
> or
> which level of the btree is all encrypted.
> 
>  Metadata nonces
> 
>  * Journal entries use their sequence number - which is unique for a given
>filesystem. When metadata is being replicated and we're doing multiple
>journal writes with the same sequence number - and thus nonce - we really 
> are
>writing the same data (we only checksum once, not once per write).
> 
>  * Btree nodes concatenate a few things for the nonce:
>- A 64 bit random integer, which is generated per btree node (but btree 
> nodes
>  are log structured, so entries within a given btree node share the same
>  integer).
>- A journal sequence number. For btree node writes done at around the same
>  point in time, this field can be identical in unrelated btree node 
> writes -
>  but only for btree nodes writes done relatively close in time, so the
>  journal sequence number plus the previous random integer should be more
>  than sufficient entropy.
>- And lastly the offset within the btree node, so that btree node entries
>  sharing the same random integer are guaranteed a different nonce.
> 
>  * Allocation information (struct prio_set):
>bcache/bcachefs doesn't have allocation information persisted like other
>filesystems, but this is our closest equivalent - this structure mainly
>stores generation numbers that correspond to extent pointers.
> 
>Allocation information uses a dedicated randomly generated 96 bit nonce
>field.
> 
> ### Data
> 
> Data writes have no unencrypted header: checksums/MACs, nonces, etc. are 
> stored
> with the pointers, ZFS style.
> 
> Bcache/bcachefs is extent based, not block based: pointers point to variable
> sized chunks of data, and we store one checksum/MAC per extent, not per 
> block: a
> checksum or MAC might cover up to 64k (extents that aren't checksummed or
> compressed may be larger). Nonces are thus also per extent, not per block.
> 
> Currently, the Poly1305 MAC is truncated to 64 bits - due to a desire not to
> inflate our metadata any more than necessary. Guidance from cryptographers is
> requested as to whether this is a reasonable option; do note that the MAC is 
> not
> stored with the data, but is itself stored encrypted elsewhere in the btree. 

Encrypted or not, MACs are already only as strong as sqrt(2^n) because of 
birthday paradox collisions.  Truncating to 64 bits guarantees a MAC 
collision every 2^32 generations which is too weak in my opinion.  Lost 
space or no, it is definitely better to keep the MAC the same size as the 
MACing function (or hash if HMAC).

Another note on MACs:  Using Poly1305 sounds great, but what if it is 
found to be weakened or broken or deprecated in the future? 

IMO, it would best to support MACs as HMACs and allow the user to specify 
the hash function using the existing kernel crypto library.  HMAC 
construction is as follows (|| means append):

HMAC_{HASH,key}(data) = HASH(HASH(key) || HASH(data))


Of course you are welcome to support all of the above and let the user 
decide on the MAC function (poly1305 vs HMAC_{HASH,key}) and crypto 
function (counter-mode+block-cipher, streaming ciphers including chacha*).

It is appealing to have super-fast chacha20+poly1305---and equally so to 
have the flexibility and breadth of the existing crypto suite using well 
tested crypto constructs like counter-mode and HMAC.


--
Eric Wheeler

> We do already have different fields for storing 4 byte checksums and 8 
> byte checksums; it will be a trivial matter to add a field allowing 16 
> byte checksums to be stored, and we will add that anyways - so this 
> isn't a pressing design issue, this is just a matter of what the 
> defaults should be and what we should tell users.
> 
>  Extent nonces
> 
> We don't wish to simply add a rand

Re: [PATCH V2 00/22] Replace the CFQ I/O Scheduler with BFQ

2016-09-01 Thread Eric Wheeler
On Wed, 31 Aug 2016, Mark Brown wrote:
[...]
> I personally feel that given that it looks like this is all going to
> take a while it'd still be good to merge BFQ at least as an alternative
> scheduler so that people can take advantage of it while the work on
> modernising everything to use blk-mq - that way we can hopefully improve
> the state of the art for users in the short term or at least help get
> some wider feedback on how well this works in the real world
> independently of the work on blk-mq.

I would like to chime in agree fervently with Mark.  

We have a pair of very busy hypervisors with a complicated block stack 
integrating bcache, drbd, LVM, dm-thin, kvm, ggaoed (AoE target), zram 
swap, continuous block-layer backups and snapshot verifies to tertiary 
storage, cgroup block IO throttled limits, and lots of hourly dm-thin 
snapshots replicated to tertiary storage.  All of this is performed under 
heavy memory pressure (35-40% swapped out to zram).

The systems work moderately well under cfq, but *amazingly well* using 
BFQ.  I like BFQ so much that I've backported v8r2 to Linux v4.1 [1].

+1 to upstream this as a new scheduler without replacing CFQ.

Including BFQ would be a boon for Linux and the community at large.

--
Eric Wheeler

[1] Based on Linux v4.1-rc1, it cleanly merges forward into v4.7:
https://bitbucket.org/ewheelerinc/linux/branch/v4.1-rc1-bfq-v8
git pull https://bitbucket.org/ewheelerinc/linux.git v4.1-rc1-bfq-v8


Re: [PATCH V2 00/22] Replace the CFQ I/O Scheduler with BFQ

2016-09-01 Thread Eric Wheeler
On Wed, 31 Aug 2016, Mark Brown wrote:
[...]
> I personally feel that given that it looks like this is all going to
> take a while it'd still be good to merge BFQ at least as an alternative
> scheduler so that people can take advantage of it while the work on
> modernising everything to use blk-mq - that way we can hopefully improve
> the state of the art for users in the short term or at least help get
> some wider feedback on how well this works in the real world
> independently of the work on blk-mq.

I would like to chime in agree fervently with Mark.  

We have a pair of very busy hypervisors with a complicated block stack 
integrating bcache, drbd, LVM, dm-thin, kvm, ggaoed (AoE target), zram 
swap, continuous block-layer backups and snapshot verifies to tertiary 
storage, cgroup block IO throttled limits, and lots of hourly dm-thin 
snapshots replicated to tertiary storage.  All of this is performed under 
heavy memory pressure (35-40% swapped out to zram).

The systems work moderately well under cfq, but *amazingly well* using 
BFQ.  I like BFQ so much that I've backported v8r2 to Linux v4.1 [1].

+1 to upstream this as a new scheduler without replacing CFQ.

Including BFQ would be a boon for Linux and the community at large.

--
Eric Wheeler

[1] Based on Linux v4.1-rc1, it cleanly merges forward into v4.7:
https://bitbucket.org/ewheelerinc/linux/branch/v4.1-rc1-bfq-v8
git pull https://bitbucket.org/ewheelerinc/linux.git v4.1-rc1-bfq-v8


Re: [PATCH v3] block: make sure big bio is splitted into at most 256 bvecs

2016-08-18 Thread Eric Wheeler
> On Mon, Aug 15, 2016 at 11:23:28AM -0700, Christoph Hellwig wrote:
> > On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote:
> > > After arbitrary bio size is supported, the incoming bio may
> > > be very big. We have to split the bio into small bios so that
> > > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > > as bio_clone().
> > 
> > I still think working around a rough driver submitting too large
> > I/O is a bad thing until we've done a full audit of all consuming
> > bios through ->make_request, and we've enabled it for the common
> > path as well.
> 
> bcache originally had workaround code to split too-large bios when it 
> first went upstream - that was dropped only after the patches to make 
> generic_make_request() handle arbitrary size bios went in. So to do what 
> you're suggesting would mean reverting that bcache patch and bringing 
> that code back, which from my perspective would be a step in the wrong 
> direction. I just want to get this over and done with.
> 
> > 
> > >   bool do_split = true;
> > >   struct bio *new = NULL;
> > >   const unsigned max_sectors = get_max_io_size(q, bio);
> > > + unsigned bvecs = 0;
> > > +
> > > + *no_merge = true;
> > >  
> > >   bio_for_each_segment(bv, bio, iter) {
> > >   /*
> > > +  * With arbitrary bio size, the incoming bio may be very
> > > +  * big. We have to split the bio into small bios so that
> > > +  * each holds at most BIO_MAX_PAGES bvecs because
> > > +  * bio_clone() can fail to allocate big bvecs.
> > > +  *
> > > +  * It should have been better to apply the limit per
> > > +  * request queue in which bio_clone() is involved,
> > > +  * instead of globally. The biggest blocker is
> > > +  * bio_clone() in bio bounce.
> > > +  *
> > > +  * If bio is splitted by this reason, we should allow
> > > +  * to continue bios merging.
> > > +  *
> > > +  * TODO: deal with bio bounce's bio_clone() gracefully
> > > +  * and convert the global limit into per-queue limit.
> > > +  */
> > > + if (bvecs++ >= BIO_MAX_PAGES) {
> > > + *no_merge = false;
> > > + goto split;
> > > + }
> > 
> > That being said this simple if check here is simple enough that it's
> > probably fine.  But I see no need to uglify the whole code path
> > with that no_merge flag.  Please drop if for now, and if we start
> > caring for this path in common code we should just move the
> > REQ_NOMERGE setting into the actual blk_bio_*_split helpers.
> 
> Agreed about the no_merge thing.

By removing `no_merge` this patch should cherry-peck into stable v4.3+ 
without merge issues by avoiding bi_rw refactor interference, too.

Ming, can you send out a V4 without `no_merge` ?

--
Eric Wheeler





Re: [PATCH v3] block: make sure big bio is splitted into at most 256 bvecs

2016-08-18 Thread Eric Wheeler
> On Mon, Aug 15, 2016 at 11:23:28AM -0700, Christoph Hellwig wrote:
> > On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote:
> > > After arbitrary bio size is supported, the incoming bio may
> > > be very big. We have to split the bio into small bios so that
> > > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > > as bio_clone().
> > 
> > I still think working around a rough driver submitting too large
> > I/O is a bad thing until we've done a full audit of all consuming
> > bios through ->make_request, and we've enabled it for the common
> > path as well.
> 
> bcache originally had workaround code to split too-large bios when it 
> first went upstream - that was dropped only after the patches to make 
> generic_make_request() handle arbitrary size bios went in. So to do what 
> you're suggesting would mean reverting that bcache patch and bringing 
> that code back, which from my perspective would be a step in the wrong 
> direction. I just want to get this over and done with.
> 
> > 
> > >   bool do_split = true;
> > >   struct bio *new = NULL;
> > >   const unsigned max_sectors = get_max_io_size(q, bio);
> > > + unsigned bvecs = 0;
> > > +
> > > + *no_merge = true;
> > >  
> > >   bio_for_each_segment(bv, bio, iter) {
> > >   /*
> > > +  * With arbitrary bio size, the incoming bio may be very
> > > +  * big. We have to split the bio into small bios so that
> > > +  * each holds at most BIO_MAX_PAGES bvecs because
> > > +  * bio_clone() can fail to allocate big bvecs.
> > > +  *
> > > +  * It should have been better to apply the limit per
> > > +  * request queue in which bio_clone() is involved,
> > > +  * instead of globally. The biggest blocker is
> > > +  * bio_clone() in bio bounce.
> > > +  *
> > > +  * If bio is splitted by this reason, we should allow
> > > +  * to continue bios merging.
> > > +  *
> > > +  * TODO: deal with bio bounce's bio_clone() gracefully
> > > +  * and convert the global limit into per-queue limit.
> > > +  */
> > > + if (bvecs++ >= BIO_MAX_PAGES) {
> > > + *no_merge = false;
> > > + goto split;
> > > + }
> > 
> > That being said this simple if check here is simple enough that it's
> > probably fine.  But I see no need to uglify the whole code path
> > with that no_merge flag.  Please drop if for now, and if we start
> > caring for this path in common code we should just move the
> > REQ_NOMERGE setting into the actual blk_bio_*_split helpers.
> 
> Agreed about the no_merge thing.

By removing `no_merge` this patch should cherry-peck into stable v4.3+ 
without merge issues by avoiding bi_rw refactor interference, too.

Ming, can you send out a V4 without `no_merge` ?

--
Eric Wheeler





Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs

2016-08-11 Thread Eric Wheeler
On Fri, 10 Jun 2016, Christoph Hellwig wrote: 
> On Wed, 6 Apr 2016, Ming Lei wrote:
> >
> > After arbitrary bio size is supported, the incoming bio may
> > be very big. We have to split the bio into small bios so that
> > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > as bio_clone().
> > 
> > This patch fixes the following kernel crash:
> > [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> > [raid1]
> > [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> > [  172.664880]  [] ? md_make_request+0xf6/0x1e9 [md_mod]
> > [  172.664912]  [] ? generic_make_request+0xb5/0x155
> > [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
>
> The fixup to allow bio_clone support a larger size is the same one as to 
> allow everyone else submitting larger bios:  increase BIO_MAX_PAGES and 
> create the required mempools to back that new larger size.  Or just go 
> for multipage biovecs..

Hi Christoph, Ming, everyone:

I'm hoping you can help me get this off of a list of stability fixes 
related to changes around Linux 4.3.  Ming's patch [1] is known to fix an 
issue when a bio with bi_vcnt > BIO_MAX_PAGES is passed to 
generic_make_request and later hits bio_clone.  (Note that bi_vcnt can't 
be trusted since immutable biovecs and needs to be re-counted unless you 
own the bio, which Ming's patch does.)

The diffstat, 22 lines of which are commentary, 
seems relatively minor and would land in stable for v4.3+:
 block/blk-merge.c | 35 ---

I'm not sure I understood Christoph's suggestion; BIO_MAX_PAGES is a 
static #define and we don't know what the the bi_vcnt from an arbitrary 
driver might be.  Wouldn't increasing BIO_MAX_PAGES just push the problem 
further out into the future when bi_vcnt might again exceed BIO_MAX_PAGES?  

Perhaps you could elaborate if I have misunderstood. Are you suggesting 
that no driver should call generic_make_request when bi_vcnt > BIO_MAX_PAGES?


--
Eric Wheeler

[1] https://patchwork.kernel.org/patch/9169483/
Pasted below:



After arbitrary bio size is supported, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().

This patch fixes the following kernel crash:

> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 
> 0028
> [  172.660229] IP: [] bio_trim+0xf/0x2a
> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> [  172.660399] Oops:  [#1] SMP
> [...]
> [  172.664780] Call Trace:
> [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 [raid1]
> [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> [  172.664880]  [] ? md_make_request+0xf6/0x1e9 [md_mod]
> [  172.664912]  [] ? generic_make_request+0xb5/0x155
> [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> [  172.664981]  [] ? register_cache_set+0x355/0x8d0 [bcache]
> [  172.665016]  [] ? register_bcache+0x1006/0x1174 [bcache]

The issue can be reproduced by the following steps:
- create one raid1 over two virtio-blk
- build bcache device over the above raid1 and another cache device
and bucket size is set as 2Mbytes
- set cache mode as writeback
- run random write over ext4 on the bcache device

Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
Reported-by: Sebastian Roesner <sroesner-kernel...@roesner-online.de>
Reported-by: Eric Wheeler <bca...@lists.ewheeler.net>
Cc: sta...@vger.kernel.org (4.3+)
Cc: Shaohua Li <s...@fb.com>
Acked-by: Kent Overstreet <kent.overstr...@gmail.com>
Signed-off-by: Ming Lei <ming@canonical.com>
---
V2:
- don't mark as REQ_NOMERGE in case the bio is splitted
for reaching the limit of bvecs count
V1:
- Kent pointed out that using max io size can't cover
the case of non-full bvecs/pages
 block/blk-merge.c | 35 ---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index c265348..839529b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -85,7 +85,8 @@  static inline unsigned get_max_io_size(struct request_queue 
*q,
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
-unsigned *segs)
+unsigned *segs,
+bool *no_merge)
 {
struct bio_vec bv, bvprv, *bvprvp = NULL;
struct bvec_iter iter;
@@ -94,9 +95,34 @@  static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bool do_split = true;
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(

Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs

2016-08-11 Thread Eric Wheeler
On Fri, 10 Jun 2016, Christoph Hellwig wrote: 
> On Wed, 6 Apr 2016, Ming Lei wrote:
> >
> > After arbitrary bio size is supported, the incoming bio may
> > be very big. We have to split the bio into small bios so that
> > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > as bio_clone().
> > 
> > This patch fixes the following kernel crash:
> > [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> > [raid1]
> > [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> > [  172.664880]  [] ? md_make_request+0xf6/0x1e9 [md_mod]
> > [  172.664912]  [] ? generic_make_request+0xb5/0x155
> > [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
>
> The fixup to allow bio_clone support a larger size is the same one as to 
> allow everyone else submitting larger bios:  increase BIO_MAX_PAGES and 
> create the required mempools to back that new larger size.  Or just go 
> for multipage biovecs..

Hi Christoph, Ming, everyone:

I'm hoping you can help me get this off of a list of stability fixes 
related to changes around Linux 4.3.  Ming's patch [1] is known to fix an 
issue when a bio with bi_vcnt > BIO_MAX_PAGES is passed to 
generic_make_request and later hits bio_clone.  (Note that bi_vcnt can't 
be trusted since immutable biovecs and needs to be re-counted unless you 
own the bio, which Ming's patch does.)

The diffstat, 22 lines of which are commentary, 
seems relatively minor and would land in stable for v4.3+:
 block/blk-merge.c | 35 ---

I'm not sure I understood Christoph's suggestion; BIO_MAX_PAGES is a 
static #define and we don't know what the the bi_vcnt from an arbitrary 
driver might be.  Wouldn't increasing BIO_MAX_PAGES just push the problem 
further out into the future when bi_vcnt might again exceed BIO_MAX_PAGES?  

Perhaps you could elaborate if I have misunderstood. Are you suggesting 
that no driver should call generic_make_request when bi_vcnt > BIO_MAX_PAGES?


--
Eric Wheeler

[1] https://patchwork.kernel.org/patch/9169483/
Pasted below:



After arbitrary bio size is supported, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().

This patch fixes the following kernel crash:

> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 
> 0028
> [  172.660229] IP: [] bio_trim+0xf/0x2a
> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> [  172.660399] Oops:  [#1] SMP
> [...]
> [  172.664780] Call Trace:
> [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 [raid1]
> [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> [  172.664880]  [] ? md_make_request+0xf6/0x1e9 [md_mod]
> [  172.664912]  [] ? generic_make_request+0xb5/0x155
> [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> [  172.664981]  [] ? register_cache_set+0x355/0x8d0 [bcache]
> [  172.665016]  [] ? register_bcache+0x1006/0x1174 [bcache]

The issue can be reproduced by the following steps:
- create one raid1 over two virtio-blk
- build bcache device over the above raid1 and another cache device
and bucket size is set as 2Mbytes
- set cache mode as writeback
- run random write over ext4 on the bcache device

Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
Reported-by: Sebastian Roesner 
Reported-by: Eric Wheeler 
Cc: sta...@vger.kernel.org (4.3+)
Cc: Shaohua Li 
Acked-by: Kent Overstreet 
Signed-off-by: Ming Lei 
---
V2:
- don't mark as REQ_NOMERGE in case the bio is splitted
for reaching the limit of bvecs count
V1:
- Kent pointed out that using max io size can't cover
the case of non-full bvecs/pages
 block/blk-merge.c | 35 ---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index c265348..839529b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -85,7 +85,8 @@  static inline unsigned get_max_io_size(struct request_queue 
*q,
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
-unsigned *segs)
+unsigned *segs,
+bool *no_merge)
 {
struct bio_vec bv, bvprv, *bvprvp = NULL;
struct bvec_iter iter;
@@ -94,9 +95,34 @@  static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bool do_split = true;
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
+   unsigned bvecs = 0;
+
+   *no_merge = true;
 
bio_for_each_segment(bv, bio, iter) {
/*
+* With a

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-08-10 Thread Eric Wheeler
On Tue, 19 Jul 2016, Lars Ellenberg wrote:
> On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> > On Tue, Jul 12 2016 at 10:18pm -0400,
> > Eric Wheeler <bca...@lists.ewheeler.net> wrote:
> > 
> > > On Tue, 12 Jul 2016, NeilBrown wrote:
> > > 
> > > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > > 
> > > > >
> > > > > Instead, I suggest to distinguish between recursive calls to
> > > > > generic_make_request(), and pushing back the remainder part in
> > > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > >   struct recursion_to_iteration_bio_lists {
> > > > >   struct bio_list recursion;
> > > > >   struct bio_list queue;
> > > > >   }
> > > > >
> > > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > > bio_list, then merging any recursively submitted bios to the
> > > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > > logic in generic_make_request() process deepest level bios first,
> > > > > and "sibling" bios of the same level in "natural" order.
> > > > >
> > > > > Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> > > > > Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com>
> > > > 
> > > > Reviewed-by: NeilBrown <ne...@suse.com>
> > > > 
> > > > Thanks again for doing this - I think this is a very significant
> > > > improvement and could allow other simplifications.
> > > 
> > > Thank you Lars for all of this work!  
> > > 
> > > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > > will certainly address some of those (maybe all of them?).  (I think we 
> > > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > > compatible.
> > > 
> > > 
> > > Do you believe that this patch would solve any of the proposals by others 
> > > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > > if I'm wrong):

[... cut unrelated A,B ... ]

> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > >   by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > >   (was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now, this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(>lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to have been submitted already.
> 
> But this does not look like it easily fits into the current DM model.
> 
> | 7) snapshot_map is called for this new bio, it waits on
> | down_write(>lock) that is held by Process B (in step 5).
> 
> There is an other suggestion:
> Use down_trylock (or down_timeout),
> and if it fails, push back the currently to-be-processed bio.
> We can introduce a new bio helper for that.
> Kind of what blk_queue_split() does with my patch applied.
> 
> Or even better, ignore the down_tr

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-08-10 Thread Eric Wheeler
On Tue, 19 Jul 2016, Lars Ellenberg wrote:
> On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> > On Tue, Jul 12 2016 at 10:18pm -0400,
> > Eric Wheeler  wrote:
> > 
> > > On Tue, 12 Jul 2016, NeilBrown wrote:
> > > 
> > > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > > 
> > > > >
> > > > > Instead, I suggest to distinguish between recursive calls to
> > > > > generic_make_request(), and pushing back the remainder part in
> > > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > >   struct recursion_to_iteration_bio_lists {
> > > > >   struct bio_list recursion;
> > > > >   struct bio_list queue;
> > > > >   }
> > > > >
> > > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > > bio_list, then merging any recursively submitted bios to the
> > > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > > logic in generic_make_request() process deepest level bios first,
> > > > > and "sibling" bios of the same level in "natural" order.
> > > > >
> > > > > Signed-off-by: Lars Ellenberg 
> > > > > Signed-off-by: Roland Kammerer 
> > > > 
> > > > Reviewed-by: NeilBrown 
> > > > 
> > > > Thanks again for doing this - I think this is a very significant
> > > > improvement and could allow other simplifications.
> > > 
> > > Thank you Lars for all of this work!  
> > > 
> > > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > > will certainly address some of those (maybe all of them?).  (I think we 
> > > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > > compatible.
> > > 
> > > 
> > > Do you believe that this patch would solve any of the proposals by others 
> > > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > > if I'm wrong):

[... cut unrelated A,B ... ]

> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > >   by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > >   (was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now, this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(>lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to have been submitted already.
> 
> But this does not look like it easily fits into the current DM model.
> 
> | 7) snapshot_map is called for this new bio, it waits on
> | down_write(>lock) that is held by Process B (in step 5).
> 
> There is an other suggestion:
> Use down_trylock (or down_timeout),
> and if it fails, push back the currently to-be-processed bio.
> We can introduce a new bio helper for that.
> Kind of what blk_queue_split() does with my patch applied.
> 
> Or even better, ignore the down_trylock suggestion,
> simply not iterate over all pieces first,
> but process one piece, and return back the the
> itera

Re: To add, or not to add, a bio REQ_ROTATIONAL flag

2016-07-31 Thread Eric Wheeler
[+cc from "Enable use of Solid State Hybrid Drives"
https://lkml.org/lkml/2014/10/29/698 ]

On Thu, 28 Jul 2016, Martin K. Petersen wrote:
> >>>>> "Eric" == Eric Wheeler <bca...@lists.ewheeler.net> writes:
> Eric> [...]  This may imply that
> Eric> we need a new way to flag cache bypass from userspace [...]
> Eric> So what are our options?  What might be the best way to do this?
[...] 
> Eric> Are FADV_NOREUSE/FADV_DONTNEED reasonable candidates?
> 
> FADV_DONTNEED was intended for this. There have been patches posted in
> the past that tied the loop between the fadvise flags and the bio. I
> would like to see those revived.

That sounds like a good start, this looks about right from 2014:
https://lkml.org/lkml/2014/10/29/698
https://lwn.net/Articles/619058/

I read through the thread and have summarized the relevant parts here 
with additional commentary below the summary:

/* Summary 

They were seeking to do basically the same in 2014 thing we want with 
stacked block caching drivers today: hint to the IO layer so the (ATA 3.2) 
driver can decide whether a block should hit the cache or spinning disk.  
This was done by adding bitflags to ioprio for IOPRIO_ADV_ advice.

There are two arguments throughout the thread: one that the cache hint 
should be per-process (ionice) and the other, that hints should be per 
inode via fadvise (and maybe madvise).  Dan Williams noted with respect to 
fadvise for their implementation that "It's straightforward to add, but I 
think "80%" of the benefit can be had by just having a per-thread cache 
priority."

Kapil Karkra extended the page flags so the ioprio advice bits can be 
copied into bio->bi_rw, to which Jens said "is a bit...icky. I see why 
it's done, though, it requires the least amount of plumbing."

Martin K. Petersen provides a matrix of desires for actual use cases here:
https://lkml.org/lkml/2014/10/29/1014 
and asks "Are there actually people asking for sub-file granularity? I 
didn't get any requests for that in the survey I did this summer. [...] In 
any case I thought it was interesting that pretty much every use case that 
people came up with could be adequately described by a handful of I/O 
classes."

Further, Jens notes that "I think we've needed a proper API for passing in 
appropriate hints on a per-io basis for a LONG time. [...] We've tried 
(and failed) in the past to define a set of hints that make sense. It'd be 
a shame to add something that's specific to a given transport/technology. 
That said, this set of hints do seem pretty basic and would not 
necessarily be a bad place to start. But they are still very specific to 
this use case."
*/


So, perhaps it is time to plan the hint API and figure out how to plumb 
it.  These are some design considerations based on the thread:

a. People want per-process cache hinting (ionice, or some other tool).
b. Per inode+range hinting would be useful to some (fadvise, ioctl, etc)
c. Don't use page flags to convey cache hints---or find a clean way to do so.
d. Per IO hints would be useful to both stacking and hardware drivers.
e. Cache layers will implement their own device assignment choice based 
on the caching hint; for example, an IO flagged to miss the cache might 
hit if already in cache due to unrelated IO and such a determination would 
be made per-cache-implementation.


I can see this go two ways:

1. A dedicated implementation for cache hinting.
2. An API for generalized hinting, upon which cache hinting may be 
implemented.

To consider #2, what hinting is wanted from processes and inodes down to 
bio's?  Does it justify an entire API for generalized hinting, or do we 
just need a cache hinting implementation?  If we do want #2, then what are 
all of the features wanted by the community so it can be designed as such?

If #1 is sufficient, then what is the preferred mechanism and 
implementation for cache hinting?

In either direction, how can those hints pass down to bio's in an 
appropriate way (ie, not page flags)?


With the interest of a cache hinting implementation independent of 
transport/technology, I have been playing with an idea to use two per-IO 
"TTL" counters, both of which tend toward zero; I've not yet started an 
implementation:

cacheskip: 
Decrement until zero to skip cache layers (slow medium)
Ignore cachedepth until cacheskip==0.

cachedepth:
Initialize to positive, negative, or zero value.  Once zero, no 
special treatment is given to the IO.  When less than zero, prefer the 
slower medium.  When greater than zero, prefer the faster medium.  
Inc/decrement toward zero each time the IO passes through a 
caching layer.

Independent of how we might apply these counters to a pid/inode, the cache 
layers might look something like t

Re: To add, or not to add, a bio REQ_ROTATIONAL flag

2016-07-31 Thread Eric Wheeler
[+cc from "Enable use of Solid State Hybrid Drives"
https://lkml.org/lkml/2014/10/29/698 ]

On Thu, 28 Jul 2016, Martin K. Petersen wrote:
> >>>>> "Eric" == Eric Wheeler  writes:
> Eric> [...]  This may imply that
> Eric> we need a new way to flag cache bypass from userspace [...]
> Eric> So what are our options?  What might be the best way to do this?
[...] 
> Eric> Are FADV_NOREUSE/FADV_DONTNEED reasonable candidates?
> 
> FADV_DONTNEED was intended for this. There have been patches posted in
> the past that tied the loop between the fadvise flags and the bio. I
> would like to see those revived.

That sounds like a good start, this looks about right from 2014:
https://lkml.org/lkml/2014/10/29/698
https://lwn.net/Articles/619058/

I read through the thread and have summarized the relevant parts here 
with additional commentary below the summary:

/* Summary 

They were seeking to do basically the same in 2014 thing we want with 
stacked block caching drivers today: hint to the IO layer so the (ATA 3.2) 
driver can decide whether a block should hit the cache or spinning disk.  
This was done by adding bitflags to ioprio for IOPRIO_ADV_ advice.

There are two arguments throughout the thread: one that the cache hint 
should be per-process (ionice) and the other, that hints should be per 
inode via fadvise (and maybe madvise).  Dan Williams noted with respect to 
fadvise for their implementation that "It's straightforward to add, but I 
think "80%" of the benefit can be had by just having a per-thread cache 
priority."

Kapil Karkra extended the page flags so the ioprio advice bits can be 
copied into bio->bi_rw, to which Jens said "is a bit...icky. I see why 
it's done, though, it requires the least amount of plumbing."

Martin K. Petersen provides a matrix of desires for actual use cases here:
https://lkml.org/lkml/2014/10/29/1014 
and asks "Are there actually people asking for sub-file granularity? I 
didn't get any requests for that in the survey I did this summer. [...] In 
any case I thought it was interesting that pretty much every use case that 
people came up with could be adequately described by a handful of I/O 
classes."

Further, Jens notes that "I think we've needed a proper API for passing in 
appropriate hints on a per-io basis for a LONG time. [...] We've tried 
(and failed) in the past to define a set of hints that make sense. It'd be 
a shame to add something that's specific to a given transport/technology. 
That said, this set of hints do seem pretty basic and would not 
necessarily be a bad place to start. But they are still very specific to 
this use case."
*/


So, perhaps it is time to plan the hint API and figure out how to plumb 
it.  These are some design considerations based on the thread:

a. People want per-process cache hinting (ionice, or some other tool).
b. Per inode+range hinting would be useful to some (fadvise, ioctl, etc)
c. Don't use page flags to convey cache hints---or find a clean way to do so.
d. Per IO hints would be useful to both stacking and hardware drivers.
e. Cache layers will implement their own device assignment choice based 
on the caching hint; for example, an IO flagged to miss the cache might 
hit if already in cache due to unrelated IO and such a determination would 
be made per-cache-implementation.


I can see this go two ways:

1. A dedicated implementation for cache hinting.
2. An API for generalized hinting, upon which cache hinting may be 
implemented.

To consider #2, what hinting is wanted from processes and inodes down to 
bio's?  Does it justify an entire API for generalized hinting, or do we 
just need a cache hinting implementation?  If we do want #2, then what are 
all of the features wanted by the community so it can be designed as such?

If #1 is sufficient, then what is the preferred mechanism and 
implementation for cache hinting?

In either direction, how can those hints pass down to bio's in an 
appropriate way (ie, not page flags)?


With the interest of a cache hinting implementation independent of 
transport/technology, I have been playing with an idea to use two per-IO 
"TTL" counters, both of which tend toward zero; I've not yet started an 
implementation:

cacheskip: 
Decrement until zero to skip cache layers (slow medium)
Ignore cachedepth until cacheskip==0.

cachedepth:
Initialize to positive, negative, or zero value.  Once zero, no 
special treatment is given to the IO.  When less than zero, prefer the 
slower medium.  When greater than zero, prefer the faster medium.  
Inc/decrement toward zero each time the IO passes through a 
caching layer.

Independent of how we might apply these counters to a pid/inode, the cache 
layers might look something like this:

cachedepth  descrip

To add, or not to add, a bio REQ_ROTATIONAL flag

2016-07-28 Thread Eric Wheeler
Hello all,

With the many SSD caching layers being developed (bcache, dm-cache, 
dm-writeboost, etc), how could we flag a bio from userspace to indicate 
whether the bio is preferred to hit spinning disks instead of an SSD?

Unnecessary promotions, evections, and writeback increase the write burden 
on the caching layer and burns out SSDs too fast (TBW), thus requring 
equipment replacement.

Is there already a mechanism for this that could be added to the various 
caching mechanisms' promote/demote/bypass logic?

For example, I would like to prevent backups from influencing the cache 
eviction logic. Neither do I wish to evict cache due to a bio from a 
backup process, nor do I wish a bio from the backup process to be cached 
on the SSD.  


We would want to bypass the cache for IO that is somehow flagged to bypass 
block-layer caches and use the rotational disk unless the referenced block 
already exists on the SSD.

There might be two cases here that would be ideal to unify without 
touching filesystem code:

  1) open() of a block device

  2) open() on a file such that a filesystem must flag the bio

I had considered writing something to detect FADV_SEQUENTIAL/FADV_NOREUSE 
or `ionice -c3` on a process hitting bcache and modifying 
check_should_bypass()/should_writeback() to behave as such.

However, just because FADV_SEQUENTIAL is flagged doesn't mean the cache 
should bypass.  Filesystems can fragment, and while the file being read 
may be read sequentially, the blocks on which it resides may not be.  
Same thing for higher-level block devices such as dm-thinp where one might 
sequentially read a thin volume but its _tdata might not be in linear 
order.  This may imply that we need a new way to flag cache bypass from 
userspace that is neither io-priority nor fadvise driven.

So what are our options?  What might be the best way to do this?

If fadvise is the better option, how can a block device driver lookup the 
fadvise advice from a given bio struct?  Can we add an FADV_NOSSD flag 
since FADV_SEQUENTIAL may be insufficent?  Are FADV_NOREUSE/FADV_DONTNEED 
reasonable candidates?

Perhaps ionice could be used used, but the concept of "priority" 
doesn't exactly encompass the concept of cache-bypass---so is something 
else needed?

Other ideas?  


--
Eric Wheeler


To add, or not to add, a bio REQ_ROTATIONAL flag

2016-07-28 Thread Eric Wheeler
Hello all,

With the many SSD caching layers being developed (bcache, dm-cache, 
dm-writeboost, etc), how could we flag a bio from userspace to indicate 
whether the bio is preferred to hit spinning disks instead of an SSD?

Unnecessary promotions, evections, and writeback increase the write burden 
on the caching layer and burns out SSDs too fast (TBW), thus requring 
equipment replacement.

Is there already a mechanism for this that could be added to the various 
caching mechanisms' promote/demote/bypass logic?

For example, I would like to prevent backups from influencing the cache 
eviction logic. Neither do I wish to evict cache due to a bio from a 
backup process, nor do I wish a bio from the backup process to be cached 
on the SSD.  


We would want to bypass the cache for IO that is somehow flagged to bypass 
block-layer caches and use the rotational disk unless the referenced block 
already exists on the SSD.

There might be two cases here that would be ideal to unify without 
touching filesystem code:

  1) open() of a block device

  2) open() on a file such that a filesystem must flag the bio

I had considered writing something to detect FADV_SEQUENTIAL/FADV_NOREUSE 
or `ionice -c3` on a process hitting bcache and modifying 
check_should_bypass()/should_writeback() to behave as such.

However, just because FADV_SEQUENTIAL is flagged doesn't mean the cache 
should bypass.  Filesystems can fragment, and while the file being read 
may be read sequentially, the blocks on which it resides may not be.  
Same thing for higher-level block devices such as dm-thinp where one might 
sequentially read a thin volume but its _tdata might not be in linear 
order.  This may imply that we need a new way to flag cache bypass from 
userspace that is neither io-priority nor fadvise driven.

So what are our options?  What might be the best way to do this?

If fadvise is the better option, how can a block device driver lookup the 
fadvise advice from a given bio struct?  Can we add an FADV_NOSSD flag 
since FADV_SEQUENTIAL may be insufficent?  Are FADV_NOREUSE/FADV_DONTNEED 
reasonable candidates?

Perhaps ionice could be used used, but the concept of "priority" 
doesn't exactly encompass the concept of cache-bypass---so is something 
else needed?

Other ideas?  


--
Eric Wheeler


Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-21 Thread Eric Wheeler
[+cc Mikulas Patocka, Jeff Moyer; Do either of you have any input on Lars' 
commentary related to patchwork #'s 9204125 and 7398411 and BZ#119841? ]

On Tue, 19 Jul 2016, Lars Ellenberg wrote:

> On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> > On Tue, Jul 12 2016 at 10:18pm -0400,
> > Eric Wheeler <bca...@lists.ewheeler.net> wrote:
> > 
> > > On Tue, 12 Jul 2016, NeilBrown wrote:
> > > 
> > > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > > 
> > > > >
> > > > > Instead, I suggest to distinguish between recursive calls to
> > > > > generic_make_request(), and pushing back the remainder part in
> > > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > >   struct recursion_to_iteration_bio_lists {
> > > > >   struct bio_list recursion;
> > > > >   struct bio_list queue;
> > > > >   }
> > > > >
> > > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > > bio_list, then merging any recursively submitted bios to the
> > > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > > logic in generic_make_request() process deepest level bios first,
> > > > > and "sibling" bios of the same level in "natural" order.
> > > > >
> > > > > Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> > > > > Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com>
> > > > 
> > > > Reviewed-by: NeilBrown <ne...@suse.com>
> > > > 
> > > > Thanks again for doing this - I think this is a very significant
> > > > improvement and could allow other simplifications.
> > > 
> > > Thank you Lars for all of this work!  
> > > 
> > > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > > will certainly address some of those (maybe all of them?).  (I think we 
> > > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > > compatible.
> > > 
> > > 
> > > Do you believe that this patch would solve any of the proposals by others 
> > > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > > if I'm wrong):
> > > 
> > > A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
> > >   by Ming Lei: https://patchwork.kernel.org/patch/9169483/
> 
> That's an independend issue.
> 
> > > B.  block: don't make BLK_DEF_MAX_SECTORS too big
> > >   by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html
> 
> Yet an other independend issue.
> 
> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > >   by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > >   (was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now,
> this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(>lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to h

Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-21 Thread Eric Wheeler
[+cc Mikulas Patocka, Jeff Moyer; Do either of you have any input on Lars' 
commentary related to patchwork #'s 9204125 and 7398411 and BZ#119841? ]

On Tue, 19 Jul 2016, Lars Ellenberg wrote:

> On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> > On Tue, Jul 12 2016 at 10:18pm -0400,
> > Eric Wheeler  wrote:
> > 
> > > On Tue, 12 Jul 2016, NeilBrown wrote:
> > > 
> > > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > > 
> > > > >
> > > > > Instead, I suggest to distinguish between recursive calls to
> > > > > generic_make_request(), and pushing back the remainder part in
> > > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > >   struct recursion_to_iteration_bio_lists {
> > > > >   struct bio_list recursion;
> > > > >   struct bio_list queue;
> > > > >   }
> > > > >
> > > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > > bio_list, then merging any recursively submitted bios to the
> > > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > > logic in generic_make_request() process deepest level bios first,
> > > > > and "sibling" bios of the same level in "natural" order.
> > > > >
> > > > > Signed-off-by: Lars Ellenberg 
> > > > > Signed-off-by: Roland Kammerer 
> > > > 
> > > > Reviewed-by: NeilBrown 
> > > > 
> > > > Thanks again for doing this - I think this is a very significant
> > > > improvement and could allow other simplifications.
> > > 
> > > Thank you Lars for all of this work!  
> > > 
> > > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > > will certainly address some of those (maybe all of them?).  (I think we 
> > > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > > compatible.
> > > 
> > > 
> > > Do you believe that this patch would solve any of the proposals by others 
> > > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > > if I'm wrong):
> > > 
> > > A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
> > >   by Ming Lei: https://patchwork.kernel.org/patch/9169483/
> 
> That's an independend issue.
> 
> > > B.  block: don't make BLK_DEF_MAX_SECTORS too big
> > >   by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html
> 
> Yet an other independend issue.
> 
> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > >   by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > >   (was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now,
> this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(>lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to have been submitted already.
> 
> But this does not look like it easily fits into the current DM model.
> 
> | 7

Re: [dm-devel] [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-12 Thread Eric Wheeler
On Tue, 12 Jul 2016, NeilBrown wrote:

> On Tue, Jul 12 2016, Lars Ellenberg wrote:
> 
> >
> > Instead, I suggest to distinguish between recursive calls to
> > generic_make_request(), and pushing back the remainder part in
> > blk_queue_split(), by pointing current->bio_lists to a
> > struct recursion_to_iteration_bio_lists {
> > struct bio_list recursion;
> > struct bio_list queue;
> > }
> >
> > By providing each q->make_request_fn() with an empty "recursion"
> > bio_list, then merging any recursively submitted bios to the
> > head of the "queue" list, we can make the recursion-to-iteration
> > logic in generic_make_request() process deepest level bios first,
> > and "sibling" bios of the same level in "natural" order.
> >
> > Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> > Signed-off-by: Roland Kammerer <roland.kamme...@linbit.com>
> 
> Reviewed-by: NeilBrown <ne...@suse.com>
> 
> Thanks again for doing this - I think this is a very significant
> improvement and could allow other simplifications.

Thank you Lars for all of this work!  

It seems like there have been many 4.3+ blockdev stacking issues and this 
will certainly address some of those (maybe all of them?).  (I think we 
hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
issue.)  It would be great to hear 4.4.y stable pick this up if 
compatible.


Do you believe that this patch would solve any of the proposals by others 
since 4.3 related to bio splitting/large bios?  I've been collecting a 
list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
if I'm wrong):

A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
by Ming Lei: https://patchwork.kernel.org/patch/9169483/

B.  block: don't make BLK_DEF_MAX_SECTORS too big
by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html

C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
(was https://patchwork.kernel.org/patch/7398411/)

D.  dm-crypt: Fix error with too large bios
by Mikulas Patocka: https://patchwork.kernel.org/patch/9138595/

The A,B,D are known to fix large bio issues when stacking dm+bcache 
(though the B,D are trivial and probably necessary even with your patch).

Patch C was mentioned earlier in this thread by Mike Snitzer and you 
commented briefly that his patch might solve the issue; given that, and in 
the interest of minimizing duplicate effort, which of the following best 
describes the situation?

  1. Your patch could supersede Mikulas's patch; they address the same 
issue.

  2. Mikulas's patch addresses different issues such and both patches 
should be applied.

  3. There is overlap between both your patch and Mikulas's such that both 
#1,#2 are true and effort to solve this has been duplicated.


If #3, then what might be done to resolve the overlap?

What are the opinions of the authors and can a consensus be reached so we 
can see these pushed upstream with the appropriate stable Cc tags and 
ultimately fix 4.4.y?


--
Eric Wheeler


Re: [dm-devel] [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

2016-07-12 Thread Eric Wheeler
On Tue, 12 Jul 2016, NeilBrown wrote:

> On Tue, Jul 12 2016, Lars Ellenberg wrote:
> 
> >
> > Instead, I suggest to distinguish between recursive calls to
> > generic_make_request(), and pushing back the remainder part in
> > blk_queue_split(), by pointing current->bio_lists to a
> > struct recursion_to_iteration_bio_lists {
> > struct bio_list recursion;
> > struct bio_list queue;
> > }
> >
> > By providing each q->make_request_fn() with an empty "recursion"
> > bio_list, then merging any recursively submitted bios to the
> > head of the "queue" list, we can make the recursion-to-iteration
> > logic in generic_make_request() process deepest level bios first,
> > and "sibling" bios of the same level in "natural" order.
> >
> > Signed-off-by: Lars Ellenberg 
> > Signed-off-by: Roland Kammerer 
> 
> Reviewed-by: NeilBrown 
> 
> Thanks again for doing this - I think this is a very significant
> improvement and could allow other simplifications.

Thank you Lars for all of this work!  

It seems like there have been many 4.3+ blockdev stacking issues and this 
will certainly address some of those (maybe all of them?).  (I think we 
hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
issue.)  It would be great to hear 4.4.y stable pick this up if 
compatible.


Do you believe that this patch would solve any of the proposals by others 
since 4.3 related to bio splitting/large bios?  I've been collecting a 
list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
if I'm wrong):

A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
by Ming Lei: https://patchwork.kernel.org/patch/9169483/

B.  block: don't make BLK_DEF_MAX_SECTORS too big
by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html

C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
(was https://patchwork.kernel.org/patch/7398411/)

D.  dm-crypt: Fix error with too large bios
by Mikulas Patocka: https://patchwork.kernel.org/patch/9138595/

The A,B,D are known to fix large bio issues when stacking dm+bcache 
(though the B,D are trivial and probably necessary even with your patch).

Patch C was mentioned earlier in this thread by Mike Snitzer and you 
commented briefly that his patch might solve the issue; given that, and in 
the interest of minimizing duplicate effort, which of the following best 
describes the situation?

  1. Your patch could supersede Mikulas's patch; they address the same 
issue.

  2. Mikulas's patch addresses different issues such and both patches 
should be applied.

  3. There is overlap between both your patch and Mikulas's such that both 
#1,#2 are true and effort to solve this has been duplicated.


If #3, then what might be done to resolve the overlap?

What are the opinions of the authors and can a consensus be reached so we 
can see these pushed upstream with the appropriate stable Cc tags and 
ultimately fix 4.4.y?


--
Eric Wheeler


Re: [PATCH] bcache: bch_writeback_thread() is not freezable

2016-05-10 Thread Eric Wheeler
On Mon, 2 May 2016, Jiri Kosina wrote:
> On Tue, 19 Apr 2016, Jiri Kosina wrote:
> 
> > From: Jiri Kosina <jkos...@suse.cz>
> > 
> > bch_writeback_thread() is calling try_to_freeze(), but that's just an 
> > expensive no-op given the fact that the thread is not marked freezable.
> > 
> > I/O helper kthreads, exactly such as the bcache writeback thread, actually 
> > shouldn't be freezable, because they are potentially necessary for 
> > finalizing the image write-out.
> > 
> > Signed-off-by: Jiri Kosina <jkos...@suse.cz>
> 
> Let's send a friendly ping on these three patches after two weeks...

Maciej,

Were you able to get a backtrace?


--
Eric Wheeler


> 
> -- 
> Jiri Kosina
> SUSE Labs
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH] bcache: bch_writeback_thread() is not freezable

2016-05-10 Thread Eric Wheeler
On Mon, 2 May 2016, Jiri Kosina wrote:
> On Tue, 19 Apr 2016, Jiri Kosina wrote:
> 
> > From: Jiri Kosina 
> > 
> > bch_writeback_thread() is calling try_to_freeze(), but that's just an 
> > expensive no-op given the fact that the thread is not marked freezable.
> > 
> > I/O helper kthreads, exactly such as the bcache writeback thread, actually 
> > shouldn't be freezable, because they are potentially necessary for 
> > finalizing the image write-out.
> > 
> > Signed-off-by: Jiri Kosina 
> 
> Let's send a friendly ping on these three patches after two weeks...

Maciej,

Were you able to get a backtrace?


--
Eric Wheeler


> 
> -- 
> Jiri Kosina
> SUSE Labs
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH] bcache: bch_writeback_thread() is not freezable

2016-04-21 Thread Eric Wheeler
On Wed, 20 Apr 2016, Jiri Kosina wrote:
> On Tue, 19 Apr 2016, Eric Wheeler wrote:
> 
> > > bch_writeback_thread() is calling try_to_freeze(), but that's just an 
> > > expensive no-op given the fact that the thread is not marked freezable.
> > > 
> > > I/O helper kthreads, exactly such as the bcache writeback thread, 
> > > actually 
> > > shouldn't be freezable, because they are potentially necessary for 
> > > finalizing the image write-out.
> > 
> > This is good timing, as Maciej Piechotka just reported a hang when 
> > suspending his system.
> 
> Could you please point me to the actual report? Thanks.
>
> On Tue, 19 Apr 2016, Maciej Piechotka wrote:
> Eric Wheeler  lists.ewheeler.net> writes:
> > Interesting.  Can you collect the dmesg output as it suspends/resumes via
> > serial or something other means?
>
> I'll try to capture the output today.

No technical data yet, but this is the thread:  

http://comments.gmane.org/gmane.linux.kernel.bcache.devel/3820

> > What is the proper way to safely support suspend?  Assuming the 
> > try_to_freeze() calls are in the right place, should we simply 
> > set_freezable() on these kthreads?
> 
> Unfortunately, this is really a tricky question; the issue is that frezing 
> semantics is rather undefined for kthreads. For starters, please see
> 
>   https://lwn.net/Articles/662703/
>   http://lkml.org/lkml/2007/4/27/608

Interesting indeed.  So suspend should succeed independent of kernel 
threads since we want to get rid of freezable kthreads? 

Does this also mean that IO kthreads will always break suspend?

> I don't belive in freezable kthreads which serve as I/O helpers. Such 
> threads simply have to keep going until the image is written out and 
> machine powered down.
> 
> So I'd like to start with understanding how bcache is preventning suspend. 
> Maciej?

We await backtraces from Maciej, but I can say that bcache uses only two 
kthreads, one for garbage collection and another for writeback.  

Speculation: The writeback thread can (probably) be made unrunnable at any 
time without issue since it is (should be) fully asynchronous.  However, 
garbage collection might deadlock if the GC thread is unrunnable while 
hibernate (suspend?) IO is writing through bcache while bcache waits for 
GC to complete under allocation contention.  I'm not familiar with the 
bcache allocator details, so anyone else please chime here.

Presumably, GC is only unsafe during writes to the cache for blocks that 
are not yet cached but would cause a cache allocation.  If so, then 
perhaps we can hook the pending suspend, set cache_mode to "writearound"  
to prevent btree changes, and restore the cache_mode on resume. It will be 
interesting to see the backtrace if Maciej can get one out of the system.


--
Eric Wheeler




Re: [PATCH] bcache: bch_writeback_thread() is not freezable

2016-04-21 Thread Eric Wheeler
On Wed, 20 Apr 2016, Jiri Kosina wrote:
> On Tue, 19 Apr 2016, Eric Wheeler wrote:
> 
> > > bch_writeback_thread() is calling try_to_freeze(), but that's just an 
> > > expensive no-op given the fact that the thread is not marked freezable.
> > > 
> > > I/O helper kthreads, exactly such as the bcache writeback thread, 
> > > actually 
> > > shouldn't be freezable, because they are potentially necessary for 
> > > finalizing the image write-out.
> > 
> > This is good timing, as Maciej Piechotka just reported a hang when 
> > suspending his system.
> 
> Could you please point me to the actual report? Thanks.
>
> On Tue, 19 Apr 2016, Maciej Piechotka wrote:
> Eric Wheeler  lists.ewheeler.net> writes:
> > Interesting.  Can you collect the dmesg output as it suspends/resumes via
> > serial or something other means?
>
> I'll try to capture the output today.

No technical data yet, but this is the thread:  

http://comments.gmane.org/gmane.linux.kernel.bcache.devel/3820

> > What is the proper way to safely support suspend?  Assuming the 
> > try_to_freeze() calls are in the right place, should we simply 
> > set_freezable() on these kthreads?
> 
> Unfortunately, this is really a tricky question; the issue is that frezing 
> semantics is rather undefined for kthreads. For starters, please see
> 
>   https://lwn.net/Articles/662703/
>   http://lkml.org/lkml/2007/4/27/608

Interesting indeed.  So suspend should succeed independent of kernel 
threads since we want to get rid of freezable kthreads? 

Does this also mean that IO kthreads will always break suspend?

> I don't belive in freezable kthreads which serve as I/O helpers. Such 
> threads simply have to keep going until the image is written out and 
> machine powered down.
> 
> So I'd like to start with understanding how bcache is preventning suspend. 
> Maciej?

We await backtraces from Maciej, but I can say that bcache uses only two 
kthreads, one for garbage collection and another for writeback.  

Speculation: The writeback thread can (probably) be made unrunnable at any 
time without issue since it is (should be) fully asynchronous.  However, 
garbage collection might deadlock if the GC thread is unrunnable while 
hibernate (suspend?) IO is writing through bcache while bcache waits for 
GC to complete under allocation contention.  I'm not familiar with the 
bcache allocator details, so anyone else please chime here.

Presumably, GC is only unsafe during writes to the cache for blocks that 
are not yet cached but would cause a cache allocation.  If so, then 
perhaps we can hook the pending suspend, set cache_mode to "writearound"  
to prevent btree changes, and restore the cache_mode on resume. It will be 
interesting to see the backtrace if Maciej can get one out of the system.


--
Eric Wheeler




Re: [PATCH] bcache: bch_writeback_thread() is not freezable

2016-04-19 Thread Eric Wheeler

On Tue, 19 Apr 2016, Jiri Kosina wrote:

> From: Jiri Kosina <jkos...@suse.cz>
> 
> bch_writeback_thread() is calling try_to_freeze(), but that's just an 
> expensive no-op given the fact that the thread is not marked freezable.
> 
> I/O helper kthreads, exactly such as the bcache writeback thread, actually 
> shouldn't be freezable, because they are potentially necessary for 
> finalizing the image write-out.

This is good timing, as Maciej Piechotka just reported a hang when 
suspending his system.

What is the proper way to safely support suspend?  Assuming the 
try_to_freeze() calls are in the right place, should we simply 
set_freezable() on these kthreads?


--
Eric Wheeler


> 
> Signed-off-by: Jiri Kosina <jkos...@suse.cz>
> ---
>  drivers/md/bcache/writeback.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index b9346cd..6012367 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -12,7 +12,6 @@
>  #include "writeback.h"
>  
>  #include 
> -#include 
>  #include 
>  #include 
>  
> @@ -228,7 +227,6 @@ static void read_dirty(struct cached_dev *dc)
>*/
>  
>   while (!kthread_should_stop()) {
> - try_to_freeze();
>  
>   w = bch_keybuf_next(>writeback_keys);
>   if (!w)
> @@ -433,7 +431,6 @@ static int bch_writeback_thread(void *arg)
>   if (kthread_should_stop())
>   return 0;
>  
> - try_to_freeze();
>   schedule();
>   continue;
>   }
> -- 
> Jiri Kosina
> SUSE Labs
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH] bcache: bch_writeback_thread() is not freezable

2016-04-19 Thread Eric Wheeler

On Tue, 19 Apr 2016, Jiri Kosina wrote:

> From: Jiri Kosina 
> 
> bch_writeback_thread() is calling try_to_freeze(), but that's just an 
> expensive no-op given the fact that the thread is not marked freezable.
> 
> I/O helper kthreads, exactly such as the bcache writeback thread, actually 
> shouldn't be freezable, because they are potentially necessary for 
> finalizing the image write-out.

This is good timing, as Maciej Piechotka just reported a hang when 
suspending his system.

What is the proper way to safely support suspend?  Assuming the 
try_to_freeze() calls are in the right place, should we simply 
set_freezable() on these kthreads?


--
Eric Wheeler


> 
> Signed-off-by: Jiri Kosina 
> ---
>  drivers/md/bcache/writeback.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index b9346cd..6012367 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -12,7 +12,6 @@
>  #include "writeback.h"
>  
>  #include 
> -#include 
>  #include 
>  #include 
>  
> @@ -228,7 +227,6 @@ static void read_dirty(struct cached_dev *dc)
>*/
>  
>   while (!kthread_should_stop()) {
> - try_to_freeze();
>  
>   w = bch_keybuf_next(>writeback_keys);
>   if (!w)
> @@ -433,7 +431,6 @@ static int bch_writeback_thread(void *arg)
>   if (kthread_should_stop())
>   return 0;
>  
> - try_to_freeze();
>   schedule();
>   continue;
>   }
> -- 
> Jiri Kosina
> SUSE Labs
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH] block: make sure big bio is splitted into at most 256 bvecs

2016-04-07 Thread Eric Wheeler
On Thu, 7 Apr 2016, Ming Lei wrote:

> On Thu, Apr 7, 2016 at 9:56 AM, Eric Wheeler <bca...@lists.ewheeler.net> 
> wrote:
> > On Thu, 7 Apr 2016, Ming Lei wrote:
> >
> >> On Thu, Apr 7, 2016 at 9:36 AM, Eric Wheeler <bca...@lists.ewheeler.net> 
> >> wrote:
> >> > On Wed, 6 Apr 2016, Ming Lei wrote:
> >> >
> >> >> After arbitrary bio size is supported, the incoming bio may
> >> >> be very big. We have to split the bio into small bios so that
> >> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> >> as bio_clone().
> >> >>
> >> >> This patch fixes the following kernel crash:
> >> >>
> >> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference 
> >> >> > at
> >> >> > 0028
> >> >> > [  172.660229] IP: [] bio_trim+0xf/0x2a
> >> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> >> > [  172.660399] Oops:  [#1] SMP
> >> >> > [...]
> >> >> > [  172.664780] Call Trace:
> >> >> > [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> >> >> > [raid1]
> >> >> > [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> >> >> > [  172.664880]  [] ? md_make_request+0xf6/0x1e9 
> >> >> > [md_mod]
> >> >> > [  172.664912]  [] ? generic_make_request+0xb5/0x155
> >> >> > [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> >> >> > [  172.664981]  [] ? register_cache_set+0x355/0x8d0 
> >> >> > [bcache]
> >> >> > [  172.665016]  [] ? register_bcache+0x1006/0x1174 
> >> >> > [bcache]
> >> >>
> >> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily 
> >> >> sized bios)
> >> >> Reported-by: Sebastian Roesner <sroesner-kernel...@roesner-online.de>
> >> >> Reported-by: Eric Wheeler <bca...@lists.ewheeler.net>
> >> >> Cc: sta...@vger.kernel.org (4.2+)
> >> >
> >> > Ming Lei,
> >> >
> >> > get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we
> >> > won't see it in stable I don't think.
> >> >
> >> > It would be nice to see this fixed in 4.1 (if affected there).  Is there
> >>
> >> The issue should be introduced to v4.3 via 54efd50
> >>
> >> > another place this could be applied to be a bit more backward compatible?
> >>
> >> The v1 needn't change to get_max_io_size(), and it should be simple enough
> >> to backport to previous stables, please try it:
> >>
> >> http://marc.info/?l=linux-block=145991422422927=2
> >
> > V1 changes blk_bio_segment_split() which doesn't exist until 4.3.
> >
> > How might you port this to v4.1.y?
> 
> Can you see the issue with v4.1?
> 
> You mentioned there are three reports:
> 
> > [2014-02-04] https://bugzilla.redhat.com/show_bug.cgi?id=1061339
> > [2016-01-13] http://www.spinics.net/lists/linux-bcache/msg03335.html
>  https://bugzilla.kernel.org/show_bug.cgi?id=110771
> > [2016-03-25] http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3607 
> > [this thread]
>  https://bugzilla.kernel.org/show_bug.cgi?id=114871
> 
> The first one has been fixed by '8423ae3 block: Fix cloning of
> discard/write same bios', as mentioned in bugzilla.
> 
> The other two are reported on v4.4 and v4.5.
> 
> If you look at the patch description, it is just needed for 4.3+.


Oh, that could be---but just to be sure:

I had thought perhaps this was an old issue since the first mention of 
this backtrace (but not bcache) was in 3.14 back in 2014 based on this 
post:
  https://bugzilla.redhat.com/show_bug.cgi?id=1061339

Is this relevant?

--
Eric Wheeler



> 
> Or I am wrong?
> 
> 
> Thanks,
> 
> >
> > --
> > Eric Wheeler
> >
> >
> >>
> >> Thanks,
> >>
> >> >
> >> >> Cc: Shaohua Li <s...@fb.com>
> >> >> Signed-off-by: Ming Lei <ming@canonical.com>
> >> >> ---
> >> >> I can reproduce the issue and verify the fix by the following approach:
> >> >>   - create one raid1 over two virtio-blk
> >> >>   - build bcache device over the above raid1 and another cache 
> >> >> device.
> >> >>   - set cache mode as writebac

Re: [PATCH] block: make sure big bio is splitted into at most 256 bvecs

2016-04-07 Thread Eric Wheeler
On Thu, 7 Apr 2016, Ming Lei wrote:

> On Thu, Apr 7, 2016 at 9:56 AM, Eric Wheeler  
> wrote:
> > On Thu, 7 Apr 2016, Ming Lei wrote:
> >
> >> On Thu, Apr 7, 2016 at 9:36 AM, Eric Wheeler  
> >> wrote:
> >> > On Wed, 6 Apr 2016, Ming Lei wrote:
> >> >
> >> >> After arbitrary bio size is supported, the incoming bio may
> >> >> be very big. We have to split the bio into small bios so that
> >> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> >> as bio_clone().
> >> >>
> >> >> This patch fixes the following kernel crash:
> >> >>
> >> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference 
> >> >> > at
> >> >> > 0028
> >> >> > [  172.660229] IP: [] bio_trim+0xf/0x2a
> >> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> >> > [  172.660399] Oops:  [#1] SMP
> >> >> > [...]
> >> >> > [  172.664780] Call Trace:
> >> >> > [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> >> >> > [raid1]
> >> >> > [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> >> >> > [  172.664880]  [] ? md_make_request+0xf6/0x1e9 
> >> >> > [md_mod]
> >> >> > [  172.664912]  [] ? generic_make_request+0xb5/0x155
> >> >> > [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> >> >> > [  172.664981]  [] ? register_cache_set+0x355/0x8d0 
> >> >> > [bcache]
> >> >> > [  172.665016]  [] ? register_bcache+0x1006/0x1174 
> >> >> > [bcache]
> >> >>
> >> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily 
> >> >> sized bios)
> >> >> Reported-by: Sebastian Roesner 
> >> >> Reported-by: Eric Wheeler 
> >> >> Cc: sta...@vger.kernel.org (4.2+)
> >> >
> >> > Ming Lei,
> >> >
> >> > get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we
> >> > won't see it in stable I don't think.
> >> >
> >> > It would be nice to see this fixed in 4.1 (if affected there).  Is there
> >>
> >> The issue should be introduced to v4.3 via 54efd50
> >>
> >> > another place this could be applied to be a bit more backward compatible?
> >>
> >> The v1 needn't change to get_max_io_size(), and it should be simple enough
> >> to backport to previous stables, please try it:
> >>
> >> http://marc.info/?l=linux-block=145991422422927=2
> >
> > V1 changes blk_bio_segment_split() which doesn't exist until 4.3.
> >
> > How might you port this to v4.1.y?
> 
> Can you see the issue with v4.1?
> 
> You mentioned there are three reports:
> 
> > [2014-02-04] https://bugzilla.redhat.com/show_bug.cgi?id=1061339
> > [2016-01-13] http://www.spinics.net/lists/linux-bcache/msg03335.html
>  https://bugzilla.kernel.org/show_bug.cgi?id=110771
> > [2016-03-25] http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3607 
> > [this thread]
>  https://bugzilla.kernel.org/show_bug.cgi?id=114871
> 
> The first one has been fixed by '8423ae3 block: Fix cloning of
> discard/write same bios', as mentioned in bugzilla.
> 
> The other two are reported on v4.4 and v4.5.
> 
> If you look at the patch description, it is just needed for 4.3+.


Oh, that could be---but just to be sure:

I had thought perhaps this was an old issue since the first mention of 
this backtrace (but not bcache) was in 3.14 back in 2014 based on this 
post:
  https://bugzilla.redhat.com/show_bug.cgi?id=1061339

Is this relevant?

--
Eric Wheeler



> 
> Or I am wrong?
> 
> 
> Thanks,
> 
> >
> > --
> > Eric Wheeler
> >
> >
> >>
> >> Thanks,
> >>
> >> >
> >> >> Cc: Shaohua Li 
> >> >> Signed-off-by: Ming Lei 
> >> >> ---
> >> >> I can reproduce the issue and verify the fix by the following approach:
> >> >>   - create one raid1 over two virtio-blk
> >> >>   - build bcache device over the above raid1 and another cache 
> >> >> device.
> >> >>   - set cache mode as writeback
> >> >>   - run random write over ext4 on the bcache device
> >> >>   - then the crash can be triggered
> >> >>
> >> >>  b

Re: [PATCH] block: make sure big bio is splitted into at most 256 bvecs

2016-04-06 Thread Eric Wheeler
On Thu, 7 Apr 2016, Ming Lei wrote:

> On Thu, Apr 7, 2016 at 9:36 AM, Eric Wheeler <bca...@lists.ewheeler.net> 
> wrote:
> > On Wed, 6 Apr 2016, Ming Lei wrote:
> >
> >> After arbitrary bio size is supported, the incoming bio may
> >> be very big. We have to split the bio into small bios so that
> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> as bio_clone().
> >>
> >> This patch fixes the following kernel crash:
> >>
> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> >> > 0028
> >> > [  172.660229] IP: [] bio_trim+0xf/0x2a
> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> > [  172.660399] Oops:  [#1] SMP
> >> > [...]
> >> > [  172.664780] Call Trace:
> >> > [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> >> > [raid1]
> >> > [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> >> > [  172.664880]  [] ? md_make_request+0xf6/0x1e9 
> >> > [md_mod]
> >> > [  172.664912]  [] ? generic_make_request+0xb5/0x155
> >> > [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> >> > [  172.664981]  [] ? register_cache_set+0x355/0x8d0 
> >> > [bcache]
> >> > [  172.665016]  [] ? register_bcache+0x1006/0x1174 
> >> > [bcache]
> >>
> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized 
> >> bios)
> >> Reported-by: Sebastian Roesner <sroesner-kernel...@roesner-online.de>
> >> Reported-by: Eric Wheeler <bca...@lists.ewheeler.net>
> >> Cc: sta...@vger.kernel.org (4.2+)
> >
> > Ming Lei,
> >
> > get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we
> > won't see it in stable I don't think.
> >
> > It would be nice to see this fixed in 4.1 (if affected there).  Is there
> 
> The issue should be introduced to v4.3 via 54efd50
> 
> > another place this could be applied to be a bit more backward compatible?
> 
> The v1 needn't change to get_max_io_size(), and it should be simple enough
> to backport to previous stables, please try it:
> 
> http://marc.info/?l=linux-block=145991422422927=2

V1 changes blk_bio_segment_split() which doesn't exist until 4.3.

How might you port this to v4.1.y?

--
Eric Wheeler


> 
> Thanks,
> 
> >
> >> Cc: Shaohua Li <s...@fb.com>
> >> Signed-off-by: Ming Lei <ming@canonical.com>
> >> ---
> >> I can reproduce the issue and verify the fix by the following approach:
> >>   - create one raid1 over two virtio-blk
> >>   - build bcache device over the above raid1 and another cache device.
> >>   - set cache mode as writeback
> >>   - run random write over ext4 on the bcache device
> >>   - then the crash can be triggered
> >>
> >>  block/blk-merge.c | 12 
> >>  1 file changed, 12 insertions(+)
> >>
> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> index 2613531..9a8651f 100644
> >> --- a/block/blk-merge.c
> >> +++ b/block/blk-merge.c
> >> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct 
> >> request_queue *q,
> >>   /* aligned to logical block size */
> >>   sectors &= ~(mask >> 9);
> >>
> >> + /*
> >> +  * With arbitrary bio size, the incoming bio may be very big.
> >> +  * We have to split the bio into small bios so that each holds
> >> +  * at most BIO_MAX_PAGES bvecs for safety reason, such as
> >> +  * bio_clone().
> >> +  *
> >> +  * In the future, the limit might be converted into per-queue
> >> +  * flag.
> >> +  */
> >> + sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> >> + (PAGE_CACHE_SHIFT - 9));
> >> +
> >>   return sectors;
> >>  }
> >>
> >> --
> >> 1.9.1
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH] block: make sure big bio is splitted into at most 256 bvecs

2016-04-06 Thread Eric Wheeler
On Thu, 7 Apr 2016, Ming Lei wrote:

> On Thu, Apr 7, 2016 at 9:36 AM, Eric Wheeler  
> wrote:
> > On Wed, 6 Apr 2016, Ming Lei wrote:
> >
> >> After arbitrary bio size is supported, the incoming bio may
> >> be very big. We have to split the bio into small bios so that
> >> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> >> as bio_clone().
> >>
> >> This patch fixes the following kernel crash:
> >>
> >> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> >> > 0028
> >> > [  172.660229] IP: [] bio_trim+0xf/0x2a
> >> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> > [  172.660399] Oops:  [#1] SMP
> >> > [...]
> >> > [  172.664780] Call Trace:
> >> > [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> >> > [raid1]
> >> > [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> >> > [  172.664880]  [] ? md_make_request+0xf6/0x1e9 
> >> > [md_mod]
> >> > [  172.664912]  [] ? generic_make_request+0xb5/0x155
> >> > [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> >> > [  172.664981]  [] ? register_cache_set+0x355/0x8d0 
> >> > [bcache]
> >> > [  172.665016]  [] ? register_bcache+0x1006/0x1174 
> >> > [bcache]
> >>
> >> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized 
> >> bios)
> >> Reported-by: Sebastian Roesner 
> >> Reported-by: Eric Wheeler 
> >> Cc: sta...@vger.kernel.org (4.2+)
> >
> > Ming Lei,
> >
> > get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we
> > won't see it in stable I don't think.
> >
> > It would be nice to see this fixed in 4.1 (if affected there).  Is there
> 
> The issue should be introduced to v4.3 via 54efd50
> 
> > another place this could be applied to be a bit more backward compatible?
> 
> The v1 needn't change to get_max_io_size(), and it should be simple enough
> to backport to previous stables, please try it:
> 
> http://marc.info/?l=linux-block=145991422422927=2

V1 changes blk_bio_segment_split() which doesn't exist until 4.3.

How might you port this to v4.1.y?

--
Eric Wheeler


> 
> Thanks,
> 
> >
> >> Cc: Shaohua Li 
> >> Signed-off-by: Ming Lei 
> >> ---
> >> I can reproduce the issue and verify the fix by the following approach:
> >>   - create one raid1 over two virtio-blk
> >>   - build bcache device over the above raid1 and another cache device.
> >>   - set cache mode as writeback
> >>   - run random write over ext4 on the bcache device
> >>   - then the crash can be triggered
> >>
> >>  block/blk-merge.c | 12 
> >>  1 file changed, 12 insertions(+)
> >>
> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> index 2613531..9a8651f 100644
> >> --- a/block/blk-merge.c
> >> +++ b/block/blk-merge.c
> >> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct 
> >> request_queue *q,
> >>   /* aligned to logical block size */
> >>   sectors &= ~(mask >> 9);
> >>
> >> + /*
> >> +  * With arbitrary bio size, the incoming bio may be very big.
> >> +  * We have to split the bio into small bios so that each holds
> >> +  * at most BIO_MAX_PAGES bvecs for safety reason, such as
> >> +  * bio_clone().
> >> +  *
> >> +  * In the future, the limit might be converted into per-queue
> >> +  * flag.
> >> +  */
> >> + sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> >> + (PAGE_CACHE_SHIFT - 9));
> >> +
> >>   return sectors;
> >>  }
> >>
> >> --
> >> 1.9.1
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH] block: make sure big bio is splitted into at most 256 bvecs

2016-04-06 Thread Eric Wheeler
On Wed, 6 Apr 2016, Ming Lei wrote:

> On Wed, Apr 6, 2016 at 1:44 AM, Ming Lei <ming@canonical.com> wrote:
> > After arbitrary bio size is supported, the incoming bio may
> > be very big. We have to split the bio into small bios so that
> > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > as bio_clone().
> >
> > This patch fixes the following kernel crash:
> >
> >> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> >> 0028
> >> [  172.660229] IP: [] bio_trim+0xf/0x2a
> >> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> [  172.660399] Oops:  [#1] SMP
> >> [...]
> >> [  172.664780] Call Trace:
> >> [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> >> [raid1]
> >> [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> >> [  172.664880]  [] ? md_make_request+0xf6/0x1e9 [md_mod]
> >> [  172.664912]  [] ? generic_make_request+0xb5/0x155
> >> [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> >> [  172.664981]  [] ? register_cache_set+0x355/0x8d0 
> >> [bcache]
> >> [  172.665016]  [] ? register_bcache+0x1006/0x1174 
> >> [bcache]
> >
> > Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized 
> > bios)
> > Reported-by: Sebastian Roesner <sroesner-kernel...@roesner-online.de>
> > Reported-by: Eric Wheeler <bca...@lists.ewheeler.net>
> > Cc: sta...@vger.kernel.org (4.2+)
> > Cc: Shaohua Li <s...@fb.com>
> > Signed-off-by: Ming Lei <ming@canonical.com>
> > ---
> > I can reproduce the issue and verify the fix by the following approach:
> > - create one raid1 over two virtio-blk
> > - build bcache device over the above raid1 and another cache device.
> > - set cache mode as writeback
> > - run random write over ext4 on the bcache device
> > - then the crash can be triggered
> 
> For anyone who is interested in issue/fix, forget to mention:
> 
>The bucket size should be set as bigger than 1M during making bcache.
>In my test, the bucket size is 2M.

Does the bucket size dictate the ideal cached data size, or is it just an 
optimization for erase block boundaries on the SSD?  

Are reads/writes smaller than the bucket size still cached effectively, or 
does a 2MB bucket slurp up 2MB of backing data along with it?  

For example, if 64k is our ideal IO size, should we use 64k buckets?

--
Eric Wheeler

> 
> Thanks,
> Ming
> 
> >
> >  block/blk-merge.c | 12 
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 2613531..9a8651f 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct 
> > request_queue *q,
> > /* aligned to logical block size */
> > sectors &= ~(mask >> 9);
> >
> > +   /*
> > +* With arbitrary bio size, the incoming bio may be very big.
> > +* We have to split the bio into small bios so that each holds
> > +* at most BIO_MAX_PAGES bvecs for safety reason, such as
> > +* bio_clone().
> > +*
> > +* In the future, the limit might be converted into per-queue
> > +* flag.
> > +*/
> > +   sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> > +   (PAGE_CACHE_SHIFT - 9));
> > +
> > return sectors;
> >  }
> >
> > --
> > 1.9.1
> >
> 


Re: [PATCH] block: make sure big bio is splitted into at most 256 bvecs

2016-04-06 Thread Eric Wheeler
On Wed, 6 Apr 2016, Ming Lei wrote:

> On Wed, Apr 6, 2016 at 1:44 AM, Ming Lei  wrote:
> > After arbitrary bio size is supported, the incoming bio may
> > be very big. We have to split the bio into small bios so that
> > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > as bio_clone().
> >
> > This patch fixes the following kernel crash:
> >
> >> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> >> 0028
> >> [  172.660229] IP: [] bio_trim+0xf/0x2a
> >> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> >> [  172.660399] Oops:  [#1] SMP
> >> [...]
> >> [  172.664780] Call Trace:
> >> [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> >> [raid1]
> >> [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> >> [  172.664880]  [] ? md_make_request+0xf6/0x1e9 [md_mod]
> >> [  172.664912]  [] ? generic_make_request+0xb5/0x155
> >> [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> >> [  172.664981]  [] ? register_cache_set+0x355/0x8d0 
> >> [bcache]
> >> [  172.665016]  [] ? register_bcache+0x1006/0x1174 
> >> [bcache]
> >
> > Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized 
> > bios)
> > Reported-by: Sebastian Roesner 
> > Reported-by: Eric Wheeler 
> > Cc: sta...@vger.kernel.org (4.2+)
> > Cc: Shaohua Li 
> > Signed-off-by: Ming Lei 
> > ---
> > I can reproduce the issue and verify the fix by the following approach:
> > - create one raid1 over two virtio-blk
> > - build bcache device over the above raid1 and another cache device.
> > - set cache mode as writeback
> > - run random write over ext4 on the bcache device
> > - then the crash can be triggered
> 
> For anyone who is interested in issue/fix, forget to mention:
> 
>The bucket size should be set as bigger than 1M during making bcache.
>In my test, the bucket size is 2M.

Does the bucket size dictate the ideal cached data size, or is it just an 
optimization for erase block boundaries on the SSD?  

Are reads/writes smaller than the bucket size still cached effectively, or 
does a 2MB bucket slurp up 2MB of backing data along with it?  

For example, if 64k is our ideal IO size, should we use 64k buckets?

--
Eric Wheeler

> 
> Thanks,
> Ming
> 
> >
> >  block/blk-merge.c | 12 
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 2613531..9a8651f 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct 
> > request_queue *q,
> > /* aligned to logical block size */
> > sectors &= ~(mask >> 9);
> >
> > +   /*
> > +* With arbitrary bio size, the incoming bio may be very big.
> > +* We have to split the bio into small bios so that each holds
> > +* at most BIO_MAX_PAGES bvecs for safety reason, such as
> > +* bio_clone().
> > +*
> > +* In the future, the limit might be converted into per-queue
> > +* flag.
> > +*/
> > +   sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> > +   (PAGE_CACHE_SHIFT - 9));
> > +
> > return sectors;
> >  }
> >
> > --
> > 1.9.1
> >
> 


Re: [PATCH] block: make sure big bio is splitted into at most 256 bvecs

2016-04-06 Thread Eric Wheeler
On Wed, 6 Apr 2016, Ming Lei wrote:

> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().
> 
> This patch fixes the following kernel crash:
> 
> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> > 0028
> > [  172.660229] IP: [] bio_trim+0xf/0x2a
> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> > [  172.660399] Oops:  [#1] SMP
> > [...]
> > [  172.664780] Call Trace:
> > [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> > [raid1]
> > [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> > [  172.664880]  [] ? md_make_request+0xf6/0x1e9 [md_mod]
> > [  172.664912]  [] ? generic_make_request+0xb5/0x155
> > [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> > [  172.664981]  [] ? register_cache_set+0x355/0x8d0 
> > [bcache]
> > [  172.665016]  [] ? register_bcache+0x1006/0x1174 
> > [bcache]
> 
> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> Reported-by: Sebastian Roesner <sroesner-kernel...@roesner-online.de>
> Reported-by: Eric Wheeler <bca...@lists.ewheeler.net>
> Cc: sta...@vger.kernel.org (4.2+)

Ming Lei,

get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we 
won't see it in stable I don't think.

It would be nice to see this fixed in 4.1 (if affected there).  Is there 
another place this could be applied to be a bit more backward compatible?

-Eric


--
Eric Wheeler



> Cc: Shaohua Li <s...@fb.com>
> Signed-off-by: Ming Lei <ming@canonical.com>
> ---
> I can reproduce the issue and verify the fix by the following approach:
>   - create one raid1 over two virtio-blk 
>   - build bcache device over the above raid1 and another cache device.
>   - set cache mode as writeback
>   - run random write over ext4 on the bcache device
>   - then the crash can be triggered
> 
>  block/blk-merge.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..9a8651f 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct 
> request_queue *q,
>   /* aligned to logical block size */
>   sectors &= ~(mask >> 9);
>  
> + /*
> +  * With arbitrary bio size, the incoming bio may be very big.
> +  * We have to split the bio into small bios so that each holds
> +  * at most BIO_MAX_PAGES bvecs for safety reason, such as
> +  * bio_clone().
> +  *
> +  * In the future, the limit might be converted into per-queue
> +  * flag.
> +  */
> + sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> + (PAGE_CACHE_SHIFT - 9));
> +
>   return sectors;
>  }
>  
> -- 
> 1.9.1
> 
> 


Re: [PATCH] block: make sure big bio is splitted into at most 256 bvecs

2016-04-06 Thread Eric Wheeler
On Wed, 6 Apr 2016, Ming Lei wrote:

> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().
> 
> This patch fixes the following kernel crash:
> 
> > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at
> > 0028
> > [  172.660229] IP: [] bio_trim+0xf/0x2a
> > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> > [  172.660399] Oops:  [#1] SMP
> > [...]
> > [  172.664780] Call Trace:
> > [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 
> > [raid1]
> > [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> > [  172.664880]  [] ? md_make_request+0xf6/0x1e9 [md_mod]
> > [  172.664912]  [] ? generic_make_request+0xb5/0x155
> > [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> > [  172.664981]  [] ? register_cache_set+0x355/0x8d0 
> > [bcache]
> > [  172.665016]  [] ? register_bcache+0x1006/0x1174 
> > [bcache]
> 
> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> Reported-by: Sebastian Roesner 
> Reported-by: Eric Wheeler 
> Cc: sta...@vger.kernel.org (4.2+)

Ming Lei,

get_max_io_size doesn't appear until 4.5 based on a quick LXR check so we 
won't see it in stable I don't think.

It would be nice to see this fixed in 4.1 (if affected there).  Is there 
another place this could be applied to be a bit more backward compatible?

-Eric


--
Eric Wheeler



> Cc: Shaohua Li 
> Signed-off-by: Ming Lei 
> ---
> I can reproduce the issue and verify the fix by the following approach:
>   - create one raid1 over two virtio-blk 
>   - build bcache device over the above raid1 and another cache device.
>   - set cache mode as writeback
>   - run random write over ext4 on the bcache device
>   - then the crash can be triggered
> 
>  block/blk-merge.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..9a8651f 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -79,6 +79,18 @@ static inline unsigned get_max_io_size(struct 
> request_queue *q,
>   /* aligned to logical block size */
>   sectors &= ~(mask >> 9);
>  
> + /*
> +  * With arbitrary bio size, the incoming bio may be very big.
> +  * We have to split the bio into small bios so that each holds
> +  * at most BIO_MAX_PAGES bvecs for safety reason, such as
> +  * bio_clone().
> +  *
> +  * In the future, the limit might be converted into per-queue
> +  * flag.
> +  */
> + sectors = min_t(unsigned, sectors, BIO_MAX_PAGES <<
> + (PAGE_CACHE_SHIFT - 9));
> +
>   return sectors;
>  }
>  
> -- 
> 1.9.1
> 
> 


[PATCH-RFC]: sysrq-a: graceful reboot via kernel_restart(), similar to sysrq-o

2016-03-10 Thread Eric Wheeler
Hello all,

We were having a discussion on the bcache list about the safest reboot 
options via sysrq here:
  http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3559/focus=3586

The result of the discussion ended up in a patch for sysrq-a to call 
kernel_restart much in the same way as sysrq-ocalls kernel_power_off.

Please comment on the patch and suggest any appropriate changes.  

Thank you!

--
Eric Wheeler


===
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt
index 0e307c9..cf43fc8 100644
--- a/Documentation/sysrq.txt
+++ b/Documentation/sysrq.txt
@@ -65,6 +65,11 @@ On all -  write a character to /proc/sysrq-trigger.  e.g.:
 
 *  What are the 'command' keys?
 ~~~
+'a' - Attempt to semi-gracefully reboot via kernel_restart(NULL). This
+  is in-kernel only (bypasses init), but is cleaner than a hard
+  reboot such as sysrq-b. sysrq-a is functionally similar to 'o' but
+  calls kernel_restart() instead of kernel_power_off().
+
 'b' - Will immediately reboot the system without syncing or unmounting
   your disks.
 
diff --git a/kernel/reboot.c b/kernel/reboot.c
index d20c85d..8181749 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * this indicates whether you can reboot with ctrl-alt-del: the default is yes
@@ -555,3 +556,42 @@ static int __init reboot_setup(char *str)
return 1;
 }
 __setup("reboot=", reboot_setup);
+
+
+/*
+ * When the user hits Sys-Rq 'a' to kernel_restart the machine this is the
+ * callback we use.
+ */
+
+static void do_krestart(struct work_struct *dummy)
+{
+   /* are these two necessary? */
+   lockdep_off();
+   local_irq_enable();
+
+   /* restart */
+   kernel_restart(NULL);
+}
+
+static DECLARE_WORK(krestart_work, do_krestart);
+
+static void handle_krestart(int key)
+{
+   /* run sysrq krestart on boot cpu */
+   schedule_work_on(cpumask_first(cpu_online_mask), _work);
+}
+
+static struct sysrq_key_op sysrq_krestart_op = {
+   .handler= handle_krestart,
+   .help_msg   = "kernel_restart(a)",
+   .action_msg = "Gracefullish Restart",
+   .enable_mask= SYSRQ_ENABLE_BOOT,
+};
+
+static int __init pm_sysrq_krestart_init(void)
+{
+   register_sysrq_key('a', _krestart_op);
+   return 0;
+}
+
+subsys_initcall(pm_sysrq_krestart_init);
===


[PATCH-RFC]: sysrq-a: graceful reboot via kernel_restart(), similar to sysrq-o

2016-03-10 Thread Eric Wheeler
Hello all,

We were having a discussion on the bcache list about the safest reboot 
options via sysrq here:
  http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3559/focus=3586

The result of the discussion ended up in a patch for sysrq-a to call 
kernel_restart much in the same way as sysrq-ocalls kernel_power_off.

Please comment on the patch and suggest any appropriate changes.  

Thank you!

--
Eric Wheeler


===
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt
index 0e307c9..cf43fc8 100644
--- a/Documentation/sysrq.txt
+++ b/Documentation/sysrq.txt
@@ -65,6 +65,11 @@ On all -  write a character to /proc/sysrq-trigger.  e.g.:
 
 *  What are the 'command' keys?
 ~~~
+'a' - Attempt to semi-gracefully reboot via kernel_restart(NULL). This
+  is in-kernel only (bypasses init), but is cleaner than a hard
+  reboot such as sysrq-b. sysrq-a is functionally similar to 'o' but
+  calls kernel_restart() instead of kernel_power_off().
+
 'b' - Will immediately reboot the system without syncing or unmounting
   your disks.
 
diff --git a/kernel/reboot.c b/kernel/reboot.c
index d20c85d..8181749 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * this indicates whether you can reboot with ctrl-alt-del: the default is yes
@@ -555,3 +556,42 @@ static int __init reboot_setup(char *str)
return 1;
 }
 __setup("reboot=", reboot_setup);
+
+
+/*
+ * When the user hits Sys-Rq 'a' to kernel_restart the machine this is the
+ * callback we use.
+ */
+
+static void do_krestart(struct work_struct *dummy)
+{
+   /* are these two necessary? */
+   lockdep_off();
+   local_irq_enable();
+
+   /* restart */
+   kernel_restart(NULL);
+}
+
+static DECLARE_WORK(krestart_work, do_krestart);
+
+static void handle_krestart(int key)
+{
+   /* run sysrq krestart on boot cpu */
+   schedule_work_on(cpumask_first(cpu_online_mask), _work);
+}
+
+static struct sysrq_key_op sysrq_krestart_op = {
+   .handler= handle_krestart,
+   .help_msg   = "kernel_restart(a)",
+   .action_msg = "Gracefullish Restart",
+   .enable_mask= SYSRQ_ENABLE_BOOT,
+};
+
+static int __init pm_sysrq_krestart_init(void)
+{
+   register_sysrq_key('a', _krestart_op);
+   return 0;
+}
+
+subsys_initcall(pm_sysrq_krestart_init);
===


Re: bcache_writeback: bch_writeback_thread...blk_queue_bio IO hang [was BUG: soft lockup]

2016-02-01 Thread Eric Wheeler
[ changed subject, more below ]

On Fri, 29 Jan 2016, Johannes Thumshirn wrote:

> [ +cc Kent ]
> 
> On Wed, Jan 27, 2016 at 02:57:25PM +, Yannis Aribaud wrote:
> > Hi,
> > 
> > After several weeks using the 4.2.6 kernel + patches from Ewheeler we just 
> > ran into a crash again.
> > This time the kernel was still running and the server was responsive but 
> > not able to do any IO on the bcache devices.
> > 
> > [696983.683498] bcache_writebac D 810643df 0  5741  2 
> > 0x
> > [696983.683505]  88103d01f180 0046 88107842d000 
> > 811a95cd
> > [696983.683510]   8810388c4000 88103d01f180 
> > 0001
> > [696983.683514]  882034ae0c10  882034ae 
> > 8139601e
> > [696983.683518] Call Trace:
> > [696983.683530]  [] ? blk_queue_bio+0x262/0x279
> > [696983.683539]  [] ? schedule+0x6b/0x78
> > [696983.683553]  [] ? closure_sync+0x66/0x91 [bcache]
> > [696983.683563]  [] ? bch_writeback_thread+0x622/0x6b5 
> > [bcache]
> > [696983.683569]  [] ? __switch_to+0x1de/0x3f7
> > [696983.683578]  [] ? bch_writeback_thread+0x622/0x6b5 
> > [bcache]
> > [696983.683586]  [] ? write_dirty_finish+0x1bf/0x1bf 
> > [bcache]
> > [696983.683594]  [] ? kthread+0x99/0xa1
> > [696983.683598]  [] ? kthread_parkme+0x16/0x16
> > [696983.683603]  [] ? ret_from_fork+0x3f/0x70
> > [696983.683607]  [] ? kthread_parkme+0x16/0x16
> > 
> > Don't know if this help.
> > Unfortunately I thing that we will rollback and stop using Bcache unless 
> > this is really fixed :/
> > 
> 
> Hi Yannis,
> 
> Do you have a machine with a bcache setup running where you can reproduce the
> error? Or do you know a method to reproduce the error?
> 
> What I'd be interested in is which locks are held when it locks up (you can
> acquire this information with SysRq+d or echo d > /proc/sysrq-trigger.
> 
> Kent, do you have an idea what's happening here?

Are you using md-based raid5/raid6?  

If so, it could be the raid_partial_stripes_expensive bug.

If it *is* the bcache optimizations around partial_stripes_expensive, then
please try the patch below and then do this *before* loading the bcache
module (or at least before registering the bcache backing volume):
echo 0 > /sys/block/BLK/queue/limits/raid_partial_stripes_expensive
where BLK is your backing volume.

I wrote this patch because we use hardware RAID5/6 and wanted to get the
partial_stripes_expensive optimizations on by setting
raid_partial_stripes_expensive=1 and io_opt=our_stride_width.
Unfortunately it caused the backtrace in the LKML thread below, so we
stopped using it.

See also this thread, however, it shows a backtrace prior to
removing the bio splitting code:
  https://lkml.org/lkml/2016/1/7/844

Note that commit 749b61dab30736eb95b1ee23738cae90973d4fc3 might not 
exactly address this issue, but it might prevent full hangs.  Make sure 
you cherry-pick commit 749b61dab30736eb95b1ee23738cae90973d4fc3 and 
hand-clean-up as necessary. It simplifies the bio code and deletes a bunch 
of stuff.  It showed up in 4.4 and isn't in my patchset.

After commit the commit above and setting 
  raid_partial_stripes_expensive=1 
we still get errors like this:
  bcache: bch_count_io_errors() dm-3: IO error on writing data to cache, 
recovering but they don't lock up the system.  Ultimately we run with 
raid_partial_stripes_expensive=0 because of these related problemsand 
haven't had any issues.  Also, fwiw, we run 4.1.y as our stable branch 
backed by hardware raid.

See my queue_limits patch below which exposes 
  ::limits
via sysfs.

If anyone does have a fix for this, I would love to see it---because then 
we could use the bcache raid_partial_stripes_expensive/opt_io 
optimizations!

-Eric


diff --git a/block/blk-core.c b/block/blk-core.c
index 03b5f8d..54006aa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -642,6 +642,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
INIT_DELAYED_WORK(>delay_work, blk_delay_work);
 
kobject_init(>kobj, _queue_ktype);
+   kobject_init(>ql_kobj, _queue_limits_ktype);
 
mutex_init(>sysfs_lock);
spin_lock_init(>__queue_lock);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 2b8fd30..9c29b44 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -430,6 +430,119 @@ static struct attribute *default_attrs[] = {
NULL,
 };
 
+/* First we define a show/store function generator 
+ * for `struct queue_limits` */
+#define QL_STORE_FUNC_NAME(name) queue_limits_store_##name
+#define QL_STORE_FUNC(name, type)  \
+   static ssize_t \
+   QL_STORE_FUNC_NAME(name) (struct request_queue *q, const char *page, 
size_t count)\
+   {  \
+   unsigned long val; \
+   

Re: bcache_writeback: bch_writeback_thread...blk_queue_bio IO hang [was BUG: soft lockup]

2016-02-01 Thread Eric Wheeler
[ changed subject, more below ]

On Fri, 29 Jan 2016, Johannes Thumshirn wrote:

> [ +cc Kent ]
> 
> On Wed, Jan 27, 2016 at 02:57:25PM +, Yannis Aribaud wrote:
> > Hi,
> > 
> > After several weeks using the 4.2.6 kernel + patches from Ewheeler we just 
> > ran into a crash again.
> > This time the kernel was still running and the server was responsive but 
> > not able to do any IO on the bcache devices.
> > 
> > [696983.683498] bcache_writebac D 810643df 0  5741  2 
> > 0x
> > [696983.683505]  88103d01f180 0046 88107842d000 
> > 811a95cd
> > [696983.683510]   8810388c4000 88103d01f180 
> > 0001
> > [696983.683514]  882034ae0c10  882034ae 
> > 8139601e
> > [696983.683518] Call Trace:
> > [696983.683530]  [] ? blk_queue_bio+0x262/0x279
> > [696983.683539]  [] ? schedule+0x6b/0x78
> > [696983.683553]  [] ? closure_sync+0x66/0x91 [bcache]
> > [696983.683563]  [] ? bch_writeback_thread+0x622/0x6b5 
> > [bcache]
> > [696983.683569]  [] ? __switch_to+0x1de/0x3f7
> > [696983.683578]  [] ? bch_writeback_thread+0x622/0x6b5 
> > [bcache]
> > [696983.683586]  [] ? write_dirty_finish+0x1bf/0x1bf 
> > [bcache]
> > [696983.683594]  [] ? kthread+0x99/0xa1
> > [696983.683598]  [] ? kthread_parkme+0x16/0x16
> > [696983.683603]  [] ? ret_from_fork+0x3f/0x70
> > [696983.683607]  [] ? kthread_parkme+0x16/0x16
> > 
> > Don't know if this help.
> > Unfortunately I thing that we will rollback and stop using Bcache unless 
> > this is really fixed :/
> > 
> 
> Hi Yannis,
> 
> Do you have a machine with a bcache setup running where you can reproduce the
> error? Or do you know a method to reproduce the error?
> 
> What I'd be interested in is which locks are held when it locks up (you can
> acquire this information with SysRq+d or echo d > /proc/sysrq-trigger.
> 
> Kent, do you have an idea what's happening here?

Are you using md-based raid5/raid6?  

If so, it could be the raid_partial_stripes_expensive bug.

If it *is* the bcache optimizations around partial_stripes_expensive, then
please try the patch below and then do this *before* loading the bcache
module (or at least before registering the bcache backing volume):
echo 0 > /sys/block/BLK/queue/limits/raid_partial_stripes_expensive
where BLK is your backing volume.

I wrote this patch because we use hardware RAID5/6 and wanted to get the
partial_stripes_expensive optimizations on by setting
raid_partial_stripes_expensive=1 and io_opt=our_stride_width.
Unfortunately it caused the backtrace in the LKML thread below, so we
stopped using it.

See also this thread, however, it shows a backtrace prior to
removing the bio splitting code:
  https://lkml.org/lkml/2016/1/7/844

Note that commit 749b61dab30736eb95b1ee23738cae90973d4fc3 might not 
exactly address this issue, but it might prevent full hangs.  Make sure 
you cherry-pick commit 749b61dab30736eb95b1ee23738cae90973d4fc3 and 
hand-clean-up as necessary. It simplifies the bio code and deletes a bunch 
of stuff.  It showed up in 4.4 and isn't in my patchset.

After commit the commit above and setting 
  raid_partial_stripes_expensive=1 
we still get errors like this:
  bcache: bch_count_io_errors() dm-3: IO error on writing data to cache, 
recovering but they don't lock up the system.  Ultimately we run with 
raid_partial_stripes_expensive=0 because of these related problemsand 
haven't had any issues.  Also, fwiw, we run 4.1.y as our stable branch 
backed by hardware raid.

See my queue_limits patch below which exposes 
  ::limits
via sysfs.

If anyone does have a fix for this, I would love to see it---because then 
we could use the bcache raid_partial_stripes_expensive/opt_io 
optimizations!

-Eric


diff --git a/block/blk-core.c b/block/blk-core.c
index 03b5f8d..54006aa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -642,6 +642,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
INIT_DELAYED_WORK(>delay_work, blk_delay_work);
 
kobject_init(>kobj, _queue_ktype);
+   kobject_init(>ql_kobj, _queue_limits_ktype);
 
mutex_init(>sysfs_lock);
spin_lock_init(>__queue_lock);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 2b8fd30..9c29b44 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -430,6 +430,119 @@ static struct attribute *default_attrs[] = {
NULL,
 };
 
+/* First we define a show/store function generator 
+ * for `struct queue_limits` */
+#define QL_STORE_FUNC_NAME(name) queue_limits_store_##name
+#define QL_STORE_FUNC(name, type)  \
+   static ssize_t \
+   QL_STORE_FUNC_NAME(name) (struct request_queue *q, const char *page, 
size_t count)\
+   {  \
+   unsigned long val; \
+   

[PULL] Re: bcache stability patches

2015-12-29 Thread Eric Wheeler
Hi Jens and Kent,

This affects many users, so please take a look when you have a moment:

There is a growing bcache user community with a well-tested patchset that 
is necessary for production bcache use.  The diffstat is small and we all 
want someone to pull it in and get it into mainline.  This would serve 
many people if this can get pulled in upstream.

More below:

On Tue, 22 Dec 2015, Denis Bychkov wrote:
> There is a set of bcache stability patches elevating bcache stability to 
> production level. As far as I know, there is no single reported and peer 
> confirmed bug that is not solved by this set. Unfortunately, for some 
> reason, Kent does not have enough time and/or energy to review them and 
> send them upstream. Let's come up with a solution that would allow to 
> review all these patches (some of them written by Ken himself, some of 
> them produced by the community), review them and hand them to the 
> maintainer who is willing to apply them upstream. Without that, bcache 
> is just another half-assed unstable and buggy cache layer. These patches 
> will allow people to start use bcache in production systems. Please find 
> the patch set attached. (The patches apply cleanly to 4.3 and 4.4 kernel 
> series).

Hi Dennis,

I'm maintaining a branch here that is ready to merge.  We have been 
testing this for about a year in production and works great.  All Cc's and 
authors are correct and it (should) have every stability patch below, 
possibly others too.  Please tell me if there are any patches missing:

git pull https://github.com/ewheelerinc/linux.git bcache-patches-for-3.17
(Yes, github for hosting only, I don't edit with their web interfaces.) 

Note that this branch still merges cleanly through v4.4-rc7 and as far 
back as 3.17-rc1 (maybe earlier).  Each patch provides Cc: 
sta...@vger.kernel.org.

It is ready to merge!  We just need Jens or Kent or someone to pull it in.  
Here is the diffstat and shortlog against v4.4-rc7:

 drivers/md/bcache/btree.c |  5 -
 drivers/md/bcache/super.c | 16 
 drivers/md/bcache/writeback.c | 37 ++---
 drivers/md/bcache/writeback.h |  3 ++-
 4 files changed, 48 insertions(+), 13 deletions(-)

Al Viro (1):
  bcache: fix a leak in bch_cached_dev_run()

Gabriel de Perthuis (1):
  bcache: allows use of register in udev to avoid "device_busy" error.

Kent Overstreet (2):
  bcache: Add a cond_resched() call to gc
  bcache: Change refill_dirty() to always scan entire disk if 
necessary

Stefan Bader (1):
  bcache: prevent crash on changing writeback_running

Zheng Liu (3):
  bcache: fix a livelock when we cause a huge number of cache misses
  bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
  bcache: unregister reboot notifier if bcache fails to unregister device

See also these threads:
  https://lkml.org/lkml/2015/12/5/38
  https://www.mail-archive.com/linux-bcache@vger.kernel.org/msg03159.html

Quickly view commits here, too:
  https://github.com/ewheelerinc/linux/commits/bcache-patches-for-3.17


Cheers,

-Eric


-- 
Eric Wheeler, President   eWheeler, Inc. dba Global Linux Security
888-LINUX26 (888-546-8926)Fax: 503-716-3878   PO Box 25107
www.GlobalLinuxSecurity.pro   Linux since 1996! Portland, OR 97298

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PULL] Re: bcache stability patches

2015-12-29 Thread Eric Wheeler
Hi Jens and Kent,

This affects many users, so please take a look when you have a moment:

There is a growing bcache user community with a well-tested patchset that 
is necessary for production bcache use.  The diffstat is small and we all 
want someone to pull it in and get it into mainline.  This would serve 
many people if this can get pulled in upstream.

More below:

On Tue, 22 Dec 2015, Denis Bychkov wrote:
> There is a set of bcache stability patches elevating bcache stability to 
> production level. As far as I know, there is no single reported and peer 
> confirmed bug that is not solved by this set. Unfortunately, for some 
> reason, Kent does not have enough time and/or energy to review them and 
> send them upstream. Let's come up with a solution that would allow to 
> review all these patches (some of them written by Ken himself, some of 
> them produced by the community), review them and hand them to the 
> maintainer who is willing to apply them upstream. Without that, bcache 
> is just another half-assed unstable and buggy cache layer. These patches 
> will allow people to start use bcache in production systems. Please find 
> the patch set attached. (The patches apply cleanly to 4.3 and 4.4 kernel 
> series).

Hi Dennis,

I'm maintaining a branch here that is ready to merge.  We have been 
testing this for about a year in production and works great.  All Cc's and 
authors are correct and it (should) have every stability patch below, 
possibly others too.  Please tell me if there are any patches missing:

git pull https://github.com/ewheelerinc/linux.git bcache-patches-for-3.17
(Yes, github for hosting only, I don't edit with their web interfaces.) 

Note that this branch still merges cleanly through v4.4-rc7 and as far 
back as 3.17-rc1 (maybe earlier).  Each patch provides Cc: 
sta...@vger.kernel.org.

It is ready to merge!  We just need Jens or Kent or someone to pull it in.  
Here is the diffstat and shortlog against v4.4-rc7:

 drivers/md/bcache/btree.c |  5 -
 drivers/md/bcache/super.c | 16 
 drivers/md/bcache/writeback.c | 37 ++---
 drivers/md/bcache/writeback.h |  3 ++-
 4 files changed, 48 insertions(+), 13 deletions(-)

Al Viro (1):
  bcache: fix a leak in bch_cached_dev_run()

Gabriel de Perthuis (1):
  bcache: allows use of register in udev to avoid "device_busy" error.

Kent Overstreet (2):
  bcache: Add a cond_resched() call to gc
  bcache: Change refill_dirty() to always scan entire disk if 
necessary

Stefan Bader (1):
  bcache: prevent crash on changing writeback_running

Zheng Liu (3):
  bcache: fix a livelock when we cause a huge number of cache misses
  bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
  bcache: unregister reboot notifier if bcache fails to unregister device

See also these threads:
  https://lkml.org/lkml/2015/12/5/38
  https://www.mail-archive.com/linux-bcache@vger.kernel.org/msg03159.html

Quickly view commits here, too:
  https://github.com/ewheelerinc/linux/commits/bcache-patches-for-3.17


Cheers,

-Eric


-- 
Eric Wheeler, President   eWheeler, Inc. dba Global Linux Security
888-LINUX26 (888-546-8926)Fax: 503-716-3878   PO Box 25107
www.GlobalLinuxSecurity.pro   Linux since 1996! Portland, OR 97298

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PULL REQUEST]: bcache stability patches

2015-12-05 Thread Eric Wheeler
On Wed, 2 Dec 2015, Johannes Thumshirn wrote:
> On Sun, 2015-11-29 at 20:13 -0800, Eric Wheeler wrote:
> > Hello all,
> > 
> > Please pull this bcache branch with stability patches and see if it fixes 
> > any issue that you might have!  If you have stability patches (*that you 
> > have tested*) which are not part of this branch, then please forward them 
> > so they can be included.
> > 
> > We have been using bcache with great performance and stability for over a 
> > year; it is the best SSD caching mechanism that we have found for our 
> > workload.  
> > 
> > I've been collecting these fixes for a while now and just put them up on 
> > github so they can be easily merged.  I continue to see mailing list 
> > discussions about problems that users are having which these patches 
> > address, so hopefully we can get these merged:
> >  
> >   https://github.com/ewheelerinc/linux/commits/bcache-patches-for-3.17
> >   git remote add ewheelerinc https://github.com/ewheelerinc/linux.git
> >   git fetch ewheelerinc
> >   git merge ewheelerinc/bcache-patches-for-3.17
> > 
> > This branch is from a clone of Linus's tree circa 3.17-rc so git merge 
> > should bring this in cleanly to any later branch.  This will merge cleanly 
> > into kernel.org longerm versions v4.1.13 and v3.18.24.  The original 
> > author is cited in each commit; you are being Cc'ed in this email if you 
> > are an author on any of these patches or are part of a mailing list thread 
> > indicating problems that this pull may fix.
> 
> I just did a test merge on the patches against today's master and I had no
> merge conflicts. The overall diff looks reasonable as well and it builds w/o
> any warnings.
> 
> Just pulling them straight in won't be a good idea though, as the From: always
> points to Eric, that needs to be fixed up (but that's all just mechanics).

Fixed.  I've updated the authors and done a git push --force to adjust.  
Please re-merge.

> > Each commit includes `Cc: sta...@vger.kernel.org` so if they get pulled 
> > into mainline we will see bcache stable on the newer LTS branches of 
> > kernel.org.
> > 
> > This git branch includes patches that have been collected and tested by 
> > Denis Bychkov from his post on 2015-09-16 with one exception: the 
> > refill_dirty() patch has been updated to the latest stable version which 
> > Kent posted the next day.  The refill_dirty patch addresses issues when 
> > bcache is backed by raid5/6 md's causing 
> > queue_limits->partial_stripes_expensive to be nonzero.
> > 
> > 
> > Kent,
> > 
> > Can this be sent to Jens for a pull into mainline?  If you can give your 
> > OK on this patchset then I'll approach Jens and see if he can pull this 
> > in.  Maybe Cc him in your response.
> > 
> > If there's anything in here that you would like to omit or change from 
> > this pull then let me know and I'll rebase.
> > 
> > Thanks!
> > 
> > -Eric
> > 
> 
> Kent, can you please merge these patches? There are people out there actually
> using bcache and these patches aren't just cosmetics.
> 
> Thanks a lot for collecting these patches Eric.
>   Johannes

Thank you for the feedback Johannes!  I look forward to seeing these 
upstream.

-Eric
 
> > 
> > 
> > --
> > Eric Wheeler, President   eWheeler, Inc. dba Global Linux Security
> > 888-LINUX26 (888-546-8926)    Fax: 503-716-3878   PO Box 25107
> > www.GlobalLinuxSecurity.pro  ; Linux since 1996! Portland, OR 97298
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

Re: [PULL REQUEST]: bcache stability patches

2015-12-05 Thread Eric Wheeler
On Wed, 2 Dec 2015, Johannes Thumshirn wrote:
> On Sun, 2015-11-29 at 20:13 -0800, Eric Wheeler wrote:
> > Hello all,
> > 
> > Please pull this bcache branch with stability patches and see if it fixes 
> > any issue that you might have!  If you have stability patches (*that you 
> > have tested*) which are not part of this branch, then please forward them 
> > so they can be included.
> > 
> > We have been using bcache with great performance and stability for over a 
> > year; it is the best SSD caching mechanism that we have found for our 
> > workload.  
> > 
> > I've been collecting these fixes for a while now and just put them up on 
> > github so they can be easily merged.  I continue to see mailing list 
> > discussions about problems that users are having which these patches 
> > address, so hopefully we can get these merged:
> >  
> >   https://github.com/ewheelerinc/linux/commits/bcache-patches-for-3.17
> >   git remote add ewheelerinc https://github.com/ewheelerinc/linux.git
> >   git fetch ewheelerinc
> >   git merge ewheelerinc/bcache-patches-for-3.17
> > 
> > This branch is from a clone of Linus's tree circa 3.17-rc so git merge 
> > should bring this in cleanly to any later branch.  This will merge cleanly 
> > into kernel.org longerm versions v4.1.13 and v3.18.24.  The original 
> > author is cited in each commit; you are being Cc'ed in this email if you 
> > are an author on any of these patches or are part of a mailing list thread 
> > indicating problems that this pull may fix.
> 
> I just did a test merge on the patches against today's master and I had no
> merge conflicts. The overall diff looks reasonable as well and it builds w/o
> any warnings.
> 
> Just pulling them straight in won't be a good idea though, as the From: always
> points to Eric, that needs to be fixed up (but that's all just mechanics).

Fixed.  I've updated the authors and done a git push --force to adjust.  
Please re-merge.

> > Each commit includes `Cc: sta...@vger.kernel.org` so if they get pulled 
> > into mainline we will see bcache stable on the newer LTS branches of 
> > kernel.org.
> > 
> > This git branch includes patches that have been collected and tested by 
> > Denis Bychkov from his post on 2015-09-16 with one exception: the 
> > refill_dirty() patch has been updated to the latest stable version which 
> > Kent posted the next day.  The refill_dirty patch addresses issues when 
> > bcache is backed by raid5/6 md's causing 
> > queue_limits->partial_stripes_expensive to be nonzero.
> > 
> > 
> > Kent,
> > 
> > Can this be sent to Jens for a pull into mainline?  If you can give your 
> > OK on this patchset then I'll approach Jens and see if he can pull this 
> > in.  Maybe Cc him in your response.
> > 
> > If there's anything in here that you would like to omit or change from 
> > this pull then let me know and I'll rebase.
> > 
> > Thanks!
> > 
> > -Eric
> > 
> 
> Kent, can you please merge these patches? There are people out there actually
> using bcache and these patches aren't just cosmetics.
> 
> Thanks a lot for collecting these patches Eric.
>   Johannes

Thank you for the feedback Johannes!  I look forward to seeing these 
upstream.

-Eric
 
> > 
> > 
> > --
> > Eric Wheeler, President   eWheeler, Inc. dba Global Linux Security
> > 888-LINUX26 (888-546-8926)    Fax: 503-716-3878   PO Box 25107
> > www.GlobalLinuxSecurity.pro  ; Linux since 1996! Portland, OR 97298
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>