Re: [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
On Fri, Nov 16, 2018 at 02:46:45PM +0100, Christoph Hellwig wrote: > > - bio_for_each_segment_all(bv, bio, i) { > > + for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) { > > This really needs a comment. Otherwise it looks fine to me. OK, will do it in next version. Thanks, Ming
Re: [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
On Thu, Nov 15, 2018 at 04:44:02PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:52:58PM +0800, Ming Lei wrote: > > bch_bio_alloc_pages() is always called on one new bio, so it is safe > > to access the bvec table directly. Given it is the only kind of this > > case, open code the bvec table access since bio_for_each_segment_all() > > will be changed to support for iterating over multipage bvec. > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Acked-by: Coly Li > > Cc: Mike Snitzer > > Cc: dm-de...@redhat.com > > Cc: Alexander Viro > > Cc: linux-fsde...@vger.kernel.org > > Cc: Shaohua Li > > Cc: linux-r...@vger.kernel.org > > Cc: linux-er...@lists.ozlabs.org > > Cc: David Sterba > > Cc: linux-btrfs@vger.kernel.org > > Cc: Darrick J. Wong > > Cc: linux-...@vger.kernel.org > > Cc: Gao Xiang > > Cc: Christoph Hellwig > > Cc: Theodore Ts'o > > Cc: linux-e...@vger.kernel.org > > Cc: Coly Li > > Cc: linux-bca...@vger.kernel.org > > Cc: Boaz Harrosh > > Cc: Bob Peterson > > Cc: cluster-de...@redhat.com > > Signed-off-by: Ming Lei > > --- > > drivers/md/bcache/util.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c > > index 20eddeac1531..8517aebcda2d 100644 > > --- a/drivers/md/bcache/util.c > > +++ b/drivers/md/bcache/util.c > > @@ -270,7 +270,7 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask) > > int i; > > struct bio_vec *bv; > > > > - bio_for_each_segment_all(bv, bio, i) { > > + for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) { > > This is missing an i++. Good catch, will fix it in next version. thanks, Ming
Re: [PATCH 05/12] bcache: convert to bioset_init()/mempool_init()
On 2018/5/21 6:25 AM, Kent Overstreet wrote: > Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com> Hi Kent, This change looks good to me, Reviewed-by: Coly Li <col...@suse.de> Thanks. Coly Li > --- > drivers/md/bcache/bcache.h | 10 +- > drivers/md/bcache/bset.c| 13 ----- > drivers/md/bcache/bset.h| 2 +- > drivers/md/bcache/btree.c | 4 ++-- > drivers/md/bcache/io.c | 4 ++-- > drivers/md/bcache/request.c | 18 +++++- > drivers/md/bcache/super.c | 38 ++--- > 7 files changed, 37 insertions(+), 52 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 3a0cfb237a..3050438761 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -269,7 +269,7 @@ struct bcache_device { > atomic_t*stripe_sectors_dirty; > unsigned long *full_dirty_stripes; > > - struct bio_set *bio_split; > + struct bio_set bio_split; > > unsigneddata_csum:1; > > @@ -528,9 +528,9 @@ struct cache_set { > struct closure sb_write; > struct semaphoresb_write_mutex; > > - mempool_t *search; > - mempool_t *bio_meta; > - struct bio_set *bio_split; > + mempool_t search; > + mempool_t bio_meta; > + struct bio_set bio_split; > > /* For the btree cache */ > struct shrinker shrink; > @@ -655,7 +655,7 @@ struct cache_set { >* A btree node on disk could have too many bsets for an iterator to fit >* on the stack - have to dynamically allocate them >*/ > - mempool_t *fill_iter; > + mempool_t fill_iter; > > struct bset_sort_state sort; > > diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c > index 579c696a5f..f3403b45bc 100644 > --- a/drivers/md/bcache/bset.c > +++ b/drivers/md/bcache/bset.c > @@ -1118,8 +1118,7 @@ struct bkey *bch_btree_iter_next_filter(struct > btree_iter *iter, > > void bch_bset_sort_state_free(struct bset_sort_state *state) > { > - if (state->pool) > - mempool_destroy(state->pool); > + mempool_exit(>pool); > } > > int bch_bset_sort_state_init(struct bset_sort_state *state, unsigned > page_order) > @@ -1129,11 +1128,7 @@ int bch_bset_sort_state_init(struct bset_sort_state > *state, unsigned page_order) > state->page_order = page_order; > state->crit_factor = int_sqrt(1 << page_order); > > - state->pool = mempool_create_page_pool(1, page_order); > - if (!state->pool) > - return -ENOMEM; > - > - return 0; > + return mempool_init_page_pool(>pool, 1, page_order); > } > EXPORT_SYMBOL(bch_bset_sort_state_init); > > @@ -1191,7 +1186,7 @@ static void __btree_sort(struct btree_keys *b, struct > btree_iter *iter, > > BUG_ON(order > state->page_order); > > - outp = mempool_alloc(state->pool, GFP_NOIO); > + outp = mempool_alloc(>pool, GFP_NOIO); > out = page_address(outp); > used_mempool = true; > order = state->page_order; > @@ -1220,7 +1215,7 @@ static void __btree_sort(struct btree_keys *b, struct > btree_iter *iter, > } > > if (used_mempool) > - mempool_free(virt_to_page(out), state->pool); > + mempool_free(virt_to_page(out), >pool); > else > free_pages((unsigned long) out, order); > > diff --git a/drivers/md/bcache/bset.h b/drivers/md/bcache/bset.h > index 0c24280f3b..b867f22004 100644 > --- a/drivers/md/bcache/bset.h > +++ b/drivers/md/bcache/bset.h > @@ -347,7 +347,7 @@ static inline struct bkey *bch_bset_search(struct > btree_keys *b, > /* Sorting */ > > struct bset_sort_state { > - mempool_t *pool; > + mempool_t pool; > > unsignedpage_order; > unsignedcrit_factor; > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > index 17936b2dc7..2a0968c04e 100644 > --- a/drivers/md/bcache/btree.c > +++ b/drivers/md/bcache/btree.c > @@ -204,7 +204,7 @@ void bch_btree_node_read_done(struct btree *b) > struct bset *i = btree_bset_first(b); > struct btree_iter *iter; > > - iter = mempool_alloc(b->c->fill_iter, GFP_NOIO); > + iter = mempool_alloc(>c->fill_iter, GFP_NOIO); > iter->size = b->c-&
[PATCH 05/12] bcache: convert to bioset_init()/mempool_init()
Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com> --- drivers/md/bcache/bcache.h | 10 +- drivers/md/bcache/bset.c| 13 - drivers/md/bcache/bset.h| 2 +- drivers/md/bcache/btree.c | 4 ++-- drivers/md/bcache/io.c | 4 ++-- drivers/md/bcache/request.c | 18 +- drivers/md/bcache/super.c | 38 ++--- 7 files changed, 37 insertions(+), 52 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 3a0cfb237a..3050438761 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -269,7 +269,7 @@ struct bcache_device { atomic_t*stripe_sectors_dirty; unsigned long *full_dirty_stripes; - struct bio_set *bio_split; + struct bio_set bio_split; unsigneddata_csum:1; @@ -528,9 +528,9 @@ struct cache_set { struct closure sb_write; struct semaphoresb_write_mutex; - mempool_t *search; - mempool_t *bio_meta; - struct bio_set *bio_split; + mempool_t search; + mempool_t bio_meta; + struct bio_set bio_split; /* For the btree cache */ struct shrinker shrink; @@ -655,7 +655,7 @@ struct cache_set { * A btree node on disk could have too many bsets for an iterator to fit * on the stack - have to dynamically allocate them */ - mempool_t *fill_iter; + mempool_t fill_iter; struct bset_sort_state sort; diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index 579c696a5f..f3403b45bc 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -1118,8 +1118,7 @@ struct bkey *bch_btree_iter_next_filter(struct btree_iter *iter, void bch_bset_sort_state_free(struct bset_sort_state *state) { - if (state->pool) - mempool_destroy(state->pool); + mempool_exit(>pool); } int bch_bset_sort_state_init(struct bset_sort_state *state, unsigned page_order) @@ -1129,11 +1128,7 @@ int bch_bset_sort_state_init(struct bset_sort_state *state, unsigned page_order) state->page_order = page_order; state->crit_factor = int_sqrt(1 << page_order); - state->pool = mempool_create_page_pool(1, page_order); - if (!state->pool) - return -ENOMEM; - - return 0; + return mempool_init_page_pool(>pool, 1, page_order); } EXPORT_SYMBOL(bch_bset_sort_state_init); @@ -1191,7 +1186,7 @@ static void __btree_sort(struct btree_keys *b, struct btree_iter *iter, BUG_ON(order > state->page_order); - outp = mempool_alloc(state->pool, GFP_NOIO); + outp = mempool_alloc(>pool, GFP_NOIO); out = page_address(outp); used_mempool = true; order = state->page_order; @@ -1220,7 +1215,7 @@ static void __btree_sort(struct btree_keys *b, struct btree_iter *iter, } if (used_mempool) - mempool_free(virt_to_page(out), state->pool); + mempool_free(virt_to_page(out), >pool); else free_pages((unsigned long) out, order); diff --git a/drivers/md/bcache/bset.h b/drivers/md/bcache/bset.h index 0c24280f3b..b867f22004 100644 --- a/drivers/md/bcache/bset.h +++ b/drivers/md/bcache/bset.h @@ -347,7 +347,7 @@ static inline struct bkey *bch_bset_search(struct btree_keys *b, /* Sorting */ struct bset_sort_state { - mempool_t *pool; + mempool_t pool; unsignedpage_order; unsigned crit_factor; diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 17936b2dc7..2a0968c04e 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -204,7 +204,7 @@ void bch_btree_node_read_done(struct btree *b) struct bset *i = btree_bset_first(b); struct btree_iter *iter; - iter = mempool_alloc(b->c->fill_iter, GFP_NOIO); + iter = mempool_alloc(>c->fill_iter, GFP_NOIO); iter->size = b->c->sb.bucket_size / b->c->sb.block_size; iter->used = 0; @@ -271,7 +271,7 @@ void bch_btree_node_read_done(struct btree *b) bch_bset_init_next(>keys, write_block(b), bset_magic(>c->sb)); out: - mempool_free(iter, b->c->fill_iter); + mempool_free(iter, >c->fill_iter); return; err: set_btree_node_io_error(b); diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index 2ddf8515e6..9612873afe 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -17,12 +17,12 @@ void bch_bbio_free(struct bio *bio, str
Re: [PATCH 08/10] bcache: move closures to lib/
On Fri, May 18, 2018 at 03:49:13AM -0400, Kent Overstreet wrote: > Prep work for bcachefs - being a fork of bcache it also uses closures Hell no. This code needs to go away and not actually be promoted to lib/. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/10] bcache: optimize continue_at_nobarrier()
Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com> --- drivers/md/bcache/closure.h | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h index 3b9dfc9962..2392a46bcd 100644 --- a/drivers/md/bcache/closure.h +++ b/drivers/md/bcache/closure.h @@ -244,7 +244,7 @@ static inline void closure_queue(struct closure *cl) != offsetof(struct work_struct, func)); if (wq) { INIT_WORK(>work, cl->work.func); - BUG_ON(!queue_work(wq, >work)); + queue_work(wq, >work); } else cl->fn(cl); } @@ -337,8 +337,13 @@ do { \ */ #define continue_at_nobarrier(_cl, _fn, _wq) \ do { \ - set_closure_fn(_cl, _fn, _wq); \ - closure_queue(_cl); \ + closure_set_ip(_cl);\ + if (_wq) { \ + INIT_WORK(&(_cl)->work, (void *) _fn); \ + queue_work((_wq), &(_cl)->work);\ + } else {\ + (_fn)(_cl); \ + } \ } while (0) /** -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] bcache: move closures to lib/
Prep work for bcachefs - being a fork of bcache it also uses closures Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com> --- drivers/md/bcache/Kconfig | 10 +- drivers/md/bcache/Makefile | 6 +++--- drivers/md/bcache/bcache.h | 2 +- drivers/md/bcache/super.c | 1 - drivers/md/bcache/util.h | 3 +-- {drivers/md/bcache => include/linux}/closure.h | 17 - lib/Kconfig| 3 +++ lib/Kconfig.debug | 9 + lib/Makefile | 2 ++ {drivers/md/bcache => lib}/closure.c | 17 - 10 files changed, 36 insertions(+), 34 deletions(-) rename {drivers/md/bcache => include/linux}/closure.h (97%) rename {drivers/md/bcache => lib}/closure.c (95%) diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig index 4d200883c5..45f1094c08 100644 --- a/drivers/md/bcache/Kconfig +++ b/drivers/md/bcache/Kconfig @@ -1,6 +1,7 @@ config BCACHE tristate "Block device as cache" + select CLOSURES ---help--- Allows a block device to be used as cache for other devices; uses a btree for indexing and the layout is optimized for SSDs. @@ -15,12 +16,3 @@ config BCACHE_DEBUG Enables extra debugging tools, allows expensive runtime checks to be turned on. - -config BCACHE_CLOSURES_DEBUG - bool "Debug closures" - depends on BCACHE - select DEBUG_FS - ---help--- - Keeps all active closures in a linked list and provides a debugfs - interface to list them, which makes it possible to see asynchronous - operations that get stuck. diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile index d26b351958..2b790fb813 100644 --- a/drivers/md/bcache/Makefile +++ b/drivers/md/bcache/Makefile @@ -2,8 +2,8 @@ obj-$(CONFIG_BCACHE) += bcache.o -bcache-y := alloc.o bset.o btree.o closure.o debug.o extents.o\ - io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\ - util.o writeback.o +bcache-y := alloc.o bset.o btree.o debug.o extents.o io.o\ + journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o util.o\ + writeback.o CFLAGS_request.o += -Iblock diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 12e5197f18..d954dc44dd 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -180,6 +180,7 @@ #include #include +#include #include #include #include @@ -191,7 +192,6 @@ #include "bset.h" #include "util.h" -#include "closure.h" struct bucket { atomic_t pin; diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index f2273143b3..5f1ac8e0a3 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2148,7 +2148,6 @@ static int __init bcache_init(void) mutex_init(_register_lock); init_waitqueue_head(_wait); register_reboot_notifier(); - closure_debug_init(); bcache_major = register_blkdev(0, "bcache"); if (bcache_major < 0) { diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index a6763db7f0..a75523ed0d 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -4,6 +4,7 @@ #define _BCACHE_UTIL_H #include +#include #include #include #include @@ -12,8 +13,6 @@ #include #include -#include "closure.h" - #define PAGE_SECTORS (PAGE_SIZE / 512) struct closure; diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h similarity index 97% rename from drivers/md/bcache/closure.h rename to include/linux/closure.h index 2392a46bcd..1072bf2c13 100644 --- a/drivers/md/bcache/closure.h +++ b/include/linux/closure.h @@ -154,7 +154,7 @@ struct closure { atomic_tremaining; -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG +#ifdef CONFIG_DEBUG_CLOSURES #define CLOSURE_MAGIC_DEAD 0xc054dead #define CLOSURE_MAGIC_ALIVE0xc054a11e @@ -183,15 +183,13 @@ static inline void closure_sync(struct closure *cl) __closure_sync(cl); } -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG +#ifdef CONFIG_DEBUG_CLOSURES -void closure_debug_init(void); void closure_debug_create(struct closure *cl); void closure_debug_destroy(struct closure *cl); #else -static inline void closure_debug_init(void) {} static inline void closure_debug_create(struct closure *cl) {} static inline void closure_debug_destroy(struct closure *cl) {} @@ -199,21 +197,21 @@ static inline void closure_debug_destroy(struct closure *cl) {} static inline void closure_set_ip(struct closure *cl) { -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG +#ifdef CONFIG_DEBUG_CLOSURES cl->ip =
Re: Kernel 4.14 RAID5 multi disk array on bcache not mounting
On 11/21/17 23:22, Lionel Bouton wrote: > Le 21/11/2017 à 23:04, Andy Leadbetter a écrit : >> I have a 4 disk array on top of 120GB bcache setup, arranged as follows > [...] >> Upgraded today to 4.14.1 from their PPA and the > > 4.14 and 4.14.1 have a nasty bug affecting bcache users. See for example > : > https://www.reddit.com/r/linux/comments/7eh2oz/serious_regression_in_linux_414_using_bcache_can/ 4.14.2 (just out as rc1) will have the fix. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 4.14 RAID5 multi disk array on bcache not mounting
Le 21/11/2017 à 23:04, Andy Leadbetter a écrit : > I have a 4 disk array on top of 120GB bcache setup, arranged as follows [...] > Upgraded today to 4.14.1 from their PPA and the 4.14 and 4.14.1 have a nasty bug affecting bcache users. See for example : https://www.reddit.com/r/linux/comments/7eh2oz/serious_regression_in_linux_414_using_bcache_can/ Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kernel 4.14 RAID5 multi disk array on bcache not mounting
I have a 4 disk array on top of 120GB bcache setup, arranged as follows dev/sda1: UUID="42AE-12E3" TYPE="vfat" PARTLABEL="EFI System" PARTUUID="d337c56a-fb0f-4e87-8d5f-a89122c81167" /dev/sda2: UUID="06e3ce52-f34a-409a-a143-3c04f1d334ff" TYPE="ext4" PARTLABEL="Linux filesystem" PARTUUID="d2d3fa93-eebf-41ab-8162-d81722bf47ec" /dev/sda4: UUID="b729c490-81f0-461f-baa2-977af9a7b6d9" TYPE="bcache" PARTLABEL="Linux filesystem" PARTUUID="84548857-f504-440a-857f-c0838c1eb83d" /dev/sdb1: UUID="6016277c-143d-46b4-ae4e-8565ffc8158f" TYPE="swap" PARTLABEL="Linux swap" PARTUUID="8692bf67-7271-4bf6-a623-b79d74093f2c" /dev/sdb2: UUID="bc93c5e2-705a-4cbe-bcd9-7be1181163b2" TYPE="bcache" PARTLABEL="Linux filesystem" PARTUUID="662a450b-3592-4929-9647-8e8a1dedae69" /dev/sdc1: UUID="9df21d4e-de02-4000-b684-5fb95d4d0492" TYPE="swap" PARTLABEL="Linux swap" PARTUUID="ed9d7b8e-5480-4e70-b983-1a350ecae38a" /dev/sdc2: UUID="7d8feaf6-aa6a-4b13-af49-0ad1bd1efb64" TYPE="bcache" PARTLABEL="Linux filesystem" PARTUUID="d343e23a-39ed-4061-80a2-55b66e20ecc1" /dev/sdd1: UUID="18defba2-594b-402e-b3b2-8e38035c624d" TYPE="swap" PARTLABEL="Linux swap" PARTUUID="fed9ffd6-0480-4496-8e6d-02d263d719b7" /dev/sdd2: UUID="be0f0381-0d7e-46c9-ad04-01415bfc6f61" TYPE="bcache" PARTLABEL="Linux filesystem" PARTUUID="8f56de8a-105f-4d56-b699-59e1215b3c6b" /dev/bcache32: UUID="38d5de43-28fb-40a9-a535-dbf17ff52e75" UUID_SUB="731c31f1-51dd-477a-9bd1-fac73d0e6f69" TYPE="btrfs" /dev/sde: UUID="05514ad3-d90a-4e90-aa11-7c6d34515ca2" TYPE="bcache" /dev/bcache16: UUID="38d5de43-28fb-40a9-a535-dbf17ff52e75" UUID_SUB="79cbcaf1-40b9-4954-a977-537ed3310e76" TYPE="btrfs" /dev/bcache0: UUID="38d5de43-28fb-40a9-a535-dbf17ff52e75" UUID_SUB="42d3a0dd-fbec-4318-9a5b-6d96aa1f6328" TYPE="btrfs" /dev/bcache48: UUID="38d5de43-28fb-40a9-a535-dbf17ff52e75" UUID_SUB="cb7018d6-a27d-493e-b41f-e45c64f6873a" TYPE="btrfs" /dev/sda3: PARTUUID="d9fa3100-5044-4e10-9f2f-f8037786a43f" ubuntu 17.10 with PPA Kernels up to 4.13.x all mount this array perfectly, and the performance of the cache is as expected. Upgraded today to 4.14.1 from their PPA and the running btrfs dev scan finds the btrfs filesystem devices bcache16 and bcache32, bcache0 and bcache48 are not recognised, and thus the file system will not mount. according bcache all devices are present, and attached to the cache device correctly. btrfs fi on Kernel 4.13 gives Label: none uuid: 38d5de43-28fb-40a9-a535-dbf17ff52e75 Total devices 4 FS bytes used 2.03TiB devid1 size 1.82TiB used 1.07TiB path /dev/bcache16 devid2 size 1.82TiB used 1.07TiB path /dev/bcache32 devid3 size 1.82TiB used 1.07TiB path /dev/bcache0 devid4 size 1.82TiB used 1.07TiB path /dev/bcache48 Where do I start in debugging this? btrfs-progs v4.12 btrfs fi df / Data, RAID5: total=3.20TiB, used=2.02TiB System, RAID5: total=192.00MiB, used=288.00KiB Metadata, RAID5: total=6.09GiB, used=3.69GiB GlobalReserve, single: total=512.00MiB, used=0.00B There are no errors in the dmesg that I can see from btrfs scan, simply the two devices are not found. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Give up on bcache?
On 2017-09-26 18:46, Ferry Toth wrote: Op Tue, 26 Sep 2017 15:52:44 -0400, schreef Austin S. Hemmelgarn: On 2017-09-26 12:50, Ferry Toth wrote: Looking at the Phoronix benchmark here: https://www.phoronix.com/scan.php?page=article=linux414-bcache- raid=2 I think it might be idle hopes to think bcache can be used as a ssd cache for btrfs to significantly improve performance.. True, the benchmark is using ext. It's a benchmark. They're inherently synthetic and workload specific, and therefore should not be trusted to represent things accurately for arbitrary use cases. So what. A decent benchmark tries to measure a specific aspect of the fs. Yes, and it usually measures it using a ridiculously unrealistic workload. Some of the benchmarks in iozone are a good example of this, like the backwards read one (there is nearly nothing that it provides any useful data for). For a benchmark to be meaningful, you have to test what you actually intend to use, and from a practical perspective, that article is primarily testing throughput, which is not something you should be using SSD caching for. I think you agree that applications doing lots of fsyncs (databases, dpkg) are slow on btrfs especially on hdd's, whatever way you measure that (it feels slow, it measures slow, it really is slow). Yes, but they're also slow on _everything_. fsync() is slow. Period. It just more of an issue on BTRFS because it's a CoW filesystem _and_ it's slower than ext4 even with that CoW layer bypassed. On a ssd the problem is less. And most of that is a result of the significantly higher bulk throughput on the SSD, which is not something that SSD caching replicates. So if you can fix that by using a ssd cache or a hybrid solution, how would you like to compare that? It _feels_ faster? That depends. If it's on a desktop, then that actually is one of the best ways to test it, since user perception is your primary quality metric (you can make the fastest system in the world, but if the user can't tell, you've gained nothing). If you're on anything else, you test the actual workload if possible, and a benchmark that tries to replicate the workload if not. Put another way, if you're building a PGSQL server, you should be bench-marking things with a PGSQL bench-marking tool, not some arbitrary that likely won't replicate a PGSQL workload. But the most important one (where btrfs always shows to be a little slow) would be the SQLLite test. And with ext at least performance _degrades_ except for the Writeback mode, and even there is nowhere near what the SSD is capable of. And what makes you think it will be? You're using it as a hot-data cache, not a dedicated write-back cache, and you have the overhead from bcache itself too. Just some simple math based on examining the bcache code suggests you can't get better than about 98% of the SSD's performance if you're lucky, and I'd guess it's more like 80% most of the time. I think with btrfs it will be even worse and that it is a fundamental problem: caching is complex and the cache can not how how the data on the fs is used. Actually, the improvement from using bcache with BTRFS is higher proportionate to the baseline of not using it by a small margin than it is when used with ext4. BTRFS does a lot more with the disk, so you have a lot more time spent accessing the disk, and thus more time that can be reduced by improving disk performance. While the CoW nature of BTRFS does somewhat mitigate the performance improvement from using bcache, it does not completely negate it. I would like to reverse this, how much degradation do you suffer from btrfs on a ssd as baseline compared to btrfs on a mixed ssd/hdd system. Performance-wise? It's workload dependent, but in most case it's a hit regardless of if you're using BTRFS or some other filesystem. If instead you're asking what the difference in device longevity, you can probably expect the SSD to wear out faster in the second case. Unless you have a reasonably big SSD and are using write-around caching, every write will hit the SSD too, and you'll end up with lots of rewrites on the SSD. IMHO you are hoping to get ssd performance at hdd cost. Then you're looking at the wrong tool. The primary use cases for SSD caching are smoothing latency and improving interactivity by reducing head movement. Any other measure of performance is pretty much guaranteed to be worse with SSD caching than just using an SSD, and bulk throughput is often just as bad as, if not worse than, using a regular HDD by itself. If you are that desperate for performance like an SSD, quit whining about cost and just buy an SSD. Decent ones are down to less than 0.40 USD per GB depending on the brand (search 'Crucial MX300' on Amazon if you want an example), so the cost isn't nearly as bad as people make it out to be, especially considering that most the time a normal person who isn't doing multimedia work
Re: Give up on bcache?
Op Tue, 26 Sep 2017 15:52:44 -0400, schreef Austin S. Hemmelgarn: > On 2017-09-26 12:50, Ferry Toth wrote: >> Looking at the Phoronix benchmark here: >> >> https://www.phoronix.com/scan.php?page=article=linux414-bcache- >> raid=2 >> >> I think it might be idle hopes to think bcache can be used as a ssd >> cache for btrfs to significantly improve performance.. True, the >> benchmark is using ext. > It's a benchmark. They're inherently synthetic and workload specific, > and therefore should not be trusted to represent things accurately for > arbitrary use cases. So what. A decent benchmark tries to measure a specific aspect of the fs. I think you agree that applications doing lots of fsyncs (databases, dpkg) are slow on btrfs especially on hdd's, whatever way you measure that (it feels slow, it measures slow, it really is slow). On a ssd the problem is less. So if you can fix that by using a ssd cache or a hybrid solution, how would you like to compare that? It _feels_ faster? >> But the most important one (where btrfs always shows to be a little >> slow) >> would be the SQLLite test. And with ext at least performance _degrades_ >> except for the Writeback mode, and even there is nowhere near what the >> SSD is capable of. > And what makes you think it will be? You're using it as a hot-data > cache, not a dedicated write-back cache, and you have the overhead from > bcache itself too. Just some simple math based on examining the bcache > code suggests you can't get better than about 98% of the SSD's > performance if you're lucky, and I'd guess it's more like 80% most of > the time. >> >> I think with btrfs it will be even worse and that it is a fundamental >> problem: caching is complex and the cache can not how how the data on >> the fs is used. > Actually, the improvement from using bcache with BTRFS is higher > proportionate to the baseline of not using it by a small margin than it > is when used with ext4. BTRFS does a lot more with the disk, so you > have a lot more time spent accessing the disk, and thus more time that > can be reduced by improving disk performance. While the CoW nature of > BTRFS does somewhat mitigate the performance improvement from using > bcache, it does not completely negate it. I would like to reverse this, how much degradation do you suffer from btrfs on a ssd as baseline compared to btrfs on a mixed ssd/hdd system. IMHO you are hoping to get ssd performance at hdd cost. >> I think the original idea of hot data tracking has a much better chance >> to significantly improve performance. This of course as the SSD's and >> HDD's then will be equal citizens and btrfs itself gets to decide on >> which drive the data is best stored. > First, the user needs to decide, not BTRFS (at least, by default, BTRFS > should not be involved in the decision). Second, tiered storage (that's > what that's properly called) is mostly orthogonal to caching (though > bcache and dm-cache behave like tiered storage once the cache is > warmed). So, on your desktop you really are going to seach for all sqllite, mysql and psql files, dpkg files etc. and move them to the ssd? You can already do that. Go ahead! The big win would be if the file system does that automatically for you. >> With this implemented right, it would also finally silence the never >> ending discussion why not btrfs and why zfs, ext, xfs etc. Which would >> be a plus by its own right. > Even with this, there would still be plenty of reasons to pick one of > those filesystems over BTRFS. There would however be one more reason to > pick BTRFS over ext or XFS (but necessarily not ZFS, it already has > caching built in). Exactly, one more advantage of btrfs and one less of zfs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Give up on bcache?
On Tue, Sep 26, 2017 at 11:33:19PM +0500, Roman Mamedov wrote: > On Tue, 26 Sep 2017 16:50:00 + (UTC) > Ferry Toth <ft...@telfort.nl> wrote: > > > https://www.phoronix.com/scan.php?page=article=linux414-bcache- > > raid=2 > > > > I think it might be idle hopes to think bcache can be used as a ssd cache > > for btrfs to significantly improve performance.. > > My personal real-world experience shows that SSD caching -- with lvmcache -- > does indeed significantly improve performance of a large Btrfs filesystem with > slowish base storage. > > And that article, sadly, only demonstrates once again the general mediocre > quality of Phoronix content: it is an astonishing oversight to not check out > lvmcache in the same setup, to at least try to draw some useful conclusion, is > it Bcache that is strangely deficient, or SSD caching as a general concept > does not work well in the hardware setup utilized. Also, it looks as if Phoronix' tests don't stress metadata at all. Btrfs is all about metadata, speeding it up greatly helps most workloads. A pipe-dream wishlist would be: * store and access master copy of metadata on SSD only * pin all data blocks referenced by generations not yet mirrored * slowly copy over metadata to HDD -- ⢀⣴⠾⠻⢶⣦⠀ We domesticated dogs 36000 years ago; together we chased ⣾⠁⢰⠒⠀⣿⡁ animals, hung out and licked or scratched our private parts. ⢿⡄⠘⠷⠚⠋⠀ Cats domesticated us 9500 years ago, and immediately we got ⠈⠳⣄ agriculture, towns then cities. -- whitroth on /. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Give up on bcache?
On 2017-09-26 12:50, Ferry Toth wrote: Looking at the Phoronix benchmark here: https://www.phoronix.com/scan.php?page=article=linux414-bcache- raid=2 I think it might be idle hopes to think bcache can be used as a ssd cache for btrfs to significantly improve performance.. True, the benchmark is using ext. It's a benchmark. They're inherently synthetic and workload specific, and therefore should not be trusted to represent things accurately for arbitrary use cases. But the most important one (where btrfs always shows to be a little slow) would be the SQLLite test. And with ext at least performance _degrades_ except for the Writeback mode, and even there is nowhere near what the SSD is capable of. And what makes you think it will be? You're using it as a hot-data cache, not a dedicated write-back cache, and you have the overhead from bcache itself too. Just some simple math based on examining the bcache code suggests you can't get better than about 98% of the SSD's performance if you're lucky, and I'd guess it's more like 80% most of the time. I think with btrfs it will be even worse and that it is a fundamental problem: caching is complex and the cache can not how how the data on the fs is used. Actually, the improvement from using bcache with BTRFS is higher proportionate to the baseline of not using it by a small margin than it is when used with ext4. BTRFS does a lot more with the disk, so you have a lot more time spent accessing the disk, and thus more time that can be reduced by improving disk performance. While the CoW nature of BTRFS does somewhat mitigate the performance improvement from using bcache, it does not completely negate it. I think the original idea of hot data tracking has a much better chance to significantly improve performance. This of course as the SSD's and HDD's then will be equal citizens and btrfs itself gets to decide on which drive the data is best stored. First, the user needs to decide, not BTRFS (at least, by default, BTRFS should not be involved in the decision). Second, tiered storage (that's what that's properly called) is mostly orthogonal to caching (though bcache and dm-cache behave like tiered storage once the cache is warmed). With this implemented right, it would also finally silence the never ending discussion why not btrfs and why zfs, ext, xfs etc. Which would be a plus by its own right. Even with this, there would still be plenty of reasons to pick one of those filesystems over BTRFS. There would however be one more reason to pick BTRFS over ext or XFS (but necessarily not ZFS, it already has caching built in). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Give up on bcache?
Am Tue, 26 Sep 2017 23:33:19 +0500 schrieb Roman Mamedov <r...@romanrm.net>: > On Tue, 26 Sep 2017 16:50:00 + (UTC) > Ferry Toth <ft...@telfort.nl> wrote: > > > https://www.phoronix.com/scan.php?page=article=linux414-bcache- > > raid=2 > > > > I think it might be idle hopes to think bcache can be used as a ssd > > cache for btrfs to significantly improve performance.. > > My personal real-world experience shows that SSD caching -- with > lvmcache -- does indeed significantly improve performance of a large > Btrfs filesystem with slowish base storage. > > And that article, sadly, only demonstrates once again the general > mediocre quality of Phoronix content: it is an astonishing oversight > to not check out lvmcache in the same setup, to at least try to draw > some useful conclusion, is it Bcache that is strangely deficient, or > SSD caching as a general concept does not work well in the hardware > setup utilized. Bcache is actually not meant to increase benchmark performance except for very few corner cases. It is designed to improve interactivity and perceived performance, reducing head movements. On the bcache homepage there's actually tips on how to benchmark bcache correctly, including warm-up phase and turning on sequential caching. Phoronix doesn't do that, they test default settings, which is imho a good thing but you should know the consequences and research how to turn the knobs. Depending on the caching mode and cache size, the SQlite test may not show real-world numbers. Also, you should optimize some btrfs options to work correctly with bcache, e.g. force it to mount "nossd" as it detects the bcache device as SSD - which is wrong for some workloads, I think especially desktop workloads and most server workloads. Also, you may want to tune udev to correct some attributes so other applications can do their detection and behavior correctly, too: $ cat /etc/udev/rules.d/00-ssd-scheduler.rules ACTION=="add|change", KERNEL=="bcache*", ATTR{queue/rotational}="1" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/iosched/slice_idle}="0" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="kyber" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq" Take note: on a non-mq system you may want to use noop/deadline/cfq instead of kyber/bfq. I'm running bcache since over two years now and the performance improvement is very very high with boot times going down to 30-40s from 3+ minutes previously, faster app startup times (almost instantly like on SSD), reduced noise by reduced head movements, etc. Also, it has easy setup (no split metadata/data cache, you can attach more than one device to a single cache), and it is rocksolid even when crashing the system. Bcache learns by using LRU for caching: What you don't need will be pushed out of cache over time, what you use, stays. This is actually a lot like "hot data caching". Given a big enough cache, everything of your daily needs would stay in cache, easily achieving hit ratios around 90%. Since sequential access is bypassed, you don't have to worry to flush the cache with large copy operations. My system uses a 512G SSD with 400G dedicated to bcache, attached to 3x 1TB HDD draid0 mraid1 btrfs, filled with 2TB of net data and daily backups using borgbackup. Bcache runs in writeback mode, the backup takes around 15 minutes each night to dig through all data and stores it to an internal intermediate backup also on bcache (xfs, write-around mode). Currently not implemented, this intermediate backup will later be mirrored to external, off-site location. Some of the rest of the SSD is EFI-ESP, some swap space, and over-provisioned area to keep bcache performance high. $ uptime && bcache-status 21:28:44 up 3 days, 20:38, 3 users, load average: 1,18, 1,44, 2,14 --- bcache --- UUIDaacfbcd9-dae5-4377-92d1-6808831a4885 Block Size 4.00 KiB Bucket Size 512.00 KiB Congested? False Read Congestion 2.0ms Write Congestion20.0ms Total Cache Size400 GiB Total Cache Used400 GiB (100%) Total Cache Unused 0 B (0%) Evictable Cache 396 GiB (99%) Replacement Policy [lru] fifo random Cache Mode (Various) Total Hits 2364518 (89%) Total Misses290764 Total Bypass Hits 4284468 (100%) Total Bypass Misses 0 Total Bypassed 215 GiB The bucket size and block size was chosen to best fit with Samsung TLC arrangement. But this is pure theory, I
Re: Give up on bcache?
On Tue, 26 Sep 2017 16:50:00 + (UTC) Ferry Toth <ft...@telfort.nl> wrote: > https://www.phoronix.com/scan.php?page=article=linux414-bcache- > raid=2 > > I think it might be idle hopes to think bcache can be used as a ssd cache > for btrfs to significantly improve performance.. My personal real-world experience shows that SSD caching -- with lvmcache -- does indeed significantly improve performance of a large Btrfs filesystem with slowish base storage. And that article, sadly, only demonstrates once again the general mediocre quality of Phoronix content: it is an astonishing oversight to not check out lvmcache in the same setup, to at least try to draw some useful conclusion, is it Bcache that is strangely deficient, or SSD caching as a general concept does not work well in the hardware setup utilized. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Give up on bcache?
Looking at the Phoronix benchmark here: https://www.phoronix.com/scan.php?page=article=linux414-bcache- raid=2 I think it might be idle hopes to think bcache can be used as a ssd cache for btrfs to significantly improve performance.. True, the benchmark is using ext. But the most important one (where btrfs always shows to be a little slow) would be the SQLLite test. And with ext at least performance _degrades_ except for the Writeback mode, and even there is nowhere near what the SSD is capable of. I think with btrfs it will be even worse and that it is a fundamental problem: caching is complex and the cache can not how how the data on the fs is used. I think the original idea of hot data tracking has a much better chance to significantly improve performance. This of course as the SSD's and HDD's then will be equal citizens and btrfs itself gets to decide on which drive the data is best stored. With this implemented right, it would also finally silence the never ending discussion why not btrfs and why zfs, ext, xfs etc. Which would be a plus by its own right. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8.8, bcache deadlock and hard lockup
On Wed, 30 Nov 2016, Marc MERLIN wrote: > On Wed, Nov 30, 2016 at 03:57:28PM -0800, Eric Wheeler wrote: > > > I'll start another separate thread with the btrfs folks on how much > > > pressure is put on the system, but on your side it would be good to help > > > ensure that bcache doesn't crash the system altogether if too many > > > requests are allowed to pile up. > > > > Try BFQ. It is AWESOME and helps reduce the congestion problem with bulk > > writes at the request queue on its way to the spinning disk or SSD: > > http://algo.ing.unimo.it/people/paolo/disk_sched/ > > > > use the latest BFQ git here, merge it into v4.8.y: > > https://github.com/linusw/linux-bfq/commits/bfq-v8 > > > > This doesn't completely fix the dirty_ration problem, but it is far better > > than CFQ or deadline in my opinion (and experience). > > That's good to know thanks. > But for my uninformed opinion, is there anything bcache can do to throttle > incoming requests if they are piling up, or they're coming from producers > upstream and bcache has no choice but try and process them as quickly as > possible without a way to block the sender if too many are coming? Not really. The congestion isn't in bcache, its at the disk queue beyond bcache, but userspace processes are blocked by the (huge) pagecache dirty writeback which happens before bcache gets it and must complete before userspace may proceed: fs -> pagecache -> bcache -> {ssd,disk} The real issue is that the dirty page cache gets really big, flushes, waits for downstream devices (bcache->ssd,disk) to finish, and then returns to userspace. The only way to limit dirty cache are those options that Linus mentioned. BFQ can help for processes not tied to the flush because it may re-order other process requests ahead of the big flush---so even though a big flush is happening and that process is stalled, others might proceed without delay. See this thread, too: https://groups.google.com/forum/#!msg/bfq-iosched/M2M_UhbC05A/hf6Ni9JbAQAJ -- Eric Wheeler > > Thanks, > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems > .... what McDonalds is to gourmet > cooking > Home page: http://marc.merlins.org/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
Is there any fundamental reason not to support huge writeback caches? (I mean, besides working around bugs and/or questionably poor design choices which no one wishes to fix.) The obvious drawback is the increased risk of data loss upon hardware failure or kernel panic but why couldn't the user be allowed to draw the line between probability of data loss and potential performance gains? The last time I changed hardware, I put double the amount of RAM into my little home server for the sole reason to use a relatively huge cache, especially a huge writeback cache. Although I realized it soon enough that writeback ratios like 20/45 will make the system unstable (OOM reaping) even if ~90% of the memory is theoretically free = used as some form of cache, read or write, depending on this ratio parameter and I ended up below the default to get rid of The Reaper. My plan was to try and decrease the fragmentation of files which are created by dumping several parallel real-time video streams into separate files (and also minimize the HDD head seeks due to that). (The computer in question is on a UPS.) On Thu, Dec 1, 2016 at 4:49 PM, Michal Hocko <mho...@kernel.org> wrote: > On Wed 30-11-16 10:16:53, Marc MERLIN wrote: >> +folks from linux-mm thread for your suggestion >> >> On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote: >> > > swraid5 < bcache < dmcrypt < btrfs >> > > >> > > Copying with btrfs send/receive causes massive hangs on the system. >> > > Please see this explanation from Linus on why the workaround was >> > > suggested: >> > > https://lkml.org/lkml/2016/11/29/667 >> > And Linux' assessment is absolutely correct (at least, the general >> > assessment is, I have no idea about btrfs_start_shared_extent, but I'm more >> > than willing to bet he's correct that that's the culprit). >> >> > > All of this mostly went away with Linus' suggestion: >> > > echo 2 > /proc/sys/vm/dirty_ratio >> > > echo 1 > /proc/sys/vm/dirty_background_ratio >> > > >> > > But that's hiding the symptom which I think is that btrfs is piling up >> > > too many I/O >> > > requests during btrfs send/receive and btrfs scrub (probably balance >> > > too) and not >> > > looking at resulting impact to system health. >> >> > I see pretty much identical behavior using any number of other storage >> > configurations on a USB 2.0 flash drive connected to a system with 16GB of >> > RAM with the default dirty ratios because it's trying to cache up to 3.2GB >> > of data for writeback. While BTRFS is doing highly sub-optimal things >> > here, >> > the ancient default writeback ratios are just as much a culprit. I would >> > suggest that get changed to 200MB or 20% of RAM, whichever is smaller, >> > which >> > would give overall almost identical behavior to x86-32, which in turn works >> > reasonably well for most cases. I sadly don't have the time, patience, or >> > expertise to write up such a patch myself though. >> >> Dear linux-mm folks, is that something you could consider (changing the >> dirty_ratio defaults) given that it affects at least bcache and btrfs >> (with or without bcache)? > > As much as the dirty_*ratio defaults a major PITA this is not something > that would be _easy_ to change without high risks of regressions. This > topic has been discussed many times with many good ideas, nothing really > materialized from them though :/ > > To be honest I really do hate dirty_*ratio and have seen many issues on > very large machines and always suggested to use dirty_bytes instead but > a particular value has always been a challenge to get right. It has > always been very workload specific. > > That being said this is something more for IO people than MM IMHO. > > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
On Wed 30-11-16 10:16:53, Marc MERLIN wrote: > +folks from linux-mm thread for your suggestion > > On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote: > > > swraid5 < bcache < dmcrypt < btrfs > > > > > > Copying with btrfs send/receive causes massive hangs on the system. > > > Please see this explanation from Linus on why the workaround was > > > suggested: > > > https://lkml.org/lkml/2016/11/29/667 > > And Linux' assessment is absolutely correct (at least, the general > > assessment is, I have no idea about btrfs_start_shared_extent, but I'm more > > than willing to bet he's correct that that's the culprit). > > > > All of this mostly went away with Linus' suggestion: > > > echo 2 > /proc/sys/vm/dirty_ratio > > > echo 1 > /proc/sys/vm/dirty_background_ratio > > > > > > But that's hiding the symptom which I think is that btrfs is piling up > > > too many I/O > > > requests during btrfs send/receive and btrfs scrub (probably balance too) > > > and not > > > looking at resulting impact to system health. > > > I see pretty much identical behavior using any number of other storage > > configurations on a USB 2.0 flash drive connected to a system with 16GB of > > RAM with the default dirty ratios because it's trying to cache up to 3.2GB > > of data for writeback. While BTRFS is doing highly sub-optimal things here, > > the ancient default writeback ratios are just as much a culprit. I would > > suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which > > would give overall almost identical behavior to x86-32, which in turn works > > reasonably well for most cases. I sadly don't have the time, patience, or > > expertise to write up such a patch myself though. > > Dear linux-mm folks, is that something you could consider (changing the > dirty_ratio defaults) given that it affects at least bcache and btrfs > (with or without bcache)? As much as the dirty_*ratio defaults a major PITA this is not something that would be _easy_ to change without high risks of regressions. This topic has been discussed many times with many good ideas, nothing really materialized from them though :/ To be honest I really do hate dirty_*ratio and have seen many issues on very large machines and always suggested to use dirty_bytes instead but a particular value has always been a challenge to get right. It has always been very workload specific. That being said this is something more for IO people than MM IMHO. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8.8, bcache deadlock and hard lockup
On 2016-11-30 19:48, Chris Murphy wrote: On Wed, Nov 30, 2016 at 4:57 PM, Eric Wheeler <bca...@lists.ewheeler.net> wrote: On Wed, 30 Nov 2016, Marc MERLIN wrote: +btrfs mailing list, see below why On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote: On Mon, 27 Nov 2016, Coly Li wrote: Yes, too many work queues... I guess the locking might be caused by some very obscure reference of closure code. I cannot have any clue if I cannot find a stable procedure to reproduce this issue. Hmm, if there is a tool to clone all the meta data of the back end cache and whole cached device, there might be a method to replay the oops much easier. Eric, do you have any hint ? Note that the backing device doesn't have any metadata, just a superblock. You can easily dd that off onto some other volume without transferring the data. By default, data starts at 8k, or whatever you used in `make-bcache -w`. Ok, Linus helped me find a workaround for this problem: https://lkml.org/lkml/2016/11/29/667 namely: echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio (it's a 24GB system, so the defaults of 20 and 10 were creating too many requests in th buffers) Note that this is only a workaround, not a fix. When I did this and re tried my big copy again, I still got 100+ kernel work queues, but apparently the underlying swraid5 was able to unblock and satisfy the write requests before too many accumulated and crashed the kernel. I'm not a kernel coder, but seems to me that bcache needs a way to throttle incoming requests if there are too many so that it does not end up in a state where things blow up due to too many piled up requests. You should be able to reproduce this by taking 5 spinning rust drives, put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although I used btrfs) and send lots of requests. Actually to be honest, the problems have mostly been happening when I do btrfs scrub and btrfs send/receive which both generate I/O from within the kernel instead of user space. So here, btrfs may be a contributor to the problem too, but while btrfs still trashes my system if I remove the caching device on bcache (and with the default dirty ratio values), it doesn't crash the kernel. I'll start another separate thread with the btrfs folks on how much pressure is put on the system, but on your side it would be good to help ensure that bcache doesn't crash the system altogether if too many requests are allowed to pile up. Try BFQ. It is AWESOME and helps reduce the congestion problem with bulk writes at the request queue on its way to the spinning disk or SSD: http://algo.ing.unimo.it/people/paolo/disk_sched/ use the latest BFQ git here, merge it into v4.8.y: https://github.com/linusw/linux-bfq/commits/bfq-v8 This doesn't completely fix the dirty_ration problem, but it is far better than CFQ or deadline in my opinion (and experience). There are several threads over the past year with users having problems no one else had previously reported, and they were using BFQ. But there's no evidence whether BFQ was the cause, or exposing some existing bug that another scheduler doesn't. Anyway, I'd say using an out of tree scheduler means higher burden of testing and skepticism. Normally I'd agree on this, but BFQ is a bit of a different situation from usual because: 1. 90% of the reason that BFQ isn't in mainline is that the block maintainers have declared the legacy (non blk-mq) code deprecated and refuse to take anything new there despite having absolutely zero scheduling in blk-mq. 2. It's been around for years with hundreds of thousands of users over the years who have had no issues with it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8.8, bcache deadlock and hard lockup
On Wed, Nov 30, 2016 at 4:57 PM, Eric Wheeler <bca...@lists.ewheeler.net> wrote: > On Wed, 30 Nov 2016, Marc MERLIN wrote: >> +btrfs mailing list, see below why >> >> On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote: >> > On Mon, 27 Nov 2016, Coly Li wrote: >> > > >> > > Yes, too many work queues... I guess the locking might be caused by some >> > > very obscure reference of closure code. I cannot have any clue if I >> > > cannot find a stable procedure to reproduce this issue. >> > > >> > > Hmm, if there is a tool to clone all the meta data of the back end cache >> > > and whole cached device, there might be a method to replay the oops much >> > > easier. >> > > >> > > Eric, do you have any hint ? >> > >> > Note that the backing device doesn't have any metadata, just a superblock. >> > You can easily dd that off onto some other volume without transferring the >> > data. By default, data starts at 8k, or whatever you used in `make-bcache >> > -w`. >> >> Ok, Linus helped me find a workaround for this problem: >> https://lkml.org/lkml/2016/11/29/667 >> namely: >>echo 2 > /proc/sys/vm/dirty_ratio >>echo 1 > /proc/sys/vm/dirty_background_ratio >> (it's a 24GB system, so the defaults of 20 and 10 were creating too many >> requests in th buffers) >> >> Note that this is only a workaround, not a fix. >> >> When I did this and re tried my big copy again, I still got 100+ kernel >> work queues, but apparently the underlying swraid5 was able to unblock >> and satisfy the write requests before too many accumulated and crashed >> the kernel. >> >> I'm not a kernel coder, but seems to me that bcache needs a way to >> throttle incoming requests if there are too many so that it does not end >> up in a state where things blow up due to too many piled up requests. >> >> You should be able to reproduce this by taking 5 spinning rust drives, >> put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although >> I used btrfs) and send lots of requests. >> Actually to be honest, the problems have mostly been happening when I do >> btrfs scrub and btrfs send/receive which both generate I/O from within >> the kernel instead of user space. >> So here, btrfs may be a contributor to the problem too, but while btrfs >> still trashes my system if I remove the caching device on bcache (and >> with the default dirty ratio values), it doesn't crash the kernel. >> >> I'll start another separate thread with the btrfs folks on how much >> pressure is put on the system, but on your side it would be good to help >> ensure that bcache doesn't crash the system altogether if too many >> requests are allowed to pile up. > > > Try BFQ. It is AWESOME and helps reduce the congestion problem with bulk > writes at the request queue on its way to the spinning disk or SSD: > http://algo.ing.unimo.it/people/paolo/disk_sched/ > > use the latest BFQ git here, merge it into v4.8.y: > https://github.com/linusw/linux-bfq/commits/bfq-v8 > > This doesn't completely fix the dirty_ration problem, but it is far better > than CFQ or deadline in my opinion (and experience). There are several threads over the past year with users having problems no one else had previously reported, and they were using BFQ. But there's no evidence whether BFQ was the cause, or exposing some existing bug that another scheduler doesn't. Anyway, I'd say using an out of tree scheduler means higher burden of testing and skepticism. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8.8, bcache deadlock and hard lockup
On Wed, Nov 30, 2016 at 03:57:28PM -0800, Eric Wheeler wrote: > > I'll start another separate thread with the btrfs folks on how much > > pressure is put on the system, but on your side it would be good to help > > ensure that bcache doesn't crash the system altogether if too many > > requests are allowed to pile up. > > Try BFQ. It is AWESOME and helps reduce the congestion problem with bulk > writes at the request queue on its way to the spinning disk or SSD: > http://algo.ing.unimo.it/people/paolo/disk_sched/ > > use the latest BFQ git here, merge it into v4.8.y: > https://github.com/linusw/linux-bfq/commits/bfq-v8 > > This doesn't completely fix the dirty_ration problem, but it is far better > than CFQ or deadline in my opinion (and experience). That's good to know thanks. But for my uninformed opinion, is there anything bcache can do to throttle incoming requests if they are piling up, or they're coming from producers upstream and bcache has no choice but try and process them as quickly as possible without a way to block the sender if too many are coming? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8.8, bcache deadlock and hard lockup
On Wed, 30 Nov 2016, Marc MERLIN wrote: > +btrfs mailing list, see below why > > On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote: > > On Mon, 27 Nov 2016, Coly Li wrote: > > > > > > Yes, too many work queues... I guess the locking might be caused by some > > > very obscure reference of closure code. I cannot have any clue if I > > > cannot find a stable procedure to reproduce this issue. > > > > > > Hmm, if there is a tool to clone all the meta data of the back end cache > > > and whole cached device, there might be a method to replay the oops much > > > easier. > > > > > > Eric, do you have any hint ? > > > > Note that the backing device doesn't have any metadata, just a superblock. > > You can easily dd that off onto some other volume without transferring the > > data. By default, data starts at 8k, or whatever you used in `make-bcache > > -w`. > > Ok, Linus helped me find a workaround for this problem: > https://lkml.org/lkml/2016/11/29/667 > namely: >echo 2 > /proc/sys/vm/dirty_ratio >echo 1 > /proc/sys/vm/dirty_background_ratio > (it's a 24GB system, so the defaults of 20 and 10 were creating too many > requests in th buffers) > > Note that this is only a workaround, not a fix. > > When I did this and re tried my big copy again, I still got 100+ kernel > work queues, but apparently the underlying swraid5 was able to unblock > and satisfy the write requests before too many accumulated and crashed > the kernel. > > I'm not a kernel coder, but seems to me that bcache needs a way to > throttle incoming requests if there are too many so that it does not end > up in a state where things blow up due to too many piled up requests. > > You should be able to reproduce this by taking 5 spinning rust drives, > put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although > I used btrfs) and send lots of requests. > Actually to be honest, the problems have mostly been happening when I do > btrfs scrub and btrfs send/receive which both generate I/O from within > the kernel instead of user space. > So here, btrfs may be a contributor to the problem too, but while btrfs > still trashes my system if I remove the caching device on bcache (and > with the default dirty ratio values), it doesn't crash the kernel. > > I'll start another separate thread with the btrfs folks on how much > pressure is put on the system, but on your side it would be good to help > ensure that bcache doesn't crash the system altogether if too many > requests are allowed to pile up. Try BFQ. It is AWESOME and helps reduce the congestion problem with bulk writes at the request queue on its way to the spinning disk or SSD: http://algo.ing.unimo.it/people/paolo/disk_sched/ use the latest BFQ git here, merge it into v4.8.y: https://github.com/linusw/linux-bfq/commits/bfq-v8 This doesn't completely fix the dirty_ration problem, but it is far better than CFQ or deadline in my opinion (and experience). -Eric -- Eric Wheeler > > Thanks, > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems > what McDonalds is to gourmet > cooking > Home page: http://marc.merlins.org/ | PGP > 1024R/763BE901 > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
+folks from linux-mm thread for your suggestion On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote: > > swraid5 < bcache < dmcrypt < btrfs > > > > Copying with btrfs send/receive causes massive hangs on the system. > > Please see this explanation from Linus on why the workaround was > > suggested: > > https://lkml.org/lkml/2016/11/29/667 > And Linux' assessment is absolutely correct (at least, the general > assessment is, I have no idea about btrfs_start_shared_extent, but I'm more > than willing to bet he's correct that that's the culprit). > > All of this mostly went away with Linus' suggestion: > > echo 2 > /proc/sys/vm/dirty_ratio > > echo 1 > /proc/sys/vm/dirty_background_ratio > > > > But that's hiding the symptom which I think is that btrfs is piling up too > > many I/O > > requests during btrfs send/receive and btrfs scrub (probably balance too) > > and not > > looking at resulting impact to system health. > I see pretty much identical behavior using any number of other storage > configurations on a USB 2.0 flash drive connected to a system with 16GB of > RAM with the default dirty ratios because it's trying to cache up to 3.2GB > of data for writeback. While BTRFS is doing highly sub-optimal things here, > the ancient default writeback ratios are just as much a culprit. I would > suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which > would give overall almost identical behavior to x86-32, which in turn works > reasonably well for most cases. I sadly don't have the time, patience, or > expertise to write up such a patch myself though. Dear linux-mm folks, is that something you could consider (changing the dirty_ratio defaults) given that it affects at least bcache and btrfs (with or without bcache)? By the way, on the 200MB max suggestion, when I had 2 and 1% (or 480MB and 240MB on my 24GB system), this was enough to make btrfs behave sanely, but only if I had bcache turned off. With bcache enabled, those values were just enough so that bcache didn't crash my system, but not enough that prevent undesirable behaviour (things hanging, 100+ bcache kworkers piled up, and more). However, the copy did succeed, despite the relative impact on the system, so it's better than nothing :) But the impact from bcache probably goes beyond what btrfs is responsible for, so I have a separate thread on the bcache list: http://marc.info/?l=linux-bcache=148052441423532=2 http://marc.info/?l=linux-bcache=148052620524162=2 On the plus side, btrfs did ok with 0 visible impact to my system with those 480 and 240MB dirty ratio values. Thanks for your reply, Austin. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
On 2016-11-30 12:18, Marc MERLIN wrote: On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote: +btrfs mailing list, see below why Ok, Linus helped me find a workaround for this problem: https://lkml.org/lkml/2016/11/29/667 namely: echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio (it's a 24GB system, so the defaults of 20 and 10 were creating too many requests in th buffers) I'll remove the bcache list on this followup since I want to concentrate here on the fact that btrfs does behave badly with the default dirty_ratio values. I will comment that on big systems, almost everything behaves badly with the default dirty ratios, they're leftovers from when 1GB was a huge amount of RAM. As usual though, BTRFS has pathological behavior compared to other options. As a reminder, it's a btrfs send/receive copy between 2 swraid5 arrays on spinning rust. swraid5 < bcache < dmcrypt < btrfs Copying with btrfs send/receive causes massive hangs on the system. Please see this explanation from Linus on why the workaround was suggested: https://lkml.org/lkml/2016/11/29/667 And Linux' assessment is absolutely correct (at least, the general assessment is, I have no idea about btrfs_start_shared_extent, but I'm more than willing to bet he's correct that that's the culprit). The hangs that I'm getting with bcache cache turned off (i.e. passthrough) are now very likely only due to btrfs and mess up anything doing file IO that ends up timing out, break USB even as reads time out in the middle of USB requests, interrupts lost, and so forth. All of this mostly went away with Linus' suggestion: echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio But that's hiding the symptom which I think is that btrfs is piling up too many I/O requests during btrfs send/receive and btrfs scrub (probably balance too) and not looking at resulting impact to system health. I see pretty much identical behavior using any number of other storage configurations on a USB 2.0 flash drive connected to a system with 16GB of RAM with the default dirty ratios because it's trying to cache up to 3.2GB of data for writeback. While BTRFS is doing highly sub-optimal things here, the ancient default writeback ratios are just as much a culprit. I would suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which would give overall almost identical behavior to x86-32, which in turn works reasonably well for most cases. I sadly don't have the time, patience, or expertise to write up such a patch myself though. Is there a way to stop flodding the entire system with I/O and causing so much strain on it? (I realize that if there is a caching layer underneath that just takes requests and says thank you without giving other clues that underneath bad things are happening, it may be hard, but I'm asking anyway :) [10338.968912] perf: interrupt took too long (3927 > 3917), lowering kernel.perf_event_max_sample_rate to 50750 [12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb stopped: -32 [17761.122238] usb 4-1.4: USB disconnect, device number 39 [17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 rq 6 len 1024 ret -108 [17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd [17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd [24130.574425] hpet1: lost 2306 rtc interrupts [24156.034950] hpet1: lost 1628 rtc interrupts [24173.314738] hpet1: lost 1104 rtc interrupts [24180.129950] hpet1: lost 436 rtc interrupts [24257.557955] hpet1: lost 4954 rtc interrupts [24267.522656] hpet1: lost 637 rtc interrupts [28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds. [28034.975471] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28035.025429] btrfs D 91154d33fc70 0 5618 5372 0x0080 [28035.047717] 91154d33fc70 00200246 911842f880c0 9115a4cf01c0 [28035.071020] 91154d33fc58 91154d34 91165493bca0 9115623773f0 [28035.094252] 1000 0001 91154d33fc88 b86cf1a6 [28035.117538] Call Trace: [28035.125791] [] schedule+0x8b/0xa3 [28035.141550] [] btrfs_start_ordered_extent+0xce/0x122 [28035.162457] [] ? wake_up_atomic_t+0x2c/0x2c [28035.180891] [] btrfs_wait_ordered_range+0xa9/0x10d [28035.201723] [] btrfs_truncate+0x40/0x24b [28035.219269] [] btrfs_setattr+0x1da/0x2d7 [28035.237032] [] notify_change+0x252/0x39c [28035.254566] [] do_truncate+0x81/0xb4 [28035.271057] [] vfs_truncate+0xd9/0xf9 [28035.287782] [] do_sys_truncate+0x63/0xa7 [28155.781987] INFO: task btrfs:5618 blocked for more than 120 seconds. [28155.802229] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #
Re: 4.8.8, bcache deadlock and hard lockup
On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote: > +btrfs mailing list, see below why > > On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote: > > On Mon, 27 Nov 2016, Coly Li wrote: > > > > > > Yes, too many work queues... I guess the locking might be caused by some > > > very obscure reference of closure code. I cannot have any clue if I > > > cannot find a stable procedure to reproduce this issue. > > > > > > Hmm, if there is a tool to clone all the meta data of the back end cache > > > and whole cached device, there might be a method to replay the oops much > > > easier. > > > > > > Eric, do you have any hint ? > > > > Note that the backing device doesn't have any metadata, just a superblock. > > You can easily dd that off onto some other volume without transferring the > > data. By default, data starts at 8k, or whatever you used in `make-bcache > > -w`. > > Ok, Linus helped me find a workaround for this problem: > https://lkml.org/lkml/2016/11/29/667 > namely: >echo 2 > /proc/sys/vm/dirty_ratio >echo 1 > /proc/sys/vm/dirty_background_ratio > (it's a 24GB system, so the defaults of 20 and 10 were creating too many > requests in th buffers) > > Note that this is only a workaround, not a fix. Actually, I'm even more worried about the general bcache situation when caching is enabled. In the message above, Linus wrote: "One situation where I've seen something like this happen is (a) lots and lots of dirty data queued up (b) horribly slow storage (c) filesystem that ends up serializing on writeback under certain circumstances The usual case for (b) in the modern world is big SSD's that have bad worst-case behavior (ie they may do gbps speeds when doing well, and then they come to a screeching halt when their buffers fill up and they have to do rewrites, and their gbps throughput drops to mbps or lower). Generally you only find that kind of really nasty SSD in the USB stick world these days." Well, come to think of it, this is _exactly_ what bcache will create, by design. It'll swallow up a lot of IO cached to the SSD, until the SSD buffers fill up and then things will hang while bcache struggles to write it all to slower spinning rust storage. Looks to me like bcache and dirty_ratio need to be synced somehow, or things will fall over reliably. What do you think? Thanks, Marc > When I did this and re tried my big copy again, I still got 100+ kernel > work queues, but apparently the underlying swraid5 was able to unblock > and satisfy the write requests before too many accumulated and crashed > the kernel. > > I'm not a kernel coder, but seems to me that bcache needs a way to > throttle incoming requests if there are too many so that it does not end > up in a state where things blow up due to too many piled up requests. > > You should be able to reproduce this by taking 5 spinning rust drives, > put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although > I used btrfs) and send lots of requests. > Actually to be honest, the problems have mostly been happening when I do > btrfs scrub and btrfs send/receive which both generate I/O from within > the kernel instead of user space. > So here, btrfs may be a contributor to the problem too, but while btrfs > still trashes my system if I remove the caching device on bcache (and > with the default dirty ratio values), it doesn't crash the kernel. > > I'll start another separate thread with the btrfs folks on how much > pressure is put on the system, but on your side it would be good to help > ensure that bcache doesn't crash the system altogether if too many > requests are allowed to pile up. > > Thanks, > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems > what McDonalds is to gourmet > cooking > Home page: http://marc.merlins.org/ | PGP > 1024R/763BE901 -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote: > +btrfs mailing list, see below why > > Ok, Linus helped me find a workaround for this problem: > https://lkml.org/lkml/2016/11/29/667 > namely: >echo 2 > /proc/sys/vm/dirty_ratio >echo 1 > /proc/sys/vm/dirty_background_ratio > (it's a 24GB system, so the defaults of 20 and 10 were creating too many > requests in th buffers) I'll remove the bcache list on this followup since I want to concentrate here on the fact that btrfs does behave badly with the default dirty_ratio values. As a reminder, it's a btrfs send/receive copy between 2 swraid5 arrays on spinning rust. swraid5 < bcache < dmcrypt < btrfs Copying with btrfs send/receive causes massive hangs on the system. Please see this explanation from Linus on why the workaround was suggested: https://lkml.org/lkml/2016/11/29/667 The hangs that I'm getting with bcache cache turned off (i.e. passthrough) are now very likely only due to btrfs and mess up anything doing file IO that ends up timing out, break USB even as reads time out in the middle of USB requests, interrupts lost, and so forth. All of this mostly went away with Linus' suggestion: echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio But that's hiding the symptom which I think is that btrfs is piling up too many I/O requests during btrfs send/receive and btrfs scrub (probably balance too) and not looking at resulting impact to system health. Is there a way to stop flodding the entire system with I/O and causing so much strain on it? (I realize that if there is a caching layer underneath that just takes requests and says thank you without giving other clues that underneath bad things are happening, it may be hard, but I'm asking anyway :) [10338.968912] perf: interrupt took too long (3927 > 3917), lowering kernel.perf_event_max_sample_rate to 50750 [12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb stopped: -32 [17761.122238] usb 4-1.4: USB disconnect, device number 39 [17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 rq 6 len 1024 ret -108 [17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd [17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd [24130.574425] hpet1: lost 2306 rtc interrupts [24156.034950] hpet1: lost 1628 rtc interrupts [24173.314738] hpet1: lost 1104 rtc interrupts [24180.129950] hpet1: lost 436 rtc interrupts [24257.557955] hpet1: lost 4954 rtc interrupts [24267.522656] hpet1: lost 637 rtc interrupts [28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds. [28034.975471] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28035.025429] btrfs D 91154d33fc70 0 5618 5372 0x0080 [28035.047717] 91154d33fc70 00200246 911842f880c0 9115a4cf01c0 [28035.071020] 91154d33fc58 91154d34 91165493bca0 9115623773f0 [28035.094252] 1000 0001 91154d33fc88 b86cf1a6 [28035.117538] Call Trace: [28035.125791] [] schedule+0x8b/0xa3 [28035.141550] [] btrfs_start_ordered_extent+0xce/0x122 [28035.162457] [] ? wake_up_atomic_t+0x2c/0x2c [28035.180891] [] btrfs_wait_ordered_range+0xa9/0x10d [28035.201723] [] btrfs_truncate+0x40/0x24b [28035.219269] [] btrfs_setattr+0x1da/0x2d7 [28035.237032] [] notify_change+0x252/0x39c [28035.254566] [] do_truncate+0x81/0xb4 [28035.271057] [] vfs_truncate+0xd9/0xf9 [28035.287782] [] do_sys_truncate+0x63/0xa7 [28155.781987] INFO: task btrfs:5618 blocked for more than 120 seconds. [28155.802229] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [28155.827894] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28155.852479] btrfs D 91154d33fc70 0 5618 5372 0x0080 [28155.874761] 91154d33fc70 00200246 911842f880c0 9115a4cf01c0 [28155.898059] 91154d33fc58 91154d34 91165493bca0 9115623773f0 [28155.921464] 1000 0001 91154d33fc88 b86cf1a6 [28155.944720] Call Trace: [28155.953176] [] schedule+0x8b/0xa3 [28155.968945] [] btrfs_start_ordered_extent+0xce/0x122 [28155.989811] [] ? wake_up_atomic_t+0x2c/0x2c [28156.008195] [] btrfs_wait_ordered_range+0xa9/0x10d [28156.028498] [] btrfs_truncate+0x40/0x24b [28156.046081] [] btrfs_setattr+0x1da/0x2d7 [28156.063621] [] notify_change+0x252/0x39c [28156.081667] [] do_truncate+0x81/0xb4 [28156.098732] [] vfs_truncate+0xd9/0xf9 [28156.115489] [] do_sys_truncate+0x63/0xa7 [28156.133389] [] SyS_truncate+0xe/0x10 [28156.149831] [] do_syscall_64+0x61/0x72 [28156.167179] [] entry_SYSCALL64_slow_path+0x25/0x25 [28397.436986] INFO: task btrfs:
Re: 4.8.8, bcache deadlock and hard lockup
+btrfs mailing list, see below why On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote: > On Mon, 27 Nov 2016, Coly Li wrote: > > > > Yes, too many work queues... I guess the locking might be caused by some > > very obscure reference of closure code. I cannot have any clue if I > > cannot find a stable procedure to reproduce this issue. > > > > Hmm, if there is a tool to clone all the meta data of the back end cache > > and whole cached device, there might be a method to replay the oops much > > easier. > > > > Eric, do you have any hint ? > > Note that the backing device doesn't have any metadata, just a superblock. > You can easily dd that off onto some other volume without transferring the > data. By default, data starts at 8k, or whatever you used in `make-bcache > -w`. Ok, Linus helped me find a workaround for this problem: https://lkml.org/lkml/2016/11/29/667 namely: echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio (it's a 24GB system, so the defaults of 20 and 10 were creating too many requests in th buffers) Note that this is only a workaround, not a fix. When I did this and re tried my big copy again, I still got 100+ kernel work queues, but apparently the underlying swraid5 was able to unblock and satisfy the write requests before too many accumulated and crashed the kernel. I'm not a kernel coder, but seems to me that bcache needs a way to throttle incoming requests if there are too many so that it does not end up in a state where things blow up due to too many piled up requests. You should be able to reproduce this by taking 5 spinning rust drives, put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although I used btrfs) and send lots of requests. Actually to be honest, the problems have mostly been happening when I do btrfs scrub and btrfs send/receive which both generate I/O from within the kernel instead of user space. So here, btrfs may be a contributor to the problem too, but while btrfs still trashes my system if I remove the caching device on bcache (and with the default dirty ratio values), it doesn't crash the kernel. I'll start another separate thread with the btrfs folks on how much pressure is put on the system, but on your side it would be good to help ensure that bcache doesn't crash the system altogether if too many requests are allowed to pile up. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BCache
Hello, I „BCache“ a BTrFS RAID with 4 hard drives. Normal use seems to work good. I have no problems. I have the BTrFS RAID mounted through „/dev/bcache3“. But when I remove a disk (simulating it’s broken), BTrFS doesn’t inform me about a „missing disk“: # btrfs fi show Label: 'RAID' uuid: d0e2e2eb-2df7-454f-8446-5213cec2de3c Total devices 4 FS bytes used 12.55GiB devid1 size 465.76GiB used 6.00GiB path /dev/bcache3 devid2 size 931.51GiB used 6.00GiB path /dev/bcache1 devid3 size 596.17GiB used 6.00GiB path /dev/bcache2 devid4 size 465.76GiB used 6.00GiB path /dev/bcache0 One of these (actually „/dev/bcache3“ or „/dev/sde“) should be broken or missing. I can’t make a „btrfs device delete missing“ either. It replies „ERROR: not a btrfs filesystem:“. It doesn’t matter, if I use „/dev/bcacheX“ or „/dev/sdX“ for that. And if I make a „btrfs device delete missing /mnt/raid“, it replies: „ERROR: error removing device 'missing': no missing devices found to remove“. It looks to me, as if BCache hides these informations from BTrFS. Is this possible? What do you think? Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 25/45] bcache: use bio op accessors
On 06/05/2016 09:32 PM, mchri...@redhat.com wrote: > From: Mike Christie <mchri...@redhat.com> > > Separate the op from the rq_flag_bits and have bcache > set/get the bio using bio_set_op_attrs/bio_op. > > Signed-off-by: Mike Christie <mchri...@redhat.com> > --- > drivers/md/bcache/btree.c | 4 ++-- > drivers/md/bcache/debug.c | 4 ++-- > drivers/md/bcache/journal.c | 7 --- > drivers/md/bcache/movinggc.c | 2 +- > drivers/md/bcache/request.c | 14 +++--- > drivers/md/bcache/super.c | 24 +--- > drivers/md/bcache/writeback.c | 4 ++-- > 7 files changed, 31 insertions(+), 28 deletions(-) > Reviewed-by: Hannes Reinecke <h...@suse.com> Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/45] bcache: use op_is_write instead of checking for REQ_WRITE
On 06/05/2016 09:31 PM, mchri...@redhat.com wrote: > From: Mike Christie <mchri...@redhat.com> > > We currently set REQ_WRITE/WRITE for all non READ IOs > like discard, flush, writesame, etc. In the next patches where we > no longer set up the op as a bitmap, we will not be able to > detect a operation direction like writesame by testing if REQ_WRITE is > set. > > This has bcache use the op_is_write helper which will do the right > thing. > > Signed-off-by: Mike Christie <mchri...@redhat.com> > --- > drivers/md/bcache/io.c | 2 +- > drivers/md/bcache/request.c | 6 +++--- > 2 files changed, 4 insertions(+), 4 deletions(-) > (Could probably folded together with the two previous patches) Reviewed-by: Hannes Reinecke <h...@suse.com> Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 25/45] bcache: use bio op accessors
From: Mike Christie <mchri...@redhat.com> Separate the op from the rq_flag_bits and have bcache set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchri...@redhat.com> --- drivers/md/bcache/btree.c | 4 ++-- drivers/md/bcache/debug.c | 4 ++-- drivers/md/bcache/journal.c | 7 --- drivers/md/bcache/movinggc.c | 2 +- drivers/md/bcache/request.c | 14 +++--- drivers/md/bcache/super.c | 24 +--- drivers/md/bcache/writeback.c | 4 ++-- 7 files changed, 31 insertions(+), 28 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index eab505e..76f7534 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -294,10 +294,10 @@ static void bch_btree_node_read(struct btree *b) closure_init_stack(); bio = bch_bbio_alloc(b->c); - bio->bi_rw = REQ_META|READ_SYNC; bio->bi_iter.bi_size = KEY_SIZE(>key) << 9; bio->bi_end_io = btree_node_read_endio; bio->bi_private = + bio_set_op_attrs(bio, REQ_OP_READ, REQ_META|READ_SYNC); bch_bio_map(bio, b->keys.set[0].data); @@ -396,8 +396,8 @@ static void do_btree_node_write(struct btree *b) b->bio->bi_end_io = btree_node_write_endio; b->bio->bi_private = cl; - b->bio->bi_rw = REQ_META|WRITE_SYNC|REQ_FUA; b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c)); + bio_set_op_attrs(b->bio, REQ_OP_WRITE, REQ_META|WRITE_SYNC|REQ_FUA); bch_bio_map(b->bio, i); /* diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 52b6bcf..c28df164 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -52,7 +52,7 @@ void bch_btree_verify(struct btree *b) bio->bi_bdev= PTR_CACHE(b->c, >key, 0)->bdev; bio->bi_iter.bi_sector = PTR_OFFSET(>key, 0); bio->bi_iter.bi_size= KEY_SIZE(>key) << 9; - bio->bi_rw = REQ_META|READ_SYNC; + bio_set_op_attrs(bio, REQ_OP_READ, REQ_META|READ_SYNC); bch_bio_map(bio, sorted); submit_bio_wait(bio); @@ -114,7 +114,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio) check = bio_clone(bio, GFP_NOIO); if (!check) return; - check->bi_rw |= READ_SYNC; + bio_set_op_attrs(check, REQ_OP_READ, READ_SYNC); if (bio_alloc_pages(check, GFP_NOIO)) goto out_put; diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index af3f9f7..a3c3b30 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -54,11 +54,11 @@ reread: left = ca->sb.bucket_size - offset; bio_reset(bio); bio->bi_iter.bi_sector = bucket + offset; bio->bi_bdev= ca->bdev; - bio->bi_rw = READ; bio->bi_iter.bi_size= len << 9; bio->bi_end_io = journal_read_endio; bio->bi_private = + bio_set_op_attrs(bio, REQ_OP_READ, 0); bch_bio_map(bio, data); closure_bio_submit(bio, ); @@ -449,10 +449,10 @@ static void do_journal_discard(struct cache *ca) atomic_set(>discard_in_flight, DISCARD_IN_FLIGHT); bio_init(bio); + bio_set_op_attrs(bio, REQ_OP_DISCARD, 0); bio->bi_iter.bi_sector = bucket_to_sector(ca->set, ca->sb.d[ja->discard_idx]); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_DISCARD; bio->bi_max_vecs= 1; bio->bi_io_vec = bio->bi_inline_vecs; bio->bi_iter.bi_size= bucket_bytes(ca); @@ -626,11 +626,12 @@ static void journal_write_unlocked(struct closure *cl) bio_reset(bio); bio->bi_iter.bi_sector = PTR_OFFSET(k, i); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; bio->bi_iter.bi_size = sectors << 9; bio->bi_end_io = journal_write_endio; bio->bi_private = w; + bio_set_op_attrs(bio, REQ_OP_WRITE, + REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA); bch_bio_map(bio, w->data); trace_bcache_journal_write(bio); diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c index b929fc9..1881319 100644 --- a/drivers/md/bcache/movinggc.c +++ b/drivers/md/bcache/movinggc.c @@ -163,7 +163,7 @@ static void read_moving(struct cache_set *c)
[PATCH 07/45] bcache: use op_is_write instead of checking for REQ_WRITE
From: Mike Christie <mchri...@redhat.com> We currently set REQ_WRITE/WRITE for all non READ IOs like discard, flush, writesame, etc. In the next patches where we no longer set up the op as a bitmap, we will not be able to detect a operation direction like writesame by testing if REQ_WRITE is set. This has bcache use the op_is_write helper which will do the right thing. Signed-off-by: Mike Christie <mchri...@redhat.com> --- drivers/md/bcache/io.c | 2 +- drivers/md/bcache/request.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index 86a0bb8..fd885cc 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -111,7 +111,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, struct bbio *b = container_of(bio, struct bbio, bio); struct cache *ca = PTR_CACHE(c, >key, 0); - unsigned threshold = bio->bi_rw & REQ_WRITE + unsigned threshold = op_is_write(bio_op(bio)) ? c->congested_write_threshold_us : c->congested_read_threshold_us; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 25fa844..6b85a23 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -383,7 +383,7 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio) if (mode == CACHE_MODE_NONE || (mode == CACHE_MODE_WRITEAROUND && -(bio->bi_rw & REQ_WRITE))) +op_is_write(bio_op(bio goto skip; if (bio->bi_iter.bi_sector & (c->sb.block_size - 1) || @@ -404,7 +404,7 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio) if (!congested && mode == CACHE_MODE_WRITEBACK && - (bio->bi_rw & REQ_WRITE) && + op_is_write(bio_op(bio)) && (bio->bi_rw & REQ_SYNC)) goto rescale; @@ -657,7 +657,7 @@ static inline struct search *search_alloc(struct bio *bio, s->cache_miss = NULL; s->d= d; s->recoverable = 1; - s->write= (bio->bi_rw & REQ_WRITE) != 0; + s->write= op_is_write(bio_op(bio)); s->read_dirty_data = 0; s->start_time = jiffies; -- 2.7.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 21/42] bcache: set bi_op to REQ_OP
From: Mike Christie <mchri...@redhat.com> This patch has bcache use bio->bi_op for REQ_OPs and rq_flag_bits to bio->bi_rw. Signed-off-by: Mike Christie <mchri...@redhat.com> Reviewed-by: Christoph Hellwig <h...@lst.de> Reviewed-by: Hannes Reinecke <h...@suse.com> --- drivers/md/bcache/btree.c | 2 ++ drivers/md/bcache/debug.c | 2 ++ drivers/md/bcache/io.c| 2 +- drivers/md/bcache/journal.c | 7 --- drivers/md/bcache/movinggc.c | 2 +- drivers/md/bcache/request.c | 9 + drivers/md/bcache/super.c | 26 +++--- drivers/md/bcache/writeback.c | 4 ++-- 8 files changed, 32 insertions(+), 22 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 22b9e34..752a44f 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -295,6 +295,7 @@ static void bch_btree_node_read(struct btree *b) closure_init_stack(); bio = bch_bbio_alloc(b->c); + bio->bi_op = REQ_OP_READ; bio->bi_rw = REQ_META|READ_SYNC; bio->bi_iter.bi_size = KEY_SIZE(>key) << 9; bio->bi_end_io = btree_node_read_endio; @@ -397,6 +398,7 @@ static void do_btree_node_write(struct btree *b) b->bio->bi_end_io = btree_node_write_endio; b->bio->bi_private = cl; + b->bio->bi_op = REQ_OP_WRITE; b->bio->bi_rw = REQ_META|WRITE_SYNC|REQ_FUA; b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c)); bch_bio_map(b->bio, i); diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 52b6bcf..8df9e66 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -52,6 +52,7 @@ void bch_btree_verify(struct btree *b) bio->bi_bdev= PTR_CACHE(b->c, >key, 0)->bdev; bio->bi_iter.bi_sector = PTR_OFFSET(>key, 0); bio->bi_iter.bi_size= KEY_SIZE(>key) << 9; + bio->bi_op = REQ_OP_READ; bio->bi_rw = REQ_META|READ_SYNC; bch_bio_map(bio, sorted); @@ -114,6 +115,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio) check = bio_clone(bio, GFP_NOIO); if (!check) return; + check->bi_op = REQ_OP_READ; check->bi_rw |= READ_SYNC; if (bio_alloc_pages(check, GFP_NOIO)) diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index 86a0bb8..f10a9a0 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -111,7 +111,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, struct bbio *b = container_of(bio, struct bbio, bio); struct cache *ca = PTR_CACHE(c, >key, 0); - unsigned threshold = bio->bi_rw & REQ_WRITE + unsigned threshold = op_is_write(bio->bi_op) ? c->congested_write_threshold_us : c->congested_read_threshold_us; diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index af3f9f7..68fa0f0 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -54,7 +54,7 @@ reread: left = ca->sb.bucket_size - offset; bio_reset(bio); bio->bi_iter.bi_sector = bucket + offset; bio->bi_bdev= ca->bdev; - bio->bi_rw = READ; + bio->bi_op = REQ_OP_READ; bio->bi_iter.bi_size= len << 9; bio->bi_end_io = journal_read_endio; @@ -452,7 +452,7 @@ static void do_journal_discard(struct cache *ca) bio->bi_iter.bi_sector = bucket_to_sector(ca->set, ca->sb.d[ja->discard_idx]); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_DISCARD; + bio->bi_op = REQ_OP_DISCARD; bio->bi_max_vecs= 1; bio->bi_io_vec = bio->bi_inline_vecs; bio->bi_iter.bi_size= bucket_bytes(ca); @@ -626,7 +626,8 @@ static void journal_write_unlocked(struct closure *cl) bio_reset(bio); bio->bi_iter.bi_sector = PTR_OFFSET(k, i); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; + bio->bi_op = REQ_OP_WRITE; + bio->bi_rw = REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; bio->bi_iter.bi_size = sectors << 9; bio->bi_end_io = journal_write_endio; diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c index b929fc9..f33860a 100644 --- a/drivers/md/bcache/movinggc.c +++ b/drivers/md/bcache/movinggc.c @@
[PATCH 21/42] bcache: set bi_op to REQ_OP
From: Mike Christie <mchri...@redhat.com> This patch has bcache use bio->bi_op for REQ_OPs and rq_flag_bits to bio->bi_rw. Signed-off-by: Mike Christie <mchri...@redhat.com> Reviewed-by: Christoph Hellwig <h...@lst.de> Reviewed-by: Hannes Reinecke <h...@suse.com> --- drivers/md/bcache/btree.c | 2 ++ drivers/md/bcache/debug.c | 2 ++ drivers/md/bcache/io.c| 2 +- drivers/md/bcache/journal.c | 7 --- drivers/md/bcache/movinggc.c | 2 +- drivers/md/bcache/request.c | 9 + drivers/md/bcache/super.c | 26 +++--- drivers/md/bcache/writeback.c | 4 ++-- 8 files changed, 32 insertions(+), 22 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 22b9e34..752a44f 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -295,6 +295,7 @@ static void bch_btree_node_read(struct btree *b) closure_init_stack(); bio = bch_bbio_alloc(b->c); + bio->bi_op = REQ_OP_READ; bio->bi_rw = REQ_META|READ_SYNC; bio->bi_iter.bi_size = KEY_SIZE(>key) << 9; bio->bi_end_io = btree_node_read_endio; @@ -397,6 +398,7 @@ static void do_btree_node_write(struct btree *b) b->bio->bi_end_io = btree_node_write_endio; b->bio->bi_private = cl; + b->bio->bi_op = REQ_OP_WRITE; b->bio->bi_rw = REQ_META|WRITE_SYNC|REQ_FUA; b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c)); bch_bio_map(b->bio, i); diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 52b6bcf..8df9e66 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -52,6 +52,7 @@ void bch_btree_verify(struct btree *b) bio->bi_bdev= PTR_CACHE(b->c, >key, 0)->bdev; bio->bi_iter.bi_sector = PTR_OFFSET(>key, 0); bio->bi_iter.bi_size= KEY_SIZE(>key) << 9; + bio->bi_op = REQ_OP_READ; bio->bi_rw = REQ_META|READ_SYNC; bch_bio_map(bio, sorted); @@ -114,6 +115,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio) check = bio_clone(bio, GFP_NOIO); if (!check) return; + check->bi_op = REQ_OP_READ; check->bi_rw |= READ_SYNC; if (bio_alloc_pages(check, GFP_NOIO)) diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index 86a0bb8..f10a9a0 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -111,7 +111,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, struct bbio *b = container_of(bio, struct bbio, bio); struct cache *ca = PTR_CACHE(c, >key, 0); - unsigned threshold = bio->bi_rw & REQ_WRITE + unsigned threshold = op_is_write(bio->bi_op) ? c->congested_write_threshold_us : c->congested_read_threshold_us; diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index af3f9f7..68fa0f0 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -54,7 +54,7 @@ reread: left = ca->sb.bucket_size - offset; bio_reset(bio); bio->bi_iter.bi_sector = bucket + offset; bio->bi_bdev= ca->bdev; - bio->bi_rw = READ; + bio->bi_op = REQ_OP_READ; bio->bi_iter.bi_size= len << 9; bio->bi_end_io = journal_read_endio; @@ -452,7 +452,7 @@ static void do_journal_discard(struct cache *ca) bio->bi_iter.bi_sector = bucket_to_sector(ca->set, ca->sb.d[ja->discard_idx]); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_DISCARD; + bio->bi_op = REQ_OP_DISCARD; bio->bi_max_vecs= 1; bio->bi_io_vec = bio->bi_inline_vecs; bio->bi_iter.bi_size= bucket_bytes(ca); @@ -626,7 +626,8 @@ static void journal_write_unlocked(struct closure *cl) bio_reset(bio); bio->bi_iter.bi_sector = PTR_OFFSET(k, i); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; + bio->bi_op = REQ_OP_WRITE; + bio->bi_rw = REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; bio->bi_iter.bi_size = sectors << 9; bio->bi_end_io = journal_write_endio; diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c index b929fc9..f33860a 100644 --- a/drivers/md/bcache/movinggc.c +++ b/drivers/md/bcache/movinggc.c @@
[PATCH 21/42] bcache: set bi_op to REQ_OP
From: Mike Christie <mchri...@redhat.com> This patch has bcache use bio->bi_op for REQ_OPs and rq_flag_bits to bio->bi_rw. Signed-off-by: Mike Christie <mchri...@redhat.com> Reviewed-by: Christoph Hellwig <h...@lst.de> --- drivers/md/bcache/btree.c | 2 ++ drivers/md/bcache/debug.c | 2 ++ drivers/md/bcache/io.c| 2 +- drivers/md/bcache/journal.c | 7 ++++--- drivers/md/bcache/movinggc.c | 2 +- drivers/md/bcache/request.c | 9 + drivers/md/bcache/super.c | 26 +++--- drivers/md/bcache/writeback.c | 4 ++-- 8 files changed, 32 insertions(+), 22 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 22b9e34..752a44f 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -295,6 +295,7 @@ static void bch_btree_node_read(struct btree *b) closure_init_stack(); bio = bch_bbio_alloc(b->c); + bio->bi_op = REQ_OP_READ; bio->bi_rw = REQ_META|READ_SYNC; bio->bi_iter.bi_size = KEY_SIZE(>key) << 9; bio->bi_end_io = btree_node_read_endio; @@ -397,6 +398,7 @@ static void do_btree_node_write(struct btree *b) b->bio->bi_end_io = btree_node_write_endio; b->bio->bi_private = cl; + b->bio->bi_op = REQ_OP_WRITE; b->bio->bi_rw = REQ_META|WRITE_SYNC|REQ_FUA; b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c)); bch_bio_map(b->bio, i); diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 52b6bcf..8df9e66 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -52,6 +52,7 @@ void bch_btree_verify(struct btree *b) bio->bi_bdev= PTR_CACHE(b->c, >key, 0)->bdev; bio->bi_iter.bi_sector = PTR_OFFSET(>key, 0); bio->bi_iter.bi_size= KEY_SIZE(>key) << 9; + bio->bi_op = REQ_OP_READ; bio->bi_rw = REQ_META|READ_SYNC; bch_bio_map(bio, sorted); @@ -114,6 +115,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio) check = bio_clone(bio, GFP_NOIO); if (!check) return; + check->bi_op = REQ_OP_READ; check->bi_rw |= READ_SYNC; if (bio_alloc_pages(check, GFP_NOIO)) diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index 86a0bb8..f10a9a0 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -111,7 +111,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, struct bbio *b = container_of(bio, struct bbio, bio); struct cache *ca = PTR_CACHE(c, >key, 0); - unsigned threshold = bio->bi_rw & REQ_WRITE + unsigned threshold = op_is_write(bio->bi_op) ? c->congested_write_threshold_us : c->congested_read_threshold_us; diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index af3f9f7..68fa0f0 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -54,7 +54,7 @@ reread: left = ca->sb.bucket_size - offset; bio_reset(bio); bio->bi_iter.bi_sector = bucket + offset; bio->bi_bdev= ca->bdev; - bio->bi_rw = READ; + bio->bi_op = REQ_OP_READ; bio->bi_iter.bi_size= len << 9; bio->bi_end_io = journal_read_endio; @@ -452,7 +452,7 @@ static void do_journal_discard(struct cache *ca) bio->bi_iter.bi_sector = bucket_to_sector(ca->set, ca->sb.d[ja->discard_idx]); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_DISCARD; + bio->bi_op = REQ_OP_DISCARD; bio->bi_max_vecs= 1; bio->bi_io_vec = bio->bi_inline_vecs; bio->bi_iter.bi_size= bucket_bytes(ca); @@ -626,7 +626,8 @@ static void journal_write_unlocked(struct closure *cl) bio_reset(bio); bio->bi_iter.bi_sector = PTR_OFFSET(k, i); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; + bio->bi_op = REQ_OP_WRITE; + bio->bi_rw = REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; bio->bi_iter.bi_size = sectors << 9; bio->bi_end_io = journal_write_endio; diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c index b929fc9..f33860a 100644 --- a/drivers/md/bcache/movinggc.c +++ b/drivers/md/bcache/movinggc.c @@ -163,7 +163,7 @@ static vo
[PATCH 21/35] bcache: set bi_op to REQ_OP
From: Mike Christie <mchri...@redhat.com> This patch has bcache set the bio bi_op to a REQ_OP, and rq_flag_bits to bi_rw. This patch is compile tested only Signed-off-by: Mike Christie <mchri...@redhat.com> --- drivers/md/bcache/btree.c | 2 ++ drivers/md/bcache/debug.c | 2 ++ drivers/md/bcache/io.c| 2 +- drivers/md/bcache/journal.c | 7 --- drivers/md/bcache/movinggc.c | 2 +- drivers/md/bcache/request.c | 9 + drivers/md/bcache/super.c | 26 +++--- drivers/md/bcache/writeback.c | 4 ++-- 8 files changed, 32 insertions(+), 22 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 22b9e34..752a44f 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -295,6 +295,7 @@ static void bch_btree_node_read(struct btree *b) closure_init_stack(); bio = bch_bbio_alloc(b->c); + bio->bi_op = REQ_OP_READ; bio->bi_rw = REQ_META|READ_SYNC; bio->bi_iter.bi_size = KEY_SIZE(>key) << 9; bio->bi_end_io = btree_node_read_endio; @@ -397,6 +398,7 @@ static void do_btree_node_write(struct btree *b) b->bio->bi_end_io = btree_node_write_endio; b->bio->bi_private = cl; + b->bio->bi_op = REQ_OP_WRITE; b->bio->bi_rw = REQ_META|WRITE_SYNC|REQ_FUA; b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c)); bch_bio_map(b->bio, i); diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 52b6bcf..8df9e66 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -52,6 +52,7 @@ void bch_btree_verify(struct btree *b) bio->bi_bdev= PTR_CACHE(b->c, >key, 0)->bdev; bio->bi_iter.bi_sector = PTR_OFFSET(>key, 0); bio->bi_iter.bi_size= KEY_SIZE(>key) << 9; + bio->bi_op = REQ_OP_READ; bio->bi_rw = REQ_META|READ_SYNC; bch_bio_map(bio, sorted); @@ -114,6 +115,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio) check = bio_clone(bio, GFP_NOIO); if (!check) return; + check->bi_op = REQ_OP_READ; check->bi_rw |= READ_SYNC; if (bio_alloc_pages(check, GFP_NOIO)) diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index 86a0bb8..f10a9a0 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -111,7 +111,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, struct bbio *b = container_of(bio, struct bbio, bio); struct cache *ca = PTR_CACHE(c, >key, 0); - unsigned threshold = bio->bi_rw & REQ_WRITE + unsigned threshold = op_is_write(bio->bi_op) ? c->congested_write_threshold_us : c->congested_read_threshold_us; diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index af3f9f7..68fa0f0 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -54,7 +54,7 @@ reread: left = ca->sb.bucket_size - offset; bio_reset(bio); bio->bi_iter.bi_sector = bucket + offset; bio->bi_bdev= ca->bdev; - bio->bi_rw = READ; + bio->bi_op = REQ_OP_READ; bio->bi_iter.bi_size= len << 9; bio->bi_end_io = journal_read_endio; @@ -452,7 +452,7 @@ static void do_journal_discard(struct cache *ca) bio->bi_iter.bi_sector = bucket_to_sector(ca->set, ca->sb.d[ja->discard_idx]); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_DISCARD; + bio->bi_op = REQ_OP_DISCARD; bio->bi_max_vecs= 1; bio->bi_io_vec = bio->bi_inline_vecs; bio->bi_iter.bi_size= bucket_bytes(ca); @@ -626,7 +626,8 @@ static void journal_write_unlocked(struct closure *cl) bio_reset(bio); bio->bi_iter.bi_sector = PTR_OFFSET(k, i); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; + bio->bi_op = REQ_OP_WRITE; + bio->bi_rw = REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; bio->bi_iter.bi_size = sectors << 9; bio->bi_end_io = journal_write_endio; diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c index b929fc9..f33860a 100644 --- a/drivers/md/bcache/movinggc.c +++ b/drivers/md/bcache/movinggc.c @@ -163,7 +163,7 @@ static void read_moving(struct cache_set *c)
Re: btrfs on top of bcache on top of dmcrypt on top of md raid5
Use all defaults for everything. Anything new by show should do the right thing including 4096 byte alignment. gargamel:~# cryptsetup luksDump /dev/md8 [snip] Payload offset: 3072 This is a bit weird because the default is 4096. But because the LUKS offset (header + payload + extra unused space) is 2MiB, so it doesn't affect alignment. There may be unpatched (fixes not backported) in the tools of current long term supported distros, that can cause misalignment. Probably top concern would be parted/libparted, which would start partition 1 at LBA 63, which is not aligned. The upstream tools for a long time now have set partition 1 to LBA 2048, but these crusty old unpatched versions just seem to persist like a booger you can't flick off. It's really annoying - this idea of "stable bugs" that go on and on for a decade. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on top of bcache on top of dmcrypt on top of md raid5
On Sun, Feb 14, 2016 at 01:43:05PM -0700, Chris Murphy wrote: > Use all defaults for everything. Anything new by show should do the > right thing including 4096 byte alignment. > > gargamel:~# cryptsetup luksDump /dev/md8 > [snip] > Payload offset: 3072 > > This is a bit weird because the default is 4096. But because the LUKS > offset (header + payload + extra unused space) is 2MiB, so it doesn't > affect alignment. There may be unpatched (fixes not backported) in the > tools of current long term supported distros, that can cause > misalignment. Probably top concern would be parted/libparted, which > would start partition 1 at LBA 63, which is not aligned. The upstream > tools for a long time now have set partition 1 to LBA 2048, but these > crusty old unpatched versions just seem to persist like a booger you > can't flick off. It's really annoying - this idea of "stable bugs" > that go on and on for a decade. Indeed. Thankfully my partitions now start at 2048 like you say. The only thing I did wrong last time (when using bcache) is md5 - dmcrypt - bcache - btrfs ssd - dmcrypt / This was stupid, I needed to do this: md5 - bcache - dmcrypt - btrfs ssd / So I think at this point, just to be future proof, I'm going to add bcache on top of all block devices I have, before putting dmcrypt on top, even if I don't have a cache device. That way I can later add a cache device without problems. Without doing that, adding bcache later is a full re-install :( Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs on top of bcache on top of dmcrypt on top of md raid5
I have a 5 drive md array with dmcrypt on top, and btrfs on top of that. Kernel: 4.4 but the filesystem was created 2 years ago with an older version of btrfs. It's littered with files and hardlinks (it's a backup server). Mostly it gets btrfs receive data, and rsyncs of filesystem trees that are occasionally hardlinked to keep history (for data that wasn't on btrfs to start with). Basically the filesystem works, but it's slow, I can see that my system feels sluggish when backups are running, cronjobs that are somewhat time critical also fail to run in time when rsyncs/backups to that filesystem, are running. It's time to re-create it, but this time I'm looking at adding bcache in the middle (backed by an encrypted ssd) to hopefully help with the random I/O bits that won't be as fast on disk backed raid5. Are there best practises in doing this? Are there issues with the default filesystem options in btrfs? Do I want -m dup considering it's ultimately backed by raid5/hdd and not ssd? (I would think yes, but I've noticed -m dup gets disabled when bcache is in the middle, probably because the detection gets foiled). Do I want to mess with --nodesize or --sectorsize and adjust for ssd write block size? (with ext4, I use -b 4096 -E stride=128,stripe-width=128 ) Any specific configuration I ought to do with bcache or mdadm chunk sizes? Does align-payload look ok? cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain Thanks, Marc PS: for reference: As discussed in the past, there seems to be a general agreement that dmcrypt on top of mdadm is better than mdadm on top of dmcrypt now that dmcrypt is multithreaded. My current array and encryption look like this. Currently, I have: gargamel:~# mdadm --detail /dev/md8 /dev/md8: Version : 1.2 Creation Time : Sat Apr 19 23:03:59 2014 Raid Level : raid5 Array Size : 7813523456 (7451.56 GiB 8001.05 GB) Used Dev Size : 1953380864 (1862.89 GiB 2000.26 GB) Raid Devices : 5 Total Devices : 5 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Thu Feb 11 08:26:45 2016 State : active Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 256K gargamel:~# cryptsetup luksDump /dev/md8 LUKS header information for /dev/md8 Version:1 Cipher name:aes Cipher mode:xts-plain64 Hash spec: sha1 Payload offset: 3072 MK bits:256 Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 21/35] bcache: set bi_op to REQ_OP
From: Mike Christie <mchri...@redhat.com> This patch has bcache set the bio bi_op to a REQ_OP, and rq_flag_bits to bi_rw. This patch is compile tested only. Signed-off-by: Mike Christie <mchri...@redhat.com> --- drivers/md/bcache/btree.c | 2 ++ drivers/md/bcache/debug.c | 2 ++ drivers/md/bcache/io.c| 2 +- drivers/md/bcache/journal.c | 7 --- drivers/md/bcache/movinggc.c | 2 +- drivers/md/bcache/request.c | 9 + drivers/md/bcache/super.c | 26 +++--- drivers/md/bcache/writeback.c | 4 ++-- 8 files changed, 32 insertions(+), 22 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 22b9e34..752a44f 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -295,6 +295,7 @@ static void bch_btree_node_read(struct btree *b) closure_init_stack(); bio = bch_bbio_alloc(b->c); + bio->bi_op = REQ_OP_READ; bio->bi_rw = REQ_META|READ_SYNC; bio->bi_iter.bi_size = KEY_SIZE(>key) << 9; bio->bi_end_io = btree_node_read_endio; @@ -397,6 +398,7 @@ static void do_btree_node_write(struct btree *b) b->bio->bi_end_io = btree_node_write_endio; b->bio->bi_private = cl; + b->bio->bi_op = REQ_OP_WRITE; b->bio->bi_rw = REQ_META|WRITE_SYNC|REQ_FUA; b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c)); bch_bio_map(b->bio, i); diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index db68562..4c48783 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -52,6 +52,7 @@ void bch_btree_verify(struct btree *b) bio->bi_bdev= PTR_CACHE(b->c, >key, 0)->bdev; bio->bi_iter.bi_sector = PTR_OFFSET(>key, 0); bio->bi_iter.bi_size= KEY_SIZE(>key) << 9; + bio->bi_op = REQ_OP_READ; bio->bi_rw |= REQ_META|READ_SYNC; bch_bio_map(bio, sorted); @@ -114,6 +115,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio) check = bio_clone(bio, GFP_NOIO); if (!check) return; + check->bi_op = REQ_OP_READ; check->bi_rw |= READ_SYNC; if (bio_alloc_pages(check, GFP_NOIO)) diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index 86a0bb8..f10a9a0 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -111,7 +111,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, struct bbio *b = container_of(bio, struct bbio, bio); struct cache *ca = PTR_CACHE(c, >key, 0); - unsigned threshold = bio->bi_rw & REQ_WRITE + unsigned threshold = op_is_write(bio->bi_op) ? c->congested_write_threshold_us : c->congested_read_threshold_us; diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index af3f9f7..68fa0f0 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -54,7 +54,7 @@ reread: left = ca->sb.bucket_size - offset; bio_reset(bio); bio->bi_iter.bi_sector = bucket + offset; bio->bi_bdev= ca->bdev; - bio->bi_rw = READ; + bio->bi_op = REQ_OP_READ; bio->bi_iter.bi_size= len << 9; bio->bi_end_io = journal_read_endio; @@ -452,7 +452,7 @@ static void do_journal_discard(struct cache *ca) bio->bi_iter.bi_sector = bucket_to_sector(ca->set, ca->sb.d[ja->discard_idx]); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_DISCARD; + bio->bi_op = REQ_OP_DISCARD; bio->bi_max_vecs= 1; bio->bi_io_vec = bio->bi_inline_vecs; bio->bi_iter.bi_size= bucket_bytes(ca); @@ -626,7 +626,8 @@ static void journal_write_unlocked(struct closure *cl) bio_reset(bio); bio->bi_iter.bi_sector = PTR_OFFSET(k, i); bio->bi_bdev= ca->bdev; - bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; + bio->bi_op = REQ_OP_WRITE; + bio->bi_rw = REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA; bio->bi_iter.bi_size = sectors << 9; bio->bi_end_io = journal_write_endio; diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c index b929fc9..f33860a 100644 --- a/drivers/md/bcache/movinggc.c +++ b/drivers/md/bcache/movinggc.c @@ -163,7 +163,7 @@ static void read_movi
Is btrfs on top of bcache stable now?
On Mon, Apr 20, 2015 at 10:27:05AM +, Hugo Mills wrote: See the first issue here: https://btrfs.wiki.kernel.org/index.php/Gotchas Hi Hugo, looking at the page again, I see bcache + btrfs does not seem to be stable yet linking to a thread more than 2 years old and btrfs kernels that wouldn't be stable without bcache anyway. I've seen others mention they switched to bcache recently and not seen new it's broken reports. So, is it ok 1) to assume bcache and btrfs play ok together now? 2) remove the warning from that gotchas page? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs on top of bcache stable now?
I'm one of those that used to have problems with btrfs on top of bcache. After some corruptions, I gave up this setup. Recently (from February, I think) I gave it another shot, and I have had no problems since. I use bcache in writeback mode, with very good performance. I'm feeling btrfs very stable in this setup. Best Regards, Fabio Pfeifer 2015-04-20 11:49 GMT-03:00 Marc MERLIN m...@merlins.org: On Mon, Apr 20, 2015 at 10:27:05AM +, Hugo Mills wrote: See the first issue here: https://btrfs.wiki.kernel.org/index.php/Gotchas Hi Hugo, looking at the page again, I see bcache + btrfs does not seem to be stable yet linking to a thread more than 2 years old and btrfs kernels that wouldn't be stable without bcache anyway. I've seen others mention they switched to bcache recently and not seen new it's broken reports. So, is it ok 1) to assume bcache and btrfs play ok together now? 2) remove the warning from that gotchas page? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovering BTRFS from bcache failure.
On Tue, Apr 7, 2015 at 11:40 PM, Dan Merillat dan.meril...@gmail.com wrote: Bcache failures are nasty, because they leave a mix of old and new data on the disk. In this case, there was very little dirty data, but of course the tree roots were dirty and out-of-sync. fileserver:/usr/src/btrfs-progs# ./btrfs --version Btrfs v3.18.2 kernel version 3.18 [ 572.573566] BTRFS info (device bcache0): enabling auto recovery [ 572.573619] BTRFS info (device bcache0): disk space caching is enabled [ 574.266055] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.276952] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277008] BTRFS: failed to read tree root on bcache0 [ 574.277187] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277356] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277398] BTRFS: failed to read tree root on bcache0 [ 574.285955] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 613694 [ 574.298741] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 610499 [ 574.298804] BTRFS: failed to read tree root on bcache0 [ 575.047079] BTRFS (device bcache0): bad tree block start 0 7567954464768 [ 575.111495] BTRFS (device bcache0): parent transid verify failed on 7567954464768 wanted 613688 found 613685 [ 575.111559] BTRFS: failed to read tree root on bcache0 [ 575.121749] BTRFS (device bcache0): bad tree block start 0 7567954214912 [ 575.131803] BTRFS (device bcache0): parent transid verify failed on 7567954214912 wanted 613687 found 613680 [ 575.131866] BTRFS: failed to read tree root on bcache0 [ 575.180101] BTRFS: open_ctree failed all the btrfs tools throw up their hands with similar errors: ileserver:/usr/src/btrfs-progs# btrfs restore /dev/bcache0 -l parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super fileserver:/usr/src/btrfs-progs# ./btrfsck --repair /dev/bcache0 --init-extent-tree enabling repair mode parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Couldn't open file system Annoyingly: # ./btrfs-image -c9 -t4 -s -w /dev/bcache0 /tmp/test.out parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Open ctree failed create failed (Success) So I can't even send an image for people to look at. CCing some more people on this one, while this filesystem isn't important I'd like to know that restore from backup isn't the only option for BTRFS corruption. All of the tools simply throw up their hands and bail when confronted with this filesystem, even btrfs-image. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovering BTRFS from bcache failure.
It's a known bug with bcache and enabling discard, it was discarding sections containing data it wanted. After a reboot bcache refused to accept the cache data, and of course it was dirty because I'm frankly too stupid to breathe sometimes. So yes, it's a bcache issue, but that's unresolvable. I'm trying to rescue the btrfs data that it trashed. On Wed, Apr 8, 2015 at 2:27 PM, Cameron Berkenpas c...@neo-zeon.de wrote: Hello, I had some luck in the past with btrfs restore using the -r option. I don't recall how I determined the roots... Maybe I tried random numbers? I was able to recover nearly all of my data from a bcache related crash from over a year ago. What kind of bcache failure did you see? I've been doing some testing recently and ran into 2 bcache failures. With both of these failures, I had a ' bad btree header at bucket' error message (which is entirely different from the crash I had over a year back). I'm currently trying a different SSD to see if that alleviates the issue. The error makes me think that it's a bcache specific issue that's unrelated to btrfs or possibly (in my case) an issue with the previous SSD. Did you encounter this same error? With my 2 most recent crashes, I didn't try to recover very hard (or even try 'btrfs recover; at all) as I've been taking daily backups. I did try btrfsck, and not only would it fail, it would segfault. -Cameron On 04/08/2015 11:07 AM, Dan Merillat wrote: Any ideas on where to start with this? I did flush the cache out to disk before I made changes to the bcache configuration, so there shouldn't be anything completely missing, just some bits of stale metadata. If I can get the tools to take the closest match and run with it it would probably recover nearly everything. At worst, is there a way to scan the metadata blocks and rebuild from found extent-trees? On Tue, Apr 7, 2015 at 11:40 PM, Dan Merillat dan.meril...@gmail.com wrote: Bcache failures are nasty, because they leave a mix of old and new data on the disk. In this case, there was very little dirty data, but of course the tree roots were dirty and out-of-sync. fileserver:/usr/src/btrfs-progs# ./btrfs --version Btrfs v3.18.2 kernel version 3.18 [ 572.573566] BTRFS info (device bcache0): enabling auto recovery [ 572.573619] BTRFS info (device bcache0): disk space caching is enabled [ 574.266055] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.276952] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277008] BTRFS: failed to read tree root on bcache0 [ 574.277187] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277356] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277398] BTRFS: failed to read tree root on bcache0 [ 574.285955] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 613694 [ 574.298741] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 610499 [ 574.298804] BTRFS: failed to read tree root on bcache0 [ 575.047079] BTRFS (device bcache0): bad tree block start 0 7567954464768 [ 575.111495] BTRFS (device bcache0): parent transid verify failed on 7567954464768 wanted 613688 found 613685 [ 575.111559] BTRFS: failed to read tree root on bcache0 [ 575.121749] BTRFS (device bcache0): bad tree block start 0 7567954214912 [ 575.131803] BTRFS (device bcache0): parent transid verify failed on 7567954214912 wanted 613687 found 613680 [ 575.131866] BTRFS: failed to read tree root on bcache0 [ 575.180101] BTRFS: open_ctree failed all the btrfs tools throw up their hands with similar errors: ileserver:/usr/src/btrfs-progs# btrfs restore /dev/bcache0 -l parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent
Re: Recovering BTRFS from bcache failure.
Sorry I pressed send before I finished my thoughts. btrfs restore gets nowhere with any options. btrfs-recover says the superblocks are fine, and chunk recover does nothing after a few hours of reading. Everything else bails out with the errors I listed above. On Wed, Apr 8, 2015 at 2:36 PM, Dan Merillat dan.meril...@gmail.com wrote: It's a known bug with bcache and enabling discard, it was discarding sections containing data it wanted. After a reboot bcache refused to accept the cache data, and of course it was dirty because I'm frankly too stupid to breathe sometimes. So yes, it's a bcache issue, but that's unresolvable. I'm trying to rescue the btrfs data that it trashed. On Wed, Apr 8, 2015 at 2:27 PM, Cameron Berkenpas c...@neo-zeon.de wrote: Hello, I had some luck in the past with btrfs restore using the -r option. I don't recall how I determined the roots... Maybe I tried random numbers? I was able to recover nearly all of my data from a bcache related crash from over a year ago. What kind of bcache failure did you see? I've been doing some testing recently and ran into 2 bcache failures. With both of these failures, I had a ' bad btree header at bucket' error message (which is entirely different from the crash I had over a year back). I'm currently trying a different SSD to see if that alleviates the issue. The error makes me think that it's a bcache specific issue that's unrelated to btrfs or possibly (in my case) an issue with the previous SSD. Did you encounter this same error? With my 2 most recent crashes, I didn't try to recover very hard (or even try 'btrfs recover; at all) as I've been taking daily backups. I did try btrfsck, and not only would it fail, it would segfault. -Cameron On 04/08/2015 11:07 AM, Dan Merillat wrote: Any ideas on where to start with this? I did flush the cache out to disk before I made changes to the bcache configuration, so there shouldn't be anything completely missing, just some bits of stale metadata. If I can get the tools to take the closest match and run with it it would probably recover nearly everything. At worst, is there a way to scan the metadata blocks and rebuild from found extent-trees? On Tue, Apr 7, 2015 at 11:40 PM, Dan Merillat dan.meril...@gmail.com wrote: Bcache failures are nasty, because they leave a mix of old and new data on the disk. In this case, there was very little dirty data, but of course the tree roots were dirty and out-of-sync. fileserver:/usr/src/btrfs-progs# ./btrfs --version Btrfs v3.18.2 kernel version 3.18 [ 572.573566] BTRFS info (device bcache0): enabling auto recovery [ 572.573619] BTRFS info (device bcache0): disk space caching is enabled [ 574.266055] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.276952] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277008] BTRFS: failed to read tree root on bcache0 [ 574.277187] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277356] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277398] BTRFS: failed to read tree root on bcache0 [ 574.285955] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 613694 [ 574.298741] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 610499 [ 574.298804] BTRFS: failed to read tree root on bcache0 [ 575.047079] BTRFS (device bcache0): bad tree block start 0 7567954464768 [ 575.111495] BTRFS (device bcache0): parent transid verify failed on 7567954464768 wanted 613688 found 613685 [ 575.111559] BTRFS: failed to read tree root on bcache0 [ 575.121749] BTRFS (device bcache0): bad tree block start 0 7567954214912 [ 575.131803] BTRFS (device bcache0): parent transid verify failed on 7567954214912 wanted 613687 found 613680 [ 575.131866] BTRFS: failed to read tree root on bcache0 [ 575.180101] BTRFS: open_ctree failed all the btrfs tools throw up their hands with similar errors: ileserver:/usr/src/btrfs-progs# btrfs restore /dev/bcache0 -l parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring
Re: Recovering BTRFS from bcache failure.
Dan Merillat dan.meril...@gmail.com schrieb: Bcache failures are nasty, because they leave a mix of old and new data on the disk. In this case, there was very little dirty data, but of course the tree roots were dirty and out-of-sync. fileserver:/usr/src/btrfs-progs# ./btrfs --version Btrfs v3.18.2 kernel version 3.18 [ 572.573566] BTRFS info (device bcache0): enabling auto recovery [ 572.573619] BTRFS info (device bcache0): disk space caching is enabled [ 574.266055] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.276952] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277008] BTRFS: failed to read tree root on bcache0 [ 574.277187] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277356] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277398] BTRFS: failed to read tree root on bcache0 [ 574.285955] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 613694 [ 574.298741] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 610499 [ 574.298804] BTRFS: failed to read tree root on bcache0 [ 575.047079] BTRFS (device bcache0): bad tree block start 0 [ 7567954464768 575.111495] BTRFS (device bcache0): parent transid verify [ failed on 7567954464768 wanted 613688 found 613685 [ 575.111559] BTRFS: failed to read tree root on bcache0 [ 575.121749] BTRFS (device bcache0): bad tree block start 0 [ 7567954214912 575.131803] BTRFS (device bcache0): parent transid verify [ failed on 7567954214912 wanted 613687 found 613680 [ 575.131866] BTRFS: failed to read tree root on bcache0 [ 575.180101] BTRFS: open_ctree failed all the btrfs tools throw up their hands with similar errors: ileserver:/usr/src/btrfs-progs# btrfs restore /dev/bcache0 -l parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super fileserver:/usr/src/btrfs-progs# ./btrfsck --repair /dev/bcache0 --init-extent-tree enabling repair mode parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Couldn't open file system Annoyingly: # ./btrfs-image -c9 -t4 -s -w /dev/bcache0 /tmp/test.out parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Open ctree failed create failed (Success) So I can't even send an image for people to look at. There's always last resort (LAST RESORT!) btrfs-zero-log. It may destroy some of your data, however, and can make things even worse if other repairs could've helped before. So here's some pointers: * btrfs-find-root: find a working tree-root (no idea how to set it, tho) * mount -o recovery: mount in recovery mode (tries to mount with a working superblock backup) * btrfs restore: command to get files off a broken fs (at least what is readable, no guarantees for sane file contents, tho, I guess) Its a bit hard to follow the discussion here because posts from Cameron are missing in my NNTP reader (I'm using the gmane gateway to read here). So I'm answering
Re: Recovering BTRFS from bcache failure.
Any ideas on where to start with this? I did flush the cache out to disk before I made changes to the bcache configuration, so there shouldn't be anything completely missing, just some bits of stale metadata. If I can get the tools to take the closest match and run with it it would probably recover nearly everything. At worst, is there a way to scan the metadata blocks and rebuild from found extent-trees? On Tue, Apr 7, 2015 at 11:40 PM, Dan Merillat dan.meril...@gmail.com wrote: Bcache failures are nasty, because they leave a mix of old and new data on the disk. In this case, there was very little dirty data, but of course the tree roots were dirty and out-of-sync. fileserver:/usr/src/btrfs-progs# ./btrfs --version Btrfs v3.18.2 kernel version 3.18 [ 572.573566] BTRFS info (device bcache0): enabling auto recovery [ 572.573619] BTRFS info (device bcache0): disk space caching is enabled [ 574.266055] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.276952] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277008] BTRFS: failed to read tree root on bcache0 [ 574.277187] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277356] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277398] BTRFS: failed to read tree root on bcache0 [ 574.285955] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 613694 [ 574.298741] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 610499 [ 574.298804] BTRFS: failed to read tree root on bcache0 [ 575.047079] BTRFS (device bcache0): bad tree block start 0 7567954464768 [ 575.111495] BTRFS (device bcache0): parent transid verify failed on 7567954464768 wanted 613688 found 613685 [ 575.111559] BTRFS: failed to read tree root on bcache0 [ 575.121749] BTRFS (device bcache0): bad tree block start 0 7567954214912 [ 575.131803] BTRFS (device bcache0): parent transid verify failed on 7567954214912 wanted 613687 found 613680 [ 575.131866] BTRFS: failed to read tree root on bcache0 [ 575.180101] BTRFS: open_ctree failed all the btrfs tools throw up their hands with similar errors: ileserver:/usr/src/btrfs-progs# btrfs restore /dev/bcache0 -l parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super fileserver:/usr/src/btrfs-progs# ./btrfsck --repair /dev/bcache0 --init-extent-tree enabling repair mode parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Couldn't open file system Annoyingly: # ./btrfs-image -c9 -t4 -s -w /dev/bcache0 /tmp/test.out parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Open ctree failed create failed (Success) So I can't even send an image for people to look at. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Recovering BTRFS from bcache failure.
Bcache failures are nasty, because they leave a mix of old and new data on the disk. In this case, there was very little dirty data, but of course the tree roots were dirty and out-of-sync. fileserver:/usr/src/btrfs-progs# ./btrfs --version Btrfs v3.18.2 kernel version 3.18 [ 572.573566] BTRFS info (device bcache0): enabling auto recovery [ 572.573619] BTRFS info (device bcache0): disk space caching is enabled [ 574.266055] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.276952] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277008] BTRFS: failed to read tree root on bcache0 [ 574.277187] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277356] BTRFS (device bcache0): parent transid verify failed on 7567956930560 wanted 613690 found 613681 [ 574.277398] BTRFS: failed to read tree root on bcache0 [ 574.285955] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 613694 [ 574.298741] BTRFS (device bcache0): parent transid verify failed on 7567965720576 wanted 613689 found 610499 [ 574.298804] BTRFS: failed to read tree root on bcache0 [ 575.047079] BTRFS (device bcache0): bad tree block start 0 7567954464768 [ 575.111495] BTRFS (device bcache0): parent transid verify failed on 7567954464768 wanted 613688 found 613685 [ 575.111559] BTRFS: failed to read tree root on bcache0 [ 575.121749] BTRFS (device bcache0): bad tree block start 0 7567954214912 [ 575.131803] BTRFS (device bcache0): parent transid verify failed on 7567954214912 wanted 613687 found 613680 [ 575.131866] BTRFS: failed to read tree root on bcache0 [ 575.180101] BTRFS: open_ctree failed all the btrfs tools throw up their hands with similar errors: ileserver:/usr/src/btrfs-progs# btrfs restore /dev/bcache0 -l parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Could not open root, trying backup super fileserver:/usr/src/btrfs-progs# ./btrfsck --repair /dev/bcache0 --init-extent-tree enabling repair mode parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Couldn't setup device tree Couldn't open file system Annoyingly: # ./btrfs-image -c9 -t4 -s -w /dev/bcache0 /tmp/test.out parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 parent transid verify failed on 7567956930560 wanted 613690 found 613681 Ignoring transid failure Couldn't setup extent tree Open ctree failed create failed (Success) So I can't even send an image for people to look at. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Suggestion on reducing short kernel hangs from my btrfs filesystems: bcache?
I have a server which runs zoneminder (video recording which is CPU and disk IO intensive) while also doing a bunch of I/O over serial ports. I have a a dual core Intel(R) Core(TM) i3-2100T CPU @ 2.50GHz (4 virtual CPUs in /proc/cpuinfo) It's pretty clear that when zoneminder is doing more work, my programs that talk to serial ports start failing due to delays on the kernel side and desynchronization, causing serial port protocol errors (I'm using USB serial adapters, and use 12 of them). I'm pretty sure it's because of delays in the kernel more than user space, but can't prove that easily. I have a preempt kernel, kernel 3.16.3: CONFIG_TREE_PREEMPT_RCU=y CONFIG_PREEMPT_RCU=y CONFIG_PREEMPT_NOTIFIERS=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_COUNT=y CONFIG_DEBUG_PREEMPT=y From what I can tell, things did get worse after I upgraded from ext4 to btrfs (not counting times where I resync the software raid5 underneath or run a btrfs scrub). I may try to see if VOLPREMPT might work better, but I'm thinking putting an SSD in front of that mdadm RAID5 array will help by relieving the IO load and hopefully giving more time for the CPU to handle serial port requests. I'm actually not sure if my issue is btrfs interrupting serial port connections due to PREEMPT, or if serial port connections aren't being serviced quickly enough because the kernel is busy with btrfs and PREMPT hasn't kicked in yet. From reading the list, bcache may work with btrfs, but before I try that, I was curious if there are other or better ways to use an SSD to make btrfs less impacting on my server? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
Hi, has this issue been resolved? I would like to use the bcache + btrfs combo. Thanks -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
After completely loosing my filesystem twice because of this bug, I gave up using btrfs on top of bcache (also writeback). In my case, I used to have some subvolumes and some snapshot of these subvolumes, but not many of them. The btrfs mantra backup, bakcup and backup saved me. Best regards, Fábio Pfeifer 2014-07-30 20:01 GMT-03:00 Larkin Lowrey llow...@nuclearwinter.com: I've been running two backup servers, with 25T and 20T of data, using btrfs on bcache (writeback) for about 7 months. I periodically run btrfs scrubs and backup verifies (SHA1 hashes) and have never had a corruption issue. My use of btrfs is simple, though, with no subvolumes and no btrfs level raid. My bcache backing devices are LVM volumes that span multiple md raid6 arrays. So, either the bug has been fixed or my configuration is not susceptible. I'm running kernel 3.15.5-200.fc20.x86_64. --Larkin On 7/30/2014 5:04 PM, dptr...@arcor.de wrote: Concerning http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018, does this bug still exists? Kernel 3.14 B: 2x HDD 1 TB C: 1x SSD 256 GB # make-bcache -B /dev/sda /dev/sdb -C /dev/sdc --cache_replacement_policy=lru # mkfs.btrfs -d raid1 -m raid1 -L BTRFS_RAID /dev/bcache0 /dev/bcache1 I still have no incomplete page write messages in dmesg | grep btrfs and the checksums of some manually reviewed files are okay. Who has more experiences about this? Thanks, - dp -- To unsubscribe from this list: send the line unsubscribe linux-bcache in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-bcache in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs on bcache
Concerning http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018, does this bug still exists? Kernel 3.14 B: 2x HDD 1 TB C: 1x SSD 256 GB # make-bcache -B /dev/sda /dev/sdb -C /dev/sdc --cache_replacement_policy=lru # mkfs.btrfs -d raid1 -m raid1 -L BTRFS_RAID /dev/bcache0 /dev/bcache1 I still have no incomplete page write messages in dmesg | grep btrfs and the checksums of some manually reviewed files are okay. Who has more experiences about this? Thanks, - dp -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
dptrash posted on Thu, 31 Jul 2014 17:35:44 +0200 as excerpted: Concerning http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018, does this bug still exists? Kernel 3.14 B: 2x HDD 1 TB C: 1x SSD 256 GB # make-bcache -B /dev/sda /dev/sdb -C /dev/sdc --cache_replacement_policy=lru # mkfs.btrfs -d raid1 -m raid1 -L BTRFS_RAID /dev/bcache0 /dev/bcache1 I still have no incomplete page write messages in dmesg | grep btrfs and the checksums of some manually reviewed files are okay. Who has more experiences about this? See the reply (not mine) to your earlier post of the question: http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2602 -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs on bcache
Concerning http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018, does this bug still exists? Kernel 3.14 B: 2x HDD 1 TB C: 1x SSD 256 GB # make-bcache -B /dev/sda /dev/sdb -C /dev/sdc --cache_replacement_policy=lru # mkfs.btrfs -d raid1 -m raid1 -L BTRFS_RAID /dev/bcache0 /dev/bcache1 I still have no incomplete page write messages in dmesg | grep btrfs and the checksums of some manually reviewed files are okay. Who has more experiences about this? Thanks, - dp -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
I've been running two backup servers, with 25T and 20T of data, using btrfs on bcache (writeback) for about 7 months. I periodically run btrfs scrubs and backup verifies (SHA1 hashes) and have never had a corruption issue. My use of btrfs is simple, though, with no subvolumes and no btrfs level raid. My bcache backing devices are LVM volumes that span multiple md raid6 arrays. So, either the bug has been fixed or my configuration is not susceptible. I'm running kernel 3.15.5-200.fc20.x86_64. --Larkin On 7/30/2014 5:04 PM, dptr...@arcor.de wrote: Concerning http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018, does this bug still exists? Kernel 3.14 B: 2x HDD 1 TB C: 1x SSD 256 GB # make-bcache -B /dev/sda /dev/sdb -C /dev/sdc --cache_replacement_policy=lru # mkfs.btrfs -d raid1 -m raid1 -L BTRFS_RAID /dev/bcache0 /dev/bcache1 I still have no incomplete page write messages in dmesg | grep btrfs and the checksums of some manually reviewed files are okay. Who has more experiences about this? Thanks, - dp -- To unsubscribe from this list: send the line unsubscribe linux-bcache in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
On 2014-04-30 14:16, Felix Homann wrote: Hi, a couple of months ago there has been some discussion about issues when using btrfs on bcache: http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018 From looking at the mailing list archives I cannot tell whether or not this issue has been resolved in current kernels from either bcache's or btrfs' side. Can anyone tell me what's the current state of this issue? Should it be safe to use btrfs on bcache by now? In all practicality, I don't think anyone who frequents the list knows. I do know that there are a number of people (myself included) who avoid bcache in general because of having issues with seemingly random kernel OOPSes when it is linked in (either as a module or compiled in), even when it isn't being used. My advice would be to just test it with some non-essential data (maybe set up a virtual machine?). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs on bcache
Hi, a couple of months ago there has been some discussion about issues when using btrfs on bcache: http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018 From looking at the mailing list archives I cannot tell whether or not this issue has been resolved in current kernels from either bcache's or btrfs' side. Can anyone tell me what's the current state of this issue? Should it be safe to use btrfs on bcache by now? Thanks and kind regards, Felix -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
On Mon, 2014-01-06 at 15:37 -0800, Kent Overstreet wrote: On Fri, Dec 20, 2013 at 03:46:30PM +, Chris Mason wrote: On Fri, 2013-12-20 at 10:42 -0200, Fábio Pfeifer wrote: Hello, I put the WARN_ON(1); after the printk lines (incomplete page read and incomplete page write) in extent_io.c. here some call traces: [ 19.509497] incomplete page read in btrfs with offset 2560 and length 1536 [ 19.509500] [ cut here ] [ 19.509528] WARNING: CPU: 2 PID: 220 at fs/btrfs/extent_io.c:2441 end_bio_extent_readpage+0x788/0xc20 [btrfs]() [ 19.509530] Modules linked in: cdc_acm fuse iTCO_wdt iTCO_vendor_support snd_hda_codec_analog coretemp kvm_intel kvm raid1 ext4 crc16 md_mod mbcache jbd2 microcode nvidia(PO) psmouse pcspkr evdev serio_raw i2c_i801 lpc_ich i2c_core snd_hda_intel sky2 skge i82975x_edac button asus_atk0110 snd_hda_codec snd_hwdep shpchp snd_pcm snd_page_alloc snd_timer acpi_cpufreq snd edac_core soundcore processor vboxdrv(O) sr_mod cdrom ata_generic pata_acpi hid_generic usbhid hid usb_storage sd_mod pata_marvell firewire_ohci uhci_hcd ahci ehci_pci firewire_core ata_piix libahci crc_itu_t ehci_hcd libata scsi_mod usbcore usb_common btrfs crc32c libcrc32c xor raid6_pq bcache [ 19.509578] CPU: 2 PID: 220 Comm: btrfs-endio-met Tainted: P W O 3.12.5-1-ARCH #1 [ 19.509580] Hardware name: System manufacturer System Product Name/P5WDG2 WS Pro, BIOS 090503/06/2008 [ 19.509581] 0009 880231a63cb0 814ee37b [ 19.509585] 880231a63ce8 81062bcd ea00085eaec0 [ 19.509587] 8802320cc9c0 880233b0e000 880231a63cf8 [ 19.509590] Call Trace: [ 19.509596] [814ee37b] dump_stack+0x54/0x8d [ 19.509601] [81062bcd] warn_slowpath_common+0x7d/0xa0 [ 19.509603] [81062caa] warn_slowpath_null+0x1a/0x20 [ 19.509614] [a00b7ba8] end_bio_extent_readpage+0x788/0xc20 [btrfs] This should mean that bcache is either failing to read some blocks properly or is fiddling with the bv_len/bv_offset fields. Could someone from bcache comment? Oh man, I found this and then threw up my hands in despair. Bcache isn't doing anything with the bv_len/bv_offset fields; it may clone the biovec so it can retry a bio on error, if the biovecs weren't all whole pages, otherwise it just passes the biovec down with the next bio to the underlying cache/backing device. What btrfs appears to be doing though - I couldn't believe that code actually _worked_, Jens please jump in here but AFAIK bv_len/bv_offset are in practice undefined after a bio's completed, they might have been updated if the driver was using blk_update_request but for many drivers that just process the entire bio all at once they just won't touch those fields - and that includes anything that clones the bio (md/dm). This is probably relevant to immutable biovecs here... - Ok, I looked again at the relevant btrfs code, I guess I can see how this printk isn't normally triggered. But Chris, _what on earth_ is btrfs trying to check for here? And why is it using bv_offset and bv_len further down in end_bio_extent_readpage()? After the IO is done, we're recording the specific logical byte range that covered the IO. In practice its always the full page, we can switch to just trusting PAGE_CACHE_SIZE. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
On Wed, Jan 08, 2014 at 07:35:32PM +, Chris Mason wrote: On Mon, 2014-01-06 at 15:37 -0800, Kent Overstreet wrote: Ok, I looked again at the relevant btrfs code, I guess I can see how this printk isn't normally triggered. But Chris, _what on earth_ is btrfs trying to check for here? And why is it using bv_offset and bv_len further down in end_bio_extent_readpage()? After the IO is done, we're recording the specific logical byte range that covered the IO. In practice its always the full page, we can switch to just trusting PAGE_CACHE_SIZE. Yeah, the code already assumes it was doing PAGE_CACHE_SIZE reads; what you're effectively checking is that the driver did the bvec all at once, and that it didn't process half a bvec, update it, then process the rest - which is a completely fine thing to do. So for now - yeah, the correct thing to do is to just ignore bv_offset/bv_len and go by PAGE_CACHE_SIZE. But - after immutable biovecs is in, _then_ you'll be able to depend on bv_offset/bv_len remaining unchanged (and you can get rid of your dependency on PAGE_CACHE_SIZE bvecs). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
On Fri, Dec 20, 2013 at 03:46:30PM +, Chris Mason wrote: On Fri, 2013-12-20 at 10:42 -0200, Fábio Pfeifer wrote: Hello, I put the WARN_ON(1); after the printk lines (incomplete page read and incomplete page write) in extent_io.c. here some call traces: [ 19.509497] incomplete page read in btrfs with offset 2560 and length 1536 [ 19.509500] [ cut here ] [ 19.509528] WARNING: CPU: 2 PID: 220 at fs/btrfs/extent_io.c:2441 end_bio_extent_readpage+0x788/0xc20 [btrfs]() [ 19.509530] Modules linked in: cdc_acm fuse iTCO_wdt iTCO_vendor_support snd_hda_codec_analog coretemp kvm_intel kvm raid1 ext4 crc16 md_mod mbcache jbd2 microcode nvidia(PO) psmouse pcspkr evdev serio_raw i2c_i801 lpc_ich i2c_core snd_hda_intel sky2 skge i82975x_edac button asus_atk0110 snd_hda_codec snd_hwdep shpchp snd_pcm snd_page_alloc snd_timer acpi_cpufreq snd edac_core soundcore processor vboxdrv(O) sr_mod cdrom ata_generic pata_acpi hid_generic usbhid hid usb_storage sd_mod pata_marvell firewire_ohci uhci_hcd ahci ehci_pci firewire_core ata_piix libahci crc_itu_t ehci_hcd libata scsi_mod usbcore usb_common btrfs crc32c libcrc32c xor raid6_pq bcache [ 19.509578] CPU: 2 PID: 220 Comm: btrfs-endio-met Tainted: P W O 3.12.5-1-ARCH #1 [ 19.509580] Hardware name: System manufacturer System Product Name/P5WDG2 WS Pro, BIOS 090503/06/2008 [ 19.509581] 0009 880231a63cb0 814ee37b [ 19.509585] 880231a63ce8 81062bcd ea00085eaec0 [ 19.509587] 8802320cc9c0 880233b0e000 880231a63cf8 [ 19.509590] Call Trace: [ 19.509596] [814ee37b] dump_stack+0x54/0x8d [ 19.509601] [81062bcd] warn_slowpath_common+0x7d/0xa0 [ 19.509603] [81062caa] warn_slowpath_null+0x1a/0x20 [ 19.509614] [a00b7ba8] end_bio_extent_readpage+0x788/0xc20 [btrfs] This should mean that bcache is either failing to read some blocks properly or is fiddling with the bv_len/bv_offset fields. Could someone from bcache comment? Oh man, I found this and then threw up my hands in despair. Bcache isn't doing anything with the bv_len/bv_offset fields; it may clone the biovec so it can retry a bio on error, if the biovecs weren't all whole pages, otherwise it just passes the biovec down with the next bio to the underlying cache/backing device. What btrfs appears to be doing though - I couldn't believe that code actually _worked_, Jens please jump in here but AFAIK bv_len/bv_offset are in practice undefined after a bio's completed, they might have been updated if the driver was using blk_update_request but for many drivers that just process the entire bio all at once they just won't touch those fields - and that includes anything that clones the bio (md/dm). This is probably relevant to immutable biovecs here... - Ok, I looked again at the relevant btrfs code, I guess I can see how this printk isn't normally triggered. But Chris, _what on earth_ is btrfs trying to check for here? And why is it using bv_offset and bv_len further down in end_bio_extent_readpage()? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
Austin S Hemmelgarn posted on Wed, 01 Jan 2014 15:12:40 -0500 as excerpted: On 12/30/2013 11:02 AM, Austin S Hemmelgarn wrote: As an alternative to using bcache, you might try something simmilar to the following: 64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage, /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp This is essentially what I use now, and I have found that it significantly improves system performance. On this specific note, I would actually suggest against putting the portage tree on btrfs, it makes syncing go ridiculously slow, and it also seems to slow down emerge as well. Interesting observation. I had not see it here (with the gentoo tree and overlays on btrfs), but that's very likely because all my btrfs are on SSD, as I upgraded to both at the same time, because my previous default filesystem choice, reiserfs, isn't well suited to SSD due to excessive writing due to the journaling. I do know slow syncs and portage dep-calculations were one of the reasons I switched to SSD (and thus btrfs), however. That was getting pretty painful on spinning rust, at least with reiserfs. And I imagine btrfs on single-device spinning rust would if anything be worse at least for syncs, due to the default dup metadata, meaning at least three writes (and three seeks) for each file, once for the data, twice for the metadata. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
Austin S Hemmelgarn posted on Mon, 30 Dec 2013 11:02:21 -0500 as excerpted: I've actually tried a simmilar configuration myself a couple of times (also using Gentoo in-fact), and I can tell you from experience that unless things have changed greatly since kernel 3.12.1, it really isn't worth the headaches. Basically what I posted, but now with added real experience! (TM) =:^) As an alternative to using bcache, you might try something simmilar to the following: 64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage, /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp This is essentially what I use now, and I have found that it significantly improves system performance. Again, very similar to my own recommendation. Nice to see others saying the same thing. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
On 12/30/2013 11:02 AM, Austin S Hemmelgarn wrote: As an alternative to using bcache, you might try something simmilar to the following: 64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage, /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp This is essentially what I use now, and I have found that it significantly improves system performance. On this specific note, I would actually suggest against putting the portage tree on btrfs, it makes syncing go ridiculously slow, and it also seems to slow down emerge as well. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
On Mon, Dec 30, 2013 at 02:22:55AM +0100, Kai Krakow wrote: These thought are actually quite interesting. So you are saying that data may not be fully written to SSD although the kernel thinks so? This is That, and worse. Incidently, I have just posted on my G+ about this: https://plus.google.com/106981743284611658289/posts/Us8yjK9SPs6 which is mostly links to http://lkcl.net/reports/ssd_analysis.html https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault After you read those, you'll never think twice about SSDs and data loss anymore :-/ (I kind of found that out myself over time too, but these have much more data than I got myself empirically on a couple of SSDs) Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
On 12/29/2013 04:11 PM, Kai Krakow wrote: Hello list! I'm planning to buy a small SSD (around 60GB) and use it for bcache in front of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs is my root device, thus the system must be able to boot from bcache using init ramdisk. My /boot is a separate filesystem outside of btrfs and will be outside of bcache. I am using Gentoo as my system. I have a few questions: * How stable is it? I've read about some csum errors lately... * I want to migrate my current storage to bcache without replaying a backup. Is it possible? * Did others already use it? What is the perceived performance for desktop workloads in comparision to not using bcache? * How well does bcache handle power outages? Btrfs does handle them very well since many months. * How well does it play with dracut as initrd? Is it as simple as telling it the new device nodes or is there something complicate to configure? * How does bcache handle a failing SSD when it starts to wear out in a few years? * Is it worth waiting for hot-relocation support in btrfs to natively use a SSD as cache? * Would you recommend going with a bigger/smaller SSD? I'm planning to use only 75% of it for bcache so wear-leveling can work better, maybe use another part of it for hibernation (suspend to disk). I've actually tried a simmilar configuration myself a couple of times (also using Gentoo in-fact), and I can tell you from experience that unless things have changed greatly since kernel 3.12.1, it really isn't worth the headaches. Setting it up on an already installed system is a serious pain because the backing device has to be reformatted with a bcache super-block. In addition, every kernel that I have tried that had bcache compiled in or loaded as a module had issues, I would see a kernel OOPS on average once a day from the bcache code, usually followed shortly by a panic from some other unrelated subsystem. I didn't get any actual data corruption, but I wasn't using btrfs at the time for any of my filesystems. As an alternative to using bcache, you might try something simmilar to the following: 64G SSD with /boot, /, and /usr Other HDD with /var, /usr/portage, /usr/src, and /home tmpfs or ramdisk for /tmp and /var/tmp This is essentially what I use now, and I have found that it significantly improves system performance. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
Marc MERLIN m...@merlins.org schrieb: On Mon, Dec 30, 2013 at 02:22:55AM +0100, Kai Krakow wrote: These thought are actually quite interesting. So you are saying that data may not be fully written to SSD although the kernel thinks so? This is That, and worse. Incidently, I have just posted on my G+ about this: https://plus.google.com/106981743284611658289/posts/Us8yjK9SPs6 which is mostly links to http://lkcl.net/reports/ssd_analysis.html https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault After you read those, you'll never think twice about SSDs and data loss anymore :-/ (I kind of found that out myself over time too, but these have much more data than I got myself empirically on a couple of SSDs) The bad thing here is: Even battery-backed RAID controllers won't help you here. I start to understand why I still don't trust this new technology entirely. Thanks, Kai -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
Duncan 1i5t5.dun...@cox.net schrieb: [ spoiler: tldr ;-) ] * How stable is it? I've read about some csum errors lately... FWIW, both bcache and btrfs are new and still developing technology. While I'm using btrfs here, I have tested usable (which for root means either means directly bootable or that you have tested booting to a recovery image and restoring from there, I do the former, here) backups, as STRONGLY recommended for btrfs in its current state, but haven't had to use them. And I considered bcache previously and might otherwise be using it, but at least personally, I'm not willing to try BOTH of them at once, since neither one is mature yet and if there are problems as there very well might be, I'd have the additional issue of figuring out which one was the problem, and I'm personally not prepared to deal with that. I mostly trust btrfs by now. Don't understand me wrong: I still have my nightly backup job syncing the complete system to an external drive - nothing defeats a good backup. But btrfs has survived reliably multiple power-losses, kernel panics/freezes, unreliable USB connections, ... It looks very stable from that view. Yes, it may have bugs that may introduce errors fatal to the filesystem structure. But generally, under usual workloads it has proven stable for me. At least for desktop workloads. Instead, at this point I'd recommend choosing /either/ bcache /or/ btrfs, and using bcache with a more mature filesystem like ext4 or (what I used for years previous and still use for spinning rust) reiserfs. I've used reiserfs for several years a long time ago. But it does absolutely not scale well for parallel/threaded workloads which is a show stopper for server workloads. But it always survived even the worst failure scenarios (like SCSI bus going offline for some RAID members) and the tools distributed with it were able to recover all data even if the FS was damaged beyond any usual things you would normally try when it does no longer mount. I've been with Ext3 before, and it was not only one time that a simple power-loss during high server-workload destroyed the filesystem beyond repair with fsck only making it worse. Since reiserfs did not scale well and ext* FS has annoyed me more than once, we've decided to go with XFS. While it tends to wipe some data after power- loss and leaves you with zero-filled files, it has proven extremely reliable even under those situations mentioned above like dying SCSI bus. Not to the extent reiserfs did but still very satisfying. The big plus: it scales extremely well with parallel workloads and can be optimized for the stripe configuration of the underlying RAID layer. So I made it my default filesystem for desktop, too. With the above mentioned annoying feature of zero'ing out recently touched files when the system crashed. But well, we all got proven backups, right? Yep, I also learned that lesson... *sigh But btrfs, when first announced and while I already was jealously looking at ZFS, seemed to be the FS of my choice giving me flexible RAID setups, snapshots... I'm quite happy with it although it feels slow sometimes. I simply threw more RAM at it - now it is okay. And as I said, keep your backups as current as you're willing to deal with losing what's not backed up, and tested usable and (for root) either bootable or restorable from alternate boot, because while at least btrfs is /reasonably/ stable for /ordinary/ daily use, there remain corner- cases and you never know when your case is going to BE a corner-case! I've got a small rescue system I can boot which has btrfs-tools and a recent kernel to flexible repair, restore, or whatever I want to do with my backup. My backup itself is not bootable (although it probably could, if I change some configurations files). * I want to migrate my current storage to bcache without replaying a backup. Is it possible? Since I've not actually used bcache, I won't try to answer some of these, but will answer based on what I've seen on the list where I can... I don't know on this one. I remember someone created some pyhton scripts to make it possible - wrt to btrfs especially. Can't remember the link. Maybe I'm able to dig it up. But at least I read it as: There's no improvement on that migration path directly from bcache. I hoped otherwise... * Did others already use it? What is the perceived performance for desktop workloads in comparision to not using bcache? Others are indeed already using it. I've seen some btrfs/bcache problems reported on this list, but as mentioned above, when both are in use that means figuring out which is the problem, and at least from the btrfs side I've not seen a lot of resolution in that regard. From here it /looks/ like that's simply being punted at this time, as there's still more easily traceable problems without the additional bcache variable to work on first. But it's quite possible
Migrate to bcache: A few questions
Hello list! I'm planning to buy a small SSD (around 60GB) and use it for bcache in front of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs is my root device, thus the system must be able to boot from bcache using init ramdisk. My /boot is a separate filesystem outside of btrfs and will be outside of bcache. I am using Gentoo as my system. I have a few questions: * How stable is it? I've read about some csum errors lately... * I want to migrate my current storage to bcache without replaying a backup. Is it possible? * Did others already use it? What is the perceived performance for desktop workloads in comparision to not using bcache? * How well does bcache handle power outages? Btrfs does handle them very well since many months. * How well does it play with dracut as initrd? Is it as simple as telling it the new device nodes or is there something complicate to configure? * How does bcache handle a failing SSD when it starts to wear out in a few years? * Is it worth waiting for hot-relocation support in btrfs to natively use a SSD as cache? * Would you recommend going with a bigger/smaller SSD? I'm planning to use only 75% of it for bcache so wear-leveling can work better, maybe use another part of it for hibernation (suspend to disk). Regards, Kai -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
On Dec 29, 2013, at 2:11 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote: * How stable is it? I've read about some csum errors lately… Seems like bcache devs are still looking into the recent btrfs csum issues. * I want to migrate my current storage to bcache without replaying a backup. Is it possible? * Did others already use it? What is the perceived performance for desktop workloads in comparision to not using bcache? * How well does bcache handle power outages? Btrfs does handle them very well since many months. * How well does it play with dracut as initrd? Is it as simple as telling it the new device nodes or is there something complicate to configure? * How does bcache handle a failing SSD when it starts to wear out in a few years? I think most of these questions are better suited for the bcache list. I think there are still many uncertainties about the behavior of SSDs during power failures when they aren't explicitly designed with power failure protection in mind. At best I'd hope for a rollback involving data loss, but hopefully not a corrupt file system. I'd rather lose the last minute of data supposedly written to the drive, than have to do a fuil restore from backup. * Is it worth waiting for hot-relocation support in btrfs to natively use a SSD as cache? I haven't read anything about it. Don't see it listed in project ideas. * Would you recommend going with a bigger/smaller SSD? I'm planning to use only 75% of it for bcache so wear-leveling can work better, maybe use another part of it for hibernation (suspend to disk). I think that depends greatly on workload. If you're writing or reading a lot of disparate files, or a lot of small file random writes (mail server), I'd go bigger. By default sequential IO isn't cached. So I think you can get a big boost in responsiveness with a relatively small bcache size. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
Chris Murphy li...@colorremedies.com schrieb: I think most of these questions are better suited for the bcache list. Ah yes, you are true. I will repost the non-btrfs related questions to the bcache list. But actually I am most interested in using bcache together btrfs, so getting a general picture of its current state in this combination would be nice - and so these questions may be partially appropriate here. I think there are still many uncertainties about the behavior of SSDs during power failures when they aren't explicitly designed with power failure protection in mind. At best I'd hope for a rollback involving data loss, but hopefully not a corrupt file system. I'd rather lose the last minute of data supposedly written to the drive, than have to do a fuil restore from backup. These thought are actually quite interesting. So you are saying that data may not be fully written to SSD although the kernel thinks so? This is probably very dangerous. The bcache module could not ensure coherence between its backing devices and its own contents - and data loss will occur and probably destroy important file system structures. I understand your words as data may only partially being written. This, of course, may happen to HDDs as well. But usually a file system works with transactions so the last incomplete transaction can simply be thrown away. I hope bcache implements the same architecture. But what does it mean for the stacked write-back architecture? As I understand, bcache may use write-through for sequential writes, but write-back for random writes. In this case, part of the data may have hit the backing device, other data does only exist in the bcache. If that last transaction is not closed due to power-loss, and then thrown away, we have part of the transaction already written to the backing device that the filesystem does not know of after resume. I'd appreciate some thoughts about it but this topic is probably also best moved over to the bcache list. Thanks, Kai -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
On Dec 29, 2013, at 6:22 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote: So you are saying that data may not be fully written to SSD although the kernel thinks so? Drives shouldn't lie when asked to flush to disk, but they do. Older article about this at lwn is a decent primer on the subject of write barriers. http://lwn.net/Articles/283161/ This is probably very dangerous. The bcache module could not ensure coherence between its backing devices and its own contents - and data loss will occur and probably destroy important file system structures. I don't know the details, there's more detail on lkml.org and bcache lists. My impression is that short of bugs, it should be much safer than you describe. It's not like a linear/concat md or LVM device fail scenario. There's good info in the bcache.h file: http://lxr.free-electrons.com/source/drivers/md/bcache/bcache.h If anything, once the kinks are worked out, under heavy random write IO I'd expect bcache to improve the likelihood data isn't lost. Faster speed of SSD means we get a faster commit of the data to stable media. Also bcache assumes the cache is always dirty on startup, no matter whether the shutdown was clean or dirty, so the code is explicitly designed to resolve the state of the cache relative to the backing device. It's actually pretty fascinating work. It may not be required, but I'd expect we'd want the write cache on the backing device disabled. It should still honor write barriers but it kinda seems unnecessary and riskier to have it enabled (which is the default with consumer drives). As I understand, bcache may use write-through for sequential writes, but write-back for random writes. In this case, part of the data may have hit the backing device, other data does only exist in the bcache. If that last transaction is not closed due to power-loss, and then thrown away, we have part of the transaction already written to the backing device that the filesystem does not know of after resume. In the write through case we should be no worse off than the bare drive in a power loss. In the write back case the SSD should have committed more data than the HDD could have in the same situation. I don't understand the details of how partially successful writes to the backing media are handled when the system comes back up. Since bcache is also COW, SSD blocks aren't reused until data is committed to the backing device. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
Kai Krakow posted on Sun, 29 Dec 2013 22:11:16 +0100 as excerpted: Hello list! I'm planning to buy a small SSD (around 60GB) and use it for bcache in front of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs is my root device, thus the system must be able to boot from bcache using init ramdisk. My /boot is a separate filesystem outside of btrfs and will be outside of bcache. I am using Gentoo as my system. Gentooer here too. =:^) I have a few questions: * How stable is it? I've read about some csum errors lately... FWIW, both bcache and btrfs are new and still developing technology. While I'm using btrfs here, I have tested usable (which for root means either means directly bootable or that you have tested booting to a recovery image and restoring from there, I do the former, here) backups, as STRONGLY recommended for btrfs in its current state, but haven't had to use them. And I considered bcache previously and might otherwise be using it, but at least personally, I'm not willing to try BOTH of them at once, since neither one is mature yet and if there are problems as there very well might be, I'd have the additional issue of figuring out which one was the problem, and I'm personally not prepared to deal with that. Instead, at this point I'd recommend choosing /either/ bcache /or/ btrfs, and using bcache with a more mature filesystem like ext4 or (what I used for years previous and still use for spinning rust) reiserfs. And as I said, keep your backups as current as you're willing to deal with losing what's not backed up, and tested usable and (for root) either bootable or restorable from alternate boot, because while at least btrfs is /reasonably/ stable for /ordinary/ daily use, there remain corner- cases and you never know when your case is going to BE a corner-case! * I want to migrate my current storage to bcache without replaying a backup. Is it possible? Since I've not actually used bcache, I won't try to answer some of these, but will answer based on what I've seen on the list where I can... I don't know on this one. * Did others already use it? What is the perceived performance for desktop workloads in comparision to not using bcache? Others are indeed already using it. I've seen some btrfs/bcache problems reported on this list, but as mentioned above, when both are in use that means figuring out which is the problem, and at least from the btrfs side I've not seen a lot of resolution in that regard. From here it /looks/ like that's simply being punted at this time, as there's still more easily traceable problems without the additional bcache variable to work on first. But it's quite possible the bcache list is actively tackling btrfs/bache combination problems, as I'm not subscribed there. So I can't answer the desktop performance comparison question directly, but given that I /am/ running btrfs on SSD, I /can/ say I'm quite happy with that. =:^) Keep in mind... We're talking storage cache here. Given the cost of memory and common system configurations these days, 4-16 gig of memory on a desktop isn't unusual or cost prohibitive, and a common desktop working set should well fit. I suspect my desktop setup, 16 gigs memory backing a 6-core AMD fx6100 (bulldozer-1) @ 3.6 GHz, is probably a bit toward the high side even for a gentooer, but not inordinately so. Based on my usage... Typical app memory usage runs 1-2 GiB (that's with KDE 4.12.49. from the gentoo/kde overlay, but USE=-semantic-desktop, etc). Buffer memory runs a few MiB but isn't normally significant, so it can fold into that same 1-2 GiB too. That leaves a full 14 GiB for cache. But at least with /my/ usage, normal non-update cache memory usage tends to be below ~6 GiB too, so total apps/buffer/cache memory usage tends to be below 8 GiB as well. When I'm doing multi-job builds or working with big media files, I'll sometimes go above 8 gig usage, and that occasional cache-spill was why I upgraded to 16 gig. But in practice, 10 gig would take care of that most of the time, and were it not for the accident of powers-of-two meaning 16 gig is the notch above 8 gig, 10 or 12 gig would be plenty. Truth be told, I so seldom use that last 4 gig that it's almost embarrassing. * Tho if I ran multi-GiB VMs that'd use up that extra memory real fast! But while that /is/ becoming more common, I'm not exactly sure I'd classify 4 gigs plus of VM usage as desktop usage just yet. Workstation, yes, and definitely server, but not really desktop. All that as background to this... * Cache works only after first access. If you only access something occasionally, it may not be worth caching at all. * Similarly, if access isn't time critical, think of playing a huge video file where only a few meg in memory at once is plenty, and where storage access is several times faster than play-speed, cache isn't particularly useful. * Bcache
Re: btrfs on bcache
(resend int text only) Some more information about this issue. I installed my system last november (arch x86_64), with kernel 3.11. That time I didn't see any csum error or incomplete page read error. Some time later these errors started to show up. I don't know exactly if it was in 3.11 - 3.12 upgrade or somewhere in the 3.12 cycle. I've been using bcache in writeback mode from the beginning. I made some more testing: - tryed bcache in writethrough, writearound and none modes; - tryed linux kernel 3.13-rc5 The errors didn't go away (maybe because my filesystem is already corrupted). I didn't have time to test with kernel 3.11 again. But lately the errors increased, and it started to make my system unstable, and then unusable. I had to reformat everything and recover my backups. I don't have my / and /home in btrfs over bcache anymore, but I can make some tests in a spare HD and SSD i have here. I'll report back after Christmas. thanks, Fabio 2013/12/20 Chris Mason c...@fb.com: On Fri, 2013-12-20 at 10:42 -0200, Fábio Pfeifer wrote: Hello, I put the WARN_ON(1); after the printk lines (incomplete page read and incomplete page write) in extent_io.c. here some call traces: [ 19.509497] incomplete page read in btrfs with offset 2560 and length 1536 [ 19.509500] [ cut here ] [ 19.509528] WARNING: CPU: 2 PID: 220 at fs/btrfs/extent_io.c:2441 end_bio_extent_readpage+0x788/0xc20 [btrfs]() [ 19.509530] Modules linked in: cdc_acm fuse iTCO_wdt iTCO_vendor_support snd_hda_codec_analog coretemp kvm_intel kvm raid1 ext4 crc16 md_mod mbcache jbd2 microcode nvidia(PO) psmouse pcspkr evdev serio_raw i2c_i801 lpc_ich i2c_core snd_hda_intel sky2 skge i82975x_edac button asus_atk0110 snd_hda_codec snd_hwdep shpchp snd_pcm snd_page_alloc snd_timer acpi_cpufreq snd edac_core soundcore processor vboxdrv(O) sr_mod cdrom ata_generic pata_acpi hid_generic usbhid hid usb_storage sd_mod pata_marvell firewire_ohci uhci_hcd ahci ehci_pci firewire_core ata_piix libahci crc_itu_t ehci_hcd libata scsi_mod usbcore usb_common btrfs crc32c libcrc32c xor raid6_pq bcache [ 19.509578] CPU: 2 PID: 220 Comm: btrfs-endio-met Tainted: P W O 3.12.5-1-ARCH #1 [ 19.509580] Hardware name: System manufacturer System Product Name/P5WDG2 WS Pro, BIOS 090503/06/2008 [ 19.509581] 0009 880231a63cb0 814ee37b [ 19.509585] 880231a63ce8 81062bcd ea00085eaec0 [ 19.509587] 8802320cc9c0 880233b0e000 880231a63cf8 [ 19.509590] Call Trace: [ 19.509596] [814ee37b] dump_stack+0x54/0x8d [ 19.509601] [81062bcd] warn_slowpath_common+0x7d/0xa0 [ 19.509603] [81062caa] warn_slowpath_null+0x1a/0x20 [ 19.509614] [a00b7ba8] end_bio_extent_readpage+0x788/0xc20 [btrfs] This should mean that bcache is either failing to read some blocks properly or is fiddling with the bv_len/bv_offset fields. Could someone from bcache comment? -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
On Thu, Dec 19, 2013 at 8:59 PM, Chris Mason c...@fb.com wrote: On Wed, 2013-12-18 at 18:17 +0100, eb wrote: Btrfs shouldn't be setting the offset on the bios. Are you able to add a WARN_ON to the message that prints this so we can see the stack trace? If you send me a patch - my experience on hacking on the kernel is exactly 0 - I'll try to see if I can compile a custom kernel and get it running. Could you please cc the bcache and btrfs list together? Done. I did some more testing - I copied an image of a 128GB drive over the network (via netcat) onto the bcache/btrfs system and verified the results twice using sha1sum. They're both identical on the source system (which is *not* using bcache) and bcache/btrfs setup. I've gotten a lot of the incomplete write errors and a few csum erros in dmesg, but apparently they haven't done any harm? Not sure how remarkable this is, as these kinds of things are supposed to bypass the cache anyway, but I assume they still have to go through the subsystem. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
Hello, I put the WARN_ON(1); after the printk lines (incomplete page read and incomplete page write) in extent_io.c. here some call traces: [ 19.509497] incomplete page read in btrfs with offset 2560 and length 1536 [ 19.509500] [ cut here ] [ 19.509528] WARNING: CPU: 2 PID: 220 at fs/btrfs/extent_io.c:2441 end_bio_extent_readpage+0x788/0xc20 [btrfs]() [ 19.509530] Modules linked in: cdc_acm fuse iTCO_wdt iTCO_vendor_support snd_hda_codec_analog coretemp kvm_intel kvm raid1 ext4 crc16 md_mod mbcache jbd2 microcode nvidia(PO) psmouse pcspkr evdev serio_raw i2c_i801 lpc_ich i2c_core snd_hda_intel sky2 skge i82975x_edac button asus_atk0110 snd_hda_codec snd_hwdep shpchp snd_pcm snd_page_alloc snd_timer acpi_cpufreq snd edac_core soundcore processor vboxdrv(O) sr_mod cdrom ata_generic pata_acpi hid_generic usbhid hid usb_storage sd_mod pata_marvell firewire_ohci uhci_hcd ahci ehci_pci firewire_core ata_piix libahci crc_itu_t ehci_hcd libata scsi_mod usbcore usb_common btrfs crc32c libcrc32c xor raid6_pq bcache [ 19.509578] CPU: 2 PID: 220 Comm: btrfs-endio-met Tainted: P W O 3.12.5-1-ARCH #1 [ 19.509580] Hardware name: System manufacturer System Product Name/P5WDG2 WS Pro, BIOS 090503/06/2008 [ 19.509581] 0009 880231a63cb0 814ee37b [ 19.509585] 880231a63ce8 81062bcd ea00085eaec0 [ 19.509587] 8802320cc9c0 880233b0e000 880231a63cf8 [ 19.509590] Call Trace: [ 19.509596] [814ee37b] dump_stack+0x54/0x8d [ 19.509601] [81062bcd] warn_slowpath_common+0x7d/0xa0 [ 19.509603] [81062caa] warn_slowpath_null+0x1a/0x20 [ 19.509614] [a00b7ba8] end_bio_extent_readpage+0x788/0xc20 [btrfs] [ 19.509617] [8107010b] ? lock_timer_base.isra.35+0x2b/0x50 [ 19.509619] [8106f660] ? detach_if_pending+0x120/0x120 [ 19.509623] [811d98dd] bio_endio+0x1d/0x30 [ 19.509632] [a0090227] end_workqueue_fn+0x37/0x40 [btrfs] [ 19.509642] [a00c6b1e] worker_loop+0x14e/0x560 [btrfs] [ 19.509646] [810952b2] ? default_wake_function+0x12/0x20 [ 19.509656] [a00c69d0] ? btrfs_queue_worker+0x330/0x330 [btrfs] [ 19.509672] [81084fe0] kthread+0xc0/0xd0 [ 19.509677] [81084f20] ? kthread_create_on_node+0x120/0x120 [ 19.509680] [814fce7c] ret_from_fork+0x7c/0xb0 [ 19.509683] [81084f20] ? kthread_create_on_node+0x120/0x120 [ 19.509687] ---[ end trace bbc8d0d088375446 ]--- [ 25.592100] incomplete page read in btrfs with offset 2560 and length 1536 [ 25.592105] [ cut here ] [ 25.592141] WARNING: CPU: 0 PID: 442 at fs/btrfs/extent_io.c:2441 end_bio_extent_readpage+0x788/0xc20 [btrfs]() [ 25.592143] Modules linked in: cdc_acm fuse iTCO_wdt iTCO_vendor_support snd_hda_codec_analog coretemp kvm_intel kvm raid1 ext4 crc16 md_mod mbcache jbd2 microcode nvidia(PO) psmouse pcspkr evdev serio_raw i2c_i801 lpc_ich i2c_core snd_hda_intel sky2 skge i82975x_edac button asus_atk0110 snd_hda_codec snd_hwdep shpchp snd_pcm snd_page_alloc snd_timer acpi_cpufreq snd edac_core soundcore processor vboxdrv(O) sr_mod cdrom ata_generic pata_acpi hid_generic usbhid hid usb_storage sd_mod pata_marvell firewire_ohci uhci_hcd ahci ehci_pci firewire_core ata_piix libahci crc_itu_t ehci_hcd libata scsi_mod usbcore usb_common btrfs crc32c libcrc32c xor raid6_pq bcache [ 25.592205] CPU: 0 PID: 442 Comm: btrfs-endio-met Tainted: P W O 3.12.5-1-ARCH #1 [ 25.592208] Hardware name: System manufacturer System Product Name/P5WDG2 WS Pro, BIOS 090503/06/2008 [ 25.592211] 0009 880229773cb0 814ee37b [ 25.592216] 880229773ce8 81062bcd ea0002a20a80 [ 25.592220] 88022d3ab180 88022d326000 880229773cf8 [ 25.592225] Call Trace: [ 25.592234] [814ee37b] dump_stack+0x54/0x8d [ 25.592240] [81062bcd] warn_slowpath_common+0x7d/0xa0 [ 25.592245] [81062caa] warn_slowpath_null+0x1a/0x20 [ 25.592262] [a00b7ba8] end_bio_extent_readpage+0x788/0xc20 [btrfs] [ 25.592267] [810701ef] ? try_to_del_timer_sync+0x4f/0x70 [ 25.592271] [81070262] ? del_timer_sync+0x52/0x60 [ 25.592275] [8106f660] ? detach_if_pending+0x120/0x120 [ 25.592280] [811d98dd] bio_endio+0x1d/0x30 [ 25.592296] [a0090227] end_workqueue_fn+0x37/0x40 [btrfs] [ 25.592312] [a00c6b1e] worker_loop+0x14e/0x560 [btrfs] [ 25.592318] [810952b2] ? default_wake_function+0x12/0x20 [ 25.592335] [a00c69d0] ? btrfs_queue_worker+0x330/0x330 [btrfs] [ 25.592350] [81084fe0] kthread+0xc0/0xd0 [ 25.592353] [81084f20] ? kthread_create_on_node+0x120/0x120 [ 25.592356] [814fce7c] ret_from_fork+0x7c/0xb0 [ 25.592359] [81084f20
Re: btrfs on bcache
On Fri, 2013-12-20 at 10:42 -0200, Fábio Pfeifer wrote: Hello, I put the WARN_ON(1); after the printk lines (incomplete page read and incomplete page write) in extent_io.c. here some call traces: [ 19.509497] incomplete page read in btrfs with offset 2560 and length 1536 [ 19.509500] [ cut here ] [ 19.509528] WARNING: CPU: 2 PID: 220 at fs/btrfs/extent_io.c:2441 end_bio_extent_readpage+0x788/0xc20 [btrfs]() [ 19.509530] Modules linked in: cdc_acm fuse iTCO_wdt iTCO_vendor_support snd_hda_codec_analog coretemp kvm_intel kvm raid1 ext4 crc16 md_mod mbcache jbd2 microcode nvidia(PO) psmouse pcspkr evdev serio_raw i2c_i801 lpc_ich i2c_core snd_hda_intel sky2 skge i82975x_edac button asus_atk0110 snd_hda_codec snd_hwdep shpchp snd_pcm snd_page_alloc snd_timer acpi_cpufreq snd edac_core soundcore processor vboxdrv(O) sr_mod cdrom ata_generic pata_acpi hid_generic usbhid hid usb_storage sd_mod pata_marvell firewire_ohci uhci_hcd ahci ehci_pci firewire_core ata_piix libahci crc_itu_t ehci_hcd libata scsi_mod usbcore usb_common btrfs crc32c libcrc32c xor raid6_pq bcache [ 19.509578] CPU: 2 PID: 220 Comm: btrfs-endio-met Tainted: P W O 3.12.5-1-ARCH #1 [ 19.509580] Hardware name: System manufacturer System Product Name/P5WDG2 WS Pro, BIOS 090503/06/2008 [ 19.509581] 0009 880231a63cb0 814ee37b [ 19.509585] 880231a63ce8 81062bcd ea00085eaec0 [ 19.509587] 8802320cc9c0 880233b0e000 880231a63cf8 [ 19.509590] Call Trace: [ 19.509596] [814ee37b] dump_stack+0x54/0x8d [ 19.509601] [81062bcd] warn_slowpath_common+0x7d/0xa0 [ 19.509603] [81062caa] warn_slowpath_null+0x1a/0x20 [ 19.509614] [a00b7ba8] end_bio_extent_readpage+0x788/0xc20 [btrfs] This should mean that bcache is either failing to read some blocks properly or is fiddling with the bv_len/bv_offset fields. Could someone from bcache comment? -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
On Thu, Dec 19, 2013 at 2:04 PM, Fábio Pfeifer fmpfei...@gmail.com wrote: Any update on this? I have here exactly the same issue. Kernel 3.12.5-1-ARCH, backing device 500 GB IDE, cache 24 GB SSD = /dev/bcache0 On /dev/bcache I also have 2 subvolumes, / and /home. I get lots of messages in dmesg: I also have this issue. Also, this afternoon I experienced data corruption on my btrfs device (checksum errors), which might or might not be related. I don't really know how to determine the cause, but if anyone has suggestions they'd be appreciated. Cheers, Henry de Valence -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
Any update on this? I have here exactly the same issue. Kernel 3.12.5-1-ARCH, backing device 500 GB IDE, cache 24 GB SSD = /dev/bcache0 On /dev/bcache I also have 2 subvolumes, / and /home. I get lots of messages in dmesg: (...) [ 22.282469] BTRFS info (device bcache0): csum failed ino 56193 off 212992 csum 519977505 expected csum 3166125439 [ 22.282656] incomplete page read in btrfs with offset 1024 and length 3072 [ 23.370872] incomplete page read in btrfs with offset 1024 and length 3072 [ 23.370890] BTRFS info (device bcache0): csum failed ino 57765 off 106496 csum 3553846164 expected csum 1299185721 [ 23.505238] incomplete page read in btrfs with offset 2560 and length 1536 [ 23.505256] BTRFS info (device bcache0): csum failed ino 75922 off 172032 csum 1883678196 expected csum 1337496676 [ 23.508535] incomplete page read in btrfs with offset 2560 and length 1536 [ 23.508547] BTRFS info (device bcache0): csum failed ino 74368 off 237568 csum 2863587994 expected csum 2693116460 [ 25.683059] incomplete page read in btrfs with offset 2560 and length 1536 [ 25.683078] BTRFS info (device bcache0): csum failed ino 123709 off 57344 csum 1528117893 expected csum 2239543273 [ 25.684339] incomplete page read in btrfs with offset 1024 and length 3072 [ 26.622384] incomplete page read in btrfs with offset 1024 and length 3072 [ 26.906718] incomplete page read in btrfs with offset 2560 and length 1536 [ 27.823247] incomplete page read in btrfs with offset 1024 and length 3072 [ 27.823265] btrfs_readpage_end_io_hook: 2 callbacks suppressed [ 27.823271] BTRFS info (device bcache0): csum failed ino 34587 off 16384 csum 1180114025 expected csum 474262911 [ 28.490066] incomplete page read in btrfs with offset 2560 and length 1536 [ 28.490085] BTRFS info (device bcache0): csum failed ino 65817 off 327680 csum 3065880108 expected csum 2663659117 [ 29.413824] incomplete page read in btrfs with offset 1024 and length 3072 [ 41.913857] incomplete page read in btrfs with offset 2560 and length 1536 [ 55.761753] incomplete page read in btrfs with offset 1024 and length 3072 [ 55.761771] BTRFS info (device bcache0): csum failed ino 72835 off 81920 csum 1511792656 expected csum 3733709121 [ 69.636498] incomplete page read in btrfs with offset 2560 and length 1536 (...) should I be worried? thanks, Fabio Pfeifer 2013/12/18 eb e...@gmx.ch: I've recently setup a system (Kernel 3.12.5-1-ARCH) which is layered as follows: /dev/sdb3 - cache0 (80 GB Intel SSD) /dev/sdc1 - backing device (2 TB WD HDD) sdb3+sdc1 = /dev/bcache0 On /dev/bcache0, there's a btrfs filesystem with 2 subvolumes, mounted as / and /home. What's been bothering me are the following entries in my kernel log: [13811.845540] incomplete page write in btrfs with offset 1536 and length 2560 [13870.326639] incomplete page write in btrfs with offset 3072 and length 1024 The offset/length values are always either 1536/2560 or 3072/1024, they sum up nicely to 4K. There are 607 of those in there as I am writing this, the machine has been up 18 hours and been under no particular I/O strain (it's a desktop). Trying to fix this, I unattached the cache (still using /dev/bcache0, but without /dev/sdb3 attached), causing these errors to disappear. As soon as I re-attached /dev/sdb3 they started again, so I am fairly sure it's an unfavorable interaction between bcache and btrfs. Is this something I should be worried about (they're only emitted with KERN_INFO?) or just an alignment problem? The underlying HDD is using 4K-Sectors, while the block_size of bcache seems to be 512, could that be the issue here? I've also encountered incomplete reads and a few csum errors, but I have not been able to trigger these regularly. I have a feeling that the error is more likely o be on the bcache end (I've mailed to that list as well), however any insight into the matter would be much appreciated. Thanks, - eb -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
Forgot to mention: bcache is in writeback mode 2013/12/19 Fábio Pfeifer fmpfei...@gmail.com: Any update on this? I have here exactly the same issue. Kernel 3.12.5-1-ARCH, backing device 500 GB IDE, cache 24 GB SSD = /dev/bcache0 On /dev/bcache I also have 2 subvolumes, / and /home. I get lots of messages in dmesg: (...) [ 22.282469] BTRFS info (device bcache0): csum failed ino 56193 off 212992 csum 519977505 expected csum 3166125439 [ 22.282656] incomplete page read in btrfs with offset 1024 and length 3072 [ 23.370872] incomplete page read in btrfs with offset 1024 and length 3072 [ 23.370890] BTRFS info (device bcache0): csum failed ino 57765 off 106496 csum 3553846164 expected csum 1299185721 [ 23.505238] incomplete page read in btrfs with offset 2560 and length 1536 [ 23.505256] BTRFS info (device bcache0): csum failed ino 75922 off 172032 csum 1883678196 expected csum 1337496676 [ 23.508535] incomplete page read in btrfs with offset 2560 and length 1536 [ 23.508547] BTRFS info (device bcache0): csum failed ino 74368 off 237568 csum 2863587994 expected csum 2693116460 [ 25.683059] incomplete page read in btrfs with offset 2560 and length 1536 [ 25.683078] BTRFS info (device bcache0): csum failed ino 123709 off 57344 csum 1528117893 expected csum 2239543273 [ 25.684339] incomplete page read in btrfs with offset 1024 and length 3072 [ 26.622384] incomplete page read in btrfs with offset 1024 and length 3072 [ 26.906718] incomplete page read in btrfs with offset 2560 and length 1536 [ 27.823247] incomplete page read in btrfs with offset 1024 and length 3072 [ 27.823265] btrfs_readpage_end_io_hook: 2 callbacks suppressed [ 27.823271] BTRFS info (device bcache0): csum failed ino 34587 off 16384 csum 1180114025 expected csum 474262911 [ 28.490066] incomplete page read in btrfs with offset 2560 and length 1536 [ 28.490085] BTRFS info (device bcache0): csum failed ino 65817 off 327680 csum 3065880108 expected csum 2663659117 [ 29.413824] incomplete page read in btrfs with offset 1024 and length 3072 [ 41.913857] incomplete page read in btrfs with offset 2560 and length 1536 [ 55.761753] incomplete page read in btrfs with offset 1024 and length 3072 [ 55.761771] BTRFS info (device bcache0): csum failed ino 72835 off 81920 csum 1511792656 expected csum 3733709121 [ 69.636498] incomplete page read in btrfs with offset 2560 and length 1536 (...) should I be worried? thanks, Fabio Pfeifer 2013/12/18 eb e...@gmx.ch: I've recently setup a system (Kernel 3.12.5-1-ARCH) which is layered as follows: /dev/sdb3 - cache0 (80 GB Intel SSD) /dev/sdc1 - backing device (2 TB WD HDD) sdb3+sdc1 = /dev/bcache0 On /dev/bcache0, there's a btrfs filesystem with 2 subvolumes, mounted as / and /home. What's been bothering me are the following entries in my kernel log: [13811.845540] incomplete page write in btrfs with offset 1536 and length 2560 [13870.326639] incomplete page write in btrfs with offset 3072 and length 1024 The offset/length values are always either 1536/2560 or 3072/1024, they sum up nicely to 4K. There are 607 of those in there as I am writing this, the machine has been up 18 hours and been under no particular I/O strain (it's a desktop). Trying to fix this, I unattached the cache (still using /dev/bcache0, but without /dev/sdb3 attached), causing these errors to disappear. As soon as I re-attached /dev/sdb3 they started again, so I am fairly sure it's an unfavorable interaction between bcache and btrfs. Is this something I should be worried about (they're only emitted with KERN_INFO?) or just an alignment problem? The underlying HDD is using 4K-Sectors, while the block_size of bcache seems to be 512, could that be the issue here? I've also encountered incomplete reads and a few csum errors, but I have not been able to trigger these regularly. I have a feeling that the error is more likely o be on the bcache end (I've mailed to that list as well), however any insight into the matter would be much appreciated. Thanks, - eb -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
On Wed, 2013-12-18 at 18:17 +0100, eb wrote: I've recently setup a system (Kernel 3.12.5-1-ARCH) which is layered as follows: /dev/sdb3 - cache0 (80 GB Intel SSD) /dev/sdc1 - backing device (2 TB WD HDD) sdb3+sdc1 = /dev/bcache0 On /dev/bcache0, there's a btrfs filesystem with 2 subvolumes, mounted as / and /home. What's been bothering me are the following entries in my kernel log: [13811.845540] incomplete page write in btrfs with offset 1536 and length 2560 [13870.326639] incomplete page write in btrfs with offset 3072 and length 1024 The offset/length values are always either 1536/2560 or 3072/1024, they sum up nicely to 4K. There are 607 of those in there as I am writing this, the machine has been up 18 hours and been under no particular I/O strain (it's a desktop). Btrfs shouldn't be setting the offset on the bios. Are you able to add a WARN_ON to the message that prints this so we can see the stack trace? Could you please cc the bcache and btrfs list together? -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs on bcache
I've recently setup a system (Kernel 3.12.5-1-ARCH) which is layered as follows: /dev/sdb3 - cache0 (80 GB Intel SSD) /dev/sdc1 - backing device (2 TB WD HDD) sdb3+sdc1 = /dev/bcache0 On /dev/bcache0, there's a btrfs filesystem with 2 subvolumes, mounted as / and /home. What's been bothering me are the following entries in my kernel log: [13811.845540] incomplete page write in btrfs with offset 1536 and length 2560 [13870.326639] incomplete page write in btrfs with offset 3072 and length 1024 The offset/length values are always either 1536/2560 or 3072/1024, they sum up nicely to 4K. There are 607 of those in there as I am writing this, the machine has been up 18 hours and been under no particular I/O strain (it's a desktop). Trying to fix this, I unattached the cache (still using /dev/bcache0, but without /dev/sdb3 attached), causing these errors to disappear. As soon as I re-attached /dev/sdb3 they started again, so I am fairly sure it's an unfavorable interaction between bcache and btrfs. Is this something I should be worried about (they're only emitted with KERN_INFO?) or just an alignment problem? The underlying HDD is using 4K-Sectors, while the block_size of bcache seems to be 512, could that be the issue here? I've also encountered incomplete reads and a few csum errors, but I have not been able to trigger these regularly. I have a feeling that the error is more likely o be on the bcache end (I've mailed to that list as well), however any insight into the matter would be much appreciated. Thanks, - eb -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] bcache: generic_make_request() handles large bios now
So we get to delete our hacky workaround. Signed-off-by: Kent Overstreet k...@daterainc.com --- drivers/md/bcache/bcache.h| 18 drivers/md/bcache/io.c| 100 +- drivers/md/bcache/journal.c | 4 +- drivers/md/bcache/request.c | 16 +++ drivers/md/bcache/super.c | 33 ++ drivers/md/bcache/util.h | 5 ++- drivers/md/bcache/writeback.c | 4 +- include/linux/bio.h | 12 - 8 files changed, 19 insertions(+), 173 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 964353c..8f65331 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -241,19 +241,6 @@ struct keybuf { DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR); }; -struct bio_split_pool { - struct bio_set *bio_split; - mempool_t *bio_split_hook; -}; - -struct bio_split_hook { - struct closure cl; - struct bio_split_pool *p; - struct bio *bio; - bio_end_io_t*bi_end_io; - void*bi_private; -}; - struct bcache_device { struct closure cl; @@ -286,8 +273,6 @@ struct bcache_device { int (*cache_miss)(struct btree *, struct search *, struct bio *, unsigned); int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long); - - struct bio_split_pool bio_split_hook; }; struct io { @@ -465,8 +450,6 @@ struct cache { atomic_long_t meta_sectors_written; atomic_long_t btree_sectors_written; atomic_long_t sectors_written; - - struct bio_split_pool bio_split_hook; }; struct gc_stat { @@ -901,7 +884,6 @@ void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *); void bch_bbio_free(struct bio *, struct cache_set *); struct bio *bch_bbio_alloc(struct cache_set *); -void bch_generic_make_request(struct bio *, struct bio_split_pool *); void __bch_submit_bbio(struct bio *, struct cache_set *); void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned); diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index fa028fa..86a0bb8 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -11,104 +11,6 @@ #include linux/blkdev.h -static unsigned bch_bio_max_sectors(struct bio *bio) -{ - struct request_queue *q = bdev_get_queue(bio-bi_bdev); - struct bio_vec bv; - struct bvec_iter iter; - unsigned ret = 0, seg = 0; - - if (bio-bi_rw REQ_DISCARD) - return min(bio_sectors(bio), q-limits.max_discard_sectors); - - bio_for_each_segment(bv, bio, iter) { - struct bvec_merge_data bvm = { - .bi_bdev= bio-bi_bdev, - .bi_sector = bio-bi_iter.bi_sector, - .bi_size= ret 9, - .bi_rw = bio-bi_rw, - }; - - if (seg == min_t(unsigned, BIO_MAX_PAGES, -queue_max_segments(q))) - break; - - if (q-merge_bvec_fn - q-merge_bvec_fn(q, bvm, bv) (int) bv.bv_len) - break; - - seg++; - ret += bv.bv_len 9; - } - - ret = min(ret, queue_max_sectors(q)); - - WARN_ON(!ret); - ret = max_t(int, ret, bio_iovec(bio).bv_len 9); - - return ret; -} - -static void bch_bio_submit_split_done(struct closure *cl) -{ - struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl); - - s-bio-bi_end_io = s-bi_end_io; - s-bio-bi_private = s-bi_private; - bio_endio_nodec(s-bio, 0); - - closure_debug_destroy(s-cl); - mempool_free(s, s-p-bio_split_hook); -} - -static void bch_bio_submit_split_endio(struct bio *bio, int error) -{ - struct closure *cl = bio-bi_private; - struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl); - - if (error) - clear_bit(BIO_UPTODATE, s-bio-bi_flags); - - bio_put(bio); - closure_put(cl); -} - -void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p) -{ - struct bio_split_hook *s; - struct bio *n; - - if (!bio_has_data(bio) !(bio-bi_rw REQ_DISCARD)) - goto submit; - - if (bio_sectors(bio) = bch_bio_max_sectors(bio)) - goto submit; - - s = mempool_alloc(p-bio_split_hook, GFP_NOIO); - closure_init(s-cl, NULL); - - s-bio = bio; - s-p= p; - s-bi_end_io= bio-bi_end_io; - s-bi_private = bio-bi_private; - bio_get(bio); - - do { - n = bio_next_split(bio, bch_bio_max_sectors(bio), - GFP_NOIO, s-p-bio_split
Re: [RFC PATCH 0/7] bcache: md conversion
Dan Williams wrote: The consensus from LSF was that bcache need not invent a new interface when md and dm can both do the job. As mentioned in patch 7 this series aims to be a minimal conversion. Other refactoring items like deprecating register_lock for mddev-reconfig_mutex are deferred. This supports assembly of an already established cache array: mdadm -A /dev/md/bcache /dev/sd[ab] ...will create the /dev/md/bcache container and a subarray representing the cache volume. Flash-only, or backing-device only volumes were not tested. Create support and hot-add/hot-remove come later. Note: * When attempting to test with small loopback devices (100MB), assembly soft locks in bcache_journal_read(). That hang went away with larger devices, so there seems to be minimum component device size that needs to be considered in the tooling. Is there any plan to separate the on-disk layout (per-device headers, etc) from the logic for the purpose of reuse? I can think of at least one case where this would be extremely useful: integration in BtrFS. BtrFS already has its own methods for making sure a group of devices are all present when the filesystem is mounted, so it doesn't really need the formatting of the backing device bcache does to prevent it from being mounted solo. Putting bcache under BtrFS would be silly in the same way as putting it under a raid array, but bcache can't be put on top of BtrFS. Logically, in looking at BtrFS' architecture, a cache would likely fit best at the 'block group' level, which IIUC would be roughly equivalent to the recommended 'over raid, under lvm' method of using bcache. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bcache with SSD instead of battery powered raid cards
On 03/13/2012 04:06 AM, Kiran Patil wrote: Hi, Is anybody using bcache with SSD instead of battery powered raid cards with Btrfs ? Hard drives are cheap and big, SSDs are fast but small and expensive. Wouldn't it be nice if you could transparently get the advantages of both? With Bcache, you can have your cake and eat it too. Bcache is a patch for the Linux kernel to use SSDs to cache other block devices. It's analogous to L2Arc for ZFS, but Bcache also does writeback caching, and it's filesystem agnostic. It's designed to be switched on with a minimum of effort, and to work well without configuration on any setup. By default it won't cache sequential IO, just the random reads and writes that SSDs excel at. It's meant to be suitable for desktops, servers, high end storage arrays, and perhaps even embedded. http://bcache.evilpiepirate.org/ Did you ever experiment with this? What results did you find? There is also something similar called flashcache developed by some facebook engineer that I'm interested in trying. They are supposedly using this to speed up mysql+innodb. It is out of mainline tree code though, and I don't think there is much of an effort to get it in. It supports writeback, writethrough and writearound (blocks are never cached on write, only on read) caching. It uses dm-mapper to combine your 'cache block' with your 'slow spinning block' and then you put your filesystem on top of that dm device. https://github.com/facebook/flashcache Regards, --Justin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
bcache with SSD instead of battery powered raid cards
Hi, Is anybody using bcache with SSD instead of battery powered raid cards with Btrfs ? Hard drives are cheap and big, SSDs are fast but small and expensive. Wouldn't it be nice if you could transparently get the advantages of both? With Bcache, you can have your cake and eat it too. Bcache is a patch for the Linux kernel to use SSDs to cache other block devices. It's analogous to L2Arc for ZFS, but Bcache also does writeback caching, and it's filesystem agnostic. It's designed to be switched on with a minimum of effort, and to work well without configuration on any setup. By default it won't cache sequential IO, just the random reads and writes that SSDs excel at. It's meant to be suitable for desktops, servers, high end storage arrays, and perhaps even embedded. http://bcache.evilpiepirate.org/ http://news.gmane.org/gmane.linux.kernel.bcache.devel Thanks, Kiran. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html