btrfs scrub status misreports as interrupted
Hi all, While I haven't gotten any scrub already running type errors any more, I do get one strange case of state misreport. When running scrub on /home (btrfs RAID10), after 3 of 4 drives have completed, the 4th drive (sdb) will report as interrupted, despite still running: # btrfs scrub status -d /home scrub status for 472c9290-3ff2-4096-9c47-0612d3a52cef scrub device /dev/sda (id 1) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 3380 seconds total bytes scrubbed: 252.86GiB with 0 errors scrub device /dev/sdb (id 2) status scrub started at Sat Nov 22 11:57:34 2014, interrupted after 3698 seconds, not running total bytes scrubbed: 217.50GiB with 0 errors scrub device /dev/sdc (id 3) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 3013 seconds total bytes scrubbed: 252.85GiB with 0 errors scrub device /dev/sdd (id 4) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 2994 seconds total bytes scrubbed: 252.85GiB with 0 errors The funny thing is, the time will still update as the scrub keeps going: # btrfs scrub status -d /home scrub status for 472c9290-3ff2-4096-9c47-0612d3a52cef scrub device /dev/sda (id 1) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 3380 seconds total bytes scrubbed: 252.86GiB with 0 errors scrub device /dev/sdb (id 2) status scrub started at Sat Nov 22 11:57:34 2014, interrupted after 4136 seconds, not running total bytes scrubbed: 239.44GiB with 0 errors scrub device /dev/sdc (id 3) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 3013 seconds total bytes scrubbed: 252.85GiB with 0 errors scrub device /dev/sdd (id 4) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 2994 seconds total bytes scrubbed: 252.85GiB with 0 errors This has happened a few times, and when sdb finally finishes, the status is then reported correctly as finished: # btrfs scrub status -d /home scrub status for 472c9290-3ff2-4096-9c47-0612d3a52cef scrub device /dev/sda (id 1) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 3380 seconds total bytes scrubbed: 252.86GiB with 0 errors scrub device /dev/sdb (id 2) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 4426 seconds total bytes scrubbed: 252.88GiB with 0 errors scrub device /dev/sdc (id 3) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 3013 seconds total bytes scrubbed: 252.85GiB with 0 errors scrub device /dev/sdd (id 4) history scrub started at Sat Nov 22 11:57:34 2014 and finished after 2994 seconds total bytes scrubbed: 252.85GiB with 0 errors Kernel and btrfs-progs version: # uname -a Linux marcec 3.16.7-gentoo #1 SMP PREEMPT Fri Oct 31 22:45:54 CET 2014 x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ AuthenticAMD GNU/Linux # btrfs --version Btrfs v3.17.1 Should I open a report on bugzilla? -- Marc Joliet -- People who think they know everything really annoy those of us who know we don't - Bjarne Stroustrup signature.asc Description: PGP signature
[PATCH v2] Btrfs: fix allocationg memory failure for btrfsic_state structure
size of @btrfsic_state needs more than 2M, it is very likely to fail allocating memory using kzalloc(). see following mesage: [91428.902148] Call Trace: [816f6e0f] dump_stack+0x4d/0x66 [811b1c7f] warn_alloc_failed+0xff/0x170 [811b66e1] __alloc_pages_nodemask+0x951/0xc30 [811fd9da] alloc_pages_current+0x11a/0x1f0 [811b1e0b] ? alloc_kmem_pages+0x3b/0xf0 [811b1e0b] alloc_kmem_pages+0x3b/0xf0 [811d1018] kmalloc_order+0x18/0x50 [811d1074] kmalloc_order_trace+0x24/0x140 [a06c097b] btrfsic_mount+0x8b/0xae0 [btrfs] [810af555] ? check_preempt_curr+0x85/0xa0 [810b2de3] ? try_to_wake_up+0x103/0x430 [a063d200] open_ctree+0x1bd0/0x2130 [btrfs] [a060fdde] btrfs_mount+0x62e/0x8b0 [btrfs] [811fd9da] ? alloc_pages_current+0x11a/0x1f0 [811b0a5e] ? __get_free_pages+0xe/0x50 [81230429] mount_fs+0x39/0x1b0 [812509fb] vfs_kern_mount+0x6b/0x150 [812537fb] do_mount+0x27b/0xc30 [811b0a5e] ? __get_free_pages+0xe/0x50 [812544f6] SyS_mount+0x96/0xf0 [81701970] system_call_fastpath+0x16/0x1b Since we are allocating memory for hash table array, so it will be good if we could allocate continuous pages here. Fix this problem by firstly trying kzalloc(), if we fail, use vzalloc() instead. Signed-off-by: Wang Shilong wangshilong1...@gmail.com --- v1-v2: include vmalloc.h and swith kvfree() helper. --- fs/btrfs/check-integrity.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index cb7f3fe..f88b3a6 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -94,6 +94,7 @@ #include linux/mutex.h #include linux/genhd.h #include linux/blkdev.h +#include linux/vmalloc.h #include ctree.h #include disk-io.h #include hash.h @@ -3130,10 +3131,13 @@ int btrfsic_mount(struct btrfs_root *root, root-sectorsize, PAGE_CACHE_SIZE); return -1; } - state = kzalloc(sizeof(*state), GFP_NOFS); - if (NULL == state) { - printk(KERN_INFO btrfs check-integrity: kmalloc() failed!\n); - return -1; + state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + if (!state) { + state = vzalloc(sizeof(*state)); + if (!state) { + printk(KERN_INFO btrfs check-integrity: vzalloc() failed!\n); + return -1; + } } if (!btrfsic_is_initialized) { @@ -3277,5 +3281,5 @@ void btrfsic_unmount(struct btrfs_root *root, mutex_unlock(btrfsic_mutex); - kfree(state); + kvfree(state); } -- 1.7.12.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: switch to kvfree() helper
A new helper kvfree() in mm/utils.c will do this. Signed-off-by: Wang Shilong wangshilong1...@gmail.com --- fs/btrfs/raid56.c | 13 +++-- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 6a41631..12e343b 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -221,12 +221,8 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info) } x = cmpxchg(info-stripe_hash_table, NULL, table); - if (x) { - if (is_vmalloc_addr(x)) - vfree(x); - else - kfree(x); - } + if (x) + kvfree(x); return 0; } @@ -436,10 +432,7 @@ void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info) if (!info-stripe_hash_table) return; btrfs_clear_rbio_cache(info); - if (is_vmalloc_addr(info-stripe_hash_table)) - vfree(info-stripe_hash_table); - else - kfree(info-stripe_hash_table); + kvfree(info-stripe_hash_table); info-stripe_hash_table = NULL; } -- 1.7.12.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
open_ctree problem
I have a workstation running Linux 3.14.something on a 120G SSD. It recently had a problem and now the root filesystem can't be mounted, here is the message I get when trying to mount it read-only on Debian kernel 3.16.2-3: [4703937.784447] BTRFS info (device loop0): disk space caching is enabled [4703938.754247] BTRFS: log replay required on RO media [4703938.794148] BTRFS: open_ctree failed When I tried to boot it normally it gave a lot of kernel messages and failed to mount it. Here's the error I get from the btrfs-zero-log in btrfs-tools 0.19+20130501-1: # btrfs-zero-log yayia-corrupt extent buffer leak: start 157263929344 len 4096 *** Error in `btrfs-zero-log': corrupted double-linked list: 0x01068960 *** Aborted I installed btrfs-tools 3.17-1 and then btrfs-zero-log ran without error. But when I tried to mount the filesystem I got the attached kernel error when trying to mount with Debian kernel 3.16.2-3. Any suggestions on what I should do next? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ dmesg.txt.gz Description: GNU Zip compressed data
[PATCH-v2 1/5] fs: split update_time() into update_time() and write_time()
In preparation for adding support for the lazytime mount option, we need to be able to separate out the update_time() and write_time() inode operations. Currently, only btrfs and xfs uses update_time(). We needed to preserve update_time() because btrfs wants to have a special btrfs_root_readonly() check; otherwise we could drop the update_time() inode operation entirely. Signed-off-by: Theodore Ts'o ty...@mit.edu Cc: x...@oss.sgi.com Cc: linux-btrfs@vger.kernel.org --- fs/btrfs/inode.c | 10 ++ fs/inode.c | 29 ++--- fs/xfs/xfs_iops.c | 39 --- include/linux/fs.h | 1 + 4 files changed, 45 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d23362f..a5e0d0d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now, inode-i_mtime = *now; if (flags S_ATIME) inode-i_atime = *now; + return 0; +} + +static int btrfs_write_time(struct inode *inode) +{ return btrfs_dirty_inode(inode); } @@ -9462,6 +9467,7 @@ static const struct inode_operations btrfs_dir_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, .tmpfile= btrfs_tmpfile, }; static const struct inode_operations btrfs_dir_ro_inode_operations = { @@ -9470,6 +9476,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct file_operations btrfs_dir_file_operations = { @@ -9540,6 +9547,7 @@ static const struct inode_operations btrfs_file_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct inode_operations btrfs_special_inode_operations = { .getattr= btrfs_getattr, @@ -9552,6 +9560,7 @@ static const struct inode_operations btrfs_special_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct inode_operations btrfs_symlink_inode_operations = { .readlink = generic_readlink, @@ -9565,6 +9574,7 @@ static const struct inode_operations btrfs_symlink_inode_operations = { .listxattr = btrfs_listxattr, .removexattr= btrfs_removexattr, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; const struct dentry_operations btrfs_dentry_operations = { diff --git a/fs/inode.c b/fs/inode.c index 26753ba..8f5c4b5 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode, */ static int update_time(struct inode *inode, struct timespec *time, int flags) { - if (inode-i_op-update_time) - return inode-i_op-update_time(inode, time, flags); - - if (flags S_ATIME) - inode-i_atime = *time; - if (flags S_VERSION) - inode_inc_iversion(inode); - if (flags S_CTIME) - inode-i_ctime = *time; - if (flags S_MTIME) - inode-i_mtime = *time; + int ret; + + if (inode-i_op-update_time) { + ret = inode-i_op-update_time(inode, time, flags); + if (ret) + return ret; + } else { + if (flags S_ATIME) + inode-i_atime = *time; + if (flags S_VERSION) + inode_inc_iversion(inode); + if (flags S_CTIME) + inode-i_ctime = *time; + if (flags S_MTIME) + inode-i_mtime = *time; + } + if (inode-i_op-write_time) + return inode-i_op-write_time(inode); mark_inode_dirty_sync(inode); return 0; } diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index ec6dcdc..0e9653c 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -984,10 +984,8 @@ xfs_vn_setattr( } STATIC int -xfs_vn_update_time( - struct inode*inode, - struct timespec *now, - int flags) +xfs_vn_write_time( + struct inode*inode) { struct xfs_inode*ip = XFS_I(inode); struct xfs_mount*mp = ip-i_mount; @@ -1004,21 +1002,16 @@ xfs_vn_update_time( } xfs_ilock(ip, XFS_ILOCK_EXCL); - if (flags S_CTIME) { - inode-i_ctime = *now; -
[PATCH-v2 3/5] vfs: don't let the dirty time inodes get more than a day stale
Guarantee that the on-disk timestamps will be no more than 24 hours stale. Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 1 + fs/inode.c | 16 +++- include/linux/fs.h | 1 + 3 files changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ce7de22..eb04277 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) if (flags (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { trace_writeback_dirty_inode_start(inode, flags); + inode-i_ts_dirty_day = 0; if (sb-s_op-dirty_inode) sb-s_op-dirty_inode(inode, flags); diff --git a/fs/inode.c b/fs/inode.c index 11fe81b..0d939a8 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode, */ static int update_time(struct inode *inode, struct timespec *time, int flags) { + unsigned short days_since_boot = jiffies / (HZ * 86400); int ret; if (inode-i_op-update_time) { @@ -1527,14 +1528,27 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags S_MTIME) inode-i_mtime = *time; } - if (inode-i_sb-s_flags MS_LAZYTIME) { + /* +* If i_ts_dirty_day is zero, then either we have not deferred +* timestamp updates, or the system has been up for less than +* a day (so days_since_boot is zero), so we defer timestamp +* updates in that case and set the I_DIRTY_TIME flag. If a +* day or more has passed, then i_ts_dirty_day will be +* different from days_since_boot, and then we should update +* the on-disk inode and then we can clear i_ts_dirty_day. +*/ + if ((inode-i_sb-s_flags MS_LAZYTIME) + (!inode-i_ts_dirty_day || +inode-i_ts_dirty_day == days_since_boot)) { if (inode-i_state I_DIRTY_TIME) return 0; spin_lock(inode-i_lock); inode-i_state |= I_DIRTY_TIME; spin_unlock(inode-i_lock); + inode-i_ts_dirty_day = days_since_boot; return 0; } + inode-i_ts_dirty_day = 0; if (inode-i_op-write_time) return inode-i_op-write_time(inode); mark_inode_dirty_sync(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 489b2f2..e3574cd 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -575,6 +575,7 @@ struct inode { struct timespec i_ctime; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; + unsigned short i_ts_dirty_day; unsigned inti_blkbits; blkcnt_ti_blocks; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-v2 4/5] vfs: add lazytime tracepoints for better debugging
Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 5 - fs/inode.c| 5 + include/trace/events/fs.h | 56 +++ 3 files changed, 65 insertions(+), 1 deletion(-) create mode 100644 include/trace/events/fs.h diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index eb04277..cab2d6d 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -27,6 +27,7 @@ #include linux/backing-dev.h #include linux/tracepoint.h #include linux/device.h +#include trace/events/fs.h #include internal.h /* @@ -1304,8 +1305,10 @@ static void flush_sb_dirty_time(struct super_block *sb) iput(old_inode); old_inode = inode; - if (dirty_time) + if (dirty_time) { + trace_fs_lazytime_flush(inode); mark_inode_dirty(inode); + } cond_resched(); spin_lock(inode_sb_list_lock); } diff --git a/fs/inode.c b/fs/inode.c index 0d939a8..5a9a7b0 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -20,6 +20,9 @@ #include linux/list_lru.h #include internal.h +#define CREATE_TRACE_POINTS +#include trace/events/fs.h + /* * Inode locking rules: * @@ -544,6 +547,7 @@ static void evict(struct inode *inode) mark_inode_dirty(inode); inode-i_sb-s_op-write_inode(inode, wbc); } + trace_fs_lazytime_evict(inode); } if (!list_empty(inode-i_wb_list)) @@ -1546,6 +1550,7 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) inode-i_state |= I_DIRTY_TIME; spin_unlock(inode-i_lock); inode-i_ts_dirty_day = days_since_boot; + trace_fs_lazytime_defer(inode); return 0; } inode-i_ts_dirty_day = 0; diff --git a/include/trace/events/fs.h b/include/trace/events/fs.h new file mode 100644 index 000..ca06d5c --- /dev/null +++ b/include/trace/events/fs.h @@ -0,0 +1,56 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM fs + +#if !defined(_TRACE_FS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_FS_H + +#include linux/tracepoint.h + +DECLARE_EVENT_CLASS(fs__inode, + TP_PROTO(struct inode *inode), + + TP_ARGS(inode), + + TP_STRUCT__entry( + __field(dev_t, dev ) + __field(ino_t, ino ) + __field(uid_t, uid ) + __field(gid_t, gid ) + __field(__u16, mode ) + ), + + TP_fast_assign( + __entry-dev= inode-i_sb-s_dev; + __entry-ino= inode-i_ino; + __entry-uid= i_uid_read(inode); + __entry-gid= i_gid_read(inode); + __entry-mode = inode-i_mode; + ), + + TP_printk(dev %d,%d ino %lu mode 0%o uid %u gid %u, + MAJOR(__entry-dev), MINOR(__entry-dev), + (unsigned long) __entry-ino, __entry-mode, + __entry-uid, __entry-gid) +); + +DEFINE_EVENT(fs__inode, fs_lazytime_defer, + TP_PROTO(struct inode *inode), + + TP_ARGS(inode) +); + +DEFINE_EVENT(fs__inode, fs_lazytime_evict, + TP_PROTO(struct inode *inode), + + TP_ARGS(inode) +); + +DEFINE_EVENT(fs__inode, fs_lazytime_flush, + TP_PROTO(struct inode *inode), + + TP_ARGS(inode) +); +#endif /* _TRACE_FS_H */ + +/* This part must be outside protection */ +#include trace/define_trace.h -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-v2 5/5] ext4: add support for a lazytime mount option
Add an optimization for the MS_LAZYTIME mount option so that we will opportunistically write out any inodes with the I_DIRTY_TIME flag set in a particular inode table block when we need to update some inode in that inode table block anyway. Also add some temporary code so that we can set the lazytime mount option without needing a modified /sbin/mount program which can set MS_LAZYTIME. We can eventually make this go away once util-linux has added support. Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/ext4/inode.c | 48 ++--- fs/ext4/super.c | 9 + fs/inode.c | 36 ++ include/linux/fs.h | 2 ++ include/trace/events/ext4.h | 29 +++ 5 files changed, 121 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 3356ab5..03149b4 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4163,6 +4163,50 @@ static int ext4_inode_blocks_set(handle_t *handle, } /* + * Opportunistically update the other time fields for other inodes in + * the same inode table block. + */ +static void ext4_update_other_inodes_time(struct super_block *sb, + unsigned long orig_ino, char *buf) +{ + struct ext4_inode_info *ei; + struct ext4_inode *raw_inode; + unsigned long ino; + struct inode*inode; + int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block; + int inode_size = EXT4_INODE_SIZE(sb); + + ino = orig_ino ~(inodes_per_block - 1); + for (i = 0; i inodes_per_block; i++, ino++, buf += inode_size) { + if (ino == orig_ino) + continue; + inode = find_active_inode_nowait(sb, ino); + if (!inode || + (inode-i_state I_DIRTY_TIME) == 0 || + !spin_trylock(inode-i_lock)) { + iput(inode); + continue; + } + inode-i_state = ~I_DIRTY_TIME; + inode-i_ts_dirty_day = 0; + spin_unlock(inode-i_lock); + + ei = EXT4_I(inode); + raw_inode = (struct ext4_inode *) buf; + + spin_lock(ei-i_raw_lock); + EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode); + EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); + ext4_inode_csum_set(inode, raw_inode, ei); + spin_unlock(ei-i_raw_lock); + trace_ext4_other_inode_update_time(inode, orig_ino); + iput(inode); + } +} + + +/* * Post the struct inode info into an on-disk inode location in the * buffer-cache. This gobbles the caller's reference to the * buffer_head in the inode location struct. @@ -4260,7 +4304,6 @@ static int ext4_do_update_inode(handle_t *handle, for (block = 0; block EXT4_N_BLOCKS; block++) raw_inode-i_block[block] = ei-i_data[block]; } - if (likely(!test_opt2(inode-i_sb, HURD_COMPAT))) { raw_inode-i_disk_version = cpu_to_le32(inode-i_version); if (ei-i_extra_isize) { @@ -4271,10 +4314,9 @@ static int ext4_do_update_inode(handle_t *handle, cpu_to_le16(ei-i_extra_isize); } } - ext4_inode_csum_set(inode, raw_inode, ei); - spin_unlock(ei-i_raw_lock); + ext4_update_other_inodes_time(inode-i_sb, inode-i_ino, bh-b_data); BUFFER_TRACE(bh, call ext4_handle_dirty_metadata); rc = ext4_handle_dirty_metadata(handle, NULL, bh); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 4b79f39..1ac1914 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1133,6 +1133,7 @@ enum { Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err, Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit, + Opt_lazytime, Opt_nolazytime, Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity, Opt_inode_readahead_blks, Opt_journal_ioprio, Opt_dioread_nolock, Opt_dioread_lock, @@ -1195,6 +1196,8 @@ static const match_table_t tokens = { {Opt_i_version, i_version}, {Opt_stripe, stripe=%u}, {Opt_delalloc, delalloc}, + {Opt_lazytime, lazytime}, + {Opt_nolazytime, nolazytime}, {Opt_nodelalloc, nodelalloc}, {Opt_removed, mblk_io_submit}, {Opt_removed, nomblk_io_submit}, @@ -1450,6 +1453,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, case Opt_i_version: sb-s_flags |= MS_I_VERSION; return 1; + case Opt_lazytime: + sb-s_flags |= MS_LAZYTIME; + return
[PATCH-v2 2/5] vfs: add support for a lazytime mount option
Add a new mount option which enables a new lazytime mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact 99.9 percentile latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 38 +- fs/inode.c | 20 fs/proc_namespace.c | 1 + fs/sync.c | 7 +++ include/linux/fs.h | 1 + include/uapi/linux/fs.h | 1 + 6 files changed, 67 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ef9bef1..ce7de22 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) inode-i_state = ~I_DIRTY_PAGES; dirty = inode-i_state I_DIRTY; - inode-i_state = ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC); + inode-i_state = ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME); spin_unlock(inode-i_lock); /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb) iput(old_inode); } +/* + * This works like wait_sb_inodes(), but it is called *before* we kick + * the bdi so the inodes can get written out. + */ +static void flush_sb_dirty_time(struct super_block *sb) +{ + struct inode *inode, *old_inode = NULL; + + WARN_ON(!rwsem_is_locked(sb-s_umount)); + spin_lock(inode_sb_list_lock); + list_for_each_entry(inode, sb-s_inodes, i_sb_list) { + int dirty_time; + + spin_lock(inode-i_lock); + if (inode-i_state (I_FREEING|I_WILL_FREE|I_NEW)) { + spin_unlock(inode-i_lock); + continue; + } + dirty_time = inode-i_state I_DIRTY_TIME; + __iget(inode); + spin_unlock(inode-i_lock); + spin_unlock(inode_sb_list_lock); + + iput(old_inode); + old_inode = inode; + + if (dirty_time) + mark_inode_dirty(inode); + cond_resched(); + spin_lock(inode_sb_list_lock); + } + spin_unlock(inode_sb_list_lock); + iput(old_inode); +} + /** * writeback_inodes_sb_nr -writeback dirty inodes from given super_block * @sb: the superblock @@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb) return; WARN_ON(!rwsem_is_locked(sb-s_umount)); + flush_sb_dirty_time(sb); bdi_queue_work(sb-s_bdi, work); wait_for_completion(done); diff --git a/fs/inode.c b/fs/inode.c index 8f5c4b5..11fe81b 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -534,6 +534,18 @@ static void evict(struct inode *inode) BUG_ON(!(inode-i_state I_FREEING)); BUG_ON(!list_empty(inode-i_lru)); + if (inode-i_nlink inode-i_state I_DIRTY_TIME) { + if (inode-i_op-write_time) + inode-i_op-write_time(inode); + else if (inode-i_sb-s_op-write_inode) { + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + }; + mark_inode_dirty(inode); + inode-i_sb-s_op-write_inode(inode, wbc); + } + } + if (!list_empty(inode-i_wb_list)) inode_wb_list_del(inode); @@ -1515,6 +1527,14 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags S_MTIME) inode-i_mtime = *time; } + if (inode-i_sb-s_flags MS_LAZYTIME) { + if (inode-i_state I_DIRTY_TIME) + return 0; + spin_lock(inode-i_lock); + inode-i_state |= I_DIRTY_TIME; + spin_unlock(inode-i_lock); +
[PATCH-v2 0/5] add support for a lazytime mount option
This is an updated version of what had originally been an ext4-specific patch which significantly improves performance by lazily writing timestamp updates (and in particular, mtime updates) to disk. The in-memory timestamps are always correct, but they are only written to disk when required for correctness. This provides a huge performance boost for ext4 due to how it handles journalling, but it's valuable for all file systems running on flash storage or drive-managed SMR disks by reducing the metadata write load. So upon request, I've moved the functionality to the VFS layer. Once the /sbin/mount program adds support for MS_LAZYTIME, all file systems should be able to benefit from this optimization. There is still an ext4-specific optimization, which may be applicable for other file systems which store more than one inode in a block, but it will require file system specific code. It is purely optional, however. Please note the changes to update_time() and the new write_time() inode operations functions, which impact btrfs and xfs. The changes are fairly simple, but I would appreciate confirmation from the btrfs and xfs teams that I got things right. Thanks!! Any objections to my carrying these patches in the ext4 git tree? Changes since -v1: - Added explanatory comments in update_time() regarding i_ts_dirty_days - Fix type used for days_since_boot - Improve SMP scalability in update_time and ext4_update_other_inodes_time - Added tracepoints to help test and characterize how often and under what circumstances inodes have their timestamps lazily updated Theodore Ts'o (5): fs: split update_time() into update_time() and write_time() vfs: add support for a lazytime mount option vfs: don't let the dirty time inodes get more than a day stale vfs: add lazytime tracepoints for better debugging ext4: add support for a lazytime mount option fs/btrfs/inode.c| 10 + fs/ext4/inode.c | 48 ++-- fs/ext4/super.c | 9 fs/fs-writeback.c | 42 +- fs/inode.c | 104 +++- fs/proc_namespace.c | 1 + fs/sync.c | 7 +++ fs/xfs/xfs_iops.c | 39 +++-- include/linux/fs.h | 5 +++ include/trace/events/ext4.h | 29 include/trace/events/fs.h | 56 include/uapi/linux/fs.h | 1 + 12 files changed, 313 insertions(+), 38 deletions(-) create mode 100644 include/trace/events/fs.h -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/21/2014 05:28 AM, Zygo Blaxell wrote: e.g. if an ext4 filesystem explodes, I can: 1. make a LVM snapshot of the broken filesystem 2. run e2fsck on the snapshot 3. mount and repair the snapshot, e.g. rsync any missing files from backups, salvage anything that survived 4. LVM merge the snapshot to its origin volume 5. umount the origin volume and mount the merged volume (or just reboot) ...and I can do all of this on a running system, in-place, with only a few minutes of downtime in the must-reboot case. None of the above works with btrfs at all. Multi-device btrfs fails at 2, You can't compare ext4 with btrfs, if you are talking about a multi-device filesystem: ext4 haven't this capability. Try to make a md-raid over a snapshotted logical volume(s); I never tried that, but I suppose that there will be the same problems... and mounting the filesystem fails at 3. Are you sure ? ghigo@venice:/tmp$ # create a btrfs filesystem in a logical volume ghigo@venice:/tmp$ sudo truncate -s +10G disk.img ghigo@venice:/tmp$ sudo losetup -f disk.img ghigo@venice:/tmp$ sudo pvcreate /dev/loop0 ghigo@venice:/tmp$ sudo vgcreate vgtest /dev/loop0 ghigo@venice:/tmp$ sudo lvcreate -n lvone -L 3G vgtest ghigo@venice:/tmp$ sudo mkfs.btrfs /dev/vgtest/lvone ghigo@venice:/tmp$ mkdir t ghigo@venice:/tmp$ # create a file inside a btrfs fs ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone t/ ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/disk-orig bs=1M count=1 ghigo@venice:/tmp$ sudo umount t ghigo@venice:/tmp$ # make a lvm snapshot and add a 2nd file ghigo@venice:/tmp$ sudo lvcreate -s -n lvone_snap -L 3G vgtest/lvone ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone_snap t/ ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/disk-snap bs=1M count=1 ghigo@venice:/tmp$ sudo umount t ghigo@venice:/tmp$ # mount the first one lv, and check the file ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone t/ ghigo@venice:/tmp$ ls -l t total 1024 -rw-r--r-- 1 root root 1048576 Nov 22 18:11 disk-orig ghigo@venice:/tmp$ sudo umount t ghigo@venice:/tmp$ # mount the first one lv, and check the files ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone_snap t/ ghigo@venice:/tmp$ ls -l t total 2048 -rw-r--r-- 1 root root 1048576 Nov 22 18:11 disk-orig -rw-r--r-- 1 root root 1048576 Nov 22 18:12 disk-snap On the basis of the example above, in case you want to mount a single-disk, BTRFS seems me to work properly. You have to pay attention only to not mount the two filesystem at the same time. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Volume/subvolume UUID uniqueness, was: BTRFS messes up snapshot LV with origin
I don't know how to fix this but I've convinced myself there's at least a small problem. And not just with LVM snapshots as in the originating thread. - Via seed device method of creating a Btrfs volume, the resulting volume gets a new UUID. The volume UUID from the seed device doesn't pass through, is not inherited / copied. Therefore there's already recognition that snapshotting a Btrfs volume, which is what volume creation from a seed device effectively is, should result in the new volume getting a new UUID. Therefore it seems reasonable a mechanism to support new volume UUIDs upon LVM snapshots being taken is needed. Maybe leveraging existing seed code can help, consider existing volume data a virtual seed device, and the remaining free space as a virtual added device to enable changing volume UUID rather than rewriting possibly piles of UUIDs. - While the seed device method of creating a Btrfs volume results in a new volume UUID, subvolume UUIDs from the seed pass through to the new volume. Since I can create many new volumes from one seed device, in effect I'm creating many instances of subvolumes with identical UUIDs and can now no longer be differentiated, locally and remotely. This seems to be a much bigger problem than the LVM case, since it occurs with only Btrfs tools being used. The grandiose idea of UUIDs is persistence in identifying a specific object/resource for all time, anywhere in the universe. Reducing this to something practical, it should enable a way to identify an object or resource within one or two human lifetimes, within our solar system. Yet the current implementation has broken this on a much shorter time scale, on a single computer. Since we recognize subvolume snapshots should get new subvolume UUIDs, and volume snapshots via seed device method creation of new volumes get new volume UUIDs; a volume snapshot of course is also snapshotting the subvolumes too, so the subvolume UUIDs can't pass through the way they do right now. It's not correct behavior. Another matter is what to do with parent uuid and snapshot relationship metadata in the new volume. Assume all subvolumes get new UUIDs on the new volume, there are three potentials: 1. parent uuid is always blank, no relationships between subvolumes is preserved 2. parent uuid is the uuid of its identical mirror (the original) in the seed device. 3. parent uuid is the new uuid of its relative parent on the current new volume, preserving relationships between subvolumes and snapshots. I think any of those three are better than UUID duplication (recycling actually). Maybe I'm not thinking of a use case for preserving these UUIDs but at the moment I think it's specious. We can't be attached to specific UUIDs, the instant a subvolume is effectively snapshot by LVM or Btrfs seed device, it's a unique object/resource, and should have its own URN. Afterall by default these objects are read/write. Maybe if by default they were readonly I could be convinced of the validity of UUID preservation. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] btrfs: enable swap file support
On Fri, Nov 21, 2014 at 07:00:45PM +0100, David Sterba wrote: + pr_err(BTRFS: swapfile has holes); + ret = -EINVAL; + goto out; + } + if (em-block_start == EXTENT_MAP_INLINE) { + pr_err(BTRFS: swapfile is inline); While the test is valid, this would mean that the file is smaller than the inline limit, which is now one page. I think the generic swap code would refuse such a small file anyway. Sure. This test doesn't really cost us anything, so I think I'd feel a little better just leaving it in. I'll add a comment for the next close reader. Besides that and Filipe's response, I'll address everything you mentioned here and in your other email in the next version, thanks. + ret = -EINVAL; + goto out; + } + if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) { + pr_err(BTRFS: swapfile is compresed); + ret = -EINVAL; + goto out; + } I think the preallocated extents should be refused as well. This means the filesystem has enough space to hold the data but it would still have to go through the allocation and could in turn stress the memory management code that triggered the swapping activity in the first place. Though it's probably still possible to reach such corner case even with fully allocated nodatacow file, this should be reviewed anyway. I'll definitely take a closer look at this. In particular, btrfs_get_blocks_direct and btrfs_get_extent do allocations in some cases which I'll look into. -- Omar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On 11/21/2014 04:12 PM, Robert White wrote: Here's a bug from 2005 of someone having a problem with the ACPI IDE support... That is not ACPI emulation. ACPI is not used to access the disk, but rather it has hooks that give it a chance to diddle with the disk to do things like configure it to lie about its maximum size, or issue a security unlock during suspend/resume. People debating the merits of the ACPI IDE drivers in 2005. No... that's not a debate at all; it is one guy asking if he should use IDE or ACPI mode... someone who again meant AHCI and typed the wrong acronym. Even when you get me for referencing windows, you're still wrong... How many times will you try get out of being hideously horribly wrong about ACPI supporting disk/storage IO? It is neither recent nor rare. How much egg does your face really need before you just see that your fantasy that it's new and uncommon is a delusional mistake? Project much? It seems I've proven just about everything I originally said you got wrong now so hopefully we can be done. -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBCgAGBQJUcQj4AAoJENRVrw2cjl5RwmcH+gOW0LUQE4OXEToMY33brK8Z QMKw7T1y4dtXIeeWihugNs+vbwmoI2Wheeej4WPdiqvgqIfX4ov9+N9Nb39JiIsI 7frPJ638n98Et5sirCGKfaVvDTwlF85ApHHtXrVLg2dBY3A+oLM9jVU7jpRBvW1m IFjhJH/SMGDpMhix9SFg6w6cALRh1U5WYV4zMZ1f5/ri/05TYmNJ/M23cjtBicPZ LaIFxOMGef4lylysNaVh0W03424oIJit6d7DB1gxCyjnkUvVuJ43NjuS5ay+y2sP FFrepKrOfhK1oOib9e63zNfRHhWrX4KN0Dqcu/3+/+lhD3q5G1fd4YK2RV/oaso= =nm9l -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
+btrfs list so that someone can correct me if I'm wrong. On Sat, Nov 22, 2014 at 09:34:59PM +0100, Patrik Lundquist wrote: Hi, I was scratching my head over a failing btrfs balance and read your very informative http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html, but shouldn't I can ask balance to rewrite all chunks that are more than 55% full be I can ask balance to rewrite all chunks that are less than 55% full? This one hurts my brain every time I think about it :) So, the bigger the -dusage number, the more work btrfs has to do. -dusage=0 does almost nothing -dusage=100 effectively rebalances everything But saying saying less than 95% full for -dusage=95 would mean rebalancing everything that isn't almost full, so I'm not sure it makes sense either (I would think you'd wan't to reblance full blocks first). The logical wording would be less than 95% space free. I'll update my page since this is what makes the most sense. Now, just to be sure, if I'm getting this right, if your filesystem is 55% full, you could rebalance all blocks that have less than 55% space free, and use -dusage=55 Does that sound right? Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Volume/subvolume UUID uniqueness, was: BTRFS messes up snapshot LV with origin
On 11/22/2014 02:50 PM, Robert White wrote: Take a couple snapshots of a subvolume, and then send those subvolumes to another file system with send/receive, and then do btrfs subvolume list -u -q on the two filesystems and tell me that mess makes sense. Or try to recreate a subvolume from its snapshot in a way that doesn't shatter the relationships in your backup scheme. (I'm researching for a couple patches but I'm not expecting a warm reception given the silence to date). (ASIDE In particular use btrfs sub send -c SNAP1 SNAP2 and then btrfs sub send -c SNAP2 SNAP3 etc before doing the btrfs sub list -u -q to view the mess I speak of.) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
On 22 November 2014 at 23:26, Marc MERLIN m...@merlins.org wrote: This one hurts my brain every time I think about it :) I'm new to Btrfs so I may very well be wrong, since I haven't really read up on it. :-) So, the bigger the -dusage number, the more work btrfs has to do. Agreed. -dusage=0 does almost nothing -dusage=100 effectively rebalances everything And -dusage=0 effectively reclaims empty chunks, right? But saying saying less than 95% full for -dusage=95 would mean rebalancing everything that isn't almost full, But isn't that what rebalance does? Rewriting chunks =95% full to completely full chunks and effectively defragmenting chunks and most likely reduce the number of chunks. A -dusage=0 rebalance reduced my number of chunks from 1173 to 998 and dev_item.bytes_used went from 1593466421248 to 1491460947968. Now, just to be sure, if I'm getting this right, if your filesystem is 55% full, you could rebalance all blocks that have less than 55% space free, and use -dusage=55 I realize that I interpret the usage parameter as operating on blocks (chunks? are they the same in this case?) that are = 55% full while you interpret it as = 55% free. Which is correct? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
On Sun, Nov 23, 2014 at 12:26:38AM +0100, Patrik Lundquist wrote: I realize that I interpret the usage parameter as operating on blocks (chunks? are they the same in this case?) that are = 55% full while you interpret it as = 55% free. Which is correct? I will let someone else answer because I'm not 100% certain anymore. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
On Sun, Nov 23, 2014 at 12:26:38AM +0100, Patrik Lundquist wrote: On 22 November 2014 at 23:26, Marc MERLIN m...@merlins.org wrote: This one hurts my brain every time I think about it :) I'm new to Btrfs so I may very well be wrong, since I haven't really read up on it. :-) So, the bigger the -dusage number, the more work btrfs has to do. Agreed. -dusage=0 does almost nothing -dusage=100 effectively rebalances everything And -dusage=0 effectively reclaims empty chunks, right? But saying saying less than 95% full for -dusage=95 would mean rebalancing everything that isn't almost full, But isn't that what rebalance does? Rewriting chunks =95% full to completely full chunks and effectively defragmenting chunks and most likely reduce the number of chunks. A -dusage=0 rebalance reduced my number of chunks from 1173 to 998 and dev_item.bytes_used went from 1593466421248 to 1491460947968. Now, just to be sure, if I'm getting this right, if your filesystem is 55% full, you could rebalance all blocks that have less than 55% space free, and use -dusage=55 I realize that I interpret the usage parameter as operating on blocks (chunks? are they the same in this case?) that are = 55% full while you interpret it as = 55% free. Which is correct? Less than or equal to 55% full. 0 gives you less than or equal to 0% full -- i.e. the empty block groups. 100 gives you less than or equal to 100% full, i.e. all block groups. A chunk is the part of a block group that lives on one device, so in RAID-1, every block group is precisely two chunks; in RAID-0, every block group is 2 or more chunks, up to the number of devices in the FS. A chunk is usually 1 GiB in size for data and 250 MiB for metadata, but can be smaller under some circumstances. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- And what rough beast, its hour come round at last / slouches --- towards Bethlehem, to be born? signature.asc Description: Digital signature
Best GIT repository(s) for preparing patches?
Which is the best GIT repository to clone for each of the kernel support and btrfs-progs, for preparing a patch to submit to this email list? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Sat, Nov 22, 2014 at 06:34:38PM +0100, Goffredo Baroncelli wrote: On 11/21/2014 05:28 AM, Zygo Blaxell wrote: e.g. if an ext4 filesystem explodes, I can: 1. make a LVM snapshot of the broken filesystem 2. run e2fsck on the snapshot 3. mount and repair the snapshot, e.g. rsync any missing files from backups, salvage anything that survived 4. LVM merge the snapshot to its origin volume 5. umount the origin volume and mount the merged volume (or just reboot) ...and I can do all of this on a running system, in-place, with only a few minutes of downtime in the must-reboot case. None of the above works with btrfs at all. Multi-device btrfs fails at 2, You can't compare ext4 with btrfs, if you are talking about a multi-device filesystem: ext4 haven't this capability. btrfs fails this comparison as a single-device filesystem. Try to make a md-raid over a snapshotted logical volume(s); I never tried that, but I suppose that there will be the same problems... md-raid works as long as you specify the devices, and because it's always the lowest layer it can ignore LVs (snapshot or otherwise). It's also not a particularly common use case, while making an LV snapshot of a filesystem is a typical use case. and mounting the filesystem fails at 3. Are you sure ? Yes, I'm sure. I've had to replace filesystems destroyed this way. [working instance snipped] On the basis of the example above, in case you want to mount a single-disk, BTRFS seems me to work properly. You have to pay attention only to not mount the two filesystem at the same time. The problem is btrfs stops searching when it sees one disk with each UUID, so the set of disks (snapshot vs origin) that you get is *random*. For a pair of origin + snapshots, there's a 50% chance it works, 50% chance it eats your data. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: Digital signature
Re: Fixing Btrfs Filesystem Full Problems typo?
On Sun, Nov 23, 2014 at 12:05:04AM +, Hugo Mills wrote: Which is correct? Less than or equal to 55% full. This confuses me. Does that mean that the fullest blocks do not get rebalanced? I guess I was under the mistaken impression that the more data you had the more you could be out of balance. A chunk is the part of a block group that lives on one device, so in RAID-1, every block group is precisely two chunks; in RAID-0, every block group is 2 or more chunks, up to the number of devices in the FS. A chunk is usually 1 GiB in size for data and 250 MiB for metadata, but can be smaller under some circumstances. Right. So, why would you rebalance empty chunks or near empty chunks? Don't you want to rebalance almost full chunks first, and work you way to less and less full as needed? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ signature.asc Description: Digital signature
Re: open_ctree problem
Strangely I repeated the same process on the same system (btrfs-zero-log and mount read-only) and it worked. While it's a concern that repeating the same process gives different results it's nice that I'm getting all my data back. On Sun, 23 Nov 2014, Russell Coker russ...@coker.com.au wrote: I have a workstation running Linux 3.14.something on a 120G SSD. It recently had a problem and now the root filesystem can't be mounted, here is the message I get when trying to mount it read-only on Debian kernel 3.16.2-3: [4703937.784447] BTRFS info (device loop0): disk space caching is enabled [4703938.754247] BTRFS: log replay required on RO media [4703938.794148] BTRFS: open_ctree failed When I tried to boot it normally it gave a lot of kernel messages and failed to mount it. Here's the error I get from the btrfs-zero-log in btrfs-tools 0.19+20130501-1: # btrfs-zero-log yayia-corrupt extent buffer leak: start 157263929344 len 4096 *** Error in `btrfs-zero-log': corrupted double-linked list: 0x01068960 *** Aborted I installed btrfs-tools 3.17-1 and then btrfs-zero-log ran without error. But when I tried to mount the filesystem I got the attached kernel error when trying to mount with Debian kernel 3.16.2-3. Any suggestions on what I should do next? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
Marc MERLIN posted on Sat, 22 Nov 2014 17:07:42 -0800 as excerpted: On Sun, Nov 23, 2014 at 12:05:04AM +, Hugo Mills wrote: Which is correct? Less than or equal to 55% full. This confuses me. Does that mean that the fullest blocks do not get rebalanced? Yes. =:^) I guess I was under the mistaken impression that the more data you had the more you could be out of balance. What you were thinking is a misstatement of the situation, so yes, again, that was a mistaken impression. =:^) A chunk is the part of a block group that lives on one device, so in RAID-1, every block group is precisely two chunks; in RAID-0, every block group is 2 or more chunks, up to the number of devices in the FS. A chunk is usually 1 GiB in size for data and 250 MiB for metadata, but can be smaller under some circumstances. Right. So, why would you rebalance empty chunks or near empty chunks? Don't you want to rebalance almost full chunks first, and work you way to less and less full as needed? No, the closer to empty a chunk is, the more effect you can get in rebalancing it along with others of the same fullness. Think of it this way. One goal of a rebalance, the goal we have when data and metadata is unbalanced and we're hitting ENOSPC as a result (as opposed to the goal of converting or balancing among devices when one has just been added or removed), and thus the goal that the usage filter is designed to help solve, is this: Free excess chunk-allocated but chunk-empty space back to unallocated, so it can be used by the other type, data or metadata. More specifically, all available space has been allocated to data and metadata chunks leaving no space available to allocate more chunks, and one of two extremes has been reached, we'll call them D and M: ( D1: All data chunks are full and more need to be allocated, but they can't be as there's no more unallocated space to allocate the new data chunks from, *AND* D2: There's a whole bunch of excess metadata chunks allocated, using up all that unallocated space, but they're mostly empty, and need to be rebalanced to consolidate usage into fewer but fuller metadata chunks, thus freeing the space currently taken by all those mostly empty metadata chunks. ) *OR* the reverse: ( M1: All metadata chunks are full and more need to be allocated, but they can't be as there's no more unallocated space to allocate the new metadata chunks from, *AND* M2: There's a whole bunch of excess data chunks allocated, using up all the unallocated space, but they're mostly empty, and need to be rebalanced to consoldidate usage into fewer but fuller data chunks, thus freeing the space currently taken by all those mostly empty data chunks. ) In both cases, the one type is full and needs more allocation, but the other type is hogging all the space with mostly empty chunks. In both cases, then, you *DON'T* want to bother with the full type, since it's full and rewriting it won't do anything but shuffle the full chunks around -- you can't combine any because they're all full. In both cases, What you *WANT* to do is deal with the EMPTY type, the chunks that are hogging all the space but not actually using it. This is evidently a bit counterintuitive on first glance as you're not the first to have problems with it, but it /is/ the case, and once you understand what's actually happening and why, it /does/ make sense. More specifically, in the D case, where all /data/ chunks are full, you want to rebalance the mostly empty /metadata/ chunks, combining for example 5 near 20% full metadata chunks into a single near 100% full metadata chunk, deallocating the other four metadata chunks (instead of rewriting empty chunks) once there's nothing in them at all. Five just became one, freeing four to unallocated space, which can now be used to allocate new data chunks. And the reverse in the M case, where all metadata chunks are full. Here, you want to rebalance the mostly empty data chunks, again combining say five 20% usage data chunks into a single 100% usage data chunk, deallocating the other four data chunks once there's nothing in them at all. Again, five just become one, freeing four to unallocated space, which now can be used to allocate new, in this case, metadata chunks. Thus the goal is to rebalance the nearly /empty/ chunks of the *OPPOSITE* type to the one you're running short on, combining multiple nearly empty chunks of the type you have too many of, thus freeing that empty space back to unallocated, so the type that you're actually short on can actually allocate chunks from the just freed to unallocated space. That being the goal, working with the full chunks won't get you much. Suppose you work with the 95% full chunks, 5% empty. You'll have to rewrite *TWENTY* of them to combine all those 5% empties to free just *ONE* chunk! And rewriting 100% full chunks won't get you anything at all toward this