btrfs scrub status misreports as interrupted

2014-11-22 Thread Marc Joliet
Hi all,

While I haven't gotten any scrub already running type errors any more, I do
get one strange case of state misreport.  When running scrub on /home (btrfs
RAID10), after 3 of 4 drives have completed, the 4th drive (sdb) will report as
interrupted, despite still running:

# btrfs scrub status -d /home
scrub status for 472c9290-3ff2-4096-9c47-0612d3a52cef
scrub device /dev/sda (id 1) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3380 
seconds
total bytes scrubbed: 252.86GiB with 0 errors
scrub device /dev/sdb (id 2) status
scrub started at Sat Nov 22 11:57:34 2014, interrupted after 3698 
seconds, not running
total bytes scrubbed: 217.50GiB with 0 errors
scrub device /dev/sdc (id 3) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3013 
seconds
total bytes scrubbed: 252.85GiB with 0 errors
scrub device /dev/sdd (id 4) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 2994 
seconds
total bytes scrubbed: 252.85GiB with 0 errors

The funny thing is, the time will still update as the scrub keeps going:

# btrfs scrub status -d /home
scrub status for 472c9290-3ff2-4096-9c47-0612d3a52cef
scrub device /dev/sda (id 1) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3380 
seconds
total bytes scrubbed: 252.86GiB with 0 errors
scrub device /dev/sdb (id 2) status
scrub started at Sat Nov 22 11:57:34 2014, interrupted after 4136 
seconds, not running
 

total bytes scrubbed: 239.44GiB with 0 errors
scrub device /dev/sdc (id 3) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3013 
seconds
total bytes scrubbed: 252.85GiB with 0 errors
scrub device /dev/sdd (id 4) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 2994 
seconds
total bytes scrubbed: 252.85GiB with 0 errors

This has happened a few times, and when sdb finally finishes, the status is then
reported correctly as finished:

# btrfs scrub status -d /home   
scrub status for 472c9290-3ff2-4096-9c47-0612d3a52cef
scrub device /dev/sda (id 1) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3380 
seconds
total bytes scrubbed: 252.86GiB with 0 errors
scrub device /dev/sdb (id 2) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 4426 
seconds
total bytes scrubbed: 252.88GiB with 0 errors
scrub device /dev/sdc (id 3) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3013 
seconds
total bytes scrubbed: 252.85GiB with 0 errors
scrub device /dev/sdd (id 4) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 2994 
seconds
total bytes scrubbed: 252.85GiB with 0 errors

Kernel and btrfs-progs version:

# uname -a
Linux marcec 3.16.7-gentoo #1 SMP PREEMPT Fri Oct 31 22:45:54 CET 2014 x86_64 
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ AuthenticAMD GNU/Linux

# btrfs --version
Btrfs v3.17.1

Should I open a report on bugzilla?

-- 
Marc Joliet
--
People who think they know everything really annoy those of us who know we
don't - Bjarne Stroustrup


signature.asc
Description: PGP signature


[PATCH v2] Btrfs: fix allocationg memory failure for btrfsic_state structure

2014-11-22 Thread Wang Shilong
size of @btrfsic_state needs more than 2M, it is very likely to
fail allocating memory using kzalloc(). see following mesage:

[91428.902148] Call Trace:
[816f6e0f] dump_stack+0x4d/0x66
[811b1c7f] warn_alloc_failed+0xff/0x170
[811b66e1] __alloc_pages_nodemask+0x951/0xc30
[811fd9da] alloc_pages_current+0x11a/0x1f0
[811b1e0b] ? alloc_kmem_pages+0x3b/0xf0
[811b1e0b] alloc_kmem_pages+0x3b/0xf0
[811d1018] kmalloc_order+0x18/0x50
[811d1074] kmalloc_order_trace+0x24/0x140
[a06c097b] btrfsic_mount+0x8b/0xae0 [btrfs]
[810af555] ? check_preempt_curr+0x85/0xa0
[810b2de3] ? try_to_wake_up+0x103/0x430
[a063d200] open_ctree+0x1bd0/0x2130 [btrfs]
[a060fdde] btrfs_mount+0x62e/0x8b0 [btrfs]
[811fd9da] ? alloc_pages_current+0x11a/0x1f0
[811b0a5e] ? __get_free_pages+0xe/0x50
[81230429] mount_fs+0x39/0x1b0
[812509fb] vfs_kern_mount+0x6b/0x150
[812537fb] do_mount+0x27b/0xc30
[811b0a5e] ? __get_free_pages+0xe/0x50
[812544f6] SyS_mount+0x96/0xf0
[81701970] system_call_fastpath+0x16/0x1b

Since we are allocating memory for hash table array, so
it will be good if we could allocate continuous pages here.

Fix this problem by firstly trying kzalloc(), if we fail,
use vzalloc() instead.

Signed-off-by: Wang Shilong wangshilong1...@gmail.com
---
v1-v2:
include vmalloc.h and swith kvfree() helper.
---
 fs/btrfs/check-integrity.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index cb7f3fe..f88b3a6 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -94,6 +94,7 @@
 #include linux/mutex.h
 #include linux/genhd.h
 #include linux/blkdev.h
+#include linux/vmalloc.h
 #include ctree.h
 #include disk-io.h
 #include hash.h
@@ -3130,10 +3131,13 @@ int btrfsic_mount(struct btrfs_root *root,
   root-sectorsize, PAGE_CACHE_SIZE);
return -1;
}
-   state = kzalloc(sizeof(*state), GFP_NOFS);
-   if (NULL == state) {
-   printk(KERN_INFO btrfs check-integrity: kmalloc() failed!\n);
-   return -1;
+   state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | 
__GFP_REPEAT);
+   if (!state) {
+   state = vzalloc(sizeof(*state));
+   if (!state) {
+   printk(KERN_INFO btrfs check-integrity: vzalloc() 
failed!\n);
+   return -1;
+   }
}
 
if (!btrfsic_is_initialized) {
@@ -3277,5 +3281,5 @@ void btrfsic_unmount(struct btrfs_root *root,
 
mutex_unlock(btrfsic_mutex);
 
-   kfree(state);
+   kvfree(state);
 }
-- 
1.7.12.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: switch to kvfree() helper

2014-11-22 Thread Wang Shilong
A new helper kvfree() in mm/utils.c will do this.

Signed-off-by: Wang Shilong wangshilong1...@gmail.com
---
 fs/btrfs/raid56.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6a41631..12e343b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -221,12 +221,8 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info 
*info)
}
 
x = cmpxchg(info-stripe_hash_table, NULL, table);
-   if (x) {
-   if (is_vmalloc_addr(x))
-   vfree(x);
-   else
-   kfree(x);
-   }
+   if (x)
+   kvfree(x);
return 0;
 }
 
@@ -436,10 +432,7 @@ void btrfs_free_stripe_hash_table(struct btrfs_fs_info 
*info)
if (!info-stripe_hash_table)
return;
btrfs_clear_rbio_cache(info);
-   if (is_vmalloc_addr(info-stripe_hash_table))
-   vfree(info-stripe_hash_table);
-   else
-   kfree(info-stripe_hash_table);
+   kvfree(info-stripe_hash_table);
info-stripe_hash_table = NULL;
 }
 
-- 
1.7.12.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


open_ctree problem

2014-11-22 Thread Russell Coker
I have a workstation running Linux 3.14.something on a 120G SSD.  It recently 
had a problem and now the root filesystem can't be mounted, here is the 
message I get when trying to mount it read-only on Debian kernel 3.16.2-3:

[4703937.784447] BTRFS info (device loop0): disk space caching is enabled
[4703938.754247] BTRFS: log replay required on RO media
[4703938.794148] BTRFS: open_ctree failed

When I tried to boot it normally it gave a lot of kernel messages and failed 
to mount it.

Here's the error I get from the btrfs-zero-log in btrfs-tools 0.19+20130501-1:

# btrfs-zero-log yayia-corrupt 
extent buffer leak: start 157263929344 len 4096
*** Error in `btrfs-zero-log': corrupted double-linked list: 
0x01068960 ***
Aborted

I installed btrfs-tools 3.17-1 and then btrfs-zero-log ran without error.  But 
when I tried to mount the filesystem I got the attached kernel error when 
trying to mount with Debian kernel 3.16.2-3.

Any suggestions on what I should do next?

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/


dmesg.txt.gz
Description: GNU Zip compressed data


[PATCH-v2 1/5] fs: split update_time() into update_time() and write_time()

2014-11-22 Thread Theodore Ts'o
In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.

Signed-off-by: Theodore Ts'o ty...@mit.edu
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/inode.c   | 10 ++
 fs/inode.c | 29 ++---
 fs/xfs/xfs_iops.c  | 39 ---
 include/linux/fs.h |  1 +
 4 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct 
timespec *now,
inode-i_mtime = *now;
if (flags  S_ATIME)
inode-i_atime = *now;
+   return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
return btrfs_dirty_inode(inode);
 }
 
@@ -9462,6 +9467,7 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
-   if (inode-i_op-update_time)
-   return inode-i_op-update_time(inode, time, flags);
-
-   if (flags  S_ATIME)
-   inode-i_atime = *time;
-   if (flags  S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags  S_CTIME)
-   inode-i_ctime = *time;
-   if (flags  S_MTIME)
-   inode-i_mtime = *time;
+   int ret;
+
+   if (inode-i_op-update_time) {
+   ret = inode-i_op-update_time(inode, time, flags);
+   if (ret)
+   return ret;
+   } else {
+   if (flags  S_ATIME)
+   inode-i_atime = *time;
+   if (flags  S_VERSION)
+   inode_inc_iversion(inode);
+   if (flags  S_CTIME)
+   inode-i_ctime = *time;
+   if (flags  S_MTIME)
+   inode-i_mtime = *time;
+   }
+   if (inode-i_op-write_time)
+   return inode-i_op-write_time(inode);
mark_inode_dirty_sync(inode);
return 0;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..0e9653c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -984,10 +984,8 @@ xfs_vn_setattr(
 }
 
 STATIC int
-xfs_vn_update_time(
-   struct inode*inode,
-   struct timespec *now,
-   int flags)
+xfs_vn_write_time(
+   struct inode*inode)
 {
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip-i_mount;
@@ -1004,21 +1002,16 @@ xfs_vn_update_time(
}
 
xfs_ilock(ip, XFS_ILOCK_EXCL);
-   if (flags  S_CTIME) {
-   inode-i_ctime = *now;
-   

[PATCH-v2 3/5] vfs: don't let the dirty time inodes get more than a day stale

2014-11-22 Thread Theodore Ts'o
Guarantee that the on-disk timestamps will be no more than 24 hours
stale.

Signed-off-by: Theodore Ts'o ty...@mit.edu
---
 fs/fs-writeback.c  |  1 +
 fs/inode.c | 16 +++-
 include/linux/fs.h |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ce7de22..eb04277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (flags  (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
trace_writeback_dirty_inode_start(inode, flags);
 
+   inode-i_ts_dirty_day = 0;
if (sb-s_op-dirty_inode)
sb-s_op-dirty_inode(inode, flags);
 
diff --git a/fs/inode.c b/fs/inode.c
index 11fe81b..0d939a8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
+   unsigned short days_since_boot = jiffies / (HZ * 86400);
int ret;
 
if (inode-i_op-update_time) {
@@ -1527,14 +1528,27 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags  S_MTIME)
inode-i_mtime = *time;
}
-   if (inode-i_sb-s_flags  MS_LAZYTIME) {
+   /*
+* If i_ts_dirty_day is zero, then either we have not deferred
+* timestamp updates, or the system has been up for less than
+* a day (so days_since_boot is zero), so we defer timestamp
+* updates in that case and set the I_DIRTY_TIME flag.  If a
+* day or more has passed, then i_ts_dirty_day will be
+* different from days_since_boot, and then we should update
+* the on-disk inode and then we can clear i_ts_dirty_day.
+*/
+   if ((inode-i_sb-s_flags  MS_LAZYTIME) 
+   (!inode-i_ts_dirty_day ||
+inode-i_ts_dirty_day == days_since_boot)) {
if (inode-i_state  I_DIRTY_TIME)
return 0;
spin_lock(inode-i_lock);
inode-i_state |= I_DIRTY_TIME;
spin_unlock(inode-i_lock);
+   inode-i_ts_dirty_day = days_since_boot;
return 0;
}
+   inode-i_ts_dirty_day = 0;
if (inode-i_op-write_time)
return inode-i_op-write_time(inode);
mark_inode_dirty_sync(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 489b2f2..e3574cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
+   unsigned short  i_ts_dirty_day;
unsigned inti_blkbits;
blkcnt_ti_blocks;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v2 4/5] vfs: add lazytime tracepoints for better debugging

2014-11-22 Thread Theodore Ts'o
Signed-off-by: Theodore Ts'o ty...@mit.edu
---
 fs/fs-writeback.c |  5 -
 fs/inode.c|  5 +
 include/trace/events/fs.h | 56 +++
 3 files changed, 65 insertions(+), 1 deletion(-)
 create mode 100644 include/trace/events/fs.h

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index eb04277..cab2d6d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -27,6 +27,7 @@
 #include linux/backing-dev.h
 #include linux/tracepoint.h
 #include linux/device.h
+#include trace/events/fs.h
 #include internal.h
 
 /*
@@ -1304,8 +1305,10 @@ static void flush_sb_dirty_time(struct super_block *sb)
iput(old_inode);
old_inode = inode;
 
-   if (dirty_time)
+   if (dirty_time) {
+   trace_fs_lazytime_flush(inode);
mark_inode_dirty(inode);
+   }
cond_resched();
spin_lock(inode_sb_list_lock);
}
diff --git a/fs/inode.c b/fs/inode.c
index 0d939a8..5a9a7b0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -20,6 +20,9 @@
 #include linux/list_lru.h
 #include internal.h
 
+#define CREATE_TRACE_POINTS
+#include trace/events/fs.h
+
 /*
  * Inode locking rules:
  *
@@ -544,6 +547,7 @@ static void evict(struct inode *inode)
mark_inode_dirty(inode);
inode-i_sb-s_op-write_inode(inode, wbc);
}
+   trace_fs_lazytime_evict(inode);
}
 
if (!list_empty(inode-i_wb_list))
@@ -1546,6 +1550,7 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
inode-i_state |= I_DIRTY_TIME;
spin_unlock(inode-i_lock);
inode-i_ts_dirty_day = days_since_boot;
+   trace_fs_lazytime_defer(inode);
return 0;
}
inode-i_ts_dirty_day = 0;
diff --git a/include/trace/events/fs.h b/include/trace/events/fs.h
new file mode 100644
index 000..ca06d5c
--- /dev/null
+++ b/include/trace/events/fs.h
@@ -0,0 +1,56 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fs
+
+#if !defined(_TRACE_FS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FS_H
+
+#include linux/tracepoint.h
+
+DECLARE_EVENT_CLASS(fs__inode,
+   TP_PROTO(struct inode *inode),
+
+   TP_ARGS(inode),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev )
+   __field(ino_t,  ino )
+   __field(uid_t,  uid )
+   __field(gid_t,  gid )
+   __field(__u16, mode )
+   ),
+
+   TP_fast_assign(
+   __entry-dev= inode-i_sb-s_dev;
+   __entry-ino= inode-i_ino;
+   __entry-uid= i_uid_read(inode);
+   __entry-gid= i_gid_read(inode);
+   __entry-mode   = inode-i_mode;
+   ),
+
+   TP_printk(dev %d,%d ino %lu mode 0%o uid %u gid %u,
+ MAJOR(__entry-dev), MINOR(__entry-dev),
+ (unsigned long) __entry-ino, __entry-mode,
+ __entry-uid, __entry-gid)
+);
+
+DEFINE_EVENT(fs__inode, fs_lazytime_defer,
+   TP_PROTO(struct inode *inode),
+
+   TP_ARGS(inode)
+);
+
+DEFINE_EVENT(fs__inode, fs_lazytime_evict,
+   TP_PROTO(struct inode *inode),
+
+   TP_ARGS(inode)
+);
+
+DEFINE_EVENT(fs__inode, fs_lazytime_flush,
+   TP_PROTO(struct inode *inode),
+
+   TP_ARGS(inode)
+);
+#endif /* _TRACE_FS_H */
+
+/* This part must be outside protection */
+#include trace/define_trace.h
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v2 5/5] ext4: add support for a lazytime mount option

2014-11-22 Thread Theodore Ts'o
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o ty...@mit.edu
---
 fs/ext4/inode.c | 48 ++---
 fs/ext4/super.c |  9 +
 fs/inode.c  | 36 ++
 include/linux/fs.h  |  2 ++
 include/trace/events/ext4.h | 29 +++
 5 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3356ab5..03149b4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4163,6 +4163,50 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+   struct ext4_inode_info  *ei;
+   struct ext4_inode   *raw_inode;
+   unsigned long   ino;
+   struct inode*inode;
+   int i, inodes_per_block = EXT4_SB(sb)-s_inodes_per_block;
+   int inode_size = EXT4_INODE_SIZE(sb);
+
+   ino = orig_ino  ~(inodes_per_block - 1);
+   for (i = 0; i  inodes_per_block; i++, ino++, buf += inode_size) {
+   if (ino == orig_ino)
+   continue;
+   inode = find_active_inode_nowait(sb, ino);
+   if (!inode ||
+   (inode-i_state  I_DIRTY_TIME) == 0 ||
+   !spin_trylock(inode-i_lock)) {
+   iput(inode);
+   continue;
+   }
+   inode-i_state = ~I_DIRTY_TIME;
+   inode-i_ts_dirty_day = 0;
+   spin_unlock(inode-i_lock);
+
+   ei = EXT4_I(inode);
+   raw_inode = (struct ext4_inode *) buf;
+
+   spin_lock(ei-i_raw_lock);
+   EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+   ext4_inode_csum_set(inode, raw_inode, ei);
+   spin_unlock(ei-i_raw_lock);
+   trace_ext4_other_inode_update_time(inode, orig_ino);
+   iput(inode);
+   }
+}
+
+
+/*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
  * buffer_head in the inode location struct.
@@ -4260,7 +4304,6 @@ static int ext4_do_update_inode(handle_t *handle,
for (block = 0; block  EXT4_N_BLOCKS; block++)
raw_inode-i_block[block] = ei-i_data[block];
}
-
if (likely(!test_opt2(inode-i_sb, HURD_COMPAT))) {
raw_inode-i_disk_version = cpu_to_le32(inode-i_version);
if (ei-i_extra_isize) {
@@ -4271,10 +4314,9 @@ static int ext4_do_update_inode(handle_t *handle,
cpu_to_le16(ei-i_extra_isize);
}
}
-
ext4_inode_csum_set(inode, raw_inode, ei);
-
spin_unlock(ei-i_raw_lock);
+   ext4_update_other_inodes_time(inode-i_sb, inode-i_ino, bh-b_data);
 
BUFFER_TRACE(bh, call ext4_handle_dirty_metadata);
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4b79f39..1ac1914 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1133,6 +1133,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+   Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
{Opt_i_version, i_version},
{Opt_stripe, stripe=%u},
{Opt_delalloc, delalloc},
+   {Opt_lazytime, lazytime},
+   {Opt_nolazytime, nolazytime},
{Opt_nodelalloc, nodelalloc},
{Opt_removed, mblk_io_submit},
{Opt_removed, nomblk_io_submit},
@@ -1450,6 +1453,12 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
case Opt_i_version:
sb-s_flags |= MS_I_VERSION;
return 1;
+   case Opt_lazytime:
+   sb-s_flags |= MS_LAZYTIME;
+   return 

[PATCH-v2 2/5] vfs: add support for a lazytime mount option

2014-11-22 Thread Theodore Ts'o
Add a new mount option which enables a new lazytime mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o ty...@mit.edu
---
 fs/fs-writeback.c   | 38 +-
 fs/inode.c  | 20 
 fs/proc_namespace.c |  1 +
 fs/sync.c   |  7 +++
 include/linux/fs.h  |  1 +
 include/uapi/linux/fs.h |  1 +
 6 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ce7de22 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct 
writeback_control *wbc)
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode-i_state = ~I_DIRTY_PAGES;
dirty = inode-i_state  I_DIRTY;
-   inode-i_state = ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+   inode-i_state = ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
spin_unlock(inode-i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty  (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
 }
 
+/*
+ * This works like wait_sb_inodes(), but it is called *before* we kick
+ * the bdi so the inodes can get written out.
+ */
+static void flush_sb_dirty_time(struct super_block *sb)
+{
+   struct inode *inode, *old_inode = NULL;
+
+   WARN_ON(!rwsem_is_locked(sb-s_umount));
+   spin_lock(inode_sb_list_lock);
+   list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
+   int dirty_time;
+
+   spin_lock(inode-i_lock);
+   if (inode-i_state  (I_FREEING|I_WILL_FREE|I_NEW)) {
+   spin_unlock(inode-i_lock);
+   continue;
+   }
+   dirty_time = inode-i_state  I_DIRTY_TIME;
+   __iget(inode);
+   spin_unlock(inode-i_lock);
+   spin_unlock(inode_sb_list_lock);
+
+   iput(old_inode);
+   old_inode = inode;
+
+   if (dirty_time)
+   mark_inode_dirty(inode);
+   cond_resched();
+   spin_lock(inode_sb_list_lock);
+   }
+   spin_unlock(inode_sb_list_lock);
+   iput(old_inode);
+}
+
 /**
  * writeback_inodes_sb_nr -writeback dirty inodes from given super_block
  * @sb: the superblock
@@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(sb-s_umount));
 
+   flush_sb_dirty_time(sb);
bdi_queue_work(sb-s_bdi, work);
wait_for_completion(done);
 
diff --git a/fs/inode.c b/fs/inode.c
index 8f5c4b5..11fe81b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,6 +534,18 @@ static void evict(struct inode *inode)
BUG_ON(!(inode-i_state  I_FREEING));
BUG_ON(!list_empty(inode-i_lru));
 
+   if (inode-i_nlink  inode-i_state  I_DIRTY_TIME) {
+   if (inode-i_op-write_time)
+   inode-i_op-write_time(inode);
+   else if (inode-i_sb-s_op-write_inode) {
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_NONE,
+   };
+   mark_inode_dirty(inode);
+   inode-i_sb-s_op-write_inode(inode, wbc);
+   }
+   }
+
if (!list_empty(inode-i_wb_list))
inode_wb_list_del(inode);
 
@@ -1515,6 +1527,14 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags  S_MTIME)
inode-i_mtime = *time;
}
+   if (inode-i_sb-s_flags  MS_LAZYTIME) {
+   if (inode-i_state  I_DIRTY_TIME)
+   return 0;
+   spin_lock(inode-i_lock);
+   inode-i_state |= I_DIRTY_TIME;
+   spin_unlock(inode-i_lock);
+  

[PATCH-v2 0/5] add support for a lazytime mount option

2014-11-22 Thread Theodore Ts'o
This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.

This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load.  So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.

There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code.  It is purely optional,
however.

Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs.  The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right.   Thanks!!

Any objections to my carrying these patches in the ext4 git tree?

Changes since -v1:
   - Added explanatory comments in update_time() regarding i_ts_dirty_days
   - Fix type used for days_since_boot
   - Improve SMP scalability in update_time and ext4_update_other_inodes_time
   - Added tracepoints to help test and characterize how often and under
 what circumstances inodes have their timestamps lazily updated

Theodore Ts'o (5):
  fs: split update_time() into update_time() and write_time()
  vfs: add support for a lazytime mount option
  vfs: don't let the dirty time inodes get more than a day stale
  vfs: add lazytime tracepoints for better debugging
  ext4: add support for a lazytime mount option

 fs/btrfs/inode.c|  10 +
 fs/ext4/inode.c |  48 ++--
 fs/ext4/super.c |   9 
 fs/fs-writeback.c   |  42 +-
 fs/inode.c  | 104 +++-
 fs/proc_namespace.c |   1 +
 fs/sync.c   |   7 +++
 fs/xfs/xfs_iops.c   |  39 +++--
 include/linux/fs.h  |   5 +++
 include/trace/events/ext4.h |  29 
 include/trace/events/fs.h   |  56 
 include/uapi/linux/fs.h |   1 +
 12 files changed, 313 insertions(+), 38 deletions(-)
 create mode 100644 include/trace/events/fs.h

-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-22 Thread Goffredo Baroncelli
On 11/21/2014 05:28 AM, Zygo Blaxell wrote:
 e.g. if an ext4 filesystem explodes, I can:
 
   1.  make a LVM snapshot of the broken filesystem
 
   2.  run e2fsck on the snapshot
 
   3.  mount and repair the snapshot, e.g. rsync any missing files
   from backups, salvage anything that survived
 
   4.  LVM merge the snapshot to its origin volume
 
   5.  umount the origin volume and mount the merged volume
   (or just reboot)
 
 ...and I can do all of this on a running system, in-place, with only a
 few minutes of downtime in the must-reboot case.
 
 None of the above works with btrfs at all.  Multi-device btrfs fails
 at 2, 

You can't compare ext4 with btrfs, if you are talking about a multi-device 
filesystem: ext4 haven't this capability. 
Try to make a md-raid over a snapshotted logical volume(s); I never tried
that, but I suppose that there will be the same problems...

 and mounting the filesystem fails at 3.  
Are you sure ?

ghigo@venice:/tmp$ # create a btrfs filesystem in a logical volume
ghigo@venice:/tmp$ sudo truncate -s +10G disk.img
ghigo@venice:/tmp$ sudo losetup -f disk.img 
ghigo@venice:/tmp$ sudo pvcreate /dev/loop0 
ghigo@venice:/tmp$ sudo vgcreate vgtest /dev/loop0 
ghigo@venice:/tmp$ sudo lvcreate -n lvone -L 3G vgtest
ghigo@venice:/tmp$ sudo mkfs.btrfs /dev/vgtest/lvone 
ghigo@venice:/tmp$ mkdir t

ghigo@venice:/tmp$ # create a file inside a btrfs fs
ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone t/
ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/disk-orig bs=1M count=1
ghigo@venice:/tmp$ sudo umount t

ghigo@venice:/tmp$ # make a lvm snapshot and add a 2nd file
ghigo@venice:/tmp$ sudo lvcreate -s -n lvone_snap -L 3G vgtest/lvone
ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone_snap t/
ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/disk-snap bs=1M count=1
ghigo@venice:/tmp$ sudo umount t

ghigo@venice:/tmp$ # mount the first one lv, and check the file
ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone t/
ghigo@venice:/tmp$ ls -l t
total 1024
-rw-r--r-- 1 root root 1048576 Nov 22 18:11 disk-orig
ghigo@venice:/tmp$ sudo umount t

ghigo@venice:/tmp$ # mount the first one lv, and check the files
ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone_snap t/
ghigo@venice:/tmp$ ls -l t
total 2048
-rw-r--r-- 1 root root 1048576 Nov 22 18:11 disk-orig
-rw-r--r-- 1 root root 1048576 Nov 22 18:12 disk-snap

On the basis of the example above, in case you want to mount a 
single-disk, BTRFS seems me to work properly. You have to pay
attention only to not mount the two filesystem at the same time.

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Volume/subvolume UUID uniqueness, was: BTRFS messes up snapshot LV with origin

2014-11-22 Thread Chris Murphy
I don't know how to fix this but I've convinced myself there's at
least a small problem. And not just with LVM snapshots as in the
originating thread.

- Via seed device method of creating a Btrfs volume, the resulting
volume gets a new UUID. The volume UUID from the seed device doesn't
pass through, is not inherited / copied. Therefore there's already
recognition that snapshotting a Btrfs volume, which is what volume
creation from a seed device effectively is, should result in the new
volume getting a new UUID.

Therefore it seems reasonable a mechanism to support new volume UUIDs
upon LVM snapshots being taken is needed. Maybe leveraging existing
seed code can help, consider existing volume data a virtual seed
device, and the remaining free space as a virtual added device to
enable changing volume UUID rather than rewriting possibly piles of
UUIDs.


- While the seed device method of creating a Btrfs volume results in a
new volume UUID, subvolume UUIDs from the seed pass through to the new
volume. Since I can create many new volumes from one seed device, in
effect I'm creating many instances of subvolumes with identical UUIDs
and can now no longer be differentiated, locally and remotely. This
seems to be a much bigger problem than the LVM case, since it occurs
with only Btrfs tools being used.

The grandiose idea of UUIDs is persistence in identifying a specific
object/resource for all time, anywhere in the universe. Reducing this
to something practical, it should enable a way to identify an object
or resource within one or two human lifetimes, within our solar
system. Yet the current implementation has broken this on a much
shorter time scale, on a single computer.

Since we recognize subvolume snapshots should get new subvolume UUIDs,
and volume snapshots via seed device method creation of new volumes
get new volume UUIDs; a volume snapshot of course is also snapshotting
the subvolumes too, so the subvolume UUIDs can't pass through the way
they do right now. It's not correct behavior.

Another matter is what to do with parent uuid and snapshot
relationship metadata in the new volume. Assume all subvolumes get new
UUIDs on the new volume, there are three potentials:
1. parent uuid is always blank, no relationships between subvolumes is preserved
2. parent uuid is the uuid of its identical mirror (the original) in
the seed device.
3. parent uuid is the new uuid of its relative parent on the current
new volume, preserving relationships between subvolumes and snapshots.

I think any of those three are better than UUID duplication (recycling
actually). Maybe I'm not thinking of a use case for preserving these
UUIDs but at the moment I think it's specious. We can't be attached to
specific UUIDs, the instant a subvolume is effectively snapshot by LVM
or Btrfs seed device, it's a unique object/resource, and should have
its own URN. Afterall by default these objects are read/write. Maybe
if by default they were readonly I could be convinced of the validity
of UUID preservation.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] btrfs: enable swap file support

2014-11-22 Thread Omar Sandoval
On Fri, Nov 21, 2014 at 07:00:45PM +0100, David Sterba wrote:
  +   pr_err(BTRFS: swapfile has holes);
  +   ret = -EINVAL;
  +   goto out;
  +   }
  +   if (em-block_start == EXTENT_MAP_INLINE) {
  +   pr_err(BTRFS: swapfile is inline);
 
 While the test is valid, this would mean that the file is smaller than
 the inline limit, which is now one page. I think the generic swap code
 would refuse such a small file anyway.
 
Sure. This test doesn't really cost us anything, so I think I'd feel a little
better just leaving it in. I'll add a comment for the next close reader.

Besides that and Filipe's response, I'll address everything you mentioned here
and in your other email in the next version, thanks.

  +   ret = -EINVAL;
  +   goto out;
  +   }
  +   if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) {
  +   pr_err(BTRFS: swapfile is compresed);
  +   ret = -EINVAL;
  +   goto out;
  +   }
 
 I think the preallocated extents should be refused as well. This means
 the filesystem has enough space to hold the data but it would still have
 to go through the allocation and could in turn stress the memory
 management code that triggered the swapping activity in the first place.
 
 Though it's probably still possible to reach such corner case even with
 fully allocated nodatacow file, this should be reviewed anyway.
 
I'll definitely take a closer look at this. In particular,
btrfs_get_blocks_direct and btrfs_get_extent do allocations in some cases which
I'll look into.

-- 
Omar
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-22 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 11/21/2014 04:12 PM, Robert White wrote:
 Here's a bug from 2005 of someone having a problem with the ACPI
 IDE support...

That is not ACPI emulation.  ACPI is not used to access the disk,
but rather it has hooks that give it a chance to diddle with the disk
to do things like configure it to lie about its maximum size, or issue
a security unlock during suspend/resume.

 People debating the merits of the ACPI IDE drivers in 2005.

No... that's not a debate at all; it is one guy asking if he should
use IDE or ACPI mode... someone who again meant AHCI and typed the
wrong acronym.

 Even when you get me for referencing windows, you're still 
 wrong...
 
 How many times will you try get out of being hideously horribly
 wrong about ACPI supporting disk/storage IO? It is neither recent
 nor rare.
 
 How much egg does your face really need before you just see that
 your fantasy that it's new and uncommon is a delusional mistake?

Project much?

It seems I've proven just about everything I originally said you got
wrong now so hopefully we can be done.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBCgAGBQJUcQj4AAoJENRVrw2cjl5RwmcH+gOW0LUQE4OXEToMY33brK8Z
QMKw7T1y4dtXIeeWihugNs+vbwmoI2Wheeej4WPdiqvgqIfX4ov9+N9Nb39JiIsI
7frPJ638n98Et5sirCGKfaVvDTwlF85ApHHtXrVLg2dBY3A+oLM9jVU7jpRBvW1m
IFjhJH/SMGDpMhix9SFg6w6cALRh1U5WYV4zMZ1f5/ri/05TYmNJ/M23cjtBicPZ
LaIFxOMGef4lylysNaVh0W03424oIJit6d7DB1gxCyjnkUvVuJ43NjuS5ay+y2sP
FFrepKrOfhK1oOib9e63zNfRHhWrX4KN0Dqcu/3+/+lhD3q5G1fd4YK2RV/oaso=
=nm9l
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-11-22 Thread Marc MERLIN
+btrfs list so that someone can correct me if I'm wrong.

On Sat, Nov 22, 2014 at 09:34:59PM +0100, Patrik Lundquist wrote:
 Hi,
 
 I was scratching my head over a failing btrfs balance and read your
 very informative
 http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html,
 but shouldn't
 
 I can ask balance to rewrite all chunks that are more than 55% full
 
 be
 
 I can ask balance to rewrite all chunks that are less than 55% full?

This one hurts my brain every time I think about it :)

So, the bigger the -dusage number, the more work btrfs has to do.

-dusage=0 does almost nothing
-dusage=100 effectively rebalances everything

But saying saying less than 95% full for -dusage=95 would mean
rebalancing everything that isn't almost full, so I'm not sure it makes
sense either (I would think you'd wan't to reblance full blocks first).

The logical wording would be less than 95% space free.

I'll update my page since this is what makes the most sense.

Now, just to be sure, if I'm getting this right, if your filesystem is
55% full, you could rebalance all blocks that have less than 55% space
free, and use -dusage=55

Does that sound right?

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume/subvolume UUID uniqueness, was: BTRFS messes up snapshot LV with origin

2014-11-22 Thread Robert White

On 11/22/2014 02:50 PM, Robert White wrote:

Take a couple snapshots of a subvolume, and then
send those subvolumes to another file system with send/receive, and then
do btrfs subvolume list -u -q on the two filesystems and tell me that
mess makes sense. Or try to recreate a subvolume from its snapshot in a
way that doesn't shatter the relationships in your backup scheme. (I'm
researching for a couple patches but I'm not expecting a warm reception
given the silence to date).


(ASIDE In particular use btrfs sub send -c SNAP1 SNAP2 and then btrfs 
sub send -c SNAP2 SNAP3 etc before doing the btrfs sub list -u -q to 
view the mess I speak of.)


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-11-22 Thread Patrik Lundquist
On 22 November 2014 at 23:26, Marc MERLIN m...@merlins.org wrote:

 This one hurts my brain every time I think about it :)

I'm new to Btrfs so I may very well be wrong, since I haven't really
read up on it. :-)


 So, the bigger the -dusage number, the more work btrfs has to do.

Agreed.


 -dusage=0 does almost nothing
 -dusage=100 effectively rebalances everything

And -dusage=0 effectively reclaims empty chunks, right?


 But saying saying less than 95% full for -dusage=95 would mean
 rebalancing everything that isn't almost full,

But isn't that what rebalance does? Rewriting chunks =95% full to
completely full chunks and effectively defragmenting chunks and most
likely reduce the number of chunks.

A -dusage=0 rebalance reduced my number of chunks from 1173 to 998 and
dev_item.bytes_used went from 1593466421248 to 1491460947968.


 Now, just to be sure, if I'm getting this right, if your filesystem is
 55% full, you could rebalance all blocks that have less than 55% space
 free, and use -dusage=55

I realize that I interpret the usage parameter as operating on blocks
(chunks? are they the same in this case?) that are = 55% full while
you interpret it as = 55% free.

Which is correct?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-11-22 Thread Marc MERLIN
On Sun, Nov 23, 2014 at 12:26:38AM +0100, Patrik Lundquist wrote:
 I realize that I interpret the usage parameter as operating on blocks
 (chunks? are they the same in this case?) that are = 55% full while
 you interpret it as = 55% free.
 
 Which is correct?

I will let someone else answer because I'm not 100% certain anymore.

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-11-22 Thread Hugo Mills
On Sun, Nov 23, 2014 at 12:26:38AM +0100, Patrik Lundquist wrote:
 On 22 November 2014 at 23:26, Marc MERLIN m...@merlins.org wrote:
 
  This one hurts my brain every time I think about it :)
 
 I'm new to Btrfs so I may very well be wrong, since I haven't really
 read up on it. :-)
 
 
  So, the bigger the -dusage number, the more work btrfs has to do.
 
 Agreed.
 
 
  -dusage=0 does almost nothing
  -dusage=100 effectively rebalances everything
 
 And -dusage=0 effectively reclaims empty chunks, right?
 
 
  But saying saying less than 95% full for -dusage=95 would mean
  rebalancing everything that isn't almost full,
 
 But isn't that what rebalance does? Rewriting chunks =95% full to
 completely full chunks and effectively defragmenting chunks and most
 likely reduce the number of chunks.
 
 A -dusage=0 rebalance reduced my number of chunks from 1173 to 998 and
 dev_item.bytes_used went from 1593466421248 to 1491460947968.
 
 
  Now, just to be sure, if I'm getting this right, if your filesystem is
  55% full, you could rebalance all blocks that have less than 55% space
  free, and use -dusage=55
 
 I realize that I interpret the usage parameter as operating on blocks
 (chunks? are they the same in this case?) that are = 55% full while
 you interpret it as = 55% free.
 
 Which is correct?

   Less than or equal to 55% full.

   0 gives you less than or equal to 0% full -- i.e. the empty block
groups. 100 gives you less than or equal to 100% full, i.e. all block
groups.

   A chunk is the part of a block group that lives on one device, so
in RAID-1, every block group is precisely two chunks; in RAID-0, every
block group is 2 or more chunks, up to the number of devices in the
FS. A chunk is usually 1 GiB in size for data and 250 MiB for
metadata, but can be smaller under some circumstances.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- And what rough beast,  its hour come round at last / slouches ---  
 towards Bethlehem,  to be born? 


signature.asc
Description: Digital signature


Best GIT repository(s) for preparing patches?

2014-11-22 Thread Robert White
Which is the best GIT repository to clone for each of the kernel support 
and btrfs-progs, for preparing a patch to submit to this email list?


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS messes up snapshot LV with origin

2014-11-22 Thread Zygo Blaxell
On Sat, Nov 22, 2014 at 06:34:38PM +0100, Goffredo Baroncelli wrote:
 On 11/21/2014 05:28 AM, Zygo Blaxell wrote:
  e.g. if an ext4 filesystem explodes, I can:
  
  1.  make a LVM snapshot of the broken filesystem
  
  2.  run e2fsck on the snapshot
  
  3.  mount and repair the snapshot, e.g. rsync any missing files
  from backups, salvage anything that survived
  
  4.  LVM merge the snapshot to its origin volume
  
  5.  umount the origin volume and mount the merged volume
  (or just reboot)
  
  ...and I can do all of this on a running system, in-place, with only a
  few minutes of downtime in the must-reboot case.
  
  None of the above works with btrfs at all.  Multi-device btrfs fails
  at 2, 
 
 You can't compare ext4 with btrfs, if you are talking about a multi-device 
 filesystem: ext4 haven't this capability. 

btrfs fails this comparison as a single-device filesystem.

 Try to make a md-raid over a snapshotted logical volume(s); I never tried
 that, but I suppose that there will be the same problems...

md-raid works as long as you specify the devices, and because it's always
the lowest layer it can ignore LVs (snapshot or otherwise).  It's also
not a particularly common use case, while making an LV snapshot of a
filesystem is a typical use case.

  and mounting the filesystem fails at 3.  
 Are you sure ?

Yes, I'm sure.  I've had to replace filesystems destroyed this way.

[working instance snipped]

 On the basis of the example above, in case you want to mount a 
 single-disk, BTRFS seems me to work properly. You have to pay
 attention only to not mount the two filesystem at the same time.

The problem is btrfs stops searching when it sees one disk with each UUID,
so the set of disks (snapshot vs origin) that you get is *random*.
For a pair of origin + snapshots, there's a 50% chance it works, 50%
chance it eats your data.

 BR
 G.Baroncelli
 
 
 -- 
 gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it
 Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: Digital signature


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-11-22 Thread Marc MERLIN
On Sun, Nov 23, 2014 at 12:05:04AM +, Hugo Mills wrote:
  Which is correct?
 
Less than or equal to 55% full.
 
This confuses me. Does that mean that the fullest blocks do not get
rebalanced?
I guess I was under the mistaken impression that the more data you had the
more you could be out of balance.

A chunk is the part of a block group that lives on one device, so
 in RAID-1, every block group is precisely two chunks; in RAID-0, every
 block group is 2 or more chunks, up to the number of devices in the
 FS. A chunk is usually 1 GiB in size for data and 250 MiB for
 metadata, but can be smaller under some circumstances.

Right. So, why would you rebalance empty chunks or near empty chunks?
Don't you want to rebalance almost full chunks first, and work you way to
less and less full as needed?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  


signature.asc
Description: Digital signature


Re: open_ctree problem

2014-11-22 Thread Russell Coker
Strangely I repeated the same process on the same system (btrfs-zero-log and 
mount read-only) and it worked.  While it's a concern that repeating the same 
process gives different results it's nice that I'm getting all my data back.

On Sun, 23 Nov 2014, Russell Coker russ...@coker.com.au wrote:
 I have a workstation running Linux 3.14.something on a 120G SSD.  It
 recently had a problem and now the root filesystem can't be mounted, here
 is the message I get when trying to mount it read-only on Debian kernel
 3.16.2-3:
 
 [4703937.784447] BTRFS info (device loop0): disk space caching is enabled
 [4703938.754247] BTRFS: log replay required on RO media
 [4703938.794148] BTRFS: open_ctree failed
 
 When I tried to boot it normally it gave a lot of kernel messages and
 failed to mount it.
 
 Here's the error I get from the btrfs-zero-log in btrfs-tools
 0.19+20130501-1:
 
 # btrfs-zero-log yayia-corrupt
 extent buffer leak: start 157263929344 len 4096
 *** Error in `btrfs-zero-log': corrupted double-linked list:
 0x01068960 ***
 Aborted
 
 I installed btrfs-tools 3.17-1 and then btrfs-zero-log ran without error. 
 But when I tried to mount the filesystem I got the attached kernel error
 when trying to mount with Debian kernel 3.16.2-3.
 
 Any suggestions on what I should do next?


-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-11-22 Thread Duncan
Marc MERLIN posted on Sat, 22 Nov 2014 17:07:42 -0800 as excerpted:

 On Sun, Nov 23, 2014 at 12:05:04AM +, Hugo Mills wrote:
  Which is correct?
 
Less than or equal to 55% full.
  
 This confuses me. Does that mean that the fullest blocks do not get
 rebalanced?

Yes. =:^)

 I guess I was under the mistaken impression that the more data you had
 the more you could be out of balance.

What you were thinking is a misstatement of the situation, so yes, again, 
that was a mistaken impression. =:^)

A chunk is the part of a block group that lives on one device, so
 in RAID-1, every block group is precisely two chunks; in RAID-0, every
 block group is 2 or more chunks, up to the number of devices in the FS.
 A chunk is usually 1 GiB in size for data and 250 MiB for metadata, but
 can be smaller under some circumstances.
 
 Right. So, why would you rebalance empty chunks or near empty chunks?
 Don't you want to rebalance almost full chunks first, and work you way
 to less and less full as needed?

No, the closer to empty a chunk is, the more effect you can get in 
rebalancing it along with others of the same fullness.


Think of it this way.

One goal of a rebalance, the goal we have when data and metadata is 
unbalanced and we're hitting ENOSPC as a result (as opposed to the goal 
of converting or balancing among devices when one has just been added or 
removed), and thus the goal that the usage filter is designed to help 
solve, is this: Free excess chunk-allocated but chunk-empty space back to 
unallocated, so it can be used by the other type, data or metadata.

More specifically, all available space has been allocated to data and 
metadata chunks leaving no space available to allocate more chunks, and 
one of two extremes has been reached, we'll call them D and M:

(

D1: All data chunks are full and more need to be allocated, but they 
can't be as there's no more unallocated space to allocate the new data 
chunks from, 

*AND* 

D2: There's a whole bunch of excess metadata chunks allocated, using up 
all that unallocated space, but they're mostly empty, and need to be 
rebalanced to consolidate usage into fewer but fuller metadata chunks, 
thus freeing the space currently taken by all those mostly empty metadata 
chunks.

)

*OR* the reverse:

(

M1: All metadata chunks are full and more need to be allocated, but they 
can't be as there's no more unallocated space to allocate the new 
metadata chunks from,

*AND*

M2: There's a whole bunch of excess data chunks allocated, using up all 
the unallocated space, but they're mostly empty, and need to be 
rebalanced to consoldidate usage into fewer but fuller data chunks, thus 
freeing the space currently taken by all those mostly empty data chunks.

)


In both cases, the one type is full and needs more allocation, but the 
other type is hogging all the space with mostly empty chunks.  In both 
cases, then, you *DON'T* want to bother with the full type, since it's 
full and rewriting it won't do anything but shuffle the full chunks 
around -- you can't combine any because they're all full.

In both cases, What you *WANT* to do is deal with the EMPTY type, the 
chunks that are hogging all the space but not actually using it.

This is evidently a bit counterintuitive on first glance as you're not 
the first to have problems with it, but it /is/ the case, and once you 
understand what's actually happening and why, it /does/ make sense.

More specifically, in the D case, where all /data/ chunks are full, you 
want to rebalance the mostly empty /metadata/ chunks, combining for 
example 5 near 20% full metadata chunks into a single near 100% full 
metadata chunk, deallocating the other four metadata chunks (instead of 
rewriting empty chunks) once there's nothing in them at all.  Five just 
became one, freeing four to unallocated space, which can now be used to 
allocate new data chunks.

And the reverse in the M case, where all metadata chunks are full.  Here, 
you want to rebalance the mostly empty data chunks, again combining say 
five 20% usage data chunks into a single 100% usage data chunk, 
deallocating the other four data chunks once there's nothing in them at 
all.  Again, five just become one, freeing four to unallocated space, 
which now can be used to allocate new, in this case, metadata chunks.


Thus the goal is to rebalance the nearly /empty/ chunks of the *OPPOSITE* 
type to the one you're running short on, combining multiple nearly empty 
chunks of the type you have too many of, thus freeing that empty space 
back to unallocated, so the type that you're actually short on can 
actually allocate chunks from the just freed to unallocated space.

That being the goal, working with the full chunks won't get you much.  
Suppose you work with the 95% full chunks, 5% empty.  You'll have to 
rewrite *TWENTY* of them to combine all those 5% empties to free just 
*ONE* chunk!  And rewriting 100% full chunks won't get you anything at 
all toward this