Re: BTRFS as image store for KVM?

2015-10-05 Thread Roman Mamedov
On Mon, 05 Oct 2015 11:43:18 +0300
Erkki Seppala  wrote:

> Lionel Bouton  writes:
> 
> > 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
> > you need 2 processes to read from 2 devices at once) and I've never seen
> > anyone arguing that the current md code is unstable.
> 
> This indeed seems to be the case on my MD RAID1 HDD.

The key difference is that MD is smart enough to distribute reads to the least
loaded devices in a RAID (i.e. two processes reading from a two-HDD RAID1 will
~always use different HDDs, no matter what their PIDs are).

Another example of the differences was that in my limited testing I saw Btrfs
submit writes sequentially on RAID1: according to 'iostat', there were long
periods of only one drive writing, and then only the other one. The result was
the RAID's write performance ended up being lower than that of a single drive.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: [PATCH 16/23] Btrfs: device path change must be logged

2015-10-05 Thread Anand Jain


Hi David,

 Kindly note. This is accepted and sent out as part of the patch
 which are dependinding together.

Thanks, Anand

On 10/01/2015 09:47 PM, Anand Jain wrote:



On 10/01/2015 09:40 PM, David Sterba wrote:

On Fri, Aug 14, 2015 at 06:33:01PM +0800, Anand Jain wrote:

>From the issue diagnosable point of view, log if the device path is
changed.

Signed-off-by: Anand Jain 
---
  fs/btrfs/volumes.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ebf37a9..dcb10fa 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -595,6 +595,10 @@ static noinline int device_list_add(const char
*path,



  return -EEXIST;
  }

+printk_in_rcu(KERN_INFO \
+"BTRFS: device fsid %pU devid %llu old path %s new path
%s\n",
+disk_super->fsid, devid, rcu_str_deref(device->name),
path);


I don't think that the message should be put into device_list_add. Its
only callsite in btrfs_scan_one_device prints some device info messages
so it would be better to extend and use the return value if possible.



  Right. will fix it.

Thanks, Anand



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: optimize check for stale device

2015-10-05 Thread Anand Jain
optimize check for stale device to only be checked when there
is device added or changed. If there is no update to the
device, there is no need to call btrfs_free_stale_device().

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b8b1171..ec1fcfa 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -620,7 +620,8 @@ static noinline int device_list_add(const char *path,
 * if there is new btrfs on an already registered device,
 * then remove the stale device entry.
 */
-   btrfs_free_stale_device(device);
+   if (ret > 0)
+   btrfs_free_stale_device(device);
 
*fs_devices_ret = fs_devices;
 
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Duncan
Rich Freeman posted on Sun, 04 Oct 2015 08:21:53 -0400 as excerpted:

> On Sun, Oct 4, 2015 at 8:03 AM, Lionel Bouton
>  wrote:
>>
>> This focus on single reader RAID1 performance surprises me.
>>
>> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
>> you need 2 processes to read from 2 devices at once) and I've never
>> seen anyone arguing that the current md code is unstable.

I'm not a coder and could be wrong, but AFAIK, md/raid1 either works per 
thread (thus should multiplex I/O across raid1 devices in single-process-
multi-thread), or handles multiple AIO requests in parallel, if not both.

(If I'm laboring under a severe misconception, and I could be, please do 
correct me -- I'll rather be publicly corrected and have just that, my 
world-view corrected to align with reality, than be wrong, publicly or 
privately, and never know it, thus never correcting it!  =:^)

IOW, the primary case where I believe md/raid1 does single-device serial 
access, is where the single process is doing just that, serialized-single-
request-sleep-until-the-data's-ready.  Otherwise read requests are spread 
among the available spindles.  =:^)

But...

> Perhaps, but with btrfs it wouldn't be hard to get 1000 processes
> reading from a raid1 in btrfs and have every single request directed to
> the same disk with the other disk remaining completely idle.  I believe
> the algorithm is just whether the pid is even or odd, and doesn't take
> into account disk activity at all, let alone disk performance or
> anything more sophisticated than that.
> 
> I'm sure md does a better job than that.

Exactly.  Even/odd PID scheduling is great for testing, since it's simple 
enough to load either side exclusively or both sides exactly evenly, but 
it's absolutely horrible for multi-task, since worst-case single-device-
bottleneck is all too easy to achieve by accident, and even pure-random 
distribution is going to favor one side or the other to some extent, most 
of the time.

Even worse, due to the most-remaining-free-space chunk allocation 
algorithm and pair-mirroring only, no matter the number of devices, try 
to use 3+ devices of differing sizes, and until the space-available on 
the largest pair reaches that of the others, that largest pair will get 
the allocations.  Consider a bunch of quarter-TiB devices in raid1, with 
a pair of 2 TiB devices as well.  The quarter-TiB devices will remain 
idle until the pair of 2 TiB devices reach 1.75 TiB full, thus equalizing 
the space available on each compared to the other devices on the 
filesystem.  Of course, that means reads too, are going to be tied to 
only those two devices, for anything in that first 1.75 TiB of data, and 
if all those reads are from even or all from odd PIDs, it's only going to 
be ONE of... perhaps 10 devices! Possibly hundreds of read threads 
bottlenecking on a single device of ten, while the other 9/10 of the 
filesystem-array remains entirely idle! =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Erkki Seppala
Lionel Bouton  writes:

> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
> you need 2 processes to read from 2 devices at once) and I've never seen
> anyone arguing that the current md code is unstable.

This indeed seems to be the case on my MD RAID1 HDD.

But on MD SSD RAID10 it does use all the four devices (using dd on the
md raid device and inspection iostat at the samy time).

So the lack of MD RAID1 doing it on HDDs seems to be for devices that
don't perform well in random access scenarios, as you mentioned.

In practical terms I seem to be getting about 1.1 GB/s from the 4 SSDs
with 'dd', whereas I seem to be getting ~650 MB/s when I dd from two
fastest components of the MD device at the same time. As it seems that I
get 330 M/s from two of the SSDs and 150M/s from the other two, it seems
the concurrent RAID10 IO is scaling linearly.

(In fact maybe I should look into why the two devices are getting lower
speeds overall - they used to be fast.)

I didn't calculate how large the linearly transferred chunks would need
to be to overcome the seek altency. Probably quite large.

-- 
  _
 / __// /__   __   http://www.modeemi.fi/~flux/\   \
/ /_ / // // /\ \/ /\  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi  \/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 3/3] Btrfs: device path change must be logged

2015-10-05 Thread Anand Jain
>From the issue diagnosable point of view, log if the device path is
changed.

Signed-off-by: Anand Jain 
---
 V2: Accepts David's review comment and adds a new ret value 2
 for device_list_add() and based on that the caller function
 would indicate if the path is overwritten (thanks David).

 fs/btrfs/volumes.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e76cad3..b8b1171 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -504,6 +504,7 @@ void btrfs_free_stale_device(struct btrfs_device *cur_dev)
  *
  * Returns:
  * 1   - first time device is seen
+ * 2   - device already known but now is overwritten with new path
  * 0   - device already known
  * < 0 - error
  */
@@ -603,6 +604,7 @@ static noinline int device_list_add(const char *path,
fs_devices->missing_devices--;
device->missing = 0;
}
+   ret = 2;
}
 
/*
@@ -1033,8 +1035,9 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
 
ret = device_list_add(path, disk_super, devid, fs_devices_ret);
if (ret > 0) {
-   printk(KERN_INFO "BTRFS: device fsid %pU devid %llu transid 
%llu %s\n",
-   disk_super->fsid, devid, 
transid, path);
+   printk(KERN_INFO "BTRFS: device fsid %pU devid %llu transid 
%llu %s %s\n",
+   disk_super->fsid, devid, 
transid, path,
+   ret == 2 ? "(overwritten)":"");
ret = 0;
}
if (!ret && fs_devices_ret)
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Btrfs: create a helper function to read the disk super

2015-10-05 Thread Anand Jain
A part of code from btrfs_scan_one_device() is moved to a new
function btrfs_read_disk_super(), so that former function looks
cleaner and moves the code to ensure null terminating label to it as well.
Further there is opportunity to merge various duplicate 
code on read disk super. Earlier attempt on this was highlighted
that there was some issues for which there are multiple versions,
however it was not clear what was issue. So until its worked out
we can keep it in a separate function.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 87 --
 1 file changed, 52 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1b2e40f..8f1d175 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -941,6 +941,56 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
return ret;
 }
 
+void btrfs_release_disk_super(struct page *page)
+{
+   kunmap(page);
+   page_cache_release(page);
+}
+
+int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr,
+   struct page **page, struct btrfs_super_block **disk_super)
+{
+   void *p;
+   pgoff_t index;
+
+   /* make sure our super fits in the device */
+   if (bytenr + PAGE_CACHE_SIZE >= i_size_read(bdev->bd_inode))
+   return 1;
+
+   /* make sure our super fits in the page */
+   if (sizeof(**disk_super) > PAGE_CACHE_SIZE)
+   return 1;
+
+   /* make sure our super doesn't straddle pages on disk */
+   index = bytenr >> PAGE_CACHE_SHIFT;
+   if ((bytenr + sizeof(**disk_super) - 1) >> PAGE_CACHE_SHIFT != index)
+   return 1;
+
+   /* pull in the page with our super */
+   *page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
+  index, GFP_NOFS);
+
+   if (IS_ERR_OR_NULL(*page))
+   return 1;
+
+   p = kmap(*page);
+
+   /* align our pointer to the offset of the super block */
+   *disk_super = p + (bytenr & ~PAGE_CACHE_MASK);
+
+   if (btrfs_super_bytenr(*disk_super) != bytenr ||
+   btrfs_super_magic(*disk_super) != BTRFS_MAGIC) {
+   btrfs_release_disk_super(*page);
+   return 1;
+   }
+
+   if ((*disk_super)->label[0] &&
+   (*disk_super)->label[BTRFS_LABEL_SIZE - 1])
+   (*disk_super)->label[BTRFS_LABEL_SIZE - 1] = '\0';
+
+   return 0;
+}
+
 /*
  * Look for a btrfs signature on a device. This may be called out of the mount 
path
  * and we are not allowed to call set_blocksize during the scan. The superblock
@@ -952,13 +1002,11 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
struct btrfs_super_block *disk_super;
struct block_device *bdev;
struct page *page;
-   void *p;
int ret = -EINVAL;
u64 devid;
u64 transid;
u64 total_devices;
u64 bytenr;
-   pgoff_t index;
 
/*
 * we would like to check all the supers, but that would make
@@ -971,41 +1019,14 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
mutex_lock(_mutex);
 
bdev = blkdev_get_by_path(path, flags, holder);
-
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
goto error;
}
 
-   /* make sure our super fits in the device */
-   if (bytenr + PAGE_CACHE_SIZE >= i_size_read(bdev->bd_inode))
+   if (btrfs_read_disk_super(bdev, bytenr, , _super))
goto error_bdev_put;
 
-   /* make sure our super fits in the page */
-   if (sizeof(*disk_super) > PAGE_CACHE_SIZE)
-   goto error_bdev_put;
-
-   /* make sure our super doesn't straddle pages on disk */
-   index = bytenr >> PAGE_CACHE_SHIFT;
-   if ((bytenr + sizeof(*disk_super) - 1) >> PAGE_CACHE_SHIFT != index)
-   goto error_bdev_put;
-
-   /* pull in the page with our super */
-   page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
-  index, GFP_NOFS);
-
-   if (IS_ERR_OR_NULL(page))
-   goto error_bdev_put;
-
-   p = kmap(page);
-
-   /* align our pointer to the offset of the super block */
-   disk_super = p + (bytenr & ~PAGE_CACHE_MASK);
-
-   if (btrfs_super_bytenr(disk_super) != bytenr ||
-   btrfs_super_magic(disk_super) != BTRFS_MAGIC)
-   goto error_unmap;
-
devid = btrfs_stack_device_id(_super->dev_item);
transid = btrfs_super_generation(disk_super);
total_devices = btrfs_super_num_devices(disk_super);
@@ -1013,8 +1034,6 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
ret = device_list_add(path, disk_super, devid, fs_devices_ret);
if (ret > 0) {
if (disk_super->label[0]) {
-   if (disk_super->label[BTRFS_LABEL_SIZE - 1])
- 

[PATCH V2 2/3] btrfs: maintain consistency in logging to help debugging

2015-10-05 Thread Anand Jain
Optional Label may or may not be set, or it might be set at
some time later. However while debugging to search
through the kernel logs the scripts would need the logs to
be consistent, so logs search key words shouldn't depend on the
optional variables, instead fsid is better.

Signed-off-by: Anand Jain 
---
V2: commit corrected

 fs/btrfs/volumes.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8f1d175..e76cad3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1033,13 +1033,8 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
 
ret = device_list_add(path, disk_super, devid, fs_devices_ret);
if (ret > 0) {
-   if (disk_super->label[0]) {
-   printk(KERN_INFO "BTRFS: device label %s ", 
disk_super->label);
-   } else {
-   printk(KERN_INFO "BTRFS: device fsid %pU ", 
disk_super->fsid);
-   }
-
-   printk(KERN_CONT "devid %llu transid %llu %s\n", devid, 
transid, path);
+   printk(KERN_INFO "BTRFS: device fsid %pU devid %llu transid 
%llu %s\n",
+   disk_super->fsid, devid, 
transid, path);
ret = 0;
}
if (!ret && fs_devices_ret)
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: add missing brelse when superblock checksum fails

2015-10-05 Thread Anand Jain
looks like oversight, call brelse() when checksum fails.
further down the code in the non error path we do call
brelse() and so we don't see brelse() in the goto error..
paths.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7191b32..99d0804 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2665,6 +2665,7 @@ int open_ctree(struct super_block *sb,
if (btrfs_check_super_csum(bh->b_data)) {
printk(KERN_ERR "BTRFS: superblock checksum mismatch\n");
err = -EINVAL;
+   brelse(bh);
goto fail_alloc;
}
 
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs: qgroup: Don't copy extent buffer to do qgroup rescan

2015-10-05 Thread Qu Wenruo
Ancient qgroup code call memcpy() on a extent buffer and use it for leaf
iteration.

As extent buffer contains lock, pointers to pages, it's never sane to do
such copy.

The following bug may be caused by this insane operation:
[92098.841309] general protection fault:  [#1] SMP
[92098.841338] Modules linked in: ...
[92098.841814] CPU: 1 PID: 24655 Comm: kworker/u4:12 Not tainted
4.3.0-rc1 #1
[92098.841868] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper
[btrfs]
[92098.842261] Call Trace:
[92098.842277]  [] ? read_extent_buffer+0xb8/0x110
[btrfs]
[92098.842304]  [] ? btrfs_find_all_roots+0x60/0x70
[btrfs]
[92098.842329]  []
btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs]

Where btrfs_qgroup_rescan_worker+0x28d is btrfs_disk_key_to_cpu(),
called in reading key from the memcpied extent_buffer.

This patch will read the whole leaf into memory, and use newly
introduced stack function to do qgroup rescan.

Reported-by: Stephane Lesimple 
Signed-off-by: Qu Wenruo 
---
v2:
  Follow the parameter change in previous patch.
---
 fs/btrfs/qgroup.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index e9ace09..6a83a40 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2183,11 +2183,11 @@ void assert_qgroups_uptodate(struct btrfs_trans_handle 
*trans)
  */
 static int
 qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, struct btrfs_path *path,
-  struct btrfs_trans_handle *trans,
-  struct extent_buffer *scratch_leaf)
+  struct btrfs_trans_handle *trans, char *stack_leaf)
 {
struct btrfs_key found;
struct ulist *roots = NULL;
+   struct btrfs_header *header;
struct seq_list tree_mod_seq_elem = SEQ_LIST_INIT(tree_mod_seq_elem);
u64 num_bytes;
int slot;
@@ -2224,13 +2224,15 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, 
struct btrfs_path *path,
fs_info->qgroup_rescan_progress.objectid = found.objectid + 1;
 
btrfs_get_tree_mod_seq(fs_info, _mod_seq_elem);
-   memcpy(scratch_leaf, path->nodes[0], sizeof(*scratch_leaf));
+   read_extent_buffer(path->nodes[0], stack_leaf, 0,
+  fs_info->extent_root->nodesize);
+   header = (struct btrfs_header *)stack_leaf;
slot = path->slots[0];
btrfs_release_path(path);
mutex_unlock(_info->qgroup_rescan_lock);
 
-   for (; slot < btrfs_header_nritems(scratch_leaf); ++slot) {
-   btrfs_item_key_to_cpu(scratch_leaf, , slot);
+   for (; slot < btrfs_stack_header_nritems(header); ++slot) {
+   btrfs_stack_item_key_to_cpu(header, , slot);
if (found.type != BTRFS_EXTENT_ITEM_KEY &&
found.type != BTRFS_METADATA_ITEM_KEY)
continue;
@@ -2261,15 +2263,15 @@ static void btrfs_qgroup_rescan_worker(struct 
btrfs_work *work)
 qgroup_rescan_work);
struct btrfs_path *path;
struct btrfs_trans_handle *trans = NULL;
-   struct extent_buffer *scratch_leaf = NULL;
+   char *stack_leaf = NULL;
int err = -ENOMEM;
int ret = 0;
 
path = btrfs_alloc_path();
if (!path)
goto out;
-   scratch_leaf = kmalloc(sizeof(*scratch_leaf), GFP_NOFS);
-   if (!scratch_leaf)
+   stack_leaf = kmalloc(fs_info->extent_root->nodesize, GFP_NOFS);
+   if (!stack_leaf)
goto out;
 
err = 0;
@@ -2283,7 +2285,7 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work 
*work)
err = -EINTR;
} else {
err = qgroup_rescan_leaf(fs_info, path, trans,
-scratch_leaf);
+stack_leaf);
}
if (err > 0)
btrfs_commit_transaction(trans, fs_info->fs_root);
@@ -2292,7 +2294,7 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work 
*work)
}
 
 out:
-   kfree(scratch_leaf);
+   kfree(stack_leaf);
btrfs_free_path(path);
 
mutex_lock(_info->qgroup_rescan_lock);
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs: Add support to do stack item key operation

2015-10-05 Thread Qu Wenruo
Normal btrfs_item_key_to_cpu() will need extent buffer to do it, and
there is not stack version to handle in memory leaf.

Add btrfs_stack_item_key_to_cpu() function for such operation, which
will provide the basis for later qgroup fix.

Signed-off-by: Qu Wenruo 
---
v2:
  Change the char* parameter to struct btrfs_header *, as a leaf always
  starts with a header.
---
 fs/btrfs/ctree.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..4864dfe 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2683,6 +2683,17 @@ static inline void btrfs_item_key(struct extent_buffer 
*eb,
read_eb_member(eb, item, struct btrfs_item, key, disk_key);
 }
 
+static inline void btrfs_stack_item_key(struct btrfs_header *stack_leaf,
+   struct btrfs_disk_key *disk_key,
+   int nr)
+{
+   unsigned long item_offset = btrfs_item_nr_offset(nr);
+   struct btrfs_item *item;
+
+   item = (struct btrfs_item *)(stack_leaf + item_offset);
+   memcpy(disk_key, >key, sizeof(*disk_key));
+}
+
 static inline void btrfs_set_item_key(struct extent_buffer *eb,
   struct btrfs_disk_key *disk_key, int nr)
 {
@@ -2785,6 +2796,15 @@ static inline void btrfs_item_key_to_cpu(struct 
extent_buffer *eb,
btrfs_disk_key_to_cpu(key, _key);
 }
 
+static inline void btrfs_stack_item_key_to_cpu(struct btrfs_header *stack_leaf,
+  struct btrfs_key *key,
+  int nr)
+{
+   struct btrfs_disk_key disk_key;
+   btrfs_stack_item_key(stack_leaf, _key, nr);
+   btrfs_disk_key_to_cpu(key, _key);
+}
+
 static inline void btrfs_dir_item_key_to_cpu(struct extent_buffer *eb,
  struct btrfs_dir_item *item,
  struct btrfs_key *key)
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: generic: Check if a bull fallocate will change extent number

2015-10-05 Thread Dave Chinner
On Fri, Oct 02, 2015 at 04:35:53PM +0800, Qu Wenruo wrote:
> Hi Dave, I updated the patch and moved it to btrfs.
> 
> But I still has some question about the fallocate behavior.
> 
> Just as the new btrfs test case, I changed the fallocate range, not
> to cover the last part, to make the problem more obvious:
> 
> Btrfs will truncate beyond EOF even that's *not covered* by the
> fallocate range.
> 
> It's OK for a fs to modify the extent layout during fallocate, but
> is it OK to modify extent layout completely *out* of the fallocate
> range?

It's beyond EOF, so the filesystem can make whatever changes to
allocated blocks in that space whenever it likes because userspace
cannot access it.

In XFS, we don't remove unwritten extents beyond EOF if they were
placed there by fallocate except via explicit truncate() or
fallocate() operations (e.g. hole punch) from userspace that
manipulate that extent range beyond EOF.

What other filesytems do with blocks beyond EOF in any given
operation is up to the filesystem, really. If btrfs wants to
truncate away all extents beyond EOF when punching a hole that spans
EOF, then you can do that. It might not be what the user expects,
but blocks beyond EOF don't fall under posix_fallocate() because
it explicitly states:

"If the size of the file is less than offset+len, then the file is
increased to this size; otherwise the file size is left unchanged."

which means it can't allocate blocks beyond EOF

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Austin S Hemmelgarn

On 2015-10-05 07:16, Lionel Bouton wrote:

Hi,

Le 04/10/2015 14:03, Lionel Bouton a écrit :

[...]
This focus on single reader RAID1 performance surprises me.

1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
you need 2 processes to read from 2 devices at once) and I've never seen
anyone arguing that the current md code is unstable.


To better illustrate my point.

According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
most of the time.

http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1

The only case where md RAID1 was noticeably faster is sequential reads
with FIO libaio.
Part of this is because BTRFS's built-in raid functionality is designed 
for COW workloads, whereas mdraid isn't.  On top of that, I would be 
willing to bet that they were using the dup profile for metadata when 
testing mdraid (which is the default when using a single disk), which 
isn't a fair comparison because it stores more data in that case than 
the BTRFS raid.


If you do the same thing with ZFS, I'd expect that you would see similar 
results (although probably with a bigger difference between ZFS and 
mdraid).  A much better comparison (which _will_ sway in mdraid's favor) 
would be running XFS or ext4 (or even JFS) on top of mdraid, as that is 
the regular usage of mdraid.


Furthermore, there is no sane reason to be running BTRFS on top of a 
single mdraid device, thus this was even less of a reasonable 
comparison.  It is worth noting however, that using BTRFS raid1 on top 
of two md RAID0 devices is significantly faster than BTRFS RAID10.

So if you base your analysis on Phoronix tests when serving large files
to a few clients maybe md could perform better. In all other cases BTRFS
RAID1 seems to be a better place to start if you want performance.
According to the bad performance -> unstable logic, md would then be the
less stable RAID1 implementation which doesn't make sense to me.
No reasonable person should be basing their analysis on results from 
someone else's testing without replicating said results themselves, 
especially when those results are based on benchmarks and not real 
workloads.  This goes double for Phoronix, as they are essentially paid 
to make whatever the newest technology on Linux is look good.

I'm not even saying that BTRFS performs better than md for most
real-world scenarios (these are only benchmarks), but that arguing that
BTRFS is not stable because it has performance issues still doesn't make
sense to me. Even synthetic benchmarks aren't enough to find the best
fit for real-world scenarios, so you could always find a very
restrictive situation where any filesystem, RAID implementation, volume
manager could look bad even the most robust ones.

Of course if BTRFS RAID1 was always slower than md RAID1 the logic might
make more sense. But clearly there were design decisions and performance
tuning in BTRFS that led to better or similar performance in several
scenarios, if the remaining scenarios don't get attention it may be
because they represent a niche (at least from the point of view of the
developers) not a lack of polishing.
Like I said above, a large part of this is probably because BTRFS raid 
was designed for COW workloads, and mdraid wasn't.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: Issues with unmountable BTRFS raid1 filesystem

2015-10-05 Thread Hugo Mills
On Mon, Oct 05, 2015 at 08:30:17AM -0400, Austin S Hemmelgarn wrote:
> I've been having issues recently with a relatively simple setup
> using a two device BTRFS raid1 on top of two two device md RAID0's,
> and every time I've rebooted since starting trying to use this
> particular filesystem, I've found it unable to mount and had to
> recreate it from scratch.  This is more of an inconvenience than
> anything else (while I don't have backups of it, all the data is
> trivial to recreate (in fact, so trivial that doing backups would be
> more effort than just recreating the data by hand)), but it's still
> something that I would like to try and fix.
> 
> First off, general info:
> Kernel version: 4.2.1-local+ (4.2.1 with minor modifications,
> sources can be found here: https://github.com/ferroin/linux)
> Btrfs-progs version: 4.2
>
> I would post output from btrfs fi show, but that's spouting
> obviously wrong data (it's saying I'm using only 127MB with 2GB of
> allocations on each 'disk',  I had been storing approximately 4-6GB
> of actual data on the filesystem).
> 
> This particular filesystem is composed of BTRFS raid1 across two LVM
> managed DM/MD RAID0 devices, each of which spans 2 physical hard
> drives.  I have a couple of other filesystems with the exact same
> configuration that have not ever displayed this issue.
> 
> When I run 'btrfs check' on the filesystem when it refuses to mount,
> I get a number of lines like the following:
> bad metadata [, ) crossing stripe boundary
> 
> followed eventually by:
> Errors found in extent allocation tree or chunk allocation

   I _think_ this is a bug in mkfs from 4.2.0, fixed in later
releases of the btrfs-progs.

   Hugo.

> As is typical of a failed mount, dmesg shows a 'failed to read the
> system array on ' 'open_ctree failed'.
> 
> I doubt that this is a hardware issue because:
> 1. Memory is brand new, and I ran a 48 hour burn-in test that showed
> no errors.
> 2. A failing storage controller, PSU, or CPU would be manifesting
> with many more issues than just this.
> 3. A disk failure would mean that two different disks, from
> different manufacturing lots, are encountering errors on exactly the
> same LBA's at exactly the same time, which while possible is
> astronomically unlikely for disks bigger than a few hundred
> gigabytes (the disks in question are 1TB each).
> 



-- 
Hugo Mills | Jazz is the sort of music where no-one plays
hugo@... carfax.org.uk | anything the same way once.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Can i have a word with you?

2015-10-05 Thread Loan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Rich Freeman
On Mon, Oct 5, 2015 at 7:16 AM, Lionel Bouton
 wrote:
> According to the bad performance -> unstable logic, md would then be the
> less stable RAID1 implementation which doesn't make sense to me.
>

The argument wasn't that bad performance meant that something was unstable.

The argument was that a lack of significant performance optimization
meant that the developers considered it unstable and not worth
investing time on optimizing.

So, the question isn't whether btrfs is or isn't faster than something
else.  the question is whether it is or isn't faster than it could be
if it were properly optimized.  That is, how does btrfs perform today
against btrfs from 20 years from now, which obviously cannot be
benchmarked today.

That said, I'm not really convinced that the developers haven't fixed
this because they feel that it would need to be redone later after
major refactoring.  I think it is more likely that there are just very
few developers working on btrfs and load-balancing on raid just
doesn't rank high on their list of interests or possibly expertise.
If any are being paid to work on btrfs then most likely their
employers don't care too much about it either.

I did find the phoronix results interesting though.  The whole driver
for "layer-violation" is that with knowledge of the filesystem you can
better optimize what you do/don't read and write, and that may be
showing here.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs progs release 4.2.2

2015-10-05 Thread David Sterba
Hi,

here's a bugfix release of btrfs-progs, fixes picked from current devel
branch.

* fi label: use fallback if the label ioctl is not available
* convert: check nodesize constraints against commandline features (-O)
* scrub: report status 'running' until all devices are finished
* device scanning might crash in some scenarios
* fi usage: print summary for non-root users

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git


Shortlog:

Anand Jain (2):
  btrfs-progs: provide fail safe for BTRFS_IOC_GET_FSLABEL ioctl
  btrfs-progs: fix is_block_device() return checks

David Sterba (3):
  btrfs-progs: make sure that is_block_device will return only 0/1
  btrfs-progs: fix double free during scanning
  btrfs-progs: fix error checking in load_device_info

Qu Wenruo (1):
  btrfs-progs: utils: Check nodesize against features

Silvio Fricke (2):
  btrfs-progs: tasks info->id is a pthread_t and should not set to -1
  btrfs-progs: check: fix memset range

Zhao Lei (2):
  btrfs-progs: Increase running state's priority in stat output
  btrfs-progs: Show detail error message when write sb failed in 
write_dev_supers()
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Lionel Bouton
Hi,

Le 04/10/2015 14:03, Lionel Bouton a écrit :
> [...]
> This focus on single reader RAID1 performance surprises me.
>
> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
> you need 2 processes to read from 2 devices at once) and I've never seen
> anyone arguing that the current md code is unstable.

To better illustrate my point.

According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
most of the time.

http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1

The only case where md RAID1 was noticeably faster is sequential reads
with FIO libaio.

So if you base your analysis on Phoronix tests when serving large files
to a few clients maybe md could perform better. In all other cases BTRFS
RAID1 seems to be a better place to start if you want performance.
According to the bad performance -> unstable logic, md would then be the
less stable RAID1 implementation which doesn't make sense to me.

I'm not even saying that BTRFS performs better than md for most
real-world scenarios (these are only benchmarks), but that arguing that
BTRFS is not stable because it has performance issues still doesn't make
sense to me. Even synthetic benchmarks aren't enough to find the best
fit for real-world scenarios, so you could always find a very
restrictive situation where any filesystem, RAID implementation, volume
manager could look bad even the most robust ones.

Of course if BTRFS RAID1 was always slower than md RAID1 the logic might
make more sense. But clearly there were design decisions and performance
tuning in BTRFS that led to better or similar performance in several
scenarios, if the remaining scenarios don't get attention it may be
because they represent a niche (at least from the point of view of the
developers) not a lack of polishing.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Issues with unmountable BTRFS raid1 filesystem

2015-10-05 Thread Austin S Hemmelgarn
I've been having issues recently with a relatively simple setup using a 
two device BTRFS raid1 on top of two two device md RAID0's, and every 
time I've rebooted since starting trying to use this particular 
filesystem, I've found it unable to mount and had to recreate it from 
scratch.  This is more of an inconvenience than anything else (while I 
don't have backups of it, all the data is trivial to recreate (in fact, 
so trivial that doing backups would be more effort than just recreating 
the data by hand)), but it's still something that I would like to try 
and fix.


First off, general info:
Kernel version: 4.2.1-local+ (4.2.1 with minor modifications, sources 
can be found here: https://github.com/ferroin/linux)

Btrfs-progs version: 4.2

I would post output from btrfs fi show, but that's spouting obviously 
wrong data (it's saying I'm using only 127MB with 2GB of allocations on 
each 'disk',  I had been storing approximately 4-6GB of actual data on 
the filesystem).


This particular filesystem is composed of BTRFS raid1 across two LVM 
managed DM/MD RAID0 devices, each of which spans 2 physical hard drives. 
 I have a couple of other filesystems with the exact same configuration 
that have not ever displayed this issue.


When I run 'btrfs check' on the filesystem when it refuses to mount, I 
get a number of lines like the following:

bad metadata [, ) crossing stripe boundary

followed eventually by:
Errors found in extent allocation tree or chunk allocation

As is typical of a failed mount, dmesg shows a 'failed to read the 
system array on ' 'open_ctree failed'.


I doubt that this is a hardware issue because:
1. Memory is brand new, and I ran a 48 hour burn-in test that showed no 
errors.
2. A failing storage controller, PSU, or CPU would be manifesting with 
many more issues than just this.
3. A disk failure would mean that two different disks, from different 
manufacturing lots, are encountering errors on exactly the same LBA's at 
exactly the same time, which while possible is astronomically unlikely 
for disks bigger than a few hundred gigabytes (the disks in question are 
1TB each).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS as image store for KVM?

2015-10-05 Thread Duncan
On Mon, 5 Oct 2015 13:16:03 +0200
Lionel Bouton  wrote:

> To better illustrate my point.
> 
> According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
> most of the time.
> 
> http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1
> 
> The only case where md RAID1 was noticeably faster is sequential reads
> with FIO libaio.
> 
> So if you base your analysis on Phoronix tests

[Oops.  Actually send to the list too this time.]

FYI...

1) It's worth noting that while I personally think Phoronix has more
merit than sometimes given credit for, it does have a rather bad rep in
kernel circles, due to how results are (claimed to be) misused in
support of points they don't actually support at all, if you know how
to read the results taking into account the configuration used.

As such, Phoronix is about the last reference you want to be using if
trying to support points with kernel folks, because rightly or wrongly,
a lot of them will see that and simply shut down right there, having
already decided based on previous experience that there's little use
arguing with somebody quoting Phoronix, since they invariably don't
know how to read the results based on what was /actually/ tested given
the independent variable and the test configuration.

Tho I personally actually do find quite some use for various Phoronix
benchmark articles, reading them with the testing context in mind.  But
I definitely wouldn't be pulling them out to demonstrate a point to
kernel folks, unless it was near enough the end of a list of references
making a similar point, after I've demonstrated my ability to keep the
testing context in mind when looking at their results, because I know
based on the history, quoting Phoronix in support of something simply
isn't likely to get me anywhere with kernel folks.


As for the specific test you referenced...

2) At least the URL you pointed at was benchmarks for Intel
SSDs, not spinning rust.  The issues and bottlenecks in the case of
good SSDs are so entirely different than for spinning rust that it's an
entirely different debate.  Among other things, good SATA-based SSDs
are and have been for awhile now, so fast that if the tests really are
I/O bound, SATA-bus speed (thru SATA 3.0, 600 MB/s, anyway, SATA 3.2
aka SATA Express and M.2, 1969 MB/s, are fast enough to often put the
bottleneck on the device, again), not device speed, tends to be the
bottleneck. However, in many cases, because good SSDs and modern buses
are so fast, the bottleneck actually ends up being CPU, once again.

So in the case of good reasonably current SSDs, CPU is likely to be the
bottleneck.  Tho these aren't as current as might be expected...

The actual devices tested here are SATA 2, 300 MB/s bus speed, and are
rather dated given the Dec 2014 article date, as they're only rated 205
MB/s read, 45 MB/s write, so only read is anything near bus speed,
while bus speed itself is only SATA 2, 300 MB/s.

Given that, it's likely that write speed is device-bound, and while
raid isn't likely to slow it down despite the multi-time-writing
because it /is/ device-bound, it's unlikely to be faster than
single-device writing, either.

But this thread was addressing read-speed.  Read-speed is much closer
to the bus speed, and depending on the application, particularly for
raid, may well be CPU-bound.  Where it's CPU-bound, because the device
and bus speed are relatively high, the multiple devices of RAID aren't
likely to be of much benefit at all.

Meanwhile, what was the actual configuration on the devices themselves?

Here, we see that in both cases it was actually btrfs, btrfs with
defaults as installed (in single-device-mode, if reading between the
lines) on top of md/raid of the tested level for the md/raid side,
native btrfs raid of the tested level on the native btrfs raid side.

But... there's already so much that isn't known -- he says defaults
where not stated, but that's still ambiguous in some cases.

For instance, he does specifically state that in native mode, btrfs
detects the ssds, and activates ssd mode, but that it doesn't do so
when installed on the md/raid.  So we know for sure that he took the
detected ssd (or not) defaults there.

But...  we do NOT know... does btrfs native raid level mean for both
data and metadata, or only for (presumably) data, leaving metadata at
the defaults (which is raid1 for multi-device), or perhaps the
reverse, tested level metadata, defaults for data (which AFAIK is
single mode for multi-device).

And in single-device btrfs on top of md/raid mode, with the md/raid at
the tested level, we already know that it didn't detect ssd and enable
ssd mode, and he didn't enable it manually, but what we /don't/ know
for sure is how it was installed at mkfs.btrfs time and whether that
might have detected ssd.   If mkfs.btrfs detected ssd, it would have
created single-mode metadata by default, otherwise, dup mode metadata,
unless specifically told 

Re: [PATCH 4/4] btrfs-progs: change -t option for subvolume list to print a simple space-separated table (making it machine-readable)

2015-10-05 Thread Axel Burri


On 2015-10-04 16:34, Goffredo Baroncelli wrote:
> On 2015-10-04 05:37, Duncan wrote:
>> Goffredo Baroncelli posted on Sat, 03 Oct 2015 19:41:33 +0200 as
>> excerpted:
>>
>>> On 2015-10-03 12:09, Axel Burri wrote:


 On 2015-10-03 11:56, Goffredo Baroncelli wrote:
> On 2015-10-02 18:41, a...@tty0.ch wrote:
>> Old implementation used tabs "\t", and tried to work around problems
>> by guessing amount of tabs needed (e.g. "\t\t" after top level", with
>> buggy output as soon as empty uuids are printed). This will never
>> work correctly, as tab width is a user-defined setting in the
>> terminal.
>
>
> Why not use string_table() and table_*() functions  ?

 string_table(), as well as all table functions by nature, needs to know
 the maximum size of all cells in a row before printing, and therefore
 buffers all the output before printing. It would eat up a lot of memory
 for large tables (it is not unusual to have 1000+ subvolumes in btrfs
 if you make heavy use of snapshotting). Furthermore, it would slow down
 things by not printing the output linewise.
>>>
>>>
>>> Assuming 200bytes per row (== subvolume) x 1000 subvolumes = 200kB... I
>>> don't think that this could be a problem, nor in terms of memory used
>>> nor in terms of speed: if you have 1000+ subvolumes the most time
>>> consuming activity is traversing the filesystem looking at the
>>> snapshot...
>>
>> Perhaps unfortunately, scaling to millions of snapshots/subvolumes really 
>> *is* expected by some people.  You'd be surprised at the number of folks 
>> that setup automated per-minute snapshotting with no automated thinning, 
>> and expect to be able to keep several years' worth of snapshots, and then 
>> wonder why btrfs maintenance commands such as balance take weeks/months...
> [...]
>> Obviously btrfs doesn't scale to that level now, and if it did, someone 
>> making the mistake of trying to get a listing of millions of snapshots 
>> would very likely change their mind before even hitting 10%...
>>
>> But that's why actually processing line-by-line is important, so they'll 
>> actually /see/ what happened and ctrl-C it, instead of the program 
>> aborting as it runs into (for example) the 32-bit user/kernel memory 
>> barrier, without printing anything useful...
> 
> Please Ducan, read the code.
> 
> *TODAY* "btrfs list" does a scan of all subvolumes storing information in 
> memory !
> 
> This is the most time consuming activity (think traversing a filesystem with 
> millions of snapshots)
> 
> *Then* "btrfs list" print the info.
> 
> So you are already blocked at the screen until all subvolume are read. And I 
> repeat (re)processing the information requires less time than reading the 
> information from the disk.
> 
> []
> 

A quick look at the code shows me that Goffredo is right here, as
__list_subvol_search() always fetches ALL data from
BTRFS_IOC_TREE_SEARCH, putting it into a rbtree for later processing
(assemble full paths, sorting).

While there is certainly room for improvements here (assuming that
BTRFS_IOC_TREE_SEARCH returns objectid's in sorted order, it would
definitively be possible to produce line-by-line output), the code looks
pretty elegant the way it is.

I still don't think it is wise to bloat things further just for printing
nice tables. My impression is that "btrfs subvolume list" is
human-readable enough without the '-t' flag, while the output with '-t'
flag is much more machine-readable-friendly, and thus should have the
highest possible performance. e.g.:

  btrfs sub list  -t / | (read th; while read $th ; do echo $gen; done)
  btrfs sub list -t | column -t

Again, this is just my opinion, being a "unix-purist". Maybe a good
compromise would be to use a single "\t" instead of " " as column delimiter.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix qgroup sanity tests

2015-10-05 Thread Josef Bacik
With my changes to allow us to find old roots when resolving indirect refs I
introduced a regression to the sanity tests.  Since we don't really care to go
down into the fs roots we just need to have the old behavior of returning ENOENT
for dummy roots for the sanity tests.  In the future if we want to get fancy we
can populate the test fs trees with the references as well.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/backref.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 5de66e9..905d697 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -339,6 +339,13 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
goto out;
}
 
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
+   if (unlikely(test_bit(BTRFS_ROOT_DUMMY_ROOT, >state))) {
+   srcu_read_unlock(_info->subvol_srcu, index);
+   ret = -ENOENT;
+   goto out;
+   }
+#endif
if (path->search_commit_root)
root_level = btrfs_header_level(root->commit_root);
else if (time_seq == (u64)-1)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs-progs: change -t option for subvolume list to print a simple space-separated table (making it machine-readable)

2015-10-05 Thread Axel Burri


On 2015-10-05 17:42, Goffredo Baroncelli wrote:
> Hi Axel,
> 
> On 2015-10-05 17:04, Axel Burri wrote:
> [...]
>> I still don't think it is wise to bloat things further just for printing
>> nice tables. My impression is that "btrfs subvolume list" is
>> human-readable enough without the '-t' flag, while the output with '-t'
>> flag is much more machine-readable-friendly, and thus should have the
>> highest possible performance. e.g.:
> 
> I disagree, the "-t" mode is for the human; the non '-t' mode would be for 
> machine (if you take the space as separator and the first word as key, with 
> the exception of the path which starts from the "path " word and ends to the 
> end of line); unfortunately "top level" is an exception. Anyway btrfs is not 
> very machine-friendly (sic).

That's what my patches are all about in the first place, to make it more
machine-friendly :)

> Compare the two output below:
> 
> $ sudo btrfs sub list -a /
> ID 257 gen 44309 top level 5 path /debian
> ID 289 gen 17415 top level 257 path debian/var/lib/machines
> ID 298 gen 35434 top level 5 path /debian-i32
> ID 299 gen 43961 top level 5 path /boot
> ID 300 gen 39452 top level 5 path /debian-jessie

"grep-friendly"

> 
> $ sudo btrfs sub list -at / 
> IDgen top level   path
> ----- -   
> 257   44310   5   /debian
> 289   17415   257 debian/var/lib/machines
> 298   35434   5   /debian-i32
> 299   43961   5   /boot
> 300   39452   5   /debian-jessie
> 
> 
> I think that is easy to declare the latter more "human" friendly. 

Well yes, but in some ways the latter is also more machine-friendly, as
there is less data to pipe and a whitespace-separated list is much
easier to parse (of course the same applies to your table implementation
below).

Problems here are:

- "top level" (whitespace, addressed in [PATCH 4/4])

- timestamp (-s) "-MM-DD HH:MM:SS" (whitspace and no timezone,
addressed in [PATCH 3/4])

- path can contain whitespace (not _that_ important as it's always
printed last).

> 
> BTW with the use of the table_* function the output become:
> 
> $ sudo ./btrfs sub list -at / 
> ID  gen   top level path   
> === = = ===
> 257 44311 5 /debian   
> 289 17415   257 debian/var/lib/machines
> 298 35434 5 /debian-i32   
> 299 43961 5 /boot 
> 300 39452 5 /debian-jessie
> 
> $ sudo ./btrfs sub list -aptcguqR / 
> ID  gen   cgen  parent top level parent_uuid received_uuid uuid   
>   path   
> === = = == = === = 
>  ===
> 257 44313 7  5 5   - - 
> 840c86cf-e78b-d54a-ab38-2858812d /debian   
> 289 17415 17415257   257   - - 
> 8b857250-3a3e-754d-810e-57342bbb2f56 debian/var/lib/machines
> 298 35434 35399  5 5   - - 
> 1f38049b-b153-d741-b903-d2de6fd7b3fd /debian-i32   
> 299 43961 35512  5 5   - - 
> f9d52b6b-a6d1-8c45-a6cd-ddb68cf58062 /boot 
> 300 39452 37744  5 5   - - 
> 026e44bd-66d4-e341-9a14-0124acf79793 /debian-jessie
> 
> I right aligned each field with the exception of the path (which is left 
> aligned).
> I already prepared the patch which convert "btrfs sub list -t" to use the 
> table_* function. If you want I can send it to you to extend/replace your 
> patch #4.

Looks promising, although I'd still not recommend it because of the
increased memory/cpu requirements.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs-progs: change -t option for subvolume list to print a simple space-separated table (making it machine-readable)

2015-10-05 Thread Goffredo Baroncelli
Hi Axel,

On 2015-10-05 17:04, Axel Burri wrote:
[...]
> 
> A quick look at the code shows me that Goffredo is right here, as
> __list_subvol_search() always fetches ALL data from
> BTRFS_IOC_TREE_SEARCH, putting it into a rbtree for later processing
> (assemble full paths, sorting).
> 
> While there is certainly room for improvements here (assuming that
> BTRFS_IOC_TREE_SEARCH returns objectid's in sorted order, it would
> definitively be possible to produce line-by-line output), the code looks
> pretty elegant the way it is.
> 
> I still don't think it is wise to bloat things further just for printing
> nice tables. My impression is that "btrfs subvolume list" is
> human-readable enough without the '-t' flag, while the output with '-t'
> flag is much more machine-readable-friendly, and thus should have the
> highest possible performance. e.g.:

I disagree, the "-t" mode is for the human; the non '-t' mode would be for 
machine (if you take the space as separator and the first word as key, with the 
exception of the path which starts from the "path " word and ends to the end of 
line); unfortunately "top level" is an exception. Anyway btrfs is not very 
machine-friendly (sic).


Compare the two output below:

$ sudo btrfs sub list -a /
ID 257 gen 44309 top level 5 path /debian
ID 289 gen 17415 top level 257 path debian/var/lib/machines
ID 298 gen 35434 top level 5 path /debian-i32
ID 299 gen 43961 top level 5 path /boot
ID 300 gen 39452 top level 5 path /debian-jessie

$ sudo btrfs sub list -at / 
ID  gen top level   path
--  --- -   
257 44310   5   /debian
289 17415   257 debian/var/lib/machines
298 35434   5   /debian-i32
299 43961   5   /boot
300 39452   5   /debian-jessie


I think that is easy to declare the latter more "human" friendly. 

BTW with the use of the table_* function the output become:

$ sudo ./btrfs sub list -at / 
ID  gen   top level path   
=== = = ===
257 44311 5 /debian   
289 17415   257 debian/var/lib/machines
298 35434 5 /debian-i32   
299 43961 5 /boot 
300 39452 5 /debian-jessie

$ sudo ./btrfs sub list -aptcguqR / 
ID  gen   cgen  parent top level parent_uuid received_uuid uuid 
path   
=== = = == = === = 
 ===
257 44313 7  5 5   - - 
840c86cf-e78b-d54a-ab38-2858812d /debian   
289 17415 17415257   257   - - 
8b857250-3a3e-754d-810e-57342bbb2f56 debian/var/lib/machines
298 35434 35399  5 5   - - 
1f38049b-b153-d741-b903-d2de6fd7b3fd /debian-i32   
299 43961 35512  5 5   - - 
f9d52b6b-a6d1-8c45-a6cd-ddb68cf58062 /boot 
300 39452 37744  5 5   - - 
026e44bd-66d4-e341-9a14-0124acf79793 /debian-jessie

I right aligned each field with the exception of the path (which is left 
aligned).
I already prepared the patch which convert "btrfs sub list -t" to use the 
table_* function. If you want I can send it to you to extend/replace your patch 
#4.

BR
G.Baroncelli






> 
>   btrfs sub list  -t / | (read th; while read $th ; do echo $gen; done)
>   btrfs sub list -t | column -t
> 
> Again, this is just my opinion, being a "unix-purist". Maybe a good
> compromise would be to use a single "\t" instead of " " as column delimiter.




> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Austin S Hemmelgarn

On 2015-10-05 10:04, Duncan wrote:

On Mon, 5 Oct 2015 13:16:03 +0200
Lionel Bouton  wrote:


To better illustrate my point.

According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
most of the time.

http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1

The only case where md RAID1 was noticeably faster is sequential reads
with FIO libaio.

So if you base your analysis on Phoronix tests



[...snip...]


Hmm... I think I've begun to see the kernel folks point about people
quoting Phoronix in support of their points, when it's really not
apropos at all.  Yes, I do still consider Phoronix reports in context
to contain useful information, at some level.  However, one really must
be aware of what was actually tested in ordered to understand what the
results actually mean, and unfortunately, it seems most people quoting
it, including here, really can't properly do so in context, and thus
end up using it in support of points that simply are not supported by
the given evidence in the Phoronix articles people are attempting to
use.

Even aside from the obvious past issues with Phoronix reports, people 
forget that they are news organization (regardless of what they claim, 
they _are_ a news organization), and as such their employees are not 
paid to verify existing results, they're paid to make impactful articles 
that grab people's attention (I'd be willing to bet that this story 
started in response to the people who pointed out correctly that XFS or 
ext4 on top of mdraid beats the pants off of BTRFS performance wise, and 
(incorrectly) assumed that this meant that mdraid was better than BTRFS 
raid).  This when combined with almost no evidence in many cases of 
actual statistical analysis, really hurts their credibility (at least, 
for me it does).


The other issue is that so many people tout benchmarks as the pinnacle 
of testing, when they really aren't.  Benchmarks are by definition 
synthetic workloads, and as such only the _really_ good ones (which 
there aren't many of) give you more than a very basic idea what 
performance differences you can expect with a given workload.  On top of 
that, people just accept results without trying to reproduce them 
themselves (Kernel folks tend to be much better about this than many 
other people though).


A truly sane person, looking to determine the best configuration for a 
given workload, will:
1. Look at a wide variety of sources to determine what configurations he 
should even be testing.  (The author of the linked article obviously 
didn't do this, or just didn't care, the defaults on btrfs are 
unsuitable for a significant number of cases, including usage on top of 
mdraid).
2. Using this information, run established benchmarks similar to his 
use-case to further narrow down the test candidates.
3. Write his own benchmark to simulate to the greatest degree possible 
the actual workload he expects to run, and then use that for testing the 
final candidates.
4. Gather some reasonable number of samples with the above mentioned 
benchmark, and use _real_ statistical analysis to determine what he 
should be using.


To put this in further perspective, most people just do step one, assume 
that other people know what they're talking about, and don't do any 
further testing, and there are other people who just do step two, and 
then claim their results are infallible.




smime.p7s
Description: S/MIME Cryptographic Signature