[PATCH] btrfs-progs: tests: add 020-extent-ref-cases

2016-05-29 Thread Lu Fengqi
In order to confirm that btrfsck supports to check a variety of
refs, add the
following cases:
* keyed_block_ref
* keyed_data_ref
* shared_block_ref
* shared_data_ref
* no_inline_ref (a extent item without inline ref)
* no_skinny_ref

Signed-off-by: Lu Fengqi 
---
In order to btrfsck on the restored file system, we should use the patch
"btrfs-progs: make btrfs-image restore to support dup". This patch make
btrfs-image correctly restore img in dup case.
---
 .../fsck-tests/020-extent-ref-cases/keyed_block_ref.img | Bin 0 -> 10240 bytes
 .../fsck-tests/020-extent-ref-cases/keyed_data_ref.img  | Bin 0 -> 4096 bytes
 tests/fsck-tests/020-extent-ref-cases/no_inline_ref.img | Bin 0 -> 4096 bytes
 tests/fsck-tests/020-extent-ref-cases/no_skinny_ref.img | Bin 0 -> 3072 bytes
 .../020-extent-ref-cases/shared_block_ref.img   | Bin 0 -> 23552 bytes
 .../fsck-tests/020-extent-ref-cases/shared_data_ref.img | Bin 0 -> 5120 bytes
 tests/fsck-tests/020-extent-ref-cases/test.sh   |  14 ++
 7 files changed, 14 insertions(+)
 create mode 100644 tests/fsck-tests/020-extent-ref-cases/keyed_block_ref.img
 create mode 100644 tests/fsck-tests/020-extent-ref-cases/keyed_data_ref.img
 create mode 100644 tests/fsck-tests/020-extent-ref-cases/no_inline_ref.img
 create mode 100644 tests/fsck-tests/020-extent-ref-cases/no_skinny_ref.img
 create mode 100644 tests/fsck-tests/020-extent-ref-cases/shared_block_ref.img
 create mode 100644 tests/fsck-tests/020-extent-ref-cases/shared_data_ref.img
 create mode 100755 tests/fsck-tests/020-extent-ref-cases/test.sh

diff --git a/tests/fsck-tests/020-extent-ref-cases/keyed_block_ref.img 
b/tests/fsck-tests/020-extent-ref-cases/keyed_block_ref.img
new file mode 100644
index 
..289d37bc309fb8c33bff13cb71de8b9c5d83f1bb
GIT binary patch
literal 10240
zcmeHMcTiK`mrv**#1MK39Rx%S1d-lRq-bag5-@ZGM0$}H2qgkiq!)pRA_yX2qbMyD
zY0^ZRNK>jHML=pc=H!l5!e?7D1x4iog?E3b%9NhXF*ZqN)e_?MD;(pL=
zzwhB`|BdVZz^IPja>1w_y!$Oj|H3y8o)7N`=*!$o2sC@W|{SWSO{e@i)
zaOp4Hae%vjVIvA62Gku~0MMVae!~C1{J#bMrxx%X*3tV|+<~Snrg5gaTl?f5>Sjxj58pSyaN?QCrWii3?0Q0!gvyd
z;he}|O`t`a){WyMEqN?TyOulVFWm^5U)!jzoeM}Qz|O5F(0O*x(o^PM_)1MmDt1HN
zfSn}8
z5OLEv!FTxC&6^qRhgpY=lT~#x27JBwnbQ?Tqu9+c1^;{V^34G?_+mt02O@*~f&`w9
z6@ND;gqQllG%8e*CRGPKCTpzG^BfMAIXLK6Q{62_xu1;Mdt5)hppG|u{0f~83_m$+Q`P+
z8)H9Ea;t#zTdm)QT)$7uRzRd_?sg3!5+Foj(+@KYUXy@{sX!(|BtET$$mIP;1^&?n
zOrlGB>sCjF06C(gXV*v}(#yG@E=B6_;EI;ORo~%0^8~p};BgbCL2$9bf`V;z
zWYCA~3qG|Hx-xZC5c@DKc(!@y4Qhv(w+`ubejBkDHg6?7ZI9|!
zNo#P82$=w+P0%L6RFYi*&%z}Hc|d+0k%buQ#=#hovKZ1vhY0OIjr`9Zg;^ueY|Usd
zO5EsqiU=uJ7N$4cI+eF5g^NN$r00*r#tknyR{#L)2721Q!wkrI@y5ffCe3L>b{nqt
zfUT>tYCDR}Lx*W+)jYg-@>)pdymMl_X>@CiygM-F*Xj4>sG6y=8F
zB>y`UoB;s!FANjD!}2RL2nJHFQqfXglZPMt+PshN6)Pq#1zDqH8jmo}bm+WwnCQ5z
z%4ItE`dMyLJc$537weQFHx?65HT0;40K>Y0gmG9g{Wt21AZ!_hFfL3Wp^ofD9*5p}
z61Wi`!($BnsPPFu-V!SfpMEY2?0FOf@=hr*uKpZQ6O$m|jye}!Dhb27!(cGnNLe_!
z6(vIzv>goz(1N(ihnn(u|Fp6ziAjB48&6%GepUwZYLclz;gzh6#Plmy^)phla#B;$
zQ!^mf9FudBk~5HyYc|FisflSx8KjJ2Wunq3bnJBNX+c@MUO_TrH3TSr|S=6o|7BTtuQ8AxF~_Ss>w+@D6Ab?9-UL$Q`giy
z>rxM;aqgE0%ze}nYe%HfNAYj$4OPl!*wn3SE}map=4
zMMi88!uteirVQZpo%oBs!#iM(=68qrpAjT+MAlT|w$pNH(&`*7AJb2(=}xE_-Xl#N
za|LVuWVgfeMi>AX5n@974rlsI4{%W^po~%KrP@!{k`}qPP5p_nJ#5!2DWS^n9JZ
z-iwbT+gA27?VlP3wxo6Wq)|SOV|$FIE@sYWw-j$=6c>)87ES!`7gVHV!TJsxj!rck
z8s3~B_z~*);kHfwT|F+6XO+1MLIt`YWDs7J@!wX;5)P0N*(b={())c>8-|7mSBv45hNXEr8M`sK{}
z=EGbFi9L$=ws6~nf2Cu!5)-%Yum^$OKj$zV17sSG(2*HEZyotJItGDQP7!J9ZIa@l
zWs5yl@(4eY

Re: [regression] make sure seed is writeable sprout after device add

2016-05-29 Thread Chris Murphy
On Sun, May 29, 2016 at 8:03 PM, Chris Murphy  wrote:

>
> # lsblk -o +UUID
> NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT UUID
> loop07:0010G  0 loop /mnt/0 
> 63288b0c-9216-4f11-aed4-cc054ae90e07
> loop17:1010G  0 loop
> 63288b0c-9216-4f11-aed4-cc054ae90e07
>
> This is worse.
>
> # btrfs fi show
> Label: none  uuid: 63288b0c-9216-4f11-aed4-cc054ae90e07
> Total devices 2 FS bytes used 384.00KiB
> devid1 size 10.00GiB used 2.02GiB path /dev/loop0
> devid2 size 10.00GiB used 0.00B path /dev/loop1
>
> So is this. Where is the new UUID for the newly created sprout volume?
>
> /dev/loop0: UUID="63288b0c-9216-4f11-aed4-cc054ae90e07"
> UUID_SUB="e379aedb-6d14-4d56-be7d-1772c9984bc5" TYPE="btrfs"
> /dev/loop1: UUID="63288b0c-9216-4f11-aed4-cc054ae90e07"
> UUID_SUB="b282f566-8382-468e-b9ec-f748244b703b" TYPE="btrfs"
>
> Uhh? Identical UUIDs for seed and sproud? That's not right.


Just to confirm that UUID for the sprout is the same as the seed at
mkfs time, and I didn't include it in the previous email:


# mkfs.btrfs /dev/loop0
btrfs-progs v4.5.2
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (10.00GiB) ...
Label:  (null)
UUID:   63288b0c-9216-4f11-aed4-cc054ae90e07



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [regression] make sure seed is writeable sprout after device add

2016-05-29 Thread Chris Murphy
With 4.5.5 the 'mount -o remount,rw' works like the wiki describes,
and is in my opinion contrary to the mount man page.

After the -o remount,rw following btrfs dev add VG/sprout, I get this partial:

# lsblk -o +UUID
NAME  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT UUID
│ └─VG-thintastic_tdata   253:1090G  0 lvm
│   └─VG-thintastic-tpool 253:2090G  0 lvm
│ ├─VG-thintastic 253:3090G  0 lvm
│ ├─VG-seed   253:5050G  0 lvm   /mnt/0
59828c01-8354-43ac-a92d-f22d1b5d0e22
│ └─VG-sprout 253:6050G  0 lvm
e8de3a52-34d1-46af-98c1-8620642be884

And

# mount
[...trimmed...]
/dev/mapper/VG-seed on /mnt/0 type btrfs
(rw,relatime,seclabel,space_cache,subvolid=5,subvol=/)

This is just wrong. The wrong volume UUID is associated with /mnt/0 by
lsblk. And mount shows VG/seed as rw, which is not possible because by
definition the seed is read-only.

I've always thought this was misleading and confusing.

Next I tried it with kernel-4.6.0...

[root@f24m ~]# losetup
NAME   SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILEDIO
/dev/loop0 0  0 0  0 /root/seed 0
/dev/loop1 0  0 0  0 /root/sprout   0

# mount
[...trimmed...]
/dev/loop0 on /mnt/0 type btrfs
(rw,relatime,seclabel,space_cache,subvolid=5,subvol=/)

So it's the same incorrect information as 4.5.5.

# lsblk -o +UUID
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT UUID
loop07:0010G  0 loop /mnt/0 63288b0c-9216-4f11-aed4-cc054ae90e07
loop17:1010G  0 loop63288b0c-9216-4f11-aed4-cc054ae90e07

This is worse.

# btrfs fi show
Label: none  uuid: 63288b0c-9216-4f11-aed4-cc054ae90e07
Total devices 2 FS bytes used 384.00KiB
devid1 size 10.00GiB used 2.02GiB path /dev/loop0
devid2 size 10.00GiB used 0.00B path /dev/loop1

So is this. Where is the new UUID for the newly created sprout volume?

/dev/loop0: UUID="63288b0c-9216-4f11-aed4-cc054ae90e07"
UUID_SUB="e379aedb-6d14-4d56-be7d-1772c9984bc5" TYPE="btrfs"
/dev/loop1: UUID="63288b0c-9216-4f11-aed4-cc054ae90e07"
UUID_SUB="b282f566-8382-468e-b9ec-f748244b703b" TYPE="btrfs"

Uhh? Identical UUIDs for seed and sproud? That's not right.

Same version of btrfs-progs in both cases, just different kernels, BUT
on different machines, and hence why LVM in the first case, and
fallocated files on loop for the second. I'm not sure what's causing
this.




Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Functional difference between "replace" vs "add" then "delete missing" with a missing disk in a RAID56 array

2016-05-29 Thread Duncan
Chris Johnson posted on Sun, 29 May 2016 09:33:49 -0700 as excerpted:

> Situation: A six disk RAID5/6 array with a completely failed disk. The
> failed disk is removed and an identical replacement drive is plugged in.

First of all, be aware (as you already will be if you're following the 
list) that there are currently two, possibly related, (semi-?)critical 
known bugs still affecting raid56 mode, with the result being that 
despite raid56 nominal completion in 3.19 and fix of a couple even more 
critical bugs early on, by 4.1 release, raid56 mode remains negatively 
recommended for anything but testing.

One of the two bugs is that restriping (as done by balance either with 
the restripe filters after adding devices or triggered automatically by 
device delete) can, in SOME cases only, with the trigger variable unknown 
at this point, can take an order of magnitude (or even more) longer than 
it should -- we're talking over a week for a rebalance that would be 
expected to be done in under a day, upto possibly months for the multi-TB 
filesystems that are a common use-case for raid5/6, that might be 
expected to take a day or two under normal circumstances.

This rises to critical because other than the impractical time involved, 
once you're talking weeks to months restripe time, the danger of another 
device going out, thereby killing the entire array, increases 
unacceptably, to the point that raid56 cannot be considered usable for 
the normal things people use it for, thus the critical bug rating even if 
in theory the restripe is completing correctly and the data isn't in 
immediate danger.

Obviously you're not hitting it if your results show balance as 
significantly faster, but because we don't know what triggers the problem 
yet, that's no guarantee that you won't hit it later, after somehow 
triggering the problem.

The second bug is equally alarming, but in a different way.  A number of 
people have reported that replacing (by one method or the other) a first 
device appears to work, but if a second replace is attempted, it kills 
the array(!!), so obviously something's going wrong with the first 
replace as it's not returning the array to full undegraded functionality, 
even tho all the current tools as well as operations before the second 
replace is attempted suggest that it has done just that.

This one too remains untraced to an ultimate cause, and while the two 
bugs appear quite different, because they are both critical and remain 
untraced, it remains possible that they are actually simply two different 
symptoms of the same root bug.


So, if you're using raid56 only for testing as is recommended, great, but 
if you're using it for live data, for sure have your backups ready as 
there remains an uncomfortably high chance that you may need to use them 
if something goes wrong with that raid56 and these bug(s) prevent you 
from recovering the array.  Or alternatively, switch to the more mature 
raid1 or raid10 modes if realistic in your use-case, or to more 
traditional solutions such as md/dm-raid underneath btrfs or some other 
filesystem.

(One very interesting solution is btrfs raid1 mode over top of a pair of 
md/dm-raid0 virtual devices, each of which can then be composed of 
multiple physical devices.  This allows btrfs raid1 mode data and 
metadata integrity checking and repair that underlying raid modes don't 
have, and includes the repair of detected checksum errors that btrfs 
single mode won't be able to do because it can detect problems but not 
correct them.  Meanwhile, the underlying raid0 helps make up somewhat for 
the btrfs' poor raid1 optimization and performance as it tends to 
serialize access to multiple devices that other raid solutions 
parallelize.)

Of course, the more mature zfs on linux can be another alternative, if 
you're prepared to overlook the licensing issues and have hardware upto 
the task.

With that warning explained and alternatives provided, to your actual 
question...

> Here I have two options for replacing the disk, assuming the old drive
> is device 6 in the superblock and the replacement disk is /dev/sda.
> 
> 'btrfs replace start 6 /dev/sda /mnt'
> This will start a rebuild of the array using the new drive, copying data
> that would have been on device 6 to the new drive from the parity data.
> 
> btrfs add /dev/sda /mnt && btrfs device delete missing /mnt This adds a
> new device (the replacement disk) to the array and dev delete missing
> appears to trigger a rebalance before deleting the missing disk from the
> array. The end result appears to be identical to option 1.
> 
> A few weeks back I recovered an array with a failed drive using 'delete
> missing' because 'replace' caused a kernel panic. I later discovered
> that this was not (just) a failed drive but some other failed hardware
> that I've yet to start diagnosing. Either motherboard or HBA. The drives
> are in a new server now and I am currently rebuilding the array with
> 

Re: [regression] make sure seed is writeable sprout after device add

2016-05-29 Thread Chris Murphy
On Sun, May 29, 2016 at 1:48 PM, Anand Jain  wrote:
> Originally a seed FS becomes a writeable sprout FS after a
> device is added to it, however as at 4.6 I don't see this
> behavior anymore.

I think the old behavior where it's possible to use -o remount,rw is
actually confusing.

Strictly speaking what is mounted is the seed volume with that
particular volume UUID. Doing a remount to effectively cause a behind
the scene umount then a mount of a different volume and volume UUID is
not obvious. What if there's more than one sprout? It seems really
ambiguous what the user wants with an -o remount so I'm wondering if
this operation should fail instead?

>From mount man page:

   remount
  Attempt to remount an already-mounted filesystem.  This
is commonly used to change the mount flags for a filesystem,
especially to make a readonly filesystem writable.  It does not change
device or mount point.

Two problems: 1. the seed and sprout are two filesystems, and one of
them is not already mounted. So remount should not mean switch from a
mounted filesystem to an unmounted one. 2. The former remount behavior
did change device, from that of the seed to that of the sprout.

So both of those seem to depart from the remount definition rather
significantly.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-29 Thread Ferry Toth
Op Sun, 29 May 2016 12:33:06 -0600, schreef Chris Murphy:

> On Sun, May 29, 2016 at 12:03 PM, Holger Hoffstätte
>  wrote:
>> On 05/29/16 19:53, Chris Murphy wrote:
>>> But I'm skeptical of bcache using a hidden area historically for the
>>> bootloader, to put its device metadata. I didn't realize that was the
>>> case. Imagine if LVM were to stuff metadata into the MBR gap, or
>>> mdadm. Egads.
>>
>> On the matter of bcache in general this seems noteworthy:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/
commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667
>>
>> bummer..
> 
> Well it doesn't mean no one will take it, just that no one has taken it
> yet. But the future of SSD caching may only be with LVM.
> 
> --
> Chris Murphy

I think all the above posts underline exacly my point: 

Instead of using a ssd cache (be it bcache or dm-cache) it would be much 
better to have the btrfs allocator be aware of ssd's in the pool and 
prioritize allocations to the ssd to maximize performance.

This will allow to easily add more ssd's or replace worn out ones, 
without the mentioned headaches. After all adding/replacing drives to a 
pool is one of btrfs's biggest advantages. 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Btrfs: test_check_exists: Fix infinite loop when searching for free space entries

2016-05-29 Thread Feifei Xu



On 2016/5/27 23:43, Josef Bacik wrote:



 fs/btrfs/free-space-cache.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 5e6062c..05c9ef8 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3662,6 +3662,7 @@ have_info:
 if (tmp->offset + tmp->bytes < offset)
 break;
 if (offset + bytes < tmp->offset) {
+info = tmp;
 n = rb_prev(>offset_index);
 continue;
 }
@@ -3676,6 +3677,7 @@ have_info:
 if (offset + bytes < tmp->offset)
 break;
 if (tmp->offset + tmp->bytes < offset) {
+info = tmp;
 n = rb_next(>offset_index);
 continue;
 }



Just make it rb_next(>offset_index)/rb_prev(>offset_index) 
instead of doing the info = tmp thing.  Thanks,


Josef


I will change it in v2, thanks
Feifei

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[regression] make sure seed is writeable sprout after device add

2016-05-29 Thread Anand Jain
Originally a seed FS becomes a writeable sprout FS after a
device is added to it, however as at 4.6 I don't see this
behavior anymore.

This, the above feature is quite unique to btrfs, and there
are some good future solutions on top it. So please preserve
this feature and here is a test case [1] which is to make
sure we would.

On the point of fixing regression, I am trying, traced until
3.8 looks like it failed beyond that. Will continue after my
vacation.

Further, while digging this out, found another bug in
btrfs_init_new_device() which [2] will fix it. (again [2] is not
a fix for the regression).


[1]
[PATCH] btrfs: failed to create sprout should set back to rdonly

[2]
[PATCH] fstests: btrfs: test case to make sure seed FS is writable
after device add
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: failed to create sprout should set back to rdonly

2016-05-29 Thread Anand Jain
btrfs_init_new_device() should put the FS back to RDONLY
if init fails in the seed_device context.

Further it adds the following clean up:
- fixes a bug_on to goto label error_trans:
- move btrfs_abort_transaction() label error_trans: and
- as there is no code to undo the btrfs_prepare_sprout()
  so temporarily calls a bug_on

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2b88127bba5b..a637e99e4c6b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2351,7 +2351,8 @@ int btrfs_init_new_device(struct btrfs_root *root, char 
*device_path)
if (seeding_dev) {
sb->s_flags &= ~MS_RDONLY;
ret = btrfs_prepare_sprout(root);
-   BUG_ON(ret); /* -ENOMEM */
+   if (ret)
+   goto error_trans;
}
 
device->fs_devices = root->fs_info->fs_devices;
@@ -2398,26 +2399,20 @@ int btrfs_init_new_device(struct btrfs_root *root, char 
*device_path)
lock_chunks(root);
ret = init_first_rw_device(trans, root, device);
unlock_chunks(root);
-   if (ret) {
-   btrfs_abort_transaction(trans, root, ret);
-   goto error_trans;
-   }
+   if (ret)
+   goto error_sysfs;
}
 
ret = btrfs_add_device(trans, root, device);
-   if (ret) {
-   btrfs_abort_transaction(trans, root, ret);
-   goto error_trans;
-   }
+   if (ret)
+   goto error_sysfs;
 
if (seeding_dev) {
char fsid_buf[BTRFS_UUID_UNPARSED_SIZE];
 
ret = btrfs_finish_sprout(trans, root);
-   if (ret) {
-   btrfs_abort_transaction(trans, root, ret);
-   goto error_trans;
-   }
+   if (ret)
+   goto error_sysfs;
 
/* Sprouting would change fsid of the mounted root,
 * so rename the fsid on the sysfs
@@ -2460,10 +2455,18 @@ int btrfs_init_new_device(struct btrfs_root *root, char 
*device_path)
update_dev_time(device_path);
return ret;
 
+error_sysfs:
+   if (seeding_dev) {
+   /* undo of btrfs_prepare_sprout is missing*/
+   BUG_ON(1);
+   }
+   btrfs_sysfs_rm_device_link(root->fs_info->fs_devices, device);
 error_trans:
+   if (seeding_dev)
+   sb->s_flags |= MS_RDONLY;
+   btrfs_abort_transaction(trans, root, ret);
btrfs_end_transaction(trans, root);
rcu_string_free(device->name);
-   btrfs_sysfs_rm_device_link(root->fs_info->fs_devices, device);
kfree(device);
 error:
blkdev_put(bdev, FMODE_EXCL);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fstests: btrfs: test case to make sure seed FS is writable after device add

2016-05-29 Thread Anand Jain
Originally when the device is added to a seed FS, the mount point
converts to writeable. However there appears to be a regression
that in 4.6 the sprouted FS still remains read-only. Traced back
untill 3.8 and still there is regression.

Seed sprout btrfs feature is one of the unique feature of btrfs
and interesting solutions can be developed using this feature.

So this test case makes sure that original expected output is
preserved.
---
 tests/btrfs/125 | 81 +
 tests/btrfs/125.out |  1 +
 tests/btrfs/group   |  1 +
 3 files changed, 83 insertions(+)
 create mode 100755 tests/btrfs/125
 create mode 100644 tests/btrfs/125.out

diff --git a/tests/btrfs/125 b/tests/btrfs/125
new file mode 100755
index ..189d30614ad0
--- /dev/null
+++ b/tests/btrfs/125
@@ -0,0 +1,81 @@
+#! /bin/bash
+# FS QA Test No. btrfs/123
+#
+# Test BTRFS seed device add
+#
+# Steps:
+#   Create seed FS and mount
+#   Device add
+#   Check if the FS is now RW-able
+#
+#---
+# Copyright (c) 2016 Oracle.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_nocheck
+_require_scratch_dev_pool 2
+
+rm -f $seqres.full
+
+_scratch_dev_pool_get 1
+_spare_dev_get
+
+_scratch_pool_mkfs >> $seqres.full 2>&1
+
+btrfstune -S 1 $SCRATCH_DEV_POOL || \
+   _fail "btrfstune failed to mark '$SCRATCH_DEV_POOL' as seed"
+
+_scratch_mount >> $seqres.full 2>&1
+
+_run_btrfs_util_prog filesystem show -m
+
+_run_btrfs_util_prog device add $SPARE_DEV "$SCRATCH_MNT"
+
+_run_btrfs_util_prog filesystem show -m
+
+touch "$SCRATCH_MNT"/tf1 || _fail "FS not Writeable"
+
+_scratch_unmount
+_spare_dev_put
+_scratch_dev_pool_put
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/btrfs/125.out b/tests/btrfs/125.out
new file mode 100644
index ..4f22ab0cb5e9
--- /dev/null
+++ b/tests/btrfs/125.out
@@ -0,0 +1 @@
+QA output created by 125
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 1866b17aa6df..0afc82940f61 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -126,3 +126,4 @@
 122 auto quick snapshot qgroup
 123 auto replace
 124 auto replace
+125 auto replace
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-29 Thread Chris Murphy
On Sun, May 29, 2016 at 12:03 PM, Holger Hoffstätte
 wrote:
> On 05/29/16 19:53, Chris Murphy wrote:
>> But I'm skeptical of bcache using a hidden area historically for the
>> bootloader, to put its device metadata. I didn't realize that was the
>> case. Imagine if LVM were to stuff metadata into the MBR gap, or
>> mdadm. Egads.
>
> On the matter of bcache in general this seems noteworthy:
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667
>
> bummer..

Well it doesn't mean no one will take it, just that no one has taken
it yet. But the future of SSD caching may only be with LVM.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize doesnt work as expected

2016-05-29 Thread Chris Murphy
On Sun, May 29, 2016 at 12:16 PM, Peter Becker  wrote:
> 2016-05-29 19:11 GMT+02:00 Chris Murphy :
>> On Sat, May 28, 2016 at 3:42 PM, Peter Becker  wrote:
>>> Thanks for the clarification. I've probably overlooked this.
>>>
>>> But should "resize max" does not do what you expect instead of falling
>>> back on an "invisible" 1?
>>
>> How does it know what the user expects?
>
> Then simply remove the default deviceid und let the user choise what the want.

They can already choose what they want, but they have to specify it,
it's not an interactive UI. Plus the shrink case has to be considered.

What it could do is state what happened rather than completing without
any message, i.e. if devid not specifed it would say something like:

devid 1 resized from X to X

At least there's feedback. It doesn't exactly make sense to require
the most common case, single device, to have to specify the single
device, hence why devid 1 is assumed.


> IMHO its a bad think to choise automaticly a option if its not clear
> what the user wants. And in spezial there is no hint in the output of
> the deviceid who is used.
> The output is:
>
> "Resize '/mnt' of 'max'" .. no hint the this only affect the deviceid
> 1. The suggestion for me is that the whole pool is resized to max.

That would mean the command affects all devid's at the same time. This
would require a lot more logic and safeguards for the shrink case. So
someone would need to do that work. I think such Ux improvements are
happening in a separate github thread on revising the btrfs-progs
UI/Ux.



>
> Possible solutions:
> 1. remove the default deviceid
> 2. resize without deviceid affects the whole pool
> 3. improve the output of the resize command by adding the deviceid
> 4. remove the inconsitent between add+remove and replace by triggering
> resize max after replace is finished.

1 negatively impacts single device setups.
2 doesn't account for shrink, where now every device is reduced by
some unknown amount and could end up a mess in some cases.
3 & 4 are reasonable

Right now the command is rather explicit with an exception for the
single device case. That's really what you're seeing here.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize doesnt work as expected

2016-05-29 Thread Peter Becker
2016-05-29 19:11 GMT+02:00 Chris Murphy :
> On Sat, May 28, 2016 at 3:42 PM, Peter Becker  wrote:
>> Thanks for the clarification. I've probably overlooked this.
>>
>> But should "resize max" does not do what you expect instead of falling
>> back on an "invisible" 1?
>
> How does it know what the user expects?

Then simply remove the default deviceid und let the user choise what the want.
IMHO its a bad think to choise automaticly a option if its not clear
what the user wants. And in spezial there is no hint in the output of
the deviceid who is used.
The output is:

"Resize '/mnt' of 'max'" .. no hint the this only affect the deviceid
1. The suggestion for me is that the whole pool is resized to max.

Possible solutions:
1. remove the default deviceid
2. resize without deviceid affects the whole pool
3. improve the output of the resize command by adding the deviceid
4. remove the inconsitent between add+remove and replace by triggering
resize max after replace is finished.

> I think the issue is not with the resize command, but rather the
> replace command does not include the resize max operation. Presumably
> the user intends the entire block device provided as the target for
> replacement to be used.
>
> So I think the mistake is replace assumes the user wants to use the
> same amount of space as the former block device. I think if the user
> wanted to use the former block device size on the new block device,
> they'd partition it and use the partition as the target.
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-29 Thread Holger Hoffstätte
On 05/29/16 19:53, Chris Murphy wrote:
> But I'm skeptical of bcache using a hidden area historically for the
> bootloader, to put its device metadata. I didn't realize that was the
> case. Imagine if LVM were to stuff metadata into the MBR gap, or
> mdadm. Egads.

On the matter of bcache in general this seems noteworthy:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667

bummer..

Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-29 Thread Chris Murphy
On Sun, May 29, 2016 at 12:23 AM, Andrei Borzenkov  wrote:
> 20.05.2016 20:59, Austin S. Hemmelgarn пишет:
>> On 2016-05-20 13:02, Ferry Toth wrote:
>>> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4,
>>> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs
>>> partitions are in the same pool, which is in btrfs RAID10 format. /boot
>>> is in subvolume @boot.
>> If you have GRUB installed on all 4, then you don't actually have the
>> full 2047 sectors between the MBR and the partition free, as GRUB is
>> embedded in that space.  I forget exactly how much space it takes up,
>> but I know it's not the whole 1023.5K  I would not suggest risking usage
>> of the final 8k there though.
>
> If you mean grub2, required space is variable and depends on where
> /boot/grub is located (i.e. which drivers it needs to access it).
> Assuming plain btrfs on legacy BIOS MBR, required space is around 40-50KB.
>
> Note that grub2 detects some post-MBR gap software signatures and skips
> over them (space need not be contiguous). It is entirely possible to add
> bcache detection if enough demand exists.

Might not be a bad idea, just to avoid it getting stepped on and
causing later confusion. If it is stepped on I don't think there's
data loss except possibly in the case where there's an unclean
shutdown where the SSD has bcache data that hasn't been committed to
the HDD?

But I'm skeptical of bcache using a hidden area historically for the
bootloader, to put its device metadata. I didn't realize that was the
case. Imagine if LVM were to stuff metadata into the MBR gap, or
mdadm. Egads.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize doesnt work as expected

2016-05-29 Thread Chris Murphy
On Sat, May 28, 2016 at 3:42 PM, Peter Becker  wrote:
> Thanks for the clarification. I've probably overlooked this.
>
> But should "resize max" does not do what you expect instead of falling
> back on an "invisible" 1?

How does it know what the user expects?

I think the issue is not with the resize command, but rather the
replace command does not include the resize max operation. Presumably
the user intends the entire block device provided as the target for
replacement to be used.

So I think the mistake is replace assumes the user wants to use the
same amount of space as the former block device. I think if the user
wanted to use the former block device size on the new block device,
they'd partition it and use the partition as the target.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Functional difference between "replace" vs "add" then "delete missing" with a missing disk in a RAID56 array

2016-05-29 Thread Chris Johnson
Situation: A six disk RAID5/6 array with a completely failed disk. The
failed disk is removed and an identical replacement drive is plugged
in.

Here I have two options for replacing the disk, assuming the old drive
is device 6 in the superblock and the replacement disk is /dev/sda.

'btrfs replace start 6 /dev/sda /mnt'
This will start a rebuild of the array using the new drive, copying
data that would have been on device 6 to the new drive from the parity
data.

btrfs add /dev/sda /mnt && btrfs device delete missing /mnt
This adds a new device (the replacement disk) to the array and dev
delete missing appears to trigger a rebalance before deleting the
missing disk from the array. The end result appears to be identical to
option 1.

A few weeks back I recovered an array with a failed drive using
'delete missing' because 'replace' caused a kernel panic. I later
discovered that this was not (just) a failed drive but some other
failed hardware that I've yet to start diagnosing. Either motherboard
or HBA. The drives are in a new server now and I am currently
rebuilding the array with 'replace', which is believe is the "more
correct" way to replace a bad drive in an array.

Both work, but 'replace' seems to be slower so I'm curious what the
functional differences are between the two. I thought the replace
would be faster as I assumed it would need to read fewer blocks since
instead of a complete rebalance it's just rebuilding a drive from
parity data.

What are the differences between the two under the hood? The only
obvious difference I could see is that when I ran `replace` the space
on the replacement drive was instantly allocated under 'filesystem
show' while when I used 'device delete' the drive usage slowly crept
up through the course of the rebalance.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some ideas for improvements

2016-05-29 Thread Dmitry Katsubo
On 2016-05-25 21:03, Duncan wrote:
> Dmitry Katsubo posted on Wed, 25 May 2016 16:45:41 +0200 as excerpted:
>> * Would be nice if 'btrfs scrub status' shows estimated finishing time
>> (ETA) and throughput (in Mb/s).
> 
> That might not be so easy to implement.  (Caveat, I'm not a dev, just a 
> btrfs user and list regular, so if a dev says different...)
> 
> Currently, a running scrub simply outputs progress to a file (/var/lib/
> btrfs/scrub.status.), and scrub status is simply a UI to pretty-
> print that file.  Note that there's nothing in there which lists the 
> total number of extents or bytes to go -- that's not calculated ahead of 
> time.
> 
> So implementing some form of percentage done or eta is likely to increase 
> the processing time dramatically, as it could involve doing a dry-run 
> first, in ordered to get the total figures against which to calculate 
> percentage done.

Indeed that this cannot (should not) be done on user-space level: kernel
module should provide information about that. I am not a dev :) but I
think module should now number of extents, at least something is shown in
"btrfs fi usage ..." output.

The information shouldn't be 100% exact, but at least some indication
would be great. In worst scenario module can remember the last scrub
time and make estimation based on that (similar how some CD burning
utilities do).

>> * Not possible to start scrub for all devices in the volume without
>> mounting it.
> 
> Interesting.  It's news to me that you can scrub individual devices 
> without mounting.  But given that, this would indeed be a useful feature, 
> and given that btrfs filesystem show can get the information, scrub 
> should be able to get and make use of it as well. =:^)

More over I got into a trap when tried to use "btrfs scrub start /dev/..."
syntax, as I only scrubs the given device. When I scrubbed the whole
volume after mounting it, de result was different. I understood it only
after reading man btrfs-scrub more attentively:

  start ... |

  Start a scrub on all devices of the filesystem identified by 
  or on a single .

Other (shorter) forms of help misled me, giving the impression that
it does not matter whether I specify a path or device.

On 2016-05-26 00:05, Duncan wrote:
> Nicholas D Steeves posted on Wed, 25 May 2016 16:36:13 -0400 as excerpted:
>> On 25 May 2016 at 15:03, Duncan <1i5t5.dun...@cox.net> wrote:
>>> Dmitry Katsubo posted on Wed, 25 May 2016 16:45:41 +0200 as excerpted:
 btrfs-restore [needs an o]ption that applies (y) to all questions
 (completely unattended recovery)
>>>
>>> That['s] a known sore spot that a lot of people have complained
>>> about.
> 
>> I'm surprised no one has mentioned, in any of these discussions, what I
>> believe is the standard method of providing this functionality:
>> yes | btrfs-restore -options /dev/disk
> 
> Good point.
> 
> I didn't bring it up because while I've used btrfs restore a few times, 
> my btrfs are all on relatively small SSD partitions, so I both needed 
> less y's, and the total time per restore is a few minutes, not hours, so 
> it wasn't a big deal.  As a result, while I know of yes, I didn't need to 
> think about automation, and as I never used it, it didn't occur to me to 
> suggest it for others.

Thanks for advise, Nicholas. Last time I tried it I used the following
command:

while true; do echo y; done | btrfs restore -voxmSi /dev/sda /mnt/tmp &> 
btrfs_restore &

which presumably is equivalent to what you suggest. The command was in
"running" state in "jobs" output for a while, but then turned into
"waiting" state and did not progress. I suspect that btrfs-restore
somehow reads directly from terminal, not from stdin. I will try the
solution with "yes | btrfs-restore..." once I get a chance.

-- 
With best regards,
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Btrfs for 4.7, part 2

2016-05-29 Thread Chris Mason

On Sat, May 28, 2016 at 01:14:13PM +0800, Anand Jain wrote:



On 05/27/2016 11:42 PM, Chris Mason wrote:

I'm getting errors from btrfs fi show -d, after the very last round of
device replaces.  A little extra debugging:

bytenr mismatch, want=4332716032, have=0
ERROR: cannot read chunk root
ERROR reading /dev/vdh
failed /dev/vdh

Which is cute because the very next command we run fscks /dev/vdh and
succeeds.


Checked the code paths both btrfs fi show -d and btrfs check,
both are calling flush during relative open_ctree in progs.

However the flush is called after we have read superblock. That
means the read_superblock during 'show' cli (only) will read superblock
without flush, and 'check' won't, because 011 calls 'check' after
'show'. But it still does not explain the above error, which is
during open_ctree not at read superblock. Remains strange case as
of now.


It's because we're just not done writing it out yet when btrfs fi show is run.
I think replace is special here.



Also. I can't reproduce.



I'm in a relatively new test rig using kvm, which probably explains why
I haven't seen it before.  You can probably make it easier by adding
a sleep inside the actual __free_device() func.


So the page cache is stale and this isn't related to any of our patches.


close_ctree() calls into btrfs_close_devices(), which calls
btrfs_close_one_device(), which uses:

call_rcu(>rcu, free_device);

close_ctree() also does an rcu_barrier() to make sure and wait for
free_device() to finish.

But, free_device() just puts the work into schedule_work(), so we don't
know for sure the blkdev_put is done when we exit.


Right, saw that before. Any idea why its like that ? Or if it
should be fixed?


It's just trying to limit the work that is done from call_rcu, and it should
definitely be fixed.  It might cause EBUSY or other problems.  Probably
easiest to add a counter or completion object that gets changed by the
__free_device function.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 16/22] btrfs-progs: convert: Introduce function to migrate reserved ranges

2016-05-29 Thread Qu Wenruo



On 05/28/2016 11:16 AM, Liu Bo wrote:

On Fri, Jan 29, 2016 at 01:03:26PM +0800, Qu Wenruo wrote:

Introduce new function, migrate_reserved_ranges() to migrate used fs
data in btrfs reserved space.

Unlike old implement, which will need to relocate all the complicated
csum and reference relocation, previous patches already ensure such
reserved ranges won't be allocated.
So here we only need copy these data out and create new
extent/csum/reference.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
 btrfs-convert.c | 124 +++-
 1 file changed, 122 insertions(+), 2 deletions(-)

diff --git a/btrfs-convert.c b/btrfs-convert.c
index 16e2309..f6126db 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -1679,6 +1679,123 @@ static int create_image_file_range_v2(struct 
btrfs_trans_handle *trans,
return ret;
 }

+
+/*
+ * Relocate old fs data in one reserved ranges
+ *
+ * Since all old fs data in reserved range is not covered by any chunk nor
+ * data extent, we don't need to handle any reference but add new
+ * extent/reference, which makes codes more clear
+ */
+static int migrate_one_reserved_range(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct cache_tree *used,
+ struct btrfs_inode_item *inode, int fd,
+ u64 ino, u64 start, u64 len, int datacsum)
+{
+   u64 cur_off = start;
+   u64 cur_len = len;
+   struct cache_extent *cache;
+   struct btrfs_key key;
+   struct extent_buffer *eb;
+   int ret = 0;
+
+   while (cur_off < start + len) {
+   cache = lookup_cache_extent(used, cur_off, cur_len);
+   if (!cache)
+   break;
+   cur_off = max(cache->start, cur_off);
+   cur_len = min(cache->start + cache->size, start + len) -
+ cur_off;
+   BUG_ON(cur_len < root->sectorsize);
+
+   /* reserve extent for the data */
+   ret = btrfs_reserve_extent(trans, root, cur_len, 0, 0, (u64)-1,
+  , 1);
+   if (ret < 0)
+   break;
+
+   eb = malloc(sizeof(*eb) + cur_len);
+   if (!eb) {
+   ret = -ENOMEM;
+   break;
+   }
+
+   ret = pread(fd, eb->data, cur_len, cur_off);
+   if (ret < cur_len) {
+   ret = (ret < 0 ? ret : -EIO);
+   free(eb);
+   break;
+   }
+   eb->start = key.objectid;
+   eb->len = key.offset;
+
+   /* Write the data */
+   ret = write_and_map_eb(trans, root, eb);
+   free(eb);
+   if (ret < 0)
+   break;


With write_data_to_disk(), we don't have to create eb for write.

Thanks,

-liubo


Nice advice.

I didn't remember whether write_data_to_disk() was there when the 
patchset was written, but always a good idea to get rid of the temporary eb.


Thanks,
Qu


+
+   /* Now handle extent item and file extent things */
+   ret = btrfs_record_file_extent(trans, root, ino, inode, cur_off,
+  key.objectid, key.offset);
+   if (ret < 0)
+   break;
+   /* Finally, insert csum items */
+   if (datacsum)
+   ret = csum_disk_extent(trans, root, key.objectid,
+  key.offset);
+
+   cur_off += key.offset;
+   cur_len = start + len - cur_off;
+   }
+   return ret;
+}
+
+/*
+ * Relocate the used ext2 data in reserved ranges
+ * [0,1M)
+ * [btrfs_sb_offset(1), +BTRFS_STRIPE_LEN)
+ * [btrfs_sb_offset(2), +BTRFS_STRIPE_LEN)
+ */
+static int migrate_reserved_ranges(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  struct cache_tree *used,
+  struct btrfs_inode_item *inode, int fd,
+  u64 ino, u64 total_bytes, int datacsum)
+{
+   u64 cur_off;
+   u64 cur_len;
+   int ret = 0;
+
+   /* 0 ~ 1M */
+   cur_off = 0;
+   cur_len = 1024 * 1024;
+   ret = migrate_one_reserved_range(trans, root, used, inode, fd, ino,
+cur_off, cur_len, datacsum);
+   if (ret < 0)
+   return ret;
+
+   /* second sb(fisrt sb is included in 0~1M) */
+   cur_off = btrfs_sb_offset(1);
+   cur_len = min(total_bytes, cur_off + BTRFS_STRIPE_LEN) - cur_off;
+   if (cur_off < total_bytes)
+   return ret;
+   ret = 

Re: [PATCH v3 21/22] btrfs-progs: convert: Strictly avoid meta or system chunk allocation

2016-05-29 Thread Qu Wenruo



On 05/28/2016 11:30 AM, Liu Bo wrote:

On Fri, Jan 29, 2016 at 01:03:31PM +0800, Qu Wenruo wrote:

Before this patch, btrfs-convert only rely on large enough initial
system/metadata chunk size to ensure no newer system/meta chunk will be
created.

But that's not safe enough. So add two new members in fs_info,
avoid_sys/meta_chunk_alloc flags to prevent any newer system or meta
chunks to be created before init_btrfs_v2().

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
 btrfs-convert.c |  9 +
 ctree.h |  3 +++
 extent-tree.c   | 10 ++
 3 files changed, 22 insertions(+)

diff --git a/btrfs-convert.c b/btrfs-convert.c
index efa3b02..333f413 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -2322,6 +2322,13 @@ static int init_btrfs_v2(struct btrfs_mkfs_config *cfg, 
struct btrfs_root *root,
struct btrfs_fs_info *fs_info = root->fs_info;
int ret;

+   /*
+* Don't alloc any metadata/system chunk, as we don't want
+* any meta/sys chunk allcated before all data chunks are inserted.
+* Or we screw up the chunk layout just like the old implement.
+*/


I don't get this, with this patch set, we can allocate data from DATA chunk,
allocate metadata from METADATA chunk, but then we're not allowed to allocate 
new chunks?

Thanks,


For new convert, we are going through the following steps:
1) Create initial meta/sys chunks into unused space, manually
2) Open fs
3) Insert data chunks to covert all ext* used data
4) Do per inode copying

The whole patchset rely on a key assumption: All data chunks are already 
allocated to cover all ext* used data, so new chunk/extent allocation 
can follow the normal routine.


Before that, only chunks created in step 1) is completely safe, and new 
chunk allocation before step 3) are all unsafe, as the key assumption is 
not met yet.


So, unitl step 3), we must not allocate any new data/metadata chunks.

Thanks,
Qu


-liubo

+   fs_info->avoid_sys_chunk_alloc = 1;
+   fs_info->avoid_meta_chunk_alloc = 1;
trans = btrfs_start_transaction(root, 1);
BUG_ON(!trans);
ret = btrfs_fix_block_accounting(trans, root);
@@ -2359,6 +2366,8 @@ static int init_btrfs_v2(struct btrfs_mkfs_config *cfg, 
struct btrfs_root *root,
goto err;

ret = btrfs_commit_transaction(trans, root);
+   fs_info->avoid_sys_chunk_alloc = 0;
+   fs_info->avoid_meta_chunk_alloc = 0;
 err:
return ret;
 }
diff --git a/ctree.h b/ctree.h
index 1443746..187bd27 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1030,6 +1030,9 @@ struct btrfs_fs_info {
unsigned int quota_enabled:1;
unsigned int suppress_check_block_errors:1;
unsigned int ignore_fsid_mismatch:1;
+   unsigned int avoid_meta_chunk_alloc:1;
+   unsigned int avoid_sys_chunk_alloc:1;
+

int (*free_extent_hook)(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
diff --git a/extent-tree.c b/extent-tree.c
index 93b1945..e7c61b1 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1904,6 +1904,16 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
thresh)
return 0;

+   /*
+* Avoid allocating given chunk type
+*/
+   if (extent_root->fs_info->avoid_meta_chunk_alloc &&
+   (flags & BTRFS_BLOCK_GROUP_METADATA))
+   return 0;
+   if (extent_root->fs_info->avoid_sys_chunk_alloc &&
+   (flags & BTRFS_BLOCK_GROUP_SYSTEM))
+   return 0;
+
ret = btrfs_alloc_chunk(trans, extent_root, , _bytes,
space_info->flags);
if (ret == -ENOSPC) {
--
2.7.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/22] btrfs-progs: Introduce function to setup temporary superblock

2016-05-29 Thread Qu Wenruo



On 05/28/2016 11:04 AM, Liu Bo wrote:

On Fri, Jan 29, 2016 at 01:03:15PM +0800, Qu Wenruo wrote:

Introduce a new function, setup_temp_super(), to setup temporary super
for make_btrfs_v2().

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
 utils.c | 117 
 1 file changed, 117 insertions(+)

diff --git a/utils.c b/utils.c
index bc10293..ed5476d 100644
--- a/utils.c
+++ b/utils.c
@@ -212,6 +212,98 @@ static int reserve_free_space(struct cache_tree 
*free_tree, u64 len,
return 0;
 }

+static inline int write_temp_super(int fd, struct btrfs_super_block *sb,
+  u64 sb_bytenr)
+{
+   u32 crc = ~(u32)0;
+   int ret;
+
+   crc = btrfs_csum_data(NULL, (char *)sb + BTRFS_CSUM_SIZE, crc,
+ BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE);
+   btrfs_csum_final(crc, (char *)>csum[0]);
+   ret = pwrite(fd, sb, BTRFS_SUPER_INFO_SIZE, sb_bytenr);
+   if (ret < BTRFS_SUPER_INFO_SIZE)
+   ret = (ret < 0 ? -errno : -EIO);
+   else
+   ret = 0;
+   return ret;
+}
+
+/*
+ * Setup temporary superblock at cfg->super_bynter
+ * Needed info are extracted from cfg, and root_bytenr, chunk_bytenr
+ *
+ * For now sys chunk array will be empty and dev_item is empty
+ * too.
+ * They will be re-initialized at temp chunk tree setup.
+ */
+static int setup_temp_super(int fd, struct btrfs_mkfs_config *cfg,
+   u64 root_bytenr, u64 chunk_bytenr)
+{
+   unsigned char chunk_uuid[BTRFS_UUID_SIZE];
+   char super_buf[BTRFS_SUPER_INFO_SIZE];
+   struct btrfs_super_block *super = (struct btrfs_super_block *)super_buf;
+   int ret;
+
+   /*
+* We rely on cfg->chunk_uuid and cfg->fs_uuid to pass uuid
+* for other functions.
+* Caller must allocation space for them
+*/
+   BUG_ON(!cfg->chunk_uuid || !cfg->fs_uuid);
+   memset(super_buf, 0, BTRFS_SUPER_INFO_SIZE);
+   cfg->num_bytes = round_down(cfg->num_bytes, cfg->sectorsize);
+
+   if (cfg->fs_uuid && *cfg->fs_uuid) {
+   if (uuid_parse(cfg->fs_uuid, super->fsid) != 0) {
+   error("cound not parse UUID: %s", cfg->fs_uuid);
+   ret = -EINVAL;
+   goto out;
+   }
+   if (!test_uuid_unique(cfg->fs_uuid)) {
+   error("non-unique UUID: %s", cfg->fs_uuid);
+   ret = -EINVAL;
+   goto out;
+   }
+   } else {
+   uuid_generate(super->fsid);
+   uuid_unparse(super->fsid, cfg->fs_uuid);
+   }
+   uuid_generate(chunk_uuid);
+   uuid_unparse(chunk_uuid, cfg->chunk_uuid);
+
+   btrfs_set_super_bytenr(super, cfg->super_bytenr);
+   btrfs_set_super_num_devices(super, 1);
+   btrfs_set_super_magic(super, BTRFS_MAGIC);
+   btrfs_set_super_generation(super, 1);
+   btrfs_set_super_root(super, root_bytenr);
+   btrfs_set_super_chunk_root(super, chunk_bytenr);
+   btrfs_set_super_total_bytes(super, cfg->num_bytes);
+   /*
+* Temporary btrfs will only has 6 tree roots:
+* chunk tree, root tree, extent_tree, device tree, fs tree
+* and csum tree.
+*/
+   btrfs_set_super_bytes_used(super, 6 * cfg->nodesize);
+   btrfs_set_super_sectorsize(super, cfg->sectorsize);
+   btrfs_set_super_leafsize(super, cfg->nodesize);
+   btrfs_set_super_nodesize(super, cfg->nodesize);
+   btrfs_set_super_stripesize(super, cfg->stripesize);
+   btrfs_set_super_csum_type(super, BTRFS_CSUM_TYPE_CRC32);
+   btrfs_set_super_chunk_root(super, chunk_bytenr);
+   btrfs_set_super_cache_generation(super, -1);
+   btrfs_set_super_incompat_flags(super, cfg->features);
+   if (cfg->label)
+   strncpy(super->label, cfg->label, BTRFS_LABEL_SIZE - 1);


Why not use __strncpy_null?

Thanks,

-liubo


Good idea, I'll add new patch to use it.

Thanks,
Qu



+
+   /* Sys chunk array will be re-initialized at chunk tree init time */
+   super->sys_chunk_array_size = 0;
+
+   ret = write_temp_super(fd, super, cfg->super_bytenr);
+out:
+   return ret;
+}
+
 /*
  * Improved version of make_btrfs().
  *
@@ -230,6 +322,10 @@ static int make_convert_btrfs(int fd, struct 
btrfs_mkfs_config *cfg,
struct cache_tree *used = >used;
u64 sys_chunk_start;
u64 meta_chunk_start;
+   /* chunk tree bytenr, in system chunk */
+   u64 chunk_bytenr;
+   /* metadata trees bytenr, in metadata chunk */
+   u64 root_bytenr;
int ret;

/* Shouldn't happen */
@@ -260,6 +356,27 @@ static int make_convert_btrfs(int fd, struct 
btrfs_mkfs_config *cfg,
if (ret < 0)
goto out;

+   /*
+* Inside the allocate metadata chunk, its layout will be:
+   

Re: [PATCH] btrfs,vfs: allow FILE_EXTENT_SAME on a file opened ro

2016-05-29 Thread Andrei Borzenkov
29.05.2016 03:56, Zygo Blaxell пишет:
>>
>> I don't think this can happen on btrfs: the superblock is updated only after
>> a barrier when both the data and extent refs are already on the disk.
> 
> If and only if the filesystem is mounted with the flushoncommit option,
> that's true.  This is not the default, though, and I lost a fair amount
> of time and data before I discovered this.
> 

According to wiki, this is default on "reasonably recent kernels";
unfortunately it does say what kernel is recent enough. I am surprised
it can be disabled at all.



signature.asc
Description: OpenPGP digital signature


Re: Hot data tracking / hybrid storage

2016-05-29 Thread Andrei Borzenkov
20.05.2016 20:59, Austin S. Hemmelgarn пишет:
> On 2016-05-20 13:02, Ferry Toth wrote:
>> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4,
>> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs
>> partitions are in the same pool, which is in btrfs RAID10 format. /boot
>> is in subvolume @boot.
> If you have GRUB installed on all 4, then you don't actually have the
> full 2047 sectors between the MBR and the partition free, as GRUB is
> embedded in that space.  I forget exactly how much space it takes up,
> but I know it's not the whole 1023.5K  I would not suggest risking usage
> of the final 8k there though.

If you mean grub2, required space is variable and depends on where
/boot/grub is located (i.e. which drivers it needs to access it).
Assuming plain btrfs on legacy BIOS MBR, required space is around 40-50KB.

Note that grub2 detects some post-MBR gap software signatures and skips
over them (space need not be contiguous). It is entirely possible to add
bcache detection if enough demand exists.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html