Re: btrfs receive bigger than original snapshot?

2015-09-23 Thread Filipe David Manana
On Tue, Sep 22, 2015 at 9:04 PM, Hugo Mills  wrote:
> On Tue, Sep 22, 2015 at 09:52:19PM +0200, carlo von lynX wrote:
>> Hello, it's me again. This time I searched the web to make sure
>> I'm not making another beginner's mistake. I'm still not on the
>> list, so please keep me in cc: on replies.
>>
>> I have optimized a btrfs subvolume with a script* that reflinks
>> all files with identical contents, then I did a read-only snap
>> and fed it to send/receive. The bad news: on the receiving
>> side the same snapshot grew from 5.5G to 7.1G.

So that's likely because you have files with holes. Right now when a
hole exists in a file the send stream will contain an instruction to
write zeroes into the file instead of a punch hole instruction. So
imagine a file with a 1Gb hole, the send stream makes the receiver
write 1Gb of zeroes, wasting a lot of space (and time).

There's an over an year old patchset to add hole punching support to
the send stream and a few other features, but it was never picked by
Josef at the time (when he was maintaining the integration branch) nor
Chris.

>
>That's something I'd definitely expect it to be able to do. If it's
> not doing it, I'd say there's something wrong. cc'ing Filipe, who is,
> I think, currently the local expert on send/receive.
>
>> I assume send/receive does not support one of the coolest
>> btrfs features ever.. reflinks. Didn't find any mention on this
>> on https://btrfs.wiki.kernel.org/index.php/Incremental_Backup
>> or other pages. Is there any documentation that would explain
>> to me why this has to be or is it just a missing feature that
>> someone someday may find the time to add?
>>
>> Generally I find it odd that btrfs receive would not recreate
>> an identical clone of the original snapshot, that would also
>> allow me to continue working on a backup hard disk, then merge
>> the changes back to the main disk. Instead I have to decide
>> which device contains the master copy for all times and never
>> make rw snapshots elsewhere. What if the master disk dies?
>> Then I can turn a backup into the new master but I will have
>> to re-bootstrap all other backups as they will not accept the
>> non-identical parent snapshot.
>
>That's a known drawback, and one that's been discussed on this list
> already. It's fixable (within some limits), but requires a change to
> the send stream format. (See my analysis below).
>
>> Apparently I'm not the only one that thought this to be a
>> defect rather than a design choice:
>> http://www.spinics.net/lists/linux-btrfs/msg45175.html
>>
>> This actually confused me (in particular the absence of responses
>> to that mail), that's why I have btrfs-progs 4.0 installed...
>> but in the meantime I figured out that I expected send/receive
>> to be bidirectional. So my question in this case.. is there a
>> higher reasoning for the inexactness of send/receive transfers?
>
>It's about tracking enough metadata to be sure that the send (or
> the receive) is actually feasible. See
> http://www.spinics.net/lists/linux-btrfs/msg44089.html for my analysis
> of the problem, and (theoretical) suggestions for what the solution
> should look like.
>
>> And another classic: since the output size of the snapshot copy
>> is unpredictable, running out of disk space can be frequent.
>> Wouldn't it be cool if receive could resume rather than restarting
>> from scratch?
>
>Resuming is a bit tricky -- how do you know where to resume from?
> Bear in mind that send simply writes its results to stdout, so it has
> no knowledge of anything on the receiving side. In fact, the receiving
> side may not even exist at the point that the send stream is created.
>
>Hugo.
>
>> But maybe I still got it all wrong in my head. If these things
>> are FAQs, please add them to the FAQ document. In particular some
>> criteria to decide when rsync is actually a more suitable tool
>> over send/receive, which apparently under some circumstances is
>> the case. In some other cases, git can be the better suited tool.
>>
>> Still I am very glad that you created a new alternative for data
>> organization between the extremes of reckless rsync and overly
>> accurate git. It's just a steep learning mountain.
>>
>>
>> *) I used fdupes' output ran through a perl script that calls
>>   "cp --reflink" for each match. Would "bedup" or "duperemove"
>>  do a better job? bedup looks like a better long-term solution.
>>
>>
&g

Re: [PATCH] btrfs: Fix no space bug caused by removing bg

2015-09-22 Thread Filipe David Manana
On Tue, Sep 22, 2015 at 12:24 PM, Zhao Lei  wrote:
> Hi, Filipe David Manana
>
>> -Original Message-
>> From: linux-btrfs-ow...@vger.kernel.org
>> [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Filipe David Manana
>> Sent: Tuesday, September 22, 2015 6:22 PM
>> To: Zhao Lei 
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
>>
>> On Tue, Sep 22, 2015 at 11:06 AM, Zhao Lei  wrote:
>> > Hi, Filipe David Manana
>> >
>> > Thanks for review this patch.
>> >
>> >> -Original Message-
>> >> From: Filipe David Manana [mailto:fdman...@gmail.com]
>> >> Sent: Monday, September 21, 2015 9:27 PM
>> >> To: Zhao Lei 
>> >> Cc: linux-btrfs@vger.kernel.org
>> >> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
>> >>
>> >> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei 
>> wrote:
>> >> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
>> >> > mount option.
>> >> >
>> >> > Failed cases are:
>> >> >
>> >> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,0
>> >> > 54,
>> >> >  077,083,084,087,092,094
>> >>
>> >> Hi Zhao,
>> >>
>> >> So far I tried a few of those against Chris' integration-4.3 branch
>> >> (same btrfs code as 4.3-rc1):
>> >>
>> >> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016
>> >> btrfs/019
>> >> btrfs/020
>> >> FSTYP -- btrfs
>> >> PLATFORM  -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
>> >> MKFS_OPTIONS  -- /dev/sdc
>> >> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc
>> >> /home/fdmanana/btrfs-tests/scratch_1
>> >>
>> >> btrfs/008 2s ... 1s
>> >> btrfs/016 4s ... 3s
>> >> btrfs/019 4s ... 2s
>> >> btrfs/020 2s ... 1s
>> >> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests
>> >>
>> >> And none of the tests failed...
>> >>
>> > Sorry I hadn't paste detail of my test command.
>> >
>> > It is from a coincidence operation which is some different with
>> > standard steps(as yours), I mount fs with -o no_space_cache manually
>> > without set MOUNT_OPT, then xfstests entered into a special path, and
>> triggered the bug:
>> >   export TEST_DEV='/dev/sdb5'
>> >   export TEST_DIR='/var/ltf/tester/mnt'
>> >   mkdir -p '/var/ltf/tester/mnt'
>> >
>> >   export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9
>> /dev/sdb10 /dev/sdb11'
>> >   export SCRATCH_MNT='/var/ltf/tester/scratch_mnt'
>> >   mkdir -p '/var/ltf/tester/scratch_mnt'
>> >
>> >   export DIFF_LENGTH=0
>> >
>> >   mkfs.btrfs -f "$TEST_DEV"
>> >   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>> >
>> >   ./check generic/014
>> >
>> > Result:
>> >   FSTYP -- btrfs
>> >   PLATFORM  -- Linux/x86_64 lenovo
>> 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_
>> >   MKFS_OPTIONS  -- /dev/sdb6
>> >   MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt
>> >
>> >   generic/014 0s ... - output mismatch (see
>> /var/lib/xfstests/results//generic/014.out.bad)
>> >   --- tests/generic/014.out   2015-09-22 17:46:13.855391451 +0800
>> >   +++ /var/lib/xfstests/results//generic/014.out.bad  2015-09-22
>> 17:57:06.446095748 +0800
>> >   @@ -3,4 +3,5 @@
>> >--
>> >test 1
>> >--
>> >   -OK
>> >   +truncfile returned 1 : "write: No space left on device
>> >   +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)"
>> >   Ran: generic/014
>> >   Failures: generic/014
>> >   Failed 1 of 1 tests
>> >
>> > And following script is from trace result of above test.
>> > Maybe I can remove the xfstest description because it is not standard 
>> > steps.
>> >
>> >> >
>> >> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112
>> >> > ,12
>> >> > 3,
>> >> > 124,125,126,127,131,133,192,193,198

Re: [PATCH] btrfs: Fix no space bug caused by removing bg

2015-09-22 Thread Filipe David Manana
On Tue, Sep 22, 2015 at 11:06 AM, Zhao Lei  wrote:
> Hi, Filipe David Manana
>
> Thanks for review this patch.
>
>> -Original Message-----
>> From: Filipe David Manana [mailto:fdman...@gmail.com]
>> Sent: Monday, September 21, 2015 9:27 PM
>> To: Zhao Lei 
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
>>
>> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei  wrote:
>> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
>> > mount option.
>> >
>> > Failed cases are:
>> >
>> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
>> >  077,083,084,087,092,094
>>
>> Hi Zhao,
>>
>> So far I tried a few of those against Chris' integration-4.3 branch (same 
>> btrfs
>> code as 4.3-rc1):
>>
>> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019
>> btrfs/020
>> FSTYP -- btrfs
>> PLATFORM  -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
>> MKFS_OPTIONS  -- /dev/sdc
>> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc
>> /home/fdmanana/btrfs-tests/scratch_1
>>
>> btrfs/008 2s ... 1s
>> btrfs/016 4s ... 3s
>> btrfs/019 4s ... 2s
>> btrfs/020 2s ... 1s
>> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests
>>
>> And none of the tests failed...
>>
> Sorry I hadn't paste detail of my test command.
>
> It is from a coincidence operation which is some different with standard
> steps(as yours), I mount fs with -o no_space_cache manually without set
> MOUNT_OPT, then xfstests entered into a special path, and triggered the bug:
>   export TEST_DEV='/dev/sdb5'
>   export TEST_DIR='/var/ltf/tester/mnt'
>   mkdir -p '/var/ltf/tester/mnt'
>
>   export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9 /dev/sdb10 
> /dev/sdb11'
>   export SCRATCH_MNT='/var/ltf/tester/scratch_mnt'
>   mkdir -p '/var/ltf/tester/scratch_mnt'
>
>   export DIFF_LENGTH=0
>
>   mkfs.btrfs -f "$TEST_DEV"
>   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>
>   ./check generic/014
>
> Result:
>   FSTYP -- btrfs
>   PLATFORM  -- Linux/x86_64 lenovo 
> 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_
>   MKFS_OPTIONS  -- /dev/sdb6
>   MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt
>
>   generic/014 0s ... - output mismatch (see 
> /var/lib/xfstests/results//generic/014.out.bad)
>   --- tests/generic/014.out   2015-09-22 17:46:13.855391451 +0800
>   +++ /var/lib/xfstests/results//generic/014.out.bad  2015-09-22 
> 17:57:06.446095748 +0800
>   @@ -3,4 +3,5 @@
>--
>test 1
>--
>   -OK
>   +truncfile returned 1 : "write: No space left on device
>   +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)"
>   Ran: generic/014
>   Failures: generic/014
>   Failed 1 of 1 tests
>
> And following script is from trace result of above test.
> Maybe I can remove the xfstest description because it is not standard steps.
>
>> >
>> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,12
>> > 3,
>> > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,24
>> > 0,
>> >  246,247,248,255,263,285,306,313,316,323
>> >
>> > We can reproduce this bug with following simple command:
>> >  TEST_DEV=/dev/vdh
>> >  TEST_DIR=/mnt/tmp
>> >
>> >  umount "$TEST_DEV" >/dev/null
>> >  mkfs.btrfs -f "$TEST_DEV"
>> >  mount "$TEST_DEV" "$TEST_DIR"
>> >
>> >  umount "$TEST_DEV"
>> >  mount "$TEST_DEV" "$TEST_DIR"
>> >
>> >  cp /bin/bash $TEST_DIR
>> >
>> > Result is:
>> >  (omit previous commands)
>> >  # cp /bin/bash $TEST_DIR
>> >  cp: writing `/mnt/tmp/bash': No space left on device
>> >
>> > By bisect, we can see it is triggered by patch titled:
>> >  commit e44163e17796
>> >  ("btrfs: explictly delete unused block groups in close_ctree and
>> > ro-remount")
>> >
>> > But the wrong code is not in above patch, btrfs delete all chunks if
>> > no data in filesystem, and above patch just make it obviously.
>> >
>> > Detail reason:
>> >  1: mkfs a blank filesystem, or delete everything in filesystem
>> &

Re: [PATCH] btrfs: Fix no space bug caused by removing bg

2015-09-21 Thread Filipe David Manana
On Mon, Sep 21, 2015 at 2:27 PM, Filipe David Manana  wrote:
> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei  wrote:
>> btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
>> mount option.
>>
>> Failed cases are:
>>  btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
>>  077,083,084,087,092,094
>
> Hi Zhao,
>
> So far I tried a few of those against Chris' integration-4.3 branch
> (same btrfs code as 4.3-rc1):
>
> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019 
> btrfs/020
> FSTYP -- btrfs
> PLATFORM  -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
> MKFS_OPTIONS  -- /dev/sdc
> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc 
> /home/fdmanana/btrfs-tests/scratch_1
>
> btrfs/008 2s ... 1s
> btrfs/016 4s ... 3s
> btrfs/019 4s ... 2s
> btrfs/020 2s ... 1s
> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020
> Passed all 4 tests
>
> And none of the tests failed...
>
>>  generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,123,
>>  124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240,
>>  246,247,248,255,263,285,306,313,316,323
>>
>> We can reproduce this bug with following simple command:
>>  TEST_DEV=/dev/vdh
>>  TEST_DIR=/mnt/tmp
>>
>>  umount "$TEST_DEV" >/dev/null
>>  mkfs.btrfs -f "$TEST_DEV"
>>  mount "$TEST_DEV" "$TEST_DIR"
>>
>>  umount "$TEST_DEV"
>>  mount "$TEST_DEV" "$TEST_DIR"
>>
>>  cp /bin/bash $TEST_DIR
>>
>> Result is:
>>  (omit previous commands)
>>  # cp /bin/bash $TEST_DIR
>>  cp: writing `/mnt/tmp/bash': No space left on device
>>
>> By bisect, we can see it is triggered by patch titled:
>>  commit e44163e17796
>>  ("btrfs: explictly delete unused block groups in close_ctree and 
>> ro-remount")
>>
>> But the wrong code is not in above patch, btrfs delete all chunks
>> if no data in filesystem, and above patch just make it obviously.
>>
>> Detail reason:
>>  1: mkfs a blank filesystem, or delete everything in filesystem
>>  2: umount fs
>> (current code will delete all data chunks)
>>  3: mount fs
>> Because no any data chunks, data's space_cache have no chance
>> to init, it means: space_info->total_bytes == 0, and
>> space_info->full == 1.
>
> Right, and that's the problem. When the space_info is initialized it
> should never be flagged as full, otherwise any buffered write attempts
> fail immediately with enospc instead of trying to allocate a data
> block group (at extent-tree.c:btrfs_check_data_free_space()).
>
> That was fixed recently by:
>
> https://patchwork.kernel.org/patch/7133451/
>
> (with a respective test too, https://patchwork.kernel.org/patch/7133471/)
>
>>  4: do some write
>> Current code will ignore chunk allocate because space_info->full,
>> and return -ENOSPC.
>>
>> Fix:
>>  Don't auto-delete last blockgroup for a raid type.
>>  If we delete all blockgroup for a raidtype, it not only cause above bug,
>>  but also may change filesystem to all-single in some case.
>
> I don't get this. Can you mention in which cases that happens and why
> (in the commit message)?
>
> It isn't clear when reading the patch why we need to keep at least one
> block of each type/profile, and seems to be a workaround for a
> different problem.

Plus it would be a bad fix for such a problem, as anyone can still
trigger deletion of the last block group via a balance operation (like
in the test at https://patchwork.kernel.org/patch/7133471/), i.e.,
preventing deletion by the cleaner kthread is not enough to guarantee
the last block group of a kind isn't deleted...

>
> thanks
>
>>
>> Test:
>>  Test by above script, and confirmed the logic by debug output.
>>
>> Signed-off-by: Zhao Lei 
>> ---
>>  fs/btrfs/extent-tree.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 5411f0a..35cf7eb 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
>> *fs_info)
>>    bg_list);
>> space_info = block_group->space_info;
>> list_del_init(&block_group->bg_list);
>> -   if (ret || btrfs_mixed_space_info(space_info)) {
&g

Re: [PATCH] btrfs: Fix no space bug caused by removing bg

2015-09-21 Thread Filipe David Manana
On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei  wrote:
> btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
> mount option.
>
> Failed cases are:
>  btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
>  077,083,084,087,092,094

Hi Zhao,

So far I tried a few of those against Chris' integration-4.3 branch
(same btrfs code as 4.3-rc1):

MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019 btrfs/020
FSTYP -- btrfs
PLATFORM  -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
MKFS_OPTIONS  -- /dev/sdc
MOUNT_OPTIONS -- -o nospace_cache /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/008 2s ... 1s
btrfs/016 4s ... 3s
btrfs/019 4s ... 2s
btrfs/020 2s ... 1s
Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020
Passed all 4 tests

And none of the tests failed...

>  generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,123,
>  124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240,
>  246,247,248,255,263,285,306,313,316,323
>
> We can reproduce this bug with following simple command:
>  TEST_DEV=/dev/vdh
>  TEST_DIR=/mnt/tmp
>
>  umount "$TEST_DEV" >/dev/null
>  mkfs.btrfs -f "$TEST_DEV"
>  mount "$TEST_DEV" "$TEST_DIR"
>
>  umount "$TEST_DEV"
>  mount "$TEST_DEV" "$TEST_DIR"
>
>  cp /bin/bash $TEST_DIR
>
> Result is:
>  (omit previous commands)
>  # cp /bin/bash $TEST_DIR
>  cp: writing `/mnt/tmp/bash': No space left on device
>
> By bisect, we can see it is triggered by patch titled:
>  commit e44163e17796
>  ("btrfs: explictly delete unused block groups in close_ctree and ro-remount")
>
> But the wrong code is not in above patch, btrfs delete all chunks
> if no data in filesystem, and above patch just make it obviously.
>
> Detail reason:
>  1: mkfs a blank filesystem, or delete everything in filesystem
>  2: umount fs
> (current code will delete all data chunks)
>  3: mount fs
> Because no any data chunks, data's space_cache have no chance
> to init, it means: space_info->total_bytes == 0, and
> space_info->full == 1.

Right, and that's the problem. When the space_info is initialized it
should never be flagged as full, otherwise any buffered write attempts
fail immediately with enospc instead of trying to allocate a data
block group (at extent-tree.c:btrfs_check_data_free_space()).

That was fixed recently by:

https://patchwork.kernel.org/patch/7133451/

(with a respective test too, https://patchwork.kernel.org/patch/7133471/)

>  4: do some write
> Current code will ignore chunk allocate because space_info->full,
> and return -ENOSPC.
>
> Fix:
>  Don't auto-delete last blockgroup for a raid type.
>  If we delete all blockgroup for a raidtype, it not only cause above bug,
>  but also may change filesystem to all-single in some case.

I don't get this. Can you mention in which cases that happens and why
(in the commit message)?

It isn't clear when reading the patch why we need to keep at least one
block of each type/profile, and seems to be a workaround for a
different problem.

thanks

>
> Test:
>  Test by above script, and confirmed the logic by debug output.
>
> Signed-off-by: Zhao Lei 
> ---
>  fs/btrfs/extent-tree.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5411f0a..35cf7eb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
> *fs_info)
>bg_list);
> space_info = block_group->space_info;
> list_del_init(&block_group->bg_list);
> -   if (ret || btrfs_mixed_space_info(space_info)) {
> +   if (ret || btrfs_mixed_space_info(space_info) ||
> +   block_group->list.next == block_group->list.prev) {
> btrfs_put_block_group(block_group);
> continue;
> }
> --
> 1.8.5.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs corruption / bug after sending and receiving and repair

2015-09-11 Thread Filipe David Manana
R14: 880002004c58 R15: 
> 0001
> [86077.520456] FS:  () GS:88022fc8() 
> knlGS:
> [86077.520458] CS:  0010 DS:  ES:  CR0: 8005003b
> [86077.520460] CR2: 7f8167699148 CR3: 01c13000 CR4: 
> 07e0
> [86077.520461] Stack:
> [86077.520462]  88020ee93c78 c0408ca5 880104cfa000 
> 880002001800
> [86077.520465]  880210fcf090 880105660578 880223b7ec00 
> 8801b8337d80
> [86077.520468]  88020ee93d08 c03b42da 880210fcf098 
> 880210fcf110
> [86077.520470] Call Trace:
> [86077.520487]  [] ? lookup_free_space_inode+0x45/0xf0 
> [btrfs]
> [86077.520498]  [] btrfs_remove_block_group+0x13a/0x760 
> [btrfs]
> [86077.520513]  [] btrfs_remove_chunk+0x63a/0x760 [btrfs]
> [86077.520524]  [] btrfs_delete_unused_bgs+0x249/0x270 
> [btrfs]
> [86077.520536]  [] cleaner_kthread+0x144/0x1a0 [btrfs]
> [86077.520547]  [] ? check_leaf+0x360/0x360 [btrfs]
> [86077.520552]  [] kthread+0xc9/0xe0
> [86077.520555]  [] ? kthread_create_on_node+0x1c0/0x1c0
> [86077.520558]  [] ret_from_fork+0x58/0x90
> [86077.520560]  [] ? kthread_create_on_node+0x1c0/0x1c0
> [86077.520562] Code: 60 04 00 00 e9 b0 fe ff ff 66 90 89 45 c8 f0 41 80 64 24 
> 80 fd 4c 89 e7 e8 1e 21 fe ff 8b 45 c8 e9 1b ff ff ff 66 0f 1f 44 00 00 <0f> 
> 0b b8 f4 ff ff ff e9 10 ff ff ff 4c 89 f7 45 31 f6 e8 69 07
> [86077.520587] RIP  [] btrfs_orphan_add+0x1c0/0x1e0 [btrfs]
> [86077.520600]  RSP 
> [86077.520602] ---[ end trace 24353018afe32b08 ]---

It isn't a corruption, just a failure to reserve space necessary for
block group deletion (triggering a BUG_ON / hang). This got fixed in
kernel 4.0:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d84be799194147e04c0e3129ed44a948773b80a

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/14] Accurate qgroup reserve framework

2015-09-10 Thread Filipe David Manana
On Thu, Sep 10, 2015 at 10:01 PM, Mark Fasheh  wrote:
> Hi Qu,
>
> On Tue, Sep 08, 2015 at 04:56:52PM +0800, Qu Wenruo wrote:
>> [[BUG]]
>> One of the most common case to trigger the bug is the following method:
>> 1) Enable quota
>> 2) Limit excl of qgroup 5 to 16M
>> 3) Write [0,2M) of a file inside subvol 5 10 times without sync
>>
>> EQUOT will be triggered at about the 8th write.
>
> Does this happen on all kernels with qgroups or is this related to your
> recent rewrite?
>
>
>> [[CAUSE]]
>> The problem is caused by the fact that qgroup will reserve space even
>> the data space is already reserved.
>>
>> In above reproducer, each time we buffered write [0,2M) qgroup will
>> reserve 2M space, but in fact, at the 1st time, we have already reserved
>> 2M and from then on, we don't need to reserved any data space as we are
>> only writing [0,2M).
>>
>> Also, the reserved space will only be freed *ONCE* when its backref is
>> run at commit_transaction() time.
>>
>> That's causing the reserved space leaking.
>>
>> [[FIX]]
>> The fix is not a simple one, as currently btrfs_qgroup_reserve() follow
>
> Indeed, this is quite a large patch series and I see no testing details from
> you. Can you please at the least provide a single reproducer in the form of
> something that can be added to xfstests?

https://patchwork.kernel.org/patch/7047641/

Came way before this patchset :)

>
>
>> the very bad btrfs space allocating principle:
>>   Allocate as much as you needed, even it's not fully used.
>>
>> So for accurate qgroup reserve, we introduce a completely new framework
>> for data and metadata.
>> 1) Per-inode data reserve map
>>Now, each inode will have a data reserve map, recording which range
>>of data is already reserved.
>>If we are writing a range which is already reserved, we won't need to
>>reserve space again.
>>
>>Also, for the fact that qgroup is only accounted at commit_trans(),
>>for data commit into disc and its metadata is also inserted into
>>current tree, we should free the data reserved range, but still keep
>>the reserved space until commit_trans().
>>
>>So delayed_ref_head will have new members to record how much space is
>>reserved and free them at commit_trans() time.
>>
>> 2) Per-root metadata reserve counter
>>For metadata(tree block), it's impossible to know how much space it
>>will use exactly in advance.
>>And due to the new qgroup accounting framework, the old
>>free-at-end-trans may lead to exceeding limit.
>>
>>So we record how much metadata space is reserved for each root, and
>>free them at commit_trans() time.
>>This method is not perfect, but thanks to the compared small size of
>>metadata, it should be quite good.
>>
>> More detailed info can be found in each commit message and source
>> commend.
>>
>> Qu Wenruo (19):
>>   btrfs: qgroup: New function declaration for new reserve implement
>>   btrfs: qgroup: Implement data_rsv_map init/free functions
>>   btrfs: qgroup: Introduce new function to search most left reserve
>> range
>>   btrfs: qgroup: Introduce function to insert non-overlap reserve range
>>   btrfs: qgroup: Introduce function to reserve data range per inode
>>   btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function
>>   btrfs: qgroup: Introduce function to release reserved range
>>   btrfs: qgroup: Introduce function to release/free reserved data range
>>   btrfs: delayed_ref: Add new function to record reserved space into
>> delayed ref
>>   btrfs: delayed_ref: release and free qgroup reserved at proper timing
>>   btrfs: qgroup: Introduce new functions to reserve/free metadata
>>   btrfs: qgroup: Use new metadata reservation.
>>   btrfs: extent-tree: Add new verions of btrfs_check_data_free_space
>>   btrfs: Switch to new check_data_free_space
>>   btrfs: fallocate: Add support to accurate qgroup reserve
>>   btrfs: extent-tree: Add new version of btrfs_delalloc_reserve_space
>>   btrfs: extent-tree: Use new __btrfs_delalloc_reserve_space function
>>   btrfs: qgroup: Cleanup old inaccurate facilities
>>   btrfs: qgroup: Add handler for NOCOW and inline
>
> I took a quick look through a few of these, none of them have any trace_*
> functions, yet you're adding several new entrypoints to the qgroup code.
> Those are incredibly useful for debugging on live systems and in fact I've
> got a patch which reintroduces the ones you remov

Re: 4.1.6 gentoo-hardened: Hang during rename

2015-08-30 Thread Filipe David Manana
On Sat, Aug 29, 2015 at 9:23 PM, Kenneth Lakin  wrote:
> Hey. It looks like I'm being bitten by:
> http://article.gmane.org/gmane.comp.file-systems.btrfs/44987/ except I'm
> hitting it during rename, rather than unlink. Dmesg spew here:
> http://pastebin.ca/3137198 (That pastebin post includes the -unrelated-
> source of the W taint.)
>
> I'm running on Gentoo hardened-sources 4.1.6 on x86 (rather than amd64).
> The affected volume is mounted with
> "rw,relatime,ssd,discard,space_cache,autodefrag" and -AFAIK- has no
> subvolumes.
>
> emerge is the blocked program and is the Gentoo package manager. This
> lockup happens occasionally during the "Updating Portage cache" step of
> the emerge sync operation. The emerge task is unkillable and is blocked
> in D status. Attempts to re-run the cache update task have it hang at
> what appears to be the same place. Only a reboot can "fix" the problem.
> The rest of the system works just fine; the only thing that appears to
> be blocked is the Portage cache update.
>
> I couldn't find an open bug for this, but my Google-fu may be weak
> today. Did I miss the bug report, or is this a currently unknown or
> unconfirmed issue? Let me know if I can help with diagnostics.

The fix landed in 4.2-rc:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6ca0709756710c47ec604dd08b9fc45929d36390

And it's not in 4.1 nor any other older releases.
>
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: deal with error on secondary log properly

2015-08-28 Thread Filipe David Manana
On Wed, Aug 26, 2015 at 8:06 PM, Josef Bacik  wrote:
> On 08/25/2015 10:06 PM, Liu Bo wrote:
>>
>> On Tue, Aug 25, 2015 at 01:09:43PM -0400, Josef Bacik wrote:
>>>
>>> If we have an fsync at the same time in two seperate subvolumes we could
>>> end up
>>> with the tree log pointing at invalid blocks.  We need to notice if our
>>> writeout
>>
>>
>> Mind to share more details of the problem?
>>
>
> It's the problem Filipe was trying to solve.  Say we fsync() on two
> different subvols, one of them will race in and be the one who commits the
> log root tree.  So process A waits on process B to add its log to the log
> root tree and write out its log.  If we get an IO error while writing out
> the log the log root tree will be pointing to invalid crap, and we also
> won't return an error back to userspace.  We need to notice if there was an
> error, turn on the transaction commit stuff since we've already updated the
> the log root tree with our subvol log so we don't get garbage on the disk,
> and we need to return an error to process B. Thanks,

Josef,

So the problem was that without forcing A to trigger a transaction
commit if B gets an error when writing one or more log tree
nodes/leafs, A could write a superblock pointing to a log root tree
for which not all nodes/leafs were persisted. Then B would fall back
to a transaction commit and everything would be ok - i.e. we would
only have a "small" time window where a superblock points to an
invalid log root tree.

So the fix here shouldn't be only to force A do a transaction commit
(call btrfs_set_log_full_commit())? Why do we need to make B return an
error to userspace and not fallback to a transaction commit too (as it
was before this change)? After all this is the kind of failure for
which we can fallback to a transaction commit without losing any inode
metadata (links, owner, group, xattrs, etc) nor file data.

The change looks good, just puzzled why we need to return the error to
userspace.

thanks

>
> Josef
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] fstests: btrfs: Add reserved space leak check for rewrite dirty page

2015-08-19 Thread Filipe David Manana
able to write $FILESIZE - $BLOCKSIZE data now
> +$XFS_IO_PROG -f -c "pwrite -b $BLOCKSIZE 0 $(($FILESIZE - $BLOCKSIZE))" \
> +   $SCRATCH_MNT/foo | _filter_xfs_io
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/089.out b/tests/btrfs/089.out
> new file mode 100644
> index 000..642bede
> --- /dev/null
> +++ b/tests/btrfs/089.out
> @@ -0,0 +1,13 @@
> +QA output created by 089
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 132120576/132120576 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index ffe18bf..da37e46 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -91,6 +91,7 @@
>  086 auto quick clone
>  087 auto quick send
>  088 auto quick metadata
> +089 auto qgroup
>  090 auto quick metadata
>  091 auto quick qgroup
>  092 auto quick send
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: btrfs: Add reserved space leak check for rewrite dirty page

2015-08-17 Thread Filipe David Manana
on of why we do "sync" before calling "rm"
wouldn't hurt (not everyone is too familiar with the qgroups
implementation and knows that it frees space reservation at
transaction commit time).

> +sync
> +# error shouldn't happen, as BLOCKSIZE is large enough for metdata cow
> +rm $SCRATCH_MNT/foo || _fail "reserved space leak detected"

Why do we need || _fail ...? If rm fails it prints an error message to
stderr which makes the test fail.

> +sync
> +
> +# We should be able to write $FILESIZE - $BLOCKSIZE data
> +$XFS_IO_PROG -f -c "pwrite -b $BLOCKSIZE 0 $(($FILESIZE - $BLOCKSIZE))" \
> +   $SCRATCH_MNT/foo | _filter_xfs_io
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/089.out b/tests/btrfs/089.out
> new file mode 100644
> index 000..642bede
> --- /dev/null
> +++ b/tests/btrfs/089.out
> @@ -0,0 +1,13 @@
> +QA output created by 089
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 33554432/33554432 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 132120576/132120576 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index ffe18bf..da37e46 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -91,6 +91,7 @@
>  086 auto quick clone
>  087 auto quick send
>  088 auto quick metadata
> +089 auto quick qgroup
>  090 auto quick metadata
>  091 auto quick qgroup
>  092 auto quick send
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/3] xfstests: btrfs: test device replace, with EIO on the src dev

2015-08-14 Thread Filipe David Manana
On Fri, Aug 14, 2015 at 11:47 AM, Anand Jain  wrote:
> From: Anand Jain 
>
> This test case will test to confirm the replace works with
> the failed (EIO) replacing source device. EIO condition is
> achieved using the DM device.
>
> Signed-off-by: Anand Jain 
> Reviewed-by: Filipe Manana 
> ---
> v4->v5: rebase on latest xfstests code and accepts Filipe comment
> v3->v4: rebase on latest xfstests code
> v2->v3: accepts Filipe Manana's review comments, thanks
> v1->v2: accepts Dave Chinner's review comments, thanks
>  tests/btrfs/098 | 81 
> +
>  tests/btrfs/098.out | 11 
>  tests/btrfs/group   |  1 +
>  3 files changed, 93 insertions(+)
>  create mode 100755 tests/btrfs/098
>  create mode 100644 tests/btrfs/098.out
>
> diff --git a/tests/btrfs/098 b/tests/btrfs/098
> new file mode 100755
> index 000..afb41d1
> --- /dev/null
> +++ b/tests/btrfs/098
> @@ -0,0 +1,81 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/098
> +#
> +#test device replace works when the source device has EIO
> +#
> +# Copyright (c) 2015 Oracle.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +
> +_cleanup()
> +{
> +   _cleanup_dmerror
> +   rm -f $tmp
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/filter.btrfs
> +. ./common/dmerror
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_need_to_be_root
> +_require_scratch_dev_pool 3
> +_require_dmerror
> +
> +rm -f $seqres.full
> +
> +dev1="`echo $SCRATCH_DEV_POOL | $AWK_PROG '{print $2}'`"
> +dev2="`echo $SCRATCH_DEV_POOL | $AWK_PROG '{print $3}'`"
> +
> +_init_dmerror
> +_scratch_mkfs_dmerror "-f -d raid1 -m raid1 $dev1"
> +_mount_dmerror
> +
> +_run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
> +$BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT | 
> _filter_btrfs_filesystem_show
> +
> +error_devid=`$BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT |\
> +   egrep $DMERROR_DEV | $AWK_PROG '{print $2}'`
> +
> +snapshot_cmd="$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT"
> +snapshot_cmd="$snapshot_cmd $SCRATCH_MNT/snap_\`date +'%H_%M_%S_%N'\`"
> +run_check $FSSTRESS_PROG -d $SCRATCH_MNT -n 200 -p 8 $FSSTRESS_AVOID -x \
> +   "$snapshot_cmd" -X 50 
> >&/dev/null

Sorry missed this before, but you don't need to redirect stdout/stderr
to /dev/null.
run_check redirects them to $seqres.full where it's actually useful -
when we have the test failing, we can check $seqres.full to see what
seed fsstress used (fsstress prints it to stdout/stderr). That's for
the case where it's failing only for some seeds of course.

Same observation applies to the other test/patch.

Thanks.

> +
> +# now load the error into the DMERROR_DEV
> +_load_dmerror_table
> +
> +_run_btrfs_util_prog replace start -B $error_devid $dev2 $SCRATCH_MNT
> +
> +_run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
> +$BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT | 
> _filter_btrfs_filesystem_show
> +
> +echo "=== device replace completed"
> +
> +status=0; exit
> diff --git a/tests/btrfs/098.out b/tests/btrfs/098.out
> new file mode 100644
> index 000..eb2f87f
> --- /dev/null
> +++ b/tests/btrfs/098.out
> @@ -0,0 +1,11 @@
> +QA output created by 098
> +Label: none  uuid:  
> +   Total devices  FS bytes used 
> +   devid  size  used  path SCRATCH_DEV
> +   devid  size  used  path /dev/mapper/error-test
> +
> +Label: none  uuid:  
> +   Total devices  FS bytes used 
> +   devid  size  used  path SCRATCH_DE

Re: [PATCH v2] fstests: generic/018: expand "write backwards sync but contiguous" to test regression in btrfs

2015-08-13 Thread Filipe David Manana
On Thu, Aug 13, 2015 at 10:43 AM, Filipe David Manana
 wrote:
> On Thu, Aug 13, 2015 at 9:47 AM, Liu Bo  wrote:
>> Btrfs has a problem when defraging a file which has a large fragment'ed 
>> range,
>> it'd leave the tail extent as a seperate extent instead of merging it with
>> previous extents.
>>
>> This makes generic/018 recognize the above regression.
>>
>> Meanwhile, I find that in the case of 'write backwards sync but contiguous",
>> ext4 doesn't produce fragments like btrfs and xfs, so I modify 018.out a 
>> little
>> bit to let ext4 pass.
>>
>> Moreover, I follow Filipe's suggestion to filter xfs_io's output in order to
>> check these writes actually succeed.
>>
>> Signed-off-by: Liu Bo 
>
> Reviewed-by: Filipe Manana 
>
> The lines with XFS_IO_PROG are now wider than 80 characters (the echo
> command lines were already wider than 80 too). But other than that all
> looks good to me.
> Test fails with the btrfs kernel fix, test passes with the fix applied

Err, typo, should have been "test fails without the btrfs kernel fix..."

> and ext4/xfs continue to pass here.
> Thanks.
>
>> ---
>> v2: fix typo in title, s/expend/expand/g
>>
>>  tests/generic/018 |  16 ++--
>>  tests/generic/018.out | 198 
>> +-
>>  2 files changed, 203 insertions(+), 11 deletions(-)
>>
>> diff --git a/tests/generic/018 b/tests/generic/018
>> index d97bb88..3693874 100755
>> --- a/tests/generic/018
>> +++ b/tests/generic/018
>> @@ -68,28 +68,24 @@ $XFS_IO_PROG -f -c "truncate 1m" $fragfile
>>  _defrag --before 0 --after 0 $fragfile
>>
>>  echo "Contiguous file:" | tee -a $seqres.full
>> -$XFS_IO_PROG -f -c "pwrite -b $((4 * bsize)) 0 $((4 * bsize))" $fragfile \
>> -   > /dev/null
>> +$XFS_IO_PROG -f -c "pwrite -b $((4 * bsize)) 0 $((4 * bsize))" $fragfile | 
>> _filter_xfs_io
>>  _defrag --before 1 --after 1 $fragfile
>>
>>  echo "Write backwards sync, but contiguous - should defrag to 1 extent" | 
>> tee -a $seqres.full
>> -for i in `seq 9 -1 0`; do
>> -   $XFS_IO_PROG -fs -c "pwrite -b $bsize $((i * bsize)) $bsize" 
>> $fragfile \
>> -   > /dev/null
>> +for i in `seq 64 -1 0`; do
>> +   $XFS_IO_PROG -fd -c "pwrite -b $bsize $((i * bsize)) $bsize" 
>> $fragfile | _filter_xfs_io
>>  done
>> -_defrag --before 10 --after 1 $fragfile
>> +_defrag --after 1 $fragfile
>>
>>  echo "Write backwards sync leaving holes - defrag should do nothing" | tee 
>> -a $seqres.full
>>  for i in `seq 31 -2 0`; do
>> -   $XFS_IO_PROG -fs -c "pwrite -b $bsize $((i * bsize)) $bsize" 
>> $fragfile \
>> -   > /dev/null
>> +   $XFS_IO_PROG -fs -c "pwrite -b $bsize $((i * bsize)) $bsize" 
>> $fragfile | _filter_xfs_io
>>  done
>>  _defrag --before 16 --after 16 $fragfile
>>
>>  echo "Write forwards sync leaving holes - defrag should do nothing" | tee 
>> -a $seqres.full
>>  for i in `seq 0 2 31`; do
>> -   $XFS_IO_PROG -fs -c "pwrite -b $bsize $((i * bsize)) $bsize" 
>> $fragfile \
>> -   > /dev/null
>> +   $XFS_IO_PROG -fs -c "pwrite -b $bsize $((i * bsize)) $bsize" 
>> $fragfile | _filter_xfs_io
>>  done
>>  _defrag --before 16 --after 16 $fragfile
>>
>> diff --git a/tests/generic/018.out b/tests/generic/018.out
>> index 5f265d1..0886a9a 100644
>> --- a/tests/generic/018.out
>> +++ b/tests/generic/018.out
>> @@ -6,14 +6,210 @@ Sparse file (no blocks):
>>  Before: 0
>>  After: 0
>>  Contiguous file:
>> +wrote 16384/16384 bytes at offset 0
>> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>  Before: 1
>>  After: 1
>>  Write backwards sync, but contiguous - should defrag to 1 extent
>> -Before: 10
>> +wrote 4096/4096 bytes at offset 262144
>> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>> +wrote 4096/4096 bytes at offset 258048
>> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>> +wrote 4096/4096 bytes at offset 253952
>> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>> +wrote 4096/4096 bytes at offset 249856
>> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>> +wrote 4096/4096 bytes at o

Re: [PATCH v2] fstests: generic/018: expand "write backwards sync but contiguous" to test regression in btrfs

2015-08-13 Thread Filipe David Manana
 at offset 36864
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 32768
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 28672
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 24576
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 20480
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 16384
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 12288
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 8192
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 4096
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +Before: in_range(0, -1)
>  After: 1
>  Write backwards sync leaving holes - defrag should do nothing
> +wrote 4096/4096 bytes at offset 126976
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 118784
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 110592
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 102400
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 94208
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 86016
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 77824
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 69632
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 61440
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 53248
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 45056
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 36864
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 28672
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 20480
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 12288
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 4096
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>  Before: 16
>  After: 16
>  Write forwards sync leaving holes - defrag should do nothing
> +wrote 4096/4096 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 8192
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 16384
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 24576
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 32768
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 40960
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 49152
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 57344
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 65536
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 73728
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 81920
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 90112
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 98304
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 106496
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 114688
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 4096/4096 bytes at offset 122880
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>  Before: 16
>  After: 16
> --
> 1.8.2.1
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs regression test for defrag tail extents

2015-08-10 Thread Filipe David Manana
On Mon, Aug 10, 2015 at 9:12 AM, Liu Bo  wrote:
> Regression test for btrfs defragment tool, it's aimed to verify
> that tail extents won't be skipped as a separate extent while the previous
> extents have been defrag'ed into a whole extent.

Thanks for doing this Liu.
Some comments below.

>
> Signed-off-by: Liu Bo 
> ---
>  tests/btrfs/098 | 68 
> +
>  tests/btrfs/098.out |  3 +++
>  tests/btrfs/group   |  1 +
>  3 files changed, 72 insertions(+)
>  create mode 100755 tests/btrfs/098
>  create mode 100644 tests/btrfs/098.out
>
> diff --git a/tests/btrfs/098 b/tests/btrfs/098
> new file mode 100755
> index 000..e4bb38a
> --- /dev/null
> +++ b/tests/btrfs/098
> @@ -0,0 +1,68 @@
> +#! /bin/bash
> +# FS QA Test 098
> +#
> +# Test if btrfs defrag tool can merge tail extents.

Well, this wasn't a problem in the tool (btrfs-progs) but rather in
the kernel's defrag code (same observation regarding the commit
message).

> +#
> +#---
> +# Copyright (c) 2015 Liu Bo.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/defrag
> +
> +# real QA test starts here
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_defrag
> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >> $seqres.full 2>&1
> +_scratch_mount
> +
> +$XFS_IO_PROG -f -c "pwrite 0 640k" $SCRATCH_MNT/foobar >> $seqres.full 2>&1

Shouldn't redirect stdout/stderr to $seqres.full but instead let it be
part of the golden output (and pipe its output to _filter_xfs_io).
That's what we do everywhere else.

> +
> +# create sparse file layout
> +for ((i = 160; i > 0; i--)); do
> +   $XFS_IO_PROG -f -c "pwrite $((($RANDOM % 160) * 4))k 4k" \
> +   $SCRATCH_MNT/foobar >> $seqres.full 2>&1

Same here if we could get rid of the random offset (is it really
needed?). Without this loop (and even without the btrfs fix applied)
this test succeeds as well - we want to verify the extent count after
defrag is 1 for this scenario of a sparse file, so we should really
check these writes actually succeed

> +done
> +
> +_defrag --after 1 $SCRATCH_MNT/foobar
> +
> +# success, all done
> +status=0
> +exit

There doesn't seem to be really anything btrfs specific in this test.
Any reason to not make it a generic test?

> diff --git a/tests/btrfs/098.out b/tests/btrfs/098.out
> new file mode 100644
> index 000..7306733
> --- /dev/null
> +++ b/tests/btrfs/098.out
> @@ -0,0 +1,3 @@
> +QA output created by 098
> +Before: in_range(0, -1)
> +After: 1

So even without your btrfs fix applied, the test passes, therefore it
doesn't serve as a regression test for btrfs.
Can you double check it?

thanks

> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index e13865a..392de6d 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -100,3 +100,4 @@
>  095 auto quick metadata
>  096 auto quick clone
>  097 auto quick send clone
> +098 auto defrag quick
> --
> 1.8.2.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] xfstests: btrfs: test device delete with EIO on src dev

2015-08-07 Thread Filipe David Manana
On Tue, May 5, 2015 at 8:03 PM, Anand Jain  wrote:
> This test case tests if the device delete would work when
> the source device has failed with EIO errors.
>
> EIO errors are achieved usign the DM device.
>
> Also this test needs the latest btrfs-progs and kernel patch

Not patch, but patch set instead. I was looking for a patch with that
title and didn't found any, only a cover letter with the subject:
"[PATCH 0/8 v2] device delete by devid".

> under title
>   [PATCH] device delete by devid
>
> When this patch is not found in the btrfs-progs, this test
> will not run. However when the require patch is not found
> in the kernel it will fail gracefully.

What's the status of these patches (both kernel and btrfs-progs)? I've
noticed they've been around since april and no feedback on the mailing
list.

>
> Signed-off-by: Anand Jain 
> ---
> v1->v2: accepts Dave Chinner's review comments, thanks
>
>  common/rc   |  6 +
>  tests/btrfs/088 | 71 
> +
>  tests/btrfs/088.out |  2 ++
>  tests/btrfs/group   |  1 +
>  4 files changed, 80 insertions(+)
>  create mode 100755 tests/btrfs/088
>  create mode 100644 tests/btrfs/088.out
>
> diff --git a/common/rc b/common/rc
> index 447ab7f..aca7a62 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -2685,6 +2685,12 @@ _require_test_fcntl_advisory_locks()
> _notrun "Require fcntl advisory locks support"
>  }
>
> +_require_btrfs_dev_del_by_devid()
> +{
> +   $BTRFS_UTIL_PROG device delete --help | egrep devid > /dev/null 2>&1
> +   [ $? -eq 0 ] || _notrun "$BTRFS_UTIL_PROG too old (must support 
> 'btrfs device delete  /')"
> +}
> +
>  _get_total_inode()
>  {
> if [ -z "$1" ]; then
> diff --git a/tests/btrfs/088 b/tests/btrfs/088
> new file mode 100755
> index 000..87814ec
> --- /dev/null
> +++ b/tests/btrfs/088
> @@ -0,0 +1,71 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/088
> +#
> +# test device delete when the source device has EIO
> +#
> +# Copyright (c) 2015 Oracle.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +
> +_cleanup()
> +{
> +   _cleanup_dmerror
> +   rm -f $tmp
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmerror
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_need_to_be_root
> +_require_scratch_dev_pool 3
> +_require_btrfs_dev_del_by_devid
> +_require_dmerror
> +
> +rm -f $seqres.full
> +
> +dev1="`echo $SCRATCH_DEV_POOL | $AWK_PROG '{print $2}'`"
> +dev2="`echo $SCRATCH_DEV_POOL | $AWK_PROG '{print $3}'`"
> +
> +_init_dmerror
> +_scratch_mkfs_dmerror "-f -d raid1 -m raid1 $dev1 $dev2"
> +_mount_dmerror
> +
> +#$BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT | 
> _filter_btrfs_filesystem_show
> +
> +error_devid=`$BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT |\
> +   egrep $DMERROR_DEV | $AWK_PROG '{print $2}'`

Any reason to not run here fsstress like the test from patch 2? Doing
the device delete with a non-empty fs is a lot more interesting and
can help find bugs and regressions in the future.

> +
> +# now load the error into the DMERROR_DEV
> +_load_dmerror_table
> +
> +_run_btrfs_util_prog device delete $error_devid $SCRATCH_MNT
> +
> +echo "Silence is golden"
> +status=0; exit
> diff --git a/tests/btrfs/088.out b/tests/btrfs/088.out
> new file mode 100644
> index 000..c24480a
> --- /dev/null
> +++ b/tests/btrfs/088.out
> @@ -0,0 +1,2 @@
> +QA output created by 088
> +Silence is golden
> diff --git a/tests/btrfs/group b/tests/btrfs

Re: [PATCH v4 2/3] xfstests: btrfs: test device replace, with EIO on the src dev

2015-08-07 Thread Filipe David Manana
On Wed, Jul 22, 2015 at 11:14 AM, Anand Jain  wrote:
> From: Anand Jain 
>
> This test case will test to confirm the replace works when
> the replacing source device has EIO errors.
>
> EIO condition is achieved using the DM device.
>
> Signed-off-by: Anand Jain 
> ---
> v3->v4: rebase on latest xfstests code
> v2->v3: accepts Filipe Manana's review comments, thanks
> v1->v2: accepts Dave Chinner's review comments, thanks
>
>  tests/btrfs/095 | 76 
> +
>  tests/btrfs/095.out | 10 +++
>  tests/btrfs/group   |  1 +
>  3 files changed, 87 insertions(+)
>  create mode 100755 tests/btrfs/095
>  create mode 100644 tests/btrfs/095.out
>
> diff --git a/tests/btrfs/095 b/tests/btrfs/095
> new file mode 100755
> index 000..1da856f
> --- /dev/null
> +++ b/tests/btrfs/095
> @@ -0,0 +1,76 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/095
> +#
> +#test device replace works when the source device has EIO
> +#
> +# Copyright (c) 2015 Oracle.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +
> +_cleanup()
> +{
> +   _cleanup_dmerror
> +   rm -f $tmp
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/filter.btrfs
> +. ./common/dmerror
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_need_to_be_root
> +_require_scratch_dev_pool 3
> +_require_dmerror
> +
> +rm -f $seqres.full
> +
> +dev1="`echo $SCRATCH_DEV_POOL | $AWK_PROG '{print $2}'`"
> +dev2="`echo $SCRATCH_DEV_POOL | $AWK_PROG '{print $3}'`"
> +
> +_init_dmerror
> +_scratch_mkfs_dmerror "-f -d raid1 -m raid1 $dev1"
> +_mount_dmerror
> +
> +$BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT | 
> _filter_btrfs_filesystem_show
> +
> +error_devid=`$BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT |\
> +   egrep $DMERROR_DEV | $AWK_PROG '{print $2}'`
> +
> +snapshot_cmd="$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT"
> +snapshot_cmd="$snapshot_cmd $SCRATCH_MNT/snap_\`date +'%H_%M_%S_%N'\`"
> +run_check $FSSTRESS_PROG -d $SCRATCH_MNT -n 200 -p 8 $FSSTRESS_AVOID -x 
> "$snapshot_cmd" -X 50 >&/dev/null

Please keep the line length up to 80 characters.

> +
> +# now load the error into the DMERROR_DEV
> +_load_dmerror_table
> +
> +_run_btrfs_util_prog replace start -B $error_devid $dev2 $SCRATCH_MNT
> +
> +$BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT | 
> _filter_btrfs_filesystem_show
> +
> +status=0; exit
> diff --git a/tests/btrfs/095.out b/tests/btrfs/095.out
> new file mode 100644
> index 000..9af70bb
> --- /dev/null
> +++ b/tests/btrfs/095.out
> @@ -0,0 +1,10 @@
> +QA output created by 095
> +Label: none  uuid: 

So the test always fails due to a mismatch with this expected golden output:

-Label: none  uuid: 
+Label: none  uuid:  
  Total devices  FS bytes used 
  devid  size  used  path SCRATCH_DEV
  devid  size  used  path /dev/mapper/error-test

The extra space after "uuid:" comes from _filter_uuid:

   sed -e "s/\(uuid[ :=]\+\) *[0-9a-f-][0-9a-f-]*/\1 /ig"

Which gets \1 with the string "uuid: " and then does "uuid: " + " " + "".

We need to either to fix the golden output here or _filter_uuid (and
make sure it doesn't break any other tests using it directly or
indirectly such as yours).

Other than that, the test looks good and when fixed you can add:

Reviewed-by: Filipe Manana 


> +   Total devices  FS bytes used 
> +   devid  size  used  path SCRATCH_DEV
> +   devid  size  used  path /dev/mapper/error-test
> +
> +Label: none  uuid: 
> +   Total devices  F

Re: [PATCH v4 1/3] xfstests: btrfs: add functions to create dm-error device

2015-08-07 Thread Filipe David Manana
On Wed, Jul 22, 2015 at 11:14 AM, Anand Jain  wrote:
> From: Anand Jain 
>
> Controlled EIO from the device is achieved using the dm device.
> Helper functions are at common/dmerror.
>
> Broadly steps will include calling _init_dmerror().
> _init_dmerror() will use SCRATCH_DEV to create dm linear device and assign
> DMERROR_DEV to /dev/mapper/error-test.
>
> When test script is ready to get EIO, the test cases can call
> _load_dmerror_table() which then it will load the dm error.
> so that reading DMERROR_DEV will cause EIO. After the test case is
> complete, cleanup must be done by calling _cleanup_dmerror().
>
> Signed-off-by: Anand Jain 
> ---
> v3->v4: rebase on latest xfstests code
> v2.1->v3: accepts Filipe Manana's review comments, thanks
> v2->v2.1: fixed missed typo error fixup in the commit.
> v1->v2: accepts Dave Chinner's review comments, thanks
>
>  common/dmerror | 69 
> ++
>  common/rc  | 15 +
>  2 files changed, 84 insertions(+)
>  create mode 100644 common/dmerror
>
> diff --git a/common/dmerror b/common/dmerror
> new file mode 100644
> index 000..f895d90
> --- /dev/null
> +++ b/common/dmerror
> @@ -0,0 +1,69 @@
> +##/bin/bash
> +#
> +# Copyright (c) 2015 Oracle.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +#
> +# common functions for setting up and tearing down a dmerror device
> +
> +_init_dmerror()
> +{
> +   $DMSETUP_PROG remove error-test > /dev/null 2>&1
> +
> +   local BLK_DEV_SIZE=`blockdev --getsz $SCRATCH_DEV`
> +
> +   DMERROR_DEV='/dev/mapper/error-test'
> +
> +   DMLINEAR_TABLE="0 $BLK_DEV_SIZE linear $SCRATCH_DEV 0"
> +
> +   $DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
> +   _fatal "failed to create dm linear device"
> +
> +   DMERROR_TABLE="0 $BLK_DEV_SIZE error $SCRATCH_DEV 0"
> +}
> +
> +_scratch_mkfs_dmerror()
> +{
> +   $MKFS_BTRFS_PROG $* $DMERROR_DEV >> $seqres.full 2>&1 || \
> +   _fatal "failed to create mkfs.btrfs $* $DMERROR_DEV"
> +}
> +
> +_mount_dmerror()
> +{
> +   mount -t $FSTYP $MOUNT_OPTIONS $DMERROR_DEV $SCRATCH_MNT
> +}
> +
> +_unmount_dmerror()
> +{
> +   $UMOUNT_PROGS $SCRATCH_MNT
> +}
> +
> +_cleanup_dmerror()
> +{
> +   $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
> +   $DMSETUP_PROG remove error-test > /dev/null 2>&1
> +}
> +
> +_load_dmerror_table()
> +{
> +   $DMSETUP_PROG suspend error-test
> +   [ $? -ne 0 ] && _fatal  "failed to suspend error-test"
> +
> +   $DMSETUP_PROG load error-test --table "$DMERROR_TABLE"
> +   [ $? -ne 0 ] && _fatal "failed to load error table error-test"
> +
> +   $DMSETUP_PROG resume error-test
> +   [ $? -ne 0 ] && _fatal  "failed to resume error-test"
> +}
> diff --git a/common/rc b/common/rc
> index 610045e..ff0732a 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -1336,6 +1336,21 @@ _require_sane_bdev_flush()
> fi
>  }
>
> +# this test requires the device mapper error target
> +#
> +_require_dmerror()
> +{
> +_require_command "$DMSETUP_PROG" dmsetup
> +
> +$DMSETUP_PROG targets | grep error >/dev/null 2>&1
> +if [ $? -eq 0 ]
> +then
> +   :
> +else
> +   _notrun "This test requires dm error support"
> +fi

Why not just:

[ $? -ne 0 ] && _notrun "This test requires dm error support"

The empty branch doesn't make much sense.

Also, please indent the body of this function using an 8 spaces tab,
which is the official style for fstests (just look at the surrounding
functions for example).

Other than that, it looks good to me. You can add:

Reviewed-by: Filipe Manana 

> +}
> +
>  # this test requires the device mapper flakey target
>  #
>  _require_dm_flakey()
> --
> 2.4.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: btrfs: Add regression test for reserved space leak.

2015-08-05 Thread Filipe David Manana
On Wed, Aug 5, 2015 at 2:08 AM, Qu Wenruo  wrote:
> The regression is introduced in v4.2-rc1, with the big btrfs qgroup
> change.
> The problem is, qgroup reserved space is never freed, causing even we
> increase the limit, we can still hit the EDQUOT much faster than it
> should.
>
> Reported-by: Tsutomu Itoh 
> Signed-off-by: Qu Wenruo 
Reviewed-by: Filipe Manana 

Thanks!

> ---
>  tests/btrfs/089 | 88 
> +
>  tests/btrfs/089.out |  5 +++
>  tests/btrfs/group   |  1 +
>  3 files changed, 94 insertions(+)
>  create mode 100755 tests/btrfs/089
>  create mode 100644 tests/btrfs/089.out
>
> diff --git a/tests/btrfs/089 b/tests/btrfs/089
> new file mode 100755
> index 000..82db96c
> --- /dev/null
> +++ b/tests/btrfs/089
> @@ -0,0 +1,88 @@
> +#! /bin/bash
> +# FS QA Test 089
> +#
> +# Regression test for btrfs qgroup reserved space leak.
> +#
> +# Due to qgroup reserved space leak, EDQUOT can be trigged even it's not
> +# over limit after previous write.
> +#
> +#---
> +# Copyright (c) 2015 Fujitsu. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_need_to_be_root
> +
> +# Use big blocksize to ensure there is still enough space left
> +# for metadata reserve after hitting EDQUOT
> +BLOCKSIZE=$(( 2 * 1024 * 1024 ))
> +FILESIZE=$(( 128 * 1024 * 1024 )) # 128Mbytes
> +
> +# The last block won't be able to finish write, as metadata takes
> +# $NODESIZE space, causing the last block triggering EDQUOT
> +LENGTH=$(( $FILESIZE - $BLOCKSIZE ))
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_scratch_mount
> +_require_fs_space $SCRATCH_MNT $(($FILESIZE * 2 / 1024))
> +
> +_run_btrfs_util_prog quota enable $SCRATCH_MNT
> +_run_btrfs_util_prog qgroup limit $FILESIZE 5 $SCRATCH_MNT
> +
> +$XFS_IO_PROG -f -c "pwrite -b $BLOCKSIZE 0 $LENGTH" \
> +   $SCRATCH_MNT/foo | _filter_xfs_io
> +
> +# A sync is needed to trigger a commit_transaction.
> +# As the reserved space freeing happens at commit_transaction time,
> +# without a transaction commit, no reserved space needs freeing and
> +# won't trigger the bug.
> +sync
> +
> +# Double the limit to allow further write
> +_run_btrfs_util_prog qgroup limit $(($FILESIZE * 2)) 5 $SCRATCH_MNT
> +
> +# Test whether further write can succeed
> +$XFS_IO_PROG -f -c "pwrite -b $BLOCKSIZE $LENGTH $LENGTH" \
> +   $SCRATCH_MNT/foo | _filter_xfs_io
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/089.out b/tests/btrfs/089.out
> new file mode 100644
> index 000..396888f
> --- /dev/null
> +++ b/tests/btrfs/089.out
> @@ -0,0 +1,5 @@
> +QA output created by 089
> +wrote 132120576/132120576 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 132120576/132120576 bytes at offset 132120576
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index ffe18bf..225b532 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -91,6 +91,7 @@
>  086 auto quick clone
>  087 auto quick send
>  088 auto quick metadata
> +089 auto quick qgroup
>  090 auto quick metadata
>  091 auto quick qgroup
>  092 auto quick send
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list

Re: [PATCH 2/3] Btrfs: fix null pointer dereference when extent buffer is already freed

2015-08-04 Thread Filipe David Manana
On Tue, Aug 4, 2015 at 3:43 PM, Anand Jain  wrote:
> When, read_tree_block() returns error it has already freed the extent_buffer
>
> read_tree_block(..)
> {
> ::
> ret = btree_read_extent_buffer_pages(root, buf, 0, parent_transid); 
> <== fails
> if (ret) {
> free_extent_buffer(buf); <=== its freed already
> return ERR_PTR(ret);
> }
> ::
> }
>
> open_ctree()
> {
> ::
> chunk_root->node = read_tree_block(..);
> ::
>
> BTRFS: failed to read chunk root on sdf
> BUG: unable to handle kernel NULL pointer dereference at 001f
> IP: [] free_extent_buffer+0x1e/0xb0 [btrfs]
> ::
> [] ? free_root_extent_buffers+0x1e/0x40 [btrfs]
> [] free_root_pointers+0x56/0x60 [btrfs]
> [] open_ctree+0x19e0/0x2360 [btrfs]
> [] btrfs_mount+0x9e6/0xb10 [btrfs]
> [] ? find_next_zero_bit+0x1a/0x30
> [] ? find_next_bit+0x15/0x30
> [] ? pcpu_alloc+0x35a/0x680
> [] mount_fs+0x38/0x190
> [] ? __alloc_percpu+0x15/0x20
> [] vfs_kern_mount+0x6b/0x120
> [] btrfs_mount+0x1f8/0xb10 [btrfs]
> [] ? pcpu_alloc+0x35a/0x680
> [] mount_fs+0x38/0x190
> [] ? __alloc_percpu+0x15/0x20
> [] vfs_kern_mount+0x6b/0x120
> [] do_mount+0x224/0xb80
> [] SyS_mount+0x7e/0xe0
> [] system_call_fastpath+0x12/0x71
>
> this patch avoids calling it again
>
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/disk-io.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index da7b7bf..e8901ac 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2081,7 +2081,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info 
> *fs_info)
>  static void free_root_extent_buffers(struct btrfs_root *root)
>  {
> if (root) {
> -   free_extent_buffer(root->node);
> +   if (!IS_ERR(root->node))
> +   free_extent_buffer(root->node);

Wasn't this already solved by commit [1] added in the last 4.2-rc?
Either way, it's better to avoid having root->node with an error in
the first place, should be NULL or a valid (non-error) address (like
what's done everywhere else).

[1] 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=95ab1f64908795a2edd6b847eca94a0c63a44be4

> free_extent_buffer(root->commit_root);
> root->node = NULL;
> root->commit_root = NULL;
> --
> 2.4.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Add regression test for reserved space leak.

2015-08-04 Thread Filipe David Manana
On Tue, Aug 4, 2015 at 7:27 AM, Qu Wenruo  wrote:
> The regression is introduced in v4.2-rc1, with the big btrfs qgroup
> change.
> The problem is, qgroup reserved space is never freed, causing even we
> increase the limit, we can still hit the EDQUOT much faster than it
> should.
>
> Reported-by: Tsutomu Itoh 
> Signed-off-by: Qu Wenruo 

Thanks for doing this Qu.
The test fails without the btrfs fix and passes with it, as expected.
However, one question below:

> ---
>  tests/btrfs/089 | 83 
> +
>  tests/btrfs/089.out |  5 
>  tests/btrfs/group   |  1 +
>  3 files changed, 89 insertions(+)
>  create mode 100755 tests/btrfs/089
>  create mode 100644 tests/btrfs/089.out
>
> diff --git a/tests/btrfs/089 b/tests/btrfs/089
> new file mode 100755
> index 000..0c018f2
> --- /dev/null
> +++ b/tests/btrfs/089
> @@ -0,0 +1,83 @@
> +#! /bin/bash
> +# FS QA Test 089
> +#
> +# Regression test for btrfs qgroup reserved space leak.
> +#
> +# Due to qgroup reserved space leak, EDQUOT can be trigged even it's not
> +# over limit after previous write.
> +#
> +#---
> +# Copyright (c) 2015 Fujitsu. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_need_to_be_root
> +
> +# Use big blocksize to ensure there is still enough space left
> +# for metadata reserve after hitting EDQUOT
> +BLOCKSIZE=$(( 2 * 1024 * 1024 ))
> +FILESIZE=$(( 128 * 1024 * 1024 )) # 128Mbytes
> +
> +# The last block won't be able to finish write, as metadata takes
> +# $NODESIZE space, causing the last block triggering EDQUOT
> +LENGTH=$(( $FILESIZE - $BLOCKSIZE ))
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_scratch_mount
> +_require_fs_space $SCRATCH_MNT $(($FILESIZE * 2 / 1024))
> +
> +_run_btrfs_util_prog quota enable $SCRATCH_MNT
> +_run_btrfs_util_prog qgroup limit $FILESIZE 5 $SCRATCH_MNT
> +
> +$XFS_IO_PROG -f -c "pwrite -b $BLOCKSIZE 0 $LENGTH" \
> +   $SCRATCH_MNT/foo | _filter_xfs_io
> +sync

Why is the sync needed here? Can you add a comment explaining why? It
isn't trivial/obvious (for me at least), specially because without the
call to "sync" the test passes without the btrfs fix.

thanks

> +
> +# Double the limit to allow further write
> +_run_btrfs_util_prog qgroup limit $(($FILESIZE * 2)) 5 $SCRATCH_MNT
> +
> +# Test whether further write can succeed
> +$XFS_IO_PROG -f -c "pwrite -b $BLOCKSIZE $LENGTH $LENGTH" \
> +   $SCRATCH_MNT/foo | _filter_xfs_io
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/089.out b/tests/btrfs/089.out
> new file mode 100644
> index 000..396888f
> --- /dev/null
> +++ b/tests/btrfs/089.out
> @@ -0,0 +1,5 @@
> +QA output created by 089
> +wrote 132120576/132120576 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 132120576/132120576 bytes at offset 132120576
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index ffe18bf..225b532 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -91,6 +91,7 @@
>  086 auto quick clone
>  087 auto quick send
>  088 auto quick metadata
> +089 auto quick qgroup
>  090 auto quick metadata
>  091 auto quick qgroup
>  092 auto quick send
> --

Re: [PATCH][RESEND] btrfs: fix search key advancing condition

2015-07-31 Thread Filipe David Manana
On Tue, Jun 30, 2015 at 3:25 AM, Naohiro Aota  wrote:
> The search key advancing condition used in copy_to_sk() is loose. It can
> advance the key even if it reaches sk->max_*: e.g. when the max key = (512,
> 1024, -1) and the current key = (512, 1025, 10), it increments the
> offset by 1, continues hopeless search from (512, 1025, 11). This issue
> make ioctl() to take unexpectedly long time scanning all the leaf a blocks
> one by one.
>
> This commit fix the problem using standard way of key comparison:
> btrfs_comp_cpu_keys()
>
> Signed-off-by: Naohiro Aota 
Reviewed-by: Filipe Manana 

thanks

> ---
>  fs/btrfs/ioctl.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 1c22c65..07dc01d 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1932,6 +1932,7 @@ static noinline int copy_to_sk(struct btrfs_root *root,
> u64 found_transid;
> struct extent_buffer *leaf;
> struct btrfs_ioctl_search_header sh;
> +   struct btrfs_key test;
> unsigned long item_off;
> unsigned long item_len;
> int nritems;
> @@ -2015,12 +2016,17 @@ static noinline int copy_to_sk(struct btrfs_root 
> *root,
> }
>  advance_key:
> ret = 0;
> -   if (key->offset < (u64)-1 && key->offset < sk->max_offset)
> +   test.objectid = sk->max_objectid;
> +   test.type = sk->max_type;
> +   test.offset = sk->max_offset;
> +   if (btrfs_comp_cpu_keys(key, &test) >= 0)
> +   ret = 1;
> +   else if (key->offset < (u64)-1)
> key->offset++;
> -   else if (key->type < (u8)-1 && key->type < sk->max_type) {
> +   else if (key->type < (u8)-1) {
> key->offset = 0;
> key->type++;
> -   } else if (key->objectid < (u64)-1 && key->objectid < 
> sk->max_objectid) {
> +   } else if (key->objectid < (u64)-1) {
> key->offset = 0;
> key->type = 0;
>     key->objectid++;
> --
> 2.4.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange data backref offset?

2015-07-29 Thread Filipe David Manana
On Fri, Jul 17, 2015 at 3:38 AM, Qu Wenruo  wrote:
> Hi all,
>
> While I'm developing a new btrfs inband dedup mechanism, I found btrfsck and
> kernel doing strange behavior for clone.
>
> [Reproducer]
> # mount /dev/sdc -t btrfs /mnt/test
> # dd if=/dev/zero of=/mnt/test/file1 bs=4K count=4
> # sync
> # ~/xfstests/src/cloner -s 4096 -l 4096 /mnt/test/file1 /mnt/test/file2
> # sync
>
> Then btrfs-debug-tree gives quite strange result on the data backref:
> --
> 
> item 4 key (12845056 EXTENT_ITEM 16384) itemoff 16047 itemsize 111
> extent refs 3 gen 6 flags DATA
> extent data backref root 5 objectid 257 offset 0 count 1
> extent data backref root 5 objectid 258 offset
> 18446744073709547520 count 1
>
> 
> item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53
> extent data disk byte 12845056 nr 16384
> extent data offset 0 nr 16384 ram 16384
> extent compression 0
> item 9 key (257 EXTENT_DATA 16384) itemoff 15690 itemsize 53
> extent data disk byte 12845056 nr 16384
> extent data offset 4096 nr 4096 ram 16384
> extent compression 0
> --
>
> The offset is file extent's key.offset - file exntent's offset,
> Which is 0 - 4096, causing the overflow result.
>
> Kernel and fsck all uses that behavior, so fsck can pass the strange thing.
>
> But shouldn't the offset in data backref matches with the key.offset of the
> file extent?
>
> And I'm quite sure the change of behavior can hugely break the fsck and
> kernel, but I'm wondering is this a known BUG or feature, and will it be
> handled?

Obviously a bug.

I was recently investigating incremental send failures after
cloning/deduping extents and that lead me to this as well.
It's a bug but it's not too bad as it effects only backref walking,
which can have a simple workaround (I just sent a patch for it). For
the purposes of incrementing/decrementing the data backref's count we
do the same calculation everywhere, always leading to the same large
and unexpected value, so we don't get bogus backrefs added/left
around.

>
> Thanks,
> Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix warning in backref walking

2015-07-27 Thread Filipe David Manana
On Mon, Jul 27, 2015 at 9:15 AM, Liu Bo  wrote:
> When we do backref walking, we search firstly in queued delayed refs
> and then the on-disk backrefs, but we parse differently for shared
> references, for delayed refs we also add 'ref->root' while for on-disk
> backrefs we don't, this can prevent us from merging refs indexed
> by the same bytenr and cause find_parent_nodes() to throw a warning at
> 'WARN_ON(ref->count < 0)', for example, when we have a shared data extent
> with 'ref_cnt=1' and a delayed shared data with a BTRFS_DROP_DELAYED_REF,
> that happens.
>
> For shared references, no matter if it's delayed or on-disk, ref->root is
> not at all used, instead it's ref->parent that really matters, so this has
> delayed refs handled as the same way as on-disk refs.
>
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/backref.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 802fabb..2485b868 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -632,7 +632,7 @@ static int __add_delayed_refs(struct 
> btrfs_delayed_ref_head *head, u64 seq,
> struct btrfs_delayed_tree_ref *ref;
>
> ref = btrfs_delayed_node_to_tree_ref(node);
> -   ret = __add_prelim_ref(prefs, ref->root, NULL,
> +   ret = __add_prelim_ref(prefs, 0, NULL,
>ref->level + 1, ref->parent,
>node->bytenr,
>node->ref_mod * sgn, 
> GFP_ATOMIC);
> @@ -665,10 +665,9 @@ static int __add_delayed_refs(struct 
> btrfs_delayed_ref_head *head, u64 seq,
>
> ref = btrfs_delayed_node_to_data_ref(node);
>
> -   key.objectid = ref->objectid;

Why remove only this line and not the following 2 as well if key isn't
used anymore?

> key.type = BTRFS_EXTENT_DATA_KEY;
> key.offset = ref->offset;
> -   ret = __add_prelim_ref(prefs, ref->root, &key, 0,
> +   ret = __add_prelim_ref(prefs, 0, NULL, 0,
>ref->parent, node->bytenr,
>node->ref_mod * sgn, 
> GFP_ATOMIC);
> break;

Do you have any reproducer to turn into an fstest? Would be nice,
since this is a rather critical part of the code.

thanks

> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_create_pending_block_groups:9460: errno=-27 unknown

2015-07-17 Thread Filipe David Manana
On Fri, Jul 17, 2015 at 7:36 PM, Omar Sandoval  wrote:
> Hey, Filipe,
>
> I've been seeing errors of this sort:
>
> [  658.221300] [ cut here ]
> [  658.221948] WARNING: CPU: 0 PID: 1636 at fs/btrfs/extent-tree.c:9460 
> btrfs_create_pending_block_groups+0x16b/0x210()
> [  658.223274] CPU: 0 PID: 1636 Comm: btrfs-transacti Not tainted 4.2.0-rc2 
> #65
> [  658.224205] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.7.5-20140709_153802- 04/01/2014
> [  658.225389]   7e9aaa02 880037a3bb78 
> 81a02490
> [  658.226435]  0026 880037a3bbd0 880037a3bbb8 
> 81080b81
> [  658.227494]  0200 88002c724000 88002c7241a8 
> 88003691c160
> [  658.228536] Call Trace:
> [  658.228860]  [] dump_stack+0x4c/0x6e
> [  658.229510]  [] warn_slowpath_common+0x81/0xc0
> [  658.230272]  [] warn_slowpath_fmt+0x55/0x70
> [  658.231032]  [] 
> btrfs_create_pending_block_groups+0x16b/0x210
> [  658.231987]  [] btrfs_start_dirty_block_groups+0xd1/0x3f0
> [  658.232852]  [] btrfs_commit_transaction+0x1c4/0xed0
> [  658.233692]  [] ? start_transaction+0xa4/0x730
> [  658.234477]  [] transaction_kthread+0x208/0x270
> [  658.235225]  [] ? btrfs_cleanup_transaction+0x700/0x700
> [  658.236078]  [] kthread+0xfe/0x120
> [  658.236861]  [] ? finish_task_switch+0x50/0x1a0
> [  658.237731]  [] ? __kthread_parkme+0xa0/0xa0
> [  658.238596]  [] ret_from_fork+0x3f/0x70
> [  658.239364]  [] ? __kthread_parkme+0xa0/0xa0
> [  658.240207] ---[ end trace f3b05c72d6a843fb ]---
> [  658.240887] BTRFS: error (device loop0) in 
> btrfs_create_pending_block_groups:9460: errno=-27 unknown
> [  658.242235] BTRFS info (device loop0): forced readonly
> [  658.338726] BTRFS warning (device loop0): Skipping commit of aborted 
> transaction.
> [  658.339629] BTRFS: error (device loop0) in cleanup_transaction:1710: 
> errno=-27 unknown
>
> on 4.2-rc2 which I tracked down to your commit 4fbcdf669454 ("Btrfs: fix
> -ENOSPC when finishing block group creation").
>
> Here's a reproducer, run on a 100TB Btrfs sparse image which is mounted
> over loopback:
>
> 
> truncate -s 100T big.img
> mkfs.btrfs big.img
> mount -o loop big.img /mnt/loop
>
> num=5
> for ((i = 0; i < num; i++)); do
> echo fallocate $i
> fallocate -l 10T /mnt/loop/testfile$i
> done
> btrfs filesystem sync /mnt/loop
>
> for ((i = 0; i < num; i++)); do
> echo rm $i
> rm /mnt/loop/testfile$i
> btrfs filesystem sync /mnt/loop
> done
>
> umount /mnt/loop
> 
>
> That works pre-4.2 but not with 4fbcdf669454 applied.
>
> That -27 is EFBIG which is coming from btrfs_add_system_chunk(). It
> seems like something is causing spurious allocations of system chunks or
> something, but I'm not familiar with the code. Could you take a look?

Thanks Omar.
I think I know what's going on. I'll take a look and get back to you
after analyzing/trying a few things.

>
> Thanks a lot,
> --
> Omar
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: incremental send, fix clone operations for compressed extents

2015-07-17 Thread Filipe David Manana
e
>> +* receiving end.
>> +*/
>> +   if (compressed == BTRFS_COMPRESS_NONE)
>> +   backref_ctx->data_offset = 0;
>> +   else
>> +   backref_ctx->data_offset = btrfs_file_extent_offset(eb, fi);
>>
>> /*
>>  * The last extent of a file may be too large due to page alignment.
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> Is this patch still useful? It applies to 4.1.

Yes, Chris sent it in the pull request to Linus for the 4.2-rc1.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-13 Thread Filipe David Manana
On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas  wrote:
> Greetings,
> Looking at the code of should_cow_block(), I see:
>
> if (btrfs_header_generation(buf) == trans->transid &&
>!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
> ...
> So if the extent buffer has been written to disk, and now is changed again
> in the same transaction, we insist on COW'ing it. Can anybody explain why
> COW is needed in this case? The transaction has not committed yet, so what
> is the danger of rewriting to the same location on disk? My understanding
> was that a tree block needs to be COW'ed at most once in the same
> transaction. But I see that this is not the case.

That logic is there, as far as I can see, for at least 2 obvious reasons:

1) fsync/log trees. All extent buffers (tree blocks) of a log tree
have the same transaction id/generation, and you can have multiple
fsyncs (log transaction commits) per transaction so you need to ensure
consistency. If we skipped the COWing in the example below, you would
get an inconsistent log tree at log replay time when the fs is
mounted:

transaction N start

   fsync inode A start
   creates tree block X
   flush X to disk
   write a new superblock
   fsync inode A end

   fsync inode B start
   skip COW of X because its generation == current transaction id and
modify it in place
   flush X to disk

== crash ===

   write a new superblock
   fsync inode B end

transaction N commit

2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
written to disk but instead when we trigger writeback for it. So while
the writeback is ongoing we want to make sure the block's content
isn't concurrently modified (we don't keep the eb write locked to
allow concurrent reads during the writeback).

All tree blocks that don't belong to a log tree are normally written
only when at the end of a transaction commit. But often, due to memory
pressure for e.g., the VM can call the writepages() callback of the
btree inode to force dirty tree blocks to be written to disk before
the transaction commit.

>
> I am asking because I am doing some profiling of btrfs metadata work under
> heavy loads, and I see that sometimes btrfs COW's almost twice more tree
> blocks than the total metadata size.
>
> Thanks,
> Alex.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.1-rc6 - kernel crash after doing chattr +C

2015-07-03 Thread Filipe David Manana
On Sat, Jun 6, 2015 at 7:07 AM, Tomasz Chmielewski  wrote:
> 4.1-rc6, busy filesystem.
>
> I was running mongo import which made quite a lot of IO.
> During the import, I did "chattr +C /var/lib/mongodb" - shortly after I saw
> this in dmesg and server died:
>
> [57860.149839] BUG: unable to handle kernel NULL pointer dereference at
> 0008
> [57860.149877] IP: []
> btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
> [57860.149923] PGD 5d1ac6067 PUD 5d40fc067 PMD 0
> [57860.149943] Oops: 0002 [#1] SMP
> [57860.149960] Modules linked in: xt_conntrack veth xt_CHECKSUM
> iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp
> iptable_filter ip_tables x_tables bridge stp llc intel_rapl iosf_mbi
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> crct10dif_pclmul eeepc_wmi asus_wmi crc32_pclmul ghash_clmulni_intel
> sparse_keymap aesni_intel aes_x86_64 ie31200_edac lpc_ich lrw gf128mul
> edac_core glue_helper ablk_helper shpchp cryptd serio_raw wmi video
> tpm_infineon 8250_fintek mac_hid btrfs lp parport raid10 raid456
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq
> e1000e raid1 ahci raid0 ptp libahci pps_core multipath linear
> [57860.150203] CPU: 4 PID: 14111 Comm: mongod Not tainted
> 4.1.0-040100rc6-generic #201506010235
> [57860.150237] Hardware name: System manufacturer System Product Name/P8B
> WS, BIOS 0904 10/24/2011
> [57860.150271] task: 88007901bc60 ti: 8805d5c38000 task.ti:
> 8805d5c38000
> [57860.150303] RIP: 0010:[]  []
> btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
> [57860.150346] RSP: 0018:8805d5c3bd18  EFLAGS: 00010206
> [57860.150364] RAX:  RBX: 880103c9d950 RCX:
> 3d44
> [57860.150386] RDX:  RSI: 3d44 RDI:
> 880806a74838
> [57860.150407] RBP: 8805d5c3bd88 R08:  R09:
> 
> [57860.150428] R10: 0001 R11:  R12:
> 880806bcb800
> [57860.150450] R13: 880806a74838 R14: 880103c9d8d8 R15:
> 88080a7e3518
> [57860.150471] FS:  7f5f4e6dc700() GS:88082fb0()
> knlGS:
> [57860.150504] CS:  0010 DS:  ES:  CR0: 80050033
> [57860.150523] CR2: 0008 CR3: 00062a584000 CR4:
> 000407e0
> [57860.150544] Stack:
> [57860.150558]  8805d5c3bd48 88080a7e35c8 880806bcb000
> 880806bcb800
> [57860.150592]  8800070da638 d5c3bdb0 0287
> 88080a72a4d0
> [57860.150626]  880806bcb800 88080a72a4d0 880806bcb800
> 
> [57860.150659] Call Trace:
> [57860.150682]  [] btrfs_commit_transaction+0x40b/0xb60
> [btrfs]
> [57860.150717]  [] ? prepare_to_wait_event+0x100/0x100
> [57860.150745]  [] btrfs_sync_file+0x313/0x380 [btrfs]
> [57860.150768]  [] vfs_fsync_range+0x46/0xc0
> [57860.150788]  [] vfs_fsync+0x1c/0x20
> [57860.150806]  [] do_fsync+0x38/0x70
> [57860.150825]  [] SyS_fdatasync+0x13/0x20
> [57860.150846]  [] system_call_fastpath+0x16/0x75
> [57860.150866] Code: 45 98 48 39 d8 0f 84 ad 00 00 00 48 8d 45 a8 48 83 c0
> 18 48 89 45 90 66 0f 1f 44 00 00 48 8b 13 48 8b 43 08 4c 89 ef 4c 8d 73 88
> <48> 89 42 08 48 89 10 48 89 1b 48 89 5b 08 e8 bf 3a 6b c1 e8 aa
> [57860.150959] RIP  []
> btrfs_wait_pending_ordered+0x5e/0x110 [btrfs]
> [57860.150998]  RSP 
> [57860.151014] CR2: 0008
> [57860.151186] ---[ end trace f41cd52aa31494ac ]---

Hi,

Managed to reproduce it and the following patch should fix the problem:

https://patchwork.kernel.org/patch/6716871/

>
>
> --
> Tomasz Chmielewski
> http://wpkg.org
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux 4.1 - memory leak (possibly dedup related)

2015-07-03 Thread Filipe David Manana
On Fri, Jul 3, 2015 at 7:58 AM, Marcel Ritter  wrote:
> Hi,
>
> I've been running some btrfs tests (mainly duperemove related) with
> linux kernel 4.1 for the last few days.
>
> Now I noticed by accident (dying processes), that all my memory (128
> GB!) is gone.
> "Gone" meaning, there's no user space process allocating this memory.
>
> Digging deeper I found the missing memory using slabtop (see output of
> /proc/slabinfo is attached): Looks like I got a lot of kernel memory
> allocated by kmalloc-1024 (memory leak?).
> Given the fact that the test machine does little more than btrfs
> testing I think this may be btrfs related.
>
> I was running duperemove on a 1.5 TB volume around the time the first
> "Out of memory" error were logged, so maybe the memory leak can be
> found somewhere in this code path.
>
> I'm still waiting for a scrub run to finish, after that I'll reboot
> the machine and try to reproduce this behaviour with a fresh btrfs
> filesystem.
>
> Have there been any fixes concerning memory leaks since 4.1 release I could 
> try?
> Any other ideas how to track down this potential memory leak?

Hi,

We had Julian Taylor reporting the same on IRC a couple days ago (he
also found what was being leaked). I just sent a fix and cc'ed you
(https://patchwork.kernel.org/patch/6713301/).

thanks

>
> Bye,
>Marcel



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix wrong check for btrfs_force_chunk_alloc()

2015-06-30 Thread Filipe David Manana
On Sun, Apr 12, 2015 at 7:35 AM, Wang Shilong  wrote:
> btrfs_force_chunk_alloc() return 1 for allocation chunk successfully.
> This problem exists since commit c87f08ca4.
>
> With this patch, we might fix some enospc problems for balances.
>
> Signed-off-by: Wang Shilong 
Reviewed-by: Filipe Manana 
Tested-by: Filipe Manana 

> ---
>  fs/btrfs/relocation.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index d830853..c453464 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4037,7 +4037,7 @@ restart:
> if (trans && progress && err == -ENOSPC) {
> ret = btrfs_force_chunk_alloc(trans, rc->extent_root,
>   rc->block_group->flags);
> -   if (ret == 0) {
> +   if (ret == 1) {
> err = 0;
> progress = 0;
> goto restart;
> --
> 1.7.12.4
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [btrfs] btrfs_rename: abort transaction in case of error.

2015-06-30 Thread Filipe David Manana
On Tue, Jun 30, 2015 at 5:17 AM, Davide Italiano  wrote:
> On Mon, Jun 29, 2015 at 4:59 AM, Filipe David Manana  
> wrote:
>> On Sun, Jun 28, 2015 at 10:47 PM, Davide C. C. Italiano
>>  wrote:
>>> From: Davide Italiano 
>>>
>>> btrfs_insert_inode_ref() may fail and we want to make sure
>>> the transaction is aborted before calling btrfs_end_transaction(),
>>> as it already happens everywhere else in this function in case
>>> of error.
>>>
>>> Signed-off-by: Davide Italiano 
>>> ---
>>>  fs/btrfs/inode.c | 5 -
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 8bb0136..59c475c 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -9114,8 +9114,11 @@ static int btrfs_rename(struct inode *old_dir, 
>>> struct dentry *old_dentry,
>>>  new_dentry->d_name.len,
>>>  old_ino,
>>>  btrfs_ino(new_dir), index);
>>> -   if (ret)
>>> +   if (ret) {
>>> +   btrfs_abort_transaction(trans, root, ret);
>>> goto out_fail;
>>> +   }
>>> +
>>
>> Hi,
>>
>> I don't think we need a transaction abortion here. The reason it's not
>> being done is likely because at that point the trees are in a
>> consistent state (i.e. we haven't touched any of them yet) and not
>> because it was forgotten. So an abortion there is
>> unnecessary/excessive.
>>
>> thanks
>>
>
> Thank you for the comment -- I updated the other patch and I have
> mixed feeling about this one.
> I can either withdrawn the review or provide a new patch where I add a
> comment to clarify why this is not needed, for the future.
> Which one do you like better?

Hi,

I don't think it's needed. We do this pattern in many places and it's
quite obvious if one reads the code flow.

thanks

>
> --
> Davide



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] [btrfs] btrfs_rename(): don't ignore btrfs_end_transaction() return

2015-06-30 Thread Filipe David Manana
On Tue, Jun 30, 2015 at 5:15 AM, Davide C. C. Italiano
 wrote:
> From: Davide Italiano 
>
> btrfs_end_transaction() can return an error -- this happens, e.g.
> if it tries to commit and the transaction was aborted in the meanhwile.
> Swallowing the error is wrong, so explicitly return it.
>
> Signed-off-by: Davide Italiano 
> ---
>  fs/btrfs/inode.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 59c475c..61b26be 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9199,7 +9199,8 @@ static int btrfs_rename(struct inode *old_dir, struct 
> dentry *old_dentry,
> btrfs_end_log_trans(root);
> }
>  out_fail:
> -   btrfs_end_transaction(trans, root);
> +   if (!ret)
> +   ret = btrfs_end_transaction(trans, root);

So it an error happened before, we still want to call
btrfs_end_transaction(), otherwise the transaction's refcount never
drops to 0 and its resources are never freed (memory).

Something like this:

if (ret)
btrfs_end_transaction(trans, root);
else
ret = btrfs_end_transaction(trans, root);

or

int ret2;

ret2 = btrfs_end_transaction(trans, root);
if (!ret)
ret = ret2;

Also don't forget to target your patches with V and describe
and what changed between patch versions (see
https://btrfs.wiki.kernel.org/index.php/Writing_patch_for_btrfs for
instructions about how to do it).

thanks

>  out_notrans:
> if (old_ino == BTRFS_FIRST_FREE_OBJECTID)
>     up_read(&root->fs_info->subvol_sem);
> --
> 2.4.3
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs: fix warning of bytes_may_use

2015-06-30 Thread Filipe David Manana
On Wed, Jun 17, 2015 at 9:59 AM, Liu Bo  wrote:
> While running generic/019, dmesg got several warnings from
> btrfs_free_reserved_data_space().
>
> Test generic/019 produces some disk failures so sumbit dio will get errors,
> in which case, btrfs_direct_IO() goes to the error handling and free
> bytes_may_use, but the problem is that bytes_may_use has been free'd
> during get_block().
>
> This adds a runtime flag to show if we've gone through get_block(), if so,
> don't do the cleanup work.
>
> Signed-off-by: Liu Bo 
Reviewed-by: Filipe Manana 
Tested-by: Filipe Manana 

> ---
>  fs/btrfs/btrfs_inode.h |  2 ++
>  fs/btrfs/inode.c   | 16 +---
>  2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 0ef5cc1..81220b2 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -44,6 +44,8 @@
>  #define BTRFS_INODE_IN_DELALLOC_LIST   9
>  #define BTRFS_INODE_READDIO_NEED_LOCK  10
>  #define BTRFS_INODE_HAS_PROPS  11
> +/* DIO is ready to submit */
> +#define BTRFS_INODE_DIO_READY  12
>  /*
>   * The following 3 bits are meant only for the btree inode.
>   * When any of them is set, it means an error happened while writing an
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 7bf150a..438b56f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7530,6 +7530,7 @@ unlock:
>
> current->journal_info = outstanding_extents;
> btrfs_free_reserved_data_space(inode, len);
> +   set_bit(BTRFS_INODE_DIO_READY, 
> &BTRFS_I(inode)->runtime_flags);
> }
>
> /*
> @@ -8311,9 +8312,18 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter,
>btrfs_submit_direct, flags);
> if (iov_iter_rw(iter) == WRITE) {
> current->journal_info = NULL;
> -   if (ret < 0 && ret != -EIOCBQUEUED)
> -   btrfs_delalloc_release_space(inode, count);
> -   else if (ret >= 0 && (size_t)ret < count)
> +   if (ret < 0 && ret != -EIOCBQUEUED) {
> +   /*
> +* If the error comes from submitting stage,
> +* btrfs_get_blocsk_direct() has free'd data space,
> +* and metadata space will be handled by
> +* finish_ordered_fn, don't do that again to make
> +* sure bytes_may_use is correct.
> +*/
> +   if (!test_and_clear_bit(BTRFS_INODE_DIO_READY,
> +&BTRFS_I(inode)->runtime_flags))
> +   btrfs_delalloc_release_space(inode, count);
> +   } else if (ret >= 0 && (size_t)ret < count)
> btrfs_delalloc_release_space(inode,
>          count - (size_t)ret);
> }
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: fix hang when failing to submit bio of directIO

2015-06-30 Thread Filipe David Manana
On Wed, Jun 17, 2015 at 9:59 AM, Liu Bo  wrote:
> The hang is uncoverd by generic/019.
>
> btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
> an error, thus those added ordered extents will never get processed, which
> block processes that waiting for them via btrfs_start_ordered_extent().
>
> This fixes the above, and meanwhile finish_ordered_fn will do the space
> accounting work.
>
> Signed-off-by: Liu Bo 
Reviewed-by: Filipe Manana 
Tested-by: Filipe Manana 

> ---
>  fs/btrfs/inode.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8bb0136..7bf150a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7855,8 +7855,6 @@ static void btrfs_endio_direct_write(struct bio *bio, 
> int err)
> struct bio *dio_bio;
> int ret;
>
> -   if (err)
> -   goto out_done;
>  again:
> ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
>&ordered_offset,
> @@ -7879,7 +7877,6 @@ out_test:
> ordered = NULL;
> goto again;
> }
> -out_done:
> dio_bio = dip->dio_bio;
>
> kfree(dip);
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] [btrfs] btrfs_rename(): don't ignore btrfs_end_transaction() return

2015-06-29 Thread Filipe David Manana
On Sun, Jun 28, 2015 at 10:47 PM, Davide C. C. Italiano
 wrote:
> From: Davide Italiano 
>
> btrfs_end_transaction() can return an error -- this happens, e.g.
> if it tries to commit and the transaction was aborted in the meanhwile.
> Swallowing the error is wrong, so explicitly return it.
>
> Signed-off-by: Davide Italiano 
> ---
>  fs/btrfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 59c475c..7764132 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9199,7 +9199,7 @@ static int btrfs_rename(struct inode *old_dir, struct 
> dentry *old_dentry,
> btrfs_end_log_trans(root);
> }
>  out_fail:
> -   btrfs_end_transaction(trans, root);
> +   ret = btrfs_end_transaction(trans, root);

Hi,

Good intention but it's now swallowing errors from earlier places in
the code that jump to the out_fail label. For e.g. if the call to
btrfs_set_inode_index() fails, we jump to out_fail and we lose the
error value that it returned (btrfs_end_transaction() returns 0 for
e.g.), so userspace thinks everything succeed when it didn't.
Correct fix is to set ret to the return value of
btrfs_end_transaction() only if ret is currently zero.

thanks

>  out_notrans:
> if (old_ino == BTRFS_FIRST_FREE_OBJECTID)
> up_read(&root->fs_info->subvol_sem);
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [btrfs] btrfs_rename: abort transaction in case of error.

2015-06-29 Thread Filipe David Manana
On Sun, Jun 28, 2015 at 10:47 PM, Davide C. C. Italiano
 wrote:
> From: Davide Italiano 
>
> btrfs_insert_inode_ref() may fail and we want to make sure
> the transaction is aborted before calling btrfs_end_transaction(),
> as it already happens everywhere else in this function in case
> of error.
>
> Signed-off-by: Davide Italiano 
> ---
>  fs/btrfs/inode.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8bb0136..59c475c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9114,8 +9114,11 @@ static int btrfs_rename(struct inode *old_dir, struct 
> dentry *old_dentry,
>  new_dentry->d_name.len,
>  old_ino,
>  btrfs_ino(new_dir), index);
> -   if (ret)
> +   if (ret) {
> +   btrfs_abort_transaction(trans, root, ret);
> goto out_fail;
> +   }
> +

Hi,

I don't think we need a transaction abortion here. The reason it's not
being done is likely because at that point the trees are in a
consistent state (i.e. we haven't touched any of them yet) and not
because it was forgotten. So an abortion there is
unnecessary/excessive.

thanks

> /*
>  * this is an ugly little race, but the rename is required
>  * to make sure that if we crash, the inode is either at the
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/7] Btrfs incremental send fix serval case for rename and rm directory

2015-06-23 Thread Filipe David Manana
On Tue, Jun 23, 2015 at 11:39 AM, Robbie Ko  wrote:
> Patch for fix btrfs send receive. These patches base on v4.1
> plus following patches.
> [PATCH] Btrfs: incremental send, don't delay directory renames unnecessarily
> [PATCH] Btrfs: incremental send, check if orphanized dir inode needs delayed 
> rename
>
> Thanks.
>
> Robbie Ko (7):
>   Revert "Btrfs: incremental send, remove dead code"
>   Btrfs: incremental send, avoid circular waiting and descendant
> overwrite ancestor need to update path
>   Btrfs: incremental send, avoid ancestor rename to descendant
>   Btrfs: incremental send, fix orphan_dir_info leak
>   Btrfs: incremental send, fix rmdir but dir have a unprocess item
>   Btrfs: incremental send, don't send utimes for non-existing directory
>   Btrfs: incremental send, avoid the overhead of allocating an
> orphan_dir_info object unnecessarily

Robbie,

Are you considering sending test cases for fstests for the patches 2
(all 3 examples in the commit message), 3 and 5?

Let me know if you want any assistance making those tests.
Thanks.

>
>  fs/btrfs/send.c | 179 
> +++-
>  1 file changed, 165 insertions(+), 14 deletions(-)
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/7] Revert "Btrfs: incremental send, remove dead code"

2015-06-23 Thread Filipe David Manana
  dm = get_waiting_dir_move(sctx, pm->ino);
> +   ASSERT(dm);
> +   dm->rmdir_ino = rmdir_ino;
> +   }
> +   goto out;
> +   }
> fs_path_reset(name);
> to_path = name;
> name = NULL;
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/7] Btrfs: incremental send, avoid ancestor rename to descendant

2015-06-23 Thread Filipe David Manana
ret = -ENOMEM;
> +   goto out;
> +   }
> +
> +   sctx->send_progress = sctx->cur_ino + 1;
> +   ret = path_loop(sctx, name, sctx->cur_ino,
> +   sctx->cur_inode_gen, 
> &ancestor);
> +   if (ret) {
> +   ret = add_pending_dir_move(sctx, 
> sctx->cur_ino,
> +  
> sctx->cur_inode_gen,
> +  ancestor,
> +  &sctx->new_refs,
> +  
> &sctx->deleted_refs,
> +  is_orphan);
> +   if (ret < 0) {
> +   sctx->send_progress = 
> old_send_progress;
> +   fs_path_free(name);
> +   goto out;
> +   }
> +   can_rename = false;
> +   *pending_move = 1;
> +   }
> +   sctx->send_progress = old_send_progress;
> +   fs_path_free(name);
> +   if (ret < 0)
> +   goto out;

This check for ret < 0 is useless here. The former check "if (ret) {
..." will clobber it. Either do this check before the other one or
make the other check as "if (ret == 1) { ...". In other words, we're
still ignoring errors (return value < 0) from path_loop().

> +   }
> +
> +   /*
>  * link/move the ref to the new place. If we have an orphan
>  * inode, move it and update valid_path. If not, link or move
>  * it depending on the inode mode.
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/7] Btrfs: incremental send, don't send utimes for non-existing directory

2015-06-22 Thread Filipe David Manana
pm->update_refs, list) {
> -   if (cur->dir == rmdir_ino)
> +   /*
> +* don't send utimes for non-existing directory
> +*/
> +   ret = get_inode_info(sctx->send_root, cur->dir, NULL,
> +NULL , NULL, NULL, NULL, NULL);
> +   if (ret == -ENOENT) {
> +       ret = 0;
> continue;
> +   }
> +   if (ret < 0)
> +   goto out;
> +
> ret = send_utimes(sctx, cur->dir, cur->dir_gen);
> if (ret < 0)
> goto out;
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in


Re: [PATCH v2 3/7] Btrfs: incremental send, avoid ancestor rename to descendant

2015-06-22 Thread Filipe David Manana
   name = fs_path_alloc();
> +   if (!valid_path) {

Wrong variable. Must be:

if (!name) {
   (...)


> +   ret = -ENOMEM;
> +   goto out;
> +   }
> +
> +   sctx->send_progress = sctx->cur_ino + 1;
> +   ret = path_loop(sctx, name, sctx->cur_ino, 
> sctx->cur_inode_gen, &ancestor);

Need to check if ret < 0 and "goto out" if so.

> +   if (ret) {
> +   ret = add_pending_dir_move(sctx, 
> sctx->cur_ino, sctx->cur_inode_gen,
> +   ancestor, 
> &sctx->new_refs, &sctx->deleted_refs, is_orphan);
> +   if (ret < 0) {
> +   sctx->send_progress = 
> old_send_progress;
> +   fs_path_free(name);
> +   goto out;
> +   }
> +   can_rename = false;
> +   *pending_move = 1;
> +   }
> +   sctx->send_progress = old_send_progress;
> +   fs_path_free(name);
> +   }
> +
> +   /*
>  * link/move the ref to the new place. If we have an orphan
>  * inode, move it and update valid_path. If not, link or move
>  * it depending on the inode mode.
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in


Re: [PATCH v2 2/7] Btrfs: incremental send, avoid circular waiting and descendant overwrite ancestor need to update path

2015-06-22 Thread Filipe David Manana
the
> @@ -3689,6 +3703,18 @@ verbose_printk("btrfs: process_recorded_refs %llu\n", 
> sctx->cur_ino);
> name_cache_delete(sctx, nce);
> kfree(nce);
> }
> +
> +   /*
> +* ow_inode might currently be an ancestor of
> +* cur_ino, therefore compute valid_path (the
> +* current path of cur_ino) again because it
> +* might contain the pre-orphanization name of
> +* ow_inode, which is no longer valid.
> +*/
> +   fs_path_reset(valid_path);
> +   ret = get_cur_path(sctx, sctx->cur_ino, 
> sctx->cur_inode_gen, valid_path);
> +       if (ret < 0)
> +   goto out;
> } else {
> ret = send_unlink(sctx, cur->full_path);
> if (ret < 0)

Also please run your patch against checkpath.pl, as mentioned in the
first review:

$ /path/to/kernel/source/scripts/checkpatch.pl  your_patch_file

(...)

WARNING: line over 80 characters
#118: FILE: fs/btrfs/send.c:1842:
+ if (other_inode > sctx->send_progress || is_waiting_for_move(sctx,
other_inode)) {

WARNING: line over 80 characters
#193: FILE: fs/btrfs/send.c:3686:
+ wdm = get_waiting_dir_move(sctx, ow_inode);

WARNING: line over 80 characters
#214: FILE: fs/btrfs/send.c:3715:
+ ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, valid_path);

(...)

Same comment applies to all your patches.


> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in


Re: [PATCH v2 1/7] Revert "Btrfs: incremental send, remove dead code"

2015-06-22 Thread Filipe David Manana
; +   }
> +   ino = parent_inode;
> +   gen = parent_gen;
> +   }
> +   return ret;
> +}
> +
>  static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm)
>  {
> struct fs_path *from_path = NULL;
> @@ -3091,6 +3133,7 @@ static int apply_dir_move(struct send_ctx *sctx, struct 
> pending_dir_move *pm)
> struct waiting_dir_move *dm = NULL;
> u64 rmdir_ino = 0;
> int ret;
> +   u64 ancestor = 0;
>
> name = fs_path_alloc();
> from_path = fs_path_alloc();
> @@ -3122,6 +3165,22 @@ static int apply_dir_move(struct send_ctx *sctx, 
> struct pending_dir_move *pm)
> goto out;
>
> sctx->send_progress = sctx->cur_ino + 1;
> +   ret = path_loop(sctx, name, pm->ino, pm->gen, &ancestor);
> +   if (ret) {
> +   LIST_HEAD(deleted_refs);
> +   ASSERT(ancestor > BTRFS_FIRST_FREE_OBJECTID);
> +   ret = add_pending_dir_move(sctx, pm->ino, pm->gen, ancestor,
> +  &pm->update_refs, &deleted_refs,
> +          pm->is_orphan);
> +   if (ret < 0)
> +   goto out;
> +   if (rmdir_ino) {
> +   dm = get_waiting_dir_move(sctx, pm->ino);
> +   ASSERT(dm);
> +   dm->rmdir_ino = rmdir_ino;
> +   }
> +   goto out;
> +   }
> fs_path_reset(name);
> to_path = name;
> name = NULL;
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in


Re: [PATCH v3] btrfs: test premature submount unmounting when deleting default subvolume

2015-06-19 Thread Filipe David Manana
On Fri, Jun 5, 2015 at 10:00 PM, Omar Sandoval  wrote:
> Add a regression test for a problem where attempting to delete the
> default subvolume would fail (as expected), but not until after all
> submounts under the subvolume were unmounted.
>
> Reviewed-by: Eryu Guan 
> Signed-off-by: Omar Sandoval 

Reviewed-by: Filipe Manana 
Tested-by: Filipe Manana 

Thanks for doing this.

> ---
> v2->v3:
> - Update description (thanks, Eryu)
> - Remove unneeded touch
>
> v1->v2:
> - Simpler test: just depends on umount of the bind mount succeeding
>   instead of running find. Without the patch applied, the bind mount
>   will disappear so umount will fail.
> - Fix Eryu's comments (better subject, copyright, use
>   _btrfs_get_subvolid)
>
>  tests/btrfs/089 | 79 
> +
>  tests/btrfs/089.out |  2 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 82 insertions(+)
>  create mode 100755 tests/btrfs/089
>  create mode 100644 tests/btrfs/089.out
>
> diff --git a/tests/btrfs/089 b/tests/btrfs/089
> new file mode 100755
> index ..537269824b62
> --- /dev/null
> +++ b/tests/btrfs/089
> @@ -0,0 +1,79 @@
> +#! /bin/bash
> +# FS QA Test 089
> +#
> +# Test deleting the default subvolume, making sure that submounts under it 
> are
> +# not unmounted prematurely. This is a regression test for Linux commit 
> "Btrfs:
> +# don't invalidate root dentry when subvolume deletion fails".
> +#
> +#---
> +# Copyright (c) 2015 Omar Sandoval.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/filter.btrfs
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_need_to_be_root
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_scratch_mount
> +
> +# Create a new subvolume and make it the default subvolume.
> +$BTRFS_UTIL_PROG subvolume create "$SCRATCH_MNT/testvol" >>$seqres.full 2>&1 
> \
> +   || _fail "couldn't create subvol"
> +testvol_id=$(_btrfs_get_subvolid "$SCRATCH_MNT" testvol)
> +$BTRFS_UTIL_PROG subvolume set-default $testvol_id "$SCRATCH_MNT" 
> >>$seqres.full 2>&1 \
> +   || _fail "couldn't set default"
> +
> +# Bind-mount a directory under the default subvolume.
> +mkdir "$SCRATCH_MNT/testvol/testdir"
> +mkdir "$SCRATCH_MNT/testvol/mnt"
> +mount --bind "$SCRATCH_MNT/testvol/testdir" "$SCRATCH_MNT/testvol/mnt"
> +
> +# Now attempt to delete the default subvolume.
> +$BTRFS_UTIL_PROG subvolume delete "$SCRATCH_MNT/testvol" >>$seqres.full 2>&1
> +
> +# Unmount the bind mount, which should still be alive.
> +$UMOUNT_PROG "$SCRATCH_MNT/testvol/mnt"
> +
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/btrfs/089.out b/tests/btrfs/089.out
> new file mode 100644
> index ..a7fcdee9b767
> --- /dev/null
> +++ b/tests/btrfs/089.out
> @@ -0,0 +1,2 @@
> +QA output created by 089
> +Silence is golden
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index ffe18bff0d21..616d060758c1 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -91,6 +91,7 @@
>  086 auto quick clone
>  087 auto quick send
>  088 auto quick metadata
> +089 auto quick subvol
>  090 auto quick metadata
>  091 auto quick qgroup
>  092 auto quick send
> --
> 2.4.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in


Re: [PATCH 5/5] Btrfs: incremental send, fix rmdir not send utimes

2015-06-18 Thread Filipe David Manana
On Thu, Jun 18, 2015 at 4:21 AM, Robbie Ko  wrote:
> Hi Filipe,
>
> I've found that the following case is the main cause of such error
> and it's fs tree is shown via btrfs-debug-tress as below.
>
> file tree key (459 ROOT_ITEM 20487)
> node 132988928 level 1 items 3 free 490 generation 20487 owner 459
> fs uuid b451ae42-3b03-4003-b0a4-45dce324557f
> chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc
> key (256 INODE_ITEM 0) block 132710400 (8100) gen 20486
> key (264 INODE_ITEM 0) block 130695168 (7977) gen 20480
> key (266 XATTR_ITEM 952319794) block 126042112 (7693) gen 20464
> leaf 132710400 items 166 free space 3639 generation 20486 owner 455
> fs uuid b451ae42-3b03-4003-b0a4-45dce324557f
> chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc
> item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
> inode generation 20425 transid 20442 size 32 block
> group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0
> item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
> inode ref index 0 namelen 2 name: ..
> ...
> item 165 key (262 XATTR_ITEM 1100961104) itemoff 7789 itemsize 39
> location key (0 UNKNOWN.0 0) type XATTR
> namelen 8 datalen 1 name: user.a78
> data a
> binary 61
> leaf 130695168 items 133 free space 7332 generation 20480 owner 455
> fs uuid b451ae42-3b03-4003-b0a4-45dce324557f
> chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc
> item 0 key (264 INODE_ITEM 0) itemoff 16123 itemsize 160
> inode generation 20428 transid 20434 size 10 block
> group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0
> item 1 key (264 INODE_REF 256) itemoff 16112 itemsize 11
> inode ref index 11 namelen 1 name: c
> ...
>
> We can see that inode 262 is right at the end of leaf. Then send_utime() will
> use btrfs_search_slot() to find a appropriate place to put 262 where is at the
>  back of 262. However, that place is uninitialized on disk.
> Suppose we read
> atime tv_sec:576469548413222912, tv_nsec:1919251317 and then send it out.
> Receiving side will  got EINVAL since tv_nsec:1919251317 is greater
> than   999,999,999.

I see.
So in apply_dir_move, instead of searching the btree of the send
snapshot, we can search the rbtree of orphan dir infos for an entry
with a key == cur->dir. Searching that rbtree makes it clear what the
intention is and more efficient (fully in memory structure, and much
smaller than the btree). Should work, but I haven't tested it.

thanks

>
> Thanks.
> Robbie Ko
>
> 2015-06-10 18:06 GMT+08:00 Robbie Ko :
>> Hi Filipi,
>>
>> 2015-06-09 18:36 GMT+08:00 Filipe David Manana :
>>> On Tue, Jun 9, 2015 at 11:04 AM, Robbie Ko  wrote:
>>>> Hi Filipe,
>>>>
>>>> 2015-06-08 22:00 GMT+08:00 Filipe David Manana :
>>>>> On Mon, Jun 8, 2015 at 4:44 AM, Robbie Ko  wrote:
>>>>>> Hi Filipe,
>>>>>
>>>>> Hi Robbie,
>>>>>
>>>>>>
>>>>>> I've fixed "don't send utimes for non-existing directory" with another 
>>>>>> solution.
>>>>>>
>>>>>>  In apply_dir_move(), the old parent dir. and new parent dir. will be
>>>>>> updated after the current dir. has moved.
>>>>>>
>>>>>> And there's only one entry in old parent dir. (e.g. entry with
>>>>>> smallest ino) will be tagged with rmdir_ino to prevent its parent dir.
>>>>>> is deleted but updated.
>>>>>
>>>>> Can't parse this phrase. What do you mean by tagging an entry with 
>>>>> rmdir_ino?
>>>>> rmdir_ino corresponds to the number of a inode that wasn't deleted
>>>>> when it was processed because there was some inode with a lower number
>>>>> that is a child of the directory in the parent snapshot and had its
>>>>> rename/move operation delayed (it happens after the directory we want
>>>>> to delete is processed).
>>>>>
>>>>
>>>> Right , my "tagged with rmdir_ino" is same meaning as you explained here.
>>>>
>>>>>>
>>>>>> However, if we process rename for another entry not tagged with
>>>>>> rmdir_ino first, its old parent dir. which is deleted  will be updated
>>>>>> according to apply_dir_move().
>>>>>>
>>>>>> Therefore, I think we should check the existence of  the dir. before
>>>

Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount

2015-06-17 Thread Filipe David Manana
On Wed, Jun 17, 2015 at 3:36 PM, Jeff Mahoney  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 6/17/15 10:32 AM, Jeff Mahoney wrote:
>> On 6/17/15 9:24 AM, Filipe David Manana wrote:
>>> On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana
>>>  wrote:
>>>> On Mon, Jun 15, 2015 at 2:41 PM,   wrote:
>>>>> From: Jeff Mahoney 
>>>>>
>>>>> The cleaner thread may already be sleeping by the time we
>>>>> enter close_ctree.  If that's the case, we'll skip removing
>>>>> any unused block groups queued for removal, even during a
>>>>> normal umount. They'll be cleaned up automatically at next
>>>>> mount, but users expect a umount to be a clean
>>>>> synchronization point, especially when used on
>>>>> thin-provisioned storage with -odiscard.  We also explicitly
>>>>> remove unused block groups in the ro-remount path for the
>>>>> same reason.
>>>>>
>>>>> Signed-off-by: Jeff Mahoney 
>>>> Reviewed-by: Filipe Manana  Tested-by:
>>>> Filipe Manana 
>>>>
>>>>> --- fs/btrfs/disk-io.c |  9 + fs/btrfs/super.c   |
>>>>> 11 +++ 2 files changed, 20 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
>>>>> 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++
>>>>> b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void
>>>>> close_ctree(struct btrfs_root *root)
>>>>> cancel_work_sync(&fs_info->async_reclaim_work);
>>>>>
>>>>> if (!(fs_info->sb->s_flags & MS_RDONLY)) { +   /*
>>>>> + * If the cleaner thread is stopped and there are + * block
>>>>> groups queued for removal, the deletion will be + * skipped
>>>>> when we quit the cleaner thread. +*/ +
>>>>> mutex_lock(&root->fs_info->cleaner_mutex); +
>>>>> btrfs_delete_unused_bgs(root->fs_info); +
>>>>> mutex_unlock(&root->fs_info->cleaner_mutex); + ret =
>>>>> btrfs_commit_super(root); if (ret) btrfs_err(fs_info,
>>>>> "commit super ret %d", ret); diff --git a/fs/btrfs/super.c
>>>>> b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 ---
>>>>> a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6
>>>>> +1539,17 @@ static int btrfs_remount(struct super_block *sb,
>>>>> int *flags, char *data)
>>>>>
>>>>> sb->s_flags |= MS_RDONLY;
>>>>>
>>>>> +   /* +* Setting MS_RDONLY will
>>>>> put the cleaner thread to +* sleep at the
>>>>> next loop if it's already active. +* If it's
>>>>> already asleep, we'll leave unused block +*
>>>>> groups on disk until we're mounted read-write again +
>>>>> * unless we clean them up here. +*/ +
>>>>> mutex_lock(&root->fs_info->cleaner_mutex); +
>>>>> btrfs_delete_unused_bgs(fs_info); +
>>>>> mutex_unlock(&root->fs_info->cleaner_mutex);
>>
>>> So actually, this allows for a deadlock after the patch I sent
>>> out last week:
>>
>>> https://patchwork.kernel.org/patch/6586811/
>>
>>> In that patch delete_unused_bgs is no longer called under the
>>> cleaner_mutex, and making it so, will cause a deadlock with/ru
>>> relocation.
>>
>>> Even without that patch, I don't think you need using this mutex
>>>  anyway - no 2 tasks running this function can get the same bg
>>> from the fs_info->unused_bgs list.
>>
>> I was hitting crashes during umount when xfstests would do
>> remount-ro and umount in quick succession.  I can go back and
>> confirm this, but I believe I was encountering a race between the
>> cleaner thread and umount after being set read-only.  It didn't
>> trigger all the time.  My hypothesis is that if the cleaner thread
>> was running and had a lot of work to do, it could start before set
>> MS_RDONLY and still be performing work through the remount and into
>> the umount.  Ro-remount would have set MS_RDONLY so we skip the
>> btrfs_super_commit in close_ctree and then blow up afterwards.
>>
>> Taking the cleaner mutex means we either wait until the cleaner
>> thread 

Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount

2015-06-17 Thread Filipe David Manana
On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana
 wrote:
> On Mon, Jun 15, 2015 at 2:41 PM,   wrote:
>> From: Jeff Mahoney 
>>
>> The cleaner thread may already be sleeping by the time we enter
>> close_ctree.  If that's the case, we'll skip removing any unused
>> block groups queued for removal, even during a normal umount.
>> They'll be cleaned up automatically at next mount, but users
>> expect a umount to be a clean synchronization point, especially
>> when used on thin-provisioned storage with -odiscard.  We also
>> explicitly remove unused block groups in the ro-remount path
>> for the same reason.
>>
>> Signed-off-by: Jeff Mahoney 
> Reviewed-by: Filipe Manana 
> Tested-by: Filipe Manana 
>
>> ---
>>  fs/btrfs/disk-io.c |  9 +
>>  fs/btrfs/super.c   | 11 +++
>>  2 files changed, 20 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 2ef9a4b..2e47fef 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root)
>> cancel_work_sync(&fs_info->async_reclaim_work);
>>
>> if (!(fs_info->sb->s_flags & MS_RDONLY)) {
>> +   /*
>> +* If the cleaner thread is stopped and there are
>> +* block groups queued for removal, the deletion will be
>> +* skipped when we quit the cleaner thread.
>> +*/
>> +   mutex_lock(&root->fs_info->cleaner_mutex);
>> +   btrfs_delete_unused_bgs(root->fs_info);
>> +   mutex_unlock(&root->fs_info->cleaner_mutex);
>> +
>> ret = btrfs_commit_super(root);
>> if (ret)
>> btrfs_err(fs_info, "commit super ret %d", ret);
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 9e66f5e..2ccd8d4 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int 
>> *flags, char *data)
>>
>> sb->s_flags |= MS_RDONLY;
>>
>> +   /*
>> +* Setting MS_RDONLY will put the cleaner thread to
>> +* sleep at the next loop if it's already active.
>> +* If it's already asleep, we'll leave unused block
>> +* groups on disk until we're mounted read-write again
>> +* unless we clean them up here.
>> +*/
>> +   mutex_lock(&root->fs_info->cleaner_mutex);
>> +   btrfs_delete_unused_bgs(fs_info);
>> +   mutex_unlock(&root->fs_info->cleaner_mutex);

So actually, this allows for a deadlock after the patch I sent out last week:

https://patchwork.kernel.org/patch/6586811/

In that patch delete_unused_bgs is no longer called under the
cleaner_mutex, and making it so, will cause a deadlock with
relocation.

Even without that patch, I don't think you need using this mutex
anyway - no 2 tasks running this function can get the same bg from the
fs_info->unused_bgs list.

thanks


>> +
>> btrfs_dev_replace_suspend_for_unmount(fs_info);
>> btrfs_scrub_cancel(fs_info);
>>     btrfs_pause_balance(fs_info);
>> --
>> 2.4.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] btrfs: cleanup, stop casting for extent_map->lookup everywhere

2015-06-17 Thread Filipe David Manana
t;orig_block_len;
>
> @@ -4702,7 +4702,7 @@ int btrfs_chunk_readonly(struct btrfs_root *root, u64 
> chunk_offset)
> if (!em)
> return 1;
>
> -   map = (struct map_lookup *)em->bdev;
> +   map = em->map_lookup;
> for (i = 0; i < map->num_stripes; i++) {
> if (map->stripes[i].dev->missing) {
> miss_ndevs++;
> @@ -4782,7 +4782,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
> logical, u64 len)
> return 1;
> }
>
> -   map = (struct map_lookup *)em->bdev;
> +   map = em->map_lookup;
> if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
> ret = map->num_stripes;
> else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
> @@ -4818,7 +4818,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root 
> *root,
> BUG_ON(!em);
>
> BUG_ON(em->start > logical || em->start + em->len < logical);
> -   map = (struct map_lookup *)em->bdev;
> +   map = em->map_lookup;
> if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
> len = map->stripe_len * nr_data_stripes(map);
> free_extent_map(em);
> @@ -4839,7 +4839,7 @@ int btrfs_is_parity_mirror(struct btrfs_mapping_tree 
> *map_tree,
> BUG_ON(!em);
>
> BUG_ON(em->start > logical || em->start + em->len < logical);
> -   map = (struct map_lookup *)em->bdev;
> +   map = em->map_lookup;
> if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
> ret = 1;
> free_extent_map(em);
> @@ -5000,7 +5000,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info, int rw,
> return -EINVAL;
> }
>
> -   map = (struct map_lookup *)em->bdev;
> +   map = em->map_lookup;
> offset = logical - em->start;
>
> stripe_len = map->stripe_len;
> @@ -5542,7 +5542,7 @@ int btrfs_rmap_block(struct btrfs_mapping_tree 
> *map_tree,
> free_extent_map(em);
> return -EIO;
> }
> -   map = (struct map_lookup *)em->bdev;
> +   map = em->map_lookup;
>
> length = em->len;
> rmap_len = map->stripe_len;
> @@ -6057,7 +6057,7 @@ static int read_one_chunk(struct btrfs_root *root, 
> struct btrfs_key *key,
> }
>
> set_bit(EXTENT_FLAG_FS_MAPPING, &em->flags);
> -   em->bdev = (struct block_device *)map;
> +   em->map_lookup = map;
> em->start = logical;
> em->len = length;
> em->orig_start = 0;
> @@ -6733,7 +6733,7 @@ void btrfs_update_commit_device_bytes_used(struct 
> btrfs_root *root,
> /* In order to kick the device replace finish process */
> lock_chunks(root);
> list_for_each_entry(em, &transaction->pending_chunks, list) {
> -   map = (struct map_lookup *)em->bdev;
> +   map = em->map_lookup;
>
> for (i = 0; i < map->num_stripes; i++) {
> dev = map->stripes[i].dev;
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] btrfs: add missing discards when unpinning extents with -o discard

2015-06-17 Thread Filipe David Manana
> -out:
> -   spin_lock(&block_group->lock);
> -   if (atomic_dec_and_test(&block_group->trimming) &&
> -   block_group->removed) {
> -   struct extent_map_tree *em_tree;
> -   struct extent_map *em;
> -
> -   spin_unlock(&block_group->lock);
> -
> +   if (cleanup) {
> lock_chunks(block_group->fs_info->chunk_root);
> em_tree = &block_group->fs_info->mapping_tree.map_tree;
> write_lock(&em_tree->lock);
> @@ -3326,10 +3314,31 @@ out:
>  * this block group have left 1 entry each one. Free them.
>  */
> __btrfs_remove_free_space_cache(block_group->free_space_ctl);
> -   } else {
> +   }
> +}
> +
> +int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> +  u64 *trimmed, u64 start, u64 end, u64 minlen)
> +{
> +   int ret;
> +
> +   *trimmed = 0;
> +
> +   spin_lock(&block_group->lock);
> +   if (block_group->removed) {
> spin_unlock(&block_group->lock);
> +   return 0;
> }
> +   btrfs_get_block_group_trimming(block_group);
> +   spin_unlock(&block_group->lock);
> +
> +   ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
> +   if (ret)
> +   goto out;
>
> +   ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
> +out:
> +   btrfs_put_block_group_trimming(block_group);
> return ret;
>  }
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 2ccd8d4..a80da03 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -69,7 +69,7 @@ static struct file_system_type btrfs_fs_type;
>
>  static int btrfs_remount(struct super_block *sb, int *flags, char *data);
>
> -static const char *btrfs_decode_error(int errno)
> +const char *btrfs_decode_error(int errno)
>  {
> char *errstr = "unknown";
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 5628e25..2005262 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -256,6 +256,8 @@ loop:
> mutex_init(&cur_trans->cache_write_mutex);
> cur_trans->num_dirty_bgs = 0;
> spin_lock_init(&cur_trans->dirty_bgs_lock);
> +   INIT_LIST_HEAD(&cur_trans->deleted_bgs);
> +   spin_lock_init(&cur_trans->deleted_bgs_lock);
> list_add_tail(&cur_trans->list, &fs_info->trans_list);
> extent_io_tree_init(&cur_trans->dirty_pages,
>  fs_info->btree_inode->i_mapping);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 0b24755..14325f2 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -74,6 +74,8 @@ struct btrfs_transaction {
>  */
> struct mutex cache_write_mutex;
> spinlock_t dirty_bgs_lock;
> +   struct list_head deleted_bgs;
> +   spinlock_t deleted_bgs_lock;
> struct btrfs_delayed_ref_root delayed_refs;
> int aborted;
> int dirty_bg_run;
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,   wrote:
> From: Jeff Mahoney 
>
> The cleaner thread may already be sleeping by the time we enter
> close_ctree.  If that's the case, we'll skip removing any unused
> block groups queued for removal, even during a normal umount.
> They'll be cleaned up automatically at next mount, but users
> expect a umount to be a clean synchronization point, especially
> when used on thin-provisioned storage with -odiscard.  We also
> explicitly remove unused block groups in the ro-remount path
> for the same reason.
>
> Signed-off-by: Jeff Mahoney 
Reviewed-by: Filipe Manana 
Tested-by: Filipe Manana 

> ---
>  fs/btrfs/disk-io.c |  9 +
>  fs/btrfs/super.c   | 11 +++
>  2 files changed, 20 insertions(+)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2ef9a4b..2e47fef 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root)
> cancel_work_sync(&fs_info->async_reclaim_work);
>
> if (!(fs_info->sb->s_flags & MS_RDONLY)) {
> +   /*
> +* If the cleaner thread is stopped and there are
> +* block groups queued for removal, the deletion will be
> +* skipped when we quit the cleaner thread.
> +*/
> +   mutex_lock(&root->fs_info->cleaner_mutex);
> +   btrfs_delete_unused_bgs(root->fs_info);
> +   mutex_unlock(&root->fs_info->cleaner_mutex);
> +
> ret = btrfs_commit_super(root);
> if (ret)
> btrfs_err(fs_info, "commit super ret %d", ret);
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 9e66f5e..2ccd8d4 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int 
> *flags, char *data)
>
> sb->s_flags |= MS_RDONLY;
>
> +   /*
> +* Setting MS_RDONLY will put the cleaner thread to
> +* sleep at the next loop if it's already active.
> +* If it's already asleep, we'll leave unused block
> +* groups on disk until we're mounted read-write again
> +* unless we clean them up here.
> +*/
> +   mutex_lock(&root->fs_info->cleaner_mutex);
> +   btrfs_delete_unused_bgs(fs_info);
> +   mutex_unlock(&root->fs_info->cleaner_mutex);
> +
> btrfs_dev_replace_suspend_for_unmount(fs_info);
> btrfs_scrub_cancel(fs_info);
>     btrfs_pause_balance(fs_info);
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] btrfs: iterate over unused chunk space in FITRIM

2015-06-17 Thread Filipe David Manana
find_free_dev_extent(struct btrfs_trans_handle *trans,
> -struct btrfs_device *device, u64 num_bytes,
> -u64 *start, u64 *len)
> +int find_free_dev_extent_start(struct btrfs_transaction *transaction,
> +  struct btrfs_device *device, u64 num_bytes,
> +  u64 search_start, u64 *start, u64 *len)
>  {
> struct btrfs_key key;
> struct btrfs_root *root = device->dev_root;
> @@ -1119,19 +1123,11 @@ int find_free_dev_extent(struct btrfs_trans_handle 
> *trans,
> u64 max_hole_start;
> u64 max_hole_size;
> u64 extent_end;
> -   u64 search_start;
> u64 search_end = device->total_bytes;
> int ret;
> int slot;
> struct extent_buffer *l;
>
> -   /* FIXME use last free of some kind */
> -
> -   /* we don't want to overwrite the superblock on the drive,
> -* so we make sure to start at an offset of at least 1MB
> -*/
> -   search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> -
> path = btrfs_alloc_path();
> if (!path)
> return -ENOMEM;
> @@ -1192,7 +1188,7 @@ again:
>  * Have to check before we set max_hole_start, 
> otherwise
>  * we could end up sending back this offset anyway.
>  */
> -   if (contains_pending_extent(trans, device,
> +   if (contains_pending_extent(transaction, device,
> &search_start,
> hole_size)) {
> if (key.offset >= search_start) {
> @@ -1241,7 +1237,7 @@ next:
> if (search_end > search_start) {
> hole_size = search_end - search_start;
>
> -   if (contains_pending_extent(trans, device, &search_start,
> +   if (contains_pending_extent(transaction, device, 
> &search_start,
> hole_size)) {
> btrfs_release_path(path);
> goto again;
> @@ -1267,6 +1263,24 @@ out:
> return ret;
>  }
>
> +int find_free_dev_extent(struct btrfs_trans_handle *trans,
> +struct btrfs_device *device, u64 num_bytes,
> +u64 *start, u64 *len)
> +{
> +   struct btrfs_root *root = device->dev_root;
> +   u64 search_start;
> +
> +   /* FIXME use last free of some kind */
> +
> +   /*
> +* we don't want to overwrite the superblock on the drive,
> +* so we make sure to start at an offset of at least 1MB
> +*/
> +   search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> +   return find_free_dev_extent_start(trans->transaction, device,
> + num_bytes, search_start, start, 
> len);
> +}
> +
>  static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
>   struct btrfs_device *device,
>   u64 start, u64 *dev_extent_len)
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ebc3133..30918a8 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -449,6 +449,9 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info);
>  int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info);
>  int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info);
>  int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset);
> +int find_free_dev_extent_start(struct btrfs_transaction *transaction,
> +struct btrfs_device *device, u64 num_bytes,
> +u64 search_start, u64 *start, u64 *max_avail);
>  int find_free_dev_extent(struct btrfs_trans_handle *trans,
>  struct btrfs_device *device, u64 num_bytes,
>  u64 *start, u64 *max_avail);
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] btrfs: skip superblocks during discard

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,   wrote:
> From: Jeff Mahoney 
>
> Btrfs doesn't track superblocks with extent records so there is nothing
> persistent on-disk to indicate that those blocks are in use.  We track
> the superblocks in memory to ensure they don't get used by removing them
> from the free space cache when we load a block group from disk.  Prior
> to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
> was fine since the block group would never be reclaimed so the superblock
> was always safe.  Once we started removing the empty block groups, we
> were protected by the fact that discards weren't being properly issued
> for unused space either via FITRIM or -odiscard.  The block groups were
> still being released, but the blocks remained on disk.
>
> In order to properly discard unused block groups, we need to filter out
> the superblocks from the discard range.  Superblocks are located at fixed
> locations on each device, so it makes sense to filter them out in
> btrfs_issue_discard, which is used by both -odiscard and FITRIM.
>
> Signed-off-by: Jeff Mahoney 
Reviewed-by: Filipe Manana 
Tested-by: Filipe Manana 

> ---
>  fs/btrfs/extent-tree.c | 59 
> ++
>  1 file changed, 55 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index cf9cefd..1e44b93 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1884,10 +1884,12 @@ static int remove_extent_backref(struct 
> btrfs_trans_handle *trans,
> return ret;
>  }
>
> +#define in_range(b, first, len)((b) >= (first) && (b) < (first) + 
> (len))
>  static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>u64 *discarded_bytes)
>  {
> -   int ret = 0;
> +   int j, ret = 0;
> +   u64 bytes_left, end;
> u64 aligned_start = ALIGN(start, 1 << 9);
>
> if (WARN_ON(start != aligned_start)) {
> @@ -1897,11 +1899,60 @@ static int btrfs_issue_discard(struct block_device 
> *bdev, u64 start, u64 len,
> }
>
> *discarded_bytes = 0;
> -   if (len) {
> -   ret = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> +
> +   if (!len)
> +   return 0;
> +
> +   end = start + len;
> +   bytes_left = len;
> +
> +   /* Skip any superblocks on this device. */
> +   for (j = 0; j < BTRFS_SUPER_MIRROR_MAX; j++) {
> +   u64 sb_start = btrfs_sb_offset(j);
> +   u64 sb_end = sb_start + BTRFS_SUPER_INFO_SIZE;
> +   u64 size = sb_start - start;
> +
> +   if (!in_range(sb_start, start, bytes_left) &&
> +   !in_range(sb_end, start, bytes_left) &&
> +   !in_range(start, sb_start, BTRFS_SUPER_INFO_SIZE))
> +   continue;
> +
> +   /*
> +* Superblock spans beginning of range.  Adjust start and
> +* try again.
> +*/
> +   if (sb_start <= start) {
> +   start += sb_end - start;
> +   if (start > end) {
> +   bytes_left = 0;
> +   break;
> +   }
> +   bytes_left = end - start;
> +   continue;
> +   }
> +
> +   if (size) {
> +   ret = blkdev_issue_discard(bdev, start >> 9, size >> 
> 9,
> +  GFP_NOFS, 0);
> +   if (!ret)
> +   *discarded_bytes += size;
> +   else if (ret != -EOPNOTSUPP)
> +   return ret;
> +   }
> +
> +   start = sb_end;
> +   if (start > end) {
> +   bytes_left = 0;
> +   break;
> +   }
> +   bytes_left = end - start;
> +   }
> +
> +   if (bytes_left) {
> +   ret = blkdev_issue_discard(bdev, start >> 9, bytes_left >> 9,
>GFP_NOFS, 0);
> if (!ret)
> -   *discarded_bytes = len;
> +   *discarded_bytes += bytes_left;
> }
> return ret;
>  }
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,   wrote:
> From: Jeff Mahoney 
>
> It's possible, though unexpected, to pass unaligned offsets and lengths
> to btrfs_issue_discard.  We then shift the offset/length values to sector
> units.  If an unaligned offset has been passed, it will result in the
> entire sector being discarded, possibly losing data.  An unaligned
> length is safe but we'll end up returning an inaccurate number of
> discarded bytes.
>
> This patch aligns the offset to the 512B boundary, adjusts the length,
> and warns, since we shouldn't be discarding on an offset that isn't
> aligned with our sector size.
>
> Signed-off-by: Jeff Mahoney 
Reviewed-by: Filipe Manana 
Tested-by: Filipe Manana 

> ---
>  fs/btrfs/extent-tree.c | 17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index da1145d..cf9cefd 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1888,12 +1888,21 @@ static int btrfs_issue_discard(struct block_device 
> *bdev, u64 start, u64 len,
>u64 *discarded_bytes)
>  {
> int ret = 0;
> +   u64 aligned_start = ALIGN(start, 1 << 9);
>
> -   *discarded_bytes = 0;
> -   ret = blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
> -   if (!ret)
> -   *discarded_bytes = len;
> +   if (WARN_ON(start != aligned_start)) {
> +   len -= aligned_start - start;
> +   len = round_down(len, 1 << 9);
> +   start = aligned_start;
> +   }
>
> +   *discarded_bytes = 0;
> +   if (len) {
> +   ret = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> +  GFP_NOFS, 0);
> +   if (!ret)
> +   *discarded_bytes = len;
> +   }
> return ret;
>  }
>
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,   wrote:
> From: Jeff Mahoney 
>
> Initially this will just be the length argument passed to it,
> but the following patches will adjust that to reflect re-alignment
> and skipped blocks.
>
> Signed-off-by: Jeff Mahoney 

Reviewed-by: Filipe Manana 
Tested-by: Filipe Manana 

> ---
>  fs/btrfs/extent-tree.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0ec3acd..da1145d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1884,10 +1884,17 @@ static int remove_extent_backref(struct 
> btrfs_trans_handle *trans,
> return ret;
>  }
>
> -static int btrfs_issue_discard(struct block_device *bdev,
> -   u64 start, u64 len)
> +static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
> +  u64 *discarded_bytes)
>  {
> -   return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
> +   int ret = 0;
> +
> +   *discarded_bytes = 0;
> +   ret = blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
> +   if (!ret)
> +   *discarded_bytes = len;
> +
> +   return ret;
>  }
>
>  int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
> @@ -1908,14 +1915,16 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 
> bytenr,
>
>
> for (i = 0; i < bbio->num_stripes; i++, stripe++) {
> +   u64 bytes;
> if (!stripe->dev->can_discard)
> continue;
>
> ret = btrfs_issue_discard(stripe->dev->bdev,
>   stripe->physical,
> - stripe->length);
> + stripe->length,
> + &bytes);
> if (!ret)
> -   discarded_bytes += stripe->length;
> +   discarded_bytes += bytes;
> else if (ret != -EOPNOTSUPP)
> break; /* Logic errors or -ENOMEM, or -EIO 
> but I don't know how that could happen JDM */
>
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] btrfs: skip superblocks during discard

2015-06-11 Thread Filipe David Manana
On Thu, Jun 11, 2015 at 7:17 PM, Jeff Mahoney  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 6/11/15 12:47 PM, Filipe David Manana wrote:
>> On Thu, Jun 11, 2015 at 4:20 PM,   wrote:
>>> From: Jeff Mahoney 
>>>
>>> Btrfs doesn't track superblocks with extent records so there is
>>> nothing persistent on-disk to indicate that those blocks are in
>>> use.  We track the superblocks in memory to ensure they don't get
>>> used by removing them from the free space cache when we load a
>>> block group from disk.  Prior to 47ab2a6c6a (Btrfs: remove empty
>>> block groups automatically), that was fine since the block group
>>> would never be reclaimed so the superblock was always safe.  Once
>>> we started removing the empty block groups, we were protected by
>>> the fact that discards weren't being properly issued for unused
>>> space either via FITRIM or -odiscard.  The block groups were
>>> still being released, but the blocks remained on disk.
>>>
>>> In order to properly discard unused block groups, we need to
>>> filter out the superblocks from the discard range.  Superblocks
>>> are located at fixed locations on each device, so it makes sense
>>> to filter them out in btrfs_issue_discard, which is used by both
>>> -odiscard and FITRIM.
>>>
>>> Signed-off-by: Jeff Mahoney  ---
>>> fs/btrfs/extent-tree.c | 50
>>> -- 1 file
>>> changed, 44 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 0ec3acd..75d0226 100644 --- a/fs/btrfs/extent-tree.c +++
>>> b/fs/btrfs/extent-tree.c @@ -1884,10 +1884,47 @@ static int
>>> remove_extent_backref(struct btrfs_trans_handle *trans, return
>>> ret; }
>>>
>>> -static int btrfs_issue_discard(struct block_device *bdev, -
>>> u64 start, u64 len) +#define in_range(b, first, len)((b)
>>> >= (first) && (b) < (first) + (len))
>>
>> Hi Jeff,
>>
>> So this will work if every caller behaves well and passes a region
>> whose start and end offsets are a multiple of the sector size
>> (4096) which currently matches the superblock size.
>>
>> However, I think it would be safer to check for the case where the
>> start offset of a superblock mirror is < (first) and (sb_offset +
>> sb_len) > (first).  Just to deal with cases where for example the
>> 2nd half of the sb starts at offset (first).
>>
>> I guess this sectorsize becoming less than 4096 will happen sooner
>> or later with the subpage sectorsize patch set, so it wouldn't hurt
>> to make it more bullet proof already.
>
> Is that something anyone intends to support?  While I suppose the
> subpage sector patch /could/ be used to allow file systems with a node
> size under 4k, the intention is the other way around -- systems that
> have higher order page sizes currently don't work with btrfs file
> system created on systems with smaller order page sizes like x86.
> Btrfs already has high enough metadata overhead.  Pretty much all new
> hardware has, at least, a native 4k sector size even if it's
> abstracted behind a RMW layer.  The sectors are only going to get
> larger.  With the metadata overhead that btrfs already incurs, I can't
> imagine any production use case with smaller sector sizes.
>
> Are we looking to support <4k nodes to test the subpage sector code on
> x86?  If so, then I'll change this to handle the possibility of
> superblocks crossing sector boundaries.  Otherwise, it's protecting
> against a use case that just shouldn't happen.

I understand your point.
I'm probably being too paranoid. But it's exactly because it's not
supposed to happen that at least an assertion or something should be
added imho. A lot of "not supposed not happen things" happen often,
and that's often how people lose data, and get into other bad issues.

And I think I've heard once of supporting <4k nodes (sectorsizes) for
testing at least on x86 for e.g, but I might have not understood it
correctly. Having such a check would help detect bugs during
development where some caller passes a wrong range to discard - better
to find it during development/RCs rather than in production.

But anyway, just a personal preference.

thanks

>
>> Otherwise it looks good to me. I'll give a test on this patchset
>> soon.
>
> Thanks,
>
> - -Jeff
>
>
> - --
> Jeff Mahoney
> SUSE Labs
> -BEGIN PG

Re: [PATCH 1/4] btrfs: skip superblocks during discard

2015-06-11 Thread Filipe David Manana
 stripe->physical,
> - stripe->length);
> + stripe->length, &bytes);
> if (!ret)
> -   discarded_bytes += stripe->length;
> +   discarded_bytes += bytes;
> else if (ret != -EOPNOTSUPP)
> break; /* Logic errors or -ENOMEM, or -EIO 
> but I don't know how that could happen JDM */
>
> --
> 1.8.4.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] Btrfs: improve fsync for nocow file

2015-06-10 Thread Filipe David Manana
On Wed, Jun 10, 2015 at 9:26 AM, Filipe David Manana  wrote:
> On Wed, Jun 10, 2015 at 3:09 AM, Liu Bo  wrote:
>> On Tue, Jun 09, 2015 at 01:56:41PM +0100, Filipe David Manana wrote:
>>> On Tue, Jun 9, 2015 at 1:04 PM, Liu Bo  wrote:
>>> > If  we're overwriting an allocated file without changing timestamp
>>> > and inode version, and the file is with NODATACOW, we don't have any 
>>> > metadata to
>>> > commit, thus we can just flush the data device cache and go forward.
>>> >
>>> > However, if there's have any change on extents' disk bytenr, inode size,
>>> > timestamp or inode version, we need to go through the normal 
>>> > btrfs_log_inode
>>> > path.
>>> >
>>> > Test:
>>> > 
>>> > 1. sysbench test of
>>> > "1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + 
>>> > randomwrite +
>>> > fsync_on_each_write",
>>> > 2. loop device associated with tmpfs file
>>> > 3.
>>> >   - For btrfs, "-o nodatacow" and "-o noi_version" option
>>> >   - For ext4 and xfs, no extra mount options
>>> > 
>>> >
>>> > Results:
>>> > 
>>> > - btrfs:
>>> > w/o: ~30Mb/sec
>>> > w:   ~181Mb/sec
>>> >
>>> > - other filesystems: (both don't enable i_version by default)
>>> > ext4:  203Mb/sec
>>> > xfs:   212Mb/sec
>>> > 
>>> >
>>> > Signed-off-by: Liu Bo 
>>> > ---
>>> >  fs/btrfs/btrfs_inode.h |  2 ++
>>> >  fs/btrfs/disk-io.c |  2 +-
>>> >  fs/btrfs/disk-io.h |  1 +
>>> >  fs/btrfs/file.c| 39 ++-
>>> >  fs/btrfs/inode.c   |  3 +++
>>> >  5 files changed, 41 insertions(+), 6 deletions(-)
>>> >
>>> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>>> > index 0ef5cc1..b36d87a 100644
>>> > --- a/fs/btrfs/btrfs_inode.h
>>> > +++ b/fs/btrfs/btrfs_inode.h
>>> > @@ -44,6 +44,8 @@
>>> >  #define BTRFS_INODE_IN_DELALLOC_LIST   9
>>> >  #define BTRFS_INODE_READDIO_NEED_LOCK  10
>>> >  #define BTRFS_INODE_HAS_PROPS  11
>>> > +#define BTRFS_INODE_NOTIMESTAMP12
>>> > +#define BTRFS_INODE_NOISIZE13
>>> >  /*
>>> >   * The following 3 bits are meant only for the btree inode.
>>> >   * When any of them is set, it means an error happened while writing an
>>> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> > index 2ef9a4b..de7fd94 100644
>>> > --- a/fs/btrfs/disk-io.c
>>> > +++ b/fs/btrfs/disk-io.c
>>> > @@ -3343,7 +3343,7 @@ static int write_dev_flush(struct btrfs_device 
>>> > *device, int wait)
>>> >   * send an empty flush down to each device in parallel,
>>> >   * then wait for them
>>> >   */
>>> > -static int barrier_all_devices(struct btrfs_fs_info *info)
>>> > +int barrier_all_devices(struct btrfs_fs_info *info)
>>> >  {
>>> > struct list_head *head;
>>> > struct btrfs_device *dev;
>>> > diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>>> > index d4cbfee..2bc91fe 100644
>>> > --- a/fs/btrfs/disk-io.h
>>> > +++ b/fs/btrfs/disk-io.h
>>> > @@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root);
>>> >  int write_ctree_super(struct btrfs_trans_handle *trans,
>>> >   struct btrfs_root *root, int max_mirrors);
>>> >  struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
>>> > +int barrier_all_devices(struct btrfs_fs_info *info);
>>> >  int btrfs_commit_super(struct btrfs_root *root);
>>> >  struct extent_buffer *btrfs_find_tree_block(struct btrfs_fs_info 
>>> > *fs_info,
>>> > u64 bytenr);
>>> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> > index 23b6e03..861c29f 100644
>>> > --- a/fs/btrfs/file.c
>>> > +++ b/fs/btrfs/file.c
>>> > @@ -519,8 +519,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, 
>>> > struct inode *inode,
>>> >  * 

Re: [RFC PATCH 2/2] Btrfs: improve fsync for nocow file

2015-06-10 Thread Filipe David Manana
On Wed, Jun 10, 2015 at 3:09 AM, Liu Bo  wrote:
> On Tue, Jun 09, 2015 at 01:56:41PM +0100, Filipe David Manana wrote:
>> On Tue, Jun 9, 2015 at 1:04 PM, Liu Bo  wrote:
>> > If  we're overwriting an allocated file without changing timestamp
>> > and inode version, and the file is with NODATACOW, we don't have any 
>> > metadata to
>> > commit, thus we can just flush the data device cache and go forward.
>> >
>> > However, if there's have any change on extents' disk bytenr, inode size,
>> > timestamp or inode version, we need to go through the normal 
>> > btrfs_log_inode
>> > path.
>> >
>> > Test:
>> > 
>> > 1. sysbench test of
>> > "1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite 
>> > +
>> > fsync_on_each_write",
>> > 2. loop device associated with tmpfs file
>> > 3.
>> >   - For btrfs, "-o nodatacow" and "-o noi_version" option
>> >   - For ext4 and xfs, no extra mount options
>> > 
>> >
>> > Results:
>> > 
>> > - btrfs:
>> > w/o: ~30Mb/sec
>> > w:   ~181Mb/sec
>> >
>> > - other filesystems: (both don't enable i_version by default)
>> > ext4:  203Mb/sec
>> > xfs:   212Mb/sec
>> > 
>> >
>> > Signed-off-by: Liu Bo 
>> > ---
>> >  fs/btrfs/btrfs_inode.h |  2 ++
>> >  fs/btrfs/disk-io.c |  2 +-
>> >  fs/btrfs/disk-io.h |  1 +
>> >  fs/btrfs/file.c| 39 ++-
>> >  fs/btrfs/inode.c   |  3 +++
>> >  5 files changed, 41 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> > index 0ef5cc1..b36d87a 100644
>> > --- a/fs/btrfs/btrfs_inode.h
>> > +++ b/fs/btrfs/btrfs_inode.h
>> > @@ -44,6 +44,8 @@
>> >  #define BTRFS_INODE_IN_DELALLOC_LIST   9
>> >  #define BTRFS_INODE_READDIO_NEED_LOCK  10
>> >  #define BTRFS_INODE_HAS_PROPS  11
>> > +#define BTRFS_INODE_NOTIMESTAMP12
>> > +#define BTRFS_INODE_NOISIZE13
>> >  /*
>> >   * The following 3 bits are meant only for the btree inode.
>> >   * When any of them is set, it means an error happened while writing an
>> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> > index 2ef9a4b..de7fd94 100644
>> > --- a/fs/btrfs/disk-io.c
>> > +++ b/fs/btrfs/disk-io.c
>> > @@ -3343,7 +3343,7 @@ static int write_dev_flush(struct btrfs_device 
>> > *device, int wait)
>> >   * send an empty flush down to each device in parallel,
>> >   * then wait for them
>> >   */
>> > -static int barrier_all_devices(struct btrfs_fs_info *info)
>> > +int barrier_all_devices(struct btrfs_fs_info *info)
>> >  {
>> > struct list_head *head;
>> > struct btrfs_device *dev;
>> > diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>> > index d4cbfee..2bc91fe 100644
>> > --- a/fs/btrfs/disk-io.h
>> > +++ b/fs/btrfs/disk-io.h
>> > @@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root);
>> >  int write_ctree_super(struct btrfs_trans_handle *trans,
>> >   struct btrfs_root *root, int max_mirrors);
>> >  struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
>> > +int barrier_all_devices(struct btrfs_fs_info *info);
>> >  int btrfs_commit_super(struct btrfs_root *root);
>> >  struct extent_buffer *btrfs_find_tree_block(struct btrfs_fs_info *fs_info,
>> > u64 bytenr);
>> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> > index 23b6e03..861c29f 100644
>> > --- a/fs/btrfs/file.c
>> > +++ b/fs/btrfs/file.c
>> > @@ -519,8 +519,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
>> > inode *inode,
>> >  * the disk i_size.  There is no need to log the inode
>> >  * at this time.
>> >  */
>> > -   if (end_pos > isize)
>> > +   if (end_pos > isize) {
>> > i_size_write(inode, end_pos);
>> > +   clear_bit(BTRFS_INODE_NOISIZE, 
>> > &BTRFS_I(inode)->runtime_flags);
>> > +   } else {
>> > 

Re: [RFC PATCH 2/2] Btrfs: improve fsync for nocow file

2015-06-09 Thread Filipe David Manana
   sync_it |= S_CTIME;
> +   }
>
> -   if (IS_I_VERSION(inode))
> +   if (IS_I_VERSION(inode)) {
> inode_inc_iversion(inode);
> +   sync_it |= S_VERSION;
> +   }
> +
> +   if (!sync_it)
> +   set_bit(BTRFS_INODE_NOTIMESTAMP, 
> &BTRFS_I(inode)->runtime_flags);
> +   else
> +   clear_bit(BTRFS_INODE_NOTIMESTAMP, 
> &BTRFS_I(inode)->runtime_flags);
>  }
>
>  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> @@ -1987,6 +2005,17 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
> goto out;
> }
>
> +   if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) {
> +   if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP,
> +   &BTRFS_I(inode)->runtime_flags) &&
> +   test_and_clear_bit(BTRFS_INODE_NOISIZE,
> +   &BTRFS_I(inode)->runtime_flags)) {
> +   barrier_all_devices(root->fs_info);
> +   mutex_unlock(&inode->i_mutex);
> +   goto out;

Hi Liu,

For the non-full sync case, what happens if an IO error happened
during writeback?
I don't see anything here that checks if an IO error happened and
return -EIO to user space if such error happened.
In other words, testing for the bit AS_EIO in the inode->i_mapping->flags.

thanks

> +   }
> +   }
> +
> /*
>  * ok we haven't committed the transaction yet, lets do a commit
>  */
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0020b56..3d230e6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1384,6 +1384,7 @@ out_check:
>
> btrfs_release_path(path);
> if (cow_start != (u64)-1) {
> +   clear_bit(BTRFS_INODE_NOISIZE, 
> &BTRFS_I(inode)->runtime_flags);
> ret = cow_file_range(inode, locked_page,
>  cow_start, found_key.offset - 1,
>  page_started, nr_written, 1);
> @@ -1426,6 +1427,7 @@ out_check:
> em->start + em->len - 1, 0);
> }
> type = BTRFS_ORDERED_PREALLOC;
> +       clear_bit(BTRFS_INODE_NOISIZE, 
> &BTRFS_I(inode)->runtime_flags);
> } else {
> type = BTRFS_ORDERED_NOCOW;
> }
> @@ -1464,6 +1466,7 @@ out_check:
> }
>
> if (cow_start != (u64)-1) {
> +   clear_bit(BTRFS_INODE_NOISIZE, 
> &BTRFS_I(inode)->runtime_flags);
> ret = cow_file_range(inode, locked_page, cow_start, end,
>  page_started, nr_written, 1);
> if (ret)
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Btrfs: incremental send, fix rmdir not send utimes

2015-06-09 Thread Filipe David Manana
On Tue, Jun 9, 2015 at 11:04 AM, Robbie Ko  wrote:
> Hi Filipe,
>
> 2015-06-08 22:00 GMT+08:00 Filipe David Manana :
>> On Mon, Jun 8, 2015 at 4:44 AM, Robbie Ko  wrote:
>>> Hi Filipe,
>>
>> Hi Robbie,
>>
>>>
>>> I've fixed "don't send utimes for non-existing directory" with another 
>>> solution.
>>>
>>>  In apply_dir_move(), the old parent dir. and new parent dir. will be
>>> updated after the current dir. has moved.
>>>
>>> And there's only one entry in old parent dir. (e.g. entry with
>>> smallest ino) will be tagged with rmdir_ino to prevent its parent dir.
>>> is deleted but updated.
>>
>> Can't parse this phrase. What do you mean by tagging an entry with rmdir_ino?
>> rmdir_ino corresponds to the number of a inode that wasn't deleted
>> when it was processed because there was some inode with a lower number
>> that is a child of the directory in the parent snapshot and had its
>> rename/move operation delayed (it happens after the directory we want
>> to delete is processed).
>>
>
> Right , my "tagged with rmdir_ino" is same meaning as you explained here.
>
>>>
>>> However, if we process rename for another entry not tagged with
>>> rmdir_ino first, its old parent dir. which is deleted  will be updated
>>> according to apply_dir_move().
>>>
>>> Therefore, I think we should check the existence of  the dir. before
>>> we're going to update it's utime.
>>>
>>> The patch is pasted in the following link, could you give me some comment?
>>>
>>> https://friendpaste.com/h8tZqOS9iAUpp2DvgGI2k
>>
>> Looks better.
>> However I still don't understand your explanation, and just tried the
>> example in your commit message:
>>
>> "Parent snapshot:
>>
>> | a/ (ino 259)
>>   | c (ino 264)
>> | b/ (ino 260)
>>   | d (ino 265)
>> | del/ (ino 263)
>>   | item1/ (ino 261)
>>   | item2/ (ino 262)
>>
>> Send snapshot:
>> | a/ (ino 259)
>> | b/ (ino 260)
>> | c/ (ino 2)
>>   | item2 (ino 259)
>> | d/ (ino 257)
>>   | item1/ (ino 258)"
>>
>> So it's confusing after looking at it.
>> First the send snapshot mentions inode number 2, which doesn't exist
>> in the parent snapshot - I assume you meant inode number 264.
>> Then, the send snapshot has two inodes with number 259. Is "item2" in
>> the send snapshot supposed to be inode 262?
>>
>
> Your guess is right. And I correct it as follow.
>
>  # Parent snapshot:
>  #
>  # | a/(ino 259)
>  # | | c   (ino 264)
>  # |
>  # | b/(ino 260)
>  # | | d   (ino 265)
>  # |
>  # | del/  (ino 263)
>  #| item1/ (ino 261)
>  #| item2/ (ino 262)
>
>  # Send snapshot:
>  #
>  # | a/(ino 259)
>  # | b/(ino 260)
>  # | c/(ino 264)
>  # | | item2/  (ino 262)
>  # |
>  # | d/(ino 265)
>  #   | item1/  (ino 261)
>
>> Anyway, assuming those 2 fixes to the example are correct guesses, I
>> tried the following and it didn't fail without your patches (i.e. no
>> attempts to send utimes to a non-existing directory):
>>
>
> Here my mean is :  btrfs tries to get utime from non-existing directory and
> apply it on the existing directory. And my patch is attempted to avoid
> this case.
> However, this case is not guaranteed to cause error anytime but it may
> fails somehow
> which is depending on the data on the disk.
> The following are the incremental procedures to send the snapshot.
>
> utimes
> utimes a
> utimes b
> rename del -> o263-259-0
> utimes
> rename a/c -> c
> utimes
> utimes a
> rename o263-259-0/item2 -> c/item2
> utimes c/item2
> utimes o263-259-0  <<-- this step may cause error

Why may it cause an error?
At that moment the name/path o263-259-0 exists at the destination
(i.e. the receiver, as it applies commands from the send stream
serially).

> utimes c
> utimes c
> rename b/d -> d
> utimes
> utimes b
> rename o263-259-0/item1 -> d/item1
> rmdir o263-259-0
> utimes d/item1
> utimes d
> utimes d
>
> As the above pointed procedure, o263-259-0 is not appeared in the send root.

Well yes, but that doesn't matter.
The oXXX-YYY-ZZZ names are never in any of the roots (send or parent

Re: [PATCH 3/3] btrfs: add missing discards when unpinning extents with -o discard

2015-06-08 Thread Filipe David Manana
y.objectid);
> +
> +   if (ret) {
> +   if (trimming)
> +   btrfs_put_block_group_trimming(block_group);
> +   goto end_trans;
> +   }
> +
> +   /*
> +* If we're not mounted with -odiscard, we can just forget
> +* about this block group. Otherwise we'll need to wait
> +* until transaction commit to do the actual discard.
> +*/
> +   if (trimming) {
> +   WARN_ON(!list_empty(&block_group->bg_list));
> +   spin_lock(&trans->transaction->deleted_bgs_lock);
> +   list_move(&block_group->bg_list,
> + &trans->transaction->deleted_bgs);
> +   spin_unlock(&trans->transaction->deleted_bgs_lock);
> +   btrfs_get_block_group(block_group);
> +   }
>  end_trans:
> btrfs_end_transaction(trans, root);
>  next:
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 9dbe5b5..c79253e 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3274,35 +3274,23 @@ next:
> return ret;
>  }
>
> -int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> -  u64 *trimmed, u64 start, u64 end, u64 minlen)
> +void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache)
>  {
> -   int ret;
> +   atomic_inc(&cache->trimming);
> +}
>
> -   *trimmed = 0;
> +void btrfs_put_block_group_trimming(struct btrfs_block_group_cache 
> *block_group)
> +{
> +   struct extent_map_tree *em_tree;
> +   struct extent_map *em;
> +   bool cleanup;
>
> spin_lock(&block_group->lock);
> -   if (block_group->removed) {
> -   spin_unlock(&block_group->lock);
> -   return 0;
> -   }
> -   atomic_inc(&block_group->trimming);
> +   cleanup = (atomic_dec_and_test(&block_group->trimming) &&
> +  block_group->removed);
> spin_unlock(&block_group->lock);
>
> -   ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
> -   if (ret)
> -   goto out;
> -
> -   ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
> -out:
> -   spin_lock(&block_group->lock);
> -   if (atomic_dec_and_test(&block_group->trimming) &&
> -   block_group->removed) {
> -   struct extent_map_tree *em_tree;
> -   struct extent_map *em;
> -
> -   spin_unlock(&block_group->lock);
> -
> +   if (cleanup) {
> lock_chunks(block_group->fs_info->chunk_root);
> em_tree = &block_group->fs_info->mapping_tree.map_tree;
> write_lock(&em_tree->lock);
> @@ -3326,10 +3314,31 @@ out:
>  * this block group have left 1 entry each one. Free them.
>  */
> __btrfs_remove_free_space_cache(block_group->free_space_ctl);
> -   } else {
> +   }
> +}
> +
> +int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> +  u64 *trimmed, u64 start, u64 end, u64 minlen)
> +{
> +   int ret;
> +
> +   *trimmed = 0;
> +
> +   spin_lock(&block_group->lock);
> +   if (block_group->removed) {
> spin_unlock(&block_group->lock);
> +   return 0;
> }
> +   btrfs_get_block_group_trimming(block_group);
> +   spin_unlock(&block_group->lock);
> +
> +   ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
> +   if (ret)
> +   goto out;
>
> +   ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
> +out:
> +   btrfs_put_block_group_trimming(block_group);
> return ret;
>  }
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 2ccd8d4..a80da03 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -69,7 +69,7 @@ static struct file_system_type btrfs_fs_type;
>
>  static int btrfs_remount(struct super_block *sb, int *flags, char *data);
>
> -static const char *btrfs_decode_error(int errno)
> +const char *btrfs_decode_error(int errno)
>  {
> char *errstr = "unknown";
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 5628e25..2005262 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/tr

Re: [PATCH 5/5] Btrfs: incremental send, fix rmdir not send utimes

2015-06-08 Thread Filipe David Manana
On Mon, Jun 8, 2015 at 4:44 AM, Robbie Ko  wrote:
> Hi Filipe,

Hi Robbie,

>
> I've fixed "don't send utimes for non-existing directory" with another 
> solution.
>
>  In apply_dir_move(), the old parent dir. and new parent dir. will be
> updated after the current dir. has moved.
>
> And there's only one entry in old parent dir. (e.g. entry with
> smallest ino) will be tagged with rmdir_ino to prevent its parent dir.
> is deleted but updated.

Can't parse this phrase. What do you mean by tagging an entry with rmdir_ino?
rmdir_ino corresponds to the number of a inode that wasn't deleted
when it was processed because there was some inode with a lower number
that is a child of the directory in the parent snapshot and had its
rename/move operation delayed (it happens after the directory we want
to delete is processed).

>
> However, if we process rename for another entry not tagged with
> rmdir_ino first, its old parent dir. which is deleted  will be updated
> according to apply_dir_move().
>
> Therefore, I think we should check the existence of  the dir. before
> we're going to update it's utime.
>
> The patch is pasted in the following link, could you give me some comment?
>
> https://friendpaste.com/h8tZqOS9iAUpp2DvgGI2k

Looks better.
However I still don't understand your explanation, and just tried the
example in your commit message:

"Parent snapshot:

| a/ (ino 259)
  | c (ino 264)
| b/ (ino 260)
  | d (ino 265)
| del/ (ino 263)
  | item1/ (ino 261)
  | item2/ (ino 262)

Send snapshot:
| a/ (ino 259)
| b/ (ino 260)
| c/ (ino 2)
  | item2 (ino 259)
| d/ (ino 257)
  | item1/ (ino 258)"

So it's confusing after looking at it.
First the send snapshot mentions inode number 2, which doesn't exist
in the parent snapshot - I assume you meant inode number 264.
Then, the send snapshot has two inodes with number 259. Is "item2" in
the send snapshot supposed to be inode 262?

Anyway, assuming those 2 fixes to the example are correct guesses, I
tried the following and it didn't fail without your patches (i.e. no
attempts to send utimes to a non-existing directory):

# Parent snapshot:
#
# | a/(ino 259)
# | | c   (ino 264)
# |
# | b/(ino 260)
# | | d   (ino 265)
# |
# | del/  (ino 263)
#| item1/ (ino 261)
#| item2/ (ino 262)

# Send snapshot:
#
# | a/(ino 259)
# | b/(ino 260)
# | c/(ino 264)
# | | item2/  (ino 262)
# |
# | d/(ino 265)
#   | item1/  (ino 258)

mkdir $SCRATCH_MNT/0
mkdir $SCRATCH_MNT/1

mkdir $SCRATCH_MNT/a # 259
mkdir $SCRATCH_MNT/b # 260
mkdir $SCRATCH_MNT/item1 # 261
mkdir $SCRATCH_MNT/item2 # 262
mkdir $SCRATCH_MNT/del # 263
mv $SCRATCH_MNT/item1 $SCRATCH_MNT/del/item1
mv $SCRATCH_MNT/item2 $SCRATCH_MNT/del/item2
mkdir $SCRATCH_MNT/a/c # 264
mkdir $SCRATCH_MNT/b/d # 265

_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1

mv $SCRATCH_MNT/a/c $SCRATCH_MNT/c
mv $SCRATCH_MNT/b/d $SCRATCH_MNT/d
mv $SCRATCH_MNT/del/item2 $SCRATCH_MNT/c
mv $SCRATCH_MNT/del/item1 $SCRATCH_MNT/d
rmdir $SCRATCH_MNT/del

_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2

run_check $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
run_check $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
$SCRATCH_MNT/mysnap2

_run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap
_run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
-f $tmp/2.snap

_check_scratch_fs

_scratch_unmount
_scratch_mkfs >/dev/null 2>&1
_scratch_mount

_run_btrfs_util_prog receive $SCRATCH_MNT -f $tmp/1.snap
run_check $FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1

_run_btrfs_util_prog receive $SCRATCH_MNT -f $tmp/2.snap
run_check $FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2



I would suggest making those hiearachy diagrams more readable - pipes
right below the name of their parent, continuation pipes like and
align all inode numbers in the same column, like the following:

# Parent snapshot:
#
# | a/(ino 259)
# | | c   (ino 264)
# |
# | b/(ino 260)
# | | d   (ino 265)
# |
# | del/  (ino 263)
#| item1/ (ino 261)
#| item2/ (ino 262)

# Send snapshot:
#
# | a/(ino 259)
# | b/(ino 260)
# | c/(ino 264)
# | | item2/  (ino 262)
# |
# | d/(ino 265)
#   | item1/  (ino 258)

(pasted here in case gmail screws up the indentation/formatting:
https://friendpaste.com/12wzqdcfFrlDdd1AiKX0bU)

thanks

>
> Thans!
>
> Robbie Ko
>
> 2015-06-05 0:14 GMT+08:00 Filipe David Manana :
>> On Thu, Jun 4, 2015 

Re: [PATCH 1/5] Btrfs: incremental send, avoid circular waiting and descendant overwrite ancestor need to update path

2015-06-05 Thread Filipe David Manana
On Fri, Jun 5, 2015 at 4:55 AM, Robbie Ko  wrote:
> Hi Filipe,
>
> There is another case for  2nd scenario where is_ancestor() can't be used.

So it's a 3rd case and not the 2nd one anymore. We must have an
xfstest for this case too.

>
> Parent snapshot:
> | a/ (ino 261)
>   | c (ino 267)
> | d/ (ino 259)
>   | ance/ (ino 266)
> | waiting_dir/ (ino 262)
> | pre/ (ino 264)
>   | ance/ (ino 265)
>
> Send snapshot:
> | a/ (ino 261)
>   | ance/ (ino 266)
> | c (ino 267)
>   | waiting_dir/ (ino 262)
> | pre/ (ino 264)
> | d/ (ino 259)
>   | ance/ (ino 265)
>
> First, 262 can't move to c/waiting_dir without the rename of inode 267.
> Second, 264 can move into dir 262. Although 262 is waiting, 264 is not
> parent of 262 in the parent root.
> (The second behavior will happen after applying "[PATCH] Btrfs:
> incremental send, don't delay directory renames unnecessarily")
> Finally, 265 will overwrite 266 and path for 265 should be updated
> since 266 is not the ancestor of 265.
> Here we need to check the current state of tree rather than parent
> root which  is_ancestor function does.

Right. But comparing full paths is not the way the go for the reasons
mentioned previously. So get_cur_path() gives us the full path of an
inode based on the current state (i.e. the state of directory
hierarchy on the receiving side after applying all operations issued
in the send stream so far). That means we can use that code (write a
new function similar to it) to determine if some inode is currently an
ancestor of some other inode by walking up hierarchy and comparing
inode numbers and generation numbers - that's the only correct way.

But we can make it more simple than writing such a new function that
would be similar to get_cur_path()... Just reset valid_path and
compute again after orphanizing a conflicting entry - i.e. don't
bother checking for ancestry.

So that the previous patch would be (also at
https://friendpaste.com/6jdXdYPdC6YFffwNL6V563):

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 3c38879..d34df19 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -3629,16 +3629,6 @@ verbose_printk("btrfs: process_recorded_refs
%llu\n", sctx->cur_ino);
  if (ret) {
  struct name_cache_entry *nce;
  struct waiting_dir_move *wdm;
- bool cur_is_ancestor = false;
-
- /*
- * check is dset path is ancestor src path
- * if yes, need to update cur_ino path
- */
- if (strncmp(cur->full_path->start, valid_path->start,
fs_path_len(cur->full_path)) == 0 &&
- fs_path_len(valid_path) > fs_path_len(cur->full_path) &&
valid_path->start[fs_path_len(cur->full_path)] == '/') {
- cur_is_ancestor = true;
- }

  ret = orphanize_inode(sctx, ow_inode, ow_gen,
  cur->full_path);
@@ -3672,15 +3662,18 @@ verbose_printk("btrfs: process_recorded_refs
%llu\n", sctx->cur_ino);
  }

  /*
- * if ow_inode is ancestor cur_ino, need to update
- * update cur_ino path.
+ * ow_inode might currently be an ancestor of
+ * cur_ino, therefore compute valid_path (the
+ * current path of cur_ino) again because it
+ * might contain the pre-orphanization name of
+ * ow_inode, which is no longer valid.
  */
- if (cur_is_ancestor) {
- fs_path_reset(valid_path);
- ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, valid_path);
- if (ret < 0)
- goto out;
- }
+ fs_path_reset(valid_path);
+ ret = get_cur_path(sctx, sctx->cur_ino,
+   sctx->cur_inode_gen,
+   valid_path);
+ if (ret < 0)
+ goto out;
  } else {
  ret = send_unlink(sctx, cur->full_path);
  if (ret < 0)


>
> Thanks
> Robbie Ko
>
> 2015-06-05 3:19 GMT+08:00 Filipe David Manana :
>> On Thu, Jun 4, 2015 at 2:50 PM, Filipe David Manana  
>> wrote:
>>> On Thu, Jun 4, 2015 at 12:18 PM, Robbie Ko  wrote:
>>>> Base on [PATCH] Btrfs: incremental send, check if orphanized dir inode 
>>>> needs delayed rename
>>>>
>>>> Example1:
>>>> There's one case where we can't issue a rename operation for a directory
>>>> immediately when we process it.
>>>>
>>>> Parent snapshot:
>>>> | d/ (ino 257)
>>>>   | p1 (ino 258)
>>>> | p1/ (ino 259)
>>>>
>>>> Send snapshot:
>>>> | d/ (ino 257)
>>>>   | p1 (ino 259)
>>>> | p1/ (ino 258)
>>>>
>>>> Here we can not rename 258 from d/p1 to p1/p1 without the rename of inode 
>>>> 259.
>>>> p1 258 is put into wait_parent_move. 259 can't be rename to d/p1, so it is 
>>>> put into
>>>> circular waiting happens.
>>>
>>> "... i

Re: [PATCH 1/5] Btrfs: incremental send, avoid circular waiting and descendant overwrite ancestor need to update path

2015-06-04 Thread Filipe David Manana
On Thu, Jun 4, 2015 at 2:50 PM, Filipe David Manana  wrote:
> On Thu, Jun 4, 2015 at 12:18 PM, Robbie Ko  wrote:
>> Base on [PATCH] Btrfs: incremental send, check if orphanized dir inode needs 
>> delayed rename
>>
>> Example1:
>> There's one case where we can't issue a rename operation for a directory
>> immediately when we process it.
>>
>> Parent snapshot:
>> | d/ (ino 257)
>>   | p1 (ino 258)
>> | p1/ (ino 259)
>>
>> Send snapshot:
>> | d/ (ino 257)
>>   | p1 (ino 259)
>> | p1/ (ino 258)
>>
>> Here we can not rename 258 from d/p1 to p1/p1 without the rename of inode 
>> 259.
>> p1 258 is put into wait_parent_move. 259 can't be rename to d/p1, so it is 
>> put into
>> circular waiting happens.
>
> "... into circular waiting happens" -> so 259's rename is delayed to
> happen after 258's rename, which creates a circular dependency (258 ->
> 259 -> 258).
>
>> This is fix by rename destination directory and set
>> it as orphanized for this case.
>>
>> Example2:
>> There's one case where we can't issue a rename operation for a directory
>> immediately we process it.
>> After moving 262 outside, path of 265 is stored in the name_cache_entry.
>> When 263 try to overwrite 265, its ancestor, 265 is moved to orphanized. 
>> Path of 263
>> is still the original path, however. This causes error.
>
> For the sake of a more complete/informative change log, can you
> mention what's the error?
>
>>
>> Parent snapshot:
>> | a/ (ino 259)
>>   | c (ino 266)
>> | d/ (ino 260)
>>   | ance (ino 265)
>> | e (ino 261)
>> | f (ino 262)
>> | ance (ino 263)
>>
>> Send snapshot:
>> | a/ (ino 259)
>> | c/ (ino 266)
>>   | ance (ino 265)
>> | d/ (ino 260)
>>   | ance (ino 263)
>> | f/ (ino 262)
>>   | e (ino 261)
>>
>> Signed-off-by: Robbie Ko 
>> ---
>>  fs/btrfs/send.c | 45 -
>>  1 file changed, 40 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>> index 1c1f161..fbfbb8b 100644
>> --- a/fs/btrfs/send.c
>> +++ b/fs/btrfs/send.c
>> @@ -230,7 +230,6 @@ struct pending_dir_move {
>> u64 parent_ino;
>> u64 ino;
>> u64 gen;
>> -   bool is_orphan;
>> struct list_head update_refs;
>>  };
>>
>> @@ -1840,7 +1839,7 @@ static int will_overwrite_ref(struct send_ctx *sctx, 
>> u64 dir, u64 dir_gen,
>>  * was already unlinked/moved, so we can safely assume that we will 
>> not
>>  * overwrite anything at this point in time.
>>  */
>> -   if (other_inode > sctx->send_progress) {
>> +   if (other_inode > sctx->send_progress || is_waiting_for_move(sctx, 
>> other_inode)) {
>> ret = get_inode_info(sctx->parent_root, other_inode, NULL,
>> who_gen, NULL, NULL, NULL, NULL);
>> if (ret < 0)
>> @@ -3014,7 +3013,6 @@ static int add_pending_dir_move(struct send_ctx *sctx,
>> pm->parent_ino = parent_ino;
>> pm->ino = ino;
>> pm->gen = ino_gen;
>> -   pm->is_orphan = is_orphan;
>> INIT_LIST_HEAD(&pm->list);
>> INIT_LIST_HEAD(&pm->update_refs);
>> RB_CLEAR_NODE(&pm->node);
>> @@ -3091,6 +3089,7 @@ static int apply_dir_move(struct send_ctx *sctx, 
>> struct pending_dir_move *pm)
>> struct waiting_dir_move *dm = NULL;
>> u64 rmdir_ino = 0;
>> int ret;
>> +   bool is_orphan;
>>
>> name = fs_path_alloc();
>> from_path = fs_path_alloc();
>> @@ -3102,9 +3101,10 @@ static int apply_dir_move(struct send_ctx *sctx, 
>> struct pending_dir_move *pm)
>> dm = get_waiting_dir_move(sctx, pm->ino);
>> ASSERT(dm);
>> rmdir_ino = dm->rmdir_ino;
>> +   is_orphan = dm->orphanized;
>> free_waiting_dir_move(sctx, dm);
>>
>> -   if (pm->is_orphan) {
>> +   if (is_orphan) {
>> ret = gen_unique_name(sctx, pm->ino,
>>   pm->gen, from_path);
>> } else {
>> @@ -3292,6 +3292,7 @@ static int wait_for_dest_dir_move(struct send_ctx 
>> *sctx,
>> u64 left_gen;

Re: [PATCH v2] btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx()

2015-06-04 Thread Filipe David Manana
On Thu, Jun 4, 2015 at 1:09 PM, Zhaolei  wrote:
> From: Zhao Lei 
>
> lockdep report following warning in test:
>  [25176.843958] =
>  [25176.844519] [ INFO: inconsistent lock state ]
>  [25176.845047] 4.1.0-rc3 #22 Tainted: GW
>  [25176.845591] -
>  [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
>  [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
>  [25176.847246]  (&wr_ctx->wr_lock){+.?...}, at: [] 
> scrub_free_ctx+0x2d/0xf0 [btrfs]
>  [25176.847838] {SOFTIRQ-ON-W} state was registered at:
>  [25176.848396]   [] __lock_acquire+0x6a0/0xe10
>  [25176.848955]   [] lock_acquire+0xce/0x2c0
>  [25176.849491]   [] mutex_lock_nested+0x7f/0x410
>  [25176.850029]   [] scrub_stripe+0x4df/0x1080 [btrfs]
>  [25176.850575]   [] scrub_chunk.isra.19+0x111/0x130 [btrfs]
>  [25176.851110]   [] scrub_enumerate_chunks+0x27c/0x510 
> [btrfs]
>  [25176.851660]   [] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
>  [25176.852189]   [] btrfs_dev_replace_start+0x36e/0x450 
> [btrfs]
>  [25176.852771]   [] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
>  [25176.853315]   [] do_vfs_ioctl+0x318/0x570
>  [25176.853868]   [] SyS_ioctl+0x41/0x80
>  [25176.854406]   [] system_call_fastpath+0x12/0x6f
>  [25176.854935] irq event stamp: 51506
>  [25176.855511] hardirqs last  enabled at (51506): [] 
> vprintk_emit+0x225/0x5e0
>  [25176.856059] hardirqs last disabled at (51505): [] 
> vprintk_emit+0xb7/0x5e0
>  [25176.856642] softirqs last  enabled at (50886): [] 
> __do_softirq+0x363/0x640
>  [25176.857184] softirqs last disabled at (50949): [] 
> irq_exit+0x10d/0x120
>  [25176.857746]
>  other info that might help us debug this:
>  [25176.858845]  Possible unsafe locking scenario:
>  [25176.859981]CPU0
>  [25176.860537]
>  [25176.861059]   lock(&wr_ctx->wr_lock);
>  [25176.861705]   
>  [25176.862272] lock(&wr_ctx->wr_lock);
>  [25176.862881]
>   *** DEADLOCK ***
>
> Reason:
>  Above warning is caused by:
>  Interrupt
>  -> bio_endio()
>  -> ...
>  -> scrub_put_ctx()
>  -> scrub_free_ctx() *1
>  -> ...
>  -> mutex_lock(&wr_ctx->wr_lock);
>
>  scrub_put_ctx() is allowed to be called in end_bio interrupt, but
>  in code design, it will never call scrub_free_ctx(sctx) in interrupe
>  context(above *1), because btrfs_scrub_dev() get one additional
>  reference of sctx->refs, which makes scrub_free_ctx() only called
>  withine btrfs_scrub_dev().
>
>  Now the code runs out of our wish, because free sequence in
>  scrub_pending_bio_dec() have a gap.
>
>  Current code:
>  ---+---
>  scrub_pending_bio_dec()|  btrfs_scrub_dev
>  ---+---
>  atomic_dec(&sctx->bios_in_flight); |
>  wake_up(&sctx->list_wait); |
> | scrub_put_ctx()
> | -> atomic_dec_and_test(&sctx->refs)
>  scrub_put_ctx(sctx);   |
>  -> atomic_dec_and_test(&sctx->refs)|
>  -> scrub_free_ctx()|
>  ---+---
>
>  We expected:
>  ---+---
>  scrub_pending_bio_dec()|  btrfs_scrub_dev
>  ---+---
>  atomic_dec(&sctx->bios_in_flight); |
>  wake_up(&sctx->list_wait); |
>  scrub_put_ctx(sctx);   |
>  -> atomic_dec_and_test(&sctx->refs)|
> | scrub_put_ctx()
> | -> atomic_dec_and_test(&sctx->refs)
> | -> scrub_free_ctx()
>  ---+---
>
> Fix:
>  Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
>  in interrupt context.
>  Tested by check tracelog in debug.
>
> Changelog v1->v2:
>  Use workqueue instead of adjust function call sequence in v1,
>  because v1 will introduce a bug pointed out by:
>  Filipe David Manana 
>
> Reported-by: Qu Wenruo 
> Signed-off-by: Zhao Lei 

Reviewed-by: Filipe Manana 

Thanks.

> ---
>  fs/btrfs/async-thread.c |  1 +
>  fs/btrfs/async-thread.h |  2 ++
>  fs/btrfs/ctree.h|  1 +
>  fs/btrfs/scrub.c| 26 +++---
>  4 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.

Re: [PATCH 4/5] Btrfs: incremental send, fix rmdir but dir have a unprocess item

2015-06-04 Thread Filipe David Manana
On Thu, Jun 4, 2015 at 12:18 PM, Robbie Ko  wrote:
> There's one case where we can't rmdir issue.

"There's one case where we attempt to rmdir a directory prematurely."

>
> Example:
>
> Parent snapshot:
> | a/ (ino 279)
>   | c (ino 282)
> | del/ (ino 281)
>   | tmp/ (ino 280)
>   | long/ (ino 283)
>
> Send snapshot:
> | a/ (ino 279)
>   | long (ino 283)
> | c/ (ino 282)
>   | tmp/ (ino 280)
>
> Here we process 281 use can_rmdir check, but 280 is waiting, so create 
> orphan_dir_info
> and when 282 is move to dest, so 280 can move to c/tmp, and now run can_rmdir 
> check again.

> Return is true, because sctx->cur_ino is 282 , and call can_rmdir(, 
> sctx->cur_ino + 1)
> so 283 is equal or lesser than (sctx->cur_ino + 1), not anything unprocess.

We pass 283 (sctx->cur_ino + 1) as the send_progress to the
can_rmdir() function and that makes it return true when it shouldn't,
because the inode 283 wasn't processed yet and it's still a child of
the directory with inode number 281, which makes the receiver run into
an ENOTEMPTY error when attempting to remove the directory.

> So fix this rmdir for this case.
>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/send.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index ff9d052..e8eb3ab 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -3213,7 +3213,7 @@ static int apply_dir_move(struct send_ctx *sctx, struct 
> pending_dir_move *pm)
> /* already deleted */
> goto finish;
> }
> -   ret = can_rmdir(sctx, rmdir_ino, odi->gen, sctx->cur_ino + 1);
> +   ret = can_rmdir(sctx, rmdir_ino, odi->gen, sctx->cur_ino);

Looks good, great catch.

Thanks.

> if (ret < 0)
> goto out;
> if (!ret)
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] Btrfs: incremental send, fix orphan_dir_info not completely cleared

2015-06-04 Thread Filipe David Manana
On Thu, Jun 4, 2015 at 12:18 PM, Robbie Ko  wrote:
> There's one case where we not clear orphan_dir_info issue.

You mean where we leak a orphan_dir_info structure.

>
> Example:
>
> Parent snapshot:
> | a/ (ino 279)
>   | c (ino 282)
> | del/ (ino 281)
>   | tmp/ (ino 280)
>   | long/ (ino 283)
>   | longlong/ (ino 284)
>
> Send snapshot:
> | a/ (ino 279)
>   | long (ino 283)
>   | longlong (ino 284)
> | c/ (ino 282)
>   | tmp/ (ino 280)
>
> Here we process 281 use can_rmdir check, but 280 is waiting, so create 
> orphan_dir_info
> and when 282 is move to dest, so 280 can move to c/tmp, and now run can_rmdir 
> check again.
> Return is false, because 283 and 284 is unprocess, but now not release 
> orphan_dir_info.
> When 283 and 284 is processd, 281 be delete, but not delete orphan_dir_info.
> So fix this by release orphan_dir_info for this case.

Could be described more generically as freeing an existing
orphan_dir_info for a directory, when we realize we can't rmdir the
directory because it has a descendant that wasn't yet processed, and
the orphan_dir_info was created because it had a descendant that had
its rename operation delayed.

>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/send.c | 19 ---
>  1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 596b9dc..ff9d052 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -2785,12 +2785,6 @@ add_orphan_dir_info(struct send_ctx *sctx, u64 dir_ino)
> struct rb_node *parent = NULL;
> struct orphan_dir_info *entry, *odi;
>
> -   odi = kmalloc(sizeof(*odi), GFP_NOFS);
> -   if (!odi)
> -   return ERR_PTR(-ENOMEM);
> -   odi->ino = dir_ino;
> -   odi->gen = 0;
> -
> while (*p) {
> parent = *p;
> entry = rb_entry(parent, struct orphan_dir_info, node);
> @@ -2799,11 +2793,16 @@ add_orphan_dir_info(struct send_ctx *sctx, u64 
> dir_ino)
> } else if (dir_ino > entry->ino) {
> p = &(*p)->rb_right;
> } else {
> -   kfree(odi);
> return entry;
> }
> }
>
> +   odi = kmalloc(sizeof(*odi), GFP_NOFS);
> +   if (!odi)
> +   return ERR_PTR(-ENOMEM);
> +   odi->ino = dir_ino;
> +   odi->gen = 0;
> +

All the above changes don't fix the issue described in this change -
the memory leak - they just avoid the overhead of allocating an
orphan_dir_info object unnecessarily.

The change is ok, but should be a separate patch in the series that
does only that.

> rb_link_node(&odi->node, parent, p);
> rb_insert_color(&odi->node, &sctx->orphan_dirs);
> return odi;
> @@ -2913,6 +2912,12 @@ static int can_rmdir(struct send_ctx *sctx, u64 dir, 
> u64 dir_gen,
> }
>
> if (loc.objectid > send_progress) {
> +   struct orphan_dir_info *odi;
> +
> +   odi = get_orphan_dir_info(sctx, dir);
> +   if (odi) {
> +   free_orphan_dir_info(sctx, odi);
> +   }

Looks correct, great catch.

Thanks.

>             ret = 0;
> goto out;
> }
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Btrfs: incremental send, fix rmdir not send utimes

2015-06-04 Thread Filipe David Manana
On Thu, Jun 4, 2015 at 12:18 PM, Robbie Ko  wrote:
> There's one case where we can't issue a utimes operation for a directory.
> When 263 will delete, waiting 261 and set 261 rmdir_ino, but 262 earlier
> processed and update uime between two parent directory.
> So fix this by not update non exist utimes for this case.

So you mean that we attempt to update utimes for an inode,
corresponding to a directory, that exists in the parent snapshot but
not in the send snapshot.

So the subject should be something like "Btrfs: incremental send,
don't send utimes for non-existing directory" instead of "Btrfs:
incremental send, fix rmdir not send utimes"

>
> Example:
>
> Parent snapshot:
> | a/ (ino 259)
>   | c (ino 264)
> | b/ (ino 260)
>   | d (ino 265)
> | del/ (ino 263)
>   | item1/ (ino 261)
>   | item2/ (ino 262)
>
> Send snapshot:
> | a/ (ino 259)
> | b/ (ino 260)
> | c/ (ino 2)
>   | item2 (ino 259)
> | d/ (ino 257)
>   | item1/ (ino 258)
>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/send.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index e8eb3ab..46f954c 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -2468,7 +2468,7 @@ verbose_printk("btrfs: send_utimes %llu\n", ino);
> key.type = BTRFS_INODE_ITEM_KEY;
> key.offset = 0;
> ret = btrfs_search_slot(NULL, sctx->send_root, &key, path, 0, 0);
> -   if (ret < 0)
> +   if (ret != 0)
> goto out;

So I don't think this is a good fix. The problem is in some code that
calls this function (send_utimes) against the directory that doesn't
exist - it just shouldn't do that, its logic should be fixed.
Following this approach, while it works, it's just hiding logic errors
in one or more code paths, and none of its callers checks for a return
value of 1 - they only react to values < 0 and introduces the
possibility of propagating a return value of 1 to user space.

thanks

>
> eb = path->nodes[0];
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] Btrfs: incremental send, avoid ancestor rename to descendant

2015-06-04 Thread Filipe David Manana
et > 0) {
> +   ret = 0;
> +   break;
> +   }
> +   }
> +   if (ret < 0)
> +   break;
> +   if (parent_inode == start_ino) {
> +   ret = 1;
> +   if (*ancestor_ino == 0)
> +   *ancestor_ino = ino;
> +   break;
> +   }
> +   ino = parent_inode;
> +   gen = parent_gen;
> +   }
> +   return ret;
> +}
> +
>  static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm)
>  {
> struct fs_path *from_path = NULL;
> @@ -3089,6 +3139,7 @@ static int apply_dir_move(struct send_ctx *sctx, struct 
> pending_dir_move *pm)
> struct waiting_dir_move *dm = NULL;
> u64 rmdir_ino = 0;
> int ret;
> +   u64 ancestor = 0;
> bool is_orphan;
>
> name = fs_path_alloc();
> @@ -3122,6 +3173,22 @@ static int apply_dir_move(struct send_ctx *sctx, 
> struct pending_dir_move *pm)
> goto out;
>
> sctx->send_progress = sctx->cur_ino + 1;
> +   ret = path_loop(sctx, name, pm->ino, pm->gen, &ancestor);
> +   if (ret) {
> +   LIST_HEAD(deleted_refs);
> +   ASSERT(ancestor > BTRFS_FIRST_FREE_OBJECTID);
> +   ret = add_pending_dir_move(sctx, pm->ino, pm->gen, ancestor,
> +   
> &pm->update_refs, &deleted_refs,
> +   
> is_orphan);
> +   if (ret < 0)
> +   goto out;
> +   if (rmdir_ino) {
> +   dm = get_waiting_dir_move(sctx, pm->ino);
> +   ASSERT(dm);
> +   dm->rmdir_ino = rmdir_ino;
> +   }
> +   goto out;
> +   }

So far you're basically reverting this change:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5f806c3ae2ff6263a10a6901f97abb74dac03d36

That should be a separate 'revert' patch in the series.


> fs_path_reset(name);
> to_path = name;
> name = NULL;
> @@ -3693,6 +3760,34 @@ verbose_printk("btrfs: process_recorded_refs %llu\n", 
> sctx->cur_ino);
> }
>
> /*
> +* if cur_ino is cur ancestor, can't move now,
> +* find descendant who is waiting, waiting it.
> +*/

This comment is confusing too. cur_ino is ancestor of itself, or
ancestor of which inode? Plus the code below is looking for an
ancestor while the comment mentions finding a descendant.

> +   if (can_rename && (strncmp(valid_path->start, 
> cur->full_path->start, fs_path_len(valid_path)) == 0) &&
> +   
> fs_path_len(cur->full_path) > fs_path_len(valid_path) && 
> cur->full_path->start[fs_path_len(valid_path)] == '/') {

Same comment as in the first patch, too long line.
Also given that this check (second condition) is being repeated in 2
different places, it should be encapsulated in a helper function.

> +   struct fs_path *name = NULL;
> +   u64 ancestor;
> +   u64 old_send_progress = sctx->send_progress;
> +
> +   name = fs_path_alloc();

Allocation can fail, it can return NULL, need to return -ENOMEM in such case.

thanks

> +   sctx->send_progress = sctx->cur_ino + 1;
> +   ret = path_loop(sctx, name, sctx->cur_ino, 
> sctx->cur_inode_gen, &ancestor);
> +   if (ret) {
> +   ret = add_pending_dir_move(sctx, 
> sctx->cur_ino, sctx->cur_inode_gen,
> +   ancestor, 
> &sctx->new_refs, &sctx->deleted_refs, is_orphan);
> +   if (ret < 0) {
> +   sctx->send_progress = 
> old_send_progress;
> +   fs_path_free(name);
> +   goto out;
> +   }
> +   can_rename = false;
> +   *pending_move = 1;
> +   }
> +   sctx->send_progress = old_send_progress;
> +   fs_path_free(name);
> +   }
> +
> +   /*
>  * link/move the ref to the new place. If we have an orphan
>  * inode, move it and update valid_path. If not, link or move
>  * it depending on the inode mode.
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Btrfs: incremental send, avoid circular waiting and descendant overwrite ancestor need to update path

2015-06-04 Thread Filipe David Manana
d_pending_dir_move(sctx,
>sctx->cur_ino,
>sctx->cur_inode_gen,
> @@ -3610,11 +3612,33 @@ verbose_printk("btrfs: process_recorded_refs %llu\n", 
> sctx->cur_ino);
> goto out;
> if (ret) {
> struct name_cache_entry *nce;
> +   struct waiting_dir_move *wdm;
> +   bool cur_is_ancestor = false;
> +
> +   /*
> +* check is dset path is ancestor src path
> +* if yes, need to update cur_ino path
> +*/

Typos/confusing comment and doesn't explain why the following check is
being done.

> +   if (strncmp(cur->full_path->start, 
> valid_path->start, fs_path_len(cur->full_path)) == 0 &&
> +   
> fs_path_len(valid_path) > fs_path_len(cur->full_path) && 
> valid_path->start[fs_path_len(cur->full_path)] == '/') {

At a first glance it seems confusing why we are comparing substrings
of an entire path instead of just the old and new names for the
current and the conflicting (ow_inode) inodes and their parent inode
numbers and generation. I think the comment should explain why.

Also please try to keep lines up to 80 characters (that line is 169
characters long).
You can run ./scripts/checkpatch.pl to validate your patch files and
warn you if the code doesn't comply to the coding standard.

> +   cur_is_ancestor = true;
> +   }
>
> ret = orphanize_inode(sctx, ow_inode, ow_gen,
> cur->full_path);
> if (ret < 0)
> goto out;
> +
> +   /*
> +* check is waiting dir, if yes change the ino
> +* to orphanized in the waiting tree.
> +*/
> +   if (is_waiting_for_move(sctx, ow_inode)) {
> +   wdm = get_waiting_dir_move(sctx, 
> ow_inode);
> +   ASSERT(wdm);
> +   wdm->orphanized = true;
> +   }
> +
> /*
>  * Make sure we clear our orphanized inode's
>  * name from the name cache. This is because 
> the
> @@ -3630,6 +3654,17 @@ verbose_printk("btrfs: process_recorded_refs %llu\n", 
> sctx->cur_ino);
> name_cache_delete(sctx, nce);
> kfree(nce);
> }
> +
> +   /*
> +* if ow_inode is ancestor cur_ino, need to 
> update
> +* update cur_ino path.
> +*/

"If ow_inode is an ancestor of cur_ino in the send snapshot, update
valid_path because ow_inode was orphanized and valid_path contains its
pre-orphanization name, which is not valid anymore".

> +   if (cur_is_ancestor) {
> +   fs_path_reset(valid_path);
> +   ret = get_cur_path(sctx, 
> sctx->cur_ino, sctx->cur_inode_gen, valid_path);
> +   if (ret < 0)
> +   goto out;
> +   }
> } else {
> ret = send_unlink(sctx, cur->full_path);
> if (ret < 0)
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Btrfs incremental send fix serval case for rename and rm directory

2015-06-04 Thread Filipe David Manana
On Thu, Jun 4, 2015 at 12:18 PM, Robbie Ko  wrote:
> Patch for fix btrfs send receive. These patches base on v4.1-rc6-49-g8a7deb3
> plus following patches.
> [PATCH] Btrfs: incremental send, don't delay directory renames unnecessarily
> [PATCH] Btrfs: incremental send, check if orphanized dir inode needs delayed 
> rename
>
> Thanks.
>
> Robbie Ko (5):
>   Btrfs: incremental send, avoid circular waiting and descendant
> overwrite ancestor need to update path
>   Btrfs: incremental send, avoid ancestor rename to descendant
>   Btrfs: incremental send, fix orphan_dir_info not completely cleared
>   Btrfs: incremental send, fix rmdir but dir have a unprocess item
>   Btrfs: incremental send, fix rmdir not send utimes
>
>  fs/btrfs/send.c | 163 
> +++-
>  1 file changed, 149 insertions(+), 14 deletions(-)

Thanks for doing this, a quick look over all the patches and they seem
ok, just some minor comments later.

Would you be willing to submit test cases for xfstests that cover all
these cases?
We don't want to get regressions in the future, and this particular
part of send is complex and messy, being easy to break it for some use
cases without noticing it, and xfstests [1] is the test suite
developers (and some QA people) use to validate their changes and
verify they aren't introducing regressions.

[1] https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/log/ and
mailing list: fste...@vger.kernel.org

>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests: btrfs: 022: add a quota rescan -w to wait rescan finished.

2015-06-03 Thread Filipe David Manana
On Wed, Jun 3, 2015 at 8:10 AM, Dongsheng Yang
 wrote:
> When we enable quota, btrfs will rescan quota numbers. We need
> to wait the rescan finished before any more operations on btrfs
> qgroups. Otherwith, the new btrfs-progs would WARN out:
>
> WARNING: Rescan is running, qgroup data may be incorrect.
>
> It would make btrfs/022 failed.
>
> Signed-off-by: Dongsheng Yang 
Reviewed-by: Filipe Manana 

Thanks, it works and it makes sense.

> ---
>  tests/btrfs/022 | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/tests/btrfs/022 b/tests/btrfs/022
> index 5c1a82d..56d4f3d 100755
> --- a/tests/btrfs/022
> +++ b/tests/btrfs/022
> @@ -51,6 +51,7 @@ _basic_test()
>  {
> _run_btrfs_util_prog subvolume create $SCRATCH_MNT/a
> _run_btrfs_util_prog quota enable $SCRATCH_MNT/a
> +   _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
> subvolid=$(_btrfs_get_subvolid $SCRATCH_MNT a)
> $BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | grep $subvolid >> \
> $seqres.full 2>&1
> --
> 1.8.4.2
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx()

2015-06-02 Thread Filipe David Manana
b_put_ctx(sctx);
> atomic_dec(&sctx->bios_in_flight);
> wake_up(&sctx->list_wait);
> -   scrub_put_ctx(sctx);

Hi Zhao,

I find this confusing. A "put" should logically be done by a task when
it no longer needs/accesses the resource (sctx) anymore.

Plus, while you fix this apparently device replace specific issue, you
are re-introducing the issue fixed and described by:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=de554a4fa61d77df2704be5b6b47472b2dbd1875

(which got another fix on top:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f55985f4dda5cfb6967c17e96237f3c859076eb3)

Can we get a fix that doesn't reintroduce the use-after-free issue for
the non device replace case?

thanks



>  }
>
>  static void __scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
> --
> 1.8.5.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't invalidate root dentry when subvolume deletion fails

2015-06-01 Thread Filipe David Manana
On Sat, May 30, 2015 at 9:59 AM, Omar Sandoval  wrote:
> Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"),
> mounted subvolumes can be deleted because d_invalidate() won't fail.
> However, we run into problems when we attempt to delete the default
> subvolume while it is mounted as the root filesystem:
>
> # btrfs subvol list /
> ID 257 gen 306 top level 5 path rootvol
> ID 267 gen 334 top level 5 path snap1
> # btrfs subvol get-default /
> ID 267 gen 334 top level 5 path snap1
> # btrfs inspect-internal rootid /
> 267
> # mount -o subvol=/ /dev/vda1 /mnt
> # btrfs subvol del /mnt/snap1
> Delete subvolume (no-commit): '/mnt/snap1'
> ERROR: cannot delete '/mnt/snap1' - Operation not permitted
> # findmnt /
> findmnt: can't read /proc/mounts: No such file or directory
> # ls /proc
> #
>
> Markus reported that this same scenario simply led to a kernel oops.
>
> This happens because in btrfs_ioctl_snap_destroy(), we call
> d_invalidate() before we check may_destroy_subvol(), which means that we
> detach the submounts and drop the dentry before erroring out. Instead,
> we should only invalidate the dentry once we know that we're going
> through with the deletion.
>
> Cc: 
> Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
> Reported-by: Markus Schauler 
> Signed-off-by: Omar Sandoval 
> ---
> The other fix for preventing all mounted subvolumes from being deleted
> would preclude this, but it sounded like we were leaning towards
> enforcing that in userspace once subvolume info becomes available in
> /proc/mounts, so this should be fixed separately.
>
>  fs/btrfs/ioctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 1c22c6518504..8edb8544088b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2413,14 +2413,14 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
> file *file,
> goto out_unlock_inode;
> }
>
> -   d_invalidate(dentry);
> -
> down_write(&root->fs_info->subvol_sem);
>
> err = may_destroy_subvol(dest);
> if (err)
> goto out_up_write;
>
> +   d_invalidate(dentry);
> +

Any reason why not calling d_invalidate() only if the call
btrfs_unlink_subvol() succeeds? Not seeing a reason why we should
invalidate before doing the actual deletion successfully (before that
metadata reservation can fail or failure to start/join a transaction,
etc).

Also, would you consider making an xfstest for this?

thanks


> btrfs_init_block_rsv(&block_rsv, BTRFS_BLOCK_RSV_TEMP);
> /*
>  * One for dir inode, two for dir entries, two for root
> --
> 2.4.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] btrfs-progs: Print warning message if qgroup data is inconsistent.

2015-06-01 Thread Filipe David Manana
On Mon, Jun 1, 2015 at 2:25 AM, Qu Wenruo  wrote:
>
>
>  Original Message  
> Subject: Re: [PATCH 6/7] btrfs-progs: Print warning message if qgroup data
> is inconsistent.
> From: Filipe David Manana 
> To: Qu Wenruo 
> Date: 2015年05月30日 19:39
>
>> On Fri, Feb 27, 2015 at 8:26 AM, Qu Wenruo 
>> wrote:
>>>
>>> Before this patch, qgroup show won't check btrfs qgroup status, so even
>>> the INCONSISTENT flags is set, user is not aware of it.
>>>
>>> This patch will include BTRFS_QGROUP_STATUS_ITEM in the search range and
>>> check the flag, if there is any flag meaning the inconsistence of qgroup
>>> data, info user.
>>>
>>> NOTE: There is several kernel bugs from INCONSISTENT flags is always set
>>> to RUNNING flags is not cleared until umount.
>>> So this warning will always be here if using a newer kernel fixing these
>>> bugs.
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>   qgroup.c | 26 --
>>>   1 file changed, 24 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/qgroup.c b/qgroup.c
>>> index 7288365..4173846 100644
>>> --- a/qgroup.c
>>> +++ b/qgroup.c
>>> @@ -1016,6 +1016,20 @@ static void __filter_and_sort_qgroups(struct
>>> qgroup_lookup *all_qgroups,
>>>  n = rb_prev(n);
>>>  }
>>>   }
>>> +
>>> +static inline void print_status_flag_warning(u64 flags)
>>> +{
>>> +   if (!(flags & BTRFS_QGROUP_STATUS_FLAG_ON))
>>> +   fprintf(stderr,
>>> +   "WARNING: Quota disabled, qgroup data may be out of
>>> date\n");
>>> +   else if (flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN)
>>> +   fprintf(stderr,
>>> +   "WARNING: Rescan is running, qgroup data may be
>>> incorrect\n");
>>
>>
>> Hi Qu, did you ran xfstests? Did btrfs/022 passed for you?
>>
>> btrfs/022 47s ... - output mismatch (see
>> /home/fdmanana/git/hub/xfstests/results//btrfs/022.out.bad)
>>  --- tests/btrfs/022.out 2014-11-17 20:59:51.178203000 +
>>  +++ /home/fdmanana/git/hub/xfstests/results//btrfs/022.out.bad
>> 2015-05-30 12:35:55.917146846 +0100
>>  @@ -1,2 +1,3 @@
>>   QA output created by 022
>>  +WARNING: Rescan is running, qgroup data may be incorrect
>>   Silence is golden
>>  ...
>>  (Run 'diff -u tests/btrfs/022.out
>> /home/fdmanana/git/hub/xfstests/results//btrfs/022.out.bad'  to see
>> the entire diff)
>>
>> thanks
>
> Unfortunately, it's quite hard to trigger in my environment.
> I tried about 15 times, and can only trigger it once.
>
> Any hint about mount options or other things to improve the reproducibility?

No mount options at all (nor mkfs -O features). Happens all the time here.

Thanks for looking into it.

>
> Thanks,
> Qu
>
>>
>>
>>> +   else if (flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)
>>> +   fprintf(stderr,
>>> +   "WARNING: Qgroup data inconsistent, rescan
>>> recommended\n");
>>> +}
>>> +
>>>   static int __qgroups_search(int fd, struct qgroup_lookup
>>> *qgroup_lookup)
>>>   {
>>>  int ret;
>>> @@ -1039,7 +1053,7 @@ static int __qgroups_search(int fd, struct
>>> qgroup_lookup *qgroup_lookup)
>>>
>>>  sk->tree_id = BTRFS_QUOTA_TREE_OBJECTID;
>>>  sk->max_type = BTRFS_QGROUP_RELATION_KEY;
>>> -   sk->min_type = BTRFS_QGROUP_INFO_KEY;
>>> +   sk->min_type = BTRFS_QGROUP_STATUS_KEY;
>>>  sk->max_objectid = (u64)-1;
>>>  sk->max_offset = (u64)-1;
>>>  sk->max_transid = (u64)-1;
>>> @@ -1070,7 +1084,15 @@ static int __qgroups_search(int fd, struct
>>> qgroup_lookup *qgroup_lookup)
>>>off);
>>>  off += sizeof(*sh);
>>>
>>> -   if (sh->type == BTRFS_QGROUP_INFO_KEY) {
>>> +   if (sh->type == BTRFS_QGROUP_STATUS_KEY) {
>>> +   struct btrfs_qgroup_status_item *si;
>>> +       u64 flags;
>>> +
>>> +   si = (struct btrfs_qgroup_status_item *)
>>> + 

Re: [PATCH 6/7] btrfs-progs: Print warning message if qgroup data is inconsistent.

2015-05-30 Thread Filipe David Manana
On Fri, Feb 27, 2015 at 8:26 AM, Qu Wenruo  wrote:
> Before this patch, qgroup show won't check btrfs qgroup status, so even
> the INCONSISTENT flags is set, user is not aware of it.
>
> This patch will include BTRFS_QGROUP_STATUS_ITEM in the search range and
> check the flag, if there is any flag meaning the inconsistence of qgroup
> data, info user.
>
> NOTE: There is several kernel bugs from INCONSISTENT flags is always set
> to RUNNING flags is not cleared until umount.
> So this warning will always be here if using a newer kernel fixing these
> bugs.
>
> Signed-off-by: Qu Wenruo 
> ---
>  qgroup.c | 26 --
>  1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/qgroup.c b/qgroup.c
> index 7288365..4173846 100644
> --- a/qgroup.c
> +++ b/qgroup.c
> @@ -1016,6 +1016,20 @@ static void __filter_and_sort_qgroups(struct 
> qgroup_lookup *all_qgroups,
> n = rb_prev(n);
> }
>  }
> +
> +static inline void print_status_flag_warning(u64 flags)
> +{
> +   if (!(flags & BTRFS_QGROUP_STATUS_FLAG_ON))
> +   fprintf(stderr,
> +   "WARNING: Quota disabled, qgroup data may be out of date\n");
> +   else if (flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN)
> +   fprintf(stderr,
> +   "WARNING: Rescan is running, qgroup data may be incorrect\n");

Hi Qu, did you ran xfstests? Did btrfs/022 passed for you?

btrfs/022 47s ... - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/022.out.bad)
--- tests/btrfs/022.out 2014-11-17 20:59:51.178203000 +
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/022.out.bad
2015-05-30 12:35:55.917146846 +0100
@@ -1,2 +1,3 @@
 QA output created by 022
+WARNING: Rescan is running, qgroup data may be incorrect
 Silence is golden
...
(Run 'diff -u tests/btrfs/022.out
/home/fdmanana/git/hub/xfstests/results//btrfs/022.out.bad'  to see
the entire diff)

thanks


> +   else if (flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)
> +   fprintf(stderr,
> +   "WARNING: Qgroup data inconsistent, rescan recommended\n");
> +}
> +
>  static int __qgroups_search(int fd, struct qgroup_lookup *qgroup_lookup)
>  {
> int ret;
> @@ -1039,7 +1053,7 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>
> sk->tree_id = BTRFS_QUOTA_TREE_OBJECTID;
> sk->max_type = BTRFS_QGROUP_RELATION_KEY;
> -   sk->min_type = BTRFS_QGROUP_INFO_KEY;
> +   sk->min_type = BTRFS_QGROUP_STATUS_KEY;
> sk->max_objectid = (u64)-1;
> sk->max_offset = (u64)-1;
> sk->max_transid = (u64)-1;
> @@ -1070,7 +1084,15 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>   off);
> off += sizeof(*sh);
>
> -   if (sh->type == BTRFS_QGROUP_INFO_KEY) {
> +   if (sh->type == BTRFS_QGROUP_STATUS_KEY) {
> +   struct btrfs_qgroup_status_item *si;
> +   u64 flags;
> +
> +   si = (struct btrfs_qgroup_status_item *)
> +(args.buf + off);
> +   flags = btrfs_stack_qgroup_status_flags(si);
> +   print_status_flag_warning(flags);
> +   } else if (sh->type == BTRFS_QGROUP_INFO_KEY) {
> info = (struct btrfs_qgroup_info_item *)
>(args.buf + off);
>         a1 = btrfs_stack_qgroup_info_generation(info);
> --
> 2.3.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: oops BUG unable to handle kernel NULL point dereference IP io_ctl_check_crc

2015-05-11 Thread Filipe David Manana
On Sat, May 9, 2015 at 8:43 AM, Chris Murphy  wrote:
> On Sat, May 9, 2015 at 12:34 AM, Chris Murphy  wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=97991
>
> OK it appears to be a 32bit kernel problem, as it happens in an x86_64
> qemu/kvm VM.
>
> But that same vm with the same version 64bit kernel doesn't exhibit the 
> problem.

Sounds like it should have been fixed in rc3 by [1], the same trace
was reported recently [2].

[1] 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=1d3c61c2eb3fe4f96d3192212f1bdcee49ea55aa
[2] http://www.spinics.net/lists/linux-btrfs/msg43754.html

>
> --
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: add missing discards when unpinning extents with -o discard

2015-05-08 Thread Filipe David Manana
 -   lock_chunks(block_group->fs_info->chunk_root);
> -   em_tree = &block_group->fs_info->mapping_tree.map_tree;
> -   write_lock(&em_tree->lock);
> -   em = lookup_extent_mapping(em_tree, block_group->key.objectid,
> -  1);
> -   BUG_ON(!em); /* logic error, can't happen */
> -   /*
> -* remove_extent_mapping() will delete us from the 
> pinned_chunks
> -* list, which is protected by the chunk mutex.
> -*/
> -   remove_extent_mapping(em_tree, em);
> -   write_unlock(&em_tree->lock);
> -   unlock_chunks(block_group->fs_info->chunk_root);
> -
> -   /* once for us and once for the tree */
> -   free_extent_map(em);
> -   free_extent_map(em);
> +   btrfs_cleanup_block_group_mapping(block_group);
>
> /*
>  * We've left one free space entry and other tasks trimming
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 9e66f5e..016e65a 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -69,7 +69,7 @@ static struct file_system_type btrfs_fs_type;
>
>  static int btrfs_remount(struct super_block *sb, int *flags, char *data);
>
> -static const char *btrfs_decode_error(int errno)
> +const char *btrfs_decode_error(int errno)
>  {
> char *errstr = "unknown";
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 5628e25..2005262 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -256,6 +256,8 @@ loop:
> mutex_init(&cur_trans->cache_write_mutex);
> cur_trans->num_dirty_bgs = 0;
> spin_lock_init(&cur_trans->dirty_bgs_lock);
> +   INIT_LIST_HEAD(&cur_trans->deleted_bgs);
> +   spin_lock_init(&cur_trans->deleted_bgs_lock);
> list_add_tail(&cur_trans->list, &fs_info->trans_list);
> extent_io_tree_init(&cur_trans->dirty_pages,
>  fs_info->btree_inode->i_mapping);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 0b24755..14325f2 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -74,6 +74,8 @@ struct btrfs_transaction {
>  */
> struct mutex cache_write_mutex;
> spinlock_t dirty_bgs_lock;
> +   struct list_head deleted_bgs;
> +   spinlock_t deleted_bgs_lock;
> struct btrfs_delayed_ref_root delayed_refs;
> int aborted;
> int dirty_bg_run;
> --
> 1.8.5.6
>
>
> --
> Jeff Mahoney
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tux3 Report: How fast can we fsync?

2015-04-30 Thread Filipe David Manana
On Thu, Apr 30, 2015 at 12:36 PM, Daniel Phillips  wrote:
> On 04/30/2015 04:14 AM, Filipe Manana wrote:
>>
>> On 04/30/2015 11:28 AM, Daniel Phillips wrote:
>>> It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert
>>> for me earlier this evening. It is rare but it happens.
>>
>> Hi Daniel,
>>
>> Would you mind reporting (to linux-btrfs@vger.kernel.org) the
>> bug/assertion you hit during your tests with btrfs?
>
> Kernel 3.19.0 under KVM with BTRFS mounted on a file in /tmp, see
> the KVM command below. I believe I was running the 10,000 task test
> using the "sync" program below: "syncs foo 10 1".
>
> 346 [ cut here ]
> 347 kernel BUG at fs/btrfs/extent_io.c:4548!
> 348 invalid opcode:  [#1] PREEMPT SMP
> 349 Modules linked in:
> 350 CPU: 2 PID: 5754 Comm: sync6 Not tainted 3.19.0-56544-g65cf1a5 #756
> 351 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
> 01/01/2011
> 352 task: ec3c0ea0 ti: ec3ea000 task.ti: ec3ea000
> 353 EIP: 0060:[] EFLAGS: 00010202 CPU: 2
> 354 EIP is at btrfs_release_extent_buffer_page+0xf0/0x100
> 355 EAX: 0001 EBX: f47198f0 ECX:  EDX: 0001
> 356 ESI: f47198f0 EDI: f61f1808 EBP: ec3ebbac ESP: ec3ebb9c
> 357  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> 358 CR0: 8005003b CR2: b756a356 CR3: 2c3ce000 CR4: 06d0
> 359 Stack:
> 360  0005 f47198f0 f61f1000 f61f1808 ec3ebbc0 c1301a7f f47198f0 
> 
> 361  f6a3d940 ec3ebbcc c1301ee5 d9a6c770 ec3ebbdc c12b436d fff92000 
> da136b20
> 362  ec3ebc74 c12e42b6 0c00   1000  
> 
> 363 Call Trace:
> 364  [] release_extent_buffer+0x3f/0xb0
> 365  [] free_extent_buffer+0x45/0x80
> 366  [] btrfs_release_path+0x2d/0x90
> 367  [] cow_file_range_inline+0x466/0x600
> 368  [] cow_file_range+0x50e/0x640
> 369  [] ? find_lock_delalloc_range.constprop.42+0x2e1/0x320
> 370  [] run_delalloc_range+0x419/0x450
> 371  [] writepage_delalloc.isra.32+0x14b/0x1d0
> 372  [] __extent_writepage+0xde/0x2b0
> 373  [] ? find_get_pages_tag+0xad/0x120
> 374  [] extent_writepages+0x29c/0x350
> 375  [] ? btrfs_direct_IO+0x300/0x300
> 376  [] btrfs_writepages+0x1f/0x30
> 377  [] do_writepages+0x15/0x40
> 378  [] __filemap_fdatawrite_range+0x4f/0x60
> 379  [] filemap_fdatawrite_range+0x22/0x30
> 380  [] btrfs_fdatawrite_range+0x28/0x70
> 381  [] start_ordered_ops+0x21/0x30
> 382  [] btrfs_sync_file+0x43/0x370
> 383  [] ? vfs_write+0x135/0x1c0
> 384  [] ? start_ordered_ops+0x30/0x30
> 385  [] do_fsync+0x47/0x70
> 386  [] SyS_fsync+0xd/0x10
> 387  [] syscall_call+0x7/0x7
> 388 Code: 8b 03 f6 c4 20 75 26 f0 80 63 01 f7 c7 43 1c 00 00 00 00 89 d8 
> e8 61 94 e2 ff eb c3 8d
> b4 26 00 00 00 00 83 c4 04 5b 5e 5f 5d c3 <0f> 0b 0f 0b 388 0f 0b 0f 0b 
> 90 8d b4 26 00 00 00 00
> 55 89 e5 57 56
> 389 EIP: [] btrfs_release_extent_buffer_page+0xf0/0x100 SS:ESP 
> 0068:ec3ebb9c
> 390 ---[ end trace 12b9bbe75d9541a3 ]---
>
> KVM command:
>
> mkfs.btrfs -f /tmp/disk.img && kvm -kernel 
> /src/linux-tux3/arch/x86/boot/bzImage -append
> "root=/dev/sda1 console=ttyS0 console=tty0 oops=panic tux3.tux3_trace=0" 
> -serial file:serial.txt
> -hda /more/kvm/hdd.img -hdb /tmp/disk.img -net nic -net 
> user,hostfwd=tcp::1234-:22 -smp 4 -m 2000

There's a very recent patch to fix that issue:
https://patchwork.kernel.org/patch/6261421/
It's not really specific to fsync.

Thanks for reporting it.

>
> Source code:
>
> /*
>  * syncs.c
>  *
>  * D.R. Phillips, 2015
>  *
>  * To build: c99 -Wall syncs.c -o syncs
>  * To run: ./syncs [ [ []]]
>  */
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> char text[1024] = { "hello world!\n" };
>
> int main(int argc, const char *argv[]) {
> const char *basename = argc < 1 ? "foo" : argv[1];
> char name[100];
> int steps = argc < 3 ? 1 : atoi(argv[2]);
> int tasks = argc < 4 ? 1 : atoi(argv[3]);
> int err, fd;
>
> for (int t = 0; t < tasks; t++) {
> snprintf(name, sizeof name, "%s%i", basename, t);
> if (!fork())
> goto child;
> }
> for (int t = 0; t < tasks; t++)
>     wait(&err);
> return 0;
>
> child:
> fd = creat(name, S_IRWXU);
> for (int i = 0; i < steps; i++)

Re: [PATCH v3] Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent

2015-04-27 Thread Filipe David Manana
On Mon, Feb 9, 2015 at 9:31 AM, Forrest Liu  wrote:
> btrfs_release_extent_buffer_page() can't handle dummy extent that
> allocated by btrfs_clone_extent_buffer() properly. That is because
> reference count of pages that allocated by btrfs_clone_extent_buffer()
> was 2, 1 by alloc_page(), and another by attach_extent_buffer_page().
>
> Running following command repeatly can check this memory leak problem
>
> btrfs inspect-internal inode-resolve 256 /mnt/btrfs
>
> Signed-off-by: Chien-Kuan Yeh 
> Signed-off-by: Forrest Liu 

Reviewed-by: Filipe Manana 
Tested-by: Filipe Manana 

> ---
> V2: do not call PagePrivate if page is NULL
> V3: add reproducing step in commit message
>
>  fs/btrfs/extent_io.c | 51 ++-
>  1 file changed, 26 insertions(+), 25 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 790dbae..9de93ee 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4554,36 +4554,37 @@ static void btrfs_release_extent_buffer_page(struct 
> extent_buffer *eb)
> do {
> index--;
> page = eb->pages[index];
> -   if (page && mapped) {
> +   if (!page)
> +   continue;
> +   if (mapped)
> spin_lock(&page->mapping->private_lock);
> +   /*
> +* We do this since we'll remove the pages after we've
> +* removed the eb from the radix tree, so we could race
> +* and have this page now attached to the new eb.  So
> +* only clear page_private if it's still connected to
> +* this eb.
> +*/
> +   if (PagePrivate(page) &&
> +   page->private == (unsigned long)eb) {
> +   BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
> +   BUG_ON(PageDirty(page));
> +   BUG_ON(PageWriteback(page));
> /*
> -* We do this since we'll remove the pages after we've
> -* removed the eb from the radix tree, so we could 
> race
> -* and have this page now attached to the new eb.  So
> -* only clear page_private if it's still connected to
> -* this eb.
> +* We need to make sure we haven't be attached
> +* to a new eb.
>  */
> -   if (PagePrivate(page) &&
> -   page->private == (unsigned long)eb) {
> -   BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, 
> &eb->bflags));
> -   BUG_ON(PageDirty(page));
> -   BUG_ON(PageWriteback(page));
> -   /*
> -* We need to make sure we haven't be attached
> -* to a new eb.
> -*/
> -   ClearPagePrivate(page);
> -   set_page_private(page, 0);
> -   /* One for the page private */
> -   page_cache_release(page);
> -   }
> -   spin_unlock(&page->mapping->private_lock);
> -
> -   }
> -   if (page) {
> -   /* One for when we alloced the page */
> +   ClearPagePrivate(page);
> +   set_page_private(page, 0);
> +   /* One for the page private */
> page_cache_release(page);
> }
> +
> +   if (mapped)
> +   spin_unlock(&page->mapping->private_lock);
> +
> +   /* One for when we alloced the page */
> +   page_cache_release(page);
> } while (index != 0);
>  }
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: reduce block group cache writeout times during commit

2015-04-25 Thread Filipe David Manana
On Fri, Apr 24, 2015 at 4:05 PM, Filipe David Manana  wrote:
> On Fri, Apr 24, 2015 at 2:55 PM, Chris Mason  wrote:
>> On 04/24/2015 09:43 AM, Filipe David Manana wrote:
>>> On Fri, Apr 24, 2015 at 2:00 PM, Chris Mason  wrote:
>>
>>>> Can you please bang on this and get a more reliable reproduction? I'll
>>>> take a look.
>>>
>>> Not really that easy to get a more reliable reproducer - just run
>>> fsstress with multiple processes - it already happened twice again
>>> after I sent the previous mail.
>>> From the quick look I had at this, this seems to be the change causing
>>> the problem:
>>>
>>> http://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-linus-4.1&id=1bbc621ef28462456131c035eaeb5567a1a2a2fe
>>>
>>> Early in btrfs_commit_transaction(), btrfs_start_dirty_block_groups()
>>> is called which ends up calling __btrfs_write_out_cache() for each
>>> dirty block group, which collects all the bitmap entries from the bg's
>>> space cache into a local list while holding the cache's ctl->tree_lock
>>> (to serialize with concurrent allocation requests).
>>>
>>> Then we unlock ctl->tree_lock, do other stuff and later acquire
>>> ctl->tree_lock again and call write_bitmap_entries() to write the
>>> bitmap entries we previously collected. However, while we were doing
>>> the other stuff without holding that lock, allocation requests might
>>> have happened right? - since when we call
>>> btrfs_start_dirty_block_groups() in btrfs_commit_transaction() the
>>> transaction state wasn't yet changed, allowing other tasks to join the
>>> current transaction. If such other task allocates all the remaining
>>> space from a bitmap entry we collected before (because it's still in
>>> the space cache's rbtree), it ends up deleting it and freeing its
>>> ->bitmap member, which results in an invalid memory access (and the
>>> warning on the list corruption) when we later call
>>> write_bitmap_entries() in __btrfs_write_out_cache() - which is what
>>> the second part of the trace I sent says:
>>
>> It's easy to hold the ctl->tree_lock from collection write out, but
>> everyone deleting items is using list_del_init, so it should be fine to
>> take the lock again and run through any items that are left.
>
> Right, but free_bitmap() / unlink_free_space() free the bitmap entry
> without deleting it from the list (nor its callers do it), which
> should be enough to cause such list corruption report.
>
> I'll try the patch and see if I can get at least one successful entire
> run of xfstests.
> Thanks Chris.

So with the updated version of that last patch, found at [1], this
problem no longer happens as expected.
However found 2 others one for which I've just sent fixes, plus
another one with invalid on disk caches which I'm not sure if it's
related with the (not new) race you mentioned before when writing
block group caches. It happened once with generic/299 and fsck's
report was:

checking extents
checking free space cache
Wanted bytes 2883584, found 1310720 for off 4929859584
Wanted bytes 468209664, found 1310720 for off 4929859584
cache appears valid but isnt 4324327424
Wanted bytes 262144, found 131072 for off 5403049984
Wanted bytes 1068761088, found 131072 for off 5403049984
cache appears valid but isnt 5398069248
Wanted bytes 262144, found 131072 for off 6472204288
Wanted bytes 1073348608, found 131072 for off 6472204288
cache appears valid but isnt 6471811072
Wanted bytes 786432, found 655360 for off 7545552896
Wanted bytes 1073741824, found 655360 for off 7545552896
cache appears valid but isnt 7545552896
Wanted bytes 1048576, found 131072 for off 8621916160
Wanted bytes 1071120384, found 131072 for off 8621916160
cache appears valid but isnt 8619294720
There is no free space entry for 9693298688-9695395840
There is no free space entry for 9693298688-10766778368
cache appears valid but isnt 9693036544
There is no free space entry for 10769268736-10769661952
There is no free space entry for 10769268736-11840520192
cache appears valid but isnt 10766778368
Checking filesystem on /dev/sdc
UUID: abf30a2f-f784-4829-9131-86e20f13a8cf
found 788994098 bytes used err is -22
total csum bytes: 2586348
total tree bytes: 14954496
total fs tree bytes: 4702208
total extent tree bytes: 3817472
btree space waste bytes: 5422437
file data blocks allocated: 2651303936
 referenced 2651303936

[1] 
http://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-linus-4.1&id=a3bdccc4e683f0ac69230707ed3fa20e7cf73a79

>
>>
>> Here's a replacement i

Re: [PATCH 0/4] btrfs: reduce block group cache writeout times during commit

2015-04-24 Thread Filipe David Manana
On Fri, Apr 24, 2015 at 2:55 PM, Chris Mason  wrote:
> On 04/24/2015 09:43 AM, Filipe David Manana wrote:
>> On Fri, Apr 24, 2015 at 2:00 PM, Chris Mason  wrote:
>
>>> Can you please bang on this and get a more reliable reproduction? I'll
>>> take a look.
>>
>> Not really that easy to get a more reliable reproducer - just run
>> fsstress with multiple processes - it already happened twice again
>> after I sent the previous mail.
>> From the quick look I had at this, this seems to be the change causing
>> the problem:
>>
>> http://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-linus-4.1&id=1bbc621ef28462456131c035eaeb5567a1a2a2fe
>>
>> Early in btrfs_commit_transaction(), btrfs_start_dirty_block_groups()
>> is called which ends up calling __btrfs_write_out_cache() for each
>> dirty block group, which collects all the bitmap entries from the bg's
>> space cache into a local list while holding the cache's ctl->tree_lock
>> (to serialize with concurrent allocation requests).
>>
>> Then we unlock ctl->tree_lock, do other stuff and later acquire
>> ctl->tree_lock again and call write_bitmap_entries() to write the
>> bitmap entries we previously collected. However, while we were doing
>> the other stuff without holding that lock, allocation requests might
>> have happened right? - since when we call
>> btrfs_start_dirty_block_groups() in btrfs_commit_transaction() the
>> transaction state wasn't yet changed, allowing other tasks to join the
>> current transaction. If such other task allocates all the remaining
>> space from a bitmap entry we collected before (because it's still in
>> the space cache's rbtree), it ends up deleting it and freeing its
>> ->bitmap member, which results in an invalid memory access (and the
>> warning on the list corruption) when we later call
>> write_bitmap_entries() in __btrfs_write_out_cache() - which is what
>> the second part of the trace I sent says:
>
> It's easy to hold the ctl->tree_lock from collection write out, but
> everyone deleting items is using list_del_init, so it should be fine to
> take the lock again and run through any items that are left.

Right, but free_bitmap() / unlink_free_space() free the bitmap entry
without deleting it from the list (nor its callers do it), which
should be enough to cause such list corruption report.

I'll try the patch and see if I can get at least one successful entire
run of xfstests.
Thanks Chris.

>
> Here's a replacement incremental that'll cover both cases:
>
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index d773f22..657a8ec 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1119,18 +1119,21 @@ static int flush_dirty_cache(struct inode *inode)
>  }
>
>  static void noinline_for_stack
> -cleanup_write_cache_enospc(struct inode *inode,
> +cleanup_write_cache_enospc(struct btrfs_free_space_ctl *ctl,
> +  struct inode *inode,
>struct btrfs_io_ctl *io_ctl,
>struct extent_state **cached_state,
>struct list_head *bitmap_list)
>  {
> struct list_head *pos, *n;
>
> +   spin_lock(&ctl->tree_lock);
> list_for_each_safe(pos, n, bitmap_list) {
> struct btrfs_free_space *entry =
> list_entry(pos, struct btrfs_free_space, list);
> list_del_init(&entry->list);
> }
> +   spin_unlock(&ctl->tree_lock);
> io_ctl_drop_pages(io_ctl);
> unlock_extent_cached(&BTRFS_I(inode)->io_tree, 0,
>  i_size_read(inode) - 1, cached_state,
> @@ -1266,8 +1269,8 @@ static int __btrfs_write_out_cache(struct
> btrfs_root *root, struct inode *inode,
> ret = write_cache_extent_entries(io_ctl, ctl,
>  block_group, &entries, &bitmaps,
>  &bitmap_list);
> -   spin_unlock(&ctl->tree_lock);
> if (ret) {
> +   spin_unlock(&ctl->tree_lock);
> mutex_unlock(&ctl->cache_writeout_mutex);
> goto out_nospc;
> }
> @@ -1282,6 +1285,7 @@ static int __btrfs_write_out_cache(struct
> btrfs_root *root, struct inode *inode,
>  */
> ret = write_pinned_extent_entries(root, block_group, io_ctl, 
> &entries);
> if (ret) {
> +   spin_unlock(&ctl->tree_lock);
> mutex_

Re: [PATCH 0/4] btrfs: reduce block group cache writeout times during commit

2015-04-24 Thread Filipe David Manana
On Fri, Apr 24, 2015 at 2:00 PM, Chris Mason  wrote:
> On 04/24/2015 02:34 AM, Filipe David Manana wrote:
>> On Thu, Apr 23, 2015 at 8:50 PM, Chris Mason  wrote:
>>> On 04/23/2015 03:43 PM, Filipe David Manana wrote:
>>>> On Thu, Apr 23, 2015 at 4:48 PM, Filipe David Manana  
>>>> wrote:
>>>>> On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason  wrote:
>>>>>> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote:
>>>>>>>>> Trying the current integration-4.1 branch, I ran into the following
>>>>>>>>> during xfstests/btrfs/049:
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ugh, I must not be waiting correctly in one of the inode cache writeout
>>>>>>>> sections.  But I've run 049 a whole bunch of times without triggering,
>>>>>>>> can you get this to happen consistently?
>>>>>>>
>>>>>>> All the time so far.
>>>>>>
>>>>>> I'm testing with this now:
>>>>>>
>>>>>> commit 9f433238891b1b243c4f19d3f36eed913b270cbc
>>>>>> Author: Chris Mason 
>>>>>> Date:   Thu Apr 23 08:02:49 2015 -0700
>>>>>>
>>>>>> Btrfs: fix inode cache writeout
>>>>>>
>>>>>> The code to fix stalls during free spache cache IO wasn't using
>>>>>> the correct root when waiting on the IO for inode caches.  This
>>>>>> is only a problem when the inode cache is enabled with
>>>>>>
>>>>>> mount -o inode_cache
>>>>>>
>>>>>> This fixes the inode cache writeout to preserve any error values and
>>>>>> makes sure not to override the root when inode cache writeout is 
>>>>>> done.
>>>>>>
>>>>>> Reported-by: Filipe Manana 
>>>>>> Signed-off-by: Chris Mason 
>>>>>
>>>>> Thanks, btrfs/049 now passes with that patch applied.
>>>>> Running the whole xfstests suite now.
>>>>
>>>> btrfs/066 also failed once during final fsck with:
>>>>
>>>> _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>>> *** fsck.btrfs output ***
>>>> checking extents
>>>> checking free space cache
>>>> There is no free space entry for 21676032-21680128
>>>> There is no free space entry for 21676032-87031808
>>>> cache appears valid but isnt 20971520
>>>
>>> Josef has a btrfs-progs patch for this.  The kernel will toss the cache.
>>>  There's a somewhat fundamental race in cache writeout this patch makes
>>> a little bigger, but it has always been there.
>>>
>>> (compare what find_free_extent can do with no trans running vs the
>>> actual cache writeback)
>>
>> There's also one list corruption I didn't get before and happened
>> while running fsstress (btrfs/078), apparently due to some race:
>
> Can you please bang on this and get a more reliable reproduction? I'll
> take a look.

Not really that easy to get a more reliable reproducer - just run
fsstress with multiple processes - it already happened twice again
after I sent the previous mail.
>From the quick look I had at this, this seems to be the change causing
the problem:

http://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-linus-4.1&id=1bbc621ef28462456131c035eaeb5567a1a2a2fe

Early in btrfs_commit_transaction(), btrfs_start_dirty_block_groups()
is called which ends up calling __btrfs_write_out_cache() for each
dirty block group, which collects all the bitmap entries from the bg's
space cache into a local list while holding the cache's ctl->tree_lock
(to serialize with concurrent allocation requests).

Then we unlock ctl->tree_lock, do other stuff and later acquire
ctl->tree_lock again and call write_bitmap_entries() to write the
bitmap entries we previously collected. However, while we were doing
the other stuff without holding that lock, allocation requests might
have happened right? - since when we call
btrfs_start_dirty_block_groups() in btrfs_commit_transaction() the
transaction state wasn't yet changed, allowing other tasks to join the
current transaction. If such other task allocates all the remaining
space from a bitmap entry we collected before (because it's still in
the space cache's rbtree), it ends up deleting it and freeing its
->bitmap member, which result

Re: [PATCH 0/4] btrfs: reduce block group cache writeout times during commit

2015-04-23 Thread Filipe David Manana
On Thu, Apr 23, 2015 at 8:50 PM, Chris Mason  wrote:
> On 04/23/2015 03:43 PM, Filipe David Manana wrote:
>> On Thu, Apr 23, 2015 at 4:48 PM, Filipe David Manana  
>> wrote:
>>> On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason  wrote:
>>>> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote:
>>>>>>> Trying the current integration-4.1 branch, I ran into the following
>>>>>>> during xfstests/btrfs/049:
>>>>>>>
>>>>>>
>>>>>> Ugh, I must not be waiting correctly in one of the inode cache writeout
>>>>>> sections.  But I've run 049 a whole bunch of times without triggering,
>>>>>> can you get this to happen consistently?
>>>>>
>>>>> All the time so far.
>>>>
>>>> I'm testing with this now:
>>>>
>>>> commit 9f433238891b1b243c4f19d3f36eed913b270cbc
>>>> Author: Chris Mason 
>>>> Date:   Thu Apr 23 08:02:49 2015 -0700
>>>>
>>>> Btrfs: fix inode cache writeout
>>>>
>>>> The code to fix stalls during free spache cache IO wasn't using
>>>> the correct root when waiting on the IO for inode caches.  This
>>>> is only a problem when the inode cache is enabled with
>>>>
>>>> mount -o inode_cache
>>>>
>>>> This fixes the inode cache writeout to preserve any error values and
>>>> makes sure not to override the root when inode cache writeout is done.
>>>>
>>>> Reported-by: Filipe Manana 
>>>> Signed-off-by: Chris Mason 
>>>
>>> Thanks, btrfs/049 now passes with that patch applied.
>>> Running the whole xfstests suite now.
>>
>> btrfs/066 also failed once during final fsck with:
>>
>> _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>> *** fsck.btrfs output ***
>> checking extents
>> checking free space cache
>> There is no free space entry for 21676032-21680128
>> There is no free space entry for 21676032-87031808
>> cache appears valid but isnt 20971520
>
> Josef has a btrfs-progs patch for this.  The kernel will toss the cache.
>  There's a somewhat fundamental race in cache writeout this patch makes
> a little bigger, but it has always been there.
>
> (compare what find_free_extent can do with no trans running vs the
> actual cache writeback)

There's also one list corruption I didn't get before and happened
while running fsstress (btrfs/078), apparently due to some race:

[25590.799058] [ cut here ]
[25590.800204] WARNING: CPU: 3 PID: 7280 at lib/list_debug.c:62
__list_del_entry+0x5a/0x98()
[25590.802101] list_del corruption. next->prev should be
8801a0f74d50, but was a56b6b6b6b6b6b6b
[25590.804236] Modules linked in: btrfs dm_flakey dm_mod
crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs
lockd grace fscache sunrpc loop fuse i2c_piix4 i2c_core psmouse
serio_raw evdev parport_pc parport acpi_cpufreq processor button
pcspkr thermal_sys microcode ext4 crc16 jbd2 mbcache sd_mod sg sr_mod
cdrom virtio_scsi ata_generic virtio_pci virtio_ring ata_piix e1000
virtio libata floppy scsi_mod [last unloaded: btrfs]
[25590.818580] CPU: 3 PID: 7280 Comm: fsstress Tainted: GW
  4.0.0-rc5-btrfs-next-9+ #1
[25590.820597] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
04/01/2014
[25590.823458]  0009 8803f031bc08 8142fa46
8108b6a2
[25590.825081]  8803f031bc58 8803f031bc48 81045ea5
0011
[25590.826568]  81245af7 8801a0f74d50 8801a0f74460
880041710df0
[25590.828106] Call Trace:
[25590.828630]  [] dump_stack+0x4f/0x7b
[25590.829706]  [] ? console_unlock+0x361/0x3ad
[25590.830785]  [] warn_slowpath_common+0xa1/0xbb
[25590.831957]  [] ? __list_del_entry+0x5a/0x98
[25590.867473]  [] warn_slowpath_fmt+0x46/0x48
[25590.868631]  [] ? btrfs_csum_data+0x16/0x18 [btrfs]
[25590.869524]  [] __list_del_entry+0x5a/0x98
[25590.870918]  [] write_bitmap_entries+0x99/0xbd [btrfs]
[25590.872377]  [] ?
__btrfs_write_out_cache.isra.21+0x20b/0x3a1 [btrfs]
[25590.874079]  []
__btrfs_write_out_cache.isra.21+0x217/0x3a1 [btrfs]
[25590.875594]  [] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
[25590.877032]  [] btrfs_write_out_cache+0x93/0xdc [btrfs]
[25590.878406]  [] ?
btrfs_start_dirty_block_groups+0x156/0x29b [btrfs]
[25590.879859]  []
btrfs_start_dirty_block_groups+0x1e6/0x29b [btrfs]
[25590.881360]  []
btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[25590.882504]  [] btrfs_sync_fs+0xe1/0x12d [btrfs]
[25590.883600]  []

Re: [PATCH 0/4] btrfs: reduce block group cache writeout times during commit

2015-04-23 Thread Filipe David Manana
On Thu, Apr 23, 2015 at 4:48 PM, Filipe David Manana  wrote:
> On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason  wrote:
>> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote:
>>> >> Trying the current integration-4.1 branch, I ran into the following
>>> >> during xfstests/btrfs/049:
>>> >>
>>> >
>>> > Ugh, I must not be waiting correctly in one of the inode cache writeout
>>> > sections.  But I've run 049 a whole bunch of times without triggering,
>>> > can you get this to happen consistently?
>>>
>>> All the time so far.
>>
>> I'm testing with this now:
>>
>> commit 9f433238891b1b243c4f19d3f36eed913b270cbc
>> Author: Chris Mason 
>> Date:   Thu Apr 23 08:02:49 2015 -0700
>>
>> Btrfs: fix inode cache writeout
>>
>> The code to fix stalls during free spache cache IO wasn't using
>> the correct root when waiting on the IO for inode caches.  This
>> is only a problem when the inode cache is enabled with
>>
>> mount -o inode_cache
>>
>> This fixes the inode cache writeout to preserve any error values and
>> makes sure not to override the root when inode cache writeout is done.
>>
>> Reported-by: Filipe Manana 
>> Signed-off-by: Chris Mason 
>
> Thanks, btrfs/049 now passes with that patch applied.
> Running the whole xfstests suite now.

btrfs/066 also failed once during final fsck with:

_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.btrfs output ***
checking extents
checking free space cache
There is no free space entry for 21676032-21680128
There is no free space entry for 21676032-87031808
cache appears valid but isnt 20971520
Checking filesystem on /dev/sdc
UUID: f7785aa7-d5ba-479d-a211-7c31039dc9b1
found 11911316 bytes used err is -22
total csum bytes: 7656
total tree bytes: 454656
total fs tree bytes: 376832
total extent tree bytes: 36864
btree space waste bytes: 122959
file data blocks allocated: 42893312
 referenced 31158272

(it failed like that 1 out of 4 runs)


>
>>
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index 5a4f5d1..8cd797f 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -1149,7 +1149,8 @@ int btrfs_wait_cache_io(struct btrfs_root *root,
>> if (!inode)
>> return 0;
>>
>> -   root = root->fs_info->tree_root;
>> +   if (block_group)
>> +   root = root->fs_info->tree_root;
>>
>> /* Flush the dirty pages in the cache file. */
>> ret = flush_dirty_cache(inode);
>> @@ -3465,9 +3466,12 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
>> if (!btrfs_test_opt(root, INODE_MAP_CACHE))
>> return 0;
>>
>> +   memset(&io_ctl, 0, sizeof(io_ctl));
>> ret = __btrfs_write_out_cache(root, inode, ctl, NULL, &io_ctl,
>> -             trans, path, 0) ||
>> -   btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0);
>> + trans, path, 0);
>> +   if (!ret)
>> +   ret = btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 
>> 0);
>> +
>> if (ret) {
>> btrfs_delalloc_release_metadata(inode, inode->i_size);
>>  #ifdef DEBUG
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: reduce block group cache writeout times during commit

2015-04-23 Thread Filipe David Manana
On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason  wrote:
> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote:
>> >> Trying the current integration-4.1 branch, I ran into the following
>> >> during xfstests/btrfs/049:
>> >>
>> >
>> > Ugh, I must not be waiting correctly in one of the inode cache writeout
>> > sections.  But I've run 049 a whole bunch of times without triggering,
>> > can you get this to happen consistently?
>>
>> All the time so far.
>
> I'm testing with this now:
>
> commit 9f433238891b1b243c4f19d3f36eed913b270cbc
> Author: Chris Mason 
> Date:   Thu Apr 23 08:02:49 2015 -0700
>
> Btrfs: fix inode cache writeout
>
> The code to fix stalls during free spache cache IO wasn't using
> the correct root when waiting on the IO for inode caches.  This
> is only a problem when the inode cache is enabled with
>
> mount -o inode_cache
>
> This fixes the inode cache writeout to preserve any error values and
> makes sure not to override the root when inode cache writeout is done.
>
> Reported-by: Filipe Manana 
> Signed-off-by: Chris Mason 

Thanks, btrfs/049 now passes with that patch applied.
Running the whole xfstests suite now.

>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 5a4f5d1..8cd797f 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1149,7 +1149,8 @@ int btrfs_wait_cache_io(struct btrfs_root *root,
> if (!inode)
> return 0;
>
> -   root = root->fs_info->tree_root;
> +   if (block_group)
> +   root = root->fs_info->tree_root;
>
> /* Flush the dirty pages in the cache file. */
> ret = flush_dirty_cache(inode);
> @@ -3465,9 +3466,12 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
> if (!btrfs_test_opt(root, INODE_MAP_CACHE))
> return 0;
>
> +   memset(&io_ctl, 0, sizeof(io_ctl));
> ret = __btrfs_write_out_cache(root, inode, ctl, NULL, &io_ctl,
> - trans, path, 0) ||
> -   btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0);
> +         trans, path, 0);
> +   if (!ret)
> +   ret = btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 
> 0);
> +
> if (ret) {
> btrfs_delalloc_release_metadata(inode, inode->i_size);
>  #ifdef DEBUG



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: reduce block group cache writeout times during commit

2015-04-23 Thread Filipe David Manana
On Thu, Apr 23, 2015 at 1:52 PM, Chris Mason  wrote:
> On 04/23/2015 08:45 AM, Filipe David Manana wrote:
>> On Wed, Apr 22, 2015 at 5:55 PM, Chris Mason  wrote:
>>> On 04/22/2015 12:37 PM, Holger Hoffstätte wrote:
>>>> On Wed, 22 Apr 2015 18:09:18 +0200, Lutz Vieweg wrote:
>>>>
>>>>> On 04/13/2015 09:52 PM, Chris Mason wrote:
>>>>>> Large filesystems with lots of block groups can suffer long stalls during
>>>>>> commit while we create and send down all of the block group caches.  The
>>>>>> more blocks groups dirtied in a transaction, the longer these stalls can 
>>>>>> be.
>>>>>> Some workloads average 10 seconds per commit, but see peak times much 
>>>>>> higher.
>>>>>
>>>>> Since we see this problem very frequently on some shared development 
>>>>> servers,
>>>>> I will try to install this ASAP.
>>>>>
>>>>> Meanwhile, can anybody already tell success stories about successfully 
>>>>> removing
>>>>> lags by this patch?
>>>>
>>>> Works fine, but make sure to get the followup patch [1] as well while 
>>>> you're
>>>> at it. I've observed that my (bandwidth-throttled) backups now cause 
>>>> shorter,
>>>> nicely spaced-out blips of activity instead of longer ones when the 
>>>> writeback
>>>> kicks in.
>>>>
>>>> -h
>>>>
>>>> [1] 
>>>> https://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?h=integration-4.1&id=c1e31ffc317e4c28d242b1d961c9c6fe673c0377
>>>>
>>>
>>> Great to hear.  I recommend just using my for-linus-4.1 branch, since it
>>> has all the good things  in one place.
>>
>> Trying the current integration-4.1 branch, I ran into the following
>> during xfstests/btrfs/049:
>>
>
> Ugh, I must not be waiting correctly in one of the inode cache writeout
> sections.  But I've run 049 a whole bunch of times without triggering,
> can you get this to happen consistently?

All the time so far.

>
> -chris
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: reduce block group cache writeout times during commit

2015-04-23 Thread Filipe David Manana
]
[ 1705.070862]  []
btrfs_delalloc_release_metadata+0x54/0xe6 [btrfs]
[ 1705.072372]  [] btrfs_write_out_ino_cache+0x82/0x97 [btrfs]
[ 1705.073821]  [] btrfs_save_ino_cache+0x275/0x2dc [btrfs]
[ 1705.075150]  [] commit_fs_roots.isra.13+0xaa/0x137 [btrfs]
[ 1705.076593]  [] ? trace_hardirqs_on+0xd/0xf
[ 1705.077830]  [] ?
btrfs_commit_transaction+0x4bb/0x9d3 [btrfs]
[ 1705.079381]  [] ? _raw_spin_unlock+0x28/0x33
[ 1705.080588]  []
btrfs_commit_transaction+0x4ca/0x9d3 [btrfs]
[ 1705.082148]  [] ? trace_hardirqs_on+0xd/0xf
[ 1705.083298]  [] btrfs_sync_file+0x307/0x367 [btrfs]
[ 1705.084667]  [] vfs_fsync_range+0x95/0xa4
[ 1705.086187]  [] ? retint_swapgs+0xe/0x44
[ 1705.087289]  [] vfs_fsync+0x1c/0x1e
[ 1705.088389]  [] do_fsync+0x34/0x4e
[ 1705.089438]  [] SyS_fsync+0x10/0x14
[ 1705.092118]  [] system_call_fastpath+0x12/0x17
[ 1705.093373] note: xfs_io[3645] exited with preempt_count 1
[ 1946.983579] kmemleak: 1 new suspected memory leaks (see
/sys/kernel/debug/kmemleak)
[ 2566.080608] kmemleak: 1 new suspected memory leaks (see
/sys/kernel/debug/kmemleak)


>
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.18.11 - "no space left on device" and 'fi usage' shows lots

2015-04-20 Thread Filipe David Manana
rt rpcsec_gss_krb5 nfsd auth_rpcgss
> nfs_acl nfs lockd grace sunrpc fscache btrfs raid10 raid456
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq igb
> isci i2c_algo_bit raid1 hid_generic dca raid0 usbhid ptp ses libsas ahci
> multipath enclosure hid libahci pps_core scsi_transport_sas megaraid_sas
> linear
> [12996.697253] CPU: 5 PID: 10458 Comm: btrfs-cleaner Tainted: G
> C 3.18.11-031811-generic #201504041535
> [12996.701338] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
> SE5C600.86B.02.03.0003.041920141333 04/19/2014
> [12996.705495] task: 881fdaafda00 ti: 881a49df4000 task.ti:
> 881a49df4000
> [12996.708514] RIP: 0010:[]  []
> btrfs_orphan_add+0x1a9/0x1c0 [btrfs]
> [12996.712306] RSP: 0018:881a49df7c98  EFLAGS: 00010286
> [12996.714429] RAX: ffe4 RBX: 880fd75f8000 RCX:
> 
> [12996.717308] RDX: 2b12 RSI: 0004 RDI:
> 880f5a51c138
> [12996.791256] RBP: 881a49df7cd8 R08: e8fffee20850 R09:
> 881aa38d5d40
> [12996.866007] R10:  R11: 0010 R12:
> 881fe4608dc0
> [12996.941445] R13: 881c1f61d790 R14: 880fd75f8458 R15:
> 0001
> [12997.016606] FS:  () GS:881ffe62()
> knlGS:
> [12997.163897] CS:  0010 DS:  ES:  CR0: 80050033
> [12997.238138] CR2: 0128d008 CR3: 01c16000 CR4:
> 001407e0
> [12997.312295] Stack:
> [12997.383946]  881a49df7cd8 c0375e0f 881fd4b35800
> 880080ad8200
> [12997.528170]  881fd4b35800 881aa38d5d40 881fe4608dc0
> 0001
> [12997.672790]  881a49df7d58 c031f2c0 880f5a51c000
> 0004c0305ffa
> [12997.816858] Call Trace:
> [12997.886473]  [] ?
> lookup_free_space_inode+0x4f/0x100 [btrfs]
> [12998.025910]  []
> btrfs_remove_block_group+0x140/0x490 [btrfs]
> [12998.166112]  [] btrfs_remove_chunk+0x245/0x380
> [btrfs]
> [12998.238039]  [] btrfs_delete_unused_bgs+0x236/0x270
> [btrfs]
> [12998.309001]  [] cleaner_kthread+0x12c/0x190 [btrfs]
> [12998.378869]  [] ?
> btree_readpage_end_io_hook+0x2c0/0x2c0 [btrfs]
> [12998.514511]  [] kthread+0xc9/0xe0
> [12998.581913]  [] ? flush_kthread_worker+0x90/0x90
> [12998.648486]  [] ret_from_fork+0x58/0x90
> [12998.714302]  [] ? flush_kthread_worker+0x90/0x90
>
>
> All data seems to be in tact, but the system is unusable due to the frequent
> crashes. Does anyone have any suggestions on how to proceed? I've tried a
> balance (crashed after a long time), scrub (no errors), and fsck to no
> avail.

In addition to what Tsutomu replied to you regarding the fixes for
ENOSPC, that particular crash/BUG_ON was fixed in kernel 3.19 by the
following patch (that didn't get backported to 3.18 or other older
releases):

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d84be799194147e04c0e3129ed44a948773b80a


>
> Thanks for any help!
> -Joel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix data loss after concurrent fsyncs for files in the same subvol

2015-04-17 Thread Filipe David Manana
On Fri, Apr 17, 2015 at 7:26 PM, Josef Bacik  wrote:
> On 04/17/2015 02:20 PM, Filipe Manana wrote:
>>
>> If we have concurrent fsync calls against files living in the same
>> subvolume,
>> we have some time window where we don't add the collected ordered extents
>> to the running transaction's list of ordered extents and return success to
>> userspace. This can result in data loss if the ordered extents complete
>> after
>> the current transaction commits and a power failure happens after the
>> current
>> transaction commits and before the next one commits.
>>
>> A sequence of steps that lead to this:
>>
>>  CPU 0 CPU
>> 1
>>
>> btrfs_sync_file(inode A)
>> btrfs_sync_file(inode B)
>>btrfs_log_inode_parent()
>> btrfs_log_inode_parent()
>>
>>  start_log_trans()
>>lock root->log_mutex
>>ctx->log_transid = root->log_transid = N
>>unlock root->log_mutex
>>
>>
>> start_log_trans()
>>   lock
>> root->log_mutex
>>
>> ctx->log_transid = root->log_transid = N
>>   unlock
>> root->log_mutex
>>
>>  btrfs_log_inode()
>> btrfs_log_inode()
>>btrfs_get_logged_extents()
>> btrfs_get_logged_extents()
>>   --> gets orderede extent A->
>> gets ordered extent B
>>   into local list logged_list
>> into local list logged_list
>>write items into the log tree  write
>> items into the log tree
>>btrfs_submit_logged_extents(&logged_list)
>>  --> splices logged_list into
>>  log_root->logged_list[N % 2]
>>  (N == log_root->log_transid)
>>
>>btrfs_sync_log()
>>  lock root->log_mutex
>>
>>  atomic_set(&root->log_commit[N % 2], 1)
>>(N == ctx->log_transid)
>
>
> Except this can't happen, we have a wait_for_writer() in between here that
> will wait for CPU 1 to finish doing it's logging since it has already done
> it's start_log_trans().  Thanks,

Right, totally forgot that.
Thanks for pointing it out Josef.

>
> Josef
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: incremental send, don't rename a directory too soon

2015-04-14 Thread Filipe David Manana
On Tue, Apr 14, 2015 at 12:09 PM, Robbie Ko  wrote:
> Hi,
>
> Sorry for not making it clear.
>
> 2015-04-14 16:16 GMT+08:00 Filipe David Manana :
>> On Tue, Apr 14, 2015 at 8:33 AM, Robbie Ko  wrote:
>>> Hi,
>>>
>>> After applying the patch, I got WARN_ON.
>>> btrfs progs finished without any error message,
>>> but received subvolume is not the same as send subvolume.
>>>
>>> Here's the related information.
>>> thanks,
>>> robbieko
>>>
>>> uanme -a
>>> Linux ubuntu 4.0.0-rc4-custom #2 SMP Tue Apr 14 11:43:00 CST 2015
>>> x86_64 x86_64 x86_64 GNU/Linux
>>> btrfs --version
>>> Btrfs v3.14.1
>>>
>>> Steps to reproduce:
>>>
>>>  $ mkfs.btrfs -f /dev/sdb
>>>  $ mount /dev/sdb /mnt
>>>  $ mkfs.btrfs -f /dev/sdc
>>>  $ mount /dev/sdc /mnt2
>>>
>>> $ mkdir -p /mnt/data
>>> $ mkdir -p /mnt/data/n1/n2
>>> $ mkdir -p /mnt/data/n4
>>> $ mkdir -p /mnt/data/n1/n2/p1
>>> $ mkdir -p /mnt/data/t4
>>> $ mkdir -p /mnt/data/p1
>>> $ mkdir -p /mnt/data/p1/2
>>>
>>>   $ btrfs subvolume snapshot -r /mnt /mnt/snap1
>>>
>>> $ mv /mnt/data/n1/n2 /mnt/data/t4
>>> $ mv /mnt/data/n4 /mnt/data/t4/n2
>>> $ mv /mnt/data/t4/n2/p1 /mnt/data/t4/p1
>>> $ mv /mnt/data/p1 /mnt/data/t4/n2
>>>
>>>   $ btrfs subvolume snapshot -r /mnt /mnt/snap2
>>>
>>>   $ btrfs send /mnt/snap1 | btrfs receive /mnt2
>>>   $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
>>
>> So this is a new case, different from the ones you've sent before, isn't it?
>>
>> You should have all previous patches applied too, not just this one
>> you're replying to.
>
> Hi,
>
> I have applied all the patches fixed recently.
> Then WARN_ON happened with steps mentioned above.
> I tested it without these patches, no WARN_ON but the following error
> appeared instead.
> ERROR: rename data/t4/n2/p1 -> data/t4/n2/p1/p1 failed. Invalid argument
>
> I started to revert these patches and found that this patch causes the
> WARN_ON problem.
>
> I'm not sure whether it's a new case.

So it's a case that didn't work neither before nor after all the
recent fixes, but for different reasons.
I have 2 cases here, one triggered by your fuzz tester script and
another one I know of for quite some time (involving creation of new
directories and removing old ones in the second snapshot) but haven't
had the time to find a solution without breaking other cases that are
currently working (and have xfstests). Haven't checked however if your
reproducer fails for the same reasons as those 2 cases I know of.

thanks

>
> thanks,
> robbieko
>
>> Also, it isn't clear, are you saying this happens only with this
>> particular patch applied but doesn't happen without it (and all other
>> recent ones)?
>>
>> thanks
>>
>>
>>>
>>> Call trace message
>>>
>>> [  135.498533] [ cut here ]
>>>
>>> [  135.498557] WARNING: CPU: 1 PID: 2346 at fs/btrfs/send.c:5934
>>> btrfs_ioctl_send+0xc4c/0x11e0 [btrfs]()
>>>
>>> [  135.498560] Modules linked in: nf_conntrack_ipv4(E)
>>> nf_defrag_ipv4(E) xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E)
>>> nf_reject_ipv4(E) xt_tcpudp(E) iptable_filter(E) ip_tables(E)
>>> x_tables(E) bridge(E) stp(E) llc(E) snd_intel8x0(E) snd_ac97_codec(E)
>>> ac97_bus(E) snd_pcm(E) snd_timer(E) snd(E) iosf_mbi(E) soundcore(E)
>>> ppdev(E) joydev(E) lp(E) serio_raw(E) parport_pc(E) i2c_piix4(E)
>>> mac_hid(E) parport(E) btrfs(E) xor(E) raid6_pq(E) hid_generic(E)
>>> usbhid(E) hid(E) ahci(E) psmouse(E) libahci(E) e1000(E) pata_acpi(E)
>>>
>>> [  135.498578] CPU: 1 PID: 2346 Comm: btrfs Tainted: G E
>>> 4.0.0-rc4-custom #3
>>>
>>> [  135.498580] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
>>> VirtualBox 12/01/2006
>>>
>>> [  135.498583]  c0509016 88007a233c08 817b62f3
>>> 0007
>>>
>>> [  135.498586]   88007a233c48 8107452a
>>> c9567000
>>>
>>> [  135.498590]  8800799f1400 8800799f1418 88007b3f82d0
>>> 880079f6
>>>
>>> [  135.498593] Call Trace:
>>>
>>> [  135.498602]  [] dump_stack+0x45/0x57
>>>
>>> [  135.498609]  [] warn_slowpath_common+0x8a/0xc0
>>>
>&g

Re: [PATCH] xfstests: make "BTRFS_UTIL_PROG filesystem defragment" work

2015-04-14 Thread Filipe David Manana
On Tue, Apr 14, 2015 at 10:01 AM, Liu Bo  wrote:
> _require_defrag() needs to check if the command is executable, but btrfs has
> its subcommand "filesystem defragment", which makes this checking fail.
>
> This workarounds it and now we can run case generic/324, generic/018, 
> btrfs/005.

There's already a patch from Zhao to fix the regression:

https://patchwork.kernel.org/patch/6205031/

thanks

>
> Signed-off-by: Liu Bo 
> ---
>  common/defrag | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/common/defrag b/common/defrag
> index f923dc0..f36a68b 100644
> --- a/common/defrag
> +++ b/common/defrag
> @@ -37,7 +37,11 @@ _require_defrag()
> ;;
>  esac
>
> -_require_command "$DEFRAG_PROG" defragment
> +if [ "$FSTYP" == "btrfs" ]; then
> +   _require_command "$BTRFS_UTIL_PROG" defragment
> +else
> +   _require_command "$DEFRAG_PROG" defragment
> +fi
>  _require_xfs_io_command "fiemap"
>  }
>
> --
> 1.8.2.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: incremental send, don't rename a directory too soon

2015-04-14 Thread Filipe David Manana
> [  135.498900] CPU: 1 PID: 2346 Comm: btrfs Tainted: G W   E
> 4.0.0-rc4-custom #3
>
> [  135.498903] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
> VirtualBox 12/01/2006
>
> [  135.498910]  c0509016 88007a233c08 817b62f3
> 4c724c72
>
> [  135.498918]   88007a233c48 8107452a
> 88007a233c38
>
> [  135.498923]  8800799f1400 880078d52cc0 880078d52cd8
> 8800799f15d8
>
> [  135.498927] Call Trace:
>
> [  135.498937]  [] dump_stack+0x45/0x57
>
> [  135.498945]  [] warn_slowpath_common+0x8a/0xc0
>
> [  135.498951]  [] warn_slowpath_null+0x1a/0x20
>
> [  135.498975]  [] btrfs_ioctl_send+0x28f/0x11e0 [btrfs]
>
> [  135.498984]  [] ? __alloc_pages_nodemask+0x1ae/0xab0
>
> [  135.498990]  [] ? sched_clock_local+0x25/0x90
>
> [  135.498996]  [] ? alloc_pid+0x2e/0x530
>
> [  135.499023]  [] btrfs_ioctl+0x286/0x27e0 [btrfs]
>
> [  135.499031]  [] ? __enqueue_entity+0x78/0x80
>
> [  135.499037]  [] ? enqueue_entity+0x400/0xc20
>
> [  135.499046]  [] ? native_sched_clock+0x2a/0x90
>
> [  135.499055]  [] ? enqueue_task_fair+0x178/0x730
>
> [  135.499061]  [] ? native_smp_send_reschedule+0x4d/0x70
>
> [  135.499069]  [] ? resched_curr+0x70/0xc0
>
> [  135.499078]  [] ? check_preempt_curr+0x5a/0xa0
>
> [  135.499087]  [] ? wake_up_new_task+0x12f/0x1b0
>
> [  135.499096]  [] do_vfs_ioctl+0x2e0/0x4e0
>
> [  135.499106]  [] ? do_fork+0x13c/0x370
>
> [  135.499115]  [] SyS_ioctl+0x81/0xa0
>
> [  135.499121]  [] ? SyS_clone+0x16/0x20
>
> [  135.499131]  [] ? stub_clone+0x6d/0x90
>
> [  135.499140]  [] system_call_fastpath+0x16/0x1b
>
> [  135.499144] ---[ end trace e1dd916182de3a9e ]---



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS corruption w/kernel 3.13 while using docker -s btrfs

2015-04-09 Thread Filipe David Manana
t; - I fixed this then purged all the containers and images (i.e. btrfs
> subvolumes).
> - I did "docker pull debian:testing" which would have created a few new
> btrfs subvolumes.
> - I ran some "docker build"s and at some point I had two builds going at
> once (two docker processes each creating btrfs subvolumes/snapshots at the
> same time).
> - At some point one docker build was completed and the other was stalled
> trying to create a snapshot.
> - Minutes later I noticed that my other applications were freezing up. The
> UI was still responsive and I could use the terminal.
> - I looked at syslog and dmesg output in a terminal. There were no messages,
> ominous or otherwise coinciding with the disk i/o freezing up.
> - I held in the power button.
> - The reboot didn't boot. I ended up with the dreaded busybox shell, unable
> to mount btrfs rootfs...
> - Removed disk and took a dd copy of sda3.
>
> Other research...
> Certain BTRFS mount options break docker -s btrfs:
> - https://github.com/dotcloud/docker/issues/5429#issuecomment-42443919
>
> The "AUFS on btrfs" thing has a few mentions in docker issues, perhaps it's
> nothing to do with btrfs though:
> - https://github.com/dotcloud/docker/issues/829 - continued at...
> - https://github.com/dotcloud/docker/issues/1075 - continued at...
> - https://github.com/dotcloud/docker/issues/2961
> - https://github.com/dotcloud/docker/issues/2056 # just another datapoint.
>
> Cheers
>
> --
> Paul Harvey
>
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 RESEND 2/2] fstests: btrfs/089: Test for incorrect exclusive refernce number after file clone.

2015-04-07 Thread Filipe David Manana
load.
> +if [ $SUPPORT_NOINODE_CACHE == "no" ]; then
> +   EMPTY_SIZE=`$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | \
> +   $SED_PROG -n '/[0-9]/p' | $AWK_PROG '{print $2}' | head -n1`
> +   if [ $EMPTY_SIZE != $NODESIZE ]; then
> +   _notrun "Kernel doesn't support to disable inode cache"
> +   fi
> +fi
> +
> +dd if=/dev/zero of=$SCRATCH_MNT/subv1/file1 bs=4K count=64 2>> $seqres.full

Sorry I didn't ask in previous versions of the patch, but is dd really
necessary here for some reason? Wouldn't the following work:

$XFS_IO_PROG -f -c "pwrite 0 256K" $SCRATCH_MNT/subv1/file1 | _filter_xfs_io

I tried your test, and using xfs_io worked as well here.

> +cp --reflink $SCRATCH_MNT/subv1/file1 $SCRATCH_MNT/subv2/file1
> +cp --reflink $SCRATCH_MNT/subv1/file1 $SCRATCH_MNT/subv3/file1
> +
> +# Current btrfs use tree search ioctl to show quota, which will only show 
> info
> +# in commit tree. So need to sync to update the qgroup commit tree.
> +sync
> +
> +units=`_btrfs_qgroup_units`
> +$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' | 
> \
> +   $AWK_PROG '{print $2" "$3}'
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/089.out b/tests/btrfs/089.out
> new file mode 100644
> index 000..09a1077
> --- /dev/null
> +++ b/tests/btrfs/089.out
> @@ -0,0 +1,5 @@
> +QA output created by 089
> +65536 65536
> +327680 65536
> +327680 65536
> +327680 65536
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index a053d14..82f6fe6 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -90,3 +90,4 @@
>  085 auto quick metadata subvol
>  086 auto quick clone
>  088 auto quick
> +089 auto quick qgroup
> --
> 2.3.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND 1/2] fstests: btrfs/088: Check return value of "btrfs filesystem show" command executed on umounted device.

2015-04-07 Thread Filipe David Manana
On Tue, Apr 7, 2015 at 10:04 AM, Qu Wenruo  wrote:
> The return value should always be 0 if no problem happens, but
> "btrfs filesystem show" executed on umounted device will always return 1
> even there is no problem.
>
> This testcase just checks it.
>
> Signed-off-by: Qu Wenruo 
> ---
> Rebased to latest fstests.
> 087 is taken by Filipe's test case "fstests: test for btrfs transaction
> abortion on device with discard support", so take 088.

And I just sent some days ago tests numbered 088 and 089.
You should probably remove the number you assign to the tests from the
patch subject, Dave often renumbers them and it adds no value to the
subject.

> ---
>  tests/btrfs/088 | 71 
> +
>  tests/btrfs/088.out |  2 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 74 insertions(+)
>  create mode 100644 tests/btrfs/088
>  create mode 100644 tests/btrfs/088.out
>
> diff --git a/tests/btrfs/088 b/tests/btrfs/088
> new file mode 100644
> index 000..768d2bf
> --- /dev/null
> +++ b/tests/btrfs/088
> @@ -0,0 +1,71 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/088
> +#
> +# Check return value of "btrfs filesystem show" command executed on
> +# umounted device.
> +# It should return 0 if nothing wrong happens.
> +#
> +# Regression in v3.18 btrfs-progs and fixed by the following patch:
> +#
> +#btrfs-progs: Fix wrong return value when executing 'fi show' on
> +#umounted device.
> +#
> +#---
> +# Copyright (c) 2015 Fujitsu, Inc.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +cd /
> +rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter.btrfs
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_scratch_dev_pool

Do we really need a device pool here, isn't a single device fs enough?
If it needs a multiple devices fs, it would be a good idea to explain
why in a comment and mention the need for it in the test description.

Otherwise looks good to me.

Thanks

> +
> +rm -f $seqres.full
> +
> +FIRST_POOL_DEV=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
> +TOTAL_DEVS=`echo $SCRATCH_DEV_POOL | wc -w`
> +
> +_scratch_pool_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> +
> +_run_btrfs_util_prog filesystem show $FIRST_POOL_DEV | \
> +   _filter_btrfs_filesystem_show $TOTAL_DEVS
> +
> +# success, all done
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/btrfs/088.out b/tests/btrfs/088.out
> new file mode 100644
> index 000..c24480a
> --- /dev/null
> +++ b/tests/btrfs/088.out
> @@ -0,0 +1,2 @@
> +QA output created by 088
> +Silence is golden
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index e9c15af..a053d14 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -89,3 +89,4 @@
>  084 auto quick send
>  085 auto quick metadata subvol
>  086 auto quick clone
> +088 auto quick
> --
> 2.3.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: iterate over unused chunk space in FITRIM

2015-04-06 Thread Filipe David Manana
d that
mutex (extent-tree.c:do_chunk_alloc() locks chunks mutex and then
calls btrfs_alloc_chunk() while holding that mutex).

Thanks.

> +   btrfs_end_transaction(trans, root);
> +
> range->len = trimmed;
> return ret;
>  }
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8222f6f..2f4ce7f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1089,12 +1089,13 @@ again:
>
>
>  /*
> - * find_free_dev_extent - find free space in the specified device
> - * @device:the device which we search the free space in
> - * @num_bytes: the size of the free space that we need
> - * @start: store the start of the free space.
> - * @len:   the size of the free space. that we find, or the size of the 
> max
> - * free space if we don't find suitable free space
> + * find_free_dev_extent_start - find free space in the specified device
> + * @device:  the device which we search the free space in
> + * @num_bytes:   the size of the free space that we need
> + * @search_start: the position from which to begin the search
> + * @start:   store the start of the free space.
> + * @len: the size of the free space. that we find, or the size
> + *   of the max free space if we don't find suitable free space
>   *
>   * this uses a pretty simple search, the expectation is that it is
>   * called very infrequently and that a given device has a small number
> @@ -1108,9 +1109,9 @@ again:
>   * But if we don't find suitable free space, it is used to store the size of
>   * the max free space.
>   */
> -int find_free_dev_extent(struct btrfs_trans_handle *trans,
> -struct btrfs_device *device, u64 num_bytes,
> -u64 *start, u64 *len)
> +int find_free_dev_extent_start(struct btrfs_trans_handle *trans,
> +  struct btrfs_device *device, u64 num_bytes,
> +  u64 search_start, u64 *start, u64 *len)
>  {
> struct btrfs_key key;
> struct btrfs_root *root = device->dev_root;
> @@ -1120,19 +1121,11 @@ int find_free_dev_extent(struct btrfs_trans_handle 
> *trans,
> u64 max_hole_start;
> u64 max_hole_size;
> u64 extent_end;
> -   u64 search_start;
> u64 search_end = device->total_bytes;
> int ret;
> int slot;
> struct extent_buffer *l;
>
> -   /* FIXME use last free of some kind */
> -
> -   /* we don't want to overwrite the superblock on the drive,
> -* so we make sure to start at an offset of at least 1MB
> -*/
> -   search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> -
> path = btrfs_alloc_path();
> if (!path)
> return -ENOMEM;
> @@ -1260,6 +1253,24 @@ out:
> return ret;
>  }
>
> +int find_free_dev_extent(struct btrfs_trans_handle *trans,
> +struct btrfs_device *device, u64 num_bytes,
> +u64 *start, u64 *len)
> +{
> +   struct btrfs_root *root = device->dev_root;
> +   u64 search_start;
> +
> +   /* FIXME use last free of some kind */
> +
> +   /*
> +* we don't want to overwrite the superblock on the drive,
> +* so we make sure to start at an offset of at least 1MB
> +*/
> +   search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> +   return find_free_dev_extent_start(trans, device, num_bytes,
> + search_start, start, len);
> +}
> +
>  static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
>   struct btrfs_device *device,
>   u64 start, u64 *dev_extent_len)
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 83069de..c9a7ea9 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -450,6 +450,9 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info);
>  int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info);
>  int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info);
>  int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset);
> +int find_free_dev_extent_start(struct btrfs_trans_handle *trans,
> +struct btrfs_device *device, u64 num_bytes,
> +u64 search_start, u64 *start, u64 *max_avail);
>  int find_free_dev_extent(struct btrfs_trans_handle *trans,
>  struct btrfs_device *device, u64 num_bytes,
>  u64 *start, u64 *max_avail);
> --
> 1.8.5.6
>
>
> --
> Jeff Mahoney
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/inode.c:3142

2015-04-04 Thread Filipe David Manana
gt; Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=4.00MiB, used=3.87MiB
>
>
> BTW, it's interesting that 437 MB of data are used since there are no files
> left on the volume.
>
> Please let me know how I can help you to debug this.
>
>
> Best regards,
> Sebastian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs ENOSPC issue

2015-04-04 Thread Filipe David Manana
On Sat, Apr 4, 2015 at 12:36 AM, Justin Maggard  wrote:
> Hi,
>
> We're hitting a consistently reproducible ENOSPC condition with a
> simple test case:
>
> # truncate -s 1T btrfs.fs
> # mkfs.btrfs btrfs.fs
> # mount btrfs.fs /mnt/
> # fallocate -l 1021G /mnt/fallocate
> # btrfs fi sync /mnt/
> # dd if=/dev/zero of=/mnt/dd bs=1G
> # btrfs fi sync /mnt/
> # rm /mnt/fallocate
> # btrfs fi sync /mnt/
> # fallocate -l 20G /mnt/fallocate
> fallocate: /mnt/fallocate: fallocate failed: No space left on device
>
> I continue to get ENOSPC even after unmount / mount.
>
> Here we have 1022GB free as reported by df, yet we can't allocate
> 20GB.  I tried the integration-4.1 tree, which had the same results.
> I also added Zhao Lei's ENOSPC most recent patchset from today, but it
> didn't seem to help.

Have you tried too the following patch (not in any release nor rc)?

https://patchwork.kernel.org/patch/5800231/

>
> So it appears that when allocating the first chunk,
> find_free_dev_extent() finds a huge hole, and allocates a portion of
> that free 1022GB.  Real chunk allocation is delayed until transaction
> submit and does not insert the DEV_EXTENT item into the device tree
> immediately, so transaction->pending_chunks is used to record pending
> chunks.
>
> When it comes to the next chunk allocation, find_free_dev_extent()
> detects the same huge hole, but contains_pending_extent() returns true
> and sets hole_size to 0.  This means we skip our one and only huge
> free space hole and try to search for some other free space holes.
>
> The problem occurs when there is not enough space for chunk allocation
> if we skip that huge hole, and find_free_dev_extent() eventually
> returns –ENOSPC.
>
> The following patch makes it work for me, but I certainly may have
> missed some subtleties in how btrfs allocation works; so if something
> is incorrect here, I'd appreciate feedback.  If this is the proper way
> to go about fixing it, I can whip up a proper patch and post it to the
> list.
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a73acf4..d056448 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1053,7 +1053,7 @@ out:
>
>  static int contains_pending_extent(struct btrfs_trans_handle *trans,
>struct btrfs_device *device,
> -  u64 *start, u64 len)
> +  u64 *start, u64 *len)
>  {
> struct extent_map *em;
> struct list_head *search_list = &trans->transaction->pending_chunks;
> @@ -1068,12 +1068,16 @@ again:
> for (i = 0; i < map->num_stripes; i++) {
> if (map->stripes[i].dev != device)
> continue;
> -   if (map->stripes[i].physical >= *start + len ||
> +   if (map->stripes[i].physical >= *start + *len ||
> map->stripes[i].physical + em->orig_block_len <=
> *start)
> continue;
> *start = map->stripes[i].physical +
> em->orig_block_len;
> +   if (*len > em->orig_block_len)
> +   *len -= em->orig_block_len;
> +   else
> +   *len = 0;
> ret = 1;
> }
> }
> @@ -1191,10 +1195,9 @@ again:
>  * Have to check before we set max_hole_start, 
> otherwise
>  * we could end up sending back this offset anyway.
>  */
> -   if (contains_pending_extent(trans, device,
> -   &search_start,
> -   hole_size))
> -   hole_size = 0;
> +   contains_pending_extent(trans, device,
> +   &search_start,
> +   &hole_size);
>
> if (hole_size > max_hole_size) {
> max_hole_start = search_start;
> @@ -1239,7 +1242,7 @@ next:
> max_hole_size = hole_size;
> }
>
> -   if (contains_pending_extent(trans, device, &search_start, hole_size)) 
> {
> +   if (contains_pending_extent(trans, device, &search_start, 
> &hole_size)) {
> btrfs_release_path(path);
> goto again;
>  

Re: [PATCH v2] Btrfs: fix range cloning when same inode used as source and destination

2015-04-02 Thread Filipe David Manana
On Thu, Apr 2, 2015 at 6:31 PM, Holger Hoffstätte
 wrote:
>
> On Thu, 02 Apr 2015 18:25:11 +0100, Filipe Manana wrote:
>
>> V2: Fixed a warning about potentially uninitialized variable. David
>> got this warning on a 4.5.1 gcc, but I didn't on a 4.9.2 gcc
>> however.
>
> I was *just* about to post this warning, since I saw it only a minute ago!
>
> I assume you mean:
>
> fs/btrfs/ioctl.c: In function 'btrfs_clone':
> fs/btrfs/ioctl.c:3531:14: warning: 'next_key_min_offset' may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
>key.offset = next_key_min_offset;
>   ^
>
> ..and this is with 4.9.2 here.
>
> Anyway..thanks for being faster :)

Yes, that was it.
Thanks.

>
> Holger
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   >