Re: Linux-next regression?

2018-12-05 Thread Chris Mason
On 5 Dec 2018, at 5:59, Andrea Gelmini wrote:

> On Tue, Dec 04, 2018 at 10:29:49PM +0000, Chris Mason wrote:
>> I think (hope) this is:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Which was just nailed down to a blkmq bug.  It triggers when you have
>> scsi devices using elevator=none over blkmq.
>
> Thanks a lot Chris. Really.
> Good news: I confirm I recompiled and used blkmq and no-op (at that 
> time).
> Also, the massive write of btrfs defrag can explain the massive 
> trigger of
> the bug, and next corruption.

Sorry this happened, but glad you were able to confirm that it explains 
the trouble you hit.  Thanks for the report, I did end up using this as 
a datapoint to convince myself the bugzilla above wasn't ext4 specific.

-chris


Re: Linux-next regression?

2018-12-04 Thread Chris Mason
On 28 Nov 2018, at 11:05, Andrea Gelmini wrote:

> On Tue, Nov 27, 2018 at 10:16:52PM +0800, Qu Wenruo wrote:
>>
>> But it's less a concerning problem since it doesn't reach latest RC, 
>> so
>> if you could reproduce it stably, I'd recommend to do a bisect.
>
> No problem to bisect, usually.
> But right now it's not possible for me, I explain further.
> Anyway, here the rest of the story.
>
> So, in the end I:
> a) booted with 4.20.0-rc4
> b) updated backup
> c) did the btrfs check --read-only
> d) seven steps, everything is perfect
> e) no complains on screen or in logs (never had)
> f) so, started to compile linux-next 20181128 (on another partition)
> e) without using (reading or writing) on /home, I started
> f) btrfs filesystem defrag -v -r -t 128M /home
> g) it worked without complain (in screen or logs)
> h) then, reboot with kernel tag 20181128
> i) and no way to mount:

I think (hope) this is:

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Which was just nailed down to a blkmq bug.  It triggers when you have 
scsi devices using elevator=none over blkmq.

-chris


Re: btrfs development - question about crypto api integration

2018-11-30 Thread Chris Mason
On 29 Nov 2018, at 12:37, Nikolay Borisov wrote:

> On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
>> Hi all,
>> I've been studying LZ4 and other compression algorithms on the
>> kernel, and seen other projects such as zram and ubifs using the
>> crypto api. Is there a technical reason for not using the crypto api
>> for compression (and possibly for encryption) in btrfs?
>> I did not find any design/technical implementation choices in
>> btrfs development in the developer's FAQ on the wiki. If I completely
>> missed it, could someone point me in the right direction?
>> Lastly, if there is no technical reason for this, would it be
>> something interesting to have implemented?
>
> I personally think it would be better if btrfs' exploited the generic
> framework. And in fact when you look at zstd, btrfs does use the
> generic, low-level ZSTD routines but not the crypto library wrappers. 
> If
> I were I'd try and convert zstd (since it's the most recently added
> algorithm) to using the crypto layer to see if there are any lurking
> problems.

Back when I first added the zlib support, the zlib API was both easier 
to use and a better fit for our async worker threads.  That doesn't mean 
we shouldn't switch, it's just how we got to step one ;)

-chris


Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-28 Thread Chris Mason
On 28 Nov 2018, at 14:06, David Sterba wrote:

> On Tue, Nov 27, 2018 at 03:08:08PM -0500, Josef Bacik wrote:
>> On Tue, Nov 27, 2018 at 07:59:42PM +0000, Chris Mason wrote:
>>> On 27 Nov 2018, at 14:54, Josef Bacik wrote:
>>>
>>>> On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote:
>>>>>
>>>>>
>>>>> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
>>>>>> The cleaner thread usually takes care of delayed iputs, with the
>>>>>> exception of the btrfs_end_transaction_throttle path.  The 
>>>>>> cleaner
>>>>>> thread only gets woken up every 30 seconds, so instead wake it up 
>>>>>> to
>>>>>> do
>>>>>> it's work so that we can free up that space as quickly as 
>>>>>> possible.
>>>>>
>>>>> Have you done any measurements how this affects the overall 
>>>>> system. I
>>>>> suspect this introduces a lot of noise since now we are going to 
>>>>> be
>>>>> doing a thread wakeup on every iput, does this give a chance to 
>>>>> have
>>>>> nice, large batches of iputs that  the cost of wake up can be
>>>>> amortized
>>>>> across?
>>>>
>>>> I ran the whole patchset with our A/B testing stuff and the 
>>>> patchset
>>>> was a 5%
>>>> win overall, so I'm inclined to think it's fine.  Thanks,
>>>
>>> It's a good point though, a delayed wakeup may be less overhead.
>>
>> Sure, but how do we go about that without it sometimes messing up?  
>> In practice
>> the only time we're doing this is at the end of finish_ordered_io, so 
>> likely to
>> not be a constant stream.  I suppose since we have places where we 
>> force it to
>> run that we don't really need this.  IDK, I'm fine with dropping it.  
>> Thanks,
>
> The transaction thread wakes up cleaner periodically (commit interval,
> 30s by default), so the time to process iputs is not unbounded.
>
> I have the same concerns as Nikolay, coupling the wakeup to all 
> delayed
> iputs could result in smaller batches. But some of the callers of
> btrfs_add_delayed_iput might benefit from the extra wakeup, like
> btrfs_remove_block_group, so I don't want to leave the idea yet.

Yeah, I love the idea, I'm just worried about a tiny bit of rate 
limiting.  Since we're only waking up a fixed process and not creating 
new work queue items every time, it's probably fine, but a delay of HZ/2 
would probably be even better.

-chris


Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-27 Thread Chris Mason
On 27 Nov 2018, at 14:54, Josef Bacik wrote:

> On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
>>> The cleaner thread usually takes care of delayed iputs, with the
>>> exception of the btrfs_end_transaction_throttle path.  The cleaner
>>> thread only gets woken up every 30 seconds, so instead wake it up to 
>>> do
>>> it's work so that we can free up that space as quickly as possible.
>>
>> Have you done any measurements how this affects the overall system. I
>> suspect this introduces a lot of noise since now we are going to be
>> doing a thread wakeup on every iput, does this give a chance to have
>> nice, large batches of iputs that  the cost of wake up can be 
>> amortized
>> across?
>
> I ran the whole patchset with our A/B testing stuff and the patchset 
> was a 5%
> win overall, so I'm inclined to think it's fine.  Thanks,

It's a good point though, a delayed wakeup may be less overhead.

-chris


Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount

2018-11-01 Thread Chris Mason
On 1 Nov 2018, at 12:00, Omar Sandoval wrote:

> On Thu, Nov 01, 2018 at 04:29:48PM +0100, David Sterba wrote:
>>>
>>> How is that? kthread_stop() frees the task struct, so 
>>> wake_up_process()
>>> would be a use-after-free.
>>
>> That was an assumption for the demonstration purposes, the wording 
>> was
>> confusing sorry.
>
> Oh, well in that case, that's exactly what kthread_park() is ;) Stop 
> the
> thread and make wake_up a noop, and then we don't need to add special
> cases everywhere else.

This is how Omar won me over to kthread_park().  I think Dave's 
alternatives can be made correct, but Omar's way solves a whole class of 
potential problems.  The park() makes the thread safely ignore future 
work, and won't don't have to chase down weird bugs where people are 
sending future work at the wrong time.

-chris


Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount

2018-11-01 Thread Chris Mason
On 1 Nov 2018, at 6:15, David Sterba wrote:

> On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval 
>>
>> There's a race between close_ctree() and cleaner_kthread().
>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>> sees it set, but this is racy; the cleaner might have already checked
>> the bit and could be cleaning stuff. In particular, if it deletes 
>> unused
>> block groups, it will create delayed iputs for the free space cache
>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>> longer running delayed iputs after a commit. Therefore, if the 
>> cleaner
>> creates more delayed iputs after delayed iputs are run in
>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
>> inode crash from the VFS.
>>
>> Fix it by parking the cleaner
>
> Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> we're missing a commit or clean up data structures. Messing with state
> of a thread would be the last thing I'd try after proving that it's 
> not
> possible to fix in the logic of btrfs itself.
>
> The shutdown sequence in close_tree is quite tricky and we've had bugs
> there. The interdependencies of thread and data structures and other
> subsystems cannot have loops that could not find an ordering that will
> not leak something.
>
> It's not a big problem if some step is done more than once, like
> committing or cleaning up some other structures if we know that
> it could create new.

The problem is the cleaner thread needs to be told to stop doing new 
work, and we need to wait for the work it's already doing to be 
finished.  We're getting "stop doing new work" already because the 
cleaner thread checks to see if the FS is closing, but we don't have a 
way today to wait for him to finish what he's already doing.

kthread_park() is basically the same as adding another mutex or 
synchronization point.  I'm not sure how we could change close_tree() or 
the final commit to pick this up more effectively?

-chris



Re: [PATCH] btrfs: use tagged writepage to mitigate livelock of snapshot

2018-11-01 Thread Chris Mason
On 1 Nov 2018, at 6:21, ethanlien wrote:

> Nikolay Borisov 於 2018-11-01 18:01 寫到:
>> On 1.11.18 г. 11:56 ч., ethanlien wrote:
>>> Nikolay Borisov 於 2018-11-01 16:59 寫到:
 On 1.11.18 г. 8:49 ч., Ethan Lien wrote:
> Snapshot is expected to be fast. But if there are writers steadily
> create dirty pages in our subvolume, the snapshot may take a very 
> long
> time to complete. To fix the problem, we use tagged writepage for
> snapshot
> flusher as we do in the generic write_cache_pages(), so we can 
> ommit
> pages
> dirtied after the snapshot command.

 So the gist of this commit really is that you detect when 
 filemap_flush
 has been called from snapshot context and tag all pages at *that* 
 time
 as PAGECACHE_TAG_TOWRITE and then write them, ignoring any pages 
 that
 might have been dirtied beforehand. Your description kind of dances
 around this idea without really saying it explicitly. Those 
 semantics
 make sense, however i'd like to be stated more explicitly in the 
 change
  log.

 However, this is done at the expense of consistency, so we have 
 faster
 snapshots but depending the file which are stored in them they 
 might be
 broken (i.e a database) since they will be missing pages.

>>>
>>> tag_pages_for_writeback() will tag all pages with 
>>> PAGECACHE_TAG_DIRTY.
>>> If a dirty
>>> page get missed here, it means someone has initiated the flush 
>>> before
>>> us, so we
>>> will wait that dirty page to complete in create_snapshot() ->
>>> btrfs_wait_ordered_extents().
>>
>> And what happens in the scenario where you have  someone dirtying 
>> pages,
>> you issue the snapshot ioctl, mark all currently dirtied pages as
>> TOWRITE and then you have more delalloc pages being dirtied following
>> initial call to tag_pages_for_writeback , you will miss those, no ?
>>
>
> We don't freeze the filesystem when doing snapshot, so there is no 
> guarantee
> the page dirtied right after we send the ioctl, will be included in 
> snapshot.
> If a page is dirtied while we scan the dirty pages and its offset is 
> before our
> current index, we miss it in our current snapshot implementation too.

This looks like a pretty good compromise to me between btrfs v0's don't 
flush at all on snapshot and today's full sync on snapshot.  The full 
sync is always going to be a little racey wrt concurrent writes.

-chris


Re: [PATCH 0/2] address lock contention of btree root

2018-08-21 Thread Chris Mason

On 21 Aug 2018, at 14:15, Liu Bo wrote:


On Tue, Aug 21, 2018 at 01:54:11PM -0400, Chris Mason wrote:

On 16 Aug 2018, at 17:07, Liu Bo wrote:


The lock contention on btree nodes (esp. root node) is apparently a
bottleneck when there're multiple readers and writers concurrently
trying to access them.  Unfortunately this is by design and it's not
easy to fix it unless with some complex changes, however, there is
still some room.

With a stable workload based on fsmark which has 16 threads creating
1,600K files, we could see that a good amount of overhead comes from
switching path between spinning mode and blocking mode in
btrfs_search_slot().

Patch 1 provides more details about the overhead and test results 
from

fsmark and dbench.
Patch 2 kills leave_spinning due to the behaviour change from patch 
1.


This is really interesting, do you have numbers about how often we 
are able

to stay spinning?



I didn't gather how long we stay spinning,


I'm less worried about length of time spinning than I am how often we're 
able to call btrfs_search_slot() without ever blocking.  If one caller 
ends up going into blocking mode, it can cascade into all of the other 
callers, which can slow things down in low-to-medium contention 
workloads.


The flip side is that maybe the adaptive spinning in the mutexes is good 
enough now and we can just deleting the spinning completely.



but I was tracking

(1) how long a read lock or write lock can wait until it gets the
lock, with vanilla kernel, for read lock, it could be up to 10ms while
for write lock, it could be up to 200ms.


Nice, please add the stats to the patch descriptions, both before and 
after.  I'd love to see a histogram like you can get out of bcc's 
argdist.py.


-chris


Re: [PATCH 0/2] address lock contention of btree root

2018-08-21 Thread Chris Mason

On 16 Aug 2018, at 17:07, Liu Bo wrote:


The lock contention on btree nodes (esp. root node) is apparently a
bottleneck when there're multiple readers and writers concurrently
trying to access them.  Unfortunately this is by design and it's not
easy to fix it unless with some complex changes, however, there is
still some room.

With a stable workload based on fsmark which has 16 threads creating
1,600K files, we could see that a good amount of overhead comes from
switching path between spinning mode and blocking mode in
btrfs_search_slot().

Patch 1 provides more details about the overhead and test results from
fsmark and dbench.
Patch 2 kills leave_spinning due to the behaviour change from patch 1.


This is really interesting, do you have numbers about how often we are 
able to stay spinning?


IOW, do we end up blocking every time?

-chris


Re: Do btrfs compression option changes need to be atomic?

2018-08-21 Thread Chris Mason

On 21 Aug 2018, at 9:46, David Howells wrote:

Should changes to the compression options on a btrfs mount be atomic, 
the

problem being that they're split across three variables?

Further to that, how much of an issue is it if the configuration is 
split out
into its own struct that is accessed from struct btrfs_fs_info using 
RCU?




Hi David,

They shouldn't need to be atomic, we're making compression decisions on 
a per-extent basis and sometimes bailing on the compression if it 
doesn't make things smaller.


-chris


[PATCH v2] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion

2018-06-25 Thread Chris Mason
The vm_fault_t conversion commit introduced a ret2 variable for tracking
the integer return values from internal btrfs functions.  It was
sometimes returning VM_FAULT_LOCKED for pages that were actually invalid
and had been removed from the radix.  Something like this:

ret2 = btrfs_delalloc_reserve_space() // returns zero on success

lock_page(page)
if (page->mapping != inode->i_mapping)
goto out_unlock;

...

out_unlock:
if (!ret2) {
...
return VM_FAULT_LOCKED;
}

This ends up triggering this WARNING in btrfs_destroy_inode()
WARN_ON(BTRFS_I(inode)->block_rsv.size);

xfstests generic/095 was able to reliably reproduce the errors.

Since out_unlock: is only used for errors, this fix moves it below the
if (!ret2) check we use to return VM_FAULT_LOCKED for success.

Fixes: a528a2415087 (btrfs: change return type of btrfs_page_mkwrite to 
vm_fault_t)
Signed-off-by: Chris Mason 
---

 Changes since v1: don't set the vmfault_t 'ret' to zero, just move our
 goto taret around around instead.

 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 193f933..63f713a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9036,13 +9036,14 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 
unlock_extent_cached(io_tree, page_start, page_end, _state);
 
-out_unlock:
if (!ret2) {
btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true);
sb_end_pagefault(inode->i_sb);
extent_changeset_free(data_reserved);
return VM_FAULT_LOCKED;
}
+
+out_unlock:
unlock_page(page);
 out:
btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, (ret != 0));
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion

2018-06-25 Thread Chris Mason
On 25 Jun 2018, at 9:54, David Sterba wrote:

> On Mon, Jun 25, 2018 at 06:45:32AM -0700, Chris Mason wrote:

>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 193f933..38403aa 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -9013,6 +9013,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>>  goto out_unlock;
>>  }
>>  ret2 = 0;
>> +ret = 0;
>
> That's something I'd rather avoid, now there are two error values that
> need to be tracked. Shouldn't the 'ret2 = 0' be removed completely?


Yeah, the whole thing hurts my eyes.  Fixing.

>
>>
>>  /* page is wholly or partially inside EOF */
>>  if (page_start + PAGE_SIZE > size)
>> @@ -9037,7 +9038,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>>  unlock_extent_cached(io_tree, page_start, page_end, _state);
>>
>>  out_unlock:
>> -if (!ret2) {
>> +if (!ret) {
>
> I'm not sure is safe, the new ret is one of the VM_FAULT_ values and
> it's not 0 by default and in fact I don't see anything that would set it
> to 0 directly or indirectly. Which means that the code in the out: label
> also needs to be revisited:

Good point, ok.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits

2018-06-25 Thread Chris Mason

On 25 Jun 2018, at 7:10, David Sterba wrote:


On Fri, Jun 22, 2018 at 05:25:54PM -0400, Chris Mason wrote:

The bug came here:

commit a528a24150870c5c16cbbbec69dba7e992b08456
Author: Souptick Joarder 
Date:   Wed Jun 6 19:54:44 2018 +0530

 btrfs: change return type of btrfs_page_mkwrite to vm_fault_t

When page->mapping != mapping, we improperly return success because 
ret2

is zero on goto out_unlock.

I'm running a fix through a full xfstests and I'll post soon.


The ret/ret2 pattern is IMO used in the wrong way, the ret2 is some
temporary value and it should be set and tested next to each other, 
not

holding the state accross the function.

The fix I'd propose is to avoid relying on it and add a separate exit
block, similar to out_noreserve:

--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8981,7 +8981,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault 
*vmf)

ret = VM_FAULT_SIGBUS;
goto out_unlock;
}
-   ret2 = 0;

/* page is wholly or partially inside EOF */
if (page_start + PAGE_SIZE > size)
@@ -9004,14 +9003,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault 
*vmf)
BTRFS_I(inode)->last_log_commit = 
BTRFS_I(inode)->root->last_log_commit;


unlock_extent_cached(io_tree, page_start, page_end, 
_state);

-
-out_unlock:
-   if (!ret2) {
-   btrfs_delalloc_release_extents(BTRFS_I(inode), 
PAGE_SIZE, true);

-   sb_end_pagefault(inode->i_sb);
-   extent_changeset_free(data_reserved);
-   return VM_FAULT_LOCKED;
-   }
unlock_page(page);


VM_FAULT_LOCKED is where we return success.  The out_unlock goto is 
confusing because it's actually only used for failure, but the goto 
lands right above the if (everything actually worked) {} test.


On top of the patch I sent today, moving out_unlock: after the if would 
make it more clear.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion

2018-06-25 Thread Chris Mason
The vm_fault_t conversion commit introduced a ret2 variable for
tracking the integer return values from internal btrfs functions.  It
was sometimes returning VM_FAULT_LOCKED for pages that were actually
invalid and had been removed from the radix.  Something like this:

ret2 = btrfs_delalloc_reserve_space() // returns zero on success

lock_page(page)
if (page->mapping != inode->i_mapping)
goto out_unlock;

...

out_unlock:
if (!ret2) {
...
return VM_FAULT_LOCKED;
}

This ends up triggering this WARNING in btrfs_destroy_inode()
WARN_ON(BTRFS_I(inode)->block_rsv.size);

The fix used here is to use ret instead of ret2 in our check for success.

Fixes: a528a2415087 (btrfs: change return type of btrfs_page_mkwrite to 
vm_fault_t)
Signed-off-by: Chris Mason 
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 193f933..38403aa 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9013,6 +9013,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
goto out_unlock;
}
ret2 = 0;
+   ret = 0;
 
/* page is wholly or partially inside EOF */
if (page_start + PAGE_SIZE > size)
@@ -9037,7 +9038,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
unlock_extent_cached(io_tree, page_start, page_end, _state);
 
 out_unlock:
-   if (!ret2) {
+   if (!ret) {
btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true);
sb_end_pagefault(inode->i_sb);
extent_changeset_free(data_reserved);
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits

2018-06-22 Thread Chris Mason

On 20 Jun 2018, at 16:24, David Sterba wrote:


On Wed, Jun 20, 2018 at 03:48:08PM -0400, Chris Mason wrote:



generic/095 [18:07:03][ 3769.317862] run fstests generic/095 at
2018-06-20 18:07:03


Hmpf, I pass both 095 and 208 here.

[ 3774.849685] BTRFS: device fsid 
3acffad9-28e5-43ce-80e1-f5032e334cba

devid 1 transid 5 /dev/vdb
[ 3774.875409] BTRFS info (device vdb): disk space caching is 
enabled

[ 3774.877723] BTRFS info (device vdb): has skinny extents
[ 3774.879371] BTRFS info (device vdb): flagging fs with big 
metadata

feature
[ 3774.885020] BTRFS info (device vdb): checking UUID tree
[ 3775.593329] Page cache invalidation failure on direct I/O.
Possible data corruption due to collision with buffered I/O!
[ 3775.596979] File: /tmp/scratch/file2 PID: 12031 Comm: kworker/1:1
[ 3776.642812] Page cache invalidation failure on direct I/O.
Possible data corruption due to collision with buffered I/O!
[ 3776.645041] File: /tmp/scratch/file2 PID: 12033 Comm: kworker/3:0
[ 3776.920634] WARNING: CPU: 0 PID: 12036 at fs/btrfs/inode.c:9319
btrfs_destroy_inode+0x1d5/0x290 [btrfs]



Which warning is this in your tree?  The file_write patch is more 
likely
to have screwed up our bits and the fixup worker is more likely to 
have

screwed up nrpages.


 9311 void btrfs_destroy_inode(struct inode *inode)
 9312 {
 9313 struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 9314 struct btrfs_ordered_extent *ordered;
 9315 struct btrfs_root *root = BTRFS_I(inode)->root;
 9316
 9317 WARN_ON(!hlist_empty(>i_dentry));
 9318 WARN_ON(inode->i_data.nrpages);
 9319 WARN_ON(BTRFS_I(inode)->block_rsv.reserved);

The branch is the last pull, ie. no other 4.18-rc1 stuff plus your two 
patches.


The bug came here:

commit a528a24150870c5c16cbbbec69dba7e992b08456
Author: Souptick Joarder 
Date:   Wed Jun 6 19:54:44 2018 +0530

btrfs: change return type of btrfs_page_mkwrite to vm_fault_t

When page->mapping != mapping, we improperly return success because ret2 
is zero on goto out_unlock.


I'm running a fix through a full xfstests and I'll post soon.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: Remove V0 extent support

2018-06-21 Thread Chris Mason




On 21 Jun 2018, at 5:22, David Sterba wrote:


On Thu, Jun 21, 2018 at 09:45:00AM +0300, Nikolay Borisov wrote:

The v0 compat code was introduced in commit 5d4f98a28c7d
("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are 
using
btrfs on such an old kernel, you have much bigger problems. This 
coupled
with the fact that no one is likely testing/maintining this code 
likely

means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.


Ack. Given the age of the format and nearly zero chance of such
filesystem to be still in use it should be ok to remove it directly 
and

not add some sort of warning when such filesystem is found.


Agreed, it seems very unlikely to me that we have happy v0 users.



This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs 
in case

we have a v0 with no support intact as a sort of safety-net.


I think there should be some verbose way to report what happened and
possibly do proper error handling and go to read-only eventually.
The type of checks for v0 do not seem to be something that could be 
done

at mount time.


Yeah, with the code we have right now, it's really obvious the BUG()s 
are for V0 extents.  With all the ifdefs removed its much less clear.  
I'd swap them out for a transaction abort with a message about the v0 
extents.


Also, there's at least one ASSERT() left over in a #else, which I would 
also turn into a more explicit abort or at least comment.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits

2018-06-21 Thread Chris Mason




On 20 Jun 2018, at 15:33, David Sterba wrote:


On Wed, Jun 20, 2018 at 07:56:10AM -0700, Chris Mason wrote:
We've been hunting the root cause of data crc errors here at FB for a 
while.
We'd find one or two corrupted files, usually displaying crc errors 
without any
corresponding IO errors from the storage.  The bug was rare enough 
that we'd
need to watch a large number of machines for a few days just to catch 
it

happening.

We're still running these patches through testing, but the fixup 
worker bug
seems to account for the vast majority of crc errors we're seeing in 
the fleet.
It's cleaning pages that were dirty, and creating a window where they 
can be

reclaimed before we finish processing the page.


I'm having flashbacks when I see 'fixup worker', and the test 
generic/208 does

not make it better:

generic/095		[18:07:03][ 3769.317862] run fstests generic/095 at 
2018-06-20 18:07:03
[ 3774.849685] BTRFS: device fsid 3acffad9-28e5-43ce-80e1-f5032e334cba 
devid 1 transid 5 /dev/vdb

[ 3774.875409] BTRFS info (device vdb): disk space caching is enabled
[ 3774.877723] BTRFS info (device vdb): has skinny extents
[ 3774.879371] BTRFS info (device vdb): flagging fs with big metadata 
feature

[ 3774.885020] BTRFS info (device vdb): checking UUID tree
[ 3775.593329] Page cache invalidation failure on direct I/O.  
Possible data corruption due to collision with buffered I/O!

[ 3775.596979] File: /tmp/scratch/file2 PID: 12031 Comm: kworker/1:1
[ 3776.642812] Page cache invalidation failure on direct I/O.  
Possible data corruption due to collision with buffered I/O!

[ 3776.645041] File: /tmp/scratch/file2 PID: 12033 Comm: kworker/3:0
[ 3776.920634] WARNING: CPU: 0 PID: 12036 at fs/btrfs/inode.c:9319 
btrfs_destroy_inode+0x1d5/0x290 [btrfs]
[ 3776.924182] Modules linked in: btrfs libcrc32c xor zstd_decompress 
zstd_compress xxhash raid6_pq loop [last unloaded: libcrc32c]
[ 3776.927703] CPU: 0 PID: 12036 Comm: umount Not tainted 
4.17.0-rc7-default+ #153
[ 3776.929164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014

[ 3776.931006] RIP: 0010:btrfs_destroy_inode+0x1d5/0x290 [btrfs]


Running generic/095 on current Linus git (without my patches), I'm 
seeing this same warning.  This makes me a little happy because I have 
my patches in prod, but mostly sad because it's easier to find when the 
suspect pool is small.  I'll bisect.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits

2018-06-20 Thread Chris Mason




On 20 Jun 2018, at 15:33, David Sterba wrote:


On Wed, Jun 20, 2018 at 07:56:10AM -0700, Chris Mason wrote:
We've been hunting the root cause of data crc errors here at FB for a 
while.
We'd find one or two corrupted files, usually displaying crc errors 
without any
corresponding IO errors from the storage.  The bug was rare enough 
that we'd
need to watch a large number of machines for a few days just to catch 
it

happening.

We're still running these patches through testing, but the fixup 
worker bug
seems to account for the vast majority of crc errors we're seeing in 
the fleet.
It's cleaning pages that were dirty, and creating a window where they 
can be

reclaimed before we finish processing the page.


I'm having flashbacks when I see 'fixup worker',


Yeah, I don't understand how so much pain can live in one little 
function.



and the test generic/208 does not make it better:

generic/095		[18:07:03][ 3769.317862] run fstests generic/095 at 
2018-06-20 18:07:03


Hmpf, I pass both 095 and 208 here.

[ 3774.849685] BTRFS: device fsid 3acffad9-28e5-43ce-80e1-f5032e334cba 
devid 1 transid 5 /dev/vdb

[ 3774.875409] BTRFS info (device vdb): disk space caching is enabled
[ 3774.877723] BTRFS info (device vdb): has skinny extents
[ 3774.879371] BTRFS info (device vdb): flagging fs with big metadata 
feature

[ 3774.885020] BTRFS info (device vdb): checking UUID tree
[ 3775.593329] Page cache invalidation failure on direct I/O.  
Possible data corruption due to collision with buffered I/O!

[ 3775.596979] File: /tmp/scratch/file2 PID: 12031 Comm: kworker/1:1
[ 3776.642812] Page cache invalidation failure on direct I/O.  
Possible data corruption due to collision with buffered I/O!

[ 3776.645041] File: /tmp/scratch/file2 PID: 12033 Comm: kworker/3:0
[ 3776.920634] WARNING: CPU: 0 PID: 12036 at fs/btrfs/inode.c:9319 
btrfs_destroy_inode+0x1d5/0x290 [btrfs]



Which warning is this in your tree?  The file_write patch is more likely 
to have screwed up our bits and the fixup worker is more likely to have 
screwed up nrpages.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits

2018-06-20 Thread Chris Mason
We've been hunting the root cause of data crc errors here at FB for a while.
We'd find one or two corrupted files, usually displaying crc errors without any
corresponding IO errors from the storage.  The bug was rare enough that we'd
need to watch a large number of machines for a few days just to catch it
happening.

We're still running these patches through testing, but the fixup worker bug
seems to account for the vast majority of crc errors we're seeing in the fleet.
It's cleaning pages that were dirty, and creating a window where they can be
reclaimed before we finish processing the page.

btrfs_file_write() has a similar bug when copy_from_user catches a page fault
and we're writing to a page that was already dirty when file_write started.
This one is much harder to trigger, and I haven't confirmed yet that we're
seeing it in the fleet.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker

2018-06-20 Thread Chris Mason
For COW, btrfs expects pages dirty pages to have been through a few setup
steps.  This includes reserving space for the new block allocations and marking
the range in the state tree for delayed allocation.

A few places outside btrfs will dirty pages directly, especially when unmapping
mmap'd pages.  In order for these to properly go through COW, we run them
through a fixup worker to wait for stable pages, and do the delalloc prep.

87826df0ec36 added a window where the dirty pages were cleaned, but pending
more action from the fixup worker.  During this window, page migration can jump
in and relocate the page.  Once our fixup work actually starts, it finds
page->mapping is NULL and we end up freeing the page without ever writing it.

This leads to crc errors and other exciting problems, since it screws up the
whole statemachine for waiting for ordered extents.  The fix here is to keep
the page dirty while we're waiting for the fixup worker to get to work.  This
also makes sure the error handling in btrfs_writepage_fixup_worker does the
right thing with dirty bits when we run out of space.

Signed-off-by: Chris Mason 
---
 fs/btrfs/inode.c | 67 +---
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0b86cf1..5538900 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2100,11 +2100,21 @@ static void btrfs_writepage_fixup_worker(struct 
btrfs_work *work)
page = fixup->page;
 again:
lock_page(page);
-   if (!page->mapping || !PageDirty(page) || !PageChecked(page)) {
-   ClearPageChecked(page);
+
+   /*
+* before we queued this fixup, we took a reference on the page.
+* page->mapping may go NULL, but it shouldn't be moved to a
+* different address space.
+*/
+   if (!page->mapping || !PageDirty(page) || !PageChecked(page))
goto out_page;
-   }
 
+   /*
+* we keep the PageChecked() bit set until we're done with the
+* btrfs_start_ordered_extent() dance that we do below.  That
+* drops and retakes the page lock, so we don't want new
+* fixup workers queued for this page during the churn.
+*/
inode = page->mapping->host;
page_start = page_offset(page);
page_end = page_offset(page) + PAGE_SIZE - 1;
@@ -2129,33 +2139,46 @@ static void btrfs_writepage_fixup_worker(struct 
btrfs_work *work)
 
ret = btrfs_delalloc_reserve_space(inode, _reserved, page_start,
   PAGE_SIZE);
-   if (ret) {
-   mapping_set_error(page->mapping, ret);
-   end_extent_writepage(page, ret, page_start, page_end);
-   ClearPageChecked(page);
-   goto out;
-}
+   if (ret)
+   goto out_error;
 
ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
_state, 0);
-   if (ret) {
-   mapping_set_error(page->mapping, ret);
-   end_extent_writepage(page, ret, page_start, page_end);
-   ClearPageChecked(page);
-   goto out;
-   }
+   if (ret)
+   goto out_error;
 
-   ClearPageChecked(page);
-   set_page_dirty(page);
btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, false);
+
+   /*
+* everything went as planned, we're now the proud owners of a
+* Dirty page with delayed allocation bits set and space reserved
+* for our COW destination.
+*
+* The page was dirty when we started, nothing should have cleaned it.
+*/
+   BUG_ON(!PageDirty(page));
+
 out:
unlock_extent_cached(_I(inode)->io_tree, page_start, page_end,
 _state);
 out_page:
+   ClearPageChecked(page);
unlock_page(page);
put_page(page);
kfree(fixup);
extent_changeset_free(data_reserved);
+   return;
+
+out_error:
+   /*
+* We hit ENOSPC or other errors.  Update the mapping and page to
+* reflect the errors and clean the page.
+*/
+   mapping_set_error(page->mapping, ret);
+   end_extent_writepage(page, ret, page_start, page_end);
+   clear_page_dirty_for_io(page);
+   SetPageError(page);
+   goto out;
 }
 
 /*
@@ -2179,6 +2202,13 @@ static int btrfs_writepage_start_hook(struct page *page, 
u64 start, u64 end)
if (TestClearPagePrivate2(page))
return 0;
 
+   /*
+* PageChecked is set below when we create a fixup worker for this page,
+* don't try to create another one if we're already PageChecked()
+*
+* The extent_io writepage code will redirty the page if we send
+* back EAGAIN.
+*/
if (PageChecked(page))
return -EAGAIN;
 
@@ -2192,7 +,8 @@ stat

[PATCH 1/2] Btrfs: don't clean dirty pages during buffered writes

2018-06-20 Thread Chris Mason
During buffered writes, we follow this basic series of steps:

again:
lock all the pages
wait for writeback on all the pages
Take the extent range lock
wait for ordered extents on the whole range
clean all the pages

if (copy_from_user_in_atomic() hits a fault) {
drop our locks
goto again;
}

dirty all the pages
release all the locks

The extra waiting, cleaning and locking are there to make sure we don't
modify pages in flight to the drive, after they've been crc'd.

If some of the pages in the range were already dirty when the write
began, and we need to goto again, we create a window where a dirty page
has been cleaned and unlocked.  It may be reclaimed before we're able to
lock it again, which means we'll read the old contents off the drive and
lose any modifications that had been pending writeback.

We don't actually need to clean the pages.  All of the other locking in
place makes sure we don't start IO on the pages, so we can just leave
them dirty for the duration of the write.

Fixes: 73d59314e6ed (the original btrfs merge)
Signed-off-by: Chris Mason 
---
 fs/btrfs/file.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f660ba1..89ec4d2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -534,6 +534,15 @@ int btrfs_dirty_pages(struct inode *inode, struct page 
**pages,
 
end_of_last_block = start_pos + num_bytes - 1;
 
+   /*
+* the pages may have already been dirty, clear out old accounting
+* so we can set things up properly
+*/
+   clear_extent_bit(_I(inode)->io_tree, start_pos, end_of_last_block,
+EXTENT_DIRTY | EXTENT_DELALLOC |
+EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0,
+cached);
+
if (!btrfs_is_free_space_inode(BTRFS_I(inode))) {
if (start_pos >= isize &&
!(BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC)) {
@@ -1504,18 +1513,27 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode 
*inode, struct page **pages,
}
if (ordered)
btrfs_put_ordered_extent(ordered);
-   clear_extent_bit(>io_tree, start_pos, last_pos,
-EXTENT_DIRTY | EXTENT_DELALLOC |
-EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
-0, 0, cached_state);
+
*lockstart = start_pos;
*lockend = last_pos;
ret = 1;
}
 
+   /*
+* It's possible the pages are dirty right now, but we don't want
+* to clean them yet because copy_from_user may catch a page fault
+* and we might have to fall back to one page at a time.  If that
+* happens, we'll unlock these pages and we'd have a window where
+* reclaim could sneak in and drop the once-dirty page on the floor
+* without writing it.
+*
+* We have the pages locked and the extent range locked, so there's
+* no way someone can start IO on any dirty pages in this range.
+*
+* we'll call btrfs_dirty_pages() later on, and that will flip around
+* delalloc bits and dirty the pages as required.
+*/
for (i = 0; i < num_pages; i++) {
-   if (clear_page_dirty_for_io(pages[i]))
-   account_page_redirty(pages[i]);
set_page_extent_mapped(pages[i]);
WARN_ON(!PageLocked(pages[i]));
}
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LKP] [lkp-robot] [mm] 9092c71bb7: blogbench.write_score -12.3% regression

2018-06-20 Thread Chris Mason

On 19 Jun 2018, at 23:51, Huang, Ying wrote:

"Huang, Ying"  writes:


Hi, Josef,

Do you have time to take a look at the regression?

kernel test robot  writes:


Greeting,

FYI, we noticed a -12.3% regression of blogbench.write_score and a 
+9.6% improvement

of blogbench.read_score due to commit:


commit: 9092c71bb724dba2ecba849eae69e5c9d39bd3d2 ("mm: use 
sc->priority for slab shrink targets")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 
master


in testcase: blogbench
on test machine: 16 threads Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz 
with 8G memory

with following parameters:

disk: 1SSD
fs: btrfs
cpufreq_governor: performance

test-description: Blogbench is a portable filesystem benchmark 
that tries to reproduce the load of a real-world busy file server.

test-url:


I'm surprised, this patch is a big win in production here at FB.  I'll 
have to reproduce these results to better understand what is going on.  
My first guess is that since we have fewer inodes in slab, we're reading 
more inodes from disk in order to do the writes.


But that should also make our read scores lower.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Chris Mason

On 6 Jun 2018, at 9:38, Liu Bo wrote:


On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason  wrote:



On 5 Jun 2018, at 16:03, Andrew Morton wrote:

(switched to email.  Please respond via emailed reply-to-all, not 
via the

bugzilla web interface).

On Tue, 05 Jun 2018 18:01:36 + 
bugzilla-dae...@bugzilla.kernel.org

wrote:


https://bugzilla.kernel.org/show_bug.cgi?id=199931

Bug ID: 199931
   Summary: systemd/rtorrent file data corruption when 
using echo

3 >/proc/sys/vm/drop_caches



A long tale of woe here.  Chris, do you think the pagecache 
corruption

is a general thing, or is it possible that btrfs is contributing?

Also, that 4.4 oom-killer regression sounds very serious.



This week I found a bug in btrfs file write with how we handle stable 
pages.

Basically it works like this:

write(fd, some bytes less than a page)
write(fd, some bytes into the same page)
btrfs prefaults the userland page
lock_and_cleanup_extent_if_need()   <- stable pages
wait for writeback()
clear_page_dirty_for_io()

At this point we have a page that was dirty and is now clean.  That's
normally fine, unless our prefaulted page isn't in ram anymore.

iov_iter_copy_from_user_atomic() <--- uh oh

If the copy_from_user fails, we drop all our locks and retry.  But 
along the
way, we completely lost the dirty bit on the page.  If the page is 
dropped

by drop_caches, the writes are lost.  We'll just read back the stale
contents of that page during the retry loop.  This won't result in 
crc

errors because the bytes we lost were never crc'd.



So we're going to carefully redirty the page under the page lock, 
right?


I don't think we actually need to clean it.  We have the page locked, 
writeback won't start until we unlock.




It could result in zeros in the file because we're basically reading 
a hole,
and those zeros could move around in the page depending on which part 
of the

page was dirty when the writes were lost.



I got a question, while re-reading this page, wouldn't it read
old/stale on-disk data?


If it was never written we should be treating it like a hole, but I'll 
double check.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-05 Thread Chris Mason




On 5 Jun 2018, at 16:03, Andrew Morton wrote:

(switched to email.  Please respond via emailed reply-to-all, not via 
the

bugzilla web interface).

On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org 
wrote:



https://bugzilla.kernel.org/show_bug.cgi?id=199931

Bug ID: 199931
   Summary: systemd/rtorrent file data corruption when using 
echo

3 >/proc/sys/vm/drop_caches


A long tale of woe here.  Chris, do you think the pagecache corruption
is a general thing, or is it possible that btrfs is contributing?

Also, that 4.4 oom-killer regression sounds very serious.


This week I found a bug in btrfs file write with how we handle stable 
pages.  Basically it works like this:


write(fd, some bytes less than a page)
write(fd, some bytes into the same page)
btrfs prefaults the userland page
lock_and_cleanup_extent_if_need()   <- stable pages
wait for writeback()
clear_page_dirty_for_io()

At this point we have a page that was dirty and is now clean.  That's 
normally fine, unless our prefaulted page isn't in ram anymore.


iov_iter_copy_from_user_atomic() <--- uh oh

If the copy_from_user fails, we drop all our locks and retry.  But along 
the way, we completely lost the dirty bit on the page.  If the page is 
dropped by drop_caches, the writes are lost.  We'll just read back the 
stale contents of that page during the retry loop.  This won't result in 
crc errors because the bytes we lost were never crc'd.


It could result in zeros in the file because we're basically reading a 
hole, and those zeros could move around in the page depending on which 
part of the page was dirty when the writes were lost.


I spent a morning trying to trigger this with drop_caches and couldn't 
make it happen, even with schedule_timeout()s inserted and other tricks. 
 But I was able to get corruptions if I manually invalidated pages in 
the critical section.


I'm working on a patch, and I'll check and see if any of the other 
recent fixes Dave integrated may have a less exotic explanation.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-24 Thread Chris Mason

On 24 May 2018, at 4:46, robbieko wrote:


Chris Mason 於 2018-05-23 23:56 寫到:

On 23 May 2018, at 3:26, robbieko wrote:


But we're not avoiding the inode lock completely, we're just 
dropping

it for the expensive parts of writing to the file.  A quick guess
about what the expensive parts are:

1) balance_dirty_pages()
2) btrfs_btree_balance_dirty()
3) metadata reservations/enospc waiting.



The expensive part of buffered_write are:
1. prepare_pages()
--wait_on_page_writeback()
Because writeback submit page to PG_writeback.
We must wait until the page writeback IO ends.


Is this based on timing the fio job or something else?  We can 
trigger

a stable page prep run before we take the mutex, but stable pages
shouldn't be part of random IO workloads unless we're doing random IO
inside a file that mostly fits in ram.



This problem is easily encountered in the context of VMs and 
File-based iSCSI LUNs.

Most of them are random write pattern.
Fio can quickly simulate the problem.


With really random IO, to a device bigger than memory, the chances of us 
writing to the same block twice with it still in flight are pretty 
small.  With VMs and other virtual block devices backed by btrfs, it's 
easier to hit stable pages because filesystems try pretty hard to be 
less than completely random.




So which is the best way?
1. unlocked buffered write (like direct-io)
2. stable page prep
3. Or is there any way to completely avoid stable pages?


balance_dirty_pages() is a much more common waiting point, and doing
that with the inode lock held definitely adds contention.


Yes. I agree.
But for latency, balance_dirty_pages is usually a relatively smooth 
latency,

lock_and_cleanup_extent_if_need and prepare_pages are dependent on IO,
so the latency is a multiple of growth.


I think the best way is to peel out a stable page write wait before the 
inode lock is taken, and only when the write is inside i_size.  We'll 
need to keep the existing stable page wait as well, but waiting before 
the inode lock is taken should handle the vast majority of IO in flight.


I'd also push the balance_dirty_pages() outside the inode lock, as long 
as the write is relatively small.


For really getting rid of stable pages, that's a bigger project ;)

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-23 Thread Chris Mason



On 23 May 2018, at 2:37, Christoph Hellwig wrote:


On Tue, May 22, 2018 at 02:31:36PM -0400, Chris Mason wrote:

And what protects two writes from interleaving their results now?


page locks...ish, we at least won't have results interleaved in a 
single
page.  For btrfs it'll actually be multiple pages since we try to do 
more

than one at a time.


I think you are going to break just about every assumption people
have that any single write is going to be atomic vs another write.

E.g. this comes from the posix read definition reference by the
write definition:

"I/O is intended to be atomic to ordinary files and pipes and FIFOs.
Atomic means that all the bytes from a single operation that started 
out

together end up together, without interleaving from other I/O
operations. It is a known attribute of terminals that this is not
honored, and terminals are explicitly (and implicitly permanently)
excepted, making the behavior unspecified. The behavior for other 
device

types is also left unspecified, but the wording is intended to imply
that future standards might choose to specify atomicity (or not).
"

Without taking the inode lock (or some sort of range lock) you can
easily interleave data from two write requests.


"This volume of IEEE Std 1003.1-2001 does not specify behavior of 
concurrent writes to a file from multiple processes. Applications should 
use some form of concurrency control."


I'm always more worried about truncate than standards ;)  But just to be 
clear, I'm not disagreeing with you at all.  Interleaved writes just 
wasn't the first thing that jumped to my mind.




But we're not avoiding the inode lock completely, we're just dropping 
it for
the expensive parts of writing to the file.  A quick guess about what 
the

expensive parts are:


The way I read the patch it basically 'avoids' the inode lock for 
almost

the whole write call, just minus some setup.


Yeah, if we can get 90% of the way there by pushing some 
balance_dirty_pages() outside the lock (or whatever other expensive 
setup we're doing), I'd by much happier.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-23 Thread Chris Mason

On 23 May 2018, at 3:26, robbieko wrote:


Chris Mason 於 2018-05-23 02:31 寫到:

On 22 May 2018, at 14:08, Christoph Hellwig wrote:


On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:

From: Robbie Ko <robbi...@synology.com>

This idea is from direct io. By this patch, we can make the 
buffered
write parallel, and improve the performance and latency. But 
because we
can not update isize without i_mutex, the unlocked buffered write 
just

can be done in front of the EOF.

We needn't worry about the race between buffered write and 
truncate,

because the truncate need wait until all the buffered write end.

And we also needn't worry about the race between dio write and 
punch hole,

because we have extent lock to protect our operation.

I ran fio to test the performance of this feature.


And what protects two writes from interleaving their results now?


page locks...ish, we at least won't have results interleaved in a
single page.  For btrfs it'll actually be multiple pages since we try
to do more than one at a time.

I haven't verified all the assumptions around truncate and fallocate
and friends expecting the dio special locking to be inside i_size.  
In

general this makes me a little uncomfortable.

But we're not avoiding the inode lock completely, we're just dropping
it for the expensive parts of writing to the file.  A quick guess
about what the expensive parts are:

1) balance_dirty_pages()
2) btrfs_btree_balance_dirty()
3) metadata reservations/enospc waiting.



The expensive part of buffered_write are:
1. prepare_pages()
--wait_on_page_writeback()
Because writeback submit page to PG_writeback.
We must wait until the page writeback IO ends.


Is this based on timing the fio job or something else?  We can trigger a 
stable page prep run before we take the mutex, but stable pages 
shouldn't be part of random IO workloads unless we're doing random IO 
inside a file that mostly fits in ram.


balance_dirty_pages() is a much more common waiting point, and doing 
that with the inode lock held definitely adds contention.




2. lock_and_cleanup_extent_if_need
--btrfs_start_ordered_extent
When a large number of ordered_extent queue is in 
endio_write_workers workqueue.
Buffered_write assumes that ordered_extent is the last one in the 
endio_write_workers workqueue,
and waits for all ordered_extents to be processed before because 
the workqueue is a FIFO.




This isn't completely accurate, but it falls into the stable pages 
bucket as well.  We can push a lighter version of the stable page wait 
before the inode lock when the IO is inside of i_size.  It won't 
completely remove the possibility that someone else dirties those pages 
again, but if the workload is random or really splitting up the file, 
it'll make the work done with the lock held much less.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-22 Thread Chris Mason

On 22 May 2018, at 14:08, Christoph Hellwig wrote:


On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:

From: Robbie Ko 

This idea is from direct io. By this patch, we can make the buffered
write parallel, and improve the performance and latency. But because 
we
can not update isize without i_mutex, the unlocked buffered write 
just

can be done in front of the EOF.

We needn't worry about the race between buffered write and truncate,
because the truncate need wait until all the buffered write end.

And we also needn't worry about the race between dio write and punch 
hole,

because we have extent lock to protect our operation.

I ran fio to test the performance of this feature.


And what protects two writes from interleaving their results now?


page locks...ish, we at least won't have results interleaved in a single 
page.  For btrfs it'll actually be multiple pages since we try to do 
more than one at a time.


I haven't verified all the assumptions around truncate and fallocate and 
friends expecting the dio special locking to be inside i_size.  In 
general this makes me a little uncomfortable.


But we're not avoiding the inode lock completely, we're just dropping it 
for the expensive parts of writing to the file.  A quick guess about 
what the expensive parts are:


1) balance_dirty_pages()
2) btrfs_btree_balance_dirty()
3) metadata reservations/enospc waiting.

Can I bribe you to benchmark how much each of those things is impacting 
the iops/latency benefits?


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: verify key failure

2018-05-14 Thread Chris Mason



On 14 May 2018, at 10:35, Liu Bo wrote:


Hi,

I got another warning of verify_level_key by running btrfs/124 in a 
loop, I'm testing against 4.17-rc3.


Not sure if it's false positive.


How long does this take to trigger?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 V2] hoist BTRFS_IOC_[SG]ET_FSLABEL to vfs

2018-05-11 Thread Chris Mason

On 11 May 2018, at 10:10, David Sterba wrote:


On Thu, May 10, 2018 at 08:16:09PM +0100, Al Viro wrote:

On Thu, May 10, 2018 at 01:13:57PM -0500, Eric Sandeen wrote:

Move the btrfs label ioctls up to the vfs for general use.

This retains 256 chars as the maximum size through the interface, 
which

is the btrfs limit and AFAIK exceeds any other filesystem's maximum
label size.

Signed-off-by: Eric Sandeen 
Reviewed-by: Andreas Dilger 
Reviewed-by: David Sterba 


No objections (and it obviously ought to go through btrfs tree).


I can take it through my tree, but Eric mentioned that there's a patch
for xfs that depends on it. In this case it would make sense to take
both patches at once via the xfs tree. There are no pending 
conflicting

changes in btrfs.


Probably easiest to just have a separate pull dedicated just for this 
series.  That way it doesn't really matter which tree it goes through.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: metadata_ratio mount option?

2018-05-07 Thread Chris Mason



On 7 May 2018, at 12:16, Martin Svec wrote:


Hello Chris,

Dne 7.5.2018 v 16:49 Chris Mason napsal(a):

On 7 May 2018, at 7:40, Martin Svec wrote:


Hi,

According to man btrfs [1], I assume that metadata_ratio=1 mount 
option should
force allocation of one metadata chunk after every allocated data 
chunk. However,
when I set this option and start filling btrfs with "dd if=/dev/zero 
of=dummyfile.dat",
only data chunks are allocated but no metadata ones. So, how does 
the metadata_ratio

option really work?

Note that I'm trying to use this option as a workaround of the bug 
reported here:




[ urls that FB email server eats, sorry ]


It's link to "Btrfs remounted read-only due to ENOSPC in 
btrfs_run_delayed_refs" thread :)


Oh yeah, the link worked fine, it just goes through this url defense 
monster that munges it in replies.








i.e. I want to manually preallocate metadata chunks to avoid nightly 
ENOSPC errors.



metadata_ratio is almost but not quite what you want.  It sets a 
flag on the space_info to force a
chunk allocation the next time we decide to call 
should_alloc_chunk().  Thanks to the overcommit
code, we usually don't call that until the metadata we think we're 
going to need is bigger than
the metadata space available.  In other words, by the time we're 
into the code that honors the
force flag, reservations are already high enough to make us allocate 
the chunk anyway.


Yeah, that's how I understood the code. So I think metadata_ratio man 
section is quite confusing
because it implies that btrfs guarantees given metadata to data chunk 
space ratio, which isn't true.




I tried to use metadata_ratio to experiment with forcing more 
metadata slop space, but really I

have to tweak the overcommit code first.
Omar beat me to a better solution, tracking down our transient ENOSPC 
problems here at FB to
reservations done for orphans.  Do you have a lot of deleted files 
still being held open?  lsof

/mntpoint | grep deleted will list them.


I'll take a look during backup window. The initial bug report 
describes our rsync workload in

detail, for your reference.



We're working through a patch for the orphans here.  You've got a 
ton of bytes pinned, which isn't

a great match for the symptoms we see:

[285169.096630] BTRFS info (device sdb): space_info 4 has 
18446744072120172544 free, is not full
[285169.096633] BTRFS info (device sdb): space_info 
total=273804165120, used=269218267136,
pinned=3459629056, reserved=52396032, may_use=2663120896, 
readonly=131072


But, your may_use count is high enough that you might be hitting this 
problem.  Otherwise I'll
work out a patch to make some more metadata chunks while Josef is 
perfecting his great delayed ref

update.


As mentioned in the bug report, we have a custom patch that dedicates 
SSDs for metadata chunks and
HDDs for data chunks. So, all we need is to preallocate metadata 
chunks to occupy all of the SSD

space and our issues will be gone.
Note that btrfs with SSD-backed metadata works absolutely great for 
rsync backups, even if there're
billions of files and thousands of snapshots. The global reservation 
ENOSPC is the last issue we're

struggling with.


Great, we'll get this nailed down, thanks!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: metadata_ratio mount option?

2018-05-07 Thread Chris Mason

On 7 May 2018, at 7:40, Martin Svec wrote:


Hi,

According to man btrfs [1], I assume that metadata_ratio=1 mount 
option should
force allocation of one metadata chunk after every allocated data 
chunk. However,
when I set this option and start filling btrfs with "dd if=/dev/zero 
of=dummyfile.dat",
only data chunks are allocated but no metadata ones. So, how does the 
metadata_ratio

option really work?

Note that I'm trying to use this option as a workaround of the bug 
reported here:




[ urls that FB email server eats, sorry ]



i.e. I want to manually preallocate metadata chunks to avoid nightly 
ENOSPC errors.



metadata_ratio is almost but not quite what you want.  It sets a flag on 
the space_info to force a chunk allocation the next time we decide to 
call should_alloc_chunk().  Thanks to the overcommit code, we usually 
don't call that until the metadata we think we're going to need is 
bigger than the metadata space available.  In other words, by the time 
we're into the code that honors the force flag, reservations are already 
high enough to make us allocate the chunk anyway.


I tried to use metadata_ratio to experiment with forcing more metadata 
slop space, but really I have to tweak the overcommit code first.
Omar beat me to a better solution, tracking down our transient ENOSPC 
problems here at FB to reservations done for orphans.  Do you have a lot 
of deleted files still being held open?  lsof /mntpoint | grep deleted 
will list them.


We're working through a patch for the orphans here.  You've got a ton of 
bytes pinned, which isn't a great match for the symptoms we see:


[285169.096630] BTRFS info (device sdb): space_info 4 has 
18446744072120172544 free, is not full
[285169.096633] BTRFS info (device sdb): space_info total=273804165120, 
used=269218267136, pinned=3459629056, reserved=52396032, 
may_use=2663120896, readonly=131072


But, your may_use count is high enough that you might be hitting this 
problem.  Otherwise I'll work out a patch to make some more metadata 
chunks while Josef is perfecting his great delayed ref update.


-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent behavior of fsync in btrfs

2018-04-30 Thread Chris Mason



On 29 Apr 2018, at 18:16, Theodore Y. Ts'o wrote:


On Sun, Apr 29, 2018 at 03:55:39PM -0500, Vijay Chidambaram wrote:

In the spirit of clarifying fsync behavior, we have one more case
where we'd like to find out what should be expected.

Consider this:

Mkdir A
Creat A/bar
Fsync A/bar
Rename A to B
Fsync B/bar
-- Crash --

A/bar has been fsynced previously, so its not a newly created file.
After the crash, in ext4 and btrfs, can we expect directory B and
B/bar to exist?


or ext4, no.  The POSIX semantics apply: bar will *either* be in A,
or in B.


Same for btrfs.  If the rename for B goes down, it'll be a side effect 
of other decisions and not on purpose.  I'd actually like for the rename 
to be on disk in the normal case, but we won't always be able to catch 
it.




If you modify the file bar such that the mod time has been updated,
then fsync(2) --- but not necessarily fdatasync(2) --- will cause the
inode modifications to be written committed, and this will cause the
updates to directory B from the rename to be committed as a
side-effect.

Note though that there are plenty of people who consider this to be a
performance bug, and not a feature, and there have been papers
proposed by your fellow academics that if implemented, would change
this to no longer be true.

In general with these sorts of things it would be useful to reason
about this in the context of real world applications and why they want
such guarantees.  These guarantees can cost performance hits, and so
there is a cost/benefit tradeoff involved.  So my preference is to
negotiate with applicationt writes, and ask *why* they want such
guarantees, and to explore whether there better ways of achieving
their high level goals before we legislate this to be an iron-clad
commitment which might application A happy, but performance-seeking
user B unhappy.


I know this is not POSIX compliant, but from prior comments, it seems
like both ext4 and btrfs would like to persist directory entries upon
fsync of newly created files. So we were wondering if this extended 
to

this case.


We had real world examples of users/applications who suffered data
loss when the directory entries for newly created files were not
persisted.  It was on the basis of these complaints that we made this
commitment, since it seemed more important than the relatively minor
performance hit.



Agreeing with Ted and expanding a bit.  If fsync(some file) doesn't 
persist the name for that file, applications need to fsync the 
directories, which can be double the log commits.  Getting everything 
down to disk in one fsync() is much better for both the application and 
the FS.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent behavior of fsync in btrfs

2018-04-27 Thread Chris Mason

On 27 Apr 2018, at 10:07, David Sterba wrote:


On Thu, Apr 26, 2018 at 07:59:23PM -0500, Jayashree Mohan wrote:

Thanks for the response. We are using a tool we developed called
CrashMonkey[1] to run crash consistency tests and generate the bug
reports above. We'd be happy to guide you through setting up
CrashMonkey and getting these bugs reproduced. However, if you want 
to

be able to reproduce them with your current setup (xfstest +
dm-flakey), I have the workload scripts attached to the end of the
mail which might make your task simpler.

Interestingly we seem to have found another bug that breaks rename
atomicity and results in a previously fsynced file missing.

Workload:
1. mkdir A
2. creat A/bar (*)
3. fsync A/bar
4. mkdir B
5. creat B/bar
6. rename B/bar A/bar
7. creat A/foo
8. fsync A/foo
9. fsync A
--- crash---

When we recover from the crash, we see that file A/bar goes missing.
If the rename did not persist, we expect to see A/bar(*) created in
step 2 above, or if the rename indeed persisted, we still expect file
A/bar to be present.


I'm no fsync expert and the lack of standard or well defined behaviour
(mentioned elsewhere) leads me to question, on what do you base your
expectations? Not only for this report, but in general during your
testing.

Comparing various filesystems will show that at best it's 
implementation

defined and everybody has their own reasons for doing it one way or
another, or request fsync at particular time etc.

We have a manual page in section 5 that contains general topics of
btrfs, so documenting the fsync specifics would be good.


My goal for the fsync tree log was to make it just do the right thing 
most of the time.  We mostly got there, thanks to a ton of fixes and 
test cases from Filipe.


fsync(some file) -- all the names for this file will exist, without 
having to fsync the directory.


fsync(some dir) -- ugh, don't fsync the directory.  But if you do, all 
the files/subdirs will exist, and unlinks will be done and renames will 
be included.  This is slow and may require a full FS commit, which is 
why we don't want dirs fsunk.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent behavior of fsync in btrfs

2018-04-27 Thread Chris Mason

On 26 Apr 2018, at 18:59, Jayashree Mohan wrote:


Hi Chris,

Thanks for the response. We are using a tool we developed called
CrashMonkey[1] to run crash consistency tests and generate the bug
reports above. We'd be happy to guide you through setting up
CrashMonkey and getting these bugs reproduced. However, if you want to
be able to reproduce them with your current setup (xfstest +
dm-flakey), I have the workload scripts attached to the end of the
mail which might make your task simpler.

Interestingly we seem to have found another bug that breaks rename
atomicity and results in a previously fsynced file missing.

Workload:
1. mkdir A
2. creat A/bar (*)
3. fsync A/bar
4. mkdir B
5. creat B/bar
6. rename B/bar A/bar
7. creat A/foo
8. fsync A/foo
9. fsync A


The original workloads have been easy to reproduce/fix so far.  I want 
to make sure my patches aren't slowing things down and I'll send them 
out ~Monday.


This one looks similar but I'll double check the rename handling and 
make sure I've got all the cases.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent behavior of fsync in btrfs

2018-04-26 Thread Chris Mason

On 24 Apr 2018, at 20:35, Jayashree Mohan wrote:


Hi,

While investigating crash consistency bugs on btrfs, we came across
workloads that demonstrate inconsistent behavior of fsync.

Consider the following workload where fsync on the directory did not 
persist it.



Only file B/foo gets persisted, and both A/foo and C/foo are missing.

This seems like inconsistent behavior as only the first fsync persists
the file, while all others don't seem to. Do you agree if this is
indeed incorrect and needs fixing?

All the above tests pass on ext4 and xfs.

Please let us know what you feel about such inconsistency.



The btrfs fsync log is more fine grained than xfs/ext, but fsync(any 
file) should be enough to persist that file in its current directory.


I'll get these reproduced and see if we can nail down why we're not 
logging the location properly.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-24 Thread Chris Mason
On 01/22/2018 04:17 PM, Sebastian Ochmann wrote:
> Hello,
> 
> I attached to the ffmpeg-mux process for a little while and pasted the 
> result here:
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__pastebin.com_XHaMLX8z=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=9QPtTAxcitoznaWRKKHoEQ=IkofqwZ_S5C0_qAXjt4EQae-mVE09Ir8zmSbuGqXaCs=1nw7xUkEoQF7MgYOlZ8iAA9U0UsRQObH1Z4VLqx8IF8=
>  
> 
> 
> Can you help me with interpreting this result? If you'd like me to run 
> strace with specific options, please let me know. This is a level of 
> debugging I'm not dealing with on a daily basis. :)
> 

Going to guess it's these sequences:

lseek(3, 1302012898, SEEK_SET)  = 1302012898
write(3, 
"\37C\266u\1\377\377\377\377\377\377\377\277\204|\271\347J\347\203\3@\0\243CY\202\0\0\0!\21"...,
 262144) = 262144
write(3, 
"\310\22\323g7J#h\351\0\323\270\f\206\5\207(.\232\246\27\371/\376\341\0\0\200\th\3\37"...,
 262144) = 262144
write(3, 
"\225*\245<8N\32\263\237k\331]\313\215\301\366$\7\216\0349\302AS\201\302\307T\361\365\3375"...,
 262144) = 262144
write(3, 
"\272e\37\255\250\24n\235\341E\272Me\36'\345W\353\2337K.n\367\264\\\370\307\341_\206|"...,
 262144) = 262144
write(3, ""..., 53271) = 53271
lseek(3, 1302012902, SEEK_SET)  = 1302012902
write(3, "\1\0\0\0\0\20\320\v", 8)  = 8
lseek(3, 1303114745, SEEK_SET)  = 1303114745

It's seeking, writing, then jumping back and updating what had been written 
before.

That's going to hit the stable page writing code in btrfs that I had mentioned 
earlier.

At Facebook, we've been experimenting with fixes for this that are limited to 
O_APPEND 
slowly growing log files.  Let me think harder...

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-22 Thread Chris Mason

On 01/22/2018 01:33 PM, Sebastian Ochmann wrote:

[ skipping to the traces ;) ]


2866 ffmpeg-mux D
[] btrfs_start_ordered_extent+0x101/0x130 [btrfs]
[] lock_and_cleanup_extent_if_need+0x340/0x380 [btrfs]
[] __btrfs_buffered_write+0x261/0x740 [btrfs]
[] btrfs_file_write_iter+0x20f/0x650 [btrfs]
[] __vfs_write+0xf9/0x170
[] vfs_write+0xad/0x1a0
[] SyS_write+0x52/0xc0
[] entry_SYSCALL_64_fastpath+0x1a/0x7d
[] 0x


This is where we wait for writes that are already in flight before we're 
allowed to redirty those pages in the file.  It'll happen when we either 
overwrite a page in the file that we've already written, or when we're 
trickling down writes slowly in non-4K aligned writes.


You can probably figure out pretty quickly which is the case by stracing 
ffmpeg-mux.  Since lower dirty ratios made it happen more often for you, 
my guess is the app is sending down unaligned writes.


-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-22 Thread Chris Mason

On 01/20/2018 05:47 AM, Sebastian Ochmann wrote:

Hello,

I would like to describe a real-world use case where btrfs does not 
perform well for me. I'm recording 60 fps, larger-than-1080p video using 
OBS Studio [1] where it is important that the video stream is encoded 
and written out to disk in real-time for a prolonged period of time (2-5 
hours). The result is a H264 video encoded on the GPU with a data rate 
ranging from approximately 10-50 MB/s.


The hardware used is powerful enough to handle this task. When I use a 
XFS volume for recording, no matter whether it's a SSD or HDD, the 
recording is smooth and no frame drops are reported (OBS has a nice 
Stats window where it shows the number of frames dropped due to encoding 
lag which seemingly also includes writing the data out to disk).


However, when using a btrfs volume I quickly observe severe, periodic 
frame drops. It's not single frames but larger chunks of frames that a 
dropped at a time. I tried mounting the volume with nobarrier but to no 
avail.


Of course, the simple fix is to use a FS that works for me(TM). However 
I thought since this is a common real-world use case I'd describe the 
symptoms here in case anyone is interested in analyzing this behavior. 
It's not immediately obvious that the FS makes such a difference. Also, 
if anyone has an idea what I could try to mitigate this issue (mount or 
mkfs options?) I can try that.


I saw this behavior on two different machines with kernels 4.14.13 and 
4.14.5, both Arch Linux. btrfs-progs 4.14, OBS 20.1.3-241-gf5c3af1b 
built from git.




This could be a few different things, trying without the space cache was 
already suggested, and that's a top suspect.


How does the application do the writes?  Are they always 4K aligned or 
does it send them out in odd sizes?


The easiest way to nail it down is to use offcputime from the iovisor 
project:



https://github.com/iovisor/bcc/blob/master/tools/offcputime.py

If you haven't already configured this it may take a little while, but 
it's the perfect tool for this problem.


Otherwise, if the stalls are long enough you can try to catch it with 
/proc//stack.  I've attached a helper script I often use to dump 
the stack trace of all the tasks in D state.


Just run walker.py and it should give you something useful.  You can use 
walker.py -a to see all the tasks instead of just D state.  This just 
walks /proc//stack, so you'll need to run it as someone with 
permissions to see the stack traces of the procs you care about.


-chris


#!/usr/bin/env python3
#
# this walks all the tasks on the system and prints out a stack trace
# of any tasks waiting in D state.  If you pass -a, it will print out
# the stack of every task it finds.
#
# It also makes a histogram of the common stacks so you can see where
# more of the tasks are.
#

import sys
import os

from optparse import OptionParser

usage = "usage: %prog [options]"
parser = OptionParser(usage=usage)
parser.add_option("-a", "--all_tasks", help="Dump all stacks", default=False,
  action="store_true")
parser.add_option("-s", "--smaps", help="Dump /proc/pid/smaps", default=False,
  action="store_true")
parser.add_option("-S", "--sort", help="/proc/pid/smaps sort key",
  default="size", type="str")
parser.add_option("-p", "--pid", help="Filter on pid", default=None,
 type="str")
parser.add_option("-c", "--command", help="Filter on command name",
 default=None, type="str")
parser.add_option("-f", "--files", help="don't collapse open files",
  default=False, action="store_true")
parser.add_option("-v", "--verbose", help="details++", default=False,
  action="store_true")
(options, args) = parser.parse_args()

stacks = {}

maps = {}

lines = []

# parse the units from a number and normalize into KB
def parse_number(s):
try:
words = s.split()
unit = words[-1].lower()
number = int(words[1])
tag = words[0].lower().rstrip(':')

# we store in kb
if unit == "mb":
number = number * 1024
elif unit == "gb":
number = number * 1024 * 1024
elif unit == "tb":
number = number * 1024 * 1024

return (tag, number)
except:
return (None, None)

# pretty print a number in KB with units
def print_number(num):
# we store the number in kb
units = ['KB', 'MB', 'GB', 'TB']
index = 0

while num > 1024:
num /= 1024
index += 1

final = float(num + num / 1024)
return (final, units[index])

# find a given line in the record and pretty print it
def print_line(header, record, tag):
num, unit = print_number(record[tag])

if options.verbose:
header = header + " "
else:
header = ""

print("\t%s%s: %.2f %s" % (header, tag.capitalize(), num, unit))


# print all the lines we care about in a given record
def 

Re: Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)

2017-12-21 Thread Chris Mason

On 12/20/2017 03:59 PM, Timofey Titovets wrote:

How reproduce:
touch test_file
chattr +C test_file
dd if=/dev/zero of=test_file bs=1M count=1
btrfs fi def -vrczlib test_file
filefrag -v test_file

test_file
Filesystem type is: 9123683e
File size of test_file is 1048576 (256 blocks of 4096 bytes)
ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:   72917050..  72917081: 32: encoded
   1:   32..  63:   72917118..  72917149: 32:   72917082: encoded
   2:   64..  95:   72919494..  72919525: 32:   72917150: encoded
   3:   96.. 127:   72927576..  72927607: 32:   72919526: encoded
   4:  128.. 159:   72943261..  72943292: 32:   72927608: encoded
   5:  160.. 191:   72944929..  72944960: 32:   72943293: encoded
   6:  192.. 223:   72944952..  72944983: 32:   72944961: encoded
   7:  224.. 255:   72967084..  72967115: 32:   72944984:
last,encoded,eof
test_file: 8 extents found

I can't found at now, where that error happen in code,
but it's reproducible on Linux 4.14.8


We'll silently cow in a few cases, this is one.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes

2017-12-15 Thread Chris Mason
refcounts have a generic implementation and an asm optimized one.  The
generic version has extra debugging to make sure that once a refcount
goes to zero, refcount_inc won't increase it.

The btrfs delayed inode code wasn't expecting this, and we're tripping
over the warnings when the generic refcounts are used.  We ended up with
this race:

Process A Process B
  btrfs_get_delayed_node()
  spin_lock(root->inode_lock)
  radix_tree_lookup()
__btrfs_release_delayed_node()
refcount_dec_and_test(_node->refs)
our refcount is now zero
  refcount_add(2) <---
  warning here, refcount
  unchanged

spin_lock(root->inode_lock)
radix_tree_delete()

With the generic refcounts, we actually warn again when process B above
tries to release his refcount because refcount_add() turned into a
no-op.

We saw this in production on older kernels without the asm optimized
refcounts.

The fix used here is to use refcount_inc_not_zero() to detect when the
object is in the middle of being freed and return NULL.  This is almost
always the right answer anyway, since we usually end up pitching the
delayed_node if it didn't have fresh data in it.

This also changes __btrfs_release_delayed_node() to remove the extra
check for zero refcounts before radix tree deletion.
btrfs_get_delayed_node() was the only path that was allowing refcounts
to go from zero to one.

Signed-off-by: Chris Mason <c...@fb.com>
Fixes: 6de5f18e7b0da
cc: <sta...@vger.kernel.org> #4.12+
---
 fs/btrfs/delayed-inode.c | 45 ++---
 1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5d73f79..84c54af 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -87,6 +87,7 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
 
spin_lock(>inode_lock);
node = radix_tree_lookup(>delayed_nodes_tree, ino);
+
if (node) {
if (btrfs_inode->delayed_node) {
refcount_inc(>refs);  /* can be accessed */
@@ -94,9 +95,30 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
spin_unlock(>inode_lock);
return node;
}
-   btrfs_inode->delayed_node = node;
-   /* can be accessed and cached in the inode */
-   refcount_add(2, >refs);
+
+   /* it's possible that we're racing into the middle of
+* removing this node from the radix tree.  In this case,
+* the refcount was zero and it should never go back
+* to one.  Just return NULL like it was never in the radix
+* at all; our release function is in the process of removing
+* it.
+*
+* Some implementations of refcount_inc refuse to
+* bump the refcount once it has hit zero.  If we don't do
+* this dance here, refcount_inc() may decide to
+* just WARN_ONCE() instead of actually bumping the refcount.
+*
+* If this node is properly in the radix, we want to
+* bump the refcount twice, once for the inode
+* and once for this get operation.
+*/
+   if (refcount_inc_not_zero(>refs)) {
+   refcount_inc(>refs);
+   btrfs_inode->delayed_node = node;
+   } else {
+   node = NULL;
+   }
+
spin_unlock(>inode_lock);
return node;
}
@@ -254,17 +276,18 @@ static void __btrfs_release_delayed_node(
mutex_unlock(_node->mutex);
 
if (refcount_dec_and_test(_node->refs)) {
-   bool free = false;
struct btrfs_root *root = delayed_node->root;
+
spin_lock(>inode_lock);
-   if (refcount_read(_node->refs) == 0) {
-   radix_tree_delete(>delayed_nodes_tree,
- delayed_node->inode_id);
-   free = true;
-   }
+   /*
+* once our refcount goes to zero, nobody is allowed to
+* bump it back up.  We can delete it now
+*/
+   ASSERT(refcount_read(_node->refs) == 0);
+   radix_tree_delete(>delayed_nodes_tree,
+ delayed_node->inode_id);
spin_unlock(>inode_lock);
-   if (free)
-   

Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-30 Thread Chris Mason



On 11/30/2017 12:23 PM, David Sterba wrote:

On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote:

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't
have a problem with these going in through cgroup.  Dave does this sound
good to you?


There are only minor changes to btrfs code so cgroup tree would be
better.


I'd like to include my patch to do all crcs inline (instead of handing
off to helper threads) when io controls are in place.  By the merge
window we should have some good data on how much it's all helping.


Are there any problems in sight if the inline crc and cgroup chnanges go
separately? I assume there's a runtime dependency, not a code
dependency, so it could be sorted by the right merge order.



The feature is just more useful with the inline crcs.  Without them we 
end up with kworkers doing both high and low prio submissions and it all 
boils down to the speed of the lowest priority.


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-29 Thread Chris Mason

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't 
have a problem with these going in through cgroup.  Dave does this sound 
good to you?


I'd like to include my patch to do all crcs inline (instead of handing 
off to helper threads) when io controls are in place.  By the merge 
window we should have some good data on how much it's all helping.


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix deadlock when writing out space cache

2017-11-16 Thread Chris Mason



On 11/16/2017 03:09 AM, Nikolay Borisov wrote:



On 15.11.2017 23:20, Josef Bacik wrote:

From: Josef Bacik 

If we fail to prepare our pages for whatever reason (out of memory in
our case) we need to make sure to drop the block_group->data_rwsem,
otherwise hilarity ensues.

Signed-off-by: Josef Bacik 
---
  fs/btrfs/free-space-cache.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index cdc9f4015ec3..a6c643275210 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1263,8 +1263,12 @@ static int __btrfs_write_out_cache(struct btrfs_root 
*root, struct inode *inode,
  
  	/* Lock all pages first so we can lock the extent safely. */

ret = io_ctl_prepare_pages(io_ctl, inode, 0);
-   if (ret)
+   if (ret) {
+   if (block_group &&
+   (block_group->flags & BTRFS_BLOCK_GROUP_DATA))
+   up_write(_group->data_rwsem);
goto out;
+   }


Which function after out: label causes a deadlock - btrfs_update_inode
(unlikely) or invalidate_inode_pages2?


Neither, out: just doesn't drop the data_rwsem mutex, so it leaves the 
block group locked.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix deadlock when writing out space cache

2017-11-15 Thread Chris Mason

On 11/15/2017 06:46 PM, Liu Bo wrote:

On Wed, Nov 15, 2017 at 04:20:52PM -0500, Josef Bacik wrote:

From: Josef Bacik 

If we fail to prepare our pages for whatever reason (out of memory in
our case) we need to make sure to drop the block_group->data_rwsem,
otherwise hilarity ensues.



Thanks Josef, I searched all the logs and it looks like we've really 
only hit this twice this month.  It's surprising we haven't seen this 
more given how often we OOM.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)

2017-09-11 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert
and Phillip, we decided to send the whole thing in as one pull request.

Herbert had asked about the crypto patch when we discussed the pull, but
I didn't realize he really meant not-right-now.  I've rebased it out of
this branch, and none of the other patches depended on it.

I have things in my zstd-minimal branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal

There's a trivial conflict with the main btrfs pull from last week.
Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.

zstd is a big win in speed over zlib and in compression ratio over lzo,
and the compression team here at FB has gotten great results using it in
production.  Nick will continue to update the kernel side with new
improvements from the open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (3) commits (+14222/-12):
btrfs: Add zstd support (+468/-12)
lib: Add zstd modules (+13014/-0)
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0):
squashfs: Add zstd support

Total: (4) commits (+14400/-12)

 fs/btrfs/Kconfig   |2 +
 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |6 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |6 +-
 fs/btrfs/props.c   |6 +
 fs/btrfs/super.c   |   12 +-
 fs/btrfs/sysfs.c   |2 +
 fs/btrfs/zstd.c|  432 ++
 fs/squashfs/Kconfig|   14 +
 

Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason

On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote:

On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote:


 crypto/Kconfig |9 +
 crypto/Makefile|1 +
 crypto/testmgr.c   |   10 +
 crypto/testmgr.h   |   71 +
 crypto/zstd.c  |  265 


Is there anyone going to use zstd through the crypto API? If not
then I don't see the point in adding it at this point.  Especially
as the compression API is still in a state of flux.


That part was requested by intel, but I'm happy to leave it out for 
another time.  The rest of the patch series doesn't depend on it at all.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason



On 09/08/2017 03:33 PM, Chris Mason wrote:

Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the
open source zstd userland code.


Just to clarify, we've been testing the kernel side of this here at FB, 
but our zstd use in prod is limited to the application side.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the 
open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (4) commits (+14578/-12):  
crypto: Add zstd support (+356/-0)  
btrfs: Add zstd support (+468/-12)  
lib: Add zstd modules (+13014/-0)   
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0): 
squashfs: Add zstd support  

Total: (5) commits (+14756/-12)

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Chris Mason

On Mon, Aug 14, 2017 at 09:54:48PM +0200, Christoph Anton Mitterer wrote:

On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote:

Quite a few applications actually _do_ have some degree of secondary 
verification or protection from a crash.  Go look at almost any
database 
software.

Then please give proper references for this!

This is from 2015, where you claimed this already and I looked up all
the bigger DBs and they either couldn't do it at all, didn't to it per
default, or it required application support (i.e. from the programs
using the DB)
https://www.spinics.net/lists/linux-btrfs/msg50258.html



It usually will not have checksumming, but it will almost 
always have support for a journal, which is enough to cover the 
particular data loss scenario we're talking about (unexpected
unclean 
shutdown).


I don't think we talk about this:
We talk about people wanting checksuming to notice e.g. silent data
corruption.

The crash case is only the corner case about what happens then if data
is written correctly but csums not.


We use the crcs to catch storage gone wrong, both in terms of simple 
things like cabling, bus errors, drives gone crazy or exotic problems 
like every time I reboot the box a handful of sectors return EFI 
partition table headers instead of the data I wrote.  You don't need 
data center scale for this to happen, but it does help...


So, we do catch crc errors in prod and they do keep us from replicating 
bad data over good data.  Some databases also crc, and all drives have 
correction bits of of some kind.  There's nothing wrong with crcs 
happening at lots of layers.


Btrfs couples the crcs with COW because it's the least complicated way 
to protect against:


* bits flipping
* IO getting lost on the way to the drive, leaving stale but valid data 
in place
* IO from sector A going to sector B instead, overwriting valid data 
with other valid data.


It's possible to protect against all three without COW, but all 
solutions have their own tradeoffs and this is the setup we chose.  It's 
easy to trust and easy to debug and at scale that really helps.


In general, production storage environments prefer clearly defined 
errors when the storage has the wrong data.  EIOs happen often, and you 
want to be able to quickly pitch the bad data and replicate in good 
data.


My real goal is to make COW fast enough that we can leave it on for the 
database applications too.  Obviously I haven't quite finished that one 
yet ;) But I'd rather keep the building block of all the other btrfs 
features in place than try to do crcs differently.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH stable v4.12 backport] Btrfs: fix early ENOSPC due to delalloc

2017-08-11 Thread Chris Mason

On 08/11/2017 10:44 AM, Chris Mason wrote:

Hmpf, forgot to put the sha in Linus' tree:

17024ad0a0fdfcfe53043afb969b813d3e020c21



And Nikolay just reminded me this is already in Greg's queue.  Whoops.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH stable v4.12 backport] Btrfs: fix early ENOSPC due to delalloc

2017-08-11 Thread Chris Mason

Hmpf, forgot to put the sha in Linus' tree:

17024ad0a0fdfcfe53043afb969b813d3e020c21

-chris

On 08/11/2017 10:41 AM, Chris Mason wrote:

From: Omar Sandoval <osan...@fb.com>

If a lot of metadata is reserved for outstanding delayed allocations, we
rely on shrink_delalloc() to reclaim metadata space in order to fulfill
reservation tickets. However, shrink_delalloc() has a shortcut where if
it determines that space can be overcommitted, it will stop early. This
made sense before the ticketed enospc system, but now it means that
shrink_delalloc() will often not reclaim enough space to fulfill any
tickets, leading to an early ENOSPC. (Reservation tickets don't care
about being able to overcommit, they need every byte accounted for.)

Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
all of the metadata it is supposed to. This fixes early ENOSPCs we were
seeing when doing a btrfs receive to populate a new filesystem, as well
as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.

Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure")
Tested-by: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name>
Cc: sta...@vger.kernel.org
Reviewed-by: Josef Bacik <jba...@fb.com>
Signed-off-by: Omar Sandoval <osan...@fb.com>
Signed-off-by: David Sterba <dste...@suse.com>
Signed-off-by: Chris Mason <c...@fb.com>
---
  fs/btrfs/extent-tree.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 33d979e9..83eecd3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4776,10 +4776,6 @@ static void shrink_delalloc(struct btrfs_root *root, u64 
to_reclaim, u64 orig,
else
flush = BTRFS_RESERVE_NO_FLUSH;
spin_lock(_info->lock);
-   if (can_overcommit(root, space_info, orig, flush)) {
-   spin_unlock(_info->lock);
-   break;
-   }
if (list_empty(_info->tickets) &&
list_empty(_info->priority_tickets)) {
spin_unlock(_info->lock);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FAILED: patch "[PATCH] Btrfs: fix early ENOSPC due to delalloc" failed to apply to 4.12-stable tree

2017-08-11 Thread Chris Mason


On 08/04/2017 03:29 PM, Christoph Anton Mitterer wrote:
> Hey.
> 
> Could someone of the devs put some attention on this...?
> 
> Thanks,
> Chris :-)

Done, you can also grab it here: 

https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-stable-4.12=354850ad1948af13248031e5180d495044d05aa5

-chris

> 
> 
> On Mon, 2017-07-31 at 18:06 -0700, gre...@linuxfoundation.org wrote:
>> The patch below does not apply to the 4.12-stable tree.
>> If someone wants it applied there, or to any other stable or longterm
>> tree, then please email the backport, including the original git
>> commit
>> id to .
>>
>> thanks,
>>
>> greg k-h
>>
>> -- original commit in Linus's tree --
>>
>>  From 17024ad0a0fdfcfe53043afb969b813d3e020c21 Mon Sep 17 00:00:00
>> 2001
>> From: Omar Sandoval 
>> Date: Thu, 20 Jul 2017 15:10:35 -0700
>> Subject: [PATCH] Btrfs: fix early ENOSPC due to delalloc
>>
>> If a lot of metadata is reserved for outstanding delayed allocations,
>> we
>> rely on shrink_delalloc() to reclaim metadata space in order to
>> fulfill
>> reservation tickets. However, shrink_delalloc() has a shortcut where
>> if
>> it determines that space can be overcommitted, it will stop early.
>> This
>> made sense before the ticketed enospc system, but now it means that
>> shrink_delalloc() will often not reclaim enough space to fulfill any
>> tickets, leading to an early ENOSPC. (Reservation tickets don't care
>> about being able to overcommit, they need every byte accounted for.)
>>
>> Fix it by getting rid of the shortcut so that shrink_delalloc()
>> reclaims
>> all of the metadata it is supposed to. This fixes early ENOSPCs we
>> were
>> seeing when doing a btrfs receive to populate a new filesystem, as
>> well
>> as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.
>>
>> Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc
>> infrastructure")
>> Tested-by: Christoph Anton Mitterer > me>
>> Cc: sta...@vger.kernel.org
>> Reviewed-by: Josef Bacik 
>> Signed-off-by: Omar Sandoval 
>> Signed-off-by: David Sterba 
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index a6635f07b8f1..e3b0b4196d3d 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4825,10 +4825,6 @@ static void shrink_delalloc(struct
>> btrfs_fs_info *fs_info, u64 to_reclaim,
>>  else
>>  flush = BTRFS_RESERVE_NO_FLUSH;
>>  spin_lock(_info->lock);
>> -if (can_overcommit(fs_info, space_info, orig, flush,
>> false)) {
>> -spin_unlock(_info->lock);
>> -break;
>> -}
>>  if (list_empty(_info->tickets) &&
>>  list_empty(_info->priority_tickets)) {
>>  spin_unlock(_info->lock);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH stable v4.12 backport] Btrfs: fix early ENOSPC due to delalloc

2017-08-11 Thread Chris Mason
From: Omar Sandoval <osan...@fb.com>

If a lot of metadata is reserved for outstanding delayed allocations, we
rely on shrink_delalloc() to reclaim metadata space in order to fulfill
reservation tickets. However, shrink_delalloc() has a shortcut where if
it determines that space can be overcommitted, it will stop early. This
made sense before the ticketed enospc system, but now it means that
shrink_delalloc() will often not reclaim enough space to fulfill any
tickets, leading to an early ENOSPC. (Reservation tickets don't care
about being able to overcommit, they need every byte accounted for.)

Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
all of the metadata it is supposed to. This fixes early ENOSPCs we were
seeing when doing a btrfs receive to populate a new filesystem, as well
as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.

Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure")
Tested-by: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name>
Cc: sta...@vger.kernel.org
Reviewed-by: Josef Bacik <jba...@fb.com>
Signed-off-by: Omar Sandoval <osan...@fb.com>
Signed-off-by: David Sterba <dste...@suse.com>
Signed-off-by: Chris Mason <c...@fb.com>
---
 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 33d979e9..83eecd3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4776,10 +4776,6 @@ static void shrink_delalloc(struct btrfs_root *root, u64 
to_reclaim, u64 orig,
else
flush = BTRFS_RESERVE_NO_FLUSH;
spin_lock(_info->lock);
-   if (can_overcommit(root, space_info, orig, flush)) {
-   spin_unlock(_info->lock);
-   break;
-   }
if (list_empty(_info->tickets) &&
list_empty(_info->priority_tickets)) {
spin_unlock(_info->lock);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-11 Thread Chris Mason



On 08/10/2017 03:25 PM, Hugo Mills wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:


Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.


Could we please not add more mount options? I get that they're easy
to implement, but it's a very blunt instrument. What we tend to see
(with both nodatacow and compress) is people using the mount options,
then asking for exceptions, discovering that they can't do that, and
then falling back to doing it with attributes or btrfs properties.
Could we just start with btrfs properties this time round, and cut out
the mount option part of this cycle.

In the long run, it'd be great to see most of the btrfs-specific
mount options get deprecated and ultimately removed entirely, in
favour of attributes/properties, where feasible.



It's a good point, and as was commented later down I'd just do mount -o 
compress=zstd:3 or something.


But I do prefer properties in general for this.  My big point was just 
that next step is outside of Nick's scope.


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 03:00 PM, Eric Biggers wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.



I am not surprised --- Zstandard is closer to the state of the art, both
format-wise and implementation-wise, than the other choices in BTRFS.  My point
is that benchmarks need to account for how much data is compressed at a time.
This is a common mistake when comparing different compression algorithms; the
algorithm name and compression level do not tell the whole story.  The
dictionary size is extremely significant.  No one is going to compress or
decompress a 200 MB file as a single stream in kernel mode, so it does not make
sense to justify adding Zstandard *to the kernel* based on such a benchmark.  It
is going to be divided into chunks.  How big are the chunks in BTRFS?  I thought
that it compressed only one page (4 KiB) at a time, but I hope that has been, or
is being, improved; 32 KiB - 128 KiB should be a better amount.  (And if the
amount of data compressed at a time happens to be different between the
different algorithms, note that BTRFS benchmarks are likely to be measuring that
as much as the algorithms themselves.)


Btrfs hooks the compression code into the delayed allocation mechanism 
we use to gather large extents for COW.  So if you write 100MB to a 
file, we'll have 100MB to compress at a time (within the limits of the 
amount of pages we allow to collect before forcing it down).


But we want to balance how much memory you might need to uncompress 
during random reads.  So we have an artificial limit of 128KB that we 
send at a time to the compression code.  It's easy to change this, it's 
just a tradeoff made to limit the cost of reading small bits.


It's the same for zlib,lzo and the new zstd patch.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as 
well.  The numbers were in line with what Nick is posting here.  zstd is 
a big win over both lzo and zlib from a btrfs point of view.


It's true Nick's patches only support a single compression level in 
btrfs, but that's because btrfs doesn't have a way to pass in the 
compression ratio.  It could easily be a mount option, it was just 
outside the scope of Nick's initial work.


-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range

2017-08-03 Thread Chris Mason



On 08/03/2017 11:25 AM, Wang Shilong wrote:

On Thu, Aug 3, 2017 at 11:00 PM, Chris Mason <c...@fb.com> wrote:



On 07/27/2017 02:52 PM, fdman...@kernel.org wrote:


From: Filipe Manana <fdman...@suse.com>

If the range being cleared was not marked for defrag and we are not
about to clear the range from the defrag status, we don't need to
lock and unlock the inode.

Signed-off-by: Filipe Manana <fdman...@suse.com>



Thanks Filipe, looks like it goes all the way back to:

commit 47059d930f0e002ff851beea87d738146804726d
Author: Wang Shilong <wangsl.f...@cn.fujitsu.com>
Date:   Thu Jul 3 18:22:07 2014 +0800

 Btrfs: make defragment work with nodatacow option

I can't see how the inode lock is required here.


This blames to me, thanks for fixing it.


No blame ;)  I'll take code that works any day.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio

2017-08-03 Thread Chris Mason



On 08/03/2017 08:44 AM, Nikolay Borisov wrote:

Currently the code checks whether we should do data checksumming in
btrfs_submit_direct and the boolean result of this check is passed to
btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which
actually consumes it. The last function actually has all the necessary context
to figure out whether to skip the check or not, so let's move the check closer
to where it's being consumed. No functional changes.


I like it, thanks.

Reviewed-by: Chris Mason <c...@fb.com>

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range

2017-08-03 Thread Chris Mason



On 07/27/2017 02:52 PM, fdman...@kernel.org wrote:

From: Filipe Manana <fdman...@suse.com>

If the range being cleared was not marked for defrag and we are not
about to clear the range from the defrag status, we don't need to
lock and unlock the inode.

Signed-off-by: Filipe Manana <fdman...@suse.com>


Thanks Filipe, looks like it goes all the way back to:

commit 47059d930f0e002ff851beea87d738146804726d
Author: Wang Shilong <wangsl.f...@cn.fujitsu.com>
Date:   Thu Jul 3 18:22:07 2014 +0800

Btrfs: make defragment work with nodatacow option

I can't see how the inode lock is required here.

Reviewed-by: Chris Mason <c...@fb.com>

-chris


---
  fs/btrfs/inode.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eb495e956d53..51c45c0a8553 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1797,10 +1797,11 @@ static void btrfs_clear_bit_hook(void *private_data,
u64 len = state->end + 1 - state->start;
u32 num_extents = count_max_extents(len);
  
-	spin_lock(>lock);

-   if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG))
+   if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) {
+   spin_lock(>lock);
inode->defrag_bytes -= len;
-   spin_unlock(>lock);
+   spin_unlock(>lock);
+   }
  
  	/*

 * set_bit and clear bit hooks normally require _irqsave/restore


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-02 Thread Chris Mason



On 08/01/2017 01:39 PM, Austin S. Hemmelgarn wrote:

On 2017-08-01 13:25, Roman Mamedov wrote:

On Tue,  1 Aug 2017 10:14:23 -0600
Liu Bo  wrote:


This aims to fix write hole issue on btrfs raid5/6 setup by adding a
separate disk as a journal (aka raid5/6 log), so that after unclean
shutdown we can make sure data and parity are consistent on the raid
array by replaying the journal.


Could it be possible to designate areas on the in-array devices to be 
used as

journal?

While md doesn't have much spare room in its metadata for extraneous 
things
like this, Btrfs could use almost as much as it wants to, adding to 
size of the
FS metadata areas. Reliability-wise, the log could be stored as RAID1 
chunks.


It doesn't seem convenient to need having an additional storage device 
around
just for the log, and also needing to maintain its fault tolerance 
yourself (so
the log device would better be on a mirror, such as mdadm RAID1? more 
expense

and maintenance complexity).

I agree, MD pretty much needs a separate device simply because they 
can't allocate arbitrary space on the other array members.  BTRFS can do 
that though, and I would actually think that that would be _easier_ to 
implement than having a separate device.


That said, I do think that it would need to be a separate chunk type, 
because things could get really complicated if the metadata is itself 
using a parity raid profile.


Thanks for running with this Liu, I'm reading through all the patches. 
I do agree that it's better to put the logging into a dedicated chunk 
type, that way we can have it default to either double or triple mirroring.


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-02 Thread Chris Mason

On 08/02/2017 04:38 AM, Brendan Hide wrote:
The title seems alarmist to me - and I suspect it is going to be 
misconstrued. :-/


Supporting any filesystem is a huge amount of work.  I don't have a 
problem with Redhat or any distro picking and choosing the projects they 
want to support.


At least inside of FB, our own internal btrfs usage is continuing to 
grow.  Btrfs is becoming a big part of how we ship containers and other 
workloads where snapshots improve performance.


We also heavily use XFS, so I'm happy to see RH's long standing 
investment there continue.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode

2017-07-24 Thread Chris Mason

On 07/24/2017 03:06 PM, Austin S. Hemmelgarn wrote:

On 2017-07-24 14:53, Chris Mason wrote:

On 07/24/2017 02:41 PM, David Sterba wrote:


would it be ok for you to keep ssd_working as before?

I'd really like to get this patch merged soon because "do not use ssd
mode for ssd" has started to be the recommended workaround. Once this
sticks, we won't need to have any ssd mode anymore ...


Works for me.  I do want to make sure that commits in this area 
include the workload they were targeting, how they were measured and 
what impacts they had.  That way when we go back to try and change 
this again we'll understand what profiles we want to preserve.
Just thinking long term here, but might it make sense to (eventually) 
allow the user to tune how big a chunk of space to look for?  I know 
that ext* have options to do this kind of thing, and I think XFS does 
too (but they do it at filesystem creation time), and I do know people 
who make use of those to make sure things are working at their absolute 
best.


Agreed, we have space in the on-disk format to record those preferences, 
but we haven't done it in the past.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode

2017-07-24 Thread Chris Mason



On 07/24/2017 02:41 PM, David Sterba wrote:

On Mon, Jul 24, 2017 at 02:01:07PM -0400, Chris Mason wrote:

On 07/24/2017 10:25 AM, David Sterba wrote:


Thanks for the extensive historical summary, this change really deserves
it.

Decoupling the assumptions about the device's block management is really
a good thing, mount option 'ssd' should mean that the device just has
cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
way to keep the behaviour for anybody who wants it.

I'd like to push this change to 4.13-rc3, as I don't think we need more
time to let other users to test this. The effects of current ssd
implementation have been debated and debugged on IRC for a long time.


The description is great, but I'd love to see many more benchmarks.  At
Facebook we use the current ssd_spread mode in production on top of
hardware raid5/6 (spinning storage) because it dramatically reduces the
read/modify/write cycles done for metadata writes.


Well, I think this is an example that ssd got misused because of the
side effects of the allocation. If you observe good patterns for raid5,
then the allocator should be adapted for that case, otherwise
ssd/ssd_spread should be independent of the raid level.


Absolutely.  The optimizations that made ssd_spread useful for first 
generation flash are the same things that raid5/6 need.  Big writes, or 
said differently a minimum size for fast writes.





If we're going to play around with these, we need a good way to measure
free space fragmentation as part of benchmarks, as well as the IO
patterns coming out of the allocator.


Hans has a tool that visualizes the fragmentation. Most complaints I've
seen were about 'ssd' itself, excessive fragmentation, early ENOSPC. Not
many people use ssd_spread, 'ssd' gets turned on automatically so it has
much wider impact.


At least for our uses, ssd_spread matters much more for metadata than
data (the data writes are large and metadata is small).


 From the changes overview:


1. Throw out the current ssd_spread behaviour.


would it be ok for you to keep ssd_working as before?

I'd really like to get this patch merged soon because "do not use ssd
mode for ssd" has started to be the recommended workaround. Once this
sticks, we won't need to have any ssd mode anymore ...


Works for me.  I do want to make sure that commits in this area include 
the workload they were targeting, how they were measured and what 
impacts they had.  That way when we go back to try and change this again 
we'll understand what profiles we want to preserve.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode

2017-07-24 Thread Chris Mason

On 07/24/2017 10:25 AM, David Sterba wrote:


Thanks for the extensive historical summary, this change really deserves
it.

Decoupling the assumptions about the device's block management is really
a good thing, mount option 'ssd' should mean that the device just has
cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
way to keep the behaviour for anybody who wants it.

I'd like to push this change to 4.13-rc3, as I don't think we need more
time to let other users to test this. The effects of current ssd
implementation have been debated and debugged on IRC for a long time.



The description is great, but I'd love to see many more benchmarks.  At 
Facebook we use the current ssd_spread mode in production on top of 
hardware raid5/6 (spinning storage) because it dramatically reduces the 
read/modify/write cycles done for metadata writes.


If we're going to play around with these, we need a good way to measure 
free space fragmentation as part of benchmarks, as well as the IO 
patterns coming out of the allocator.


At least for our uses, ssd_spread matters much more for metadata than 
data (the data writes are large and metadata is small).


Thanks again,
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Btrfs for 4.13, part 1

2017-06-26 Thread Chris Mason

On 06/23/2017 11:16 AM, David Sterba wrote:

Hi,

this is the main batch for 4.13. There are some user visible changes, see
below. The core updates improve error handling (mostly related to bios), with
the usual incremental work on the GFP_NOFS (mis)use removal. All patches have
been in for-next for an extensive amount of time.

Thre will be followups but I want push the series (111 patches) forward. There
are also some updates to adjacent subsystems (writeback and blocklayer), so I
want to give some stable point for merging in the upcoming weeks.


Thanks Dave, I ran this (along with the updates we added) through a long 
stress and the usual xfstests.


For everyone else on the list, since I'm heading off to vacation until 
~July 9th, Dave is sending this off to Linus once the merge window starts.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: fix integer overflow in calc_reclaim_items_nr

2017-06-23 Thread Chris Mason
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6.  This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number.  It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.

This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.

Reported-by: Dave Jones <da...@codemonkey.org.uk>
Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
Signed-off-by: Chris Mason <c...@fb.com>
---

v1->v2 Fixed (int) cast in calc_reclaim_items_nr (thanks Holger Hoffstätte)

 fs/btrfs/dev-replace.c  |  4 ++--
 fs/btrfs/extent-tree.c  | 12 ++--
 fs/btrfs/ioctl.c|  2 +-
 fs/btrfs/ordered-data.c | 17 -
 fs/btrfs/ordered-data.h |  4 ++--
 fs/btrfs/qgroup.c   |  2 +-
 fs/btrfs/relocation.c   |  2 +-
 fs/btrfs/scrub.c|  2 +-
 fs/btrfs/super.c|  2 +-
 fs/btrfs/transaction.c  |  2 +-
 10 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 5fe1ca8..bee3ede 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -388,7 +388,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
if (ret)
btrfs_err(fs_info, "kobj add dev failed %d", ret);
 
-   btrfs_wait_ordered_roots(fs_info, -1, 0, (u64)-1);
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 
/* force writing the updated state information to disk */
trans = btrfs_start_transaction(root, 0);
@@ -507,7 +507,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
mutex_unlock(_replace->lock_finishing_cancel_unmount);
return ret;
}
-   btrfs_wait_ordered_roots(fs_info, -1, 0, (u64)-1);
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 
trans = btrfs_start_transaction(root, 0);
if (IS_ERR(trans)) {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 33d979e9..705247a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4238,7 +4238,7 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
 
if (need_commit > 0) {
btrfs_start_delalloc_roots(fs_info, 0, -1);
-   btrfs_wait_ordered_roots(fs_info, -1, 0,
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0,
 (u64)-1);
}
 
@@ -4698,14 +4698,14 @@ static void btrfs_writeback_inodes_sb_nr(struct 
btrfs_fs_info *fs_info,
}
 }
 
-static inline int calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
+static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
u64 to_reclaim)
 {
u64 bytes;
-   int nr;
+   u64 nr;
 
bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
-   nr = (int)div64_u64(to_reclaim, bytes);
+   nr = div64_u64(to_reclaim, bytes);
if (!nr)
nr = 1;
return nr;
@@ -4725,15 +4725,15 @@ static void shrink_delalloc(struct btrfs_root *root, 
u64 to_reclaim, u64 orig,
struct btrfs_trans_handle *trans;
u64 delalloc_bytes;
u64 max_reclaim;
+   u64 items;
long time_left;
unsigned long nr_pages;
int loops;
-   int items;
enum btrfs_reserve_flush_enum flush;
 
/* Calc the number of the pages we need flush for space reservation */
items = calc_reclaim_items_nr(fs_info, to_reclaim);
-   to_reclaim = (u64)items * EXTENT_SIZE_PER_ITEM;
+   to_reclaim = items * EXTENT_SIZE_PER_ITEM;
 
trans = (struct btrfs_trans_handle *)current->journal_info;
block_rsv = _info->delalloc_block_rsv;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e176375..7712217 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -689,7 +689,7 @@ static int create_snapshot(struct btrfs_root *root, struct 
inode *dir,
if (ret)
goto dec_and_free;
 
-   btrfs_wait_ordered_extents(root, -1, 0, (u64)-1);
+   btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
 
btrfs_init_block_rsv(_snapshot->block_rsv,
 BTRFS_BLOCK_RSV_TEMP);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 7b40e2e..a3aca49 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -663,7 +663,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work 
*work)
  * wait for all the ordered extents in a root.  This is done when balancing
  * space between drives.
  */
-int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr,
+u64 bt

Re: [PATCH] btrfs: fix integer overflow in calc_reclaim_items_nr

2017-06-23 Thread Chris Mason



On 06/23/2017 11:29 AM, Holger Hoffstätte wrote:

On 06/23/17 16:32, Chris Mason wrote:
[..]

-static inline int calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
+static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
u64 to_reclaim)
 {
u64 bytes;
-   int nr;
+   u64 nr;

bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
nr = (int)div64_u64(to_reclaim, bytes);

^^

Isn't this a bit odd? I can't even tell whether it matters, just randomly
noticed it because I genuinely dislike type casts..



Crud, yes that's broken.  Thanks!

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix integer overflow in calc_reclaim_items_nr

2017-06-23 Thread Chris Mason
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6.  This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number.  It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.

This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.

Reported-by: Dave Jones <da...@codemonkey.org.uk>
Fixes: 70e7af244f24c94604ef6eca32ad297632018583
Signed-off-by: Chris Mason <c...@fb.com>
---
 fs/btrfs/dev-replace.c  |  4 ++--
 fs/btrfs/extent-tree.c  | 10 +-
 fs/btrfs/ioctl.c|  2 +-
 fs/btrfs/ordered-data.c | 17 -
 fs/btrfs/ordered-data.h |  4 ++--
 fs/btrfs/qgroup.c   |  2 +-
 fs/btrfs/relocation.c   |  2 +-
 fs/btrfs/scrub.c|  2 +-
 fs/btrfs/super.c|  2 +-
 fs/btrfs/transaction.c  |  2 +-
 10 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 5fe1ca8..bee3ede 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -388,7 +388,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
if (ret)
btrfs_err(fs_info, "kobj add dev failed %d", ret);
 
-   btrfs_wait_ordered_roots(fs_info, -1, 0, (u64)-1);
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 
/* force writing the updated state information to disk */
trans = btrfs_start_transaction(root, 0);
@@ -507,7 +507,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
mutex_unlock(_replace->lock_finishing_cancel_unmount);
return ret;
}
-   btrfs_wait_ordered_roots(fs_info, -1, 0, (u64)-1);
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 
trans = btrfs_start_transaction(root, 0);
if (IS_ERR(trans)) {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 33d979e9..d675324 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4238,7 +4238,7 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
 
if (need_commit > 0) {
btrfs_start_delalloc_roots(fs_info, 0, -1);
-   btrfs_wait_ordered_roots(fs_info, -1, 0,
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0,
 (u64)-1);
}
 
@@ -4698,11 +4698,11 @@ static void btrfs_writeback_inodes_sb_nr(struct 
btrfs_fs_info *fs_info,
}
 }
 
-static inline int calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
+static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
u64 to_reclaim)
 {
u64 bytes;
-   int nr;
+   u64 nr;
 
bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
nr = (int)div64_u64(to_reclaim, bytes);
@@ -4725,15 +4725,15 @@ static void shrink_delalloc(struct btrfs_root *root, 
u64 to_reclaim, u64 orig,
struct btrfs_trans_handle *trans;
u64 delalloc_bytes;
u64 max_reclaim;
+   u64 items;
long time_left;
unsigned long nr_pages;
int loops;
-   int items;
enum btrfs_reserve_flush_enum flush;
 
/* Calc the number of the pages we need flush for space reservation */
items = calc_reclaim_items_nr(fs_info, to_reclaim);
-   to_reclaim = (u64)items * EXTENT_SIZE_PER_ITEM;
+   to_reclaim = items * EXTENT_SIZE_PER_ITEM;
 
trans = (struct btrfs_trans_handle *)current->journal_info;
block_rsv = _info->delalloc_block_rsv;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e176375..7712217 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -689,7 +689,7 @@ static int create_snapshot(struct btrfs_root *root, struct 
inode *dir,
if (ret)
goto dec_and_free;
 
-   btrfs_wait_ordered_extents(root, -1, 0, (u64)-1);
+   btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
 
btrfs_init_block_rsv(_snapshot->block_rsv,
 BTRFS_BLOCK_RSV_TEMP);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 7b40e2e..a3aca49 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -663,7 +663,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work 
*work)
  * wait for all the ordered extents in a root.  This is done when balancing
  * space between drives.
  */
-int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr,
+u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
   const u64 range_start, const u64 range_len)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
@@ -671,7 +671,7 @@ int btrfs_wait_ordered_extents(struct btrfs_

Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Chris Mason



On 06/21/2017 05:08 PM, Jeff Mahoney wrote:

On 6/21/17 4:31 PM, Chris Mason wrote:

On 06/21/2017 04:14 PM, Jeff Mahoney wrote:

On 6/14/17 11:44 AM, je...@suse.com wrote:

From: Jeff Mahoney <je...@suse.com>

In a heavy write scenario, we can end up with a large number of pinned
bytes.  This can translate into (very) premature ENOSPC because pinned
bytes must be accounted for when allowing a reservation but aren't
accounted for when deciding whether to create a new chunk.

This patch adds the accounting to should_alloc_chunk so that we can
create the chunk.

Signed-off-by: Jeff Mahoney <je...@suse.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cb0b924..d027807 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct
btrfs_fs_info *fs_info,
 {
 struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
 u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
-u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
+u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved +
sinfo->bytes_pinned + sinfo->bytes_may_use;
 u64 thresh;

 if (force == CHUNK_ALLOC_FORCE)




Ignore this patch.  It certainly allocates chunks more aggressively, but
it means we end up with a ton of metadata chunks even when we don't have
much metadata.



Josef and I pushed this needle back and forth a bunch of times in the
early days.  I still think we can allocate a few more chunks than we do
now...


I agree.  This patch was to fix an issue that we are seeing during
installation.  It'd stop with ENOSPC with >50GB completely unallocated.
The patch passed the test cases that were failing before but now it's
failing differently.  I was worried this pattern might be the end result:

Data,single: Size:4.00GiB, Used:3.32GiB
   /dev/vde4.00GiB

Metadata,DUP: Size:20.00GiB, Used:204.12MiB
   /dev/vde   40.00GiB

System,DUP: Size:8.00MiB, Used:16.00KiB
   /dev/vde   16.00MiB

This is on a fresh file system with just "cp /usr /mnt" executed.

I'm looking into it a bit more now.


Does this failure still happen with Omar's ENOSPC fix (commit: 
70e7af244f24c94604ef6eca32ad297632018583)


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Chris Mason

On 06/21/2017 04:14 PM, Jeff Mahoney wrote:

On 6/14/17 11:44 AM, je...@suse.com wrote:

From: Jeff Mahoney 

In a heavy write scenario, we can end up with a large number of pinned
bytes.  This can translate into (very) premature ENOSPC because pinned
bytes must be accounted for when allowing a reservation but aren't
accounted for when deciding whether to create a new chunk.

This patch adds the accounting to should_alloc_chunk so that we can
create the chunk.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cb0b924..d027807 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct btrfs_fs_info 
*fs_info,
 {
struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
-   u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
+   u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + 
sinfo->bytes_pinned + sinfo->bytes_may_use;
u64 thresh;

if (force == CHUNK_ALLOC_FORCE)




Ignore this patch.  It certainly allocates chunks more aggressively, but
it means we end up with a ton of metadata chunks even when we don't have
much metadata.



Josef and I pushed this needle back and forth a bunch of times in the 
early days.  I still think we can allocate a few more chunks than we do 
now...


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_wait_ordered_roots warning triggered

2017-06-21 Thread Chris Mason

On 06/21/2017 11:16 AM, Dave Jones wrote:

WARNING: CPU: 2 PID: 7153 at fs/btrfs/ordered-data.c:753 
btrfs_wait_ordered_roots+0x1a3/0x220
CPU: 2 PID: 7153 Comm: kworker/u8:7 Not tainted 4.12.0-rc6-think+ #4
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
task: 8804f08d5380 task.stack: c9000895c000
RIP: 0010:btrfs_wait_ordered_roots+0x1a3/0x220
RSP: 0018:c9000895fc70 EFLAGS: 00010286
RAX:  RBX: f7efdfb1 RCX: 25c3
RDX: 880507c0cde0 RSI:  RDI: 0002
RBP: c9000895fce8 R08:  R09: 0001
R10: c9000895fbd8 R11: 8804f08d5380 R12: 8804f4930008
R13: 0001 R14: 8804f36f R15: 8804f4930850
FS:  () GS:880507c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd40c4a1bf2 CR3: 00013995e000 CR4: 001406e0
DR0: 7f17be89 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Call Trace:
 flush_space+0x285/0x6e0
 btrfs_async_reclaim_metadata_space+0xd7/0x420
 process_one_work+0x24d/0x680
 worker_thread+0x4e/0x3b0
 kthread+0x117/0x150
 ? process_one_work+0x680/0x680
 ? kthread_create_on_node+0x70/0x70
 ret_from_fork+0x27/0x40


Thanks Dave, which trinity command line triggered this?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-06-10 Thread Chris Mason
Hi Linus,

My for-linus-4.12 branch has some fixes that Dave Sterba collected:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.12

We've been hitting an early enospc problem on production machines that
Omar tracked down to an old int->u64 mistake.  I waited a bit on
this pull to make sure it was really the problem from production,
but it's on ~2100 hosts now and I think we're good.

Omar also noticed a commit in the queue would make new early ENOSPC
problems.  I pulled that out for now, which is why the top three commits
are younger than the rest.

Otherwise these are all fixes, some explaining very old bugs that we've
been poking at for a while.

Jeff Mahoney (2) commits (+4/-3):
btrfs: fix race with relocation recovery and fs_root setup (+3/-3)
btrfs: fix memory leak in update_space_info failure path (+1/-0)

Liu Bo (1) commits (+1/-1):
Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io

Colin Ian King (1) commits (+1/-1):
btrfs: fix incorrect error return ret being passed to mapping_set_error

Omar Sandoval (1) commits (+2/-2):
Btrfs: fix delalloc accounting leak caused by u32 overflow

Qu Wenruo (1) commits (+122/-2):
btrfs: fiemap: Cache and merge fiemap extent before submit it to user

David Sterba (1) commits (+2/-2):
btrfs: use correct types for page indices in btrfs_page_exists_in_range

Jan Kara (1) commits (+6/-4):
btrfs: Make flush bios explicitely sync

Su Yue (1) commits (+1/-1):
btrfs: tree-log.c: Wrong printk information about namelen

Total: (9) commits (+139/-16)

 fs/btrfs/ctree.h   |   4 +-
 fs/btrfs/dir-item.c|   2 +-
 fs/btrfs/disk-io.c |  10 ++--
 fs/btrfs/extent-tree.c |   7 +--
 fs/btrfs/extent_io.c   | 126 +++--
 fs/btrfs/inode.c   |   6 +--
 6 files changed, 139 insertions(+), 16 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Btrfs fixes for 4.12

2017-06-09 Thread Chris Mason

On 06/05/2017 01:47 PM, David Sterba wrote:

Hi,

please pull the following assorted fixes to 4.12. Thanks.


The following changes since commit 9bcaaea7418d09691f1ffab5c49aacafe3eef9d0:

  btrfs: fix the gfp_mask for the reada_zones radix tree (2017-05-04 16:56:11 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.12

for you to fetch changes up to 2eec59665d545ce443184b9c7a32a65b4dde6d4c:

  Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io (2017-06-01 17:00:11 
+0200)


Colin Ian King (1):
  btrfs: fix incorrect error return ret being passed to mapping_set_error

David Sterba (1):
  btrfs: use correct types for page indices in btrfs_page_exists_in_range

Jan Kara (1):
  btrfs: Make flush bios explicitely sync

Jeff Mahoney (2):
  btrfs: fix memory leak in update_space_info failure path
  btrfs: fix race with relocation recovery and fs_root setup

Liu Bo (2):
  Btrfs: skip commit transaction if we don't have enough pinned bytes


Thanks Dave, I've been bashing on these along with Omar's enospc fix for 
the last few days.


Omar mentioned that until we fix the pinned bytes counter we should 
leave out this commit from Liu Bo, so I've rebased it out for now.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs list corruption and soft lockups while testing writeback error handling

2017-05-11 Thread Chris Mason

On 05/11/2017 03:52 PM, Jeff Layton wrote:

On Thu, 2017-05-11 at 07:13 -0400, Jeff Layton wrote:

I finally got my writeback error handling test to work on btrfs (thanks,
Chris!), by making the filesystem stripe the data and mirror the
metadata across two devices. The test passes now, but on one run, I got
the following list corruption warning and then a soft lockup (which is
probably fallout from the list corruption).

I ran the test several times before and since then without this failure,
so I don't have a clear reproducer. The kernel in this instance is
basically a v4.11 kernel with my pile of writeback error handling
patches on top:


https://urldefense.proofpoint.com/v2/url?u=https-3A__git.samba.org_-3Fp-3Djlayton_linux.git-3Ba-3Dshortlog-3Bh-3Drefs_heads_wberr=DwICaQ=5VD0RTtNlTh3ycd41b3MUw=9QPtTAxcitoznaWRKKHoEQ=BXXwaUFQNFNaGGFYHEVlvNBwkrXiIoH7K5iOdR_PvxM=xE6pIXeQ1rlaxAV8aTYBSiI06pb3WZoiRJW8Vo1L3NQ=

It may be that they are a contributing factor, but this smells more like
a bug down in btrfs. Let me know if you need other info:


[ btrfs inode logging ]


(cc'ing Liu Bo since we were discussing this earlier this week)

I can't reproduce this on stock v4.11, so I think this is a bug in my
series.

I think this is due to the differences in how errors are being reported
from filemap_fdatawait_range now causing some transactions to end up
being freed while they're still on the log_ctxs list. I'm working on
hunting down the problem now.

Sorry for the noise!



There's a list in the inode logging code that we consistently seem to 
find list debugging assertions with.  We've fixed up all the known 
issues, but I wouldn't be surprised if we've got a goto fail in there.


I'll take a look ;)

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
On 05/09/2017 01:56 PM, Chris Mason wrote:
> Hi Linus,
> 
> My for-linus-4.12 branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> for-linus-4.12

I hit send too soon, sorry.  There's a trivial conflict with our WARN_ON
fix that went into 4.11.  I pushed the resolution to
for-linus-4.12-merged.

diff --cc fs/btrfs/qgroup.c
index afbea61,3f75b5c..deffbeb
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str
qgroup->excl += sign * num_bytes;
qgroup->excl_cmpr += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup, num_bytes);
else
qgroup->reserved -= num_bytes;
@@@ -1103,7 -1057,9 +1060,9 @@@
WARN_ON(sign < 0 && qgroup->excl < num_bytes);
qgroup->excl += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup,
+   -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup,
  num_bytes);
else
@@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b
  
qg = unode_aux_to_qgroup(unode);
  
+   trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes);
 -  if (WARN_ON(qg->reserved < num_bytes))
 +  if (qg->reserved < num_bytes)
report_reserved_underflow(fs_info, qg, num_bytes);
else
qg->reserved -= num_bytes;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
 bdev_get_queue (+3/-4)
btrfs: check if the device is flush capable (+4/-0)
btrfs: delete unused member nobarriers (+0/-4)

Edmund Nadolski (2) commits (+25/-20):
btrfs: provide enumeration for __merge_refs mode argument (+13/-10)
btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10)

Goldwyn Rodrigues (2) commits (+24/-3):
btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1)
btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2)

Chris Mason (1) commits (+2/-2):
btrfs: fix the gfp_mask for the reada_zones radix tree

Adam Borowski (1) commits (+9/-3):
btrfs: fix a bogus warning when converting only data or metadata

Deepa Dinamani (1) commits (+2/-1):
btrfs: Use ktime_get_real_ts for root ctime

Dan Carpenter (1) commits (+15/-26):
Btrfs: handle only applicable errors returned by btrfs_get_extent

Dmitry V. Levin (1) commits (+2/-0):
MAINTAINERS: add btrfs file entries for include directories

Hans van Kranenburg (1) commits (+5/-5):
Btrfs: consistent usage of types in balance_args

Total: (71) commits

 MAINTAINERS  |   2 +
 fs/btrfs/backref.c   |  41 ++-
 fs/btrfs/btrfs_inode.h   |   7 +
 fs/btrfs/compression.c   |  18 +-
 fs/btrfs/ctree.c |  20 +-
 fs/btrfs/ctree.h |  34 +-
 fs/btrfs/delayed-inode.c |  46 +--
 fs/btrfs/delayed-inode.h |   6 +-
 fs/btrfs/delayed-ref.c   |   8 +-
 fs/btrfs/delayed-ref.h   |   8 +-
 fs/btrfs/dev-replace.c   |   9 +-
 fs/btrfs/disk-io.c   |  13 +-
 fs/btrfs/disk-io.h   |   4 +-
 fs/btrfs/extent-tree.c   |  35 +-
 fs/btrfs/extent_io.c |  59 +--
 fs/btrfs/extent_io.h |   8 +-
 fs/btrfs/extent_map.c|  10 +-
 fs/btrfs/extent_map.h|   3 +-
 fs/btrfs/file.c  |  82 -
 fs/btrfs/free-space-cache.c  |   2 +-
 fs/btrfs/inode.c | 289 +++
 fs/btrfs/ioctl.c |  33 +-
 fs/btrfs/ordered-data.c  |  20 +-
 fs/btrfs/ordered-data.h  |   2 +-
 fs/btrfs/qgroup.c| 102 ++
 fs/btrfs/qgroup.h|  51 ++-
 fs/btrfs/raid56.c|  38 +-
 fs/btrfs/reada.c |  37 +-
 fs/btrfs/root-tree.c |   3 +-
 fs/btrfs/scrub.c | 331 +++--
 fs/btrfs/send.c  |  23 +-
 fs/btrfs/super.c |   3 +-
 fs/btrfs/tests/btrfs-tests.c |   1 -
 fs/btrfs/transaction.c   |  48 ++-
 fs/btrfs/transaction.h   |   6 +-
 fs/btrfs/tree-log.c  |   2 +-
 fs/btrfs/volumes.c   | 854 +++
 fs/btrfs/volumes.h   |   8 +-
 include/trace/events/btrfs.h | 187 +-
 include/uapi/linux/btrfs.h   |  10 +-
 40 files changed, 1629 insertions(+), 834 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: always write superblocks synchronously

2017-05-03 Thread Chris Mason



On 05/03/2017 04:36 AM, Jan Kara wrote:

On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote:

Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO
when the disk doesn't have volatile write cache and thus effectively
make the write async. This was seen to cause performance hits up
to 90% regression in disk IO related benchmarks such as reaim and
dbench[1].

Fix the problem by making sure the first superblock write is also
treated as synchronous since they can block progress of the
journalling (commit, log syncs) machinery and thus the whole filesystem.





Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous)
Cc: stable 
Cc: Jan Kara 
Signed-off-by: Davidlohr Bueso 


I wasn't patient enough and already sent the fix as part of my series
fixing other filesystems [1]. It also fixes one more place in btrfs that
needs REQ_SYNC to return to the original behavior.




Thanks guys.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-04-27 Thread Chris Mason

Hi Linus,

We have one more for btrfs:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

This is dropping a new WARN_ON from rc1 that ended up making more noise 
than we really want.  The larger fix for the underflow got delayed a bit 
and it's better for now to put it under CONFIG_BTRFS_DEBUG.


David Sterba (1) commits (+7/-4):
   btrfs: qgroup: move noisy underflow warning to debugging build

Total: (1) commits (+7/-4)

fs/btrfs/qgroup.c | 11 +++
1 file changed, 7 insertions(+), 4 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs bug fixes for 4.12

2017-04-26 Thread Chris Mason

On 04/26/2017 01:52 PM, fdman...@kernel.org wrote:

From: Filipe Manana 

Hi Chris,

Please consider the following changes for the 4.12 merge window.
These are all bug fixes and nothing particularly outstanding compared to
changes for past merge windows.

Thanks.

The following changes since commit a967efb30b3afa3d858edd6a17f544f9e9e46eea:

  Btrfs: fix potential use-after-free for cloned bio (2017-04-11 18:49:56 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git 
for-chris-4.12

for you to fetch changes up to a7e3b975a0f9296162b72ac6ab7fad9631a07630:

  Btrfs: fix reported number of inode blocks (2017-04-26 16:27:26 +0100)


Filipe Manana (5):
  Btrfs: fix invalid attempt to free reserved space on failure to cow range
  Btrfs: fix incorrect space accounting after failure to insert inline 
extent
  Btrfs: fix extent map leak during fallocate error path
  Btrfs: send, fix file hole not being preserved due to inline extent
  Btrfs: fix reported number of inode blocks

Qu Wenruo (2):
  btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  btrfs: Handle delalloc error correctly to avoid ordered extent hang


Thanks Filipe, pulling this in.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Btrfs, updates for 4.12

2017-04-26 Thread Chris Mason
On 04/26/2017 11:06 AM, Filipe Manana wrote:
 
> Hi,
> 
> Did you actually ran xfstests with those readahead patches to
> preallocate radix tree nodes?
> 
> With those 2 patches applied (Chris' for-linus.4,12 branch) this
> breaks things and many btrfs specific tests (at least, since I can't
> get pass them) result in tons of traces like the following in a debug
> kernel:
> 
> [ 8180.696804] BUG: sleeping function called from invalid context at
> mm/slab.h:432
> [ 8180.703584] in_atomic(): 1, irqs_disabled(): 0, pid: 28583, name: btrfs
> [ 8180.724146] 2 locks held by btrfs/28583:
> [ 8180.726427]  #0:  (sb_writers#12){.+.+.+}, at: []
> mnt_want_write_file+0x25/0x4d
> [ 8180.736742]  #1:  (&(_info->reada_lock)->rlock){+.+.+.}, at:
> [] reada_add_block+0x2fe/0x6cd [btrfs]
> [ 8180.766321] Preemption disabled at:
> [ 8180.766326] [] preempt_count_add+0x65/0x68
> [ 8180.794837] CPU: 5 PID: 28583 Comm: btrfs Tainted: GW
> 4.11.0-rc8-btrfs-next-39+ #1
> [ 8180.798818] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [ 8180.798818] Call Trace:
> [ 8180.798818]  dump_stack+0x68/0x92
> [ 8180.798818]  ? preempt_count_add+0x65/0x68
> [ 8180.798818]  ___might_sleep+0x20f/0x226
> [ 8180.798818]  __might_sleep+0x77/0x7e
> [ 8180.798818]  slab_pre_alloc_hook+0x32/0x4f
> [ 8180.798818]  kmem_cache_alloc+0x39/0x233
> [ 8180.798818]  ? radix_tree_node_alloc.constprop.12+0x9d/0xdf
> [ 8180.798818]  radix_tree_node_alloc.constprop.12+0x9d/0xdf
> [ 8180.798818]  __radix_tree_create+0xc3/0x143
> [ 8180.798818]  __radix_tree_insert+0x32/0xc0
> [ 8180.798818]  reada_add_block+0x318/0x6cd [btrfs]

So radix_tree_preload doesn't work the way I thought it did.  It populates a 
per-cpu pool of radix tree nodes so the allocation is sure not to fail.

But, when we go to actually allocate the node during radix_tree_insert:


static struct radix_tree_node * 
radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,   
struct radix_tree_root *root,   
unsigned int shift, unsigned int offset,
unsigned int count, unsigned int exceptional)   
{   
struct radix_tree_node *ret = NULL; 

/*  
 * Preload code isn't irq safe and it doesn't make sense to use 
 * preloading during an interrupt anyway as all the allocations have
 * to be atomic. So just do normal allocation when in interrupt.
 */ 
if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
struct radix_tree_preload *rtp; 

/*  
 * Even if the caller has preloaded, try to allocate from the   
 * cache first for the new node to get accounted to the memory  
 * cgroup.  
 */ 
ret = kmem_cache_alloc(radix_tree_node_cachep,  
   gfp_mask | __GFP_NOWARN);
if (ret)
goto out;   

/*  
 * Provided the caller has preloaded here, we will always   
 * succeed in getting a node here (and never reach  
 * kmem_cache_alloc)
 */ 
rtp = this_cpu_ptr(_tree_preloads);   
if (rtp->nr) {  
ret = rtp->nodes;   
rtp->nodes = ret->parent;   
rtp->nr--;  
}   
/*  
 * 

Re: [PULL] Btrfs, updates for 4.12

2017-04-26 Thread Chris Mason



On 04/26/2017 11:06 AM, Filipe Manana wrote:


Hi,

Did you actually ran xfstests with those readahead patches to
preallocate radix tree nodes?

With those 2 patches applied (Chris' for-linus.4,12 branch) this
breaks things and many btrfs specific tests (at least, since I can't
get pass them) result in tons of traces like the following in a debug
kernel:


Huh, I did and these didn't come up.  I'll double check I have 
preemption enabled.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Btrfs, qgroup message level adjustment, for 4.11-rc8

2017-04-25 Thread Chris Mason

On 04/25/2017 10:21 AM, David Sterba wrote:

On Wed, Apr 19, 2017 at 12:51:03PM +0200, David Sterba wrote:

Hi,

a single-patch pull request, the qgroup use can trigger an underflow warning
frequently. The warning is for debugging and should not be in the final release
of 4.11 as we won't be able to fix it.


Sigh, I looked for this pull after you mentioned it last week, couldn't 
find it and assumed you held off.  Sorry, maybe Josef has the right idea 
ditching exchange.


I'll send today.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-04-14 Thread Chris Mason

Hi Linus

Dave Sterba collected a few more fixes for the last rc:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

These aren't marked for stable, but I'm putting them in with a batch 
were testing/sending by hand for this release.


Liu Bo (3) commits (+11/-13):
   Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10)
   Btrfs: fix potential use-after-free for cloned bio (+1/-1)
   Btrfs: fix segmentation fault when doing dio read (+6/-2)

Adam Borowski (1) commits (+3/-0):
   btrfs: drop the nossd flag when remounting with -o ssd

Total: (4) commits (+14/-13)

fs/btrfs/inode.c   | 22 ++
fs/btrfs/super.c   |  3 +++
fs/btrfs/volumes.c |  2 +-
3 files changed, 14 insertions(+), 13 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-03-31 Thread Chris Mason
Hi Linus,

We have 3 small fixes queued up in my for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Goldwyn Rodrigues (1) commits (+7/-7):
btrfs: Change qgroup_meta_rsv to 64bit

Dan Carpenter (1) commits (+6/-1):
Btrfs: fix an integer overflow check

Liu Bo (1) commits (+31/-21):
Btrfs: bring back repair during read

Total: (3) commits (+44/-29)

 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/disk-io.c   |  2 +-
 fs/btrfs/extent_io.c | 46 --
 fs/btrfs/inode.c |  6 +++---
 fs/btrfs/qgroup.c| 10 +-
 fs/btrfs/send.c  |  7 ++-
 6 files changed, 44 insertions(+), 29 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-03-23 Thread Chris Mason

Hi Linus

We have a small set of fixes for the next RC:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Zygo tracked down a very old bug with inline compressed extents.
I didn't tag this one for stable because I want to do individual tested 
backports.  It's a little tricky and I'd rather do some extra testing

on it along the way.

Otherwise they are pretty obvious:

Liu Bo (1) commits (+2/-1):
   Btrfs: fix regression in lock_delalloc_pages

Dmitry V. Levin (1) commits (+0/-27):
   btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h

Zygo Blaxell (1) commits (+14/-0):
   btrfs: add missing memset while reading compressed inline extents

Total: (3) commits (+16/-28)

fs/btrfs/extent_io.c   |  3 ++-
fs/btrfs/inode.c   | 14 ++
include/uapi/linux/btrfs.h | 27 ---
3 files changed, 16 insertions(+), 28 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Btrfs updates for 4.11-rc2

2017-03-17 Thread Chris Mason

On 03/09/2017 08:55 AM, David Sterba wrote:

Hi,

there's a regression fix for the assertion failure reported by Dave Jones, the
rest of patches are minor updates. Please pull, thanks.

The following changes since commit e9f467d028cd7d8bee2a4d6c4fb806caf8cd580b:

  Merge branch 'for-chris-4.11-part2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.11 
(2017-02-28 14:35:09 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git 
for-chris-4.11-rc2

for you to fetch changes up to 94b5a2924fc0ec2bb2347ca31156be8fabe907c4:

  Btrfs: consistent usage of types in balance_args (2017-03-09 14:23:00 +0100)



Thanks Dave, these all tested well all week, but a few were arguably 
cleanups.  I've pull out the fixes only for this time and I'll queue the 
others for 4.12.


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents

2017-03-10 Thread Chris Mason



On 03/10/2017 01:56 PM, Zygo Blaxell wrote:

On Fri, Mar 10, 2017 at 11:19:24AM -0500, Chris Mason wrote:

On 03/09/2017 11:41 PM, Zygo Blaxell wrote:

On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote:



On 03/08/2017 09:12 PM, Zygo Blaxell wrote:

This is a story about 4 distinct (and very old) btrfs bugs.



Really great write up.

[ ... ]



diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 25ac2cf..4d41a31 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
max_size = min_t(unsigned long, PAGE_SIZE, max_size);
ret = btrfs_decompress(compress_type, tmp, page,
   extent_offset, inline_size, max_size);
+   WARN_ON(max_size + pg_offset > PAGE_SIZE);


Can you please drop this WARN_ON and make the math reflect any possible
pg_offset?  I do agree it shouldn't be happening, but its easy to correct
for and the WARN is likely to get lost.


I'm not sure how to do that.  It looks like I'd have to pass pg_offset
through btrfs_decompress to the decompress functions?

ret = btrfs_decompress(compress_type, tmp, page,
extent_offset, inline_size, max_size, pg_offset);

and in the compression functions get pg_offset from the argument list
instead of hardcoding zero.


Yeah, it's a good point.  Both zlib and lzo are assuming a zero pg_offset
right now, but just like there are wacky corners allowing inline extents
followed by more data, there are a few wacky corners allowing inline extents
at the end of the file.

Lets not mix that change in with this one though.  For now, just get the
memset right and we can pass pg_offset down in a later patch.


Are you saying "fix the memset in the patch" (and if so, what's wrong
with it?), or are you saying "let's take the patch with its memset as is,
and fix the pg_offset > 0 issues later"?


Your WARN_ON() would fire when this math is bad:

memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - pg_offset);

Instead or warning, just don't memset if pg_offset + max_size >= PAGE_SIZE

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents

2017-03-10 Thread Chris Mason

On 03/09/2017 11:41 PM, Zygo Blaxell wrote:

On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote:



On 03/08/2017 09:12 PM, Zygo Blaxell wrote:

This is a story about 4 distinct (and very old) btrfs bugs.



Really great write up.

[ ... ]



diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 25ac2cf..4d41a31 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
max_size = min_t(unsigned long, PAGE_SIZE, max_size);
ret = btrfs_decompress(compress_type, tmp, page,
   extent_offset, inline_size, max_size);
+   WARN_ON(max_size + pg_offset > PAGE_SIZE);


Can you please drop this WARN_ON and make the math reflect any possible
pg_offset?  I do agree it shouldn't be happening, but its easy to correct
for and the WARN is likely to get lost.


I'm not sure how to do that.  It looks like I'd have to pass pg_offset
through btrfs_decompress to the decompress functions?

ret = btrfs_decompress(compress_type, tmp, page,
extent_offset, inline_size, max_size, pg_offset);

and in the compression functions get pg_offset from the argument list
instead of hardcoding zero.


Yeah, it's a good point.  Both zlib and lzo are assuming a zero 
pg_offset right now, but just like there are wacky corners allowing 
inline extents followed by more data, there are a few wacky corners 
allowing inline extents at the end of the file.


Lets not mix that change in with this one though.  For now, just get the 
memset right and we can pass pg_offset down in a later patch.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents

2017-03-09 Thread Chris Mason



On 03/08/2017 09:12 PM, Zygo Blaxell wrote:

This is a story about 4 distinct (and very old) btrfs bugs.



Really great write up.

[ ... ]



diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 25ac2cf..4d41a31 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
max_size = min_t(unsigned long, PAGE_SIZE, max_size);
ret = btrfs_decompress(compress_type, tmp, page,
   extent_offset, inline_size, max_size);
+   WARN_ON(max_size + pg_offset > PAGE_SIZE);


Can you please drop this WARN_ON and make the math reflect any possible 
pg_offset?  I do agree it shouldn't be happening, but its easy to 
correct for and the WARN is likely to get lost.



+   if (max_size + pg_offset < PAGE_SIZE) {
+   char *map = kmap(page);
+   memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - 
pg_offset);
+   kunmap(page);
+   }


Both lzo and zlib have a memset to cover the gap between what they 
actually decompress and the max_size that we pass here.  That's 
important because ram_bytes may not be 100% accurate.


Can you also please toss in a comment about how the decompression code 
is responsible for the memset up to max_bytes?


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11

2017-03-06 Thread Chris Mason

On 03/06/2017 11:44 AM, David Sterba wrote:

On Mon, Mar 06, 2017 at 04:08:41PM +0800, Qu Wenruo wrote:

Any response?

These patches are already here for at least 2 kernel releases.
And are all bug fixes, and fix bugs that are already reported.

I didn't see any reason why it should be delayed for so long time.


The reason is not enough review for the whole series. The delay is
unfortunate, same as almost zero people capable & willing to review it,
despite your efforts to keep the series up to date with current code.



I've been hesitated to take more series of qgroup fixes because past 
sets have ended up causing problems down the line.  I'm queuing it for 
v4.12 while I collect tests and reviews though, it's definitely not fair 
for you to have to keep rebasing it without suggestions on what to fix.


Thanks!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-03-02 Thread Chris Mason
Hi Linus,

My for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Has Btrfs round two.  These are mostly a continuation of Dave Sterba's 
collection
of cleanups, but Filipe also has some bug fixes and performance improvements.

Nikolay Borisov (42) commits (+611/-579):
btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode (+14/-14)
btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode (+39/-38)
btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode (+10/-8)
btrfs: all btrfs_delalloc_release_metadata take btrfs_inode (+22/-19)
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode (+3/-4)
btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode (+7/-6)
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode (+3/-3)
btrfs: Make btrfs_orphan_release_metadata take btrfs_inode (+8/-8)
btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode (+7/-7)
btrfs: Make check_parent_dirs_for_sync take btrfs_inode (+14/-14)
btrfs: make btrfs_free_io_failure_record take btrfs_inode (+9/-7)
btrfs: Make btrfs_lookup_ordered_range take btrfs_inode (+19/-18)
btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode (+17/-16)
btrfs: make btrfs_print_data_csum_error take btrfs_inode (+8/-7)
btrfs: make btrfs_is_free_space_inode take btrfs_inode (+20/-19)
btrfs: make btrfs_set_inode_index_count take btrfs_inode (+8/-8)
btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode (+5/-5)
btrfs: Make clone_update_extent_map take btrfs_inode (+13/-14)
btrfs: Make btrfs_mark_extent_written take btrfs_inode (+6/-6)
btrfs: Make btrfs_drop_extent_cache take btrfs_inode (+30/-26)
btrfs: Make calc_csum_metadata_size take btrfs_inode (+12/-15)
btrfs: Make drop_outstanding_extent take btrfs_inode (+11/-12)
btrfs: Make btrfs_del_delalloc_inode take btrfs_inode (+7/-7)
btrfs: make btrfs_log_inode_parent take btrfs_inode (+24/-26)
btrfs: Make btrfs_set_inode_index take btrfs_inode (+13/-13)
btrfs: Make btrfs_clear_bit_hook take btrfs_inode (+25/-21)
btrfs: Make check_extent_to_block take btrfs_inode (+6/-5)
btrfs: make check_compressed_csum take btrfs_inode (+4/-5)
btrfs: Make btrfs_insert_dir_item take btrfs_inode (+7/-7)
btrfs: Make btrfs_log_all_parents take btrfs_inode (+5/-5)
btrfs: Make btrfs_i_size_write take btrfs_inode (+18/-19)
btrfs: make repair_io_failure take btrfs_inode (+12/-11)
btrfs: Make btrfs_orphan_add take btrfs_inode (+24/-22)
btrfs: make btrfs_orphan_del take btrfs_inode (+20/-20)
btrfs: make clean_io_failure take btrfs_inode (+15/-14)
btrfs: Make btrfs_add_nondir take btrfs_inode (+13/-9)
btrfs: make free_io_failure take btrfs_inode (+13/-11)
btrfs: Make check_can_nocow take btrfs_inode (+12/-10)
btrfs: Make btrfs_add_link take btrfs_inode (+26/-23)
btrfs: Make get_extent_t take btrfs_inode (+59/-54)
btrfs: Make hole_mergeable take btrfs_inode (+5/-4)
btrfs: Make fill_holes take btrfs_inode (+18/-19)

David Sterba (16) commits (+139/-124):
btrfs: use predefined limits for calculating maximum number of pages for 
compression (+6/-5)
btrfs: derive maximum output size in the compression implementation (+9/-14)
btrfs: merge nr_pages input and output parameter in compress_pages (+11/-15)
btrfs: merge length input and output parameter in compress_pages (+18/-20)
btrfs: add dummy callback for readpage_io_failed and drop checks (+10/-3)
btrfs: do proper error handling in btrfs_insert_xattr_item (+2/-1)
btrfs: drop checks for mandatory extent_io_ops callbacks (+3/-4)
btrfs: constify device path passed to relevant helpers (+22/-18)
btrfs: document existence of extent_io ops callbacks (+26/-11)
btrfs: handle allocation error in update_dev_stat_item (+2/-1)
btrfs: export compression buffer limits in a header (+15/-10)
btrfs: constify name of subvolume in creation helpers (+3/-3)
btrfs: constify buffers used by compression helpers (+3/-3)
btrfs: remove BUG_ON from __tree_mod_log_insert (+0/-2)
btrfs: constify input buffer of btrfs_csum_data (+3/-3)
btrfs: let writepage_end_io_hook return void (+6/-11)

Filipe Manana (8) commits (+163/-27):
Btrfs: do not create explicit holes when replaying log tree if NO_HOLES 
enabled (+5/-0)
Btrfs: try harder to migrate items to left sibling before splitting a leaf 
(+7/-0)
Btrfs: fix assertion failure when freeing block groups at close_ctree() 
(+9/-6)
Btrfs: incremental send, fix unnecessary hole writes for sparse files 
(+86/-2)
Btrfs: fix use-after-free due to wrong order of destroying work queues 
(+7/-2)
Btrfs: incremental send, do not delay rename when parent inode is new 
(+16/-3)
Btrfs: fix data loss after truncate when using the no-holes feature (+6/-13)
Btrfs: bulk delete checksum items in the same leaf (+27/-1)

Robbie Ko (3) commits 

Re: [PULL] Btrfs cleanups for 4.11, part 2

2017-02-28 Thread Chris Mason



On 02/28/2017 10:09 AM, David Sterba wrote:

Hi,

this is the second half of the 4.11 batch, the rest of the cleanups. Please
pull, thanks.

The following changes since commit 6288d6eabc7505f42dda34a2c2962f91914be3a4:

  Btrfs: use the correct type when creating cow dio extent (2017-02-22 15:55:03 
-0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git 
for-chris-4.11-part2

for you to fetch changes up to 20a7db8ab3f2057a518448b1728d504ffadef65e:

  btrfs: add dummy callback for readpage_io_failed and drop checks (2017-02-28 
14:29:24 +0100)



Thanks Dave, I've got this along with Filipe's pull.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-02-24 Thread Chris Mason
Hi Linus,

My for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Has a series of fixes and cleanups that Dave Sterba has been collecting:

There is a pretty big variety here, cleaning up internal APIs and fixing 
corner cases.

David Sterba (46) commits (+235/-313):
 btrfs: remove unused parameter from btrfs_subvolume_release_metadata 
(+6/-11)
 btrfs: remove pointless rcu protection from btrfs_qgroup_inherit (+0/-2)
 btrfs: check quota status earlier and don't do unnecessary frees (+3/-2)
 btrfs: remove unused parameter from btrfs_prepare_extent_commit (+3/-5)
 btrfs: remove unnecessary mutex lock in qgroup_account_snapshot (+1/-5)
 btrfs: embed extent_changeset::range_changed to the structure (+11/-17)
 btrfs: remove unused parameter from cleanup_write_cache_enospc (+2/-3)
 btrfs: remove unused parameters from __btrfs_write_out_cache (+3/-8)
 btrfs: remove unused parameter from clone_copy_inline_extent (+2/-3)
 btrfs: remove unused parameter from extent_write_cache_pages (+2/-4)
 btrfs: remove unused parameter from tree_move_next_or_upnext (+2/-4)
 btrfs: remove unused parameter from btrfs_check_super_valid (+3/-5)
 btrfs: remove unused logic of limiting async delalloc pages (+0/-7)
 btrfs: fix over-80 lines introduced by previous cleanups (+74/-63)
 btrfs: remove unused parameter from read_block_for_search (+5/-5)
 btrfs: remove unused parameter from adjust_slots_upwards (+2/-3)
 btrfs: remove unused parameter from init_first_rw_device (+3/-5)
 btrfs: make space cache inode readahead failure nonfatal (+3/-7)
 btrfs: remove unused parameters from scrub_setup_wr_ctx (+3/-7)
 btrfs: remove unused parameter from __btrfs_alloc_chunk (+4/-6)
 btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE (+23/-31)
 btrfs: remove unused parameter from submit_extent_page (+3/-9)
 btrfs: remove unused parameter from clean_tree_block (+17/-19)
 btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation (+2/-2)
 btrfs: remove unused parameter from __add_inline_refs (+2/-3)
 btrfs: remove unused parameter from add_pending_csums (+2/-4)
 btrfs: remove unused parameter from update_nr_written (+4/-4)
 btrfs: remove unused parameter from __push_leaf_right (+2/-3)
 btrfs: remove unused parameter from check_async_write (+2/-2)
 btrfs: remove unused parameter from btrfs_fill_super (+2/-3)
 btrfs: remove unused parameter from __push_leaf_left (+2/-3)
 btrfs: remove unused parameter from write_dev_supers (+3/-3)
 btrfs: remove unused parameter from __add_inode_ref (+1/-2)
 btrfs: remove unused parameters from btrfs_cmp_data (+2/-3)
 btrfs: remove unused parameter from create_snapshot (+2/-2)
 btrfs: ulist: make the finalization function public (+2/-1)
 btrfs: remove unused parameter from tree_move_down (+2/-2)
 btrfs: ulist: rename ulist_fini to ulist_release (+10/-10)
 btrfs: qgroups: make __del_qgroup_relation static (+1/-1)
 btrfs: use GFP_KERNEL in btrfs_read_qgroup_config (+1/-1)
 btrfs: remove unused parameter from split_item (+2/-3)
 btrfs: merge two superblock writing helpers (+4/-11)
 btrfs: qgroups: opencode qgroup_free helper (+9/-9)
 btrfs: use GFP_KERNEL in btrfs_quota_enable (+1/-1)
 btrfs: use GFP_KERNEL in create_snapshot (+2/-2)
 btrfs: remove unused ulist members (+0/-7)

Nikolay Borisov (36) commits (+476/-480):
 btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode (+8/-8)
 btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inode (+5/-5)
 btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inode (+4/-4)
 btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inode (+6/-6)
 btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode (+5/-6)
 btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inode (+4/-4)
 btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode (+5/-5)
 btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode (+6/-6)
 btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode (+5/-5)
 btrfs: Make btrfs_check_ref_name_override take btrfs_inode (+4/-5)
 btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode (+6/-6)
 btrfs: Make btrfs_must_commit_transaction take btrfs_inode (+9/-9)
 btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode (+7/-7)
 btrfs: Make btrfs_log_changed_extents take btrfs_inode (+11/-11)
 btrfs: Make btrfs_record_unlink_dir take btrfs_inode (+14/-14)
 btrfs: Make btrfs_remove_delayed_node take btrfs_inode (+5/-5)
 btrfs: Make btrfs_get_logged_extents take btrfs_inode (+4/-4)
 btrfs: Make btrfs_log_trailing_hole take btrfs_inode (+4/-4)
 btrfs: Make btrfs_get_delayed_node take btrfs_inode (+8/-9)
 btrfs: Make btrfs_ino take a struct btrfs_inode (+151/-151)
 btrfs: Make log_directory_changes take btrfs_inode (+5/-6)
   

Re: [GIT PULL] Btrfs bug fixes and improvements for 4.11 (2nd retry)

2017-02-24 Thread Chris Mason

On Fri, Feb 24, 2017 at 03:25:09AM +, fdman...@kernel.org wrote:

From: Filipe Manana 

Hi Chris, since my previous pull request (sent timely) was either missed
or not pulled for some reason I'm not aware of, here I send it again (with
one more patch included). The following is taken from the former pull
request:


Hi Filipe, I've got this queued up and will send Monday/Tuesday.

Thanks!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-02-11 Thread Chris Mason
Hi Linus,

My for-linus-4.10 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Has two last minute fixes.  The highest priority here is a regression 
fix for the decompression code, but we also fixed up a problem with the 
32 bit compat ioctls.

The decompression bug could hand back the wrong data on big reads when 
zlib was used.  I have a larger cleanup to make the math here less error 
prone, but at this stage in the release Omar's patch is the best choice.

Omar Sandoval (1) commits (+24/-15):
 Btrfs: fix btrfs_decompress_buf2page()

Jeff Mahoney (1) commits (+4/-2):
 btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls

Total: (2) commits (+28/-17)

  fs/btrfs/compression.c | 39 ---
  fs/btrfs/ioctl.c   |  6 --
  2 files changed, 28 insertions(+), 17 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix btrfs_decompress_buf2page()

2017-02-11 Thread Chris Mason

On Fri, Feb 10, 2017 at 08:12:35PM -0800, Pat Erley wrote:



On 02/10/17 15:03, Omar Sandoval wrote:

From: Omar Sandoval 

If btrfs_decompress_buf2page() is handed a bio with its page in the
middle of the working buffer, then we adjust the offset into the working
buffer. After we copy into the bio, we advance the iterator by the
number of bytes we copied. Then, we have some logic to handle the case
of discontiguous pages and adjust the offset into the working buffer
again. However, if we didn't advance the bio to a new page, we may enter
this case in error, essentially repeating the adjustment that we already
made when we entered the function. The end result is bogus data in the
bio.

Previously, we only checked for this case when we advanced to a new
page, but the conversion to bio iterators changed that. This restores
the old, correct behavior.


I can confirm this fixes the corruption I was seeing

Feel free to add:

Tested-by: Pat Erley 


Thanks again Pat for bisecting this down.  It passed overnight so I'm 
sending in right now.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Fix ioctls on 32bit/64bit userspace/kernel, for 4.10

2017-02-08 Thread Chris Mason

On Wed, Feb 08, 2017 at 05:51:28PM +0100, David Sterba wrote:

Hi,

could you please merge this single-patch pull request, for 4.10 still?  There
are quite a few patches on top of v4.10-rc7 so this IMHO does not look like
look too bad even late in the release cycle. Though it's a fix for an uncommon
usecase of 32bit userspace on 64bit kernel, it fixes basically operation of the
ioctls. Thanks.


Hi Dave, I'll pull this in and it, thanks.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2017-01-27 Thread Chris Mason
Hi Linus,

My for-linus-4.10 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Has some fixes that we've collected from the list.  We still have one 
more pending to nail down a regression in lzo compression, but I wanted 
to get this batch out the door.

Omar Sandoval (3) commits (+2/-6):
 Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations (+0/-2)
 Btrfs: remove old tree_root case in btrfs_read_locked_inode() (+1/-4)
 Btrfs: disable xattr operations on subvolume directories (+1/-0)

Liu Bo (1) commits (+12/-1):
 Btrfs: fix truncate down when no_holes feature is enabled

Chandan Rajendra (1) commits (+2/-2):
 Btrfs: Fix deadlock between direct IO and fast fsync

Wang Xiaoguang (1) commits (+1/-0):
 btrfs: fix false enospc error when truncating heavily reflinked file

Total: (6) commits (+17/-9)

  fs/btrfs/inode.c | 26 +-
  1 file changed, 17 insertions(+), 9 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >