Re: Metadata balance fails ENOSPC

2016-11-30 Thread Stefan Priebe - Profihost AG
Am 01.12.2016 um 00:02 schrieb Chris Murphy:
> On Wed, Nov 30, 2016 at 2:03 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hello,
>>
>> # btrfs balance start -v -dusage=0 -musage=1 /ssddisk/
>> Dumping filters: flags 0x7, state 0x0, force is off
>>   DATA (flags 0x2): balancing, usage=0
>>   METADATA (flags 0x2): balancing, usage=1
>>   SYSTEM (flags 0x2): balancing, usage=1
>> ERROR: error during balancing '/ssddisk/': No space left on device
>> There may be more info in syslog - try dmesg | tail
> 
> You haven't provided kernel messages at the time of the error.

Kernel Message:
[  429.107723] BTRFS info (device vdb1): 1 enospc errors during balance

> Also useful is the kernel version.

Custom 4.4 kernel with patches up to 4.10. But i already tried 4.9-rc7
which does the same.


>> # btrfs filesystem show /ssddisk/
>> Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
>> Total devices 1 FS bytes used 305.67GiB
>> devid1 size 500.00GiB used 500.00GiB path /dev/vdb1
>>
>> # btrfs filesystem usage /ssddisk/
>> Overall:
>> Device size: 500.00GiB
>> Device allocated:500.00GiB
>> Device unallocated:1.05MiB
> 
> Drive is actually fully allocated so if Btrfs needs to create a new
> chunk right now, it can't. However,

Yes but there's lot of free space:
Free (estimated):193.46GiB  (min: 193.46GiB)

How does this match?


> All three chunk types have quite a bit of unused space in them, so
> it's unclear why there's a no space left error.
> 
> Try remounting with enoscp_debug, and then trigger the problem again,
> and post the resulting kernel messages.

With enospc debug it says:
[39193.425682] BTRFS warning (device vdb1): no space to allocate a new
chunk for block group 839941881856
[39193.426033] BTRFS info (device vdb1): 1 enospc errors during balance

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Duncan
Austin S. Hemmelgarn posted on Wed, 30 Nov 2016 11:48:57 -0500 as
excerpted:

> On 2016-11-30 10:49, Wilson Meier wrote:

>> Do you also have all home users in mind, which go to vacation (sometime
>>> 3 weeks) and don't have a 24/7 support team to replace monitored disks
>> which do report SMART errors?

> Better than 90% of people I know either shut down their systems when
> they're going to be away for a long period of time, or like me have ways
> to log in remotely and tell the FS to not use that disk anymore.

https://btrfs.wiki.kernel.org/index.php/Getting_started ...

... has two warnings offset in red right in the first section:

* If you have btrfs filesystems, run the latest kernel.

* You should keep and test backups of your data, and be prepared to use 
them.

It also says:

The status of btrfs was experimental for a long time, but the the core 
functionality is considered good enough for daily use. [...]
While many people use it reliably, there are still problems being found.


Were I editing that or something very similar would be on the main 
landing page and as a general status announcement on the feature and 
profile status page.  However, it IS on the wiki.

As to the three weeks vacation thing...

And "daily use" != "three weeks without physical access to something 
you're going to actually be relying on for parts of those three weeks".

And "keep and test backups [and] be prepared to use them" != "go away for 
three weeks and leave yourself unable to restore from those backups, for 
something you're relying on over those three weeks", either.

As Austin says, many home users actually shut down their systems when 
they're going to be away, because they are /not/ going to be using them 
in that period, and *certainly* *don't* actually /rely/ on them.

And most of those that /do/ actually rely on them, have learned or will 
learn, possibly the hard way, that "things happen", and they need either 
someone that can be called to poke the systems if necessary, or 
alternative plans in case what they can't access ATM fails.

Meanwhile, arguably those who /are/ relying on their filesystems to be up 
and running for extended periods while they can't actually poke (or have 
someone else poke) the hardware if necessary, shouldn't be running btrfs 
as yet in the first place, as it's simply not stable and mature enough 
for that.  And people who really care about it will have done the 
research to know the stability status.  And people who don't... well, by 
not doing that research they've effectively defined it as not that 
important in their life, other things have taken priority.  So if btrfs 
fails on them and they didn't know it's stability status, it can only be 
because it wasn't that important to them that they know, so no big deal.

(I know for certain that before /I/ switched to btrfs, I scoured both the 
wiki and the manpages, as well as reading a number of articles on btrfs, 
and then still posted to this list a number of questions I had remaining 
after doing all that, and got answers I read as well, before I actually 
did my switch.  That's because it was my data at risk, data I place a 
high enough value on to want to know the risk at which I was placing it 
and the best way to deal with various issues I could anticipate possibly 
happening, before they actually happened.  And I actually did some of my 
own testing before final deployment, as well, satisfying myself that I 
/could/ reasonably deal with various hardware and software disaster 
scenarios, before putting any real data at risk, as well.

Of course I don't expect everyone to do all that, but then I don't expect 
everyone to place the value in their data that I do in mine.  Which is 
fine, as long as they're willing to live with the consequences of the 
priority they placed on appreciating and dealing appropriately with the 
risk factor on their data, based on the definition of value their actions 
placed on it.  If they're willing to risk the data because it's of no 
particular value to them anyway, well then, no such preliminary research 
and testing is required.  Indeed, it would be stupid, because they surely 
have more important and higher priority things to deal with.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quota enable

2016-11-30 Thread Qu Wenruo
[BUG]
Under the following case, we can underflow qgroup reserved space.

Task A|Task B
---
 Quota disabled   |
 Buffered write   |
 |- btrfs_check_data_free_space() |
 |  *NO* qgroup space is reserved |
 |  since quota is *DISABLED* |
 |- All pages are copied to page  |
cache |
  | Enable quota
  | Quota scan finished
  |
  | Sync_fs
  | |- run_delalloc_range
  | |- Write pages
  | |- btrfs_finish_ordered_io
  ||- insert_reserved_file_extent
  |   |- btrfs_qgroup_release_data()
  |  Since no qgroup space is
 reserved in Task A, we
 underflow qgroup reserved
 space
This can be detected by fstest btrfs/104.

[CAUSE]
In insert_reserved_file_extent() we info qgroup to release the @ram_bytes
size of qgroup reserved_space under all case.
And btrfs_qgroup_release_data() will check if qgroup is enabled.

However in above case, the buffered write happens before quota is
enabled, so we don't havee reserved space for that range.

[FIX]
In insert_reserved_file_extent(), we info qgroup to release the acctual
byte number it released.
In above case, since we don't have reserved space, we info qgroup to
release 0 byte, so the problem can be fixed.

And thanks to the @reserved parameter introduced by qgroup rework, and
previous patch to return release bytes, the fix can be as small as less
than 10 lines.

Signed-off-by: Qu Wenruo 
---
To David:
These 2 patches, with updated extra WARN_ON() patch(V5), btrfs on x86_64 is
good for current qgroup test group.
But the bug reported by Chadan still exists, and the fix for that will be
delayed for a while, as the fix involves large interface change
(add struct extent_changeset for every qgroup reserve caller).

I'll run extra tests for that fix to ensure they are OK and won't cause
extra bugs.

Thanks,
Qu
---
 fs/btrfs/inode.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 242dc7e..3db58d9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2256,6 +2256,7 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
struct btrfs_path *path;
struct extent_buffer *leaf;
struct btrfs_key ins;
+   u64 qg_released;
int extent_inserted = 0;
int ret;
 
@@ -2311,15 +2312,19 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
ins.objectid = disk_bytenr;
ins.offset = disk_num_bytes;
ins.type = BTRFS_EXTENT_ITEM_KEY;
-   ret = btrfs_alloc_reserved_file_extent(trans, root,
-   root->root_key.objectid,
-   btrfs_ino(inode), file_pos,
-   ram_bytes, );
+
/*
 * Release the reserved range from inode dirty range map, as it is
 * already moved into delayed_ref_head
 */
-   btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
+   ret = btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
+   if (ret < 0)
+   goto out;
+   qg_released = ret;
+   ret = btrfs_alloc_reserved_file_extent(trans, root,
+   root->root_key.objectid,
+   btrfs_ino(inode), file_pos,
+   qg_released, );
 out:
btrfs_free_path(path);
 
-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: qgroup: Return actually freed bytes for qgroup release or free data

2016-11-30 Thread Qu Wenruo
btrfs_qgroup_release/free_data() only returns 0 or minus error
number(ENOMEM is the only possible error).

This is normally good enough, but sometimes we need the accurate byte
number it freed/released.

Change it to return actually released/freed bytenr number instead of 0
for success.
And slightly modify related extent_changeset structure, since in btrfs
one none-hole data extent won't be larger than 128M, so "unsigned int"
is large enough for the use case.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 2 +-
 fs/btrfs/extent_io.h   | 2 +-
 fs/btrfs/qgroup.c  | 1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ac3ae27..dae287d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4318,7 +4318,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 
start, u64 len)
 
/* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
ret = btrfs_qgroup_reserve_data(inode, start, len);
-   if (ret)
+   if (ret < 0)
btrfs_free_reserved_data_space_noquota(inode, start, len);
return ret;
 }
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 8df24c6..13edb86 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -190,7 +190,7 @@ struct extent_buffer {
  */
 struct extent_changeset {
/* How many bytes are set/cleared in this operation */
-   u64 bytes_changed;
+   unsigned int bytes_changed;
 
/* Changed ranges */
struct ulist *range_changed;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 1ad3be8..7263065 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2873,6 +2873,7 @@ static int __btrfs_qgroup_release_data(struct inode 
*inode, u64 start, u64 len,
}
trace_btrfs_qgroup_release_data(inode, start, len,
changeset.bytes_changed, trace_op);
+   ret = changeset.bytes_changed;
 out:
ulist_free(changeset.range_changed);
return ret;
-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix infinite loop when tree log recovery

2016-11-30 Thread robbieko

Hi Filipe,

Thank you for your review.
I have seen your modified change log with below
Btrfs: fix tree search logic when replaying directory entry deletes
Btrfs: fix deadlock caused by fsync when logging directory entries
Btrfs: fix enospc in hole punching
So what's the next step ?
modify patch change log and then send again ?

Thanks.
robbieko

Filipe Manana 於 2016-12-01 00:53 寫到:
On Fri, Oct 7, 2016 at 10:30 AM, robbieko  
wrote:

From: Robbie Ko 

if log tree like below:
leaf N:
...
item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8
dir log end 1275809046
leaf N+1:
item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 
itemsize 8

dir log end 18446744073709551615
...

when start_ret > 1275809046, but slot[0] never >= nritems,
so never go to next leaf.


This doesn't explain how the infinite loop happens. Nor exactly how
any problem happens.

It's important to have detailed information in the change logs. I
understand that english isn't your native tongue (it's not mine
either, and I'm far from mastering it), but that's not an excuse to
not express all the important information in detail (we can all live
with grammar errors and typos, and we all do such errors frequently).

I've added this patch to my branch at
https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=for-chris-4.10
but with a modified changelog and subject.

The results of the wrong logic that decides when to move to the next
leaf are unpredictable, and it won't always result in an infinite
loop. We are accessing a slot that doesn't point to an item, to a
memory location containing garbage to something unexpected, and in the
worst case that location is beyond the last page of the extent buffer.

Thanks.




Signed-off-by: Robbie Ko 
---
 fs/btrfs/tree-log.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index ef9c55b..e63dd99 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1940,12 +1940,11 @@ static noinline int find_dir_range(struct 
btrfs_root *root,

 next:
/* check the next slot in the tree to see if it is a valid 
item */

nritems = btrfs_header_nritems(path->nodes[0]);
+   path->slots[0]++;
if (path->slots[0] >= nritems) {
ret = btrfs_next_leaf(root, path);
if (ret)
goto out;
-   } else {
-   path->slots[0]++;
}

btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/14] btrfs-progs: check: introduce function to find dir_item

2016-11-30 Thread Qu Wenruo



At 12/01/2016 12:34 AM, David Sterba wrote:

On Wed, Nov 16, 2016 at 10:27:59AM +0800, Qu Wenruo wrote:

Yes please. Third namespace for existing error bits is not a good
option. Move the I_ERR bits to start from 32 and use them in the low-mem
code that's been merged to devel.


I didn't see such fix in devel branch.


Well, that's because nobody implemented it and I was not intending to do
it myself as it's a followup to yor lowmem patchset in devel.



OK, I'm going to fix it soon.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.8.8, bcache deadlock and hard lockup

2016-11-30 Thread Chris Murphy
On Wed, Nov 30, 2016 at 4:57 PM, Eric Wheeler  wrote:
> On Wed, 30 Nov 2016, Marc MERLIN wrote:
>> +btrfs mailing list, see below why
>>
>> On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:
>> > On Mon, 27 Nov 2016, Coly Li wrote:
>> > >
>> > > Yes, too many work queues... I guess the locking might be caused by some
>> > > very obscure reference of closure code. I cannot have any clue if I
>> > > cannot find a stable procedure to reproduce this issue.
>> > >
>> > > Hmm, if there is a tool to clone all the meta data of the back end cache
>> > > and whole cached device, there might be a method to replay the oops much
>> > > easier.
>> > >
>> > > Eric, do you have any hint ?
>> >
>> > Note that the backing device doesn't have any metadata, just a superblock.
>> > You can easily dd that off onto some other volume without transferring the
>> > data. By default, data starts at 8k, or whatever you used in `make-bcache
>> > -w`.
>>
>> Ok, Linus helped me find a workaround for this problem:
>> https://lkml.org/lkml/2016/11/29/667
>> namely:
>>echo 2 > /proc/sys/vm/dirty_ratio
>>echo 1 > /proc/sys/vm/dirty_background_ratio
>> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
>> requests in th buffers)
>>
>> Note that this is only a workaround, not a fix.
>>
>> When I did this and re tried my big copy again, I still got 100+ kernel
>> work queues, but apparently the underlying swraid5 was able to unblock
>> and satisfy the write requests before too many accumulated and crashed
>> the kernel.
>>
>> I'm not a kernel coder, but seems to me that bcache needs a way to
>> throttle incoming requests if there are too many so that it does not end
>> up in a state where things blow up due to too many piled up requests.
>>
>> You should be able to reproduce this by taking 5 spinning rust drives,
>> put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
>> I used btrfs) and send lots of requests.
>> Actually to be honest, the problems have mostly been happening when I do
>> btrfs scrub and btrfs send/receive which both generate I/O from within
>> the kernel instead of user space.
>> So here, btrfs may be a contributor to the problem too, but while btrfs
>> still trashes my system if I remove the caching device on bcache (and
>> with the default dirty ratio values), it doesn't crash the kernel.
>>
>> I'll start another separate thread with the btrfs folks on how much
>> pressure is put on the system, but on your side it would be good to help
>> ensure that bcache doesn't crash the system altogether if too many
>> requests are allowed to pile up.
>
>
> Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk
> writes at the request queue on its way to the spinning disk or SSD:
> http://algo.ing.unimo.it/people/paolo/disk_sched/
>
> use the latest BFQ git here, merge it into v4.8.y:
> https://github.com/linusw/linux-bfq/commits/bfq-v8
>
> This doesn't completely fix the dirty_ration problem, but it is far better
> than CFQ or deadline in my opinion (and experience).

There are several threads over the past year with users having
problems no one else had previously reported, and they were using BFQ.
But there's no evidence whether BFQ was the cause, or exposing some
existing bug that another scheduler doesn't. Anyway, I'd say using an
out of tree scheduler means higher burden of testing and skepticism.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.8.8, bcache deadlock and hard lockup

2016-11-30 Thread Marc MERLIN
On Wed, Nov 30, 2016 at 03:57:28PM -0800, Eric Wheeler wrote:
> > I'll start another separate thread with the btrfs folks on how much
> > pressure is put on the system, but on your side it would be good to help
> > ensure that bcache doesn't crash the system altogether if too many
> > requests are allowed to pile up.
> 
> Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk 
> writes at the request queue on its way to the spinning disk or SSD:
>   http://algo.ing.unimo.it/people/paolo/disk_sched/
> 
> use the latest BFQ git here, merge it into v4.8.y:
>   https://github.com/linusw/linux-bfq/commits/bfq-v8
> 
> This doesn't completely fix the dirty_ration problem, but it is far better 
> than CFQ or deadline in my opinion (and experience).

That's good to know thanks.
But for my uninformed opinion, is there anything bcache can do to throttle
incoming requests if they are piling up, or they're coming from producers
upstream and bcache has no choice but try and process them as quickly as
possible without a way to block the sender if too many are coming?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.8.8, bcache deadlock and hard lockup

2016-11-30 Thread Eric Wheeler
On Wed, 30 Nov 2016, Marc MERLIN wrote:
> +btrfs mailing list, see below why
> 
> On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:
> > On Mon, 27 Nov 2016, Coly Li wrote:
> > > 
> > > Yes, too many work queues... I guess the locking might be caused by some
> > > very obscure reference of closure code. I cannot have any clue if I
> > > cannot find a stable procedure to reproduce this issue.
> > > 
> > > Hmm, if there is a tool to clone all the meta data of the back end cache
> > > and whole cached device, there might be a method to replay the oops much
> > > easier.
> > > 
> > > Eric, do you have any hint ?
> > 
> > Note that the backing device doesn't have any metadata, just a superblock. 
> > You can easily dd that off onto some other volume without transferring the 
> > data. By default, data starts at 8k, or whatever you used in `make-bcache 
> > -w`.
> 
> Ok, Linus helped me find a workaround for this problem:
> https://lkml.org/lkml/2016/11/29/667
> namely:
>echo 2 > /proc/sys/vm/dirty_ratio
>echo 1 > /proc/sys/vm/dirty_background_ratio
> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
> requests in th buffers)
> 
> Note that this is only a workaround, not a fix.
> 
> When I did this and re tried my big copy again, I still got 100+ kernel
> work queues, but apparently the underlying swraid5 was able to unblock
> and satisfy the write requests before too many accumulated and crashed
> the kernel.
> 
> I'm not a kernel coder, but seems to me that bcache needs a way to
> throttle incoming requests if there are too many so that it does not end
> up in a state where things blow up due to too many piled up requests.
> 
> You should be able to reproduce this by taking 5 spinning rust drives,
> put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
> I used btrfs) and send lots of requests.
> Actually to be honest, the problems have mostly been happening when I do
> btrfs scrub and btrfs send/receive which both generate I/O from within
> the kernel instead of user space.
> So here, btrfs may be a contributor to the problem too, but while btrfs
> still trashes my system if I remove the caching device on bcache (and
> with the default dirty ratio values), it doesn't crash the kernel.
> 
> I'll start another separate thread with the btrfs folks on how much
> pressure is put on the system, but on your side it would be good to help
> ensure that bcache doesn't crash the system altogether if too many
> requests are allowed to pile up.


Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk 
writes at the request queue on its way to the spinning disk or SSD:
http://algo.ing.unimo.it/people/paolo/disk_sched/

use the latest BFQ git here, merge it into v4.8.y:
https://github.com/linusw/linux-bfq/commits/bfq-v8

This doesn't completely fix the dirty_ration problem, but it is far better 
than CFQ or deadline in my opinion (and experience).

-Eric



--
Eric Wheeler


> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems 
>    what McDonalds is to gourmet 
> cooking
> Home page: http://marc.merlins.org/ | PGP 
> 1024R/763BE901
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata balance fails ENOSPC

2016-11-30 Thread Chris Murphy
On Wed, Nov 30, 2016 at 2:03 PM, Stefan Priebe - Profihost AG
 wrote:
> Hello,
>
> # btrfs balance start -v -dusage=0 -musage=1 /ssddisk/
> Dumping filters: flags 0x7, state 0x0, force is off
>   DATA (flags 0x2): balancing, usage=0
>   METADATA (flags 0x2): balancing, usage=1
>   SYSTEM (flags 0x2): balancing, usage=1
> ERROR: error during balancing '/ssddisk/': No space left on device
> There may be more info in syslog - try dmesg | tail

You haven't provided kernel messages at the time of the error.

Also useful is the kernel version.



>
> # btrfs filesystem show /ssddisk/
> Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
> Total devices 1 FS bytes used 305.67GiB
> devid1 size 500.00GiB used 500.00GiB path /dev/vdb1
>
> # btrfs filesystem usage /ssddisk/
> Overall:
> Device size: 500.00GiB
> Device allocated:500.00GiB
> Device unallocated:1.05MiB

Drive is actually fully allocated so if Btrfs needs to create a new
chunk right now, it can't. However,



>
> Data,single: Size:483.97GiB, Used:298.18GiB
>/dev/vdb1 483.97GiB
>
> Metadata,single: Size:16.00GiB, Used:7.51GiB
>/dev/vdb1  16.00GiB
>
> System,single: Size:32.00MiB, Used:144.00KiB
>/dev/vdb1  32.00MiB

All three chunk types have quite a bit of unused space in them, so
it's unclear why there's a no space left error.

Try remounting with enoscp_debug, and then trigger the problem again,
and post the resulting kernel messages.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Metadata balance fails ENOSPC

2016-11-30 Thread Stefan Priebe - Profihost AG
Hello,

# btrfs balance start -v -dusage=0 -musage=1 /ssddisk/
Dumping filters: flags 0x7, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=0
  METADATA (flags 0x2): balancing, usage=1
  SYSTEM (flags 0x2): balancing, usage=1
ERROR: error during balancing '/ssddisk/': No space left on device
There may be more info in syslog - try dmesg | tail

# btrfs filesystem show /ssddisk/
Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
Total devices 1 FS bytes used 305.67GiB
devid1 size 500.00GiB used 500.00GiB path /dev/vdb1

# btrfs filesystem usage /ssddisk/
Overall:
Device size: 500.00GiB
Device allocated:500.00GiB
Device unallocated:1.05MiB
Device missing:  0.00B
Used:305.69GiB
Free (estimated):185.78GiB  (min: 185.78GiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:  512.00MiB  (used: 608.00KiB)

Data,single: Size:483.97GiB, Used:298.18GiB
   /dev/vdb1 483.97GiB

Metadata,single: Size:16.00GiB, Used:7.51GiB
   /dev/vdb1  16.00GiB

System,single: Size:32.00MiB, Used:144.00KiB
   /dev/vdb1  32.00MiB

Unallocated:
   /dev/vdb1   1.05MiB

How can i make it balancing again?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Tomasz Kusmierz
On 30 November 2016 at 19:09, Chris Murphy  wrote:
> On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn
>  wrote:
>
>> The stability info could be improved, but _absolutely none_ of the things
>> mentioned as issues with raid1 are specific to raid1.  And in general, in
>> the context of a feature stability matrix, 'OK' generally means that there
>> are no significant issues with that specific feature, and since none of the
>> issues outlined are specific to raid1, it does meet that description of
>> 'OK'.
>
> Maybe the gotchas page needs a one or two liner for each profile's
> gotchas compared to what the profile leads the user into believing.
> The overriding gotcha with all Btrfs multiple device support is the
> lack of monitoring and notification other than kernel messages; and
> the raid10 actually being more like raid0+1 I think it certainly a
> gotcha, however 'man mkfs.btrfs' contains a grid that very clearly
> states raid10 can only safely lose 1 device.
>
>
>> Looking at this another way, I've been using BTRFS on all my systems since
>> kernel 3.16 (I forget what exact vintage that is in regular years).  I've
>> not had any data integrity or data loss issues as a result of BTRFS itself
>> since 3.19, and in just the past year I've had multiple raid1 profile
>> filesystems survive multiple hardware issues with near zero issues (with the
>> caveat that I had to re-balance after replacing devices to convert a few
>> single chunks to raid1), and that includes multiple disk failures and 2 bad
>> PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected
>> power loss events.  I also have exhaustive monitoring, so I'm replacing bad
>> hardware early instead of waiting for it to actually fail.
>
> Possibly nothing aids predictably reliable storage stacks than healthy
> doses of skepticism and awareness of all limitations. :-D
>
> --
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please, I beg you add another column to man and wiki stating clearly
how many devices every profile can withstand to loose. I frequently
have to explain how btrfs profiles work and show quotes from this
mailing list because "dawning-kruger effect victims" keep poping up
with statements like "in btrfs raid10 with 8 drives you can loose 4
drives" ... I seriously beg you guys, my beating stick is half broken
by now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Martin Steigerwald
Am Mittwoch, 30. November 2016, 12:09:23 CET schrieb Chris Murphy:
> On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn
> 
>  wrote:
> > The stability info could be improved, but _absolutely none_ of the things
> > mentioned as issues with raid1 are specific to raid1.  And in general, in
> > the context of a feature stability matrix, 'OK' generally means that there
> > are no significant issues with that specific feature, and since none of
> > the
> > issues outlined are specific to raid1, it does meet that description of
> > 'OK'.
> 
> Maybe the gotchas page needs a one or two liner for each profile's
> gotchas compared to what the profile leads the user into believing.
> The overriding gotcha with all Btrfs multiple device support is the
> lack of monitoring and notification other than kernel messages; and
> the raid10 actually being more like raid0+1 I think it certainly a
> gotcha, however 'man mkfs.btrfs' contains a grid that very clearly
> states raid10 can only safely lose 1 device.

Wow, that manpage is quite an resource.

Developers, documentation people definitely improved the official BTRFS 
documentation.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Chris Murphy
On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn
 wrote:

> The stability info could be improved, but _absolutely none_ of the things
> mentioned as issues with raid1 are specific to raid1.  And in general, in
> the context of a feature stability matrix, 'OK' generally means that there
> are no significant issues with that specific feature, and since none of the
> issues outlined are specific to raid1, it does meet that description of
> 'OK'.

Maybe the gotchas page needs a one or two liner for each profile's
gotchas compared to what the profile leads the user into believing.
The overriding gotcha with all Btrfs multiple device support is the
lack of monitoring and notification other than kernel messages; and
the raid10 actually being more like raid0+1 I think it certainly a
gotcha, however 'man mkfs.btrfs' contains a grid that very clearly
states raid10 can only safely lose 1 device.


> Looking at this another way, I've been using BTRFS on all my systems since
> kernel 3.16 (I forget what exact vintage that is in regular years).  I've
> not had any data integrity or data loss issues as a result of BTRFS itself
> since 3.19, and in just the past year I've had multiple raid1 profile
> filesystems survive multiple hardware issues with near zero issues (with the
> caveat that I had to re-balance after replacing devices to convert a few
> single chunks to raid1), and that includes multiple disk failures and 2 bad
> PSU's plus about a dozen (not BTRFS related) kernel panics and 4 unexpected
> power loss events.  I also have exhaustive monitoring, so I'm replacing bad
> hardware early instead of waiting for it to actually fail.

Possibly nothing aids predictably reliable storage stacks than healthy
doses of skepticism and awareness of all limitations. :-D

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Chris Murphy
On Wed, Nov 30, 2016 at 7:04 AM, Roman Mamedov  wrote:
> On Wed, 30 Nov 2016 07:50:17 -0500

> Also I don't know what is particularly insane about copying a 4-8 GB file onto
> a storage array. I'd expect both disks to write at the same time (like they
> do in pretty much any other RAID1 system), not one-after-another, effectively
> slowing down the entire operation by as much as 2x in extreme cases.

I don't experience this behavior. Writes take the same amount of time
to single profile volume as a two device raid1 profile volume. iotop
reports 2x the write bandwidth when writing to the raid1 volume, which
corresponds to simultaneous writes to both drives in the volume. It's
also not an elaborate setup by any means: two laptop drives, each in
cheap USB 3.0 cases using bus power only, connected to a USB 3.0 hub,
in turn connected to an Intel NUC.


>
> Comparing to Ext4, that one appears to have the "errors=continue" behavior by
> default, the user has to explicitly request "errors=remount-ro", and I have
> never seen anyone use or recommend the third option of "errors=panic", which
> is basically the equivalent of the current Btrfs practce.

I think in the context of degradedness, it may be appropriate to mount
degraded,ro by default rather than fail. But changing the default
isn't enough for the root fs use case, because the mount command isn't
even issued when udev's btrfs 'dev scan' fails to report back all
devices available. In this case there is a sort of "pre check" before
even mounting is attempted, and that is what fails.

Also,  Btrfs has fatal_errors=panic and it's not the default. Rather,
we just get mount failure. There really isn't anything quite like this
in the mdadm/lvm + other file system world where the array is active
degraded and the file system mounts anyway; if it doesn't mount it's
because the array isn't active, and doesn't even exist yet.


> Unplugging and replugging a SATA cable of a RAID1 member should never put your
> system under the risk of a massive filesystem corruption; you cannot say it
> absolutely doesn't with the current implementation.

I can't say it absolutely doesn't even with md. Of course it
shouldn't, but users do report corruptions on all of the other fs
lists (ext4, XFS, linux-raid) from time to time that are not the
result of user error.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off

2016-11-30 Thread Marc MERLIN
+folks from linux-mm thread for your suggestion

On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote:
> > swraid5 < bcache < dmcrypt < btrfs
> > 
> > Copying with btrfs send/receive causes massive hangs on the system.
> > Please see this explanation from Linus on why the workaround was
> > suggested:
> > https://lkml.org/lkml/2016/11/29/667
> And Linux' assessment is absolutely correct (at least, the general
> assessment is, I have no idea about btrfs_start_shared_extent, but I'm more
> than willing to bet he's correct that that's the culprit).

> > All of this mostly went away with Linus' suggestion:
> > echo 2 > /proc/sys/vm/dirty_ratio
> > echo 1 > /proc/sys/vm/dirty_background_ratio
> > 
> > But that's hiding the symptom which I think is that btrfs is piling up too 
> > many I/O
> > requests during btrfs send/receive and btrfs scrub (probably balance too) 
> > and not
> > looking at resulting impact to system health.

> I see pretty much identical behavior using any number of other storage
> configurations on a USB 2.0 flash drive connected to a system with 16GB of
> RAM with the default dirty ratios because it's trying to cache up to 3.2GB
> of data for writeback.  While BTRFS is doing highly sub-optimal things here,
> the ancient default writeback ratios are just as much a culprit.  I would
> suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which
> would give overall almost identical behavior to x86-32, which in turn works
> reasonably well for most cases.  I sadly don't have the time, patience, or
> expertise to write up such a patch myself though.

Dear linux-mm folks, is that something you could consider (changing the
dirty_ratio defaults) given that it affects at least bcache and btrfs
(with or without bcache)?

By the way, on the 200MB max suggestion, when I had 2 and 1% (or 480MB
and 240MB on my 24GB system), this was enough to make btrfs behave
sanely, but only if I had bcache turned off.
With bcache enabled, those values were just enough so that bcache didn't
crash my system, but not enough that prevent undesirable behaviour
(things hanging, 100+ bcache kworkers piled up, and more). However, the
copy did succeed, despite the relative impact on the system, so it's
better than nothing :)
But the impact from bcache probably goes beyond what btrfs is
responsible for, so I have a separate thread on the bcache list:
http://marc.info/?l=linux-bcache=148052441423532=2
http://marc.info/?l=linux-bcache=148052620524162=2

On the plus side, btrfs did ok with 0 visible impact to my system with
those 480 and 240MB dirty ratio values.

Thanks for your reply, Austin.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off

2016-11-30 Thread Austin S. Hemmelgarn

On 2016-11-30 12:18, Marc MERLIN wrote:

On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote:

+btrfs mailing list, see below why

Ok, Linus helped me find a workaround for this problem:
https://lkml.org/lkml/2016/11/29/667
namely:
   echo 2 > /proc/sys/vm/dirty_ratio
   echo 1 > /proc/sys/vm/dirty_background_ratio
(it's a 24GB system, so the defaults of 20 and 10 were creating too many
requests in th buffers)


I'll remove the bcache list on this followup since I want to concentrate
here on the fact that btrfs does behave badly with the default
dirty_ratio values.
I will comment that on big systems, almost everything behaves badly with 
the default dirty ratios, they're leftovers from when 1GB was a huge 
amount of RAM.  As usual though, BTRFS has pathological behavior 
compared to other options.

As a reminder, it's a btrfs send/receive copy between 2 swraid5 arrays
on spinning rust.
swraid5 < bcache < dmcrypt < btrfs

Copying with btrfs send/receive causes massive hangs on the system.
Please see this explanation from Linus on why the workaround was
suggested:
https://lkml.org/lkml/2016/11/29/667
And Linux' assessment is absolutely correct (at least, the general 
assessment is, I have no idea about btrfs_start_shared_extent, but I'm 
more than willing to bet he's correct that that's the culprit).


The hangs that I'm getting with bcache cache turned off (i.e.
passthrough) are now very likely only due to btrfs and mess up anything
doing file IO that ends up timing out, break USB even as reads time out
in the middle of USB requests, interrupts lost, and so forth.

All of this mostly went away with Linus' suggestion:
echo 2 > /proc/sys/vm/dirty_ratio
echo 1 > /proc/sys/vm/dirty_background_ratio

But that's hiding the symptom which I think is that btrfs is piling up too many 
I/O
requests during btrfs send/receive and btrfs scrub (probably balance too) and 
not
looking at resulting impact to system health.
I see pretty much identical behavior using any number of other storage 
configurations on a USB 2.0 flash drive connected to a system with 16GB 
of RAM with the default dirty ratios because it's trying to cache up to 
3.2GB of data for writeback.  While BTRFS is doing highly sub-optimal 
things here, the ancient default writeback ratios are just as much a 
culprit.  I would suggest that get changed to 200MB or 20% of RAM, 
whichever is smaller, which would give overall almost identical behavior 
to x86-32, which in turn works reasonably well for most cases.  I sadly 
don't have the time, patience, or expertise to write up such a patch 
myself though.


Is there a way to stop flodding the entire system with I/O and causing
so much strain on it?
(I realize that if there is a caching layer underneath that just takes
requests and says thank you without giving other clues that underneath
bad things are happening, it may be hard, but I'm asking anyway :)


[10338.968912] perf: interrupt took too long (3927 > 3917), lowering 
kernel.perf_event_max_sample_rate to 50750

[12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb 
stopped: -32

[17761.122238] usb 4-1.4: USB disconnect, device number 39
[17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 
rq 6 len 1024 ret -108
[17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
[17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd

[24130.574425] hpet1: lost 2306 rtc interrupts
[24156.034950] hpet1: lost 1628 rtc interrupts
[24173.314738] hpet1: lost 1104 rtc interrupts
[24180.129950] hpet1: lost 436 rtc interrupts
[24257.557955] hpet1: lost 4954 rtc interrupts
[24267.522656] hpet1: lost 637 rtc interrupts

[28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28034.975471]   Tainted: G U  
4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[28035.025429] btrfs   D 91154d33fc70 0  5618   5372 0x0080
[28035.047717]  91154d33fc70 00200246 911842f880c0 
9115a4cf01c0
[28035.071020]  91154d33fc58 91154d34 91165493bca0 
9115623773f0
[28035.094252]  1000 0001 91154d33fc88 
b86cf1a6
[28035.117538] Call Trace:
[28035.125791]  [] schedule+0x8b/0xa3
[28035.141550]  [] btrfs_start_ordered_extent+0xce/0x122
[28035.162457]  [] ? wake_up_atomic_t+0x2c/0x2c
[28035.180891]  [] btrfs_wait_ordered_range+0xa9/0x10d
[28035.201723]  [] btrfs_truncate+0x40/0x24b
[28035.219269]  [] btrfs_setattr+0x1da/0x2d7
[28035.237032]  [] notify_change+0x252/0x39c
[28035.254566]  [] do_truncate+0x81/0xb4
[28035.271057]  [] vfs_truncate+0xd9/0xf9
[28035.287782]  [] do_sys_truncate+0x63/0xa7

[28155.781987] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28155.802229]   Tainted: G U  
4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28155.827894] "echo 0 > 

[GIT PULL] Btrfs fixes for 4.10

2016-11-30 Thread fdmanana
From: Filipe Manana 

Hi Chris,

Here follows a small list of fixes a couple cleanups for the 4.10 merge
window. It contains all the patches from the previous pull request (which
got unanswered nor were the changes pulled yet apparently). The most important
change is still the fix for the extent tree corruption that happens due to
balance when qgroups are enabled (a regression introduced in 4.7 by a fix for
a regression from the last qgroups rework). This has been hitting SLE and
openSUSE users and QA very badly, where transactions keep getting aborted when
running delayed references leaving the root filesystem in RO mode and nearly
unusable.
There are fixes here that allow us to run xfstests again with the integrity
checker enabled, which has been impossible since 4.8 (apparently I'm the
only one running xfstests with the integrity checker enabled, which is useful
to validate dirtied leafs, like checking if there are keys out of order, etc).
The rest are just some trivial fixes, most of them tagged for stable, and two
cleanups.

Thanks.

The following changes since commit e3597e6090ddf40904dce6d0a5a404e2c490cac6:

  Merge branch 'for-4.9-rc3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9 
(2016-11-01 12:54:45 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git 
for-chris-4.10

for you to fetch changes up to 2a7bf53f577e49c43de4ffa7776056de26db65d9:

  Btrfs: fix tree search logic when replaying directory entry deletes 
(2016-11-30 16:56:12 +)


Filipe Manana (5):
  Btrfs: fix relocation incorrectly dropping data references
  Btrfs: remove unused code when creating and merging reloc trees
  Btrfs: remove rb_node field from the delayed ref node structure
  Btrfs: fix emptiness check for dirtied extent buffers at check_leaf()
  Btrfs: fix qgroup rescan worker initialization

Liu Bo (1):
  Btrfs: fix BUG_ON in btrfs_mark_buffer_dirty

Robbie Ko (3):
  Btrfs: fix enospc in hole punching
  Btrfs: fix deadlock caused by fsync when logging directory entries
  Btrfs: fix tree search logic when replaying directory entry deletes

 fs/btrfs/delayed-ref.h |  6 --
 fs/btrfs/disk-io.c | 23 +++
 fs/btrfs/file.c|  4 ++--
 fs/btrfs/qgroup.c  |  5 +
 fs/btrfs/relocation.c  | 34 --
 fs/btrfs/tree-log.c|  7 +++
 6 files changed, 37 insertions(+), 42 deletions(-)

-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.8.8, bcache deadlock and hard lockup

2016-11-30 Thread Marc MERLIN
On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote:
> +btrfs mailing list, see below why
> 
> On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:
> > On Mon, 27 Nov 2016, Coly Li wrote:
> > > 
> > > Yes, too many work queues... I guess the locking might be caused by some
> > > very obscure reference of closure code. I cannot have any clue if I
> > > cannot find a stable procedure to reproduce this issue.
> > > 
> > > Hmm, if there is a tool to clone all the meta data of the back end cache
> > > and whole cached device, there might be a method to replay the oops much
> > > easier.
> > > 
> > > Eric, do you have any hint ?
> > 
> > Note that the backing device doesn't have any metadata, just a superblock. 
> > You can easily dd that off onto some other volume without transferring the 
> > data. By default, data starts at 8k, or whatever you used in `make-bcache 
> > -w`.
> 
> Ok, Linus helped me find a workaround for this problem:
> https://lkml.org/lkml/2016/11/29/667
> namely:
>echo 2 > /proc/sys/vm/dirty_ratio
>echo 1 > /proc/sys/vm/dirty_background_ratio
> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
> requests in th buffers)
> 
> Note that this is only a workaround, not a fix.

Actually, I'm even more worried about the general bcache situation when
caching is enabled. In the message above, Linus wrote:

"One situation where I've seen something like this happen is

 (a) lots and lots of dirty data queued up
 (b) horribly slow storage
 (c) filesystem that ends up serializing on writeback under certain
circumstances

The usual case for (b) in the modern world is big SSD's that have bad
worst-case behavior (ie they may do gbps speeds when doing well, and
then they come to a screeching halt when their buffers fill up and
they have to do rewrites, and their gbps throughput drops to mbps or
lower).

Generally you only find that kind of really nasty SSD in the USB stick
world these days."

Well, come to think of it, this is _exactly_ what bcache will create, by
design. It'll swallow up a lot of IO cached to the SSD, until the SSD
buffers fill up and then things will hang while bcache struggles to
write it all to slower spinning rust storage.

Looks to me like bcache and dirty_ratio need to be synced somehow, or
things will fall over reliably.

What do you think?

Thanks,
Marc


> When I did this and re tried my big copy again, I still got 100+ kernel
> work queues, but apparently the underlying swraid5 was able to unblock
> and satisfy the write requests before too many accumulated and crashed
> the kernel.
> 
> I'm not a kernel coder, but seems to me that bcache needs a way to
> throttle incoming requests if there are too many so that it does not end
> up in a state where things blow up due to too many piled up requests.
> 
> You should be able to reproduce this by taking 5 spinning rust drives,
> put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
> I used btrfs) and send lots of requests.
> Actually to be honest, the problems have mostly been happening when I do
> btrfs scrub and btrfs send/receive which both generate I/O from within
> the kernel instead of user space.
> So here, btrfs may be a contributor to the problem too, but while btrfs
> still trashes my system if I remove the caching device on bcache (and
> with the default dirty ratio values), it doesn't crash the kernel.
> 
> I'll start another separate thread with the btrfs folks on how much
> pressure is put on the system, but on your side it would be good to help
> ensure that bcache doesn't crash the system altogether if too many
> requests are allowed to pile up.
> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems 
>    what McDonalds is to gourmet 
> cooking
> Home page: http://marc.merlins.org/ | PGP 
> 1024R/763BE901

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off

2016-11-30 Thread Marc MERLIN
On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote:
> +btrfs mailing list, see below why
> 
> Ok, Linus helped me find a workaround for this problem:
> https://lkml.org/lkml/2016/11/29/667
> namely:
>echo 2 > /proc/sys/vm/dirty_ratio
>echo 1 > /proc/sys/vm/dirty_background_ratio
> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
> requests in th buffers)

I'll remove the bcache list on this followup since I want to concentrate
here on the fact that btrfs does behave badly with the default
dirty_ratio values.
As a reminder, it's a btrfs send/receive copy between 2 swraid5 arrays
on spinning rust.
swraid5 < bcache < dmcrypt < btrfs

Copying with btrfs send/receive causes massive hangs on the system.
Please see this explanation from Linus on why the workaround was
suggested:
https://lkml.org/lkml/2016/11/29/667

The hangs that I'm getting with bcache cache turned off (i.e.
passthrough) are now very likely only due to btrfs and mess up anything
doing file IO that ends up timing out, break USB even as reads time out
in the middle of USB requests, interrupts lost, and so forth.

All of this mostly went away with Linus' suggestion:
echo 2 > /proc/sys/vm/dirty_ratio
echo 1 > /proc/sys/vm/dirty_background_ratio

But that's hiding the symptom which I think is that btrfs is piling up too many 
I/O
requests during btrfs send/receive and btrfs scrub (probably balance too) and 
not 
looking at resulting impact to system health.

Is there a way to stop flodding the entire system with I/O and causing
so much strain on it?
(I realize that if there is a caching layer underneath that just takes
requests and says thank you without giving other clues that underneath
bad things are happening, it may be hard, but I'm asking anyway :)


[10338.968912] perf: interrupt took too long (3927 > 3917), lowering 
kernel.perf_event_max_sample_rate to 50750

[12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb 
stopped: -32

[17761.122238] usb 4-1.4: USB disconnect, device number 39
[17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 
rq 6 len 1024 ret -108
[17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
[17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd

[24130.574425] hpet1: lost 2306 rtc interrupts
[24156.034950] hpet1: lost 1628 rtc interrupts
[24173.314738] hpet1: lost 1104 rtc interrupts
[24180.129950] hpet1: lost 436 rtc interrupts
[24257.557955] hpet1: lost 4954 rtc interrupts
[24267.522656] hpet1: lost 637 rtc interrupts

[28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28034.975471]   Tainted: G U  
4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[28035.025429] btrfs   D 91154d33fc70 0  5618   5372 0x0080
[28035.047717]  91154d33fc70 00200246 911842f880c0 
9115a4cf01c0
[28035.071020]  91154d33fc58 91154d34 91165493bca0 
9115623773f0
[28035.094252]  1000 0001 91154d33fc88 
b86cf1a6
[28035.117538] Call Trace:
[28035.125791]  [] schedule+0x8b/0xa3
[28035.141550]  [] btrfs_start_ordered_extent+0xce/0x122
[28035.162457]  [] ? wake_up_atomic_t+0x2c/0x2c
[28035.180891]  [] btrfs_wait_ordered_range+0xa9/0x10d
[28035.201723]  [] btrfs_truncate+0x40/0x24b
[28035.219269]  [] btrfs_setattr+0x1da/0x2d7
[28035.237032]  [] notify_change+0x252/0x39c
[28035.254566]  [] do_truncate+0x81/0xb4
[28035.271057]  [] vfs_truncate+0xd9/0xf9
[28035.287782]  [] do_sys_truncate+0x63/0xa7

[28155.781987] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28155.802229]   Tainted: G U  
4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28155.827894] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[28155.852479] btrfs   D 91154d33fc70 0  5618   5372 0x0080
[28155.874761]  91154d33fc70 00200246 911842f880c0 
9115a4cf01c0
[28155.898059]  91154d33fc58 91154d34 91165493bca0 
9115623773f0
[28155.921464]  1000 0001 91154d33fc88 
b86cf1a6
[28155.944720] Call Trace:
[28155.953176]  [] schedule+0x8b/0xa3
[28155.968945]  [] btrfs_start_ordered_extent+0xce/0x122
[28155.989811]  [] ? wake_up_atomic_t+0x2c/0x2c
[28156.008195]  [] btrfs_wait_ordered_range+0xa9/0x10d
[28156.028498]  [] btrfs_truncate+0x40/0x24b
[28156.046081]  [] btrfs_setattr+0x1da/0x2d7
[28156.063621]  [] notify_change+0x252/0x39c
[28156.081667]  [] do_truncate+0x81/0xb4
[28156.098732]  [] vfs_truncate+0xd9/0xf9
[28156.115489]  [] do_sys_truncate+0x63/0xa7
[28156.133389]  [] SyS_truncate+0xe/0x10
[28156.149831]  [] do_syscall_64+0x61/0x72
[28156.167179]  [] entry_SYSCALL64_slow_path+0x25/0x25

[28397.436986] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28397.456798]   Tainted: G U

Re: [PATCH] Btrfs: fix infinite loop when tree log recovery

2016-11-30 Thread Filipe Manana
On Fri, Oct 7, 2016 at 10:30 AM, robbieko  wrote:
> From: Robbie Ko 
>
> if log tree like below:
> leaf N:
> ...
> item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8
> dir log end 1275809046
> leaf N+1:
> item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8
> dir log end 18446744073709551615
> ...
>
> when start_ret > 1275809046, but slot[0] never >= nritems,
> so never go to next leaf.

This doesn't explain how the infinite loop happens. Nor exactly how
any problem happens.

It's important to have detailed information in the change logs. I
understand that english isn't your native tongue (it's not mine
either, and I'm far from mastering it), but that's not an excuse to
not express all the important information in detail (we can all live
with grammar errors and typos, and we all do such errors frequently).

I've added this patch to my branch at
https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=for-chris-4.10
but with a modified changelog and subject.

The results of the wrong logic that decides when to move to the next
leaf are unpredictable, and it won't always result in an infinite
loop. We are accessing a slot that doesn't point to an item, to a
memory location containing garbage to something unexpected, and in the
worst case that location is beyond the last page of the extent buffer.

Thanks.


>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/tree-log.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index ef9c55b..e63dd99 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -1940,12 +1940,11 @@ static noinline int find_dir_range(struct btrfs_root 
> *root,
>  next:
> /* check the next slot in the tree to see if it is a valid item */
> nritems = btrfs_header_nritems(path->nodes[0]);
> +   path->slots[0]++;
> if (path->slots[0] >= nritems) {
> ret = btrfs_next_leaf(root, path);
> if (ret)
> goto out;
> -   } else {
> -   path->slots[0]++;
> }
>
> btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Austin S. Hemmelgarn

On 2016-11-30 10:49, Wilson Meier wrote:



Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn:

On 2016-11-30 08:12, Wilson Meier wrote:

Am 30/11/16 um 11:41 schrieb Duncan:

Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:


Am 30/11/16 um 09:06 schrieb Martin Steigerwald:

Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:

[snip]

So the stability matrix would need to be updated not to recommend any
kind of BTRFS RAID 1 at the moment?

Actually I faced the BTRFS RAID 1 read only after first attempt of
mounting it "degraded" just a short time ago.

BTRFS still needs way more stability work it seems to me.


I would say the matrix should be updated to not recommend any RAID
Level
as from the discussion it seems they all of them have flaws.
To me RAID is broken if one cannot expect to recover from a device
failure in a solid way as this is why RAID is used.
Correct me if i'm wrong. Right now i'm making my thoughts about
migrating to another FS and/or Hardware RAID.

It should be noted that no list regular that I'm aware of anyway, would
make any claims about btrfs being stable and mature either now or in
the
near-term future in any case.  Rather to the contrary, as I
generally put
it, btrfs is still stabilizing and maturing, with backups one is
willing
to use (and as any admin of any worth would say, a backup that hasn't
been tested usable isn't yet a backup; the job of creating the backup
isn't done until that backup has been tested actually usable for
recovery) still extremely strongly recommended.  Similarly, keeping up
with the list is recommended, as is staying relatively current on both
the kernel and userspace (generally considered to be within the latest
two kernel series of either current or LTS series kernels, and with a
similarly versioned btrfs userspace).

In that context, btrfs single-device and raid1 (and raid0 of course)
are
quite usable and as stable as btrfs in general is, that being
stabilizing
but not yet fully stable and mature, with raid10 being slightly less so
and raid56 being much more experimental/unstable at this point.

But that context never claims full stability even for the relatively
stable raid1 and single device modes, and in fact anticipates that
there
may be times when recovery from the existing filesystem may not be
practical, thus the recommendation to keep tested usable backups at the
ready.

Meanwhile, it remains relatively common on this list for those
wondering
about their btrfs on long-term-stale (not a typo) "enterprise" distros,
or even debian-stale, to be actively steered away from btrfs,
especially
if they're not willing to update to something far more current than
those
distros often provide, because in general, the current stability status
of btrfs is in conflict with the reason people generally choose to use
that level of old and stale software in the first place -- they
prioritize tried and tested to work, stable and mature, over the latest
generally newer and flashier featured but sometimes not entirely
stable,
and btrfs at this point simply doesn't meet that sort of stability/
maturity expectations, nor is it likely to for some time (measured in
years), due to all the reasons enumerated so well in the above thread.


In that context, the stability status matrix on the wiki is already
reasonably accurate, certainly so IMO, because "OK" in context means as
OK as btrfs is in general, and btrfs itself remains still stabilizing,
not fully stable and mature.

If there IS an argument as to the accuracy of the raid0/1/10 OK status,
I'd argue it's purely due to people not understanding the status of
btrfs
in general, and that if there's a general deficiency at all, it's in
the
lack of a general stability status paragraph on that page itself
explaining all this, despite the fact that the main https://
btrfs.wiki.kernel.org landing page states quite plainly under stability
status that btrfs remains under heavy development and that current
kernels are strongly recommended.  (Tho were I editing it, there'd
certainly be a more prominent mention of keeping backups at the
ready as
well.)


Hi Duncan,

i understand your arguments but cannot fully agree.
First of all, i'm not sticking with old stale versions of whatever as i
try to keep my system up2date.
My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
That being said, i'm quite aware of the heavy development status of
btrfs but pointing the finger on the users saying that they don't fully
understand the status of btrfs without giving the information on the
wiki is in my opinion not the right way. Heavy development doesn't mean
that features marked as ok are "not" or "mostly" ok in the context of
overall btrfs stability.
There is no indication on the wiki that raid1 or every other raid
(except for raid5/6) suffers from the problems stated in this thread.

The performance issues are inherent to BTRFS right now, and none of
the other issues are likely to impact most regular 

Re: 4.8.8, bcache deadlock and hard lockup

2016-11-30 Thread Marc MERLIN
+btrfs mailing list, see below why

On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:
> On Mon, 27 Nov 2016, Coly Li wrote:
> > 
> > Yes, too many work queues... I guess the locking might be caused by some
> > very obscure reference of closure code. I cannot have any clue if I
> > cannot find a stable procedure to reproduce this issue.
> > 
> > Hmm, if there is a tool to clone all the meta data of the back end cache
> > and whole cached device, there might be a method to replay the oops much
> > easier.
> > 
> > Eric, do you have any hint ?
> 
> Note that the backing device doesn't have any metadata, just a superblock. 
> You can easily dd that off onto some other volume without transferring the 
> data. By default, data starts at 8k, or whatever you used in `make-bcache 
> -w`.

Ok, Linus helped me find a workaround for this problem:
https://lkml.org/lkml/2016/11/29/667
namely:
   echo 2 > /proc/sys/vm/dirty_ratio
   echo 1 > /proc/sys/vm/dirty_background_ratio
(it's a 24GB system, so the defaults of 20 and 10 were creating too many
requests in th buffers)

Note that this is only a workaround, not a fix.

When I did this and re tried my big copy again, I still got 100+ kernel
work queues, but apparently the underlying swraid5 was able to unblock
and satisfy the write requests before too many accumulated and crashed
the kernel.

I'm not a kernel coder, but seems to me that bcache needs a way to
throttle incoming requests if there are too many so that it does not end
up in a state where things blow up due to too many piled up requests.

You should be able to reproduce this by taking 5 spinning rust drives,
put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
I used btrfs) and send lots of requests.
Actually to be honest, the problems have mostly been happening when I do
btrfs scrub and btrfs send/receive which both generate I/O from within
the kernel instead of user space.
So here, btrfs may be a contributor to the problem too, but while btrfs
still trashes my system if I remove the caching device on bcache (and
with the default dirty ratio values), it doesn't crash the kernel.

I'll start another separate thread with the btrfs folks on how much
pressure is put on the system, but on your side it would be good to help
ensure that bcache doesn't crash the system altogether if too many
requests are allowed to pile up.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/14] btrfs-progs: check: introduce function to find dir_item

2016-11-30 Thread David Sterba
On Wed, Nov 16, 2016 at 10:27:59AM +0800, Qu Wenruo wrote:
> > Yes please. Third namespace for existing error bits is not a good
> > option. Move the I_ERR bits to start from 32 and use them in the low-mem
> > code that's been merged to devel.
> 
> I didn't see such fix in devel branch.

Well, that's because nobody implemented it and I was not intending to do
it myself as it's a followup to yor lowmem patchset in devel.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Martin Steigerwald
Am Mittwoch, 30. November 2016, 16:49:59 CET schrieb Wilson Meier:
> Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn:
> > On 2016-11-30 08:12, Wilson Meier wrote:
> >> Am 30/11/16 um 11:41 schrieb Duncan:
> >>> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:
>  Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
> > Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
[…]
> >> It is really disappointing to not have this information in the wiki
> >> itself. This would have saved me, and i'm quite sure others too, a lot
> >> of time.
> >> Sorry for being a bit frustrated.
> 
> I'm not angry or something like that :) .
> I just would like to have the possibility to read such information about
> the storage i put my personal data (> 3 TB) on its official wiki.

Anyone can get an account on the wiki and add notes there, so feel free.

You can even use footnotes or something like that. Maybe it would be good to 
add a paragraph there that features are related to one another, so while BTRFS 
RAID 1 for example might be quite okay, it depends on features that are still 
flaky.

I for myself rely quite much on BTRFS RAID 1 with lzo compression and it seems 
to work okay for me.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix fsync deadlock in log_new_dir_dentries

2016-11-30 Thread Filipe Manana
On Fri, Oct 28, 2016 at 3:48 AM, robbieko  wrote:
> From: Robbie Ko 
>
> We found a fsync deadlock in log_new_dir_dentries, because
> btrfs_search_forward get path lock, then call btrfs_iget will
> get another extent_buffer lock, maybe occur deadlock.

This still doesn't explain how the deadlock happens.
For it to happen it's necessary that before btrfs_iget() does a tree
search, some other task gets write locks on nodes and blocks waiting
for the leaf locked by btrfs_search_forward() to be unlocked, and that
btrfs_iget() tries to read lock those same nodes write locked by that
other task.

It's important to have detailed information in the change logs. I
understand that english isn't your native tongue (it's not mine
either, and I'm far from mastering it), but that's not an excuse to
not express all the important information in detail (we can all live
with grammar errors and typos).

>
> Fix this by release path before call btrfs_iget, avoid deadlock occur.
>
> Example:
> Pid waiting: 32021->32020->32028->14431->14436->32021
>
> The following are their extent_buffer locked/waiting respectively:
> extent_buffer: start:207060992, len:16384
> locker pid: 32020 read lock
> wait pid: 32021 write lock
> extent_buffer: start:14730821632, len:16384
> locker pid: 32028 read lock
> wait pid: 32020 write lock
> extent_buffer: start:446503813120, len:16384
> locker pid: 14431 write lock
> wait pid: 32028 read lock
> extent_buffer: start:446503845888, len: 16384
> locker pid: 14436 write lock
> wait pid: 14431 write lock
> extent_buffer: start: 446504386560, len: 16384
> locker pid: 32021 write lock
> wait pid: 14436 write lock
>
> The following are their call trace respectively.
> [ 4077.478852] kworker/u24:10  D 88107fc90640 0 14431  2 
> 0x
> [ 4077.486752] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> [ 4077.494346]  880ffa56bad0 0046 9000 
> 880ffa56bfd8
> [ 4077.502629]  880ffa56bfd8 881016ce21c0 a06ecb26 
> 88101a5d6138
> [ 4077.510915]  880ebb5173b0 880ffa56baf8 880ebb517410 
> 881016ce21c0
> [ 4077.519202] Call Trace:
> [ 4077.528752]  [] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs]
> [ 4077.536049]  [] ? wake_up_atomic_t+0x30/0x30
> [ 4077.542574]  [] ? btrfs_search_slot+0x79f/0xb10 [btrfs]
> [ 4077.550171]  [] ? btrfs_lookup_file_extent+0x33/0x40 
> [btrfs]
> [ 4077.558252]  [] ? __btrfs_drop_extents+0x13b/0xdf0 
> [btrfs]
> [ 4077.566140]  [] ? add_delayed_data_ref+0xe2/0x150 [btrfs]
> [ 4077.573928]  [] ? btrfs_add_delayed_data_ref+0x149/0x1d0 
> [btrfs]
> [ 4077.582399]  [] ? __set_extent_bit+0x4c0/0x5c0 [btrfs]
> [ 4077.589896]  [] ? 
> insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs]
> [ 4077.599632]  [] ? start_transaction+0x8d/0x470 [btrfs]
> [ 4077.607134]  [] ? btrfs_finish_ordered_io+0x2e7/0x600 
> [btrfs]
> [ 4077.615329]  [] ? process_one_work+0x142/0x3d0
> [ 4077.622043]  [] ? worker_thread+0x109/0x3b0
> [ 4077.628459]  [] ? manage_workers.isra.26+0x270/0x270
> [ 4077.635759]  [] ? kthread+0xaf/0xc0
> [ 4077.641404]  [] ? kthread_create_on_node+0x110/0x110
> [ 4077.648696]  [] ? ret_from_fork+0x58/0x90
> [ 4077.654926]  [] ? kthread_create_on_node+0x110/0x110
>
> [ 4078.358087] kworker/u24:15  D 88107fcd0640 0 14436  2 
> 0x
> [ 4078.365981] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> [ 4078.373574]  880ffa57fad0 0046 9000 
> 880ffa57ffd8
> [ 4078.381864]  880ffa57ffd8 88103004d0a0 a06ecb26 
> 88101a5d6138
> [ 4078.390163]  880fbeffc298 880ffa57faf8 880fbeffc2f8 
> 88103004d0a0
> [ 4078.398466] Call Trace:
> [ 4078.408019]  [] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs]
> [ 4078.415322]  [] ? wake_up_atomic_t+0x30/0x30
> [ 4078.421844]  [] ? btrfs_search_slot+0x79f/0xb10 [btrfs]
> [ 4078.429438]  [] ? btrfs_lookup_file_extent+0x33/0x40 
> [btrfs]
> [ 4078.437518]  [] ? __btrfs_drop_extents+0x13b/0xdf0 
> [btrfs]
> [ 4078.445404]  [] ? add_delayed_data_ref+0xe2/0x150 [btrfs]
> [ 4078.453194]  [] ? btrfs_add_delayed_data_ref+0x149/0x1d0 
> [btrfs]
> [ 4078.461663]  [] ? __set_extent_bit+0x4c0/0x5c0 [btrfs]
> [ 4078.469161]  [] ? 
> insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs]
> [ 4078.478893]  [] ? start_transaction+0x8d/0x470 [btrfs]
> [ 4078.486388]  [] ? btrfs_finish_ordered_io+0x2e7/0x600 
> [btrfs]
> [ 4078.494561]  [] ? process_one_work+0x142/0x3d0
> [ 4078.501278]  [] ? pwq_activate_delayed_work+0x27/0x40
> [ 4078.508673]  [] ? worker_thread+0x109/0x3b0
> [ 4078.515098]  [] ? manage_workers.isra.26+0x270/0x270
> [ 4078.522396]  [] ? kthread+0xaf/0xc0
> [ 4078.528032]  [] ? kthread_create_on_node+0x110/0x110
> [ 4078.535325]  [] ? ret_from_fork+0x58/0x90
> [ 4078.541552]  [] ? kthread_create_on_node+0x110/0x110
>
> [ 4079.355824] user-space-program D 88107fd30640 0 32020   

Re: [PATCH v2 02/14] btrfs-progs: check: introduce function to find dir_item

2016-11-30 Thread David Sterba
On Tue, Nov 08, 2016 at 09:45:54AM +0800, Qu Wenruo wrote:
> > Yes please. Third namespace for existing error bits is not a good
> > option. Move the I_ERR bits to start from 32 and use them in the low-mem
> > code that's been merged to devel.
> >
> Should I submit a separate fix or replace the patchset?

Separate patches please. The check patches are at the beginning of
devel and there are several cleanup patches on top of them so that would
probably cause too many merge conflicts.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix enospc in hole punching

2016-11-30 Thread Filipe Manana
On Fri, Oct 28, 2016 at 3:32 AM, robbieko  wrote:
> From: Robbie Ko 
>
> The hole punching can result in adding new leafs (and as a consequence
> new nodes) to the tree because when we find file extent items that span
> beyond the hole range we may end up not deleting them (just adjusting them)
> and add new file extent items representing holes.
>
> That after splitting a leaf (therefore creating a new one), a new node
> might be added to each level of the tree (since there's a new key and
> every parent node was full).
>
> Fix this by use btrfs_calc_trans_metadata_size instead of
> btrfs_calc_trunc_metadata_size.
>
> v2:
> * Improve the change log

Version information does not belong in the changelog but after the ---
below (it wouldn't make sense to have it in the git changelogs...).
See https://btrfs.wiki.kernel.org/index.php/Developer's_FAQ#Repeated_submissions
and examples from others that submit patches to this list.

>
> Signed-off-by: Robbie Ko 

I've reworded the changelog for clarity and added it to my branch at:
https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=for-chris-4.10

Thanks.

> ---
>  fs/btrfs/file.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index fea31a4..809ca85 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2322,7 +2322,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
> offset, loff_t len)
> u64 tail_len;
> u64 orig_start = offset;
> u64 cur_offset;
> -   u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
> +   u64 min_size = btrfs_calc_trans_metadata_size(root, 1);
> u64 drop_end;
> int ret = 0;
> int err = 0;
> @@ -2469,7 +2469,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
> offset, loff_t len)
> ret = -ENOMEM;
> goto out_free;
> }
> -   rsv->size = btrfs_calc_trunc_metadata_size(root, 1);
> +   rsv->size = btrfs_calc_trans_metadata_size(root, 1);
> rsv->failfast = 1;
>
> /*
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Niccolò Belli

I completely agree, the whole wiki status is simply *FRUSTRATING*.

Niccolò Belli

On mercoledì 30 novembre 2016 14:12:36 CET, Wilson Meier wrote:

Am 30/11/16 um 11:41 schrieb Duncan:

Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:
 ...

Hi Duncan,

i understand your arguments but cannot fully agree.
First of all, i'm not sticking with old stale versions of whatever as i
try to keep my system up2date.
My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
That being said, i'm quite aware of the heavy development status of
btrfs but pointing the finger on the users saying that they don't fully
understand the status of btrfs without giving the information on the
wiki is in my opinion not the right way. Heavy development doesn't mean
that features marked as ok are "not" or "mostly" ok in the context of
overall btrfs stability.
There is no indication on the wiki that raid1 or every other raid
(except for raid5/6) suffers from the problems stated in this thread.
If there are know problems then the stability matrix should point them
out or link to a corresponding wiki entry otherwise one has to assume
that the features marked as "ok" are in fact "ok".
And yes, the overall btrfs stability should be put on the wiki.

Just to give you a quick overview of my history with btrfs.
I migrated away from MD Raid and ext4 to btrfs raid6 because of its CoW
and checksum features at a time as raid6 was not considered fully stable
but also not as badly broken.
After a few months i had a disk failure and the raid could not recover.
I looked at the wiki an the mailing list and noticed that raid6 has been
marked as badly broken :(
I was quite happy to have a backup. So i asked on the btrfs IRC channel
(the wiki had no relevant information) if raid10 is usable or suffers
from the same problems. The summary was "Yes it is usable and has no
known problems". So i migrated to raid10. Now i know that raid10 (marked
as ok) has also problems with 2 disk failures in different stripes and
can in fact lead to data loss.
I thought, hmm ok, i'll split my data and use raid1 (marked as ok). And
again the mailing list states that raid1 has also problems in case of
recovery.

It is really disappointing to not have this information in the wiki
itself. This would have saved me, and i'm quite sure others too, a lot
of time.
Sorry for being a bit frustrated.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Wilson Meier


Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn:
> On 2016-11-30 08:12, Wilson Meier wrote:
>> Am 30/11/16 um 11:41 schrieb Duncan:
>>> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:
>>>
 Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
>> [snip]
> So the stability matrix would need to be updated not to recommend any
> kind of BTRFS RAID 1 at the moment?
>
> Actually I faced the BTRFS RAID 1 read only after first attempt of
> mounting it "degraded" just a short time ago.
>
> BTRFS still needs way more stability work it seems to me.
>
 I would say the matrix should be updated to not recommend any RAID
 Level
 as from the discussion it seems they all of them have flaws.
 To me RAID is broken if one cannot expect to recover from a device
 failure in a solid way as this is why RAID is used.
 Correct me if i'm wrong. Right now i'm making my thoughts about
 migrating to another FS and/or Hardware RAID.
>>> It should be noted that no list regular that I'm aware of anyway, would
>>> make any claims about btrfs being stable and mature either now or in
>>> the
>>> near-term future in any case.  Rather to the contrary, as I
>>> generally put
>>> it, btrfs is still stabilizing and maturing, with backups one is
>>> willing
>>> to use (and as any admin of any worth would say, a backup that hasn't
>>> been tested usable isn't yet a backup; the job of creating the backup
>>> isn't done until that backup has been tested actually usable for
>>> recovery) still extremely strongly recommended.  Similarly, keeping up
>>> with the list is recommended, as is staying relatively current on both
>>> the kernel and userspace (generally considered to be within the latest
>>> two kernel series of either current or LTS series kernels, and with a
>>> similarly versioned btrfs userspace).
>>>
>>> In that context, btrfs single-device and raid1 (and raid0 of course)
>>> are
>>> quite usable and as stable as btrfs in general is, that being
>>> stabilizing
>>> but not yet fully stable and mature, with raid10 being slightly less so
>>> and raid56 being much more experimental/unstable at this point.
>>>
>>> But that context never claims full stability even for the relatively
>>> stable raid1 and single device modes, and in fact anticipates that
>>> there
>>> may be times when recovery from the existing filesystem may not be
>>> practical, thus the recommendation to keep tested usable backups at the
>>> ready.
>>>
>>> Meanwhile, it remains relatively common on this list for those
>>> wondering
>>> about their btrfs on long-term-stale (not a typo) "enterprise" distros,
>>> or even debian-stale, to be actively steered away from btrfs,
>>> especially
>>> if they're not willing to update to something far more current than
>>> those
>>> distros often provide, because in general, the current stability status
>>> of btrfs is in conflict with the reason people generally choose to use
>>> that level of old and stale software in the first place -- they
>>> prioritize tried and tested to work, stable and mature, over the latest
>>> generally newer and flashier featured but sometimes not entirely
>>> stable,
>>> and btrfs at this point simply doesn't meet that sort of stability/
>>> maturity expectations, nor is it likely to for some time (measured in
>>> years), due to all the reasons enumerated so well in the above thread.
>>>
>>>
>>> In that context, the stability status matrix on the wiki is already
>>> reasonably accurate, certainly so IMO, because "OK" in context means as
>>> OK as btrfs is in general, and btrfs itself remains still stabilizing,
>>> not fully stable and mature.
>>>
>>> If there IS an argument as to the accuracy of the raid0/1/10 OK status,
>>> I'd argue it's purely due to people not understanding the status of
>>> btrfs
>>> in general, and that if there's a general deficiency at all, it's in
>>> the
>>> lack of a general stability status paragraph on that page itself
>>> explaining all this, despite the fact that the main https://
>>> btrfs.wiki.kernel.org landing page states quite plainly under stability
>>> status that btrfs remains under heavy development and that current
>>> kernels are strongly recommended.  (Tho were I editing it, there'd
>>> certainly be a more prominent mention of keeping backups at the
>>> ready as
>>> well.)
>>>
>> Hi Duncan,
>>
>> i understand your arguments but cannot fully agree.
>> First of all, i'm not sticking with old stale versions of whatever as i
>> try to keep my system up2date.
>> My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
>> That being said, i'm quite aware of the heavy development status of
>> btrfs but pointing the finger on the users saying that they don't fully
>> understand the status of btrfs without giving the information on the
>> wiki is in my opinion not the right way. Heavy development doesn't mean
>> that 

Re: Convert from RAID 5 to 10

2016-11-30 Thread Austin S. Hemmelgarn

On 2016-11-30 09:04, Roman Mamedov wrote:

On Wed, 30 Nov 2016 07:50:17 -0500
"Austin S. Hemmelgarn"  wrote:


*) Read performance is not optimized: all metadata is always read from the
first device unless it has failed, data reads are supposedly balanced between
devices per PID of the process reading. Better implementations dispatch reads
per request to devices that are currently idle.

Based on what I've seen, the metadata reads get balanced too.


https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451
This starts from the mirror number 0 and tries others in an incrementing
order, until succeeds. It appears that as long as the mirror with copy #0 is up
and not corrupted, all reads will simply get satisfied from it.
That's actually how all reads work, it's just that the PID selects what 
constitutes the 'first' copy.  IIRC, that selection is doen by a lower 
layer.



*) Write performance is not optimized, during long full bandwidth sequential
writes it is common to see devices writing not in parallel, but with a long
periods of just one device writing, then another. (Admittedly have been some
time since I tested that).

I've never seen this be an issue in practice, especially if you're using
transparent compression (which caps extent size, and therefore I/O size
to a given device, at 128k).  I'm also sane enough that I'm not doing
bulk streaming writes to traditional HDD's or fully saturating the
bandwidth on my SSD's (you should be over-provisioning whenever
possible).  For a desktop user, unless you're doing real-time video
recording at higher than HD resolution with high quality surround sound,
this probably isn't going to hit you (and even then you should be
recording to a temporary location with much faster write speeds (tmpfs
or ext4 without a journal for example) because you'll likely get hit
with fragmentation).


I did not use compression while observing this;
Compression doesn't make things parallel, but it does cause BTRFS to 
distribute the writes more evenly because it writes first one extent 
then the other, which in turn makes things much more efficient because 
you're not stalling as much waiting for the I/O queue to finish.  It 
also means you have to write less overall to the disk, so on systems 
which can do LZO compression significantly faster than they can write to 
or read from the disk, it will generally improve performance all around.


Also I don't know what is particularly insane about copying a 4-8 GB file onto
a storage array. I'd expect both disks to write at the same time (like they
do in pretty much any other RAID1 system), not one-after-another, effectively
slowing down the entire operation by as much as 2x in extreme cases.
I'm not talking 4-8GB files, I'm talking really big stuff at least an 
order of magnitude larger than that, stuff like filesystem images and 
big databases.  On the only system I have where I have traditional hard 
disks (7200RPM consumer SATA3 drives connected to an LSI MPT2SAS HBA, 
about 80-100MB/s bulk write speed to a single disk), an 8GB copy from 
tmpfs is only in practice about 20% slower to BTRFS raid1 mode than to 
XFS on top of a DM-RAID RAID1 volume, and about 30% slower than the same 
with ext4.  In both cases, this is actually about 50% faster than ZFS 
(which does prallelize reads and writes) in an equivalent configuration 
on the same hardware.  Comparing all of that to single disk versions on 
the same hardware, I see roughly the same performance ratios between 
filesystems, and the same goes for running on the motherboard's SATA 
controller instead of the LSI HBA.  In this case, I am using compression 
(and the data gets reasonable compression ratios), and I see both disks 
running at just below peak bandwidth, and based on tracing, most of the 
difference is in the metadata updates required to change the extents.


I would love to see BTRFS properly parallelize writes and stripe reads 
sanely, but I seriously doubt it's going to have as much impact as you 
think, especially on systems with fast storage.



As far as not mounting degraded by default, that's a conscious design
choice that isn't going to change.  There's a switch (adding 'degraded'
to the mount options) to enable this behavior per-mount, so we're still
on-par in that respect with LVM and MD, we just picked a different
default.  In this case, I actually feel it's a better default for most
cases, because most regular users aren't doing exhaustive monitoring,
and thus are not likely to notice the filesystem being mounted degraded
until it's far too late.  If the filesystem is degraded, then
_something_ has happened that the user needs to know about, and until
some sane monitoring solution is implemented, the easiest way to ensure
this is to refuse to mount.


The easiest is to write to dmesg and syslog, if a user doesn't monitor those
either, it's their own fault; and the more user friendly one would be to still
auto mount degraded, but 

Re: [PATCH] btrfs-progs: Fix extents after finding all errors

2016-11-30 Thread David Sterba
On Thu, Nov 10, 2016 at 09:01:47AM -0600, Goldwyn Rodrigues wrote:
> Simplifying the logic of fixing.
> 
> Calling fixup_extent_ref() after encountering every error causes
> more error messages after the extent is fixed. In case of multiple errors,
> this is confusing because the error message is displayed after the fix
> message and it works on stale data. It is best to show all errors and
> then fix the extents.
> 
> Set a variable and call fixup_extent_ref() if it is set. err is not used,
> so cleared it.

Sounds ok, more comments below.

> Signed-off-by: Goldwyn Rodrigues 
> ---
>  cmds-check.c | 75 
> +++-
>  1 file changed, 24 insertions(+), 51 deletions(-)
> 
> diff --git a/cmds-check.c b/cmds-check.c
> index 779870a..8fa0b38 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -8994,6 +8994,9 @@ out:
>   ret = err;
>   }
>  
> + if (!ret)
> + fprintf(stderr, "Repaired extent references for %llu\n", 
> (unsigned long long)rec->start);

Line too long, please stick to ~80 chars, here it's easy to break line
after string.

> +
>   btrfs_release_path();
>   return ret;
>  }
> @@ -9051,7 +9054,11 @@ static int fixup_extent_flags(struct btrfs_fs_info 
> *fs_info,
>   btrfs_set_extent_flags(path.nodes[0], ei, flags);
>   btrfs_mark_buffer_dirty(path.nodes[0]);
>   btrfs_release_path();
> - return btrfs_commit_transaction(trans, root);
> + ret = btrfs_commit_transaction(trans, root);
> + if (!ret)
> + fprintf(stderr, "Repaired extent flags for %llu\n", (unsigned 
> long long)rec->start);
> +
> + return ret;
>  }
>  
>  /* right now we only prune from the extent allocation tree */
> @@ -9178,11 +9185,8 @@ static int check_extent_refs(struct btrfs_root *root,
>  {
>   struct extent_record *rec;
>   struct cache_extent *cache;
> - int err = 0;
>   int ret = 0;
> - int fixed = 0;
>   int had_dups = 0;
> - int recorded = 0;
>  
>   if (repair) {
>   /*
> @@ -9251,9 +9255,8 @@ static int check_extent_refs(struct btrfs_root *root,
>  
>   while(1) {
>   int cur_err = 0;
> + int fix = 0;
>  
> - fixed = 0;
> - recorded = 0;
>   cache = search_cache_extent(extent_cache, 0);
>   if (!cache)
>   break;
> @@ -9261,7 +9264,6 @@ static int check_extent_refs(struct btrfs_root *root,
>   if (rec->num_duplicates) {
>   fprintf(stderr, "extent item %llu has multiple extent "
>   "items\n", (unsigned long long)rec->start);
> - err = 1;
>   cur_err = 1;
>   }
>  
> @@ -9272,57 +9274,33 @@ static int check_extent_refs(struct btrfs_root *root,
>   fprintf(stderr, "extent item %llu, found %llu\n",
>   (unsigned long long)rec->extent_item_refs,
>   (unsigned long long)rec->refs);
> - ret = record_orphan_data_extents(root->fs_info, rec);
> - if (ret < 0)
> + fix = record_orphan_data_extents(root->fs_info, rec);
> + if (fix < 0)
>   goto repair_abort;

I think ret has to be set to fix here as well (in some way, eg. not
using fix for a return value), otherwise the repair_abort label will not
thake the same code path as before.

> - if (ret == 0) {
> - recorded = 1;
> - } else {
> - /*
> -  * we can't use the extent to repair file
> -  * extent, let the fallback method handle it.
> -  */
> - if (!fixed && repair) {
> - ret = fixup_extent_refs(
> - root->fs_info,
> - extent_cache, rec);
> - if (ret)
> - goto repair_abort;
> - fixed = 1;
> - }
> - }
> - err = 1;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Minor coverity defect fix - CID 1125928 In set_file_xattrs: Dereference of an explicit null value

2016-11-30 Thread David Sterba
Hi,

this patch lacks basic formatting requirements, this has been
extensively documented eg. here 
https://btrfs.wiki.kernel.org/index.php/Developer's_FAQ .

Besides the formalities, I'm missing what's the change rationale. It
deals with a strange case when the xattr name length is 0, which is
unexpected and should not be handled silently. Next I'm not sure if
bailing out of the function is right, there are more items to process.
Best if we could skip the damaged ones but still continue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: mkfs, balance convert: warn about RAID5/6 in fiery letters

2016-11-30 Thread David Sterba
On Mon, Nov 28, 2016 at 07:51:53PM +0100, Adam Borowski wrote:
> People who don't frequent IRC nor the mailing list tend to believe RAID 5/6
> are stable; this leads to data loss.  Thus, let's do warn them.
> 
> At this point, I think fiery letters that won't be missed are warranted.
> 
> Kernel 4.9 and its -progs will be a part of LTS of multiple distributions,
> so leaving experimental features without a warning is inappropriate.

I'm ok with adding the warning about raid56 feature, but I have some
comments to how it's implemented.

Special case warning for the raid56 is ok, as it corresponds to the
'mkfs_features' table where the missing value for 'safe' should lead to
a similar warning. This is planned to be more generic, so I just want to
make sure we can adjust it later without problems.

The warning should go last, after the final summary (and respect
verbosity level). If the message were not colored, I'd completely miss
the warning. This also means the warning should not be printed from a
helper function and not during the option parsing phase.

The colors seem a bit too much to me, red text or just emphasize
'warning' would IMHO suffice.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs-progs: Remove duplicate printfs in warning_trace()/assert_trace()

2016-11-30 Thread David Sterba
On Tue, Nov 29, 2016 at 10:25:14AM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Code reduction. Call warning_trace from assert_trace in order to
> reduce the printf's used. Also, trace variable in warning_trace()
> is not required because it is already handled by BTRFS_DISABLE_BACKTRACE.

This drops the distinction between BUG_ON and WARN_ON but I'm not sure
we need it. Patch applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs-progs: Correct value printed by assertions/BUG_ON/WARN_ON

2016-11-30 Thread David Sterba
On Tue, Nov 29, 2016 at 10:24:52AM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> The values passed to BUG_ON/WARN_ON are negated(!) and printed, which
> results in printing the value zero for each bug/warning. For example:
> volumes.c:988: btrfs_alloc_chunk: Assertion `ret` failed, value 0
> 
> This is not useful. Instead changed to print the value of the parameter
> passed to BUG_ON()/WARN_ON(). The value needed to be changed to long
> to accomodate pointers being passed.
> 
> Also, consolidated assert() and BUG() into ifndef.
> 
> Signed-off-by: Goldwyn Rodrigues 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Austin S. Hemmelgarn

On 2016-11-30 08:12, Wilson Meier wrote:

Am 30/11/16 um 11:41 schrieb Duncan:

Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:


Am 30/11/16 um 09:06 schrieb Martin Steigerwald:

Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:

[snip]

So the stability matrix would need to be updated not to recommend any
kind of BTRFS RAID 1 at the moment?

Actually I faced the BTRFS RAID 1 read only after first attempt of
mounting it "degraded" just a short time ago.

BTRFS still needs way more stability work it seems to me.


I would say the matrix should be updated to not recommend any RAID Level
as from the discussion it seems they all of them have flaws.
To me RAID is broken if one cannot expect to recover from a device
failure in a solid way as this is why RAID is used.
Correct me if i'm wrong. Right now i'm making my thoughts about
migrating to another FS and/or Hardware RAID.

It should be noted that no list regular that I'm aware of anyway, would
make any claims about btrfs being stable and mature either now or in the
near-term future in any case.  Rather to the contrary, as I generally put
it, btrfs is still stabilizing and maturing, with backups one is willing
to use (and as any admin of any worth would say, a backup that hasn't
been tested usable isn't yet a backup; the job of creating the backup
isn't done until that backup has been tested actually usable for
recovery) still extremely strongly recommended.  Similarly, keeping up
with the list is recommended, as is staying relatively current on both
the kernel and userspace (generally considered to be within the latest
two kernel series of either current or LTS series kernels, and with a
similarly versioned btrfs userspace).

In that context, btrfs single-device and raid1 (and raid0 of course) are
quite usable and as stable as btrfs in general is, that being stabilizing
but not yet fully stable and mature, with raid10 being slightly less so
and raid56 being much more experimental/unstable at this point.

But that context never claims full stability even for the relatively
stable raid1 and single device modes, and in fact anticipates that there
may be times when recovery from the existing filesystem may not be
practical, thus the recommendation to keep tested usable backups at the
ready.

Meanwhile, it remains relatively common on this list for those wondering
about their btrfs on long-term-stale (not a typo) "enterprise" distros,
or even debian-stale, to be actively steered away from btrfs, especially
if they're not willing to update to something far more current than those
distros often provide, because in general, the current stability status
of btrfs is in conflict with the reason people generally choose to use
that level of old and stale software in the first place -- they
prioritize tried and tested to work, stable and mature, over the latest
generally newer and flashier featured but sometimes not entirely stable,
and btrfs at this point simply doesn't meet that sort of stability/
maturity expectations, nor is it likely to for some time (measured in
years), due to all the reasons enumerated so well in the above thread.


In that context, the stability status matrix on the wiki is already
reasonably accurate, certainly so IMO, because "OK" in context means as
OK as btrfs is in general, and btrfs itself remains still stabilizing,
not fully stable and mature.

If there IS an argument as to the accuracy of the raid0/1/10 OK status,
I'd argue it's purely due to people not understanding the status of btrfs
in general, and that if there's a general deficiency at all, it's in the
lack of a general stability status paragraph on that page itself
explaining all this, despite the fact that the main https://
btrfs.wiki.kernel.org landing page states quite plainly under stability
status that btrfs remains under heavy development and that current
kernels are strongly recommended.  (Tho were I editing it, there'd
certainly be a more prominent mention of keeping backups at the ready as
well.)


Hi Duncan,

i understand your arguments but cannot fully agree.
First of all, i'm not sticking with old stale versions of whatever as i
try to keep my system up2date.
My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
That being said, i'm quite aware of the heavy development status of
btrfs but pointing the finger on the users saying that they don't fully
understand the status of btrfs without giving the information on the
wiki is in my opinion not the right way. Heavy development doesn't mean
that features marked as ok are "not" or "mostly" ok in the context of
overall btrfs stability.
There is no indication on the wiki that raid1 or every other raid
(except for raid5/6) suffers from the problems stated in this thread.
The performance issues are inherent to BTRFS right now, and none of the 
other issues are likely to impact most regular users.  Most of the 
people who would be interested in the features of BTRFS also have 
existing 

Re: [PATCH] btrfs-progs: Use helper functions to access btrfs_super_block->sys_chunk_array_size

2016-11-30 Thread David Sterba
On Tue, Nov 29, 2016 at 08:29:02PM +0530, Chandan Rajendra wrote:
> btrfs_super_block->sys_chunk_array_size is stored as le32 data on
> disk. However insert_temp_chunk_item() writes sys_chunk_array_size in
> host cpu order. This commit fixes this by using super block access
> helper functions to read and write
> btrfs_super_block->sys_chunk_array_size field.
> 
> Signed-off-by: Chandan Rajendra 
> ---
>  utils.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/utils.c b/utils.c
> index d0189ad..7b17b20 100644
> --- a/utils.c
> +++ b/utils.c
> @@ -562,14 +562,17 @@ static int insert_temp_chunk_item(int fd, struct 
> extent_buffer *buf,
>*/
>   if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
>   char *cur;
> + u32 array_size;
>  
>   cur = (char *)sb->sys_chunk_array + sb->sys_chunk_array_size;

This should also use the accessor, 'sb' is directly mapped to the buffer
read from disk.

>   memcpy(cur, _key, sizeof(disk_key));
>   cur += sizeof(disk_key);
>   read_extent_buffer(buf, cur, (unsigned long int)chunk,
>  btrfs_chunk_item_size(1));
> - sb->sys_chunk_array_size += btrfs_chunk_item_size(1) +
> + array_size = btrfs_super_sys_array_size(sb);
> + array_size += btrfs_chunk_item_size(1) +
>   sizeof(disk_key);
> + btrfs_set_super_sys_array_size(sb, array_size);
>  
>   ret = write_temp_super(fd, sb, cfg->super_bytenr);
>   }
> -- 
> 2.5.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Roman Mamedov
On Wed, 30 Nov 2016 07:50:17 -0500
"Austin S. Hemmelgarn"  wrote:

> > *) Read performance is not optimized: all metadata is always read from the
> > first device unless it has failed, data reads are supposedly balanced 
> > between
> > devices per PID of the process reading. Better implementations dispatch 
> > reads
> > per request to devices that are currently idle.
> Based on what I've seen, the metadata reads get balanced too.

https://github.com/torvalds/linux/blob/v4.8/fs/btrfs/disk-io.c#L451
This starts from the mirror number 0 and tries others in an incrementing
order, until succeeds. It appears that as long as the mirror with copy #0 is up
and not corrupted, all reads will simply get satisfied from it.

> > *) Write performance is not optimized, during long full bandwidth sequential
> > writes it is common to see devices writing not in parallel, but with a long
> > periods of just one device writing, then another. (Admittedly have been some
> > time since I tested that).
> I've never seen this be an issue in practice, especially if you're using 
> transparent compression (which caps extent size, and therefore I/O size 
> to a given device, at 128k).  I'm also sane enough that I'm not doing 
> bulk streaming writes to traditional HDD's or fully saturating the 
> bandwidth on my SSD's (you should be over-provisioning whenever 
> possible).  For a desktop user, unless you're doing real-time video 
> recording at higher than HD resolution with high quality surround sound, 
> this probably isn't going to hit you (and even then you should be 
> recording to a temporary location with much faster write speeds (tmpfs 
> or ext4 without a journal for example) because you'll likely get hit 
> with fragmentation).

I did not use compression while observing this;

Also I don't know what is particularly insane about copying a 4-8 GB file onto
a storage array. I'd expect both disks to write at the same time (like they
do in pretty much any other RAID1 system), not one-after-another, effectively
slowing down the entire operation by as much as 2x in extreme cases.

> As far as not mounting degraded by default, that's a conscious design 
> choice that isn't going to change.  There's a switch (adding 'degraded' 
> to the mount options) to enable this behavior per-mount, so we're still 
> on-par in that respect with LVM and MD, we just picked a different 
> default.  In this case, I actually feel it's a better default for most 
> cases, because most regular users aren't doing exhaustive monitoring, 
> and thus are not likely to notice the filesystem being mounted degraded 
> until it's far too late.  If the filesystem is degraded, then 
> _something_ has happened that the user needs to know about, and until 
> some sane monitoring solution is implemented, the easiest way to ensure 
> this is to refuse to mount.

The easiest is to write to dmesg and syslog, if a user doesn't monitor those
either, it's their own fault; and the more user friendly one would be to still
auto mount degraded, but read-only.

Comparing to Ext4, that one appears to have the "errors=continue" behavior by
default, the user has to explicitly request "errors=remount-ro", and I have
never seen anyone use or recommend the third option of "errors=panic", which
is basically the equivalent of the current Btrfs practce.

> > *) It does not properly handle a device disappearing during operation. 
> > (There
> > is a patchset to add that).
> >
> > *) It does not properly handle said device returning (under a
> > different /dev/sdX name, for bonus points).
> These are not an easy problem to fix completely, especially considering 
> that the device is currently guaranteed to reappear under a different 
> name because BTRFS will still have an open reference on the original 
> device name.
> 
> On top of that, if you've got hardware that's doing this without manual 
> intervention, you've got much bigger issues than how BTRFS reacts to it. 
>   No correctly working hardware should be doing this.

Unplugging and replugging a SATA cable of a RAID1 member should never put your
system under the risk of a massive filesystem corruption; you cannot say it
absolutely doesn't with the current implementation.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Wilson Meier
Am 30/11/16 um 11:41 schrieb Duncan:
> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:
>
>> Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
>>> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
 [snip]
>>> So the stability matrix would need to be updated not to recommend any
>>> kind of BTRFS RAID 1 at the moment?
>>>
>>> Actually I faced the BTRFS RAID 1 read only after first attempt of
>>> mounting it "degraded" just a short time ago.
>>>
>>> BTRFS still needs way more stability work it seems to me.
>>>
>> I would say the matrix should be updated to not recommend any RAID Level
>> as from the discussion it seems they all of them have flaws.
>> To me RAID is broken if one cannot expect to recover from a device
>> failure in a solid way as this is why RAID is used.
>> Correct me if i'm wrong. Right now i'm making my thoughts about
>> migrating to another FS and/or Hardware RAID.
> It should be noted that no list regular that I'm aware of anyway, would 
> make any claims about btrfs being stable and mature either now or in the 
> near-term future in any case.  Rather to the contrary, as I generally put 
> it, btrfs is still stabilizing and maturing, with backups one is willing 
> to use (and as any admin of any worth would say, a backup that hasn't 
> been tested usable isn't yet a backup; the job of creating the backup 
> isn't done until that backup has been tested actually usable for 
> recovery) still extremely strongly recommended.  Similarly, keeping up 
> with the list is recommended, as is staying relatively current on both 
> the kernel and userspace (generally considered to be within the latest 
> two kernel series of either current or LTS series kernels, and with a 
> similarly versioned btrfs userspace).
>
> In that context, btrfs single-device and raid1 (and raid0 of course) are 
> quite usable and as stable as btrfs in general is, that being stabilizing 
> but not yet fully stable and mature, with raid10 being slightly less so 
> and raid56 being much more experimental/unstable at this point.
>
> But that context never claims full stability even for the relatively 
> stable raid1 and single device modes, and in fact anticipates that there 
> may be times when recovery from the existing filesystem may not be 
> practical, thus the recommendation to keep tested usable backups at the 
> ready.
>
> Meanwhile, it remains relatively common on this list for those wondering 
> about their btrfs on long-term-stale (not a typo) "enterprise" distros, 
> or even debian-stale, to be actively steered away from btrfs, especially 
> if they're not willing to update to something far more current than those 
> distros often provide, because in general, the current stability status 
> of btrfs is in conflict with the reason people generally choose to use 
> that level of old and stale software in the first place -- they 
> prioritize tried and tested to work, stable and mature, over the latest 
> generally newer and flashier featured but sometimes not entirely stable, 
> and btrfs at this point simply doesn't meet that sort of stability/
> maturity expectations, nor is it likely to for some time (measured in 
> years), due to all the reasons enumerated so well in the above thread.
>
>
> In that context, the stability status matrix on the wiki is already 
> reasonably accurate, certainly so IMO, because "OK" in context means as 
> OK as btrfs is in general, and btrfs itself remains still stabilizing, 
> not fully stable and mature.
>
> If there IS an argument as to the accuracy of the raid0/1/10 OK status, 
> I'd argue it's purely due to people not understanding the status of btrfs 
> in general, and that if there's a general deficiency at all, it's in the 
> lack of a general stability status paragraph on that page itself 
> explaining all this, despite the fact that the main https://
> btrfs.wiki.kernel.org landing page states quite plainly under stability 
> status that btrfs remains under heavy development and that current 
> kernels are strongly recommended.  (Tho were I editing it, there'd 
> certainly be a more prominent mention of keeping backups at the ready as 
> well.)
>
Hi Duncan,

i understand your arguments but cannot fully agree.
First of all, i'm not sticking with old stale versions of whatever as i
try to keep my system up2date.
My kernel is 4.8.4 (Gentoo) and btrfs-progs is 4.8.4.
That being said, i'm quite aware of the heavy development status of
btrfs but pointing the finger on the users saying that they don't fully
understand the status of btrfs without giving the information on the
wiki is in my opinion not the right way. Heavy development doesn't mean
that features marked as ok are "not" or "mostly" ok in the context of
overall btrfs stability.
There is no indication on the wiki that raid1 or every other raid
(except for raid5/6) suffers from the problems stated in this thread.
If there are know problems then the stability matrix should point 

[PULL] Btrfs updates for 4.10

2016-11-30 Thread David Sterba
Hi,

here's my first pull request for 4.10. Assorted patches that have been in
for-next, mostly fixes and some cleanups. I'm expecting to send one more before
the rc1, I don't see much reason to hold the current queue back for any longer.


The following changes since commit e5517c2a5a49ed5e99047008629f1cd60246ea0e:

  Linux 4.9-rc7 (2016-11-27 13:08:04 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.10

for you to fetch changes up to 515bdc479097ec9d5f389202842345af3162f71c:

  Merge branch 'misc-4.10' into for-chris-4.10-20161130 (2016-11-30 14:02:20 
+0100)


Adam Borowski (1):
  btrfs: make block group flags in balance printks human-readable

Christoph Hellwig (9):
  btrfs: don't abuse REQ_OP_* flags for btrfs_map_block
  btrfs: use bio iterators for the decompression handlers
  btrfs: don't access the bio directly in the raid5/6 code
  btrfs: don't access the bio directly in the direct I/O code
  btrfs: don't access the bio directly in btrfs_csum_one_bio
  btrfs: use bi_size
  btrfs: calculate end of bio offset properly
  btrfs: refactor __btrfs_lookup_bio_sums to use bio_for_each_segment_all
  btrfs: use bio_for_each_segment_all in __btrfsic_submit_bio

Christophe JAILLET (1):
  btrfs: remove redundant check of btrfs_iget return value

David Sterba (17):
  btrfs: remove unused headers, statfs.h
  btrfs: remove stale comment from btrfs_statfs
  btrfs: rename helper macros for qgroup and aux data casts
  btrfs: reada, cleanup remove unneeded variable in __readahead_hook
  btrfs: reada, remove unused parameter from __readahead_hook
  btrfs: reada, sink start parameter to btree_readahead_hook
  btrfs: reada, remove pointless BUG_ON in reada_find_extent
  btrfs: reada, remove pointless BUG_ON check for fs_info
  btrfs: remove trivial helper btrfs_find_tree_block
  btrfs: delete unused member from superblock
  btrfs: introduce helpers for updating eb uuids
  btrfs: use new helpers to set uuids in eb
  btrfs: use specialized page copying helpers in btrfs_clone_extent_buffer
  btrfs: remove constant parameter to memset_extent_buffer and rename it
  btrfs: add optimized version of eb to eb copy
  btrfs: store and load values of stripes_min/stripes_max in balance status 
item
  Merge branch 'misc-4.10' into for-chris-4.10-20161130

Domagoj Tršan (1):
  btrfs: change btrfs_csum_final result param type to u8

Jeff Mahoney (3):
  btrfs: remove old tree_root dirent processing in btrfs_real_readdir()
  btrfs: increment ctx->pos for every emitted or skipped dirent in readdir
  btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_space

Josef Bacik (2):
  Btrfs: fix file extent corruption
  Btrfs: abort transaction if fill_holes() fails

Liu Bo (1):
  Btrfs: adjust len of writes if following a preallocated extent

Nick Terrell (1):
  btrfs: Call kunmap if zlib_inflateInit2 fails

Omar Sandoval (1):
  Btrfs: deal with existing encompassing extent map in btrfs_get_extent()

Qu Wenruo (4):
  btrfs: qgroup: Add comments explaining how btrfs qgroup works
  btrfs: qgroup: Rename functions to make it follow reserve,trace,account 
steps
  btrfs: Export and move leaf/subtree qgroup helpers to qgroup.c
  btrfs: qgroup: Fix qgroup data leaking by using subtree tracing

Shailendra Verma (1):
  btrfs: return early from failed memory allocations in ioctl handlers

Wang Xiaoguang (3):
  btrfs: cleanup: use already calculated value in 
btrfs_should_throttle_delayed_refs()
  btrfs: add necessary comments about tickets_id
  btrfs: improve delayed refs iterations

Xiaoguang Wang (1):
  btrfs: remove useless comments

 fs/btrfs/check-integrity.c   |  32 ++---
 fs/btrfs/compression.c   | 142 -
 fs/btrfs/compression.h   |  12 +-
 fs/btrfs/ctree.c |  49 +++-
 fs/btrfs/ctree.h |  14 ++-
 fs/btrfs/delayed-inode.c |   3 +-
 fs/btrfs/delayed-inode.h |   2 +-
 fs/btrfs/delayed-ref.c   |  20 ++-
 fs/btrfs/delayed-ref.h   |   8 ++
 fs/btrfs/disk-io.c   |  30 ++---
 fs/btrfs/disk-io.h   |   4 +-
 fs/btrfs/extent-tree.c   | 263 ---
 fs/btrfs/extent_io.c |  49 ++--
 fs/btrfs/extent_io.h |   9 +-
 fs/btrfs/file-item.c |  55 
 fs/btrfs/file.c  |  35 +-
 fs/btrfs/free-space-cache.c  |  10 +-
 fs/btrfs/inode.c | 163 
 fs/btrfs/ioctl.c |  32 ++---
 fs/btrfs/lzo.c   |  17 +--
 fs/btrfs/qgroup.c| 

Re: [PATCH v4 1/3] btrfs: Add WARN_ON for qgroup reserved underflow

2016-11-30 Thread David Sterba
On Wed, Nov 30, 2016 at 08:24:32AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/30/2016 12:10 AM, David Sterba wrote:
> > On Mon, Nov 28, 2016 at 09:40:07AM +0800, Qu Wenruo wrote:
> >> Goldwyn Rodrigues has exposed and fixed a bug which underflows btrfs
> >> qgroup reserved space, and leads to non-writable fs.
> >>
> >> This reminds us that we don't have enough underflow check for qgroup
> >> reserved space.
> >>
> >> For underflow case, we should not really underflow the numbers but warn
> >> and keeps qgroup still work.
> >>
> >> So add more check on qgroup reserved space and add WARN_ON() and
> >> btrfs_warn() for any underflow case.
> >>
> >> Signed-off-by: Qu Wenruo 
> >> Reviewed-by: David Sterba 
> >
> > One of the warnings is visible during xfstests
> > (btrfs_qgroup_free_refroot), is there a fix? Either a patch in
> > mailinglist or work in progress. If not, I'm a bit reluctant to add it
> > to 4.10 as we'd get that reported from users for sure.
> 
> Fix WIP, ETA would be in this week.

Good, thanks.

> At least, this warning is working and helped us to find bugs.

No doubt about that, I'll keep the patch in 4.10 queue but will not add
it to the first pull that I'm about to send today.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Austin S. Hemmelgarn

On 2016-11-30 00:38, Roman Mamedov wrote:

On Wed, 30 Nov 2016 00:16:48 +0100
Wilson Meier  wrote:


That said, btrfs shouldn't be used for other then raid1 as every other
raid level has serious problems or at least doesn't work as the expected
raid level (in terms of failure recovery).


RAID1 shouldn't be used either:

*) Read performance is not optimized: all metadata is always read from the
first device unless it has failed, data reads are supposedly balanced between
devices per PID of the process reading. Better implementations dispatch reads
per request to devices that are currently idle.

Based on what I've seen, the metadata reads get balanced too.

As far as the read balancing in general, while it doesn't work very well 
for single processes, but if you have a large number of processes 
started sequentially (for example, a thread-pool based server), it 
actually works out to being near optimal with a lot less logic than DM 
and MD have.  Aggregated over an entire system it's usually near optimal 
as well.


*) Write performance is not optimized, during long full bandwidth sequential
writes it is common to see devices writing not in parallel, but with a long
periods of just one device writing, then another. (Admittedly have been some
time since I tested that).
I've never seen this be an issue in practice, especially if you're using 
transparent compression (which caps extent size, and therefore I/O size 
to a given device, at 128k).  I'm also sane enough that I'm not doing 
bulk streaming writes to traditional HDD's or fully saturating the 
bandwidth on my SSD's (you should be over-provisioning whenever 
possible).  For a desktop user, unless you're doing real-time video 
recording at higher than HD resolution with high quality surround sound, 
this probably isn't going to hit you (and even then you should be 
recording to a temporary location with much faster write speeds (tmpfs 
or ext4 without a journal for example) because you'll likely get hit 
with fragmentation).


This also has overall pretty low impact compared to a number of other 
things that BTRFS does (BTRFS on a single disk with single profile for 
everything versus 2 of the same disks with raid1 profile for everything 
gets less than a 20% performance difference in all the testing I've done).


*) A degraded RAID1 won't mount by default.

If this was the root filesystem, the machine won't boot.

To mount it, you need to add the "degraded" mount option.
However you have exactly a single chance at that, you MUST restore the RAID to
non-degraded state while it's mounted during that session, since it won't ever
mount again in the r/w+degraded mode, and in r/o mode you can't perform any
operations on the filesystem, including adding/removing devices.
There is a fix pending for the single chance to mount degraded thing, 
and even then, it only applies to a 2 device raid1 array (with more 
devices, new chunks are still raid1 if you're missing 1 device, so the 
checks don't trigger and refuse the mount).


As far as not mounting degraded by default, that's a conscious design 
choice that isn't going to change.  There's a switch (adding 'degraded' 
to the mount options) to enable this behavior per-mount, so we're still 
on-par in that respect with LVM and MD, we just picked a different 
default.  In this case, I actually feel it's a better default for most 
cases, because most regular users aren't doing exhaustive monitoring, 
and thus are not likely to notice the filesystem being mounted degraded 
until it's far too late.  If the filesystem is degraded, then 
_something_ has happened that the user needs to know about, and until 
some sane monitoring solution is implemented, the easiest way to ensure 
this is to refuse to mount.


*) It does not properly handle a device disappearing during operation. (There
is a patchset to add that).

*) It does not properly handle said device returning (under a
different /dev/sdX name, for bonus points).
These are not an easy problem to fix completely, especially considering 
that the device is currently guaranteed to reappear under a different 
name because BTRFS will still have an open reference on the original 
device name.


On top of that, if you've got hardware that's doing this without manual 
intervention, you've got much bigger issues than how BTRFS reacts to it. 
 No correctly working hardware should be doing this.


Most of these also apply to all other RAID levels.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs progs release 4.8.5

2016-11-30 Thread David Sterba
Hi,

btrfs-progs version 4.8.5 have been released, contains an urgent bugfix for
receive that mistakenly reports error on valid streams.  Bug introduced in
4.8.4 by me, my appologies.

Changes:
  * receive: fix detection of end of stream (error reported even for valid
streams)
  * other:
* added test for the receive bug
* fix linking of library-test

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (7):
  btrfs-progs: docs: fix typo in btrfs-man5
  btrfs-progs: receive: properly detect end of stream conditions
  btrfs-progs: tests: end of stream conditions
  btrfs-progs: tests: add correct rpath to library-test
  btrfs-progs: test: fix static build of library-test
  btrfs-progs: update CHANGES for v4.8.5
  Btrfs progs v4.8.5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Duncan
Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:

> Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
>> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
>>> On Wed, 30 Nov 2016 00:16:48 +0100
>>>
>>> Wilson Meier  wrote:
 That said, btrfs shouldn't be used for other then raid1 as every
 other raid level has serious problems or at least doesn't work as the
 expected raid level (in terms of failure recovery).
>>> RAID1 shouldn't be used either:
>>>
>>> *) Read performance is not optimized: all metadata is always read from
>>> the first device unless it has failed, data reads are supposedly
>>> balanced between devices per PID of the process reading. Better
>>> implementations dispatch reads per request to devices that are
>>> currently idle.
>>>
>>> *) Write performance is not optimized, during long full bandwidth
>>> sequential writes it is common to see devices writing not in parallel,
>>> but with a long periods of just one device writing, then another.
>>> (Admittedly have been some time since I tested that).
>>>
>>> *) A degraded RAID1 won't mount by default.
>>>
>>> If this was the root filesystem, the machine won't boot.
>>>
>>> To mount it, you need to add the "degraded" mount option.
>>> However you have exactly a single chance at that, you MUST restore the
>>> RAID to non-degraded state while it's mounted during that session,
>>> since it won't ever mount again in the r/w+degraded mode, and in r/o
>>> mode you can't perform any operations on the filesystem, including
>>> adding/removing devices.
>>>
>>> *) It does not properly handle a device disappearing during operation.
>>> (There is a patchset to add that).
>>>
>>> *) It does not properly handle said device returning (under a
>>> different /dev/sdX name, for bonus points).
>>>
>>> Most of these also apply to all other RAID levels.
>> So the stability matrix would need to be updated not to recommend any
>> kind of BTRFS RAID 1 at the moment?
>>
>> Actually I faced the BTRFS RAID 1 read only after first attempt of
>> mounting it "degraded" just a short time ago.
>>
>> BTRFS still needs way more stability work it seems to me.
>>
> I would say the matrix should be updated to not recommend any RAID Level
> as from the discussion it seems they all of them have flaws.
> To me RAID is broken if one cannot expect to recover from a device
> failure in a solid way as this is why RAID is used.
> Correct me if i'm wrong. Right now i'm making my thoughts about
> migrating to another FS and/or Hardware RAID.

It should be noted that no list regular that I'm aware of anyway, would 
make any claims about btrfs being stable and mature either now or in the 
near-term future in any case.  Rather to the contrary, as I generally put 
it, btrfs is still stabilizing and maturing, with backups one is willing 
to use (and as any admin of any worth would say, a backup that hasn't 
been tested usable isn't yet a backup; the job of creating the backup 
isn't done until that backup has been tested actually usable for 
recovery) still extremely strongly recommended.  Similarly, keeping up 
with the list is recommended, as is staying relatively current on both 
the kernel and userspace (generally considered to be within the latest 
two kernel series of either current or LTS series kernels, and with a 
similarly versioned btrfs userspace).

In that context, btrfs single-device and raid1 (and raid0 of course) are 
quite usable and as stable as btrfs in general is, that being stabilizing 
but not yet fully stable and mature, with raid10 being slightly less so 
and raid56 being much more experimental/unstable at this point.

But that context never claims full stability even for the relatively 
stable raid1 and single device modes, and in fact anticipates that there 
may be times when recovery from the existing filesystem may not be 
practical, thus the recommendation to keep tested usable backups at the 
ready.

Meanwhile, it remains relatively common on this list for those wondering 
about their btrfs on long-term-stale (not a typo) "enterprise" distros, 
or even debian-stale, to be actively steered away from btrfs, especially 
if they're not willing to update to something far more current than those 
distros often provide, because in general, the current stability status 
of btrfs is in conflict with the reason people generally choose to use 
that level of old and stale software in the first place -- they 
prioritize tried and tested to work, stable and mature, over the latest 
generally newer and flashier featured but sometimes not entirely stable, 
and btrfs at this point simply doesn't meet that sort of stability/
maturity expectations, nor is it likely to for some time (measured in 
years), due to all the reasons enumerated so well in the above thread.


In that context, the stability status matrix on the wiki is already 
reasonably accurate, certainly so IMO, because "OK" in context means as 
OK as 

Re: Convert from RAID 5 to 10

2016-11-30 Thread Wilson Meier


Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
>> On Wed, 30 Nov 2016 00:16:48 +0100
>>
>> Wilson Meier  wrote:
>>> That said, btrfs shouldn't be used for other then raid1 as every other
>>> raid level has serious problems or at least doesn't work as the expected
>>> raid level (in terms of failure recovery).
>> RAID1 shouldn't be used either:
>>
>> *) Read performance is not optimized: all metadata is always read from the
>> first device unless it has failed, data reads are supposedly balanced
>> between devices per PID of the process reading. Better implementations
>> dispatch reads per request to devices that are currently idle.
>>
>> *) Write performance is not optimized, during long full bandwidth sequential
>> writes it is common to see devices writing not in parallel, but with a long
>> periods of just one device writing, then another. (Admittedly have been
>> some time since I tested that).
>>
>> *) A degraded RAID1 won't mount by default.
>>
>> If this was the root filesystem, the machine won't boot.
>>
>> To mount it, you need to add the "degraded" mount option.
>> However you have exactly a single chance at that, you MUST restore the RAID
>> to non-degraded state while it's mounted during that session, since it
>> won't ever mount again in the r/w+degraded mode, and in r/o mode you can't
>> perform any operations on the filesystem, including adding/removing
>> devices.
>>
>> *) It does not properly handle a device disappearing during operation.
>> (There is a patchset to add that).
>>
>> *) It does not properly handle said device returning (under a
>> different /dev/sdX name, for bonus points).
>>
>> Most of these also apply to all other RAID levels.
> So the stability matrix would need to be updated not to recommend any kind of 
> BTRFS RAID 1 at the moment?
>
> Actually I faced the BTRFS RAID 1 read only after first attempt of mounting 
> it 
> "degraded" just a short time ago.
>
> BTRFS still needs way more stability work it seems to me.
>
I would say the matrix should be updated to not recommend any RAID Level
as from the discussion it seems they all of them have flaws.
To me RAID is broken if one cannot expect to recover from a device
failure in a solid way as this is why RAID is used.
Correct me if i'm wrong. Right now i'm making my thoughts about
migrating to another FS and/or Hardware RAID.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Martin Steigerwald
Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
> On Wed, 30 Nov 2016 00:16:48 +0100
> 
> Wilson Meier  wrote:
> > That said, btrfs shouldn't be used for other then raid1 as every other
> > raid level has serious problems or at least doesn't work as the expected
> > raid level (in terms of failure recovery).
> 
> RAID1 shouldn't be used either:
> 
> *) Read performance is not optimized: all metadata is always read from the
> first device unless it has failed, data reads are supposedly balanced
> between devices per PID of the process reading. Better implementations
> dispatch reads per request to devices that are currently idle.
> 
> *) Write performance is not optimized, during long full bandwidth sequential
> writes it is common to see devices writing not in parallel, but with a long
> periods of just one device writing, then another. (Admittedly have been
> some time since I tested that).
> 
> *) A degraded RAID1 won't mount by default.
> 
> If this was the root filesystem, the machine won't boot.
> 
> To mount it, you need to add the "degraded" mount option.
> However you have exactly a single chance at that, you MUST restore the RAID
> to non-degraded state while it's mounted during that session, since it
> won't ever mount again in the r/w+degraded mode, and in r/o mode you can't
> perform any operations on the filesystem, including adding/removing
> devices.
> 
> *) It does not properly handle a device disappearing during operation.
> (There is a patchset to add that).
> 
> *) It does not properly handle said device returning (under a
> different /dev/sdX name, for bonus points).
> 
> Most of these also apply to all other RAID levels.

So the stability matrix would need to be updated not to recommend any kind of 
BTRFS RAID 1 at the moment?

Actually I faced the BTRFS RAID 1 read only after first attempt of mounting it 
"degraded" just a short time ago.

BTRFS still needs way more stability work it seems to me.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html