Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Goffredo Baroncelli
On 2017-06-22 04:12, Qu Wenruo wrote:
> 
> And in that case even device of data stripe 2 is missing, btrfs don't really 
> need to use parity to rebuild it, as btrfs knows there is no extent in that 
> stripe, and data csum matches for data stripe 1.

You are assuming that there is no data in disk2. This is likely, due to COW 
nature of BTRFS. But it is not always true. 

Anyway, the same problem happens if you are writing data on disk2 . If 
a) data (disk2) is written 
b) parity is not updated (due to a power failure)

until that you don't lose anything, but if

c) disk1 disappear 

you are not in position to recompute valid data in disk1 using only data2 and 
parity


> No need to use parity at all.
> 
> So that's why I think the hole write is not an urgent case to handle right 
> now.
> 
> Thanks,
> Qu


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to fix errors that check --mode lomem finds, but --mode normal doesn't?

2017-06-21 Thread Qu Wenruo



At 06/22/2017 10:53 AM, Marc MERLIN wrote:

Ok, first it finished (almost 24H)

(...)
ERROR: root 3862 EXTENT_DATA[18170706 135168] interrupt
ERROR: root 3862 EXTENT_DATA[18170706 1048576] interrupt
ERROR: root 3864 EXTENT_DATA[109336 4096] interrupt
ERROR: errors found in fs roots
found 5544779108352 bytes used, error(s) found
total csum bytes: 5344523140
total tree bytes: 71323041792
total fs tree bytes: 59288403968
total extent tree bytes: 5378260992
btree space waste bytes: 10912166856
file data blocks allocated: 7830914256896
  referenced 6244104495104

Thanks for your reply Qu

On Thu, Jun 22, 2017 at 10:22:57AM +0800, Qu Wenruo wrote:

gargamel:~# btrfs check -p --mode lowmem  /dev/mapper/dshelf2
Checking filesystem on /dev/mapper/dshelf2
UUID: 85441c59-ad11-4b25-b1fe-974f9e4acede
ERROR: extent[3886187384832, 81920] referencer count mismatch (root:
11930, owner: 375444, offset: 1851654144) wanted: 1, have: 4


This means that in extent tree, btrfs says there is only one referring
to this extent, but lowmem mode find 4.

It would provide great help if you could dump extent tree for it.
# btrfs-debug-tree  | grep -C 10 3886187384832
  
 extent data backref root 11712 objectid 375444 offset 1851572224 count 1

 extent data backref root 11276 objectid 375444 offset 
1851572224 count 1
 extent data backref root 11058 objectid 375444 offset 
1851572224 count 1
 extent data backref root 11494 objectid 375444 offset 
1851572224 count 1
 item 37 key (3886187352064 EXTENT_ITEM 32768) itemoff 11381 itemsize 
140
 extent refs 4 gen 32382 flags DATA
 extent data backref root 11712 objectid 375444 offset 
1851596800 count 1
 extent data backref root 11276 objectid 375444 offset 
1851596800 count 1
 extent data backref root 11058 objectid 375444 offset 
1851596800 count 1
 extent data backref root 11494 objectid 375444 offset 
1851596800 count 1
 item 38 key (3886187384832 EXTENT_ITEM 81920) itemoff 11212 itemsize 
169
 extent refs 16 gen 32382 flags DATA
 extent data backref root 11712 objectid 375444 offset 
1851654144 count 4
 extent data backref root 11276 objectid 375444 offset 
1851654144 count 4
 extent data backref root 11058 objectid 375444 offset 
1851654144 count 3
 extent data backref root 11494 objectid 375444 offset 
1851654144 count 4
 extent data backref root 11930 objectid 375444 offset 
1851654144 count 1
 item 39 key (3886187466752 EXTENT_ITEM 16384) itemoff 11043 itemsize 
169
 extent refs 5 gen 32382 flags DATA
 extent data backref root 11712 objectid 375444 offset 
1851744256 count 1
 extent data backref root 11276 objectid 375444 offset 
1851744256 count 1


Well, there is only the output from extent tree.

I was also expecting output from subvolue (11930) tree.

It could be done by
# btrfs-debug-tree -t 11930 | grep -C 10 3886187384832

But please pay attention that, this dump may contain filenames, feel 
free to mask the filenames.


Thanks,
Qu



  

ERROR: errors found in extent allocation tree or chunk allocation
cache and super generation don't match, space cache will be invalidated
ERROR: root 3857 EXTENT_DATA[108864 4096] interrupt


This means that, for root 3857, inode 108864, file offset 4096, there is
a gap before that extent.
In NO_HOLES mode it's allowed, but if NO_HOLES incompat flag is not set,
this should be a problem.

I wonder if this is a problem caused by inlined compressed file extent.

This can also be dumped by the following command.
# btrfs-debug-tree -t 3857  | grep -C 10 108864


This one is much bigger (192KB), I've bzipped and attached it.


Thanks for this one.
And it is caused by inlined compressed extent.

Lu Fengqi will send patch fixing it.

Thanks,
Qu



Thanks for having a look, I appreciate it.

Marc




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to fix errors that check --mode lomem finds, but --mode normal doesn't?

2017-06-21 Thread Qu Wenruo



At 06/22/2017 10:53 AM, Marc MERLIN wrote:

Ok, first it finished (almost 24H)

(...)
ERROR: root 3862 EXTENT_DATA[18170706 135168] interrupt
ERROR: root 3862 EXTENT_DATA[18170706 1048576] interrupt
ERROR: root 3864 EXTENT_DATA[109336 4096] interrupt
ERROR: errors found in fs roots
found 5544779108352 bytes used, error(s) found
total csum bytes: 5344523140
total tree bytes: 71323041792
total fs tree bytes: 59288403968
total extent tree bytes: 5378260992
btree space waste bytes: 10912166856
file data blocks allocated: 7830914256896
  referenced 6244104495104

Thanks for your reply Qu

On Thu, Jun 22, 2017 at 10:22:57AM +0800, Qu Wenruo wrote:

gargamel:~# btrfs check -p --mode lowmem  /dev/mapper/dshelf2
Checking filesystem on /dev/mapper/dshelf2
UUID: 85441c59-ad11-4b25-b1fe-974f9e4acede
ERROR: extent[3886187384832, 81920] referencer count mismatch (root:
11930, owner: 375444, offset: 1851654144) wanted: 1, have: 4


This means that in extent tree, btrfs says there is only one referring
to this extent, but lowmem mode find 4.

It would provide great help if you could dump extent tree for it.
# btrfs-debug-tree  | grep -C 10 3886187384832
  
 extent data backref root 11712 objectid 375444 offset 1851572224 count 1

 extent data backref root 11276 objectid 375444 offset 
1851572224 count 1
 extent data backref root 11058 objectid 375444 offset 
1851572224 count 1
 extent data backref root 11494 objectid 375444 offset 
1851572224 count 1
 item 37 key (3886187352064 EXTENT_ITEM 32768) itemoff 11381 itemsize 
140
 extent refs 4 gen 32382 flags DATA
 extent data backref root 11712 objectid 375444 offset 
1851596800 count 1
 extent data backref root 11276 objectid 375444 offset 
1851596800 count 1
 extent data backref root 11058 objectid 375444 offset 
1851596800 count 1
 extent data backref root 11494 objectid 375444 offset 
1851596800 count 1
 item 38 key (3886187384832 EXTENT_ITEM 81920) itemoff 11212 itemsize 
169
 extent refs 16 gen 32382 flags DATA
 extent data backref root 11712 objectid 375444 offset 
1851654144 count 4
 extent data backref root 11276 objectid 375444 offset 
1851654144 count 4
 extent data backref root 11058 objectid 375444 offset 
1851654144 count 3
 extent data backref root 11494 objectid 375444 offset 
1851654144 count 4
 extent data backref root 11930 objectid 375444 offset 
1851654144 count 1
 item 39 key (3886187466752 EXTENT_ITEM 16384) itemoff 11043 itemsize 
169
 extent refs 5 gen 32382 flags DATA
 extent data backref root 11712 objectid 375444 offset 
1851744256 count 1
 extent data backref root 11276 objectid 375444 offset 
1851744256 count 1


Well, there is only the output from extent tree.

I was also expecting output from subvolue (11930) tree.

It could be done by
# btrfs-debug-tree -t 11930 | grep -C 10 3886187384832

But please pay attention that, this dump may contain filenames, feel 
free to mask the filenames.




  

ERROR: errors found in extent allocation tree or chunk allocation
cache and super generation don't match, space cache will be invalidated
ERROR: root 3857 EXTENT_DATA[108864 4096] interrupt


This means that, for root 3857, inode 108864, file offset 4096, there is
a gap before that extent.
In NO_HOLES mode it's allowed, but if NO_HOLES incompat flag is not set,
this should be a problem.

I wonder if this is a problem caused by inlined compressed file extent.

This can also be dumped by the following command.
# btrfs-debug-tree -t 3857  | grep -C 10 108864


This one is much bigger (192KB), I've bzipped and attached it.


Thanks for this one.
And it is caused by inlined compressed extent.

Lu Fengqi will send patch fixing it.

Thanks,
Qu



Thanks for having a look, I appreciate it.

Marc




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Qu Wenruo



At 06/22/2017 10:43 AM, Chris Murphy wrote:

On Wed, Jun 21, 2017 at 8:12 PM, Qu Wenruo  wrote:



Well, in fact, thanks to data csum and btrfs metadata CoW, there is quite a
high chance that we won't cause any data damage.


But we have examples where data does not COW, we see a partial stripe
overwrite. And if that is interrupted it's clear that both old and new
metadata pointing to that stripe is wrong. There are way more problems
where we see csum errors on Btrfs raid56 after crashes, and there are
no bad devices.


First, if it's interrupted, there is no new metadata, as metadata is 
always updated after data.


And metadata is always update CoW, so if data write is interrupted, we 
are still at previous trans.



And in that case, no COW means no csum.
Btrfs won't check the correctness due to the lack of csum.

So the case will be that, for nodatacow case, btrfs won't detect the 
corruption, users take the responsibility to keep their data correct.








For the example I gave above, no data damage at all.

First the data is written and power loss, and data is always written before
metadata, so that's to say, after power loss, superblock is still using the
old tree roots.

So no one is really using that newly written data.


OK but that assumes that the newly written data is always COW which on
Btrfs raid56 is not certain, there's a bunch of RMW code which
suggests overwrites are possible.


RMW is mainly to update P/Q, as even we only update data stripe1, we 
still need data stripe 2 to calculate P/Q.



And for raid56 metadata it suggests RMW could happen for metadata also.


As long as we have P/Q, RMW must be used.

The root problem will be, we need cross-device FUA to ensure full stripe 
is written correctly.
Or we may take the extent allocator modification, to ensure we only 
write into vertical stripe without used data.



So anyway, RAID5/6 is only designed to handle missing devices, not power 
loss.
IIRC mdadm RAID5/6 array needs to be scrubbed each time power loss is 
detected.


Thanks,
Qu



There's fairly strong anecdotal evidence that people have less
problems with Btrfs raid5 when raid5 applies to data block groups, and
metadata block groups use some other non-parity based profile like
raid1.






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to fix errors that check --mode lomem finds, but --mode normal doesn't?

2017-06-21 Thread Marc MERLIN
Ok, first it finished (almost 24H)

(...)
ERROR: root 3862 EXTENT_DATA[18170706 135168] interrupt
ERROR: root 3862 EXTENT_DATA[18170706 1048576] interrupt
ERROR: root 3864 EXTENT_DATA[109336 4096] interrupt
ERROR: errors found in fs roots
found 5544779108352 bytes used, error(s) found
total csum bytes: 5344523140
total tree bytes: 71323041792
total fs tree bytes: 59288403968
total extent tree bytes: 5378260992
btree space waste bytes: 10912166856
file data blocks allocated: 7830914256896
 referenced 6244104495104

Thanks for your reply Qu

On Thu, Jun 22, 2017 at 10:22:57AM +0800, Qu Wenruo wrote:
> >gargamel:~# btrfs check -p --mode lowmem  /dev/mapper/dshelf2
> >Checking filesystem on /dev/mapper/dshelf2
> >UUID: 85441c59-ad11-4b25-b1fe-974f9e4acede
> >ERROR: extent[3886187384832, 81920] referencer count mismatch (root: 
> >11930, owner: 375444, offset: 1851654144) wanted: 1, have: 4
> 
> This means that in extent tree, btrfs says there is only one referring 
> to this extent, but lowmem mode find 4.
> 
> It would provide great help if you could dump extent tree for it.
> # btrfs-debug-tree  | grep -C 10 3886187384832
 
extent data backref root 11712 objectid 375444 offset 
1851572224 count 1
extent data backref root 11276 objectid 375444 offset 
1851572224 count 1
extent data backref root 11058 objectid 375444 offset 
1851572224 count 1
extent data backref root 11494 objectid 375444 offset 
1851572224 count 1
item 37 key (3886187352064 EXTENT_ITEM 32768) itemoff 11381 itemsize 140
extent refs 4 gen 32382 flags DATA
extent data backref root 11712 objectid 375444 offset 
1851596800 count 1
extent data backref root 11276 objectid 375444 offset 
1851596800 count 1
extent data backref root 11058 objectid 375444 offset 
1851596800 count 1
extent data backref root 11494 objectid 375444 offset 
1851596800 count 1
item 38 key (3886187384832 EXTENT_ITEM 81920) itemoff 11212 itemsize 169
extent refs 16 gen 32382 flags DATA
extent data backref root 11712 objectid 375444 offset 
1851654144 count 4
extent data backref root 11276 objectid 375444 offset 
1851654144 count 4
extent data backref root 11058 objectid 375444 offset 
1851654144 count 3
extent data backref root 11494 objectid 375444 offset 
1851654144 count 4
extent data backref root 11930 objectid 375444 offset 
1851654144 count 1
item 39 key (3886187466752 EXTENT_ITEM 16384) itemoff 11043 itemsize 169
extent refs 5 gen 32382 flags DATA
extent data backref root 11712 objectid 375444 offset 
1851744256 count 1
extent data backref root 11276 objectid 375444 offset 
1851744256 count 1

 
> >ERROR: errors found in extent allocation tree or chunk allocation
> >cache and super generation don't match, space cache will be invalidated
> >ERROR: root 3857 EXTENT_DATA[108864 4096] interrupt
> 
> This means that, for root 3857, inode 108864, file offset 4096, there is 
> a gap before that extent.
> In NO_HOLES mode it's allowed, but if NO_HOLES incompat flag is not set, 
> this should be a problem.
> 
> I wonder if this is a problem caused by inlined compressed file extent.
> 
> This can also be dumped by the following command.
> # btrfs-debug-tree -t 3857  | grep -C 10 108864

This one is much bigger (192KB), I've bzipped and attached it.

Thanks for having a look, I appreciate it.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  


out.bz2
Description: Binary data


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Chris Murphy
On Wed, Jun 21, 2017 at 8:12 PM, Qu Wenruo  wrote:

>
> Well, in fact, thanks to data csum and btrfs metadata CoW, there is quite a
> high chance that we won't cause any data damage.

But we have examples where data does not COW, we see a partial stripe
overwrite. And if that is interrupted it's clear that both old and new
metadata pointing to that stripe is wrong. There are way more problems
where we see csum errors on Btrfs raid56 after crashes, and there are
no bad devices.



>
> For the example I gave above, no data damage at all.
>
> First the data is written and power loss, and data is always written before
> metadata, so that's to say, after power loss, superblock is still using the
> old tree roots.
>
> So no one is really using that newly written data.

OK but that assumes that the newly written data is always COW which on
Btrfs raid56 is not certain, there's a bunch of RMW code which
suggests overwrites are possible.

And for raid56 metadata it suggests RMW could happen for metadata also.

There's fairly strong anecdotal evidence that people have less
problems with Btrfs raid5 when raid5 applies to data block groups, and
metadata block groups use some other non-parity based profile like
raid1.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Remove false alert when fiemap range is smaller than on-disk extent

2017-06-21 Thread Adam Borowski
On Thu, Jun 22, 2017 at 10:01:21AM +0800, Qu Wenruo wrote:
> Commit 4751832da990 ("btrfs: fiemap: Cache and merge fiemap extent before
> submit it to user") introduced a warning to catch unemitted cached
> fiemap extent.
> 
> However such warning doesn't take the following case into consideration:
> 
> 0 4K  8K
> |< fiemap range --->|
> |<--- On-disk extent -->|

With this fix, my filesystem doesn't appear to go in flames yet.

-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to fix errors that check --mode lomem finds, but --mode normal doesn't?

2017-06-21 Thread Qu Wenruo



At 06/21/2017 11:13 PM, Marc MERLIN wrote:

On Tue, Jun 20, 2017 at 08:43:52PM -0700, Marc MERLIN wrote:

On Tue, Jun 20, 2017 at 09:31:42PM -0600, Chris Murphy wrote:

On Tue, Jun 20, 2017 at 5:12 PM, Marc MERLIN  wrote:


I'm now going to remount this with nospace_cache to see if your guess about
space_cache was correct.
Other suggestions also welcome :)


What results do you get with lowmem mode? It won't repair without
additional patches, but might give a dev a clue what's going on. I
regularly see normal mode check finds no problems, and lowmem mode
finds problems. Lowmem mode is a total rewrite so it's a different
implementation and can find things normal mode won't.


Oh, I kind of forgot that lowmem mode looked for more things than regular
mode.
I will run this tonight and see what it says.
  
It's probably still a ways from being finished given how slow lowmem is in

comparison, but sadly it found a bunch of problems which regular mode didn't
find.

I'm pretty bummed. I just spent way too long recreating this filesystem and
the multiple btrfs send/receive relationships from other machines. Too a bit
over a week :(

It looks like the errors are not major (especially if the regular mode
doesn't even see them), but without lowmem --repair, I'm kind of screwed.

I'm wondering if I could/should leave those errors unfixed until lowmem --repair
finally happens, or whether I'm looking at spending another week rebuilding
this filesystem :-/


gargamel:~# btrfs check -p --mode lowmem  /dev/mapper/dshelf2
Checking filesystem on /dev/mapper/dshelf2
UUID: 85441c59-ad11-4b25-b1fe-974f9e4acede
ERROR: extent[3886187384832, 81920] referencer count mismatch (root: 11930, 
owner: 375444, offset: 1851654144) wanted: 1, have: 4


This means that in extent tree, btrfs says there is only one referring 
to this extent, but lowmem mode find 4.


It would provide great help if you could dump extent tree for it.
# btrfs-debug-tree  | grep -C 10 3886187384832



ERROR: extent[3886189391872, 122880] referencer count mismatch (root: 11712, 
owner: 863395, offset: 79659008) wanted: 1, have: 2
ERROR: extent[3933249708032, 69632] referencer count mismatch (root: 12424, 
owner: 6945, offset: 2083389440) wanted: 1, have: 2
ERROR: extent[3933249708032, 69632] referencer count mismatch (root: 12172, 
owner: 6945, offset: 2083389440) wanted: 1, have: 2
ERROR: extent[4571729862656, 876544] referencer count mismatch (root: 11058, 
owner: 375442, offset: 907706368) wanted: 6, have: 21
ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11712, 
owner: 375444, offset: 1848672256) wanted: 3, have: 5
ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11276, 
owner: 375444, offset: 1848672256) wanted: 3, have: 5
ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11058, 
owner: 375444, offset: 1848672256) wanted: 3, have: 5
ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11494, 
owner: 375444, offset: 1848672256) wanted: 3, have: 5
ERROR: extent[4658555617280, 122880] referencer count mismatch (root: 11712, 
owner: 375444, offset: 1848705024) wanted: 1, have: 3
ERROR: extent[4677858123776, 417792] referencer count mismatch (root: 11712, 
owner: 863395, offset: 79523840) wanted: 1, have: 3
ERROR: extent[4677858123776, 417792] referencer count mismatch (root: 11930, 
owner: 863395, offset: 79523840) wanted: 1, have: 3
ERROR: extent[4698380947456, 409600] referencer count mismatch (root: 11930, 
owner: 375444, offset: 1851596800) wanted: 3, have: 4
ERROR: extent[4720470421504, 667648] referencer count mismatch (root: 11058, 
owner: 3463478, offset: 2334720) wanted: 2, have: 10
ERROR: extent[4783941246976, 65536] referencer count mismatch (root: 9365, 
owner: 24493, offset: 4562944) wanted: 2, have: 3
ERROR: extent[5077564477440, 106496] referencer count mismatch (root: 9370, 
owner: 1602694, offset: 734756864) wanted: 1, have: 2
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11712, 
owner: 375441, offset: 910999552) wanted: 16, have: 1864
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11276, 
owner: 375441, offset: 910999552) wanted: 867, have: 1865
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11058, 
owner: 375441, offset: 910999552) wanted: 126, have: 1872
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11494, 
owner: 375441, offset: 910999552) wanted: 866, have: 1864
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11930, 
owner: 375441, offset: 910999552) wanted: 861, have: 1859
ERROR: extent[5136649891840, 66781184] referencer count mismatch (root: 11058, 
owner: 375442, offset: 192659456) wanted: 5, have: 19
ERROR: extent[5136879157248, 134217728] referencer count mismatch (root: 11930, 
owner: 375442, offset: 394543104) wanted: 10, have: 33
ERROR: extent[5137380671488, 80945152] referencer count mismatch (root: 11058, 
owner: 375442,

Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Qu Wenruo



At 06/22/2017 02:24 AM, Chris Murphy wrote:

On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruo  wrote:


Unlike pure stripe method, one fully functional RAID5/6 should be written in
full stripe behavior, which is made up by N data stripes and correct P/Q.

Given one example to show how write sequence affects the usability of
RAID5/6.

Existing full stripe:
X = Used space (Extent allocated)
O = Unused space
Data 1   |XX||
Data 2   |OOO|
Parity   |WW||

When some new extent is allocated to data 1 stripe, if we write
data directly into that region, and crashed.
The result will be:

Data 1   |XX|XX|O|
Data 2   |OOO|
Parity   |WW||

Parity stripe is not updated, although it's fine since data is still
correct, this reduces the usability, as in this case, if we lost device
containing data 2 stripe, we can't recover correct data of data 2.

Although personally I don't think it's a big problem yet.

Someone has idea to modify extent allocator to handle it, but anyway I don't
consider it's worthy.



If there is parity corruption and there is a lost device (or bad
sector causing lost data strip), that is in effect two failures and no
raid5 recovers, you have to have raid6. However, I don't know whether
Btrfs raid6 can even recover from it? If there is a single device
failure, with a missing data strip, you have both P&Q. Typically raid6
implementations use P first, and only use Q if P is not available. Is
Btrfs raid6 the same? And if reconstruction from P fails to match data
csum, does Btrfs retry using Q? Probably not is my guess.


Well, in fact, thanks to data csum and btrfs metadata CoW, there is 
quite a high chance that we won't cause any data damage.


For the example I gave above, no data damage at all.

First the data is written and power loss, and data is always written 
before metadata, so that's to say, after power loss, superblock is still 
using the old tree roots.


So no one is really using that newly written data.

And in that case even device of data stripe 2 is missing, btrfs don't 
really need to use parity to rebuild it, as btrfs knows there is no 
extent in that stripe, and data csum matches for data stripe 1.

No need to use parity at all.

So that's why I think the hole write is not an urgent case to handle 
right now.


Thanks,
Qu


I think that is a valid problem calling for a solution on Btrfs, given
its mandate. It is no worse than other raid6 implementations though
which would reconstruct from bad P, and give no warning, leaving it up
to application layers to deal with the problem.

I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario.







2. Parity data is not checksummed
Why is this a problem? Does it have to do with the design of BTRFS
somehow?
Parity is after all just data, BTRFS does checksum data so what is the
reason this is a problem?



Because that's one solution to solve above problem.

And no, parity is not data.


Parity strip is differentiated from data strip, and by itself parity
is meaningless. But parity plus n-1 data strips is an encoded form of
the missing data strip, and is therefore an encoded copy of the data.
We kinda have to treat the parity as fractionally important compared
to data; just like each mirror copy has some fractional value. You
don't have to have both of them, but you do have to have at least one
of them.





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Qu Wenruo



At 06/22/2017 01:03 AM, Goffredo Baroncelli wrote:

Hi Qu,

On 2017-06-21 10:45, Qu Wenruo wrote:

At 06/21/2017 06:57 AM, waxhead wrote:

I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
The wiki refer to kernel 3.19 which was released in February 2015 so I assume
that the information there is a tad outdated (the last update on the wiki page 
was July 2016)
https://btrfs.wiki.kernel.org/index.php/RAID56

Now there are four problems listed

1. Parity may be inconsistent after a crash (the "write hole")
Is this still true, if yes - would not this apply for RAID1 /
RAID10 as well? How was it solved there , and why can't that be done for RAID5/6


Unlike pure stripe method, one fully functional RAID5/6 should be written in 
full stripe behavior,
  which is made up by N data stripes and correct P/Q.

Given one example to show how write sequence affects the usability of RAID5/6.

Existing full stripe:
X = Used space (Extent allocated)
O = Unused space
Data 1   |XX||
Data 2   |OOO|
Parity   |WW||

When some new extent is allocated to data 1 stripe, if we write
data directly into that region, and crashed.
The result will be:

Data 1   |XX|XX|O|
Data 2   |OOO|
Parity   |WW||

Parity stripe is not updated, although it's fine since data is still correct, 
this reduces the
usability, as in this case, if we lost device containing data 2 stripe, we can't
recover correct data of data 2.

Although personally I don't think it's a big problem yet.

Someone has idea to modify extent allocator to handle it, but anyway I don't 
consider it's worthy.



2. Parity data is not checksummed
Why is this a problem? Does it have to do with the design of BTRFS somehow?
Parity is after all just data, BTRFS does checksum data so what is the reason 
this is a problem?


Because that's one solution to solve above problem.


In what it could be a solution for the write hole ?


Not my idea, so I don't why this is a solution either.

I prefer to lower the priority for such case as we have more work to do.

Thanks,
Qu


If a parity is wrong AND you lost a disk, even having a checksum of the parity, 
you are not in position to rebuild the missing data. And if you rebuild wrong 
data, anyway the checksum highlights it. So adding the checksum to the parity 
should not solve any issue.

A possible "mitigation", is to track in a "intent log" all the not "full stripe 
writes" during a transaction. If a power failure aborts a transaction, in the next mount a scrub process 
is started to correct the parities only in the stripes tracked before.

A solution, is to journal all the not "full stripe writes", as MD does.


BR
G.Baroncelli




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


commands hang 30-60s during scrubs, includes sysrq t

2017-06-21 Thread Chris Murphy
I'm getting command hangs and service start failures during scrubs.
top says CPU idle is 58-64%.

Running 'perf top' takes more than 1 minute for results to appear.
Connecting to a web management service (cockpit) takes longer, maybe 2
minutes or sometimes the login times out. And just doing an ls -l on a
directory takes 30-45 seconds. I don't remember things being this bad
but I don't do scrubs all that often and then try to use the system at
the same time.

Anyway, it seems like scrub should not soak system resources enough to
cause any hangs or system service failures.

kernel-4.11.6-301.fc26.x86_64
btrfs-progs-4.11-1.fc27.x86_64

There are two Btrfs file systems: rootfs, and the one being scrubbed.
The file system being scrubbed is only being scrubbed, nothing is
reading or writing to it. Both file systems share the same physical
block device, a laptop hard drive.

rootfs
/dev/sda1 on / type btrfs
(rw,noatime,seclabel,compress=zlib,space_cache,subvolid=983,subvol=/root)

filesystem being scrubbed (is LUKS device)
/dev/mapper/most on /srv/most type btrfs
(rw,noatime,seclabel,space_cache,subvolid=659,subvol=/most)

elevator=deadline

Linked file "perf top" output while the scrub is happening.
https://drive.google.com/open?id=0B_2Asp8DGjJ9VzhPRklGUXdOQkE

Linked file "kernelmsgsysrqt.log" contains two sysrq t's.
https://drive.google.com/open?id=0B_2Asp8DGjJ9aDBxeVpwVURPNGc


440 monotonic time I'm waiting for 'perf top' to do something. The
command has been issued, screen is black and appears "hung".

976 monotonic time I'm waiting for a client to authenticate to the
webui management system (cockpit), which ends up failing to
authenticate due to a timeout.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Remove false alert when fiemap range is smaller than on-disk extent

2017-06-21 Thread Qu Wenruo
Commit 4751832da990 ("btrfs: fiemap: Cache and merge fiemap extent before
submit it to user") introduced a warning to catch unemitted cached
fiemap extent.

However such warning doesn't take the following case into consideration:

0   4K  8K
|< fiemap range --->|
|<--- On-disk extent -->|

In this case, the whole 0~8K is cached, and since it's larger than
fiemap range, it break the fiemap extent emit loop.
This leaves the fiemap extent cached but not emitted, and caught by the
final fiemap extent sanity check, causing kernel warning.

This patch removes the kernel warning and renames the sanity check to
emit_last_fiemap_cache() since it's possible and valid to have cached
fiemap extent.

Reported-by: David Sterba 
Reported-by: Adam Borowski 
Fixes: 4751832da990 ("btrfs: fiemap: Cache and merge fiemap extent ...")
Signed-off-by: Qu Wenruo 
---
Test case will follow with xfs_io modification to allow fiemap called on
a given range other than the whole file.
---
 fs/btrfs/extent_io.c | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..13cae68bfe46 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4463,29 +4463,25 @@ static int emit_fiemap_extent(struct fiemap_extent_info 
*fieinfo,
 }
 
 /*
- * Sanity check for fiemap cache
+ * Emit last fiemap cache
  *
- * All fiemap cache should be submitted by emit_fiemap_extent()
- * Iteration should be terminated either by last fiemap extent or
- * fieinfo->fi_extents_max.
- * So no cached fiemap should exist.
+ * The last fiemap cache may still be cached in the following case:
+ * 0 4k8k
+ * |<- Fiemap range ->|
+ * |<  First extent --->|
+ *
+ * In this case, the first extent range will be cached but not emitted.
+ * So we must emit it before ending extent_fiemap().
  */
-static int check_fiemap_cache(struct btrfs_fs_info *fs_info,
-  struct fiemap_extent_info *fieinfo,
-  struct fiemap_cache *cache)
+static int emit_last_fiemap_cache(struct btrfs_fs_info *fs_info,
+ struct fiemap_extent_info *fieinfo,
+ struct fiemap_cache *cache)
 {
int ret;
 
if (!cache->cached)
return 0;
 
-   /* Small and recoverbale problem, only to info developer */
-#ifdef CONFIG_BTRFS_DEBUG
-   WARN_ON(1);
-#endif
-   btrfs_warn(fs_info,
-  "unhandled fiemap cache detected: offset=%llu phys=%llu 
len=%llu flags=0x%x",
-  cache->offset, cache->phys, cache->len, cache->flags);
ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
  cache->len, cache->flags);
cache->cached = false;
@@ -4701,7 +4697,7 @@ int extent_fiemap(struct inode *inode, struct 
fiemap_extent_info *fieinfo,
}
 out_free:
if (!ret)
-   ret = check_fiemap_cache(root->fs_info, fieinfo, &cache);
+   ret = emit_last_fiemap_cache(root->fs_info, fieinfo, &cache);
free_extent_map(em);
 out:
btrfs_free_path(path);
-- 
2.13.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to fix errors that check --mode lomem finds, but --mode normal doesn't?

2017-06-21 Thread Marc MERLIN
On Wed, Jun 21, 2017 at 05:22:15PM -0600, Chris Murphy wrote:
> I don't know what it means. Maybe Qu has some idea. He might want a
> btrfs-image  of this file system to see if it's a bug. There are still
> some bugs found with lowmem mode, so these could be bogus messages.
> But the file system clearly has problems, the question is why does
> such a new file system have these kinds of problems that can't be
> fixed by normal repair because they aren't even being detected; or
> maybe there is no problem on disk per se, the problem might be a bug.
 
Yes, that's indeed the question I was asking myself too :)
Now, I did have a couple of drives that got kicked out of a (mdadm, not
btrfs) raid array, causing the array to go away while btrfs was trying to
write to it, but my understanding of btrfs write journalling is that the new
data that was being written should have been discarded and I should have
ended up at the previous good state.
AFAIK, I'm pretty sure I didn't get any block layer corruption this time, I
just got a drive effectively pulled from a running array (well 2, one went
to degraded, and the 2nd one killed the array. I re-added them carefully and
correctly in the right order and mdadm rebuilt what it needed using the
extent tree)

For what it's worth, I've had no end of trouble with Sata SAS cards and their 4
sata cables in one:
https://www.amazon.com/gp/product/B0050SLTPC/ref=oh_aui_search_detailpage?ie=UTF8&psc=1
https://www.amazon.com/gp/product/B013G4EMH8/ref=oh_aui_search_detailpage?ie=UTF8&psc=1

I have it stable now, but those cables are super sensitive and have caused
drives to get kicked out if they weren't air canned first, and plugged in
just right :-/

> In which case, off chance going back to a substantially older kernel
> might help. Maybe the latest 4.9 series kernel?

If there is reasonable evidence that it will help, I can give it a shot.

Qu, or anyone, given that btrfs-image is going to take a long time (maybe a
day or more), given that I have to use at least -s before I can share the
image, and if I need -ss, then it's even slower from what I remember.

Basically please suggest the fastest image algorithm I can use. It's a quad
core HT machine, so should I use
btrfs-image -c0 -t8 -s /dev/ image
(I'm assuing -c9 will not be faster and that -ss will be even slower)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: DEBUG fiemap: Show more info about extent_fiemap

2017-06-21 Thread Qu Wenruo



At 06/21/2017 08:10 PM, Adam Borowski wrote:

On Wed, Jun 21, 2017 at 05:28:50PM +0800, Qu Wenruo wrote:

Would you please try this patch based on v4.12-rc5 and try to reproduce
the kernel warning?

It would be better to eliminate the noisy by ensure there is no other fiemap
caller on btrfs.


Here's my current dmesg buffer from a single dpkg --reinstall invocation.



Thanks for the dmesg!

The problem is if the fiemap range is smaller than the extent, we will 
exit without emit fiemap extent to VFS, leaving it cached.


As shown in your dmesg:
---
 extent_fiemap: enter: root=10152 inode=5353157 start=0 len=4096
 emit_fiemap_extent: entered: offset=0 phys=2012381351936 len=131072 
flags=0x2008

 emit_fiemap_extent: assigning new fiemap
 emit_fiemap_extent: last exit ret=0
 [ cut here ]
---

Your dmesg really helps a lot!

I'll send the fix in this week.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to fix errors that check --mode lomem finds, but --mode normal doesn't?

2017-06-21 Thread Chris Murphy
On Wed, Jun 21, 2017 at 9:13 AM, Marc MERLIN  wrote:
> On Tue, Jun 20, 2017 at 08:43:52PM -0700, Marc MERLIN wrote:
>> On Tue, Jun 20, 2017 at 09:31:42PM -0600, Chris Murphy wrote:
>> > On Tue, Jun 20, 2017 at 5:12 PM, Marc MERLIN  wrote:
>> >
>> > > I'm now going to remount this with nospace_cache to see if your guess 
>> > > about
>> > > space_cache was correct.
>> > > Other suggestions also welcome :)
>> >
>> > What results do you get with lowmem mode? It won't repair without
>> > additional patches, but might give a dev a clue what's going on. I
>> > regularly see normal mode check finds no problems, and lowmem mode
>> > finds problems. Lowmem mode is a total rewrite so it's a different
>> > implementation and can find things normal mode won't.
>>
>> Oh, I kind of forgot that lowmem mode looked for more things than regular
>> mode.
>> I will run this tonight and see what it says.
>
> It's probably still a ways from being finished given how slow lowmem is in
> comparison, but sadly it found a bunch of problems which regular mode didn't
> find.
>
> I'm pretty bummed. I just spent way too long recreating this filesystem and
> the multiple btrfs send/receive relationships from other machines. Too a bit
> over a week :(
>
> It looks like the errors are not major (especially if the regular mode
> doesn't even see them), but without lowmem --repair, I'm kind of screwed.
>
> I'm wondering if I could/should leave those errors unfixed until lowmem 
> --repair
> finally happens, or whether I'm looking at spending another week rebuilding
> this filesystem :-/
>
>
> gargamel:~# btrfs check -p --mode lowmem  /dev/mapper/dshelf2
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 85441c59-ad11-4b25-b1fe-974f9e4acede
> ERROR: extent[3886187384832, 81920] referencer count mismatch (root: 11930, 
> owner: 375444, offset: 1851654144) wanted: 1, have: 4
> ERROR: extent[3886189391872, 122880] referencer count mismatch (root: 11712, 
> owner: 863395, offset: 79659008) wanted: 1, have: 2
> ERROR: extent[3933249708032, 69632] referencer count mismatch (root: 12424, 
> owner: 6945, offset: 2083389440) wanted: 1, have: 2
> ERROR: extent[3933249708032, 69632] referencer count mismatch (root: 12172, 
> owner: 6945, offset: 2083389440) wanted: 1, have: 2
> ERROR: extent[4571729862656, 876544] referencer count mismatch (root: 11058, 
> owner: 375442, offset: 907706368) wanted: 6, have: 21
> ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11712, 
> owner: 375444, offset: 1848672256) wanted: 3, have: 5
> ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11276, 
> owner: 375444, offset: 1848672256) wanted: 3, have: 5
> ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11058, 
> owner: 375444, offset: 1848672256) wanted: 3, have: 5
> ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11494, 
> owner: 375444, offset: 1848672256) wanted: 3, have: 5
> ERROR: extent[4658555617280, 122880] referencer count mismatch (root: 11712, 
> owner: 375444, offset: 1848705024) wanted: 1, have: 3
> ERROR: extent[4677858123776, 417792] referencer count mismatch (root: 11712, 
> owner: 863395, offset: 79523840) wanted: 1, have: 3
> ERROR: extent[4677858123776, 417792] referencer count mismatch (root: 11930, 
> owner: 863395, offset: 79523840) wanted: 1, have: 3
> ERROR: extent[4698380947456, 409600] referencer count mismatch (root: 11930, 
> owner: 375444, offset: 1851596800) wanted: 3, have: 4
> ERROR: extent[4720470421504, 667648] referencer count mismatch (root: 11058, 
> owner: 3463478, offset: 2334720) wanted: 2, have: 10
> ERROR: extent[4783941246976, 65536] referencer count mismatch (root: 9365, 
> owner: 24493, offset: 4562944) wanted: 2, have: 3
> ERROR: extent[5077564477440, 106496] referencer count mismatch (root: 9370, 
> owner: 1602694, offset: 734756864) wanted: 1, have: 2
> ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 
> 11712, owner: 375441, offset: 910999552) wanted: 16, have: 1864
> ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 
> 11276, owner: 375441, offset: 910999552) wanted: 867, have: 1865
> ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 
> 11058, owner: 375441, offset: 910999552) wanted: 126, have: 1872
> ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 
> 11494, owner: 375441, offset: 910999552) wanted: 866, have: 1864
> ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 
> 11930, owner: 375441, offset: 910999552) wanted: 861, have: 1859
> ERROR: extent[5136649891840, 66781184] referencer count mismatch (root: 
> 11058, owner: 375442, offset: 192659456) wanted: 5, have: 19
> ERROR: extent[5136879157248, 134217728] referencer count mismatch (root: 
> 11930, owner: 375442, offset: 394543104) wanted: 10, have: 33
> ERROR: extent[5137380671488, 80945152] referencer count mismatch (root: 
> 11058, owner: 375442, offset: 875233280

Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Chris Murphy
On Wed, Jun 21, 2017 at 2:12 PM, Goffredo Baroncelli  wrote:


>
> Generally speaking, when you write "two failure" this means two failure at 
> the same time. But the write hole happens even if these two failures are not 
> at the same time:
>
> Event #1: power failure between the data stripe write and the parity stripe 
> write. The stripe is incoherent.
> Event #2: a disk is failing: if you try to read the data from the remaining 
> data and the parity you have wrong data.
>
> The likelihood of these two event at the same time (power failure and  in the 
> next boot a disk is failing) is quite low. But in the life of a filesystem, 
> these two event likely happens.
>
> However BTRFS has an advantage: a simple scrub may (crossing finger) recover 
> from event #1.

Event #3: the stripe is read, missing a data strip due to event #2,
and is wrongly reconstructed due to event #1, Btrfs computes crc32c on
the reconstructed data and compares to extent csum, which then fails
and EIO happens.

Btrfs is susceptible to the write hole happening on disk. But it's
still detected and corrupt data isn't propagated upward.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Jeff Mahoney
On 6/21/17 5:15 PM, Chris Mason wrote:
> 
> 
> On 06/21/2017 05:08 PM, Jeff Mahoney wrote:
>> On 6/21/17 4:31 PM, Chris Mason wrote:
>>> On 06/21/2017 04:14 PM, Jeff Mahoney wrote:
 On 6/14/17 11:44 AM, je...@suse.com wrote:
> From: Jeff Mahoney 
>
> In a heavy write scenario, we can end up with a large number of pinned
> bytes.  This can translate into (very) premature ENOSPC because pinned
> bytes must be accounted for when allowing a reservation but aren't
> accounted for when deciding whether to create a new chunk.
>
> This patch adds the accounting to should_alloc_chunk so that we can
> create the chunk.
>
> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index cb0b924..d027807 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct
> btrfs_fs_info *fs_info,
>  {
>  struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>  u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
> -u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
> +u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved +
> sinfo->bytes_pinned + sinfo->bytes_may_use;
>  u64 thresh;
>
>  if (force == CHUNK_ALLOC_FORCE)
>


 Ignore this patch.  It certainly allocates chunks more aggressively,
 but
 it means we end up with a ton of metadata chunks even when we don't
 have
 much metadata.

>>>
>>> Josef and I pushed this needle back and forth a bunch of times in the
>>> early days.  I still think we can allocate a few more chunks than we do
>>> now...
>>
>> I agree.  This patch was to fix an issue that we are seeing during
>> installation.  It'd stop with ENOSPC with >50GB completely unallocated.
>> The patch passed the test cases that were failing before but now it's
>> failing differently.  I was worried this pattern might be the end result:
>>
>> Data,single: Size:4.00GiB, Used:3.32GiB
>>/dev/vde4.00GiB
>>
>> Metadata,DUP: Size:20.00GiB, Used:204.12MiB
>>/dev/vde   40.00GiB
>>
>> System,DUP: Size:8.00MiB, Used:16.00KiB
>>/dev/vde   16.00MiB
>>
>> This is on a fresh file system with just "cp /usr /mnt" executed.
>>
>> I'm looking into it a bit more now.
> 
> Does this failure still happen with Omar's ENOSPC fix (commit:
> 70e7af244f24c94604ef6eca32ad297632018583)

Nope.  There aren't any warnings either with or without my patch.
Adding Omar's didn't make a difference.

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Chris Mason



On 06/21/2017 05:08 PM, Jeff Mahoney wrote:

On 6/21/17 4:31 PM, Chris Mason wrote:

On 06/21/2017 04:14 PM, Jeff Mahoney wrote:

On 6/14/17 11:44 AM, je...@suse.com wrote:

From: Jeff Mahoney 

In a heavy write scenario, we can end up with a large number of pinned
bytes.  This can translate into (very) premature ENOSPC because pinned
bytes must be accounted for when allowing a reservation but aren't
accounted for when deciding whether to create a new chunk.

This patch adds the accounting to should_alloc_chunk so that we can
create the chunk.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cb0b924..d027807 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct
btrfs_fs_info *fs_info,
 {
 struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
 u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
-u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
+u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved +
sinfo->bytes_pinned + sinfo->bytes_may_use;
 u64 thresh;

 if (force == CHUNK_ALLOC_FORCE)




Ignore this patch.  It certainly allocates chunks more aggressively, but
it means we end up with a ton of metadata chunks even when we don't have
much metadata.



Josef and I pushed this needle back and forth a bunch of times in the
early days.  I still think we can allocate a few more chunks than we do
now...


I agree.  This patch was to fix an issue that we are seeing during
installation.  It'd stop with ENOSPC with >50GB completely unallocated.
The patch passed the test cases that were failing before but now it's
failing differently.  I was worried this pattern might be the end result:

Data,single: Size:4.00GiB, Used:3.32GiB
   /dev/vde4.00GiB

Metadata,DUP: Size:20.00GiB, Used:204.12MiB
   /dev/vde   40.00GiB

System,DUP: Size:8.00MiB, Used:16.00KiB
   /dev/vde   16.00MiB

This is on a fresh file system with just "cp /usr /mnt" executed.

I'm looking into it a bit more now.


Does this failure still happen with Omar's ENOSPC fix (commit: 
70e7af244f24c94604ef6eca32ad297632018583)


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Jeff Mahoney
On 6/21/17 4:31 PM, Chris Mason wrote:
> On 06/21/2017 04:14 PM, Jeff Mahoney wrote:
>> On 6/14/17 11:44 AM, je...@suse.com wrote:
>>> From: Jeff Mahoney 
>>>
>>> In a heavy write scenario, we can end up with a large number of pinned
>>> bytes.  This can translate into (very) premature ENOSPC because pinned
>>> bytes must be accounted for when allowing a reservation but aren't
>>> accounted for when deciding whether to create a new chunk.
>>>
>>> This patch adds the accounting to should_alloc_chunk so that we can
>>> create the chunk.
>>>
>>> Signed-off-by: Jeff Mahoney 
>>> ---
>>>  fs/btrfs/extent-tree.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index cb0b924..d027807 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct
>>> btrfs_fs_info *fs_info,
>>>  {
>>>  struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>>>  u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
>>> -u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
>>> +u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved +
>>> sinfo->bytes_pinned + sinfo->bytes_may_use;
>>>  u64 thresh;
>>>
>>>  if (force == CHUNK_ALLOC_FORCE)
>>>
>>
>>
>> Ignore this patch.  It certainly allocates chunks more aggressively, but
>> it means we end up with a ton of metadata chunks even when we don't have
>> much metadata.
>>
> 
> Josef and I pushed this needle back and forth a bunch of times in the
> early days.  I still think we can allocate a few more chunks than we do
> now...

I agree.  This patch was to fix an issue that we are seeing during
installation.  It'd stop with ENOSPC with >50GB completely unallocated.
The patch passed the test cases that were failing before but now it's
failing differently.  I was worried this pattern might be the end result:

Data,single: Size:4.00GiB, Used:3.32GiB
   /dev/vde4.00GiB

Metadata,DUP: Size:20.00GiB, Used:204.12MiB
   /dev/vde   40.00GiB

System,DUP: Size:8.00MiB, Used:16.00KiB
   /dev/vde   16.00MiB

This is on a fresh file system with just "cp /usr /mnt" executed.

I'm looking into it a bit more now.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Chris Mason

On 06/21/2017 04:14 PM, Jeff Mahoney wrote:

On 6/14/17 11:44 AM, je...@suse.com wrote:

From: Jeff Mahoney 

In a heavy write scenario, we can end up with a large number of pinned
bytes.  This can translate into (very) premature ENOSPC because pinned
bytes must be accounted for when allowing a reservation but aren't
accounted for when deciding whether to create a new chunk.

This patch adds the accounting to should_alloc_chunk so that we can
create the chunk.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cb0b924..d027807 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct btrfs_fs_info 
*fs_info,
 {
struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
-   u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
+   u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + 
sinfo->bytes_pinned + sinfo->bytes_may_use;
u64 thresh;

if (force == CHUNK_ALLOC_FORCE)




Ignore this patch.  It certainly allocates chunks more aggressively, but
it means we end up with a ton of metadata chunks even when we don't have
much metadata.



Josef and I pushed this needle back and forth a bunch of times in the 
early days.  I still think we can allocate a few more chunks than we do 
now...


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Jeff Mahoney
On 6/14/17 11:44 AM, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> In a heavy write scenario, we can end up with a large number of pinned
> bytes.  This can translate into (very) premature ENOSPC because pinned
> bytes must be accounted for when allowing a reservation but aren't
> accounted for when deciding whether to create a new chunk.
> 
> This patch adds the accounting to should_alloc_chunk so that we can
> create the chunk.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index cb0b924..d027807 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct btrfs_fs_info 
> *fs_info,
>  {
>   struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>   u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
> - u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
> + u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + 
> sinfo->bytes_pinned + sinfo->bytes_may_use;
>   u64 thresh;
>  
>   if (force == CHUNK_ALLOC_FORCE)
> 


Ignore this patch.  It certainly allocates chunks more aggressively, but
it means we end up with a ton of metadata chunks even when we don't have
much metadata.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Goffredo Baroncelli
On 2017-06-21 20:24, Chris Murphy wrote:
> On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruo  wrote:
> 
>> Unlike pure stripe method, one fully functional RAID5/6 should be written in
>> full stripe behavior, which is made up by N data stripes and correct P/Q.
>>
>> Given one example to show how write sequence affects the usability of
>> RAID5/6.
>>
>> Existing full stripe:
>> X = Used space (Extent allocated)
>> O = Unused space
>> Data 1   |XX||
>> Data 2   |OOO|
>> Parity   |WW||
>>
>> When some new extent is allocated to data 1 stripe, if we write
>> data directly into that region, and crashed.
>> The result will be:
>>
>> Data 1   |XX|XX|O|
>> Data 2   |OOO|
>> Parity   |WW||
>>
>> Parity stripe is not updated, although it's fine since data is still
>> correct, this reduces the usability, as in this case, if we lost device
>> containing data 2 stripe, we can't recover correct data of data 2.
>>
>> Although personally I don't think it's a big problem yet.
>>
>> Someone has idea to modify extent allocator to handle it, but anyway I don't
>> consider it's worthy.
> 
> 
> If there is parity corruption and there is a lost device (or bad
> sector causing lost data strip), that is in effect two failures and no
> raid5 recovers, you have to have raid6. 

Generally speaking, when you write "two failure" this means two failure at the 
same time. But the write hole happens even if these two failures are not at the 
same time:

Event #1: power failure between the data stripe write and the parity stripe 
write. The stripe is incoherent.
Event #2: a disk is failing: if you try to read the data from the remaining 
data and the parity you have wrong data.

The likelihood of these two event at the same time (power failure and  in the 
next boot a disk is failing) is quite low. But in the life of a filesystem, 
these two event likely happens.

However BTRFS has an advantage: a simple scrub may (crossing finger) recover 
from event #1.

> However, I don't know whether
> Btrfs raid6 can even recover from it? If there is a single device
> failure, with a missing data strip, you have both P&Q. Typically raid6
> implementations use P first, and only use Q if P is not available. Is
> Btrfs raid6 the same? And if reconstruction from P fails to match data
> csum, does Btrfs retry using Q? Probably not is my guess.

It could, and in any case it is only an "implementation detail" :-)
> 
> I think that is a valid problem calling for a solution on Btrfs, given
> its mandate. It is no worse than other raid6 implementations though
> which would reconstruct from bad P, and give no warning, leaving it up
> to application layers to deal with the problem.
> 
> I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario.

If I understood correctly, ZFS has a variable stripe size. In BTRFS could be 
easily implemented: it would be sufficient to have different block group with 
different number of disk.

If a filesystem is composed by 5 disks, it will contain:

1 BG RAID1 for writing up-to 64k
1 BG RAID5 (3 disks) for writing up-to 128k
1 BG RAID5 (4 disks) for writing up-to 192k
1 BG RAID5 (5 disks) for all other disks

Time to time the filesystem would need a re-balance in order to empty the 
smaller block group. 


Another option could be to track the stripes involved by a RWM cycle (i.e. all 
the writings smaller than a stripe size, which in a COW filesystem, are suppose 
to be few) in an "intent log", and scrubbing all these stripes if a power 
failure happens .




> 
> 
> 
>>
>>>
>>> 2. Parity data is not checksummed
>>> Why is this a problem? Does it have to do with the design of BTRFS
>>> somehow?
>>> Parity is after all just data, BTRFS does checksum data so what is the
>>> reason this is a problem?
>>
>>
>> Because that's one solution to solve above problem.
>>
>> And no, parity is not data.
> 
> Parity strip is differentiated from data strip, and by itself parity
> is meaningless. But parity plus n-1 data strips is an encoded form of
> the missing data strip, and is therefore an encoded copy of the data.
> We kinda have to treat the parity as fractionally important compared
> to data; just like each mirror copy has some fractional value. You
> don't have to have both of them, but you do have to have at least one
> of them.
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3.2 0/6] Qgroup fixes, Non-stack version

2017-06-21 Thread David Sterba
On Wed, May 17, 2017 at 10:56:22AM +0800, Qu Wenruo wrote:
> The remaining qgroup fixes patches, based on the Chris' for-linus-4.12
> branch with commit 9bcaaea7418d09691f1ffab5c49aacafe3eef9d0 as base.
> 
> Can be fetched from github:
> https://github.com/adam900710/linux/tree/qgroup_fixes_non_stack
> 
> This update is commit message update only.
> 
> v2:
>   Add reviewed-by tag for 2nd patch
>   Update the first patch to follow the new trace point standard
> RFC v3:
>   Use non-stack (dyanamic allocation) for extent_changeset structure, in
>   5th patch, to reduce impact for quota disabled cases.
>   Rebase to latest for-linus-4.12 branch.
> RFC v3.1:
>   Update comment to include the newly introduced parameter
>   Use init/release function to replace open coded ulist_init/release().
> RFC v3.2:
>   Update commit message of 1st and 3rd patch.
> 
> Qu Wenruo (6):
>   btrfs: qgroup: Add quick exit for non-fs extents
>   btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
>   btrfs: qgroup: Return actually freed bytes for qgroup release or free
> data
>   btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered
> write and quota enable
>   btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
>   btrfs: qgroup: Fix qgroup reserved space underflow by only freeing
> reserved ranges

Added to 4.13 queue. I haven't seen any new reviews, so I tried to do my
best with my limited understanding. The preparatory look good to me, the
remaining got some review and testing so I rely on that. I made only
some wording adjustments to changelogs or comments. The whole patchset
has been in for-next for the entire dev cycle, I haven't seen any
related failures. Also I think we need to move forward with this
patchset, there was enough time to complain or help.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Chris Murphy
On Wed, Jun 21, 2017 at 12:51 AM, Marat Khalili  wrote:
> On 21/06/17 06:48, Chris Murphy wrote:
>>
>> Another possibility is to ensure a new write is written to a new*not*
>> full stripe, i.e. dynamic stripe size. So if the modification is a 50K
>> file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K
>> parity strip (a full stripe write); write out 1 64K data strip + 1 64K
>> parity strip. In effect, a 4 disk raid5 would quickly get not just 3
>> data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2
>> data + 1 parity chunks, and direct those write to the proper chunk
>> based on size. Anyway that's beyond my ability to assess how much
>> allocator work that is. Balance I'd expect to rewrite everything to
>> max data strips possible; the optimization would only apply to normal
>> operation COW..

> This will make some filesystems mostly RAID1, negating all space savings of
> RAID5, won't it?

No. It'd only apply to partial stripe writes, typically small files.
But small file, metadata centric workloads suck for raid5 anyway, and
should use raid1. So making the implementation more like raid1 than
raid5 for the RMW case I think is still better than Btrfs raid56 RMW
writes in effect being no-COW.


> Isn't it easier to recalculate parity block based using previous state of
> two rewritten strips, parity and data? I don't understand all performance
> implications, but it might scale better with number of devices.

The problem is atomicity. Either the data strip or parity strip is
overwritten first, and before the other is committed, the file system
is not merely inconsistent, it's basically lying, there's no way to
know for sure after the fact whether the data or parity were properly
written. And even the metadata is inconsistent too because it can only
describe the unmodified state and the successfully modified state,
whereas a 3rd state "partially modified" is possible and no way to
really fix it.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Chris Murphy
On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruo  wrote:

> Unlike pure stripe method, one fully functional RAID5/6 should be written in
> full stripe behavior, which is made up by N data stripes and correct P/Q.
>
> Given one example to show how write sequence affects the usability of
> RAID5/6.
>
> Existing full stripe:
> X = Used space (Extent allocated)
> O = Unused space
> Data 1   |XX||
> Data 2   |OOO|
> Parity   |WW||
>
> When some new extent is allocated to data 1 stripe, if we write
> data directly into that region, and crashed.
> The result will be:
>
> Data 1   |XX|XX|O|
> Data 2   |OOO|
> Parity   |WW||
>
> Parity stripe is not updated, although it's fine since data is still
> correct, this reduces the usability, as in this case, if we lost device
> containing data 2 stripe, we can't recover correct data of data 2.
>
> Although personally I don't think it's a big problem yet.
>
> Someone has idea to modify extent allocator to handle it, but anyway I don't
> consider it's worthy.


If there is parity corruption and there is a lost device (or bad
sector causing lost data strip), that is in effect two failures and no
raid5 recovers, you have to have raid6. However, I don't know whether
Btrfs raid6 can even recover from it? If there is a single device
failure, with a missing data strip, you have both P&Q. Typically raid6
implementations use P first, and only use Q if P is not available. Is
Btrfs raid6 the same? And if reconstruction from P fails to match data
csum, does Btrfs retry using Q? Probably not is my guess.

I think that is a valid problem calling for a solution on Btrfs, given
its mandate. It is no worse than other raid6 implementations though
which would reconstruct from bad P, and give no warning, leaving it up
to application layers to deal with the problem.

I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario.



>
>>
>> 2. Parity data is not checksummed
>> Why is this a problem? Does it have to do with the design of BTRFS
>> somehow?
>> Parity is after all just data, BTRFS does checksum data so what is the
>> reason this is a problem?
>
>
> Because that's one solution to solve above problem.
>
> And no, parity is not data.

Parity strip is differentiated from data strip, and by itself parity
is meaningless. But parity plus n-1 data strips is an encoded form of
the missing data strip, and is therefore an encoded copy of the data.
We kinda have to treat the parity as fractionally important compared
to data; just like each mirror copy has some fractional value. You
don't have to have both of them, but you do have to have at least one
of them.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] Btrfs: warn if total_bytes_pinned is non-zero on unmount

2017-06-21 Thread David Sterba
On Tue, Jun 06, 2017 at 04:45:32PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Catch any future/remaining leaks or underflows of total_bytes_pinned.
> 
> Signed-off-by: Omar Sandoval 

This patch received some objections. As it's a debugging aid, I'd rather
merge a patch that other agree is useful for that purpose.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()

2017-06-21 Thread David Sterba
On Tue, Jun 06, 2017 at 04:45:27PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: David Sterba 

Added some changelog.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Austin S. Hemmelgarn

On 2017-06-21 13:20, Andrei Borzenkov wrote:

21.06.2017 16:41, Austin S. Hemmelgarn пишет:

On 2017-06-21 08:43, Christoph Anton Mitterer wrote:

On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote:

Btrfs is always using device ID to build up its device mapping.
And for any multi-device implementation (LVM,mdadam) it's never a
good
idea to use device path.


Isn't it rather the other way round? Using the ID is bad? Don't you
remember our discussion about using leaked UUIDs (or accidental
collisions) for all kinds of attacks?

Both are bad for different reasons.  For the particular case of sanely
handling transient storage failures (device disappears then reappears),
you can't do it with a path in /dev (which is what most people mean when
they say device path), and depending on how the hardware failed and the
specifics of the firmware, you may not be able to do it with a
hardware-level device path, but you can do it with a device ID assuming
you sanely verify the ID.  Right now, BTRFS is not sanely checking the
ID (it only verifies the UUID's in the FS itself, it should also be
checking hardware-level identifiers like WWN).


Which is not enough too; if device dropped off array and reappeared
later we need to be able to declare it stale, even if it has exactly the
same UUID and WWN and whatever hardware identifier is used. So we need
some generation number to be able to do it. Incidentally MD does have
them and compares generation numbers to decide whether device can be
assimilated.

I was not disputing that aspect, just the method verifying the device 
that reappeared is the same one that disappeared.  Outside of the 
requirement to properly re-sync (we would also need to do some kind of 
sanity check on the generation number too, otherwise we end up with the 
possibility of a partial write there nuking the whole FS when the device 
reconnects), verifying some level of hardware identification covers the 
security and data safety issues that Christoph is referring to 
sufficiently for the common cases (with the biggest being USB attached 
devices with BTRFS volumes on them).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Andrei Borzenkov
21.06.2017 16:41, Austin S. Hemmelgarn пишет:
> On 2017-06-21 08:43, Christoph Anton Mitterer wrote:
>> On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote:
>>> Btrfs is always using device ID to build up its device mapping.
>>> And for any multi-device implementation (LVM,mdadam) it's never a
>>> good
>>> idea to use device path.
>>
>> Isn't it rather the other way round? Using the ID is bad? Don't you
>> remember our discussion about using leaked UUIDs (or accidental
>> collisions) for all kinds of attacks?
> Both are bad for different reasons.  For the particular case of sanely
> handling transient storage failures (device disappears then reappears),
> you can't do it with a path in /dev (which is what most people mean when
> they say device path), and depending on how the hardware failed and the
> specifics of the firmware, you may not be able to do it with a
> hardware-level device path, but you can do it with a device ID assuming
> you sanely verify the ID.  Right now, BTRFS is not sanely checking the
> ID (it only verifies the UUID's in the FS itself, it should also be
> checking hardware-level identifiers like WWN).

Which is not enough too; if device dropped off array and reappeared
later we need to be able to declare it stale, even if it has exactly the
same UUID and WWN and whatever hardware identifier is used. So we need
some generation number to be able to do it. Incidentally MD does have
them and compares generation numbers to decide whether device can be
assimilated.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Andrei Borzenkov
21.06.2017 09:51, Marat Khalili пишет:
> On 21/06/17 06:48, Chris Murphy wrote:
>> Another possibility is to ensure a new write is written to a new*not*
>> full stripe, i.e. dynamic stripe size. So if the modification is a 50K
>> file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K
>> parity strip (a full stripe write); write out 1 64K data strip + 1 64K
>> parity strip. In effect, a 4 disk raid5 would quickly get not just 3
>> data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2
>> data + 1 parity chunks, and direct those write to the proper chunk
>> based on size. Anyway that's beyond my ability to assess how much
>> allocator work that is. Balance I'd expect to rewrite everything to
>> max data strips possible; the optimization would only apply to normal
>> operation COW.
> This will make some filesystems mostly RAID1, negating all space savings
> of RAID5, won't it?
> 
> Isn't it easier to recalculate parity block based using previous state
> of two rewritten strips, parity and data? I don't understand all
> performance implications, but it might scale better with number of devices.
> 

That's what it effectively does today; the problem is, RAID[56] layer is
below btrfs allocator so same stripe may be shared by different
transactions. This defeats the very idea of redirect on write where data
on disk is assumed to never be changed by subsequent modifications.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Goffredo Baroncelli
Hi Qu,

On 2017-06-21 10:45, Qu Wenruo wrote:
> At 06/21/2017 06:57 AM, waxhead wrote:
>> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
>> The wiki refer to kernel 3.19 which was released in February 2015 so I assume
>> that the information there is a tad outdated (the last update on the wiki 
>> page was July 2016)
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>
>> Now there are four problems listed
>>
>> 1. Parity may be inconsistent after a crash (the "write hole")
>> Is this still true, if yes - would not this apply for RAID1 / 
>> RAID10 as well? How was it solved there , and why can't that be done for 
>> RAID5/6
> 
> Unlike pure stripe method, one fully functional RAID5/6 should be written in 
> full stripe behavior,
>  which is made up by N data stripes and correct P/Q.
> 
> Given one example to show how write sequence affects the usability of RAID5/6.
> 
> Existing full stripe:
> X = Used space (Extent allocated)
> O = Unused space
> Data 1   |XX||
> Data 2   |OOO|
> Parity   |WW||
> 
> When some new extent is allocated to data 1 stripe, if we write
> data directly into that region, and crashed.
> The result will be:
> 
> Data 1   |XX|XX|O|
> Data 2   |OOO|
> Parity   |WW||
> 
> Parity stripe is not updated, although it's fine since data is still correct, 
> this reduces the 
> usability, as in this case, if we lost device containing data 2 stripe, we 
> can't 
> recover correct data of data 2.
> 
> Although personally I don't think it's a big problem yet.
> 
> Someone has idea to modify extent allocator to handle it, but anyway I don't 
> consider it's worthy.
> 
>>
>> 2. Parity data is not checksummed
>> Why is this a problem? Does it have to do with the design of BTRFS somehow?
>> Parity is after all just data, BTRFS does checksum data so what is the 
>> reason this is a problem?
> 
> Because that's one solution to solve above problem.

In what it could be a solution for the write hole ? If a parity is wrong AND 
you lost a disk, even having a checksum of the parity, you are not in position 
to rebuild the missing data. And if you rebuild wrong data, anyway the checksum 
highlights it. So adding the checksum to the parity should not solve any issue.

A possible "mitigation", is to track in a "intent log" all the not "full stripe 
writes" during a transaction. If a power failure aborts a transaction, in the 
next mount a scrub process is started to correct the parities only in the 
stripes tracked before.

A solution, is to journal all the not "full stripe writes", as MD does.


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_wait_ordered_roots warning triggered

2017-06-21 Thread Dave Jones
On Wed, Jun 21, 2017 at 11:52:36AM -0400, Chris Mason wrote:
 > On 06/21/2017 11:16 AM, Dave Jones wrote:
 > > WARNING: CPU: 2 PID: 7153 at fs/btrfs/ordered-data.c:753 
 > > btrfs_wait_ordered_roots+0x1a3/0x220
 > > CPU: 2 PID: 7153 Comm: kworker/u8:7 Not tainted 4.12.0-rc6-think+ #4
 > > Workqueue: events_unbound btrfs_async_reclaim_metadata_space
 > > task: 8804f08d5380 task.stack: c9000895c000
 > > RIP: 0010:btrfs_wait_ordered_roots+0x1a3/0x220
 > > RSP: 0018:c9000895fc70 EFLAGS: 00010286
 > > RAX:  RBX: f7efdfb1 RCX: 25c3
 > > RDX: 880507c0cde0 RSI:  RDI: 0002
 > > RBP: c9000895fce8 R08:  R09: 0001
 > > R10: c9000895fbd8 R11: 8804f08d5380 R12: 8804f4930008
 > > R13: 0001 R14: 8804f36f R15: 8804f4930850
 > > FS:  () GS:880507c0() 
 > > knlGS:
 > > CS:  0010 DS:  ES:  CR0: 80050033
 > > CR2: 7fd40c4a1bf2 CR3: 00013995e000 CR4: 001406e0
 > > DR0: 7f17be89 DR1:  DR2: 
 > > DR3:  DR6: 0ff0 DR7: 0600
 > > Call Trace:
 > >  flush_space+0x285/0x6e0
 > >  btrfs_async_reclaim_metadata_space+0xd7/0x420
 > >  process_one_work+0x24d/0x680
 > >  worker_thread+0x4e/0x3b0
 > >  kthread+0x117/0x150
 > >  ? process_one_work+0x680/0x680
 > >  ? kthread_create_on_node+0x70/0x70
 > >  ret_from_fork+0x27/0x40
 > 
 > Thanks Dave, which trinity command line triggered this?

nothing too fancy

in a loop..

  trinity -q -C64 -N 1 -E SMC -a64

(note you'll need https://lkml.org/lkml/2017/6/20/10 too to survive
 more than a few seconds)

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs_wait_ordered_roots warning triggered

2017-06-21 Thread Dave Jones
WARNING: CPU: 2 PID: 7153 at fs/btrfs/ordered-data.c:753 
btrfs_wait_ordered_roots+0x1a3/0x220
CPU: 2 PID: 7153 Comm: kworker/u8:7 Not tainted 4.12.0-rc6-think+ #4 
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
task: 8804f08d5380 task.stack: c9000895c000
RIP: 0010:btrfs_wait_ordered_roots+0x1a3/0x220
RSP: 0018:c9000895fc70 EFLAGS: 00010286
RAX:  RBX: f7efdfb1 RCX: 25c3
RDX: 880507c0cde0 RSI:  RDI: 0002
RBP: c9000895fce8 R08:  R09: 0001
R10: c9000895fbd8 R11: 8804f08d5380 R12: 8804f4930008
R13: 0001 R14: 8804f36f R15: 8804f4930850
FS:  () GS:880507c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd40c4a1bf2 CR3: 00013995e000 CR4: 001406e0
DR0: 7f17be89 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Call Trace:
 flush_space+0x285/0x6e0
 btrfs_async_reclaim_metadata_space+0xd7/0x420
 process_one_work+0x24d/0x680
 worker_thread+0x4e/0x3b0
 kthread+0x117/0x150
 ? process_one_work+0x680/0x680
 ? kthread_create_on_node+0x70/0x70
 ret_from_fork+0x27/0x40



This is 4.12rc6, with the following diff because I hit that ASSERT far too 
easily..

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef3c98c527c1..8a5c906ad67c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4722,7 +4722,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
}
 error:
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
-   ASSERT(last_size >= new_size);
+// ASSERT(last_size >= new_size);
if (!err && last_size > new_size)
last_size = new_size;
btrfs_ordered_update_i_size(inode, last_size, NULL);

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix validation of XATTR_ITEM dir items

2017-06-21 Thread David Sterba
The XATTR_ITEM is a type of a directory item so we use the common
validator helper. We have to adjust the limits because of potential
data_len (ie. the xattr value), which is otherwise 0 for other directory
items.

Signed-off-by: David Sterba 
---
 fs/btrfs/dir-item.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 2b00dd746118..a48d496e63e4 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -498,6 +498,7 @@ bool btrfs_is_name_len_valid(struct extent_buffer *leaf, 
int slot,
 {
struct btrfs_fs_info *fs_info = leaf->fs_info;
struct btrfs_key key;
+   struct btrfs_dir_item *di;
u32 read_start;
u32 read_end;
u32 item_start;
@@ -515,8 +516,17 @@ bool btrfs_is_name_len_valid(struct extent_buffer *leaf, 
int slot,
btrfs_item_key_to_cpu(leaf, &key, slot);
 
switch (key.type) {
-   case BTRFS_DIR_ITEM_KEY:
case BTRFS_XATTR_ITEM_KEY:
+   /*
+* XATTR_ITEM could contain data so the item_end would not
+* match read_end as for other item types. Artificially lower
+* the upper bound so we check just name_len. We have to trust
+* the data_len value here, but if it's too big the validation
+* fails so it's safer than increasing read_end.
+*/
+   di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
+   item_end -= btrfs_dir_data_len(leaf, di);
+   case BTRFS_DIR_ITEM_KEY:
case BTRFS_DIR_INDEX_KEY:
size = sizeof(struct btrfs_dir_item);
break;
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_wait_ordered_roots warning triggered

2017-06-21 Thread Chris Mason

On 06/21/2017 11:16 AM, Dave Jones wrote:

WARNING: CPU: 2 PID: 7153 at fs/btrfs/ordered-data.c:753 
btrfs_wait_ordered_roots+0x1a3/0x220
CPU: 2 PID: 7153 Comm: kworker/u8:7 Not tainted 4.12.0-rc6-think+ #4
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
task: 8804f08d5380 task.stack: c9000895c000
RIP: 0010:btrfs_wait_ordered_roots+0x1a3/0x220
RSP: 0018:c9000895fc70 EFLAGS: 00010286
RAX:  RBX: f7efdfb1 RCX: 25c3
RDX: 880507c0cde0 RSI:  RDI: 0002
RBP: c9000895fce8 R08:  R09: 0001
R10: c9000895fbd8 R11: 8804f08d5380 R12: 8804f4930008
R13: 0001 R14: 8804f36f R15: 8804f4930850
FS:  () GS:880507c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd40c4a1bf2 CR3: 00013995e000 CR4: 001406e0
DR0: 7f17be89 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Call Trace:
 flush_space+0x285/0x6e0
 btrfs_async_reclaim_metadata_space+0xd7/0x420
 process_one_work+0x24d/0x680
 worker_thread+0x4e/0x3b0
 kthread+0x117/0x150
 ? process_one_work+0x680/0x680
 ? kthread_create_on_node+0x70/0x70
 ret_from_fork+0x27/0x40


Thanks Dave, which trinity command line triggered this?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Compression - calculate entropy for data set

2017-06-21 Thread Timofey Titovets
I've done with heuristic method,
so i post some performance test output:
(I store test data in /run/user/$UID/, and script just ran programm 2 times)
###
# Performance test will measure initialization time
# And remove it from run time of tests
# This may be inaccurate in some cases
# But this allow see speed difference
./performance_test.sh

Test good compressible data: 128k
AVG Initialization time: 17018ms
avg mean:   3240ms
shannon float:  11537ms
shannon integ:  9933ms
heuristic:  2060ms
gzip -ck -3 /run/user/1000/test_data.bin:   19767ms
lzop -ck -3 /run/user/1000/test_data.bin:   12393ms
lzop -ck -1 /run/user/1000/test_data.bin:   13660ms
- - - - -
Test half compressible data: 128k
AVG Initialization time: 15820ms
avg mean:   3302ms
shannon float:  8680ms
shannon integ:  9659ms
heuristic:  3207ms // 128*2/1024/3 ~ 800 Mb/s
gzip -ck -3 /run/user/1000/test_data.bin:   57245ms
lzop -ck -3 /run/user/1000/test_data.bin:   14175ms
lzop -ck -1 /run/user/1000/test_data.bin:   15392ms
- - - - -
Test bad compressible data: 128k
AVG Initialization time: 16878ms
avg mean:   4410ms
shannon float:  7113ms
shannon integ:  6777ms
heuristic:  2162ms  // 128*2/1024/2 ~ 1250 Mb/s
gzip -ck -3 /run/user/1000/test_data.bin:   110578ms
lzop -ck -3 /run/user/1000/test_data.bin:   14238ms
lzop -ck -1 /run/user/1000/test_data.bin:   15332ms
- - - - -
Test good compressible data: 8k
AVG Initialization time: 17526ms
avg mean:   1683ms
shannon float:  5762ms
shannon integ:  2427ms
heuristic:  1858ms
gzip -ck -3 /run/user/1000/test_data.bin:   2745ms
lzop -ck -3 /run/user/1000/test_data.bin:   10221ms
lzop -ck -1 /run/user/1000/test_data.bin:   11960ms
- - - - -
Test half compressible data: 8k
AVG Initialization time: 17513ms
avg mean:   1853ms
shannon float:  1933ms
shannon integ:  2572ms
heuristic:  2571ms
gzip -ck -3 /run/user/1000/test_data.bin:   4167ms
lzop -ck -3 /run/user/1000/test_data.bin:   9502ms
lzop -ck -1 /run/user/1000/test_data.bin:   10923ms
- - - - -
Test bad compressible data: 8k
AVG Initialization time: 18164ms
avg mean:   -28ms
shannon float:  312ms
shannon integ:  1891ms
heuristic:  542ms
gzip -ck -3 /run/user/1000/test_data.bin:   6407ms
lzop -ck -3 /run/user/1000/test_data.bin:   9088ms
lzop -ck -1 /run/user/1000/test_data.bin:   10741ms

So as you can see most of time heuristic test are 5 times faster then
direct compression.
I did some base testing on that (trying  random pieces of data: blocks
from VM images, Photos, Texts, /dev/zero, /dev/urandom)
And heuristic method are pretty accurate, most of magic are happens in [2].


I think that in near time i will try make some kernel patch set for that.
If someone interest in that, you can look code and do some tests on your data
I appreciate any feedback!

Thanks!

P.S.
[1] https://github.com/Nefelim4ag/Entropy_Calculation
[2] https://github.com/Nefelim4ag/Entropy_Calculation/blob/master/heuristic.c

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to fix errors that check --mode lomem finds, but --mode normal doesn't?

2017-06-21 Thread Marc MERLIN
On Tue, Jun 20, 2017 at 08:43:52PM -0700, Marc MERLIN wrote:
> On Tue, Jun 20, 2017 at 09:31:42PM -0600, Chris Murphy wrote:
> > On Tue, Jun 20, 2017 at 5:12 PM, Marc MERLIN  wrote:
> > 
> > > I'm now going to remount this with nospace_cache to see if your guess 
> > > about
> > > space_cache was correct.
> > > Other suggestions also welcome :)
> > 
> > What results do you get with lowmem mode? It won't repair without
> > additional patches, but might give a dev a clue what's going on. I
> > regularly see normal mode check finds no problems, and lowmem mode
> > finds problems. Lowmem mode is a total rewrite so it's a different
> > implementation and can find things normal mode won't.
> 
> Oh, I kind of forgot that lowmem mode looked for more things than regular
> mode.
> I will run this tonight and see what it says.
 
It's probably still a ways from being finished given how slow lowmem is in
comparison, but sadly it found a bunch of problems which regular mode didn't
find.

I'm pretty bummed. I just spent way too long recreating this filesystem and
the multiple btrfs send/receive relationships from other machines. Too a bit
over a week :(

It looks like the errors are not major (especially if the regular mode
doesn't even see them), but without lowmem --repair, I'm kind of screwed.

I'm wondering if I could/should leave those errors unfixed until lowmem --repair
finally happens, or whether I'm looking at spending another week rebuilding
this filesystem :-/


gargamel:~# btrfs check -p --mode lowmem  /dev/mapper/dshelf2
Checking filesystem on /dev/mapper/dshelf2
UUID: 85441c59-ad11-4b25-b1fe-974f9e4acede
ERROR: extent[3886187384832, 81920] referencer count mismatch (root: 11930, 
owner: 375444, offset: 1851654144) wanted: 1, have: 4
ERROR: extent[3886189391872, 122880] referencer count mismatch (root: 11712, 
owner: 863395, offset: 79659008) wanted: 1, have: 2
ERROR: extent[3933249708032, 69632] referencer count mismatch (root: 12424, 
owner: 6945, offset: 2083389440) wanted: 1, have: 2
ERROR: extent[3933249708032, 69632] referencer count mismatch (root: 12172, 
owner: 6945, offset: 2083389440) wanted: 1, have: 2
ERROR: extent[4571729862656, 876544] referencer count mismatch (root: 11058, 
owner: 375442, offset: 907706368) wanted: 6, have: 21
ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11712, 
owner: 375444, offset: 1848672256) wanted: 3, have: 5
ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11276, 
owner: 375444, offset: 1848672256) wanted: 3, have: 5
ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11058, 
owner: 375444, offset: 1848672256) wanted: 3, have: 5
ERROR: extent[4641490833408, 270336] referencer count mismatch (root: 11494, 
owner: 375444, offset: 1848672256) wanted: 3, have: 5
ERROR: extent[4658555617280, 122880] referencer count mismatch (root: 11712, 
owner: 375444, offset: 1848705024) wanted: 1, have: 3
ERROR: extent[4677858123776, 417792] referencer count mismatch (root: 11712, 
owner: 863395, offset: 79523840) wanted: 1, have: 3
ERROR: extent[4677858123776, 417792] referencer count mismatch (root: 11930, 
owner: 863395, offset: 79523840) wanted: 1, have: 3
ERROR: extent[4698380947456, 409600] referencer count mismatch (root: 11930, 
owner: 375444, offset: 1851596800) wanted: 3, have: 4
ERROR: extent[4720470421504, 667648] referencer count mismatch (root: 11058, 
owner: 3463478, offset: 2334720) wanted: 2, have: 10
ERROR: extent[4783941246976, 65536] referencer count mismatch (root: 9365, 
owner: 24493, offset: 4562944) wanted: 2, have: 3
ERROR: extent[5077564477440, 106496] referencer count mismatch (root: 9370, 
owner: 1602694, offset: 734756864) wanted: 1, have: 2
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11712, 
owner: 375441, offset: 910999552) wanted: 16, have: 1864
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11276, 
owner: 375441, offset: 910999552) wanted: 867, have: 1865
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11058, 
owner: 375441, offset: 910999552) wanted: 126, have: 1872
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11494, 
owner: 375441, offset: 910999552) wanted: 866, have: 1864
ERROR: extent[5136306929664, 131489792] referencer count mismatch (root: 11930, 
owner: 375441, offset: 910999552) wanted: 861, have: 1859
ERROR: extent[5136649891840, 66781184] referencer count mismatch (root: 11058, 
owner: 375442, offset: 192659456) wanted: 5, have: 19
ERROR: extent[5136879157248, 134217728] referencer count mismatch (root: 11930, 
owner: 375442, offset: 394543104) wanted: 10, have: 33
ERROR: extent[5137380671488, 80945152] referencer count mismatch (root: 11058, 
owner: 375442, offset: 875233280) wanted: 1, have: 21
ERROR: extent[5137380671488, 80945152] referencer count mismatch (root: 11930, 
owner: 375442, offset: 875233280) wanted: 11, have: 21
ERROR: extent[5138641395712, 524288] referen

Re: [PATCH] btrfs: use new block error code

2017-06-21 Thread Jens Axboe
On 06/21/2017 07:17 AM, David Sterba wrote:
> On Mon, Jun 19, 2017 at 01:55:37PM +0300, Dan Carpenter wrote:
>> This function is supposed to return blk_status_t error codes now but
>> there was a stray -ENOMEM left behind.
>>
>> Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t")
>> Signed-off-by: Dan Carpenter 
> 
> Acked-by: David Sterba 
> 
> The patch depends on the blk_status_t patchset, so I expect that it's
> going to be merged to that.

Added, thanks.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: add cond_resched to btrfs_qgroup_trace_leaf_items

2017-06-21 Thread David Sterba
On Tue, Jun 20, 2017 at 08:15:26AM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> On an uncontended system, we can end up hitting soft lockups while
> doing replace_path.  At the core, and frequently called is
> btrfs_qgroup_trace_leaf_items, so it makes sense to add a cond_resched
> there.
> 
> Signed-off-by: Jeff Mahoney 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Austin S. Hemmelgarn

On 2017-06-21 08:43, Christoph Anton Mitterer wrote:

On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote:

Btrfs is always using device ID to build up its device mapping.
And for any multi-device implementation (LVM,mdadam) it's never a
good
idea to use device path.


Isn't it rather the other way round? Using the ID is bad? Don't you
remember our discussion about using leaked UUIDs (or accidental
collisions) for all kinds of attacks?
Both are bad for different reasons.  For the particular case of sanely 
handling transient storage failures (device disappears then reappears), 
you can't do it with a path in /dev (which is what most people mean when 
they say device path), and depending on how the hardware failed and the 
specifics of the firmware, you may not be able to do it with a 
hardware-level device path, but you can do it with a device ID assuming 
you sanely verify the ID.  Right now, BTRFS is not sanely checking the 
ID (it only verifies the UUID's in the FS itself, it should also be 
checking hardware-level identifiers like WWN).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use new block error code

2017-06-21 Thread David Sterba
On Mon, Jun 19, 2017 at 01:55:37PM +0300, Dan Carpenter wrote:
> This function is supposed to return blk_status_t error codes now but
> there was a stray -ENOMEM left behind.
> 
> Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t")
> Signed-off-by: Dan Carpenter 

Acked-by: David Sterba 

The patch depends on the blk_status_t patchset, so I expect that it's
going to be merged to that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Christoph Anton Mitterer
On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote:
> Btrfs is always using device ID to build up its device mapping.
> And for any multi-device implementation (LVM,mdadam) it's never a
> good 
> idea to use device path.

Isn't it rather the other way round? Using the ID is bad? Don't you
remember our discussion about using leaked UUIDs (or accidental
collisions) for all kinds of attacks?


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: 4.11.3: BTRFS critical (device dm-1): unable to add free space :-17 => btrfs check --repair runs clean

2017-06-21 Thread Duncan
Marc MERLIN posted on Tue, 20 Jun 2017 16:12:03 -0700 as excerpted:

> On Tue, Jun 20, 2017 at 08:44:29AM -0700, Marc MERLIN wrote:
>> On Tue, Jun 20, 2017 at 03:36:01PM +, Hugo Mills wrote:
>> 
 "space cache will be invalidated " => doesn't that mean that my
 cache was already cleared by check --repair, or are you saying I
 need to clear it again?
>>> 
>>>I'm never quite sure about that one. :)
>>> 
>>>It can't hurt to clear it manually as well.
>> 
>> Sounds good, done.
>  
> Except it didn't help :(
> It worked for a while, and failed again.
> 
> It looks like I'm hitting a persistent bug :(

[Omitted free space cache dmesg errors]

> Given that check --repair ran clean when I ran it yesterday after this
> first happened, and I then ran  mount -o clear_cache , the cache got
> rebuilt, and I got the problem again, this is not looking good, seems
> like a persistent bug :-/

Keep in mind this quote from a recent (I'm quoting -progs 4.11) btrfs-
check manpage (reformatted for posting):

> 

--clear-space-cache v1|v2

completely wipe all free space cache of given type

For free space cache v1, the clear_cache kernel mount option only 
rebuilds the free space cache for block groups that are modified while the
filesystem is mounted with that option. Thus, using this option with v1 
makes it possible to actually clear the entire free space cache.

For free space cache v2, the clear_cache kernel mount option does destroy 
the entire free space cache. This option with v2 provides an alternative
method of clearing the free space cache that doesn’t require mounting the 
filesystem.

<

Given the dmesg, seems you're still running the space cache, not the v2/
tree (which is fine, I'm conservative enough not to have switched yet 
either).  So try the check option instead of the mount option.  The mount 
option might simply have not caught all the badness while it was active.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: DEBUG fiemap: Show more info about extent_fiemap

2017-06-21 Thread Qu Wenruo
Hi Adam,

Would you please try this patch based on v4.12-rc5 and try to reproduce
the kernel warning?

It would be better to eliminate the noisy by ensure there is no other fiemap
caller on btrfs.

Thanks,
Qu

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e0..ebebfb0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4406,6 +4406,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info 
*fieinfo,
 {
int ret = 0;
 
+   pr_info("%s: entered: offset=%llu phys=%llu len=%llu flags=0x%x\n",
+   __func__, offset, phys, len, flags);
if (!cache->cached)
goto assign;
 
@@ -4417,7 +4419,7 @@ static int emit_fiemap_extent(struct fiemap_extent_info 
*fieinfo,
 * NOTE: Physical address can overlap, due to compression
 */
if (cache->offset + cache->len > offset) {
-   WARN_ON(1);
+   pr_info("%s: Sanity check failed\n", __func__);
return -EINVAL;
}
 
@@ -4438,16 +4440,23 @@ static int emit_fiemap_extent(struct fiemap_extent_info 
*fieinfo,
(flags & ~FIEMAP_EXTENT_LAST)) {
cache->len += len;
cache->flags |= flags;
+   pr_info("%s: merged, new offset=%llu len=%llu flags=0x%x\n", 
__func__,
+   cache->offset, cache->len, cache->flags);
goto try_submit_last;
}
 
+   pr_info("%s: submit cached fiemap: offset=%llu len=%llu flags=0x%x\n",
+   __func__, cache->offset, cache->len, cache->flags);
/* Not mergeable, need to submit cached one */
ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
  cache->len, cache->flags);
cache->cached = false;
-   if (ret)
+   if (ret) {
+   pr_info("%s: cached submit exit ret=%d\n", __func__, ret);
return ret;
+   }
 assign:
+   pr_info("%s: assigning new fiemap\n", __func__);
cache->cached = true;
cache->offset = offset;
cache->phys = phys;
@@ -4455,10 +4464,13 @@ static int emit_fiemap_extent(struct fiemap_extent_info 
*fieinfo,
cache->flags = flags;
 try_submit_last:
if (cache->flags & FIEMAP_EXTENT_LAST) {
+   pr_info("%s: submit last fiemap: offset=%llu len=%llu 
flags=0x%x\n",
+   __func__, cache->offset, cache->len, cache->flags);
ret = fiemap_fill_next_extent(fieinfo, cache->offset,
cache->phys, cache->len, cache->flags);
cache->cached = false;
}
+   pr_info("%s: last exit ret=%d\n", __func__, ret);
return ret;
 }
 
@@ -4525,6 +4537,9 @@ int extent_fiemap(struct inode *inode, struct 
fiemap_extent_info *fieinfo,
return -ENOMEM;
path->leave_spinning = 1;
 
+   pr_info("%s: enter: root=%llu inode=%llu start=%llu len=%llu\n",
+   __func__, root->objectid, btrfs_ino(BTRFS_I(inode)),
+   start, len);
start = round_down(start, btrfs_inode_sectorsize(inode));
len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
 
@@ -4696,6 +4711,7 @@ int extent_fiemap(struct inode *inode, struct 
fiemap_extent_info *fieinfo,
if (ret) {
if (ret == 1)
ret = 0;
+   pr_info("%s: out_free after emit_fiemap_extent\n", 
__func__);
goto out_free;
}
}
@@ -4707,6 +4723,9 @@ int extent_fiemap(struct inode *inode, struct 
fiemap_extent_info *fieinfo,
btrfs_free_path(path);
unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, start + len - 1,
 &cached_state, GFP_NOFS);
+   pr_info("%s: exit: ret=%d root=%llu inode=%llu start=%llu len=%llu\n",
+   __func__, ret, root->objectid, btrfs_ino(BTRFS_I(inode)),
+   start, len);
return ret;
 }
 
-- 
2.9.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Qu Wenruo



At 06/21/2017 06:57 AM, waxhead wrote:
I am trying to piece together the actual status of the RAID5/6 bit of 
BTRFS.
The wiki refer to kernel 3.19 which was released in February 2015 so I 
assume that the information there is a tad outdated (the last update on 
the wiki page was July 2016)

https://btrfs.wiki.kernel.org/index.php/RAID56

Now there are four problems listed

1. Parity may be inconsistent after a crash (the "write hole")
Is this still true, if yes - would not this apply for RAID1 / RAID10 as 
well? How was it solved there , and why can't that be done for RAID5/6


Unlike pure stripe method, one fully functional RAID5/6 should be 
written in full stripe behavior, which is made up by N data stripes and 
correct P/Q.


Given one example to show how write sequence affects the usability of 
RAID5/6.


Existing full stripe:
X = Used space (Extent allocated)
O = Unused space
Data 1   |XX||
Data 2   |OOO|
Parity   |WW||

When some new extent is allocated to data 1 stripe, if we write
data directly into that region, and crashed.
The result will be:

Data 1   |XX|XX|O|
Data 2   |OOO|
Parity   |WW||

Parity stripe is not updated, although it's fine since data is still 
correct, this reduces the usability, as in this case, if we lost device 
containing data 2 stripe, we can't recover correct data of data 2.


Although personally I don't think it's a big problem yet.

Someone has idea to modify extent allocator to handle it, but anyway I 
don't consider it's worthy.




2. Parity data is not checksummed
Why is this a problem? Does it have to do with the design of BTRFS somehow?
Parity is after all just data, BTRFS does checksum data so what is the 
reason this is a problem?


Because that's one solution to solve above problem.

And no, parity is not data.

Parity/mirror/stripe is done in btrfs chunk level, and represents a 
nice, easy to understand linear logical space for higher level.


For example:
If in the btrfs logical space, 0~1G is mapped to a RAID5 chunk with 3 
devices, higher level only needs to tell btrfs chunk layer how many 
bytes it wants to read and where the read starts.


If one devices is missing, then try to rebuild the data using parity so 
that higher layer don't need to care what the profile is or if there is 
parity or not.


So parity can't be addressed from btrfs logical space, that's to say no 
possible position to record csum for it in current btrfs design.






3. No support for discard? (possibly -- needs confirmation with cmason)
Does this matter that much really?, is there an update on this?


Not familiar with this though.



4. The algorithm uses as many devices as are available: No support for a 
fixed-width stripe.
What is the plan for this one? There was patches on the mailing list by 
the SnapRAID author to support up to 6 parity devices. Will the (re?) 
resign of btrfs raid5/6 support a scheme that allows for multiple parity 
devices?


Considering current maintainers seems to be focusing on bug fixes, not 
new features, I'm not confident with such new feature.




I do have a few other questions as well...

5. BTRFS does still (kernel 4.9) not seem to use the device ID to 
communicate with devices.


Btrfs is always using device ID to build up its device mapping.
And for any multi-device implementation (LVM,mdadam) it's never a good 
idea to use device path.




If you on a multi device filesystem yank out a device, for example 
/dev/sdg and it reappear as /dev/sdx for example btrfs will still 
happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the 
correct device ID. What is the status for getting BTRFS to properly 
understand that a device is missing?


It's btrfs that doesn't support runtime switch from missing device to 
re-appeared device.


Most device missing detection is done at btrfs device scan time.
Runtime detect is not that perfect, but Anand Jain is introducing some 
nice infrastructure as basis to enhance it.




6. RAID1 needs to be able to make two copies always. E.g. if you have 
three disks you can loose one and it should still work. What about 
RAID10 ? If you have for example 6 disk RAID10 array, loose one disk and 
reboots (due to #5 above). Will RAID10 recognize that the array now is a 
5 disk array and stripe+mirror over 2 disks (or possibly 2.5 disks?) 
instead of 3? In other words, will it work as long as it can create a 
RAID10 profile that requires a minimum of four disks?


At least, after reboot, btrfs still knows it's a fs on 6 disks, although 
lost one, it will still create new chunk using all 6 disks.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.

Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-06-21 Thread Qu Wenruo



At 06/18/2017 09:42 PM, Adam Borowski wrote:

On Sun, Jun 18, 2017 at 07:23:00PM +0800, Qu Wenruo wrote:

[   39.726215] BTRFS warning (device sda1): unhandled fiemap cache
detected: offset=phys$35798867968 len1072 flags=0x2008



[  151.882586] BTRFS warning (device sda1): unhandled fiemap cache
detected: offset=phys$35798867968 len1072 flags=0x2008




I created file extents which returns 0x2008 and the last extent with 0x9.
But still failed to reproduce the error message.

BTW, I noticed that your output is a little strange.
Normally we should have "offset=%llu phys=%llu len=%llu and flags=0x%x".

But your output seems a little out of shape.
And further more, the len (if it's correct) is not aligned even to 512
bytes.


[24031.309852] BTRFS warning (device sda1): unhandled fiemap cache detected: 
offset=0 phys=2012381351936 len=131072 flags=0x2008

Eh?  All the numbers in my mails not only match that printf format but also
are nice round numbers; len being always 128KB (not in just what I posted,
also in all other occurences).


Well the new line seems pretty normal now.

I'll create a debug patch for you to get more info to pin down the bug.



But, in your replies there's '$'blahblab and 0x13"1072".


Seems something went wrong totally.


Perhaps there's some mail client borkage?  I see it ok in my Sent mbox but
it could have been corrupted in transit or on your side.

Here's my current dmesg: http://sprunge.us/ajag


In my work mailbox, everything is just as fine as yours.

But my gmx mailbox gives the wrong output even the message has the same 
message id (<20170617145725.y3wsex73aaf3v...@angband.pl>).


I'd reduce the usage of my gmx mailbox.

Thanks,
Qu




Meow!




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs/146: Test various btrfs operations rounding behavior

2017-06-21 Thread Nikolay Borisov
When changing the size of disks/filesystem we should always be
rounding down to a multiple of sectorsize

Signed-off-by: Nikolay Borisov 
---

Changes since v1: 
 - Worked on incorporated feedback by Eryu 
 - Changed test number to 146 to avoid clashes

 tests/btrfs/146 | 147 
 tests/btrfs/146.out |  20 +++
 tests/btrfs/group   |   1 +
 3 files changed, 168 insertions(+)
 create mode 100755 tests/btrfs/146
 create mode 100644 tests/btrfs/146.out

diff --git a/tests/btrfs/146 b/tests/btrfs/146
new file mode 100755
index 000..7e6d40f
--- /dev/null
+++ b/tests/btrfs/146
@@ -0,0 +1,147 @@
+#! /bin/bash
+# FS QA Test No. btrfs/146
+#
+# Test that various code paths which deal with modifying the total size
+# of devices/superblock correctly round the value to a multiple of
+# sector size
+#
+#---
+#
+# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Nikolay Borisov 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   $UMOUNT_PROG $loop_mnt 
+   _destroy_loop_device $loop_dev1
+   _destroy_loop_device $loop_dev2
+   cd /
+   rm -f $tmp.*
+   
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_loop
+_require_block_device $SCRATCH_DEV
+_require_btrfs_command inspect-internal dump-super
+# ~2.1 gigabytes just to be on the safe side
+_require_fs_space $SCRATCH_MNT $(( 2202009 ))
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+
+# Size of devices which are going to be half a page larger than
+# default sectorsize (4kb)
+expectedsize=$(( 1 * 1024 * 1024 * 1024 ))
+filesize=$(( $expectedsize + 2048 ))
+loop_mnt=$SCRATCH_MNT/mount
+fs_img1=$SCRATCH_MNT/disk1.img
+fs_img2=$SCRATCH_MNT/disk2.img
+
+mkdir $loop_mnt
+
+#create files to hold dummy devices
+$XFS_IO_PROG -f -c "falloc 0 $filesize" $fs_img1 &> /dev/null
+$XFS_IO_PROG -f -c "falloc 0 $filesize" $fs_img2 &> /dev/null
+
+loop_dev1=$(_create_loop_device $fs_img1)
+loop_dev2=$(_create_loop_device $fs_img2)
+
+#create fs only on the first device
+_mkfs_dev $loop_dev1
+_mount $loop_dev1 $loop_mnt
+
+echo "Size from mkfs"
+$BTRFS_UTIL_PROG inspect-internal dump-super /dev/loop0 | grep total
+
+#resize down to 768mb + 2k
+$BTRFS_UTIL_PROG filesystem resize 824182784 $loop_mnt >>$seqres.full 2>&1
+sync
+
+echo ""
+
+echo "Size after resizing down"
+$BTRFS_UTIL_PROG inspect-internal dump-super $loop_dev1 | grep total
+
+echo ""
+
+#now resize back up to 1 gb
+$BTRFS_UTIL_PROG filesystem resize max $loop_mnt >>$seqres.full 2>&1
+sync
+
+echo "Size after resizing up"
+$BTRFS_UTIL_PROG inspect-internal dump-super /dev/loop0 | grep total
+
+# Add fs load for later balance operations, we need to do this
+# before adding a second device
+$FSSTRESS_PROG -w -n 15000 -p 2 -d $loop_mnt >> $seqres.full 2>&1
+
+# add second unaligned device, this checks the btrfs_init_new_device codepath
+# device should be aligned
+$BTRFS_UTIL_PROG device add $loop_dev2 $loop_mnt >>$seqres.full 2>&1
+sync
+
+echo ""
+
+# Ensure that adding a device doesn't cause the superbock to be
+# unaligned
+echo "Size of superblock after device addition"
+$BTRFS_UTIL_PROG inspect-internal dump-super $loop_dev1 | grep total
+
+echo ""
+
+# The newly added device must also be aligned
+echo "Size of superblock/device for second device"
+$BTRFS_UTIL_PROG inspect-internal dump-super $loop_dev2 | grep total
+
+# now start a balance, this might produce warnings while
+# it's relocating chunks and updating the size of devices
+$BTRFS_UTIL_PROG balance start $loop_mnt >> $seqres.full 2>&1
+sync
+
+# delete everything and then run balance
+rm -rf $loop_mnt/*
+
+# do a bit more stress testing until we fill the fs
+$FSSTRESS_PROG -w -n 15000 -p 4 -d $loop_mnt >> $seqres.full 2>&1
+rm -rf $loop_mnt/p0 $loop_mnt/p2
+
+# run balance again, this time we will be freeing chunks
+$BTRFS_UTIL_PROG balance start $loop_mnt >> $seqres.full 

Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Peter Grandi
> [ ... ] This will make some filesystems mostly RAID1, negating
> all space savings of RAID5, won't it? [ ... ]

RAID5/RAID6/... don't merely save space, more precisely they
trade lower resilience and a more anisotropic and smaller
performance envelope to gain lower redundancy (= save space).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html