[PATCH] btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_space

2016-11-18 Thread Jeff Mahoney
From: Jeff Mahoney 
Subject: btrfs: Ensure proper sector alignment for
 btrfs_free_reserved_data_space
References: bsc#1005666
Patch-mainline: Submitted 18 Nov 2016, linux-btrfs

This fixes the WARN_ON on BTRFS_I(inode)->reserved_extents in
btrfs_destroy_inode and the WARN_ON on nonzero delalloc bytes on umount
with qgroups enabled.

I was able to reproduce this by setting up a small (~500kb) quota limit
and writing a file one byte at a time until I hit the limit.  The warnings
would all hit on umount.

The root cause is that we would reserve a block-sized range in both
the reservation and the quota in btrfs_check_data_free_space, but if we
encountered a problem (like e.g. EDQUOT), we would only release the single
byte in the qgroup reservation.  That caused an iotree state split, which
increased the number of outstanding extents, in turn disallowing releasing
the metadata reservation.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c |7 +++
 1 file changed, 7 insertions(+)

--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3822,6 +3822,13 @@ void btrfs_free_reserved_data_space_noqu
  */
 void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
 {
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+
+   /* Make sure the range is aligned to sectorsize */
+   len = round_up(start + len, root->sectorsize) -
+ round_down(start, root->sectorsize);
+   start = round_down(start, root->sectorsize);
+
btrfs_free_reserved_data_space_noquota(inode, start, len);
btrfs_qgroup_free_data(inode, start, len);
 }


-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-18 Thread Hans van Kranenburg
Hi,

On 11/19/2016 01:57 AM, Qu Wenruo wrote:
> On 11/18/2016 11:08 PM, Hans van Kranenburg wrote:
>> On 11/18/2016 03:08 AM, Qu Wenruo wrote:
> I don't see what displaying a blockgroup-level aggregate usage number
> has to do with multi-device, except that the same %usage will appear
> another time when using RAID1*.
>>>
>>> Although in fact, for profiles like RAID0/5/6/10, it's completely
>>> possible that one dev_extent contains all the data, while another
>>> dev_extent is almost empty.
>>
>> When using something like RAID0 profile, I would expect 50% of the data
>> to end up in one dev_extent and 50% in the other?
> 
> First I'm mostly OK with current grayscale.
> What I'm saying can be considered as nitpicking.
> 
> The only concern is, the for full fs output, we are in fact output
> dev_extents of each device.
> 
> In that case, we should output info at the same level of dev_extent.
> 
> So if we really need to provide *accurate* gray scale, then we should
> base on the %usage of a dev extent.
> 
> And for 50%/50% assumption for RAID0, it's not true and we can easily
> create a case where it's 100%/0%
> 
> [...]
> 
> Then, only the 2nd data stripe has data, while the 1st data stripe are
> free.

Ah, yes, I see, that's a good example.

So technically, for others than single and RAID1, just using the
blockgroup usage might be wrong. OTOH, for most cases it will still be
"correct enough" for the eye, because statistically seen, distribution
of data over the stripes will be more uniform more often than not. It's
good to realize, but I'm fine with having this as a "known limitation".

> Things will become more complicated when RAID5/6 is involved.

Yes.

So being able to show a specific dev extent with the actual info (sorted
by physical byte location) would be a nice addition. Since it also
requires walking the extent tree for the related block group and doing
calculations on the ranges, it's not feasible for the high level file
system picture to do.

>>> Strictly speaking, at full fs or dev level, we should output things at
>>> dev_extent level, then greyscale should be representing dev_extent
>>> usage(which is not possible or quite hard to calculate)
>>
>> That's what it's doing now?
>>
>>> Anyway, the greyscale is mostly OK, just as a good addition output for
>>> full fs graph.
>>
>> I don't follow.
>>
>>> Although if it could output the fs or specific dev without gray scale, I
>>> think it would be better.
>>> It will be much clearer about the dev_extent level fragments.
>>
>> I have no idea what you mean, sorry.
> 
> The point is, for full fs or per-device output, a developer may focus on
> the fragments of unallocated space in each device.
> 
> In that case, an almost empty bg will be much like unallocated space.

The usage (from 0 to 100) is translated into a brightness between 16 and
255, which already causes empty allocated space to be visually
distinguishable from unallocated space:

def _brightness(self, used_pct):
return 16 + int(round(used_pct * (255 - 16)))

If you need it to be more clear, just increase the value to more than 16
and voila.

> So I hope if there is any option to disable greyscale at full fs output,
> it would be much better.

It's just some python, don't hesitate to change it and try things out.

 def _brightness(self, used_pct):
+return 255
-return 16 + int(round(used_pct * (255 - 16)))

Et voila, 0% used is bright white now. The experience of the resulting
image is really different, but if it helps in certain situations, it's
quite easy to get it done.

> Just like the blockgroup output, only black and while, and the example
> in the github is really awesome!
> 
> It shows a lot of thing I didn't have a clear view before.
> Like batched metadata extents (mostly for csum tree) and fragmented
> metadata for other trees.

:D

First thing I want to do with the blockgrouplevel pics is make a
"rolling heatmap" of the 4 highest vaddr blockgroups combined of the
filesystem that this thread was about:

http://www.spinics.net/lists/linux-btrfs/msg54940.html

First I'm going to disable autodefrag again, take a picture a few times
per hour, let it run for a few days, and then do the same thing again
with autodefrag enabled. (spoiler: even with autodefrag enabled, it's a
disaster)

But the resulting timelapse videos will show really interesting
information on the behaviour of autodefrag I guess. Can't wait to see them.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] btrfs: use bio iterators for the decompression handlers

2016-11-18 Thread Liu Bo
On Wed, Nov 16, 2016 at 01:52:08PM +0100, Christoph Hellwig wrote:
> Pass the full bio to the decompression routines and use bio iterators
> to iterate over the data in the bio.

One question below,

> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/compression.c | 122 
> +
>  fs/btrfs/compression.h |  12 ++---
>  fs/btrfs/lzo.c |  17 ++-
>  fs/btrfs/zlib.c|  15 ++
>  4 files changed, 54 insertions(+), 112 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index d4d8b7e..12a631d 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -81,9 +81,9 @@ struct compressed_bio {
>   u32 sums;
>  };
>  
> -static int btrfs_decompress_biovec(int type, struct page **pages_in,
> -u64 disk_start, struct bio_vec *bvec,
> -int vcnt, size_t srclen);
> +static int btrfs_decompress_bio(int type, struct page **pages_in,
> +u64 disk_start, struct bio *orig_bio,
> +size_t srclen);
>  
>  static inline int compressed_bio_size(struct btrfs_root *root,
> unsigned long disk_size)
> @@ -175,11 +175,10 @@ static void end_compressed_bio_read(struct bio *bio)
>   /* ok, we're the last bio for this extent, lets start
>* the decompression.
>*/
> - ret = btrfs_decompress_biovec(cb->compress_type,
> + ret = btrfs_decompress_bio(cb->compress_type,
> cb->compressed_pages,
> cb->start,
> -   cb->orig_bio->bi_io_vec,
> -   cb->orig_bio->bi_vcnt,
> +   cb->orig_bio,
> cb->compressed_len);
>  csum_failed:
>   if (ret)
> @@ -959,9 +958,7 @@ int btrfs_compress_pages(int type, struct address_space 
> *mapping,
>   *
>   * disk_start is the starting logical offset of this array in the file
>   *
> - * bvec is a bio_vec of pages from the file that we want to decompress into
> - *
> - * vcnt is the count of pages in the biovec
> + * orig_bio contains the pages from the file that we want to decompress into
>   *
>   * srclen is the number of bytes in pages_in
>   *
> @@ -970,18 +967,18 @@ int btrfs_compress_pages(int type, struct address_space 
> *mapping,
>   * be contiguous.  They all correspond to the range of bytes covered by
>   * the compressed extent.
>   */
> -static int btrfs_decompress_biovec(int type, struct page **pages_in,
> -u64 disk_start, struct bio_vec *bvec,
> -int vcnt, size_t srclen)
> +static int btrfs_decompress_bio(int type, struct page **pages_in,
> +u64 disk_start, struct bio *orig_bio,
> +size_t srclen)
>  {
>   struct list_head *workspace;
>   int ret;
>  
>   workspace = find_workspace(type);
>  
> - ret = btrfs_compress_op[type-1]->decompress_biovec(workspace, pages_in,
> -  disk_start,
> -  bvec, vcnt, srclen);
> + ret = btrfs_compress_op[type-1]->decompress_bio(workspace, pages_in,
> +  disk_start, orig_bio,
> +  srclen);
>   free_workspace(type, workspace);
>   return ret;
>  }
> @@ -1021,9 +1018,7 @@ void btrfs_exit_compress(void)
>   */
>  int btrfs_decompress_buf2page(char *buf, unsigned long buf_start,
> unsigned long total_out, u64 disk_start,
> -   struct bio_vec *bvec, int vcnt,
> -   unsigned long *pg_index,
> -   unsigned long *pg_offset)
> +   struct bio *bio)
>  {
>   unsigned long buf_offset;
>   unsigned long current_buf_start;
> @@ -1031,13 +1026,13 @@ int btrfs_decompress_buf2page(char *buf, unsigned 
> long buf_start,
>   unsigned long working_bytes = total_out - buf_start;
>   unsigned long bytes;
>   char *kaddr;
> - struct page *page_out = bvec[*pg_index].bv_page;
> + struct bio_vec bvec = bio_iter_iovec(bio, bio->bi_iter);
>  
>   /*
>* start byte is the first byte of the page we're currently
>* copying into relative to the start of the compressed data.
>*/
> - start_byte = page_offset(page_out) - disk_start;
> + start_byte = page_offset(bvec.bv_page) - disk_start;
>  
>   /* we haven't yet hit data corresponding to this page */
>   if (total_out <= start_byte)
> @@ -1057,80 +1052,45 @@ int btrfs_decompress_buf2page(char *buf, unsigned 
> long buf_start,
>  
>   /* copy bytes from the working buffer into the 

Re: Btrfs Heatmap - v2 - block group internals!

2016-11-18 Thread Qu Wenruo



On 11/18/2016 11:08 PM, Hans van Kranenburg wrote:

On 11/18/2016 03:08 AM, Qu Wenruo wrote:


Just found one small problem.
After specifying --size 16 to output a given block group (small block
group, I need large size to make output visible), it takes a full cpu
and takes a long long long time to run.
So long I don't even want to wait.

I changed size to 10, and it finished much faster.

Is that expected?


Yes, hilbert curve size increases exponentially when increasing order.

2**16 = 65536, 65536x65536 = 4294967296 pixels in the png image.

So even if you would have a petabyte filesystem, that would still mean
that a single pixel in the image represents only ~ 256kB. I don't think
your small block groups are 256kB big?

Specifying values like this does not make any sense at all, and it
expected not to work well.


I don't see what displaying a blockgroup-level aggregate usage number
has to do with multi-device, except that the same %usage will appear
another time when using RAID1*.


Although in fact, for profiles like RAID0/5/6/10, it's completely
possible that one dev_extent contains all the data, while another
dev_extent is almost empty.


When using something like RAID0 profile, I would expect 50% of the data
to end up in one dev_extent and 50% in the other?


First I'm mostly OK with current grayscale.
What I'm saying can be considered as nitpicking.

The only concern is, the for full fs output, we are in fact output 
dev_extents of each device.


In that case, we should output info at the same level of dev_extent.

So if we really need to provide *accurate* gray scale, then we should 
base on the %usage of a dev extent.



And for 50%/50% assumption for RAID0, it's not true and we can easily 
create a case where it's 100%/0%


The method is as simple as the following script (Assume the data BG is 
1G, and first write will be allcated to offset 0 of the BG)


# Fill 1G BG with 64K files
for i in $(seq -w 0 16383); do
xfs_io -f -c "pwrite 0 64k" $mnt/file_$i

# Sync fs to ensure extent allocation happens
sync
done

# Remove every each file, to create holes
for i in $(seq -w 0 2 16383); do
rm $mnt/file_$i
done
btrfs fi sync $mnt

Then, only the 2nd data stripe has data, while the 1st data stripe are free.


Things will become more complicated when RAID5/6 is involved.



Strictly speaking, at full fs or dev level, we should output things at
dev_extent level, then greyscale should be representing dev_extent
usage(which is not possible or quite hard to calculate)


That's what it's doing now?


Anyway, the greyscale is mostly OK, just as a good addition output for
full fs graph.


I don't follow.


Although if it could output the fs or specific dev without gray scale, I
think it would be better.
It will be much clearer about the dev_extent level fragments.


I have no idea what you mean, sorry.


The point is, for full fs or per-device output, a developer may focus on 
the fragments of unallocated space in each device.


In that case, an almost empty bg will be much like unallocated space.

So I hope if there is any option to disable greyscale at full fs output, 
it would be much better.


Just like the blockgroup output, only black and while, and the example 
in the github is really awesome!


It shows a lot of thing I didn't have a clear view before.
Like batched metadata extents (mostly for csum tree) and fragmented 
metadata for other trees.


Thanks,
Qu



When generating a picture of a file system with multiple devices,
boundaries between the separate devices are not visible now.

If someone has a brilliant idea about how to do this without throwing
out actual usage data...


The first thought that comes to mind for me is to make each device be a
different color, and otherwise obey the same intensity mapping
correlating to how much data is there.  For example, if you've got a 3
device FS, the parts of the image that correspond to device 1 would go
from 0x00 to 0xFF, the parts for device 2 could be 0x00 to
0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This
is of course not perfect (you can't tell what device each segment of
empty space corresponds to), but would probably cover most use cases.
(for example, with such a scheme, you could look at an image and tell
whether the data is relatively well distributed across all the devices
or you might need to re-balance).


What about linear output separated with lines(or just black)?


Linear output does not produce useful images, except for really small
filesystems.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-18 Thread Zygo Blaxell
On Fri, Nov 18, 2016 at 07:09:34PM +0100, Goffredo Baroncelli wrote:
> Hi Zygo
> On 2016-11-18 00:13, Zygo Blaxell wrote:
> > On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> >> Fix the so-called famous RAID5/6 scrub error.
> >>
> >> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> >> sight.
> >> (Yes, without the Phoronix report on this,
> >> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> >> I won't ever be aware of it)
> > 
> > If you're hearing about btrfs RAID5 bugs for the first time through
> > Phoronix, then your testing coverage is *clearly* inadequate.
> > 
> > Fill up a RAID5 array, start a FS stress test, pull a drive out while
> > that's running, let the FS stress test run for another hour, then try
> > to replace or delete the missing device.  If there are any crashes,
> > corruptions, or EIO during any part of this process (assuming all the
> > remaining disks are healthy), then btrfs RAID5 is still broken, and
> > you've found another bug to fix.
> > 
> > The fact that so many problems in btrfs can still be found this way
> > indicates to me that nobody is doing this basic level of testing
> > (or if they are, they're not doing anything about the results).
> 
> [...]
> 
> Sorry but I don't find useful this kind of discussion.  Yes BTRFS
> RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be
> complete; but this is not a fault of a single person; and Qu tried to
> solve one issue and for this we should say only tanks..
>
> Even if you don't find valuable the work of Qu (and my little one :-)
> ), this required some time and need to be respected.

I do find this work valuable, and I do thank you and Qu for it.
I've been following it with great interest because I haven't had time
to dive into it myself.  It's a use case I used before and would like
to use again.

Most of my recent frustration, if directed at anyone, is really directed
at Phoronix for conflating "one bug was fixed" with "ready for production
use today," and I wanted to ensure that the latter rumor was promptly
quashed.

This is why I'm excited about Qu's work:  on my list of 7 btrfs-raid5
recovery bugs (6 I found plus yours), Qu has fixed at least 2 of them,
maybe as many as 4, with the patches so far.  I can fix 2 of the others,
for a total of 6 fixed out of 7.

Specifically, the 7 bugs I know of are:

1-2. BUG_ONs in functions that should return errors (I had
fixed both already when trying to recover my broken arrays)

3. scrub can't identify which drives or files are corrupted
(Qu might have fixed this--I won't know until I do testing)

4-6. symptom groups related to wrong data or EIO in scrub
recovery, including Goffredo's (Qu might have fixed all of these,
but from a quick read of the patch I think at least two are done).

7. the write hole.

I'll know more after I've had a chance to run Qu's patches through
testing, which I intend to do at some point.

Optimistically, this means there could be only *one* bug remaining
in the critical path for btrfs RAID56 single disk failure recovery.
That last bug is the write hole, which is why I keep going on about it.
It's the only bug I know exists in btrfs RAID56 that has neither an
existing fix nor any evidence of someone actively working on it, even
at the design proposal stage.  Please, I'd love to be wrong about this.

When I described the situation recently as "a thin layer of bugs on
top of a design defect", I was not trying to be mean.  I was trying to
describe the situation *precisely*.

The thin layer of bugs is much thinner thanks to Qu's work, and thanks
in part to his work, I now have confidence that further investment in
this area won't be wasted.

> Finally, I don't think that we should compare the RAID-hole with this
> kind of bug(fix). The former is a design issue, the latter is a bug
> related of one of the basic feature of the raid system (recover from
> the lost of a disk/corruption).
>
> Even the MD subsystem (which is far behind btrfs) had tolerated
> the raid-hole until last year. 

My frustration against this point is the attitude that mdadm was ever
good enough, much less a model to emulate in the future.  It's 2016--there
have been some advancements in the state of the art since the IBM patent
describing RAID5 30 years ago, yet in the btrfs world, we seem to insist
on repeating all the same mistakes in the same order.

"We're as good as some existing broken-by-design thing" isn't a really
useful attitude.  We should aspire to do *better* than the existing
broken-by-design things.  If we didn't, we wouldn't be here, we'd all
be lurking on some other list, running ext4 or xfs on mdadm or lvm.

> And its solution is far to be cheap
> (basically the MD subsystem wrote the data first in the journal
> then on the disk... which is the kind of issue that a COW filesystem
> would solve).

Journalling isn't required.  It's 

Re: RFC: raid with a variable stripe size

2016-11-18 Thread Janos Toth F.
Yes, I don't think one could find any NAND based SSDs with <4k page
size on the market right now (even =4k is hard to get) and 4k is
becoming the new norm for HDDs. However, some HDD manufacturers
continue to offer drives with 512 byte sectors (I think it's possible
to get new ones in sizable quantities if you need them).

I am aware it wouldn't solve the problem for >=4k sector devices
unless you are ready to balance frequently. But I think it would still
be a lot better to waste padding space on 4k stripes than, say, 64k
stripes until you can balance the new block groups. And, if the space
waste ratio is tolerable, then this could be an automatic background
task as soon as an individual block group or their totals get to a
high waste ratio.

I suggest this as a quick temporal workaround because it could be
cheap in terms of work if the above mentioned functionalities (stripe
size change, auto-balance) would be worked on anyway (regardless of
RAID-5/6 specific issues) until some better solution is realized
(probably through a lot more work over a lot longer development
period). RAID-5 isn't really optimal for a huge amount of disks (URE
during rebuild issue...), so the temporary space waste is probably
<=8x per unbalanced block groups (which are 1Gb or may be ~10Gb if I
am not mistaken, so usually <<8x of the whole available space). But
may be my guesstimates are wrong here.

On Fri, Nov 18, 2016 at 9:51 PM, Timofey Titovets  wrote:
> 2016-11-18 23:32 GMT+03:00 Janos Toth F. :
>> Based on the comments of this patch, stripe size could theoretically
>> go as low as 512 byte:
>> https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html
>> If these very small (0.5k-2k) stripe sizes could really work (it's
>> possible to implement such changes and it does not degrade performance
>> too much - or at all - to keep it so low), we could use RAID-5(/6) on
>> <=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem
>> sector size + 4k node size, although I am not sure if node size is
>> really important here) without having to worry about RMW, extra space
>> waste or additional fragmentation.
>>
>> On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli  
>> wrote:
>>> Hello,
>>>
>>> these are only my thoughts; no code here, but I would like to share it 
>>> hoping that it could be useful.
>>>
>>> As reported several times by Zygo (and others), one of the problem of 
>>> raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>>
>>> The problem is that the stripe size is bigger than the "sector size" (ok 
>>> sector is not the correct word, but I am referring to the basic unit of 
>>> writing on the disk, which is 4k or 16K in btrfs).
>>> So when btrfs writes less data than the stripe, the stripe is not filled; 
>>> when it is filled by a subsequent write, a RMW of the parity is required.
>>>
>>> On the best of my understanding (which could be very wrong) ZFS try to 
>>> solve this issue using a variable length stripe.
>>>
>>> On BTRFS this could be achieved using several BGs (== block group or 
>>> chunk), one for each stripe size.
>>>
>>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
>>> should have three BGs:
>>> BG #1,composed by two disks (1 data+ 1 parity)
>>> BG #2 composed by three disks (2 data + 1 parity)
>>> BG #3 composed by four disks (3 data + 1 parity).
>>>
>>> If the data to be written has a size of 4k, it will be allocated to the BG 
>>> #1.
>>> If the data to be written has a size of 8k, it will be allocated to the BG 
>>> #2
>>> If the data to be written has a size of 12k, it will be allocated to the BG 
>>> #3
>>> If the data to be written has a size greater than 12k, it will be allocated 
>>> to the BG3, until the data fills a full stripes; then the remainder will be 
>>> stored in BG #1 or BG #2.
>>>
>>>
>>> To avoid unbalancing of the disk usage, each BG could use all the disks, 
>>> even if a stripe uses less disks: i.e
>>>
>>> DISK1 DISK2 DISK3 DISK4
>>> S1S1S1S2
>>> S2S2S3S3
>>> S3S4S4S4
>>> []
>>>
>>> Above is show a BG which uses all the four disks, but has a stripe which 
>>> spans only 3 disks.
>>>
>>>
>>> Pro:
>>> - btrfs already is capable to handle different BG in the filesystem, only 
>>> the allocator has to change
>>> - no more RMW are required (== higher performance)
>>>
>>> Cons:
>>> - the data will be more fragmented
>>> - the filesystem, will have more BGs; this will require time-to time a 
>>> re-balance. But is is an issue which we already know (even if may be not 
>>> 100% addressed).
>>>
>>>
>>> Thoughts ?
>>>
>>> BR
>>> G.Baroncelli
>>>
>>>
>>>
>>> --
>>> gpg @keyserver.linux.it: Goffredo Baroncelli 
>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to 

Re: [RFC] btrfs: make max inline data can be equal to sectorsize

2016-11-18 Thread Chris Mason



On 11/16/2016 11:10 AM, David Sterba wrote:

On Mon, Nov 14, 2016 at 09:55:34AM +0800, Qu Wenruo wrote:

At 11/12/2016 04:22 AM, Liu Bo wrote:

On Tue, Oct 11, 2016 at 02:47:42PM +0800, Wang Xiaoguang wrote:

If we use mount option "-o max_inline=sectorsize", say 4096, indeed
even for a fresh fs, say nodesize is 16k, we can not make the first
4k data completely inline, I found this conditon causing this issue:
  !compressed_size && (actual_end & (root->sectorsize - 1)) == 0

If it retuns true, we'll not make data inline. For 4k sectorsize,
0~4094 dara range, we can make it inline, but 0~4095, it can not.
I don't think this limition is useful, so here remove it which will
make max inline data can be equal to sectorsize.


It's difficult to tell whether we need this, I'm not a big fan of using
max_inline size more than the default size 2048, given that most reports
about ENOSPC is due to metadata and inline may make it worse.


IMHO if we can use inline data extents to trigger ENOSPC more easily,
then we should allow it to dig the problem further.

Just ignoring it because it may cause more bug will not solve the real
problem anyway.


Not allowing the full 4k value as max_inline looks artificial to me.
We've removed other similar limitation in the past so I'd tend to agree
to do the same here. There's no significant use for it as far as I can
tell, if you want to exhaust metadata, the difference to max_inline=4095
would be really tiny in the end. So, I'm okay with merging it. If
anybody feels like adding his reviewed-by, please do so.


The check is there because in practice it doesn't make sense to inline 
an extent if it fits perfectly in a data block.  You could argue its 
saving seeks, but we're also adding seeks by spreading out the metadata 
in general.  So, I'd want to see benchmarks before deciding.


If we're using it for debugging, I'd rather stick with max_inline=4095.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: raid with a variable stripe size

2016-11-18 Thread Timofey Titovets
2016-11-18 23:32 GMT+03:00 Janos Toth F. :
> Based on the comments of this patch, stripe size could theoretically
> go as low as 512 byte:
> https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html
> If these very small (0.5k-2k) stripe sizes could really work (it's
> possible to implement such changes and it does not degrade performance
> too much - or at all - to keep it so low), we could use RAID-5(/6) on
> <=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem
> sector size + 4k node size, although I am not sure if node size is
> really important here) without having to worry about RMW, extra space
> waste or additional fragmentation.
>
> On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli  
> wrote:
>> Hello,
>>
>> these are only my thoughts; no code here, but I would like to share it 
>> hoping that it could be useful.
>>
>> As reported several times by Zygo (and others), one of the problem of 
>> raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>
>> The problem is that the stripe size is bigger than the "sector size" (ok 
>> sector is not the correct word, but I am referring to the basic unit of 
>> writing on the disk, which is 4k or 16K in btrfs).
>> So when btrfs writes less data than the stripe, the stripe is not filled; 
>> when it is filled by a subsequent write, a RMW of the parity is required.
>>
>> On the best of my understanding (which could be very wrong) ZFS try to solve 
>> this issue using a variable length stripe.
>>
>> On BTRFS this could be achieved using several BGs (== block group or chunk), 
>> one for each stripe size.
>>
>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
>> should have three BGs:
>> BG #1,composed by two disks (1 data+ 1 parity)
>> BG #2 composed by three disks (2 data + 1 parity)
>> BG #3 composed by four disks (3 data + 1 parity).
>>
>> If the data to be written has a size of 4k, it will be allocated to the BG 
>> #1.
>> If the data to be written has a size of 8k, it will be allocated to the BG #2
>> If the data to be written has a size of 12k, it will be allocated to the BG 
>> #3
>> If the data to be written has a size greater than 12k, it will be allocated 
>> to the BG3, until the data fills a full stripes; then the remainder will be 
>> stored in BG #1 or BG #2.
>>
>>
>> To avoid unbalancing of the disk usage, each BG could use all the disks, 
>> even if a stripe uses less disks: i.e
>>
>> DISK1 DISK2 DISK3 DISK4
>> S1S1S1S2
>> S2S2S3S3
>> S3S4S4S4
>> []
>>
>> Above is show a BG which uses all the four disks, but has a stripe which 
>> spans only 3 disks.
>>
>>
>> Pro:
>> - btrfs already is capable to handle different BG in the filesystem, only 
>> the allocator has to change
>> - no more RMW are required (== higher performance)
>>
>> Cons:
>> - the data will be more fragmented
>> - the filesystem, will have more BGs; this will require time-to time a 
>> re-balance. But is is an issue which we already know (even if may be not 
>> 100% addressed).
>>
>>
>> Thoughts ?
>>
>> BR
>> G.Baroncelli
>>
>>
>>
>> --
>> gpg @keyserver.linux.it: Goffredo Baroncelli 
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

AFAIK all drives at now use 4k physical sector size, and use 512b only logically
So it's create another RWM Read 4k -> Modify 512b -> Write 4k, instead
of just write 512b.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] btrfs: use bio_for_each_segment_all in __btrfsic_submit_bio

2016-11-18 Thread Omar Sandoval
On Wed, Nov 16, 2016 at 01:52:15PM +0100, Christoph Hellwig wrote:
> And remove the bogus check for a NULL return value from kmap, which
> can't happen.  While we're at it: I don't think that kmapping up to 256
> will work without deadlocks on highmem machines, a better idea would
> be to use vm_map_ram to map all of them into a single virtual address
> range.  Incidentally that would also simplify the code a lot.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/check-integrity.c | 30 +++---
>  1 file changed, 11 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index a6f657f..86f681f 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -2819,10 +2819,11 @@ static void __btrfsic_submit_bio(struct bio *bio)
>* btrfsic_mount(), this might return NULL */
>   dev_state = btrfsic_dev_state_lookup(bio->bi_bdev);
>   if (NULL != dev_state &&
> - (bio_op(bio) == REQ_OP_WRITE) && NULL != bio->bi_io_vec) {
> + (bio_op(bio) == REQ_OP_WRITE) && bio_has_data(bio)) {
>   unsigned int i;
>   u64 dev_bytenr;
>   u64 cur_bytenr;
> + struct bio_vec *bvec;
>   int bio_is_patched;
>   char **mapped_datav;
>  
> @@ -2840,32 +2841,23 @@ static void __btrfsic_submit_bio(struct bio *bio)
>   if (!mapped_datav)
>   goto leave;
>   cur_bytenr = dev_bytenr;
> - for (i = 0; i < bio->bi_vcnt; i++) {
> - BUG_ON(bio->bi_io_vec[i].bv_len != PAGE_SIZE);
> - mapped_datav[i] = kmap(bio->bi_io_vec[i].bv_page);
> - if (!mapped_datav[i]) {
> - while (i > 0) {
> - i--;
> - kunmap(bio->bi_io_vec[i].bv_page);
> - }
> - kfree(mapped_datav);
> - goto leave;
> - }
> +
> + bio_for_each_segment_all(bvec, bio, i) {
> + BUG_ON(bvec->bv_len != PAGE_SIZE);
> + mapped_datav[i] = kmap(bvec->bv_page);
> +
>   if (dev_state->state->print_mask &
>   BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH_VERBOSE)
>   pr_info("#%u: bytenr=%llu, len=%u, offset=%u\n",
> -i, cur_bytenr, bio->bi_io_vec[i].bv_len,
> -bio->bi_io_vec[i].bv_offset);
> - cur_bytenr += bio->bi_io_vec[i].bv_len;
> +i, cur_bytenr, bvec->bv_len, 
> bvec->bv_offset);
> + cur_bytenr += bvec->bv_len;
>   }
>   btrfsic_process_written_block(dev_state, dev_bytenr,
> mapped_datav, bio->bi_vcnt,
> bio, _is_patched,
> NULL, bio->bi_opf);
> - while (i > 0) {
> - i--;
> - kunmap(bio->bi_io_vec[i].bv_page);
> - }
> + bio_for_each_segment_all(bvec, bio, i)
> + kunmap(bvec->bv_page);
>   kfree(mapped_datav);
>   } else if (NULL != dev_state && (bio->bi_opf & REQ_PREFLUSH)) {
>   if (dev_state->state->print_mask &
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: raid with a variable stripe size

2016-11-18 Thread Timofey Titovets
2016-11-18 21:15 GMT+03:00 Goffredo Baroncelli :
> Hello,
>
> these are only my thoughts; no code here, but I would like to share it hoping 
> that it could be useful.
>
> As reported several times by Zygo (and others), one of the problem of raid5/6 
> is the write hole. Today BTRFS is not capable to address it.
>
> The problem is that the stripe size is bigger than the "sector size" (ok 
> sector is not the correct word, but I am referring to the basic unit of 
> writing on the disk, which is 4k or 16K in btrfs).
> So when btrfs writes less data than the stripe, the stripe is not filled; 
> when it is filled by a subsequent write, a RMW of the parity is required.
>
> On the best of my understanding (which could be very wrong) ZFS try to solve 
> this issue using a variable length stripe.
>
> On BTRFS this could be achieved using several BGs (== block group or chunk), 
> one for each stripe size.
>
> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
> should have three BGs:
> BG #1,composed by two disks (1 data+ 1 parity)
> BG #2 composed by three disks (2 data + 1 parity)
> BG #3 composed by four disks (3 data + 1 parity).
>
> If the data to be written has a size of 4k, it will be allocated to the BG #1.
> If the data to be written has a size of 8k, it will be allocated to the BG #2
> If the data to be written has a size of 12k, it will be allocated to the BG #3
> If the data to be written has a size greater than 12k, it will be allocated 
> to the BG3, until the data fills a full stripes; then the remainder will be 
> stored in BG #1 or BG #2.
>
>
> To avoid unbalancing of the disk usage, each BG could use all the disks, even 
> if a stripe uses less disks: i.e
>
> DISK1 DISK2 DISK3 DISK4
> S1S1S1S2
> S2S2S3S3
> S3S4S4S4
> []
>
> Above is show a BG which uses all the four disks, but has a stripe which 
> spans only 3 disks.
>
>
> Pro:
> - btrfs already is capable to handle different BG in the filesystem, only the 
> allocator has to change
> - no more RMW are required (== higher performance)
>
> Cons:
> - the data will be more fragmented
> - the filesystem, will have more BGs; this will require time-to time a 
> re-balance. But is is an issue which we already know (even if may be not 100% 
> addressed).
>
>
> Thoughts ?
>
> BR
> G.Baroncelli

AFAIK, it's difficult to do such things with btrfs, because btrfs use
chuck allocation for metadata & data,
i.e. AFAIK ZFS work with storage more directly, so zfs directly span
file to the different disks.

May be it's can be implemented by some chunk allocator rework, i don't know.

Fix me if i'm wrong, thanks.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: raid with a variable stripe size

2016-11-18 Thread Janos Toth F.
Based on the comments of this patch, stripe size could theoretically
go as low as 512 byte:
https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html
If these very small (0.5k-2k) stripe sizes could really work (it's
possible to implement such changes and it does not degrade performance
too much - or at all - to keep it so low), we could use RAID-5(/6) on
<=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem
sector size + 4k node size, although I am not sure if node size is
really important here) without having to worry about RMW, extra space
waste or additional fragmentation.

On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli  wrote:
> Hello,
>
> these are only my thoughts; no code here, but I would like to share it hoping 
> that it could be useful.
>
> As reported several times by Zygo (and others), one of the problem of raid5/6 
> is the write hole. Today BTRFS is not capable to address it.
>
> The problem is that the stripe size is bigger than the "sector size" (ok 
> sector is not the correct word, but I am referring to the basic unit of 
> writing on the disk, which is 4k or 16K in btrfs).
> So when btrfs writes less data than the stripe, the stripe is not filled; 
> when it is filled by a subsequent write, a RMW of the parity is required.
>
> On the best of my understanding (which could be very wrong) ZFS try to solve 
> this issue using a variable length stripe.
>
> On BTRFS this could be achieved using several BGs (== block group or chunk), 
> one for each stripe size.
>
> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
> should have three BGs:
> BG #1,composed by two disks (1 data+ 1 parity)
> BG #2 composed by three disks (2 data + 1 parity)
> BG #3 composed by four disks (3 data + 1 parity).
>
> If the data to be written has a size of 4k, it will be allocated to the BG #1.
> If the data to be written has a size of 8k, it will be allocated to the BG #2
> If the data to be written has a size of 12k, it will be allocated to the BG #3
> If the data to be written has a size greater than 12k, it will be allocated 
> to the BG3, until the data fills a full stripes; then the remainder will be 
> stored in BG #1 or BG #2.
>
>
> To avoid unbalancing of the disk usage, each BG could use all the disks, even 
> if a stripe uses less disks: i.e
>
> DISK1 DISK2 DISK3 DISK4
> S1S1S1S2
> S2S2S3S3
> S3S4S4S4
> []
>
> Above is show a BG which uses all the four disks, but has a stripe which 
> spans only 3 disks.
>
>
> Pro:
> - btrfs already is capable to handle different BG in the filesystem, only the 
> allocator has to change
> - no more RMW are required (== higher performance)
>
> Cons:
> - the data will be more fragmented
> - the filesystem, will have more BGs; this will require time-to time a 
> re-balance. But is is an issue which we already know (even if may be not 100% 
> addressed).
>
>
> Thoughts ?
>
> BR
> G.Baroncelli
>
>
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/9] btrfs: refactor __btrfs_lookup_bio_sums to use bio_for_each_segment_all

2016-11-18 Thread Omar Sandoval
On Wed, Nov 16, 2016 at 01:52:14PM +0100, Christoph Hellwig wrote:
> Rework the loop a little bit to use the generic bio_for_each_segment_all
> helper for iterating over the bio.

One minor nit. Besides that,

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/file-item.c | 31 +++
>  1 file changed, 11 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index fa8aa53..54ccb91 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -163,7 +163,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
> *root,
>  struct inode *inode, struct bio *bio,
>  u64 logical_offset, u32 *dst, int dio)
>  {
> - struct bio_vec *bvec = bio->bi_io_vec;
> + struct bio_vec *bvec;
>   struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
>   struct btrfs_csum_item *item = NULL;
>   struct extent_io_tree *io_tree = _I(inode)->io_tree;
> @@ -177,7 +177,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
> *root,
>   u32 diff;
>   int nblocks;
>   int bio_index = 0;
> - int count;
> + int count = 0;
>   u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
>  
>   path = btrfs_alloc_path();
> @@ -223,8 +223,11 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
> *root,
>   if (dio)
>   offset = logical_offset;
>  
> - page_bytes_left = bvec->bv_len;
> - while (bio_index < bio->bi_vcnt) {
> + bio_for_each_segment_all(bvec, bio, bio_index) {

Can we just rename bio_index to i now?

> + page_bytes_left = bvec->bv_len;
> + if (count)
> + goto next;
> +
>   if (!dio)
>   offset = page_offset(bvec->bv_page) + bvec->bv_offset;
>   count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
> @@ -285,29 +288,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
> *root,
>  found:
>   csum += count * csum_size;
>   nblocks -= count;
> -
> +next:
>   while (count--) {
>   disk_bytenr += root->sectorsize;
>   offset += root->sectorsize;
>   page_bytes_left -= root->sectorsize;
> - if (!page_bytes_left) {
> - bio_index++;
> - /*
> -  * make sure we're still inside the
> -  * bio before we update page_bytes_left
> -  */
> - if (bio_index >= bio->bi_vcnt) {
> - WARN_ON_ONCE(count);
> - goto done;
> - }
> - bvec++;
> - page_bytes_left = bvec->bv_len;
> - }
> -
> + if (!page_bytes_left)
> + break; /* move to next bio */
>   }
>   }
>  
> -done:
> + WARN_ON_ONCE(count);
>   btrfs_free_path(path);
>   return 0;
>  }
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/9] btrfs: calculate end of bio offset properly

2016-11-18 Thread Omar Sandoval
On Wed, Nov 16, 2016 at 01:52:13PM +0100, Christoph Hellwig wrote:
> Use the bvec offset and len members to prepare for multipage bvecs.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/compression.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 8618ac3..27e9feb 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -445,6 +445,13 @@ int btrfs_submit_compressed_write(struct inode *inode, 
> u64 start,
>   return 0;
>  }
>  
> +static u64 bio_end_offset(struct bio *bio)
> +{
> + struct bio_vec *last = >bi_io_vec[bio->bi_vcnt - 1];
> +
> + return page_offset(last->bv_page) + last->bv_len - last->bv_offset;

Why is this minus bv_offset and not plus? Am I misunderstanding
bv_offset?

> +}
> +
>  static noinline int add_ra_bio_pages(struct inode *inode,
>u64 compressed_end,
>struct compressed_bio *cb)
> @@ -463,8 +470,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>   u64 end;
>   int misses = 0;
>  
> - page = cb->orig_bio->bi_io_vec[cb->orig_bio->bi_vcnt - 1].bv_page;
> - last_offset = (page_offset(page) + PAGE_SIZE);
> + last_offset = bio_end_offset(cb->orig_bio);
>   em_tree = _I(inode)->extent_tree;
>   tree = _I(inode)->io_tree;
>  
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/9] btrfs: use bi_size

2016-11-18 Thread Omar Sandoval
On Wed, Nov 16, 2016 at 01:52:12PM +0100, Christoph Hellwig wrote:
> Instead of using bi_vcnt to calculate it.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/compression.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 12a631d..8618ac3 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -562,7 +562,6 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>   *
>   * bio->bi_iter.bi_sector points to the compressed extent on disk
>   * bio->bi_io_vec points to all of the inode pages
> - * bio->bi_vcnt is a count of pages
>   *
>   * After the compressed pages are read, we copy the bytes into the
>   * bio we were passed and then call the bio end_io calls
> @@ -574,7 +573,6 @@ int btrfs_submit_compressed_read(struct inode *inode, 
> struct bio *bio,
>   struct extent_map_tree *em_tree;
>   struct compressed_bio *cb;
>   struct btrfs_root *root = BTRFS_I(inode)->root;
> - unsigned long uncompressed_len = bio->bi_vcnt * PAGE_SIZE;
>   unsigned long compressed_len;
>   unsigned long nr_pages;
>   unsigned long pg_index;
> @@ -619,7 +617,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
> struct bio *bio,
>   free_extent_map(em);
>   em = NULL;
>  
> - cb->len = uncompressed_len;
> + cb->len = bio->bi_iter.bi_size;
>   cb->compressed_len = compressed_len;
>   cb->compress_type = extent_compress_type(bio_flags);
>   cb->orig_bio = bio;
> @@ -647,8 +645,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
> struct bio *bio,
>   add_ra_bio_pages(inode, em_start + em_len, cb);
>  
>   /* include any pages we added in add_ra-bio_pages */
> - uncompressed_len = bio->bi_vcnt * PAGE_SIZE;
> - cb->len = uncompressed_len;
> + cb->len = bio->bi_iter.bi_size;
>  
>   comp_bio = compressed_bio_alloc(bdev, cur_disk_byte, GFP_NOFS);
>   if (!comp_bio)
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/9] btrfs: don't access the bio directly in btrfs_csum_one_bio

2016-11-18 Thread Omar Sandoval
On Wed, Nov 16, 2016 at 01:52:11PM +0100, Christoph Hellwig wrote:
> Use bio_for_each_segment_all to iterate over the segments instead.
> This requires a bit of reshuffling so that we only lookup up the ordered
> item once inside the bio_for_each_segment_all loop.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/file-item.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index d0d571c..fa8aa53 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -447,13 +447,12 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
> inode *inode,
>  struct bio *bio, u64 file_start, int contig)
>  {
>   struct btrfs_ordered_sum *sums;
> - struct btrfs_ordered_extent *ordered;
> + struct btrfs_ordered_extent *ordered = NULL;
>   char *data;
> - struct bio_vec *bvec = bio->bi_io_vec;
> - int bio_index = 0;
> + struct bio_vec *bvec;
>   int index;
>   int nr_sectors;
> - int i;
> + int i, j;
>   unsigned long total_bytes = 0;
>   unsigned long this_sum_bytes = 0;
>   u64 offset;
> @@ -470,17 +469,20 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
> inode *inode,
>   if (contig)
>   offset = file_start;
>   else
> - offset = page_offset(bvec->bv_page) + bvec->bv_offset;
> + offset = 0; /* shut up gcc */
>  
> - ordered = btrfs_lookup_ordered_extent(inode, offset);
> - BUG_ON(!ordered); /* Logic error */
>   sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
>   index = 0;
>  
> - while (bio_index < bio->bi_vcnt) {
> + bio_for_each_segment_all(bvec, bio, j) {
>   if (!contig)
>   offset = page_offset(bvec->bv_page) + bvec->bv_offset;
>  
> + if (!ordered) {
> + ordered = btrfs_lookup_ordered_extent(inode, offset);
> + BUG_ON(!ordered); /* Logic error */
> + }
> +
>   data = kmap_atomic(bvec->bv_page);
>  
>   nr_sectors = BTRFS_BYTES_TO_BLKS(root->fs_info,
> @@ -529,9 +531,6 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
> inode *inode,
>   }
>  
>   kunmap_atomic(data);
> -
> - bio_index++;
> - bvec++;
>   }
>   this_sum_bytes = 0;
>   btrfs_add_ordered_sum(inode, ordered, sums);
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] btrfs: don't access the bio directly in the raid5/6 code

2016-11-18 Thread Omar Sandoval
On Wed, Nov 16, 2016 at 01:52:10PM +0100, Christoph Hellwig wrote:
> Just use bio_for_each_segment_all to iterate over all segments.

The subject seems to be copied from patch 2 for this one. Otherwise,

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/inode.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 147df4c..3f09cb6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8394,7 +8394,7 @@ static int btrfs_submit_direct_hook(struct 
> btrfs_dio_private *dip,
>   struct btrfs_root *root = BTRFS_I(inode)->root;
>   struct bio *bio;
>   struct bio *orig_bio = dip->orig_bio;
> - struct bio_vec *bvec = orig_bio->bi_io_vec;
> + struct bio_vec *bvec;
>   u64 start_sector = orig_bio->bi_iter.bi_sector;
>   u64 file_offset = dip->logical_offset;
>   u64 submit_len = 0;
> @@ -8403,7 +8403,7 @@ static int btrfs_submit_direct_hook(struct 
> btrfs_dio_private *dip,
>   int async_submit = 0;
>   int nr_sectors;
>   int ret;
> - int i;
> + int i, j;
>  
>   map_length = orig_bio->bi_iter.bi_size;
>   ret = btrfs_map_block(root->fs_info, btrfs_op(orig_bio),
> @@ -8433,7 +8433,7 @@ static int btrfs_submit_direct_hook(struct 
> btrfs_dio_private *dip,
>   btrfs_io_bio(bio)->logical = file_offset;
>   atomic_inc(>pending_bios);
>  
> - while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
> + bio_for_each_segment_all(bvec, orig_bio, j) {
>   nr_sectors = BTRFS_BYTES_TO_BLKS(root->fs_info, bvec->bv_len);
>   i = 0;
>  next_block:
> @@ -8487,7 +8487,6 @@ static int btrfs_submit_direct_hook(struct 
> btrfs_dio_private *dip,
>   i++;
>   goto next_block;
>   }
> - bvec++;
>   }
>   }
>  
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] btrfs: don't access the bio directly in the raid5/6 code

2016-11-18 Thread Omar Sandoval
On Wed, Nov 16, 2016 at 01:52:09PM +0100, Christoph Hellwig wrote:
> Just use bio_for_each_segment_all to iterate over all segments.
> 

Besides the minor whitespace issue below,

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/raid56.c | 16 ++--
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index d016d4a..da941fb 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -1144,10 +1144,10 @@ static void validate_rbio_for_rmw(struct 
> btrfs_raid_bio *rbio)
>  static void index_rbio_pages(struct btrfs_raid_bio *rbio)
>  {
>   struct bio *bio;
> + struct bio_vec *bvec;
>   u64 start;
>   unsigned long stripe_offset;
>   unsigned long page_index;
> - struct page *p;
>   int i;
>  
>   spin_lock_irq(>bio_list_lock);
> @@ -1156,10 +1156,8 @@ static void index_rbio_pages(struct btrfs_raid_bio 
> *rbio)
>   stripe_offset = start - rbio->bbio->raid_map[0];
>   page_index = stripe_offset >> PAGE_SHIFT;
>  
> - for (i = 0; i < bio->bi_vcnt; i++) {
> - p = bio->bi_io_vec[i].bv_page;
> - rbio->bio_pages[page_index + i] = p;
> - }
> + bio_for_each_segment_all(bvec, bio, i)
> + rbio->bio_pages[page_index+ i] = bvec->bv_page;

Missing a space right here --^

>   }
>   spin_unlock_irq(>bio_list_lock);
>  }
> @@ -1433,13 +1431,11 @@ static int fail_bio_stripe(struct btrfs_raid_bio 
> *rbio,
>   */
>  static void set_bio_pages_uptodate(struct bio *bio)
>  {
> + struct bio_vec *bvec;
>   int i;
> - struct page *p;
>  
> - for (i = 0; i < bio->bi_vcnt; i++) {
> - p = bio->bi_io_vec[i].bv_page;
> - SetPageUptodate(p);
> - }
> + bio_for_each_segment_all(bvec, bio, i)
> + SetPageUptodate(bvec->bv_page);
>  }
>  
>  /*
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: check: fix missing newlines

2016-11-18 Thread David Sterba
On Wed, Nov 16, 2016 at 12:54:32PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Also, the other progress messages go to stderr, so "checking extents"
> probably should, as well.
> 
> Fixes: c7a1f66a205f ("btrfs-progs: check: switch some messages to common 
> helpers")
> Signed-off-by: Omar Sandoval 
> ---
> As a side note, it seems almost completely random whether we print to
> stdout or stderr for any given message. That could probably use some
> cleaning up for consistency. A quick run of e2fsck indicated that it
> prints almost everything on stdout except for usage and administrative
> problems. xfs_repair just seems to put everything in stderr. I
> personally like the e2fsck approach. Anyone have any preference?

Cleaning up the messages is ongoing work, most error messages have been
converted. In case of 'check', I think that stdout is good to capture
normal and error messages (so no error messages are accidentally lost if
the user runs just "check > log" instead of "check >& log").

For that printf is still the way to print them. Besides the verbosity
level could be improved, we've had complaints about that since ever.

Patch applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: qgroup: fix error in ASSERT condition expression

2016-11-18 Thread David Sterba
On Thu, Nov 17, 2016 at 02:01:31PM +0900, Tsutomu Itoh wrote:
> Option -f, -F and --sort don't work because a conditional expression
> of ASSERT is wrong.
> 
> Signed-off-by: Tsutomu Itoh 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs-progs: test: fix error of test target of Makefile

2016-11-18 Thread David Sterba
On Fri, Nov 18, 2016 at 02:44:12PM +0900, Tsutomu Itoh wrote:
> Add test-cli to test target of Makefile.
> 
> Signed-off-by: Tsutomu Itoh 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs-progs: test: expand size of test device of fsck-tests 013

2016-11-18 Thread David Sterba
On Fri, Nov 18, 2016 at 02:36:28PM +0800, Qu Wenruo wrote:
> 
> 
> On 11/18/2016 01:47 PM, Tsutomu Itoh wrote:
> > In my test environment, size of /lib/modules/`uname -r`/* is
> > larger than 1GB, so test data can not copy to testdev.
> > Therefore we need expand the size of testdev.
> 
> IMHO the correct fix is to enhance populate_fs() to fill the fs into a 
> given size, other than using the unreliable copy.

Yeah that would be better, the contents of the modules dir could be
anything, but was enough to use for initial testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: raid with a variable stripe size

2016-11-18 Thread Goffredo Baroncelli
Hello,

these are only my thoughts; no code here, but I would like to share it hoping 
that it could be useful.

As reported several times by Zygo (and others), one of the problem of raid5/6 
is the write hole. Today BTRFS is not capable to address it.

The problem is that the stripe size is bigger than the "sector size" (ok sector 
is not the correct word, but I am referring to the basic unit of writing on the 
disk, which is 4k or 16K in btrfs).
So when btrfs writes less data than the stripe, the stripe is not filled; when 
it is filled by a subsequent write, a RMW of the parity is required.

On the best of my understanding (which could be very wrong) ZFS try to solve 
this issue using a variable length stripe.

On BTRFS this could be achieved using several BGs (== block group or chunk), 
one for each stripe size.

For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
should have three BGs:
BG #1,composed by two disks (1 data+ 1 parity)
BG #2 composed by three disks (2 data + 1 parity)
BG #3 composed by four disks (3 data + 1 parity).

If the data to be written has a size of 4k, it will be allocated to the BG #1.
If the data to be written has a size of 8k, it will be allocated to the BG #2
If the data to be written has a size of 12k, it will be allocated to the BG #3
If the data to be written has a size greater than 12k, it will be allocated to 
the BG3, until the data fills a full stripes; then the remainder will be stored 
in BG #1 or BG #2.


To avoid unbalancing of the disk usage, each BG could use all the disks, even 
if a stripe uses less disks: i.e

DISK1 DISK2 DISK3 DISK4
S1S1S1S2
S2S2S3S3
S3S4S4S4
[]

Above is show a BG which uses all the four disks, but has a stripe which spans 
only 3 disks.


Pro: 
- btrfs already is capable to handle different BG in the filesystem, only the 
allocator has to change
- no more RMW are required (== higher performance)

Cons:
- the data will be more fragmented
- the filesystem, will have more BGs; this will require time-to time a 
re-balance. But is is an issue which we already know (even if may be not 100% 
addressed).


Thoughts ?

BR
G.Baroncelli



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs-progs: test: fix convert-tests 004 failure

2016-11-18 Thread David Sterba
On Fri, Nov 18, 2016 at 02:49:51PM +0900, Tsutomu Itoh wrote:
> convert-tests 004 failed because the argument to check_all_images
> is not specified.
> 
> # make test-convert
> [TEST]   convert-tests.sh
> [TEST/conv]   004-ext2-backup-superblock-ranges
> find: '': No such file or directory
> #
> 
> Signed-off-by: Tsutomu Itoh 
> ---
>  tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh 
> b/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh
> index c56650b..d764ed8 100755
> --- a/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh
> +++ b/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh
> @@ -39,4 +39,4 @@ function check_image() {
>   rm -f $TEST_DEV
>  }
>  
> -check_all_images
> +check_all_images `pwd`

I'll add a fallback to use the current directory instead of forcing all
callsites to supply it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] btrfs-progs: better space_cache=v2 support

2016-11-18 Thread David Sterba
On Mon, Nov 14, 2016 at 10:43:17AM -0800, Omar Sandoval wrote:
> Cover letter from v1:
> 
> This series implements some support for space_cache=v2 in btrfs-progs.
> In particular, this adds support for `btrfs check --clear-space-cache v2`,
> proper printing of the free space tree flags for `btrfs inspect-internal
> dump-super`, and better documentation.
> 
> We'd previously talked about always making btrfs-progs always invalidate
> the free space tree when doing write operations, but I decided that this
> should be an action explicitly requested by the user. It'd also be
> unsafe if using a kernel without the free space tree valid bit support,
> which is why I didn't implement a `btrfs check --invalidate-free-space-cache`
> option. Doing the full clear is always safe.
> 
> Still missing is full read-write support, but this should hopefully
> cover most btrfs-progs usage.
> 
> Changes since v1:
> 
> - Change unsigned -> unsigned int argument to
>   btrfs_check_fs_compatability() in patch 3
> - Remove BUG_ON() in btrfs_del_root() in patch 4
> - Return error from btrfs_free_tree_block() in patch 4
> - Handle errors from btrfs_free_tree_block() and clean_tree_block() in
>   patch 4
> - Add Qu Wenruo's Reviewed-by to patches 3, 4, and 5
> 
> Thanks!
> 
> Omar Sandoval (6):
>   btrfs-progs: add the FREE_SPACE_TREE_VALID compat_ro bit definition
>   btrfs-progs: format FREE_SPACE_TREE{,_VALID} nicely in dump-super
>   btrfs-progs: add OPEN_CTREE_INVALIDATE_FST flag
>   btrfs-progs: add btrfs_clear_free_space_tree() from the kernel
>   btrfs-progs: implement btrfs check --clear-space-cache v2
>   btrfs-progs: document space_cache=v2 more thoroughly

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-18 Thread Goffredo Baroncelli
Hi Zygo
On 2016-11-18 00:13, Zygo Blaxell wrote:
> On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
>> Fix the so-called famous RAID5/6 scrub error.
>>
>> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
>> sight.
>> (Yes, without the Phoronix report on this,
>> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
>> I won't ever be aware of it)
> 
> If you're hearing about btrfs RAID5 bugs for the first time through
> Phoronix, then your testing coverage is *clearly* inadequate.
> 
> Fill up a RAID5 array, start a FS stress test, pull a drive out while
> that's running, let the FS stress test run for another hour, then try
> to replace or delete the missing device.  If there are any crashes,
> corruptions, or EIO during any part of this process (assuming all the
> remaining disks are healthy), then btrfs RAID5 is still broken, and
> you've found another bug to fix.
> 
> The fact that so many problems in btrfs can still be found this way
> indicates to me that nobody is doing this basic level of testing
> (or if they are, they're not doing anything about the results).

[...]

Sorry but I don't find useful this kind of discussion. 
Yes BTRFS RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be 
complete; but this is not a fault of a single person; and Qu tried to solve one 
issue and for this we should say only tanks..

Even if you don't find valuable the work of Qu (and my little one :-) ), this 
required some time and need to be respected.

Finally, I don't think that we should compare the RAID-hole with this kind of 
bug(fix). The former is a design issue, the latter is a bug related of one of 
the basic feature of the raid system (recover from the lost of a 
disk/corruption).

Even the MD subsystem (which is far behind btrfs) had tolerated the raid-hole 
until last year. And its solution is far to be cheap (basically the MD 
subsystem wrote the data first in the journal then on the disk... which is the 
kind of issue that a COW filesystem would solve).

BR
G.Baroncelli






-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: send: fix failure of xfstests btrfs/038

2016-11-18 Thread David Sterba
On Thu, Nov 17, 2016 at 11:24:41AM -0800, Omar Sandoval wrote:
> On Tue, Nov 15, 2016 at 05:16:27PM +0900, Tsutomu Itoh wrote:
> > The following patch was imperfect, so xfstests btrfs/038 was failed.
> > 
> >   6d4fb3d  btrfs-progs: send: fix handling of multiple snapshots (-p option)
> > 
> > [before]
> > | # ./check btrfs/038
> > | FSTYP -- btrfs
> > | PLATFORM  -- Linux/x86_64 luna 4.9.0-rc5
> > | MKFS_OPTIONS  -- /dev/sdb3
> > | MOUNT_OPTIONS -- /dev/sdb3 /test6
> > |
> > | btrfs/038 1s ... [failed, exit status 1] - output mismatch (see 
> > /For_RT/xfstests2/results//btrfs/038.out.bad)
> > | --- tests/btrfs/038.out 2015-08-04 16:09:38.0 +0900
> > | +++ /For_RT/xfstests2/results//btrfs/038.out.bad2016-11-15 
> > 13:39:48.589435290 +0900
> > | @@ -7,3 +7,5 @@
> > |  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > |  wrote 1/1 bytes at offset 30
> > |  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > | +failed: '/usr/local/bin/btrfs send -p /test6/mysnap1 -c 
> > /test6/clones_snap /test6/mysnap2 -f /tmp/tmp.aLpvAC7lsx/2.snap'
> > | +(see /For_RT/xfstests2/results//btrfs/038.full for details)
> > | ...
> > | (Run 'diff -u tests/btrfs/038.out 
> > /For_RT/xfstests2/results//btrfs/038.out.bad'  to see the entire diff)
> > | Ran: btrfs/038
> > | Failures: btrfs/038
> > 
> > [after]
> > | # ./check btrfs/038
> > | FSTYP -- btrfs
> > | PLATFORM  -- Linux/x86_64 luna 4.9.0-rc5
> > | MKFS_OPTIONS  -- /dev/sdb3
> > | MOUNT_OPTIONS -- /dev/sdb3 /test6
> > | 
> > | btrfs/038 1s ... 1s
> > | Ran: btrfs/038
> > | Passed all 1 tests
> > 
> > Signed-off-by: Tsutomu Itoh 
> 
> Verified that this fixes both btrfs/038 and btrfs/117.
> 
> Tested-by: Omar Sandoval 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-18 Thread Hans van Kranenburg
On 11/18/2016 04:30 PM, Austin S. Hemmelgarn wrote:
> 
> Now, I personally have no issue with the Hilbert ordering, but if there
> were an option to use a linear ordering, I would almost certainly use
> that instead, simply because I could more easily explain the data to
> people.

It's in there, but hidden :)

--curve linear

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-18 Thread Austin S. Hemmelgarn

On 2016-11-18 09:37, Hans van Kranenburg wrote:

Ha,

On 11/18/2016 01:36 PM, Austin S. Hemmelgarn wrote:

On 2016-11-17 16:08, Hans van Kranenburg wrote:

On 11/17/2016 08:27 PM, Austin S. Hemmelgarn wrote:

On 2016-11-17 13:51, Hans van Kranenburg wrote:

But, the fun with visualizations of data is that you learn whether they
just work(tm) or don't as soon as you see them. Mathematical or
algorithmic beauty is not always a good recipe for beauty as seen by the
human eye.

So, let's gather a bunch of ideas which we can try out and then observe
the result.

Before doing so, I'm going to restructure the code a bit more so I can
write another script in the same directory, just doing import heatmap
and calling a few functions in there to quickly try stuff, bypassing the
normal cli api.

Also, the png writing handling is now done by some random png library
that I found, which requires me to build (or copy/resize) an entire
pixel grid in memory, explicitely listing all pixel values, which is a
bit of a memory hog for bigger pictures, so I want to see if something
can be done there also.

I haven't had a chance to look at the code yet, but do you have an
option to control how much data a pixel represents?  On a multi TB
filesystem for example, you may not care about exact data, just an
overall view of the data, in which case making each pixel represent a
larger chunk of data (and thus reducing the resolution of the image)
would almost certainly save some memory on big filesystems.


--order, which defines the hilbert curve order.

Example: for a 238GiB filesystem, when specifying --order 7, then 2**7 =
128, so 128x128 = 16384 pixels, which means that a single one represents
~16MiB

when --size > --order, the image simply gets scaled up.

When not specifying --order, a number gets chosen automatically with
which bytes per pixel is closest to 32MiB.

When size is not specified, it's 10, or same as order if order is
greater than 10.

Now this output should make sense:

-# ./heatmap.py /mnt/238GiB
max_id 1 num_devices 1 fsid ed108358-c746-4e76-a071-3820d423a99d
nodesize 16384 sectorsize 4096 clone_alignment 4096
scope filesystem curve hilbert order 7 size 10 pngfile
fsid_ed10a358-c846-4e76-a071-3821d423a99d_at_1479473532.png
grid height 128 width 128 total_bytes 255057723392 bytes_per_pixel
15567488.0 pixels 16384

-# ./heatmap.py /mnt/40TiB
max_id 2 num_devices 2 fsid 9bc9947e-070f-4bbc-872e-49b2a39b3f7b
nodesize 16384 sectorsize 4096 clone_alignment 4096
scope filesystem curve hilbert order 10 size 10 pngfile
/home/beheer/heatmap/generated/fsid_9bd9947e-070f-4cbc-8e2e-49b3a39b8f7b_at_1479473950.png
grid height 1024 width 1024 total_bytes 46165378727936 bytes_per_pixel
44026736.0 pixels 1048576

OK, here's another thought, is it possible to parse smaller chunks of 
the image at a time, and then use some external tool (ImageMagick 
maybe?) to stitch those together into the final image?  That might also 
be useful for other reasons too (If you implement it so you can do 
arbitrary ranges, you could use it to split separate devices into 
independent images).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-18 Thread Austin S. Hemmelgarn

On 2016-11-18 10:08, Hans van Kranenburg wrote:

On 11/18/2016 03:08 AM, Qu Wenruo wrote:

When generating a picture of a file system with multiple devices,
boundaries between the separate devices are not visible now.

If someone has a brilliant idea about how to do this without throwing
out actual usage data...


The first thought that comes to mind for me is to make each device be a
different color, and otherwise obey the same intensity mapping
correlating to how much data is there.  For example, if you've got a 3
device FS, the parts of the image that correspond to device 1 would go
from 0x00 to 0xFF, the parts for device 2 could be 0x00 to
0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This
is of course not perfect (you can't tell what device each segment of
empty space corresponds to), but would probably cover most use cases.
(for example, with such a scheme, you could look at an image and tell
whether the data is relatively well distributed across all the devices
or you might need to re-balance).


What about linear output separated with lines(or just black)?


Linear output does not produce useful images, except for really small
filesystems.
However, it's how the human brain is hardwired to parse data like this 
(two data points per item, one for value, one for ordering).  That's 
part of the reason that all known writing systems use a linear 
arrangement arrangement of symbols to store information (the other parts 
have to do with things like storage efficiency and error detection (and 
yes, I'm serious, those do play a part in the evolution of language and 
writing)).


As an example of why this is important, imagine showing someone who 
understands the concept of data fragmentation (most people have little 
to no issue understanding this concept) a heatmap of a filesystem with 
no space fragmentation at all without explaining that it uses a a 
Hilbert Curve 2d ordering.  Pretty much 100% of people who aren't 
mathematicians or scientists will look at that and the first thought 
that will come to their mind is almost certainly going to be along the 
lines of 'holy crap that's fragmented really bad in this specific area'.


This is the reason that pretty much nothing outside of scientific or 
mathematical data uses a Hilbert curve based 2d ordering of data (and 
even then, they almost never use it for final presentation of the data).


Data presentation for something like this in a way that laypeople can 
understand is hard, but it's also important.  Take a look at some of the 
graphical tools for filesystem defragmentation.  The presentation 
requirements there are pretty similar, and so is the data being 
conveyed.  They all use a grid oriented linear presentation of 
allocation data.  The difference is that they scale up the blocks so 
that they're easily discernible by sight.  This allows them to represent 
the data in a way that's trivial to explain (read this line-by-line), 
unlike the Hilbert curve (the data follows a complex folded spiral 
pattern which is fractal in nature).


Now, I personally have no issue with the Hilbert ordering, but if there 
were an option to use a linear ordering, I would almost certainly use 
that instead, simply because I could more easily explain the data to people.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-18 Thread Hans van Kranenburg
On 11/18/2016 03:08 AM, Qu Wenruo wrote:
> 
> Just found one small problem.
> After specifying --size 16 to output a given block group (small block
> group, I need large size to make output visible), it takes a full cpu
> and takes a long long long time to run.
> So long I don't even want to wait.
> 
> I changed size to 10, and it finished much faster.
> 
> Is that expected?

Yes, hilbert curve size increases exponentially when increasing order.

2**16 = 65536, 65536x65536 = 4294967296 pixels in the png image.

So even if you would have a petabyte filesystem, that would still mean
that a single pixel in the image represents only ~ 256kB. I don't think
your small block groups are 256kB big?

Specifying values like this does not make any sense at all, and it
expected not to work well.

>>> I don't see what displaying a blockgroup-level aggregate usage number
>>> has to do with multi-device, except that the same %usage will appear
>>> another time when using RAID1*.
> 
> Although in fact, for profiles like RAID0/5/6/10, it's completely
> possible that one dev_extent contains all the data, while another
> dev_extent is almost empty.

When using something like RAID0 profile, I would expect 50% of the data
to end up in one dev_extent and 50% in the other?

> Strictly speaking, at full fs or dev level, we should output things at
> dev_extent level, then greyscale should be representing dev_extent
> usage(which is not possible or quite hard to calculate)

That's what it's doing now?

> Anyway, the greyscale is mostly OK, just as a good addition output for
> full fs graph.

I don't follow.

> Although if it could output the fs or specific dev without gray scale, I
> think it would be better.
> It will be much clearer about the dev_extent level fragments.

I have no idea what you mean, sorry.

>>> When generating a picture of a file system with multiple devices,
>>> boundaries between the separate devices are not visible now.
>>>
>>> If someone has a brilliant idea about how to do this without throwing
>>> out actual usage data...
>>>
>> The first thought that comes to mind for me is to make each device be a
>> different color, and otherwise obey the same intensity mapping
>> correlating to how much data is there.  For example, if you've got a 3
>> device FS, the parts of the image that correspond to device 1 would go
>> from 0x00 to 0xFF, the parts for device 2 could be 0x00 to
>> 0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This
>> is of course not perfect (you can't tell what device each segment of
>> empty space corresponds to), but would probably cover most use cases.
>> (for example, with such a scheme, you could look at an image and tell
>> whether the data is relatively well distributed across all the devices
>> or you might need to re-balance).
> 
> What about linear output separated with lines(or just black)?

Linear output does not produce useful images, except for really small
filesystems.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-18 Thread E V
Yes, the short window between the stalls and the panic makes it
difficult to manually check much. I could setup a cron every 5 minutes
or so if you want. Also, I see the OOM's in 4.8, but it has yet to
panic on me. Where as 4.9rc has panic'd both times I've booted it, so
depending on what you want to look at it might be easier to
investigate on 4.8. Let me know, I can turn on a couple of the DEBUG
config's and build a new 4.8.8. Never looked into a netconsole or
serial console. I think just getting the system to use a higher res
console would be an improvement, but the OOM's seemed to be the root
cause of the panic so I haven't spent any time looking into that as of
yet,

Thanks,
-Eli

On Fri, Nov 18, 2016 at 6:54 AM, Tetsuo Handa
 wrote:
> On 2016/11/18 6:49, Vlastimil Babka wrote:
>> On 11/16/2016 02:39 PM, E V wrote:
>>> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of
>>> the stack trace, and the 38 call traces in a 2 minute window shortly
>>> before, to the bugzilla case for those not on it's e-mail list:
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=186671
>>
>> The panic screenshot has only the last part, but the end marker says
>> it's OOM with no killable processes. The DEBUG_VM config thus didn't
>> trigger anything, and still there's tons of pagecache, mostly clean,
>> that's not being reclaimed.
>>
>> Could you now try this?
>> - enable CONFIG_PAGE_OWNER
>> - boot with kernel option: page_owner=on
>> - after the first oom, "cat /sys/kernel/debug/page_owner > file"
>> - provide the file (compressed, it will be quite large)
>
> Excuse me for a noise, but do we really need to do
> "cat /sys/kernel/debug/page_owner > file" after the first OOM killer
> invocation? I worry that it might be too difficult to do.
> Shouldn't we rather do "cat /sys/kernel/debug/page_owner > file"
> hourly and compare tendency between the latest one and previous one?
>
> This system has swap, and /var/log/messages before panic
> reports that swapin was stalling at memory allocation.
>
> 
> [130346.262510] dsm_sa_datamgrd: page allocation stalls for 52400ms, order:0, 
> mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
> [130346.262572] CPU: 1 PID: 3622 Comm: dsm_sa_datamgrd Tainted: GW I  
>4.9.0-rc5 #2
> [130346.262662]   8129ba69 8170e4c8 
> c90003ccb8d8
> [130346.262714]  8113449f 024200ca1ca11b40 8170e4c8 
> c90003ccb880
> [130346.262765]  0010 c90003ccb8e8 c90003ccb898 
> 88041f226e80
> [130346.262817] Call Trace:
> [130346.262843]  [] ? dump_stack+0x46/0x5d
> [130346.262872]  [] ? warn_alloc+0x11f/0x140
> [130346.262899]  [] ? __alloc_pages_slowpath+0x84b/0xa80
> [130346.262929]  [] ? __alloc_pages_nodemask+0x2b0/0x2f0
> [130346.262960]  [] ? alloc_pages_vma+0xbe/0x260
> [130346.262989]  [] ? pagecache_get_page+0x22/0x280
> [130346.263019]  [] ? __read_swap_cache_async+0x118/0x1a0
> [130346.263048]  [] ? read_swap_cache_async+0xf/0x30
> [130346.263077]  [] ? swapin_readahead+0x16e/0x1c0
> [130346.263106]  [] ? radix_tree_lookup_slot+0xe/0x20
> [130346.263135]  [] ? find_get_entry+0x14/0x130
> [130346.263162]  [] ? pagecache_get_page+0x22/0x280
> [130346.263193]  [] ? do_swap_page+0x44f/0x5f0
> [130346.263220]  [] ? __radix_tree_lookup+0x62/0xc0
> [130346.263249]  [] ? handle_mm_fault+0x66a/0xf00
> [130346.263277]  [] ? find_get_entry+0x14/0x130
> [130346.263305]  [] ? __do_page_fault+0x1c5/0x490
> [130346.263336]  [] ? page_fault+0x22/0x30
> [130346.263364]  [] ? copy_user_generic_string+0x2c/0x40
> [130346.263395]  [] ? set_fd_set+0x1d/0x30
> [130346.263422]  [] ? core_sys_select+0x1a5/0x260
> [130346.263450]  [] ? getname_flags+0x6a/0x1e0
> [130346.263479]  [] ? cp_new_stat+0x115/0x130
> [130346.263509]  [] ? ktime_get_ts64+0x3f/0xf0
> [130346.263537]  [] ? SyS_select+0xa5/0xe0
> [130346.263564]  [] ? entry_SYSCALL_64_fastpath+0x13/0x94
> 
>
> Under such situation, trying to login and execute /bin/cat could take minutes.
> Also, writing to btrfs and ext4 seems to be stalling. The btrfs one is a
> situation where WQ_MEM_RECLAIM kernel workqueue is unable to make progress.
>
> 
> [130420.008231] kworker/u34:21: page allocation stalls for 35028ms, order:0, 
> mode:0x2400840(GFP_NOFS|__GFP_NOFAIL)
> [130420.008287] CPU: 5 PID: 24286 Comm: kworker/u34:21 Tainted: GW I  
>4.9.0-rc5 #2
> [130420.008401] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [130420.008432]   8129ba69 8170e4c8 
> c900087836a0
> [130420.008483]  8113449f 024008401e3f1b40 8170e4c8 
> c90008783648
> [130420.008534]  0010 c900087836b0 c90008783660 
> 88041ecc4340
> [130420.008586] Call Trace:
> [130420.008611]  [] ? dump_stack+0x46/0x5d
> [130420.008640]  [] ? warn_alloc+0x11f/0x140
> 

Re: [PATCH] Btrfs: remove rb_node field from the delayed ref node structure

2016-11-18 Thread David Sterba
On Fri, Nov 18, 2016 at 09:38:44AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> After the last big change in the delayed references code that was needed
> for the last qgroups rework, the red black tree node field of struct
> btrfs_delayed_ref_node is no longer used, so just remove it, this helps
> us save some memory (since struct rb_node is 24 bytes on x86_64) for
> these structures.
> 
> Signed-off-by: Filipe Manana 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-18 Thread Hans van Kranenburg
Ha,

On 11/18/2016 01:36 PM, Austin S. Hemmelgarn wrote:
> On 2016-11-17 16:08, Hans van Kranenburg wrote:
>> On 11/17/2016 08:27 PM, Austin S. Hemmelgarn wrote:
>>> On 2016-11-17 13:51, Hans van Kranenburg wrote:
>> But, the fun with visualizations of data is that you learn whether they
>> just work(tm) or don't as soon as you see them. Mathematical or
>> algorithmic beauty is not always a good recipe for beauty as seen by the
>> human eye.
>>
>> So, let's gather a bunch of ideas which we can try out and then observe
>> the result.
>>
>> Before doing so, I'm going to restructure the code a bit more so I can
>> write another script in the same directory, just doing import heatmap
>> and calling a few functions in there to quickly try stuff, bypassing the
>> normal cli api.
>>
>> Also, the png writing handling is now done by some random png library
>> that I found, which requires me to build (or copy/resize) an entire
>> pixel grid in memory, explicitely listing all pixel values, which is a
>> bit of a memory hog for bigger pictures, so I want to see if something
>> can be done there also.
> I haven't had a chance to look at the code yet, but do you have an
> option to control how much data a pixel represents?  On a multi TB
> filesystem for example, you may not care about exact data, just an
> overall view of the data, in which case making each pixel represent a
> larger chunk of data (and thus reducing the resolution of the image)
> would almost certainly save some memory on big filesystems.

--order, which defines the hilbert curve order.

Example: for a 238GiB filesystem, when specifying --order 7, then 2**7 =
128, so 128x128 = 16384 pixels, which means that a single one represents
~16MiB

when --size > --order, the image simply gets scaled up.

When not specifying --order, a number gets chosen automatically with
which bytes per pixel is closest to 32MiB.

When size is not specified, it's 10, or same as order if order is
greater than 10.

Now this output should make sense:

-# ./heatmap.py /mnt/238GiB
max_id 1 num_devices 1 fsid ed108358-c746-4e76-a071-3820d423a99d
nodesize 16384 sectorsize 4096 clone_alignment 4096
scope filesystem curve hilbert order 7 size 10 pngfile
fsid_ed10a358-c846-4e76-a071-3821d423a99d_at_1479473532.png
grid height 128 width 128 total_bytes 255057723392 bytes_per_pixel
15567488.0 pixels 16384

-# ./heatmap.py /mnt/40TiB
max_id 2 num_devices 2 fsid 9bc9947e-070f-4bbc-872e-49b2a39b3f7b
nodesize 16384 sectorsize 4096 clone_alignment 4096
scope filesystem curve hilbert order 10 size 10 pngfile
/home/beheer/heatmap/generated/fsid_9bd9947e-070f-4cbc-8e2e-49b3a39b8f7b_at_1479473950.png
grid height 1024 width 1024 total_bytes 46165378727936 bytes_per_pixel
44026736.0 pixels 1048576

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Btrfs: remove unused code when creating and merging reloc trees

2016-11-18 Thread Josef Bacik

On 11/18/2016 04:37 AM, fdman...@kernel.org wrote:

From: Filipe Manana 

In commit 5bc7247ac47c (Btrfs: fix broken nocow after balance) we started
abusing the rtransid and otransid fields of root items from relocation
trees to fix some issues with nodatacow mode. However later in commit
ba8b0289333a (Btrfs: do not reset last_snapshot after relocation) we
dropped the code that made use of those fields but did not remove
the code that sets those fields.

So just remove them to avoid confusion.

Signed-off-by: Filipe Manana 


Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Btrfs: fix relocation incorrectly dropping data references

2016-11-18 Thread Josef Bacik

On 11/18/2016 04:37 AM, fdman...@kernel.org wrote:

From: Filipe Manana 

During relocation of a data block group we create a relocation tree
for each fs/subvol tree by making a snapshot of each tree using
btrfs_copy_root() and the tree's commit root, and then setting the last
snapshot field for the fs/subvol tree's root to the value of the current
transaction id minus 1. However this can lead to relocation later
dropping references that it did not create if we have qgroups enabled,
leaving the filesystem in an inconsistent state that keeps aborting
transactions.

Lets consider the following example to explain the problem, which requires
qgroups to be enabled.

We are relocating data block group Y, we have a subvolume with id 258 that
has a root at level 1, that subvolume is used to store directory entries
for snapshots and we are currently at transaction 3404.

When committing transaction 3404, we have a pending snapshot and therefore
we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot()
in order to create its dentry at subvolume 258. This results in COWing
leaf A from root 258 in order to add the dentry. Note that leaf A
also contains file extent items referring to extents from some other
block group X (we are currently relocating block group Y). Later on, still
at create_pending_snapshot() we call qgroup_account_snapshot(), which
switches the commit root for root 258 when it calls switch_commit_roots(),
so now the COWed version of leaf A, lets call it leaf A', is accessible
from the commit root of tree 258. At the end of qgroup_account_snapshot(),
we call record_root_in_trans() with 258 as its argument, which results
in btrfs_init_reloc_root() being called, which in turn calls
relocation.c:create_reloc_root() in order to create a relocation tree
associated to root 258, which results in assigning the value of 3403
(which is the current transaction id minus 1 = 3404 - 1) to the
last_snapshot field of root 258. When creating the relocation tree root
at ctree.c:btrfs_copy_root() we add a shared reference for leaf A',
corresponding to the relocation tree's root, when we call btrfs_inc_ref()
against the COWed root (a copy of the commit root from tree 258), which
is at level 1. So at this point leaf A' has 2 references, one normal
reference corresponding to root 258 and one shared reference corresponding
to the root of the relocation tree.

Transaction 3404 finishes its commit and transaction 3405 is started by
relocation when calling merge_reloc_root() for the relocation tree
associated to root 258. In the meanwhile leaf A' is COWed again, in
response to some filesystem operation, when we are still at transaction
3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we
call btrfs_block_can_be_shared() in order to figure out if other trees
refer to the leaf and if any such trees exists, add a full back reference
to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false
because the following condition is false:

  btrfs_header_generation(buf) <= btrfs_root_last_snapshot(>root_item)

which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with
only one reference, corresponding to the shared reference we created when
we called btrfs_copy_root() to create the relocation tree's root and
btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting
the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not
adding shared references for the extents from block group X that leaf A'
refers to with its file extent items.

Later, after merging the relocation root we do a call to to
btrfs_drop_snapshot() in order to delete the relocation tree. This ends
up calling do_walk_down() when path->slots[1] points to leaf A', which
results in calling btrfs_lookup_extent_info() to get the number of
references for leaf A', which is 1 at this time (only the shared reference
exists) and this value is stored at wc->refs[0]. After this walk_up_proc()
is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'.
Because the current level is 0 and wc->refs[0] is 1, it does call
btrfs_dec_ref() against leaf A', which results in removing the single
references that the extents from block group X have which are associated
to root 258 - the expectation was to have each of these extents with 2
references - one reference for root 258 and one shared reference related
to the root of the relocation tree, and so we would drop only the shared
reference (because leaf A' was supposed to have the flag
BTRFS_BLOCK_FLAG_FULL_BACKREF set).

This leaves the filesystem in an inconsistent state as we now have file
extent items in a subvolume tree that point to extents from block group X
without references in the extent tree. So later on when we try to decrement
the references for these extents, for example due to a file unlink operation,
truncate operation or overwriting ranges of a file, we fail because the
expected references do not exist in 

BCache

2016-11-18 Thread Heiri Müller
Hello,


I „BCache“ a BTrFS RAID with 4 hard drives.
Normal use seems to work good. I have no problems.
I have the BTrFS RAID mounted through „/dev/bcache3“.
But when I remove a disk (simulating it’s broken), BTrFS doesn’t inform me 
about a „missing disk“:

# btrfs fi show

Label: 'RAID'  uuid: d0e2e2eb-2df7-454f-8446-5213cec2de3c
Total devices 4 FS bytes used 12.55GiB
devid1 size 465.76GiB used 6.00GiB path /dev/bcache3
devid2 size 931.51GiB used 6.00GiB path /dev/bcache1
devid3 size 596.17GiB used 6.00GiB path /dev/bcache2
devid4 size 465.76GiB used 6.00GiB path /dev/bcache0


One of these (actually „/dev/bcache3“ or „/dev/sde“) should be broken or 
missing.
I can’t make a „btrfs device delete missing“ either. It replies „ERROR: not a 
btrfs filesystem:“. It doesn’t matter, if I use „/dev/bcacheX“ or „/dev/sdX“ 
for that. And if I make a „btrfs device delete missing /mnt/raid“, it replies: 
„ERROR: error removing device 'missing': no missing devices found to remove“.
It looks to me, as if BCache hides these informations from BTrFS. Is this 
possible? What do you think?


Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-18 Thread Austin S. Hemmelgarn

On 2016-11-17 16:08, Hans van Kranenburg wrote:

On 11/17/2016 08:27 PM, Austin S. Hemmelgarn wrote:

On 2016-11-17 13:51, Hans van Kranenburg wrote:


When generating a picture of a file system with multiple devices,
boundaries between the separate devices are not visible now.

If someone has a brilliant idea about how to do this without throwing
out actual usage data...


The first thought that comes to mind for me is to make each device be a
different color, and otherwise obey the same intensity mapping
correlating to how much data is there.  For example, if you've got a 3
device FS, the parts of the image that correspond to device 1 would go
from 0x00 to 0xFF, the parts for device 2 could be 0x00 to
0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This
is of course not perfect (you can't tell what device each segment of
empty space corresponds to), but would probably cover most use cases.
(for example, with such a scheme, you could look at an image and tell
whether the data is relatively well distributed across all the devices
or you might need to re-balance).


"most use cases" -> what are those use cases? If you want to know how
much total GiB or TiB is present on all devices, a simple btrfs fi show
does suffice.
Visualizing how the data patterning differs across devices would be the 
biggest one that comes to mind.


Another option is to just write three images, one for each of the
devices. :) Those are more easily compared.
That would actually be more useful probably, as you can then do pretty 
much whatever post-processing you want, and it would cover the above use 
case just as well.


The first idea with color that I had was to use two different colors for
data and metadata. When also using separate colors for devices, it might
all together become a big mess really quickly, or, maybe a beautiful
rainbow.
I actually like that idea a lot better than using color for 
differentiating between devices.


But, the fun with visualizations of data is that you learn whether they
just work(tm) or don't as soon as you see them. Mathematical or
algorithmic beauty is not always a good recipe for beauty as seen by the
human eye.

So, let's gather a bunch of ideas which we can try out and then observe
the result.

Before doing so, I'm going to restructure the code a bit more so I can
write another script in the same directory, just doing import heatmap
and calling a few functions in there to quickly try stuff, bypassing the
normal cli api.

Also, the png writing handling is now done by some random png library
that I found, which requires me to build (or copy/resize) an entire
pixel grid in memory, explicitely listing all pixel values, which is a
bit of a memory hog for bigger pictures, so I want to see if something
can be done there also.
I haven't had a chance to look at the code yet, but do you have an 
option to control how much data a pixel represents?  On a multi TB 
filesystem for example, you may not care about exact data, just an 
overall view of the data, in which case making each pixel represent a 
larger chunk of data (and thus reducing the resolution of the image) 
would almost certainly save some memory on big filesystems.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-18 Thread Janos Toth F.
It could be totally unrelated but I have a similar problem: processes
get randomly OOM'd when I am doing anything "sort of heavy" on my
Btrfs filesystems.
I did some "evil tuning", so I assumed that must be the problem (even
if the values looked sane for my system). Thus, I kept cutting back on
the manually set values (mostly dirty/background ratio, io scheduler
request queue size and such tunables) but it seems to be a dead end. I
guess anything I change in order to try and cut back on the related
memory footprint just makes the OOMs less frequent but it's only a
matter of time and coincidence (lots of things randomly happen to do
some notable amount of IO) until OOMs happen anyway.
It seems to be plenty enough to start a defrag or balance on more than
a single filesystem (in parallel) and pretty much any notable "useful"
user load will have a high change of triggering OOMs (and get killed)
sooner or later. It's just my limited observation but database-like
loads [like that of bitcoind] (sync writes and/or frequent flushes?)
or high priority buffered writes (ffmpeg running with higher than
default priority and saving live video streams into files without
recoding) seem to have higher chance of triggering this (more so than
simply reading or writing files sequentially and asynchronously,
either locally or through Samba).
I am on gentoo-sources 4.8.8 right now but it was there with 4.7.x as well.

On Thu, Nov 17, 2016 at 10:49 PM, Vlastimil Babka  wrote:
> On 11/16/2016 02:39 PM, E V wrote:
>> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of
>> the stack trace, and the 38 call traces in a 2 minute window shortly
>> before, to the bugzilla case for those not on it's e-mail list:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=186671
>
> The panic screenshot has only the last part, but the end marker says
> it's OOM with no killable processes. The DEBUG_VM config thus didn't
> trigger anything, and still there's tons of pagecache, mostly clean,
> that's not being reclaimed.
>
> Could you now try this?
> - enable CONFIG_PAGE_OWNER
> - boot with kernel option: page_owner=on
> - after the first oom, "cat /sys/kernel/debug/page_owner > file"
> - provide the file (compressed, it will be quite large)
>
> Vlastimil
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-18 Thread Tetsuo Handa
On 2016/11/18 6:49, Vlastimil Babka wrote:
> On 11/16/2016 02:39 PM, E V wrote:
>> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of
>> the stack trace, and the 38 call traces in a 2 minute window shortly
>> before, to the bugzilla case for those not on it's e-mail list:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=186671
> 
> The panic screenshot has only the last part, but the end marker says
> it's OOM with no killable processes. The DEBUG_VM config thus didn't
> trigger anything, and still there's tons of pagecache, mostly clean,
> that's not being reclaimed.
> 
> Could you now try this?
> - enable CONFIG_PAGE_OWNER
> - boot with kernel option: page_owner=on
> - after the first oom, "cat /sys/kernel/debug/page_owner > file"
> - provide the file (compressed, it will be quite large)

Excuse me for a noise, but do we really need to do
"cat /sys/kernel/debug/page_owner > file" after the first OOM killer
invocation? I worry that it might be too difficult to do.
Shouldn't we rather do "cat /sys/kernel/debug/page_owner > file"
hourly and compare tendency between the latest one and previous one?

This system has swap, and /var/log/messages before panic
reports that swapin was stalling at memory allocation.


[130346.262510] dsm_sa_datamgrd: page allocation stalls for 52400ms, order:0, 
mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
[130346.262572] CPU: 1 PID: 3622 Comm: dsm_sa_datamgrd Tainted: GW I
 4.9.0-rc5 #2
[130346.262662]   8129ba69 8170e4c8 
c90003ccb8d8
[130346.262714]  8113449f 024200ca1ca11b40 8170e4c8 
c90003ccb880
[130346.262765]  0010 c90003ccb8e8 c90003ccb898 
88041f226e80
[130346.262817] Call Trace:
[130346.262843]  [] ? dump_stack+0x46/0x5d
[130346.262872]  [] ? warn_alloc+0x11f/0x140
[130346.262899]  [] ? __alloc_pages_slowpath+0x84b/0xa80
[130346.262929]  [] ? __alloc_pages_nodemask+0x2b0/0x2f0
[130346.262960]  [] ? alloc_pages_vma+0xbe/0x260
[130346.262989]  [] ? pagecache_get_page+0x22/0x280
[130346.263019]  [] ? __read_swap_cache_async+0x118/0x1a0
[130346.263048]  [] ? read_swap_cache_async+0xf/0x30
[130346.263077]  [] ? swapin_readahead+0x16e/0x1c0
[130346.263106]  [] ? radix_tree_lookup_slot+0xe/0x20
[130346.263135]  [] ? find_get_entry+0x14/0x130
[130346.263162]  [] ? pagecache_get_page+0x22/0x280
[130346.263193]  [] ? do_swap_page+0x44f/0x5f0
[130346.263220]  [] ? __radix_tree_lookup+0x62/0xc0
[130346.263249]  [] ? handle_mm_fault+0x66a/0xf00
[130346.263277]  [] ? find_get_entry+0x14/0x130
[130346.263305]  [] ? __do_page_fault+0x1c5/0x490
[130346.263336]  [] ? page_fault+0x22/0x30
[130346.263364]  [] ? copy_user_generic_string+0x2c/0x40
[130346.263395]  [] ? set_fd_set+0x1d/0x30
[130346.263422]  [] ? core_sys_select+0x1a5/0x260
[130346.263450]  [] ? getname_flags+0x6a/0x1e0
[130346.263479]  [] ? cp_new_stat+0x115/0x130
[130346.263509]  [] ? ktime_get_ts64+0x3f/0xf0
[130346.263537]  [] ? SyS_select+0xa5/0xe0
[130346.263564]  [] ? entry_SYSCALL_64_fastpath+0x13/0x94


Under such situation, trying to login and execute /bin/cat could take minutes.
Also, writing to btrfs and ext4 seems to be stalling. The btrfs one is a
situation where WQ_MEM_RECLAIM kernel workqueue is unable to make progress.


[130420.008231] kworker/u34:21: page allocation stalls for 35028ms, order:0, 
mode:0x2400840(GFP_NOFS|__GFP_NOFAIL)
[130420.008287] CPU: 5 PID: 24286 Comm: kworker/u34:21 Tainted: GW I
 4.9.0-rc5 #2
[130420.008401] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[130420.008432]   8129ba69 8170e4c8 
c900087836a0
[130420.008483]  8113449f 024008401e3f1b40 8170e4c8 
c90008783648
[130420.008534]  0010 c900087836b0 c90008783660 
88041ecc4340
[130420.008586] Call Trace:
[130420.008611]  [] ? dump_stack+0x46/0x5d
[130420.008640]  [] ? warn_alloc+0x11f/0x140
[130420.008667]  [] ? __alloc_pages_slowpath+0x84b/0xa80
[130420.008707]  [] ? search_bitmap+0xc2/0x140 [btrfs]
[130420.008736]  [] ? __alloc_pages_nodemask+0x2b0/0x2f0
[130420.008766]  [] ? alloc_pages_current+0x8a/0x110
[130420.008796]  [] ? pagecache_get_page+0xec/0x280
[130420.008836]  [] ? alloc_extent_buffer+0x108/0x430 [btrfs]
[130420.008875]  [] ? btrfs_alloc_tree_block+0x118/0x4d0 
[btrfs]
[130420.008927]  [] ? __btrfs_cow_block+0x148/0x5d0 [btrfs]
[130420.008964]  [] ? btrfs_cow_block+0x114/0x1d0 [btrfs]
[130420.009001]  [] ? btrfs_search_slot+0x206/0xa40 [btrfs]
[130420.009039]  [] ? lookup_inline_extent_backref+0xd9/0x620 
[btrfs]
[130420.009095]  [] ? set_extent_bit+0x24/0x30 [btrfs]
[130420.009124]  [] ? kmem_cache_alloc+0x17f/0x1b0
[130420.009161]  [] ? __btrfs_free_extent.isra.69+0xef/0xd10 
[btrfs]
[130420.009215]  [] ? btrfs_merge_delayed_refs+0x56/0x6f0 
[btrfs]
[130420.009269]  [] ? 

Re: Announcing btrfs-dedupe

2016-11-18 Thread Niccolò Belli

On giovedì 17 novembre 2016 04:01:52 CET, Zygo Blaxell wrote:

Duperemove does use a lot of memory, but the logs at that URL only show
2G of RAM in duperemove--not nearly enough to trigger OOM under normal
conditions on an 8G machine.  There's another process with 6G of virtual
address space (although much less than that resident) that looks more
interesting (i.e. duperemove might just be the victim of some interaction
between baloo_file and the OOM killer).


Thanks, I killed baloo_file before starting duperemove and it somehow 
improved (it reached 99.73% before getting killed by OOM killer once 
again):


[ 6342.147251] Purging GPU memory, 0 pages freed, 18268 pages still pinned.
[ 6342.147253] 48 and 0 pages still available in the bound and unbound GPU 
page lists.
[ 6342.147340] Xorg invoked oom-killer: 
gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), order=3, 
oom_score_adj=0

[ 6342.147341] Xorg cpuset=/ mems_allowed=0
[ 6342.147346] CPU: 3 PID: 650 Comm: Xorg Not tainted 4.8.8-2-ARCH #1
[ 6342.147347] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A09 
08/29/2016
[ 6342.147348]  0286 9b89a9c8 88020752f598 
812fde10
[ 6342.147351]  88020752f758 8801edc62ac0 88020752f608 
81205fa2
[ 6342.147353]  000188020752f5a0 9b89a9c8  


[ 6342.147356] Call Trace:
[ 6342.147361]  [] dump_stack+0x63/0x83
[ 6342.147364]  [] dump_header+0x5c/0x1ea
[ 6342.147366]  [] oom_kill_process+0x265/0x410
[ 6342.147368]  [] ? has_capability_noaudit+0x17/0x20
[ 6342.147369]  [] out_of_memory+0x380/0x420
[ 6342.147373]  [] ? find_next_bit+0x18/0x20
[ 6342.147374]  [] __alloc_pages_nodemask+0xda0/0xde0
[ 6342.147377]  [] alloc_pages_current+0x95/0x140
[ 6342.147380]  [] kmalloc_order_trace+0x2e/0xf0
[ 6342.147382]  [] __kmalloc+0x1ea/0x200
[ 6342.147397]  [] ? alloc_gen8_temp_bitmaps+0x2e/0x80 
[i915]
[ 6342.147407]  [] alloc_gen8_temp_bitmaps+0x47/0x80 
[i915]
[ 6342.147417]  [] gen8_alloc_va_range_3lvl+0x98/0x9c0 
[i915]

[ 6342.147419]  [] ? shmem_getpage_gfp+0xed/0xc30
[ 6342.147421]  [] ? sg_init_table+0x1a/0x40
[ 6342.147423]  [] ? swiotlb_map_sg_attrs+0x53/0x130
[ 6342.147432]  [] gen8_alloc_va_range+0x256/0x490 [i915]
[ 6342.147442]  [] i915_vma_bind+0x9b/0x190 [i915]
[ 6342.147453]  [] i915_gem_object_do_pin+0x86b/0xa90 
[i915]

[ 6342.147463]  [] i915_gem_object_pin+0x2d/0x30 [i915]
[ 6342.147472]  [] 
i915_gem_execbuffer_reserve_vma.isra.7+0x9f/0x180 [i915]
[ 6342.147482]  [] 
i915_gem_execbuffer_reserve.isra.8+0x396/0x3c0 [i915]
[ 6342.147491]  [] 
i915_gem_do_execbuffer.isra.14+0x68b/0x1270 [i915]

[ 6342.147493]  [] ? unix_stream_read_generic+0x281/0x8a0
[ 6342.147503]  [] i915_gem_execbuffer2+0x104/0x270 
[i915]

[ 6342.147509]  [] drm_ioctl+0x200/0x4f0 [drm]
[ 6342.147518]  [] ? i915_gem_execbuffer+0x330/0x330 
[i915]

[ 6342.147520]  [] ? enqueue_hrtimer+0x3d/0xa0
[ 6342.147522]  [] ? timerqueue_del+0x24/0x70
[ 6342.147523]  [] ? __remove_hrtimer+0x3c/0x90
[ 6342.147525]  [] do_vfs_ioctl+0xa3/0x5f0
[ 6342.147527]  [] ? do_setitimer+0x12b/0x230
[ 6342.147529]  [] ? __fget+0x77/0xb0
[ 6342.147531]  [] SyS_ioctl+0x79/0x90
[ 6342.147533]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
[ 6342.147535] Mem-Info:
[ 6342.147538] active_anon:76311 inactive_anon:76782 isolated_anon:0
   active_file:347581 inactive_file:1415592 isolated_file:64
   unevictable:8 dirty:482 writeback:0 unstable:0
   slab_reclaimable:27219 slab_unreclaimable:14772
   mapped:20714 shmem:30458 pagetables:10557 bounce:0
   free:25642 free_pcp:327 free_cma:0
[ 6342.147541] Node 0 active_anon:305244kB inactive_anon:307128kB 
active_file:1390324kB inactive_file:5662368kB unevictable:32kB 
isolated(anon):0kB isolated(file):256kB mapped:82856kB dirty:1928kB 
writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 81920kB anon_thp: 
121832kB writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? 
no
[ 6342.147542] Node 0 DMA free:15688kB min:132kB low:164kB high:196kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15984kB managed:15896kB 
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:208kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

[ 6342.147545] lowmem_reserve[]: 0 3395 7850 7850 7850
[ 6342.147548] Node 0 DMA32 free:48772kB min:29172kB low:36464kB 
high:43756kB active_anon:84724kB inactive_anon:87164kB active_file:555728kB 
inactive_file:2639796kB unevictable:0kB writepending:696kB 
present:3564504kB managed:3488752kB mlocked:0kB slab_reclaimable:47196kB 
slab_unreclaimable:11472kB kernel_stack:192kB pagetables:200kB bounce:0kB 
free_pcp:1284kB local_pcp:0kB free_cma:0kB

[ 6342.147553] lowmem_reserve[]: 0 0 4454 4454 4454
[ 6342.147555] Node 0 Normal free:38108kB min:38276kB low:47844kB 
high:57412kB active_anon:220520kB inactive_anon:219964kB 
active_file:834596kB 

Re: [PATCH 1/3] Btrfs: remove balance warning that does not reflect a problem

2016-11-18 Thread Filipe Manana
On Fri, Nov 18, 2016 at 9:37 AM,   wrote:
> From: Filipe Manana 
>
> On openSUSE/SLE systems where balance is triggered periodically in the
> background, snapshotting happens when doing package installations and
> upgrades, and (by default) the root system is organized with multiple
> subvolumes, the following warning was triggered often:
>
> [  630.773059] WARNING: CPU: 1 PID: 2549 at fs/btrfs/relocation.c:1848 
> replace_path+0x3f0/0x940 [btrfs]
> [  630.773060] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs 
> libcrc32c acpi_cpufreq tpm_tis ppdev tpm parport_pc parport pcspkr e1000
> qemu_fw_cfg joydev i2c_piix4 button btrfs xor raid6_pq sr_mod cdrom 
> ata_generic virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect 
> sysimgblt
> fb_sys_fops ttm ata_piix virtio_pci virtio_ring virtio serio_raw floppy drm sg
> [  630.773070] CPU: 1 PID: 2549 Comm: btrfs Tainted: GW   
> 4.7.7-2-btrfs+ #2
> [  630.773071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [  630.773072]   8801f704b8c8 813afd12 
> 
> [  630.773073]   8801f704b908 81081f8b 
> 0738
> [  630.773075]  0001 8801e32eb8c0 1600 
> 8800
> [  630.773076] Call Trace:
> [  630.773078]  [] dump_stack+0x63/0x81
> [  630.773079]  [] __warn+0xcb/0xf0
> [  630.773080]  [] warn_slowpath_null+0x1d/0x20
> [  630.773090]  [] replace_path+0x3f0/0x940 [btrfs]
> [  630.773092]  [] ? ring_buffer_unlock_commit+0x3e/0x2a0
> [  630.773102]  [] merge_reloc_root+0x2b4/0x600 [btrfs]
> [  630.773111]  [] merge_reloc_roots+0x140/0x250 [btrfs]
> [  630.773120]  [] relocate_block_group+0x317/0x680 [btrfs]
> [  630.773129]  [] btrfs_relocate_block_group+0x1cc/0x2d0 
> [btrfs]
> [  630.773139]  [] btrfs_relocate_chunk.isra.40+0x56/0xf0 
> [btrfs]
> [  630.773149]  [] __btrfs_balance+0x8d5/0xbb0 [btrfs]
> [  630.773159]  [] btrfs_balance+0x2d0/0x5e0 [btrfs]
> [  630.773168]  [] btrfs_ioctl_balance+0x383/0x390 [btrfs]
> [  630.773178]  [] btrfs_ioctl+0x90f/0x1fb0 [btrfs]
> [  630.773180]  [] ? pte_alloc_one+0x33/0x40
> [  630.773182]  [] do_vfs_ioctl+0x93/0x5a0
> [  630.773183]  [] ? __do_page_fault+0x203/0x4e0
> [  630.773185]  [] SyS_ioctl+0x79/0x90
> [  630.773186]  [] entry_SYSCALL_64_fastpath+0x1e/0xa8
> [  630.773187] ---[ end trace 2cd6167577bf3be7 ]---
>
> It turned out that this warning does not reflect any problem and just
> makes users/system administrators worry about something going wrong.
> The warning happens because when we create a relocation root (which is
> just a snapshot of a subvolume tree) we set its last_snapshot field (as
> well as for the subvolume's tree root) to a value corresponding to the
> generation of the current transaction minus 1 (we do this at
> relocation.c:create_reloc_root()). This means that when we merge the
> relocation tree with the corresponding subvolume tree, at
> walk_down_reloc_tree() we can catch pointers (bytenr/generation pairs)
> with a generation that matches the generation of the transaction where
> we created the relocation root, so those pointers correspond to tree
> blocks created either before or after the relocation root was created.
> If they were created before the relocation root (and in the same
> transaction) we hit the warning, which we can safely remove because it
> means the tree blocks are accessible from both trees (the subvolume
> tree and the relocation tree).
>
> So fix this by removing the warning and adding a couple assertions that
> verify the pointer generations match and that their generation matches
> the value of the last_snapshot field from the relocation tree plus 1.
>
> Signed-off-by: Filipe Manana 


Ignore this patch in the series, since the second patch makes this one
unnecessary.
thanks

> ---
>  fs/btrfs/relocation.c | 21 -
>  1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 0ec8ffa..cdc1a1c 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -1848,7 +1848,26 @@ again:
> new_ptr_gen = 0;
> }
>
> -   if (WARN_ON(new_bytenr > 0 && new_bytenr == old_bytenr)) {
> +   /*
> +* When we create the reloc root (which is a snapshot of the
> +* subvolume tree) we set its last_snapshot field (as well as
> +* for the subvolume's tree root) to the value of the current
> +* transaction generation minus 1 (at create_reloc_root()).
> +* This means that at walk_down_reloc_tree() we can catch
> +* pointers (bytenr/generation pairs) with a generation
> +* matching the generation of the transaction where we created
> +* the reloc 

Re: root backup-reformat-restore

2016-11-18 Thread Hugo Mills
On Fri, Nov 18, 2016 at 04:06:22AM +0200, Marcus Sundman wrote:
> On 18.11.2016 02:52, Hugo Mills wrote:
> >On Fri, Nov 18, 2016 at 02:38:25AM +0200, Marcus Sundman wrote:
> >>The FAQ says that "the best solution for small devices (under about
> >>16 GB) is to reformat the FS with the --mixed option to mkfs.btrfs".
> >>
> >>OK. Does anyone have any good suggestions for doing that with an
> >>existing / partition (which has special files and whatnot)?
> >>
> >>I assume a backup-restore cycle is needed, but with which program(s)?
> >  - If you don't have (or don't care about) any snapshots, then
> > - if you can mount both the old and the new FS at the same time
> >- mv
> 
> It's the same partition, so I probably won't have the new FS there
> at the same time.
> 
> > - else, if you want to replace the old FS with the new
> >- tar to a file elsewhere, and untar later
> 
> Can tar really preserve all special files' attributes? (I have
> separate partitions for /boot and /home, but the rest is in /)

   Yes, I've used it for moving / to a different machine before. Take
a look through the options and pick all the ones that look like
they're dealing with attributes. :)

   Hugo.

-- 
Hugo Mills | Anyone who claims their cryptographic protocol is
hugo@... carfax.org.uk | secure is either a genius or a fool. Given the
http://carfax.org.uk/  | genius/fool ratio for our species, the odds aren't
PGP: E2AB1DE4  | good.  Bruce Schneier


signature.asc
Description: Digital signature


[PATCH] Btrfs: remove rb_node field from the delayed ref node structure

2016-11-18 Thread fdmanana
From: Filipe Manana 

After the last big change in the delayed references code that was needed
for the last qgroups rework, the red black tree node field of struct
btrfs_delayed_ref_node is no longer used, so just remove it, this helps
us save some memory (since struct rb_node is 24 bytes on x86_64) for
these structures.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/delayed-ref.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 43f3629..a72fbe7 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -34,12 +34,6 @@
  * ref_head. Must clean this mess up later.
  */
 struct btrfs_delayed_ref_node {
-   /*
-* ref_head use rb tree, stored in ref_root->href.
-* indexed by bytenr
-*/
-   struct rb_node rb_node;
-
/*data/tree ref use list, stored in ref_head->ref_list. */
struct list_head list;
 
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Btrfs: fix relocation incorrectly dropping data references

2016-11-18 Thread fdmanana
From: Filipe Manana 

During relocation of a data block group we create a relocation tree
for each fs/subvol tree by making a snapshot of each tree using
btrfs_copy_root() and the tree's commit root, and then setting the last
snapshot field for the fs/subvol tree's root to the value of the current
transaction id minus 1. However this can lead to relocation later
dropping references that it did not create if we have qgroups enabled,
leaving the filesystem in an inconsistent state that keeps aborting
transactions.

Lets consider the following example to explain the problem, which requires
qgroups to be enabled.

We are relocating data block group Y, we have a subvolume with id 258 that
has a root at level 1, that subvolume is used to store directory entries
for snapshots and we are currently at transaction 3404.

When committing transaction 3404, we have a pending snapshot and therefore
we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot()
in order to create its dentry at subvolume 258. This results in COWing
leaf A from root 258 in order to add the dentry. Note that leaf A
also contains file extent items referring to extents from some other
block group X (we are currently relocating block group Y). Later on, still
at create_pending_snapshot() we call qgroup_account_snapshot(), which
switches the commit root for root 258 when it calls switch_commit_roots(),
so now the COWed version of leaf A, lets call it leaf A', is accessible
from the commit root of tree 258. At the end of qgroup_account_snapshot(),
we call record_root_in_trans() with 258 as its argument, which results
in btrfs_init_reloc_root() being called, which in turn calls
relocation.c:create_reloc_root() in order to create a relocation tree
associated to root 258, which results in assigning the value of 3403
(which is the current transaction id minus 1 = 3404 - 1) to the
last_snapshot field of root 258. When creating the relocation tree root
at ctree.c:btrfs_copy_root() we add a shared reference for leaf A',
corresponding to the relocation tree's root, when we call btrfs_inc_ref()
against the COWed root (a copy of the commit root from tree 258), which
is at level 1. So at this point leaf A' has 2 references, one normal
reference corresponding to root 258 and one shared reference corresponding
to the root of the relocation tree.

Transaction 3404 finishes its commit and transaction 3405 is started by
relocation when calling merge_reloc_root() for the relocation tree
associated to root 258. In the meanwhile leaf A' is COWed again, in
response to some filesystem operation, when we are still at transaction
3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we
call btrfs_block_can_be_shared() in order to figure out if other trees
refer to the leaf and if any such trees exists, add a full back reference
to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false
because the following condition is false:

  btrfs_header_generation(buf) <= btrfs_root_last_snapshot(>root_item)

which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with
only one reference, corresponding to the shared reference we created when
we called btrfs_copy_root() to create the relocation tree's root and
btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting
the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not
adding shared references for the extents from block group X that leaf A'
refers to with its file extent items.

Later, after merging the relocation root we do a call to to
btrfs_drop_snapshot() in order to delete the relocation tree. This ends
up calling do_walk_down() when path->slots[1] points to leaf A', which
results in calling btrfs_lookup_extent_info() to get the number of
references for leaf A', which is 1 at this time (only the shared reference
exists) and this value is stored at wc->refs[0]. After this walk_up_proc()
is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'.
Because the current level is 0 and wc->refs[0] is 1, it does call
btrfs_dec_ref() against leaf A', which results in removing the single
references that the extents from block group X have which are associated
to root 258 - the expectation was to have each of these extents with 2
references - one reference for root 258 and one shared reference related
to the root of the relocation tree, and so we would drop only the shared
reference (because leaf A' was supposed to have the flag
BTRFS_BLOCK_FLAG_FULL_BACKREF set).

This leaves the filesystem in an inconsistent state as we now have file
extent items in a subvolume tree that point to extents from block group X
without references in the extent tree. So later on when we try to decrement
the references for these extents, for example due to a file unlink operation,
truncate operation or overwriting ranges of a file, we fail because the
expected references do not exist in the extent tree.

This leads to warnings and 

[PATCH 1/3] Btrfs: remove balance warning that does not reflect a problem

2016-11-18 Thread fdmanana
From: Filipe Manana 

On openSUSE/SLE systems where balance is triggered periodically in the
background, snapshotting happens when doing package installations and
upgrades, and (by default) the root system is organized with multiple
subvolumes, the following warning was triggered often:

[  630.773059] WARNING: CPU: 1 PID: 2549 at fs/btrfs/relocation.c:1848 
replace_path+0x3f0/0x940 [btrfs]
[  630.773060] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs 
libcrc32c acpi_cpufreq tpm_tis ppdev tpm parport_pc parport pcspkr e1000
qemu_fw_cfg joydev i2c_piix4 button btrfs xor raid6_pq sr_mod cdrom ata_generic 
virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm ata_piix virtio_pci virtio_ring virtio serio_raw floppy drm sg
[  630.773070] CPU: 1 PID: 2549 Comm: btrfs Tainted: GW   
4.7.7-2-btrfs+ #2
[  630.773071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  630.773072]   8801f704b8c8 813afd12 

[  630.773073]   8801f704b908 81081f8b 
0738
[  630.773075]  0001 8801e32eb8c0 1600 
8800
[  630.773076] Call Trace:
[  630.773078]  [] dump_stack+0x63/0x81
[  630.773079]  [] __warn+0xcb/0xf0
[  630.773080]  [] warn_slowpath_null+0x1d/0x20
[  630.773090]  [] replace_path+0x3f0/0x940 [btrfs]
[  630.773092]  [] ? ring_buffer_unlock_commit+0x3e/0x2a0
[  630.773102]  [] merge_reloc_root+0x2b4/0x600 [btrfs]
[  630.773111]  [] merge_reloc_roots+0x140/0x250 [btrfs]
[  630.773120]  [] relocate_block_group+0x317/0x680 [btrfs]
[  630.773129]  [] btrfs_relocate_block_group+0x1cc/0x2d0 
[btrfs]
[  630.773139]  [] btrfs_relocate_chunk.isra.40+0x56/0xf0 
[btrfs]
[  630.773149]  [] __btrfs_balance+0x8d5/0xbb0 [btrfs]
[  630.773159]  [] btrfs_balance+0x2d0/0x5e0 [btrfs]
[  630.773168]  [] btrfs_ioctl_balance+0x383/0x390 [btrfs]
[  630.773178]  [] btrfs_ioctl+0x90f/0x1fb0 [btrfs]
[  630.773180]  [] ? pte_alloc_one+0x33/0x40
[  630.773182]  [] do_vfs_ioctl+0x93/0x5a0
[  630.773183]  [] ? __do_page_fault+0x203/0x4e0
[  630.773185]  [] SyS_ioctl+0x79/0x90
[  630.773186]  [] entry_SYSCALL_64_fastpath+0x1e/0xa8
[  630.773187] ---[ end trace 2cd6167577bf3be7 ]---

It turned out that this warning does not reflect any problem and just
makes users/system administrators worry about something going wrong.
The warning happens because when we create a relocation root (which is
just a snapshot of a subvolume tree) we set its last_snapshot field (as
well as for the subvolume's tree root) to a value corresponding to the
generation of the current transaction minus 1 (we do this at
relocation.c:create_reloc_root()). This means that when we merge the
relocation tree with the corresponding subvolume tree, at
walk_down_reloc_tree() we can catch pointers (bytenr/generation pairs)
with a generation that matches the generation of the transaction where
we created the relocation root, so those pointers correspond to tree
blocks created either before or after the relocation root was created.
If they were created before the relocation root (and in the same
transaction) we hit the warning, which we can safely remove because it
means the tree blocks are accessible from both trees (the subvolume
tree and the relocation tree).

So fix this by removing the warning and adding a couple assertions that
verify the pointer generations match and that their generation matches
the value of the last_snapshot field from the relocation tree plus 1.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/relocation.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 0ec8ffa..cdc1a1c 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1848,7 +1848,26 @@ again:
new_ptr_gen = 0;
}
 
-   if (WARN_ON(new_bytenr > 0 && new_bytenr == old_bytenr)) {
+   /*
+* When we create the reloc root (which is a snapshot of the
+* subvolume tree) we set its last_snapshot field (as well as
+* for the subvolume's tree root) to the value of the current
+* transaction generation minus 1 (at create_reloc_root()).
+* This means that at walk_down_reloc_tree() we can catch
+* pointers (bytenr/generation pairs) with a generation
+* matching the generation of the transaction where we created
+* the reloc root, so those pointers correspond to tree blocks
+* that were either created before or after the reloc root was
+* created. If walk_down_reloc_tree() gave us a path that points
+* to a tree block that was created (or COWed) before the reloc
+* root was created and in the 

[PATCH 3/3] Btrfs: remove unused code when creating and merging reloc trees

2016-11-18 Thread fdmanana
From: Filipe Manana 

In commit 5bc7247ac47c (Btrfs: fix broken nocow after balance) we started
abusing the rtransid and otransid fields of root items from relocation
trees to fix some issues with nodatacow mode. However later in commit
ba8b0289333a (Btrfs: do not reset last_snapshot after relocation) we
dropped the code that made use of those fields but did not remove
the code that sets those fields.

So just remove them to avoid confusion.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/relocation.c | 19 ---
 1 file changed, 19 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 8777b17..83c1a8c 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1384,7 +1384,6 @@ static struct btrfs_root *create_reloc_root(struct 
btrfs_trans_handle *trans,
struct extent_buffer *eb;
struct btrfs_root_item *root_item;
struct btrfs_key root_key;
-   u64 last_snap = 0;
int ret;
 
root_item = kmalloc(sizeof(*root_item), GFP_NOFS);
@@ -1401,7 +1400,6 @@ static struct btrfs_root *create_reloc_root(struct 
btrfs_trans_handle *trans,
ret = btrfs_copy_root(trans, root, root->commit_root, ,
  BTRFS_TREE_RELOC_OBJECTID);
BUG_ON(ret);
-   last_snap = btrfs_root_last_snapshot(>root_item);
/*
 * Set the last_snapshot field to the generation of the commit
 * root - like this ctree.c:btrfs_block_can_be_shared() behaves
@@ -1435,12 +1433,6 @@ static struct btrfs_root *create_reloc_root(struct 
btrfs_trans_handle *trans,
memset(_item->drop_progress, 0,
   sizeof(struct btrfs_disk_key));
root_item->drop_level = 0;
-   /*
-* abuse rtransid, it is safe because it is impossible to
-* receive data into a relocation tree.
-*/
-   btrfs_set_root_rtransid(root_item, last_snap);
-   btrfs_set_root_otransid(root_item, trans->transid);
}
 
btrfs_tree_unlock(eb);
@@ -2399,9 +2391,6 @@ void merge_reloc_roots(struct reloc_control *rc)
 {
struct btrfs_root *root;
struct btrfs_root *reloc_root;
-   u64 last_snap;
-   u64 otransid;
-   u64 objectid;
LIST_HEAD(reloc_roots);
int found = 0;
int ret = 0;
@@ -2440,14 +2429,6 @@ again:
list_del_init(_root->root_list);
}
 
-   /*
-* we keep the old last snapshot transid in rtranid when we
-* created the relocation tree.
-*/
-   last_snap = btrfs_root_rtransid(_root->root_item);
-   otransid = btrfs_root_otransid(_root->root_item);
-   objectid = reloc_root->root_key.offset;
-
ret = btrfs_drop_snapshot(reloc_root, rc->block_rsv, 0, 1);
if (ret < 0) {
if (list_empty(_root->root_list))
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html