[PATCH] btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_space
From: Jeff MahoneySubject: btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_space References: bsc#1005666 Patch-mainline: Submitted 18 Nov 2016, linux-btrfs This fixes the WARN_ON on BTRFS_I(inode)->reserved_extents in btrfs_destroy_inode and the WARN_ON on nonzero delalloc bytes on umount with qgroups enabled. I was able to reproduce this by setting up a small (~500kb) quota limit and writing a file one byte at a time until I hit the limit. The warnings would all hit on umount. The root cause is that we would reserve a block-sized range in both the reservation and the quota in btrfs_check_data_free_space, but if we encountered a problem (like e.g. EDQUOT), we would only release the single byte in the qgroup reservation. That caused an iotree state split, which increased the number of outstanding extents, in turn disallowing releasing the metadata reservation. Signed-off-by: Jeff Mahoney --- fs/btrfs/extent-tree.c |7 +++ 1 file changed, 7 insertions(+) --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3822,6 +3822,13 @@ void btrfs_free_reserved_data_space_noqu */ void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len) { + struct btrfs_root *root = BTRFS_I(inode)->root; + + /* Make sure the range is aligned to sectorsize */ + len = round_up(start + len, root->sectorsize) - + round_down(start, root->sectorsize); + start = round_down(start, root->sectorsize); + btrfs_free_reserved_data_space_noquota(inode, start, len); btrfs_qgroup_free_data(inode, start, len); } -- Jeff Mahoney SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v2 - block group internals!
Hi, On 11/19/2016 01:57 AM, Qu Wenruo wrote: > On 11/18/2016 11:08 PM, Hans van Kranenburg wrote: >> On 11/18/2016 03:08 AM, Qu Wenruo wrote: > I don't see what displaying a blockgroup-level aggregate usage number > has to do with multi-device, except that the same %usage will appear > another time when using RAID1*. >>> >>> Although in fact, for profiles like RAID0/5/6/10, it's completely >>> possible that one dev_extent contains all the data, while another >>> dev_extent is almost empty. >> >> When using something like RAID0 profile, I would expect 50% of the data >> to end up in one dev_extent and 50% in the other? > > First I'm mostly OK with current grayscale. > What I'm saying can be considered as nitpicking. > > The only concern is, the for full fs output, we are in fact output > dev_extents of each device. > > In that case, we should output info at the same level of dev_extent. > > So if we really need to provide *accurate* gray scale, then we should > base on the %usage of a dev extent. > > And for 50%/50% assumption for RAID0, it's not true and we can easily > create a case where it's 100%/0% > > [...] > > Then, only the 2nd data stripe has data, while the 1st data stripe are > free. Ah, yes, I see, that's a good example. So technically, for others than single and RAID1, just using the blockgroup usage might be wrong. OTOH, for most cases it will still be "correct enough" for the eye, because statistically seen, distribution of data over the stripes will be more uniform more often than not. It's good to realize, but I'm fine with having this as a "known limitation". > Things will become more complicated when RAID5/6 is involved. Yes. So being able to show a specific dev extent with the actual info (sorted by physical byte location) would be a nice addition. Since it also requires walking the extent tree for the related block group and doing calculations on the ranges, it's not feasible for the high level file system picture to do. >>> Strictly speaking, at full fs or dev level, we should output things at >>> dev_extent level, then greyscale should be representing dev_extent >>> usage(which is not possible or quite hard to calculate) >> >> That's what it's doing now? >> >>> Anyway, the greyscale is mostly OK, just as a good addition output for >>> full fs graph. >> >> I don't follow. >> >>> Although if it could output the fs or specific dev without gray scale, I >>> think it would be better. >>> It will be much clearer about the dev_extent level fragments. >> >> I have no idea what you mean, sorry. > > The point is, for full fs or per-device output, a developer may focus on > the fragments of unallocated space in each device. > > In that case, an almost empty bg will be much like unallocated space. The usage (from 0 to 100) is translated into a brightness between 16 and 255, which already causes empty allocated space to be visually distinguishable from unallocated space: def _brightness(self, used_pct): return 16 + int(round(used_pct * (255 - 16))) If you need it to be more clear, just increase the value to more than 16 and voila. > So I hope if there is any option to disable greyscale at full fs output, > it would be much better. It's just some python, don't hesitate to change it and try things out. def _brightness(self, used_pct): +return 255 -return 16 + int(round(used_pct * (255 - 16))) Et voila, 0% used is bright white now. The experience of the resulting image is really different, but if it helps in certain situations, it's quite easy to get it done. > Just like the blockgroup output, only black and while, and the example > in the github is really awesome! > > It shows a lot of thing I didn't have a clear view before. > Like batched metadata extents (mostly for csum tree) and fragmented > metadata for other trees. :D First thing I want to do with the blockgrouplevel pics is make a "rolling heatmap" of the 4 highest vaddr blockgroups combined of the filesystem that this thread was about: http://www.spinics.net/lists/linux-btrfs/msg54940.html First I'm going to disable autodefrag again, take a picture a few times per hour, let it run for a few days, and then do the same thing again with autodefrag enabled. (spoiler: even with autodefrag enabled, it's a disaster) But the resulting timelapse videos will show really interesting information on the behaviour of autodefrag I guess. Can't wait to see them. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/9] btrfs: use bio iterators for the decompression handlers
On Wed, Nov 16, 2016 at 01:52:08PM +0100, Christoph Hellwig wrote: > Pass the full bio to the decompression routines and use bio iterators > to iterate over the data in the bio. One question below, > > Signed-off-by: Christoph Hellwig> --- > fs/btrfs/compression.c | 122 > + > fs/btrfs/compression.h | 12 ++--- > fs/btrfs/lzo.c | 17 ++- > fs/btrfs/zlib.c| 15 ++ > 4 files changed, 54 insertions(+), 112 deletions(-) > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index d4d8b7e..12a631d 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -81,9 +81,9 @@ struct compressed_bio { > u32 sums; > }; > > -static int btrfs_decompress_biovec(int type, struct page **pages_in, > -u64 disk_start, struct bio_vec *bvec, > -int vcnt, size_t srclen); > +static int btrfs_decompress_bio(int type, struct page **pages_in, > +u64 disk_start, struct bio *orig_bio, > +size_t srclen); > > static inline int compressed_bio_size(struct btrfs_root *root, > unsigned long disk_size) > @@ -175,11 +175,10 @@ static void end_compressed_bio_read(struct bio *bio) > /* ok, we're the last bio for this extent, lets start >* the decompression. >*/ > - ret = btrfs_decompress_biovec(cb->compress_type, > + ret = btrfs_decompress_bio(cb->compress_type, > cb->compressed_pages, > cb->start, > - cb->orig_bio->bi_io_vec, > - cb->orig_bio->bi_vcnt, > + cb->orig_bio, > cb->compressed_len); > csum_failed: > if (ret) > @@ -959,9 +958,7 @@ int btrfs_compress_pages(int type, struct address_space > *mapping, > * > * disk_start is the starting logical offset of this array in the file > * > - * bvec is a bio_vec of pages from the file that we want to decompress into > - * > - * vcnt is the count of pages in the biovec > + * orig_bio contains the pages from the file that we want to decompress into > * > * srclen is the number of bytes in pages_in > * > @@ -970,18 +967,18 @@ int btrfs_compress_pages(int type, struct address_space > *mapping, > * be contiguous. They all correspond to the range of bytes covered by > * the compressed extent. > */ > -static int btrfs_decompress_biovec(int type, struct page **pages_in, > -u64 disk_start, struct bio_vec *bvec, > -int vcnt, size_t srclen) > +static int btrfs_decompress_bio(int type, struct page **pages_in, > +u64 disk_start, struct bio *orig_bio, > +size_t srclen) > { > struct list_head *workspace; > int ret; > > workspace = find_workspace(type); > > - ret = btrfs_compress_op[type-1]->decompress_biovec(workspace, pages_in, > - disk_start, > - bvec, vcnt, srclen); > + ret = btrfs_compress_op[type-1]->decompress_bio(workspace, pages_in, > + disk_start, orig_bio, > + srclen); > free_workspace(type, workspace); > return ret; > } > @@ -1021,9 +1018,7 @@ void btrfs_exit_compress(void) > */ > int btrfs_decompress_buf2page(char *buf, unsigned long buf_start, > unsigned long total_out, u64 disk_start, > - struct bio_vec *bvec, int vcnt, > - unsigned long *pg_index, > - unsigned long *pg_offset) > + struct bio *bio) > { > unsigned long buf_offset; > unsigned long current_buf_start; > @@ -1031,13 +1026,13 @@ int btrfs_decompress_buf2page(char *buf, unsigned > long buf_start, > unsigned long working_bytes = total_out - buf_start; > unsigned long bytes; > char *kaddr; > - struct page *page_out = bvec[*pg_index].bv_page; > + struct bio_vec bvec = bio_iter_iovec(bio, bio->bi_iter); > > /* >* start byte is the first byte of the page we're currently >* copying into relative to the start of the compressed data. >*/ > - start_byte = page_offset(page_out) - disk_start; > + start_byte = page_offset(bvec.bv_page) - disk_start; > > /* we haven't yet hit data corresponding to this page */ > if (total_out <= start_byte) > @@ -1057,80 +1052,45 @@ int btrfs_decompress_buf2page(char *buf, unsigned > long buf_start, > > /* copy bytes from the working buffer into the
Re: Btrfs Heatmap - v2 - block group internals!
On 11/18/2016 11:08 PM, Hans van Kranenburg wrote: On 11/18/2016 03:08 AM, Qu Wenruo wrote: Just found one small problem. After specifying --size 16 to output a given block group (small block group, I need large size to make output visible), it takes a full cpu and takes a long long long time to run. So long I don't even want to wait. I changed size to 10, and it finished much faster. Is that expected? Yes, hilbert curve size increases exponentially when increasing order. 2**16 = 65536, 65536x65536 = 4294967296 pixels in the png image. So even if you would have a petabyte filesystem, that would still mean that a single pixel in the image represents only ~ 256kB. I don't think your small block groups are 256kB big? Specifying values like this does not make any sense at all, and it expected not to work well. I don't see what displaying a blockgroup-level aggregate usage number has to do with multi-device, except that the same %usage will appear another time when using RAID1*. Although in fact, for profiles like RAID0/5/6/10, it's completely possible that one dev_extent contains all the data, while another dev_extent is almost empty. When using something like RAID0 profile, I would expect 50% of the data to end up in one dev_extent and 50% in the other? First I'm mostly OK with current grayscale. What I'm saying can be considered as nitpicking. The only concern is, the for full fs output, we are in fact output dev_extents of each device. In that case, we should output info at the same level of dev_extent. So if we really need to provide *accurate* gray scale, then we should base on the %usage of a dev extent. And for 50%/50% assumption for RAID0, it's not true and we can easily create a case where it's 100%/0% The method is as simple as the following script (Assume the data BG is 1G, and first write will be allcated to offset 0 of the BG) # Fill 1G BG with 64K files for i in $(seq -w 0 16383); do xfs_io -f -c "pwrite 0 64k" $mnt/file_$i # Sync fs to ensure extent allocation happens sync done # Remove every each file, to create holes for i in $(seq -w 0 2 16383); do rm $mnt/file_$i done btrfs fi sync $mnt Then, only the 2nd data stripe has data, while the 1st data stripe are free. Things will become more complicated when RAID5/6 is involved. Strictly speaking, at full fs or dev level, we should output things at dev_extent level, then greyscale should be representing dev_extent usage(which is not possible or quite hard to calculate) That's what it's doing now? Anyway, the greyscale is mostly OK, just as a good addition output for full fs graph. I don't follow. Although if it could output the fs or specific dev without gray scale, I think it would be better. It will be much clearer about the dev_extent level fragments. I have no idea what you mean, sorry. The point is, for full fs or per-device output, a developer may focus on the fragments of unallocated space in each device. In that case, an almost empty bg will be much like unallocated space. So I hope if there is any option to disable greyscale at full fs output, it would be much better. Just like the blockgroup output, only black and while, and the example in the github is really awesome! It shows a lot of thing I didn't have a clear view before. Like batched metadata extents (mostly for csum tree) and fragmented metadata for other trees. Thanks, Qu When generating a picture of a file system with multiple devices, boundaries between the separate devices are not visible now. If someone has a brilliant idea about how to do this without throwing out actual usage data... The first thought that comes to mind for me is to make each device be a different color, and otherwise obey the same intensity mapping correlating to how much data is there. For example, if you've got a 3 device FS, the parts of the image that correspond to device 1 would go from 0x00 to 0xFF, the parts for device 2 could be 0x00 to 0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This is of course not perfect (you can't tell what device each segment of empty space corresponds to), but would probably cover most use cases. (for example, with such a scheme, you could look at an image and tell whether the data is relatively well distributed across all the devices or you might need to re-balance). What about linear output separated with lines(or just black)? Linear output does not produce useful images, except for really small filesystems. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] RAID5/6 scrub race fix
On Fri, Nov 18, 2016 at 07:09:34PM +0100, Goffredo Baroncelli wrote: > Hi Zygo > On 2016-11-18 00:13, Zygo Blaxell wrote: > > On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: > >> Fix the so-called famous RAID5/6 scrub error. > >> > >> Thanks Goffredo Baroncelli for reporting the bug, and make it into our > >> sight. > >> (Yes, without the Phoronix report on this, > >> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, > >> I won't ever be aware of it) > > > > If you're hearing about btrfs RAID5 bugs for the first time through > > Phoronix, then your testing coverage is *clearly* inadequate. > > > > Fill up a RAID5 array, start a FS stress test, pull a drive out while > > that's running, let the FS stress test run for another hour, then try > > to replace or delete the missing device. If there are any crashes, > > corruptions, or EIO during any part of this process (assuming all the > > remaining disks are healthy), then btrfs RAID5 is still broken, and > > you've found another bug to fix. > > > > The fact that so many problems in btrfs can still be found this way > > indicates to me that nobody is doing this basic level of testing > > (or if they are, they're not doing anything about the results). > > [...] > > Sorry but I don't find useful this kind of discussion. Yes BTRFS > RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be > complete; but this is not a fault of a single person; and Qu tried to > solve one issue and for this we should say only tanks.. > > Even if you don't find valuable the work of Qu (and my little one :-) > ), this required some time and need to be respected. I do find this work valuable, and I do thank you and Qu for it. I've been following it with great interest because I haven't had time to dive into it myself. It's a use case I used before and would like to use again. Most of my recent frustration, if directed at anyone, is really directed at Phoronix for conflating "one bug was fixed" with "ready for production use today," and I wanted to ensure that the latter rumor was promptly quashed. This is why I'm excited about Qu's work: on my list of 7 btrfs-raid5 recovery bugs (6 I found plus yours), Qu has fixed at least 2 of them, maybe as many as 4, with the patches so far. I can fix 2 of the others, for a total of 6 fixed out of 7. Specifically, the 7 bugs I know of are: 1-2. BUG_ONs in functions that should return errors (I had fixed both already when trying to recover my broken arrays) 3. scrub can't identify which drives or files are corrupted (Qu might have fixed this--I won't know until I do testing) 4-6. symptom groups related to wrong data or EIO in scrub recovery, including Goffredo's (Qu might have fixed all of these, but from a quick read of the patch I think at least two are done). 7. the write hole. I'll know more after I've had a chance to run Qu's patches through testing, which I intend to do at some point. Optimistically, this means there could be only *one* bug remaining in the critical path for btrfs RAID56 single disk failure recovery. That last bug is the write hole, which is why I keep going on about it. It's the only bug I know exists in btrfs RAID56 that has neither an existing fix nor any evidence of someone actively working on it, even at the design proposal stage. Please, I'd love to be wrong about this. When I described the situation recently as "a thin layer of bugs on top of a design defect", I was not trying to be mean. I was trying to describe the situation *precisely*. The thin layer of bugs is much thinner thanks to Qu's work, and thanks in part to his work, I now have confidence that further investment in this area won't be wasted. > Finally, I don't think that we should compare the RAID-hole with this > kind of bug(fix). The former is a design issue, the latter is a bug > related of one of the basic feature of the raid system (recover from > the lost of a disk/corruption). > > Even the MD subsystem (which is far behind btrfs) had tolerated > the raid-hole until last year. My frustration against this point is the attitude that mdadm was ever good enough, much less a model to emulate in the future. It's 2016--there have been some advancements in the state of the art since the IBM patent describing RAID5 30 years ago, yet in the btrfs world, we seem to insist on repeating all the same mistakes in the same order. "We're as good as some existing broken-by-design thing" isn't a really useful attitude. We should aspire to do *better* than the existing broken-by-design things. If we didn't, we wouldn't be here, we'd all be lurking on some other list, running ext4 or xfs on mdadm or lvm. > And its solution is far to be cheap > (basically the MD subsystem wrote the data first in the journal > then on the disk... which is the kind of issue that a COW filesystem > would solve). Journalling isn't required. It's
Re: RFC: raid with a variable stripe size
Yes, I don't think one could find any NAND based SSDs with <4k page size on the market right now (even =4k is hard to get) and 4k is becoming the new norm for HDDs. However, some HDD manufacturers continue to offer drives with 512 byte sectors (I think it's possible to get new ones in sizable quantities if you need them). I am aware it wouldn't solve the problem for >=4k sector devices unless you are ready to balance frequently. But I think it would still be a lot better to waste padding space on 4k stripes than, say, 64k stripes until you can balance the new block groups. And, if the space waste ratio is tolerable, then this could be an automatic background task as soon as an individual block group or their totals get to a high waste ratio. I suggest this as a quick temporal workaround because it could be cheap in terms of work if the above mentioned functionalities (stripe size change, auto-balance) would be worked on anyway (regardless of RAID-5/6 specific issues) until some better solution is realized (probably through a lot more work over a lot longer development period). RAID-5 isn't really optimal for a huge amount of disks (URE during rebuild issue...), so the temporary space waste is probably <=8x per unbalanced block groups (which are 1Gb or may be ~10Gb if I am not mistaken, so usually <<8x of the whole available space). But may be my guesstimates are wrong here. On Fri, Nov 18, 2016 at 9:51 PM, Timofey Titovetswrote: > 2016-11-18 23:32 GMT+03:00 Janos Toth F. : >> Based on the comments of this patch, stripe size could theoretically >> go as low as 512 byte: >> https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html >> If these very small (0.5k-2k) stripe sizes could really work (it's >> possible to implement such changes and it does not degrade performance >> too much - or at all - to keep it so low), we could use RAID-5(/6) on >> <=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem >> sector size + 4k node size, although I am not sure if node size is >> really important here) without having to worry about RMW, extra space >> waste or additional fragmentation. >> >> On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli >> wrote: >>> Hello, >>> >>> these are only my thoughts; no code here, but I would like to share it >>> hoping that it could be useful. >>> >>> As reported several times by Zygo (and others), one of the problem of >>> raid5/6 is the write hole. Today BTRFS is not capable to address it. >>> >>> The problem is that the stripe size is bigger than the "sector size" (ok >>> sector is not the correct word, but I am referring to the basic unit of >>> writing on the disk, which is 4k or 16K in btrfs). >>> So when btrfs writes less data than the stripe, the stripe is not filled; >>> when it is filled by a subsequent write, a RMW of the parity is required. >>> >>> On the best of my understanding (which could be very wrong) ZFS try to >>> solve this issue using a variable length stripe. >>> >>> On BTRFS this could be achieved using several BGs (== block group or >>> chunk), one for each stripe size. >>> >>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem >>> should have three BGs: >>> BG #1,composed by two disks (1 data+ 1 parity) >>> BG #2 composed by three disks (2 data + 1 parity) >>> BG #3 composed by four disks (3 data + 1 parity). >>> >>> If the data to be written has a size of 4k, it will be allocated to the BG >>> #1. >>> If the data to be written has a size of 8k, it will be allocated to the BG >>> #2 >>> If the data to be written has a size of 12k, it will be allocated to the BG >>> #3 >>> If the data to be written has a size greater than 12k, it will be allocated >>> to the BG3, until the data fills a full stripes; then the remainder will be >>> stored in BG #1 or BG #2. >>> >>> >>> To avoid unbalancing of the disk usage, each BG could use all the disks, >>> even if a stripe uses less disks: i.e >>> >>> DISK1 DISK2 DISK3 DISK4 >>> S1S1S1S2 >>> S2S2S3S3 >>> S3S4S4S4 >>> [] >>> >>> Above is show a BG which uses all the four disks, but has a stripe which >>> spans only 3 disks. >>> >>> >>> Pro: >>> - btrfs already is capable to handle different BG in the filesystem, only >>> the allocator has to change >>> - no more RMW are required (== higher performance) >>> >>> Cons: >>> - the data will be more fragmented >>> - the filesystem, will have more BGs; this will require time-to time a >>> re-balance. But is is an issue which we already know (even if may be not >>> 100% addressed). >>> >>> >>> Thoughts ? >>> >>> BR >>> G.Baroncelli >>> >>> >>> >>> -- >>> gpg @keyserver.linux.it: Goffredo Baroncelli >>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to
Re: [RFC] btrfs: make max inline data can be equal to sectorsize
On 11/16/2016 11:10 AM, David Sterba wrote: On Mon, Nov 14, 2016 at 09:55:34AM +0800, Qu Wenruo wrote: At 11/12/2016 04:22 AM, Liu Bo wrote: On Tue, Oct 11, 2016 at 02:47:42PM +0800, Wang Xiaoguang wrote: If we use mount option "-o max_inline=sectorsize", say 4096, indeed even for a fresh fs, say nodesize is 16k, we can not make the first 4k data completely inline, I found this conditon causing this issue: !compressed_size && (actual_end & (root->sectorsize - 1)) == 0 If it retuns true, we'll not make data inline. For 4k sectorsize, 0~4094 dara range, we can make it inline, but 0~4095, it can not. I don't think this limition is useful, so here remove it which will make max inline data can be equal to sectorsize. It's difficult to tell whether we need this, I'm not a big fan of using max_inline size more than the default size 2048, given that most reports about ENOSPC is due to metadata and inline may make it worse. IMHO if we can use inline data extents to trigger ENOSPC more easily, then we should allow it to dig the problem further. Just ignoring it because it may cause more bug will not solve the real problem anyway. Not allowing the full 4k value as max_inline looks artificial to me. We've removed other similar limitation in the past so I'd tend to agree to do the same here. There's no significant use for it as far as I can tell, if you want to exhaust metadata, the difference to max_inline=4095 would be really tiny in the end. So, I'm okay with merging it. If anybody feels like adding his reviewed-by, please do so. The check is there because in practice it doesn't make sense to inline an extent if it fits perfectly in a data block. You could argue its saving seeks, but we're also adding seeks by spreading out the metadata in general. So, I'd want to see benchmarks before deciding. If we're using it for debugging, I'd rather stick with max_inline=4095. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: raid with a variable stripe size
2016-11-18 23:32 GMT+03:00 Janos Toth F.: > Based on the comments of this patch, stripe size could theoretically > go as low as 512 byte: > https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html > If these very small (0.5k-2k) stripe sizes could really work (it's > possible to implement such changes and it does not degrade performance > too much - or at all - to keep it so low), we could use RAID-5(/6) on > <=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem > sector size + 4k node size, although I am not sure if node size is > really important here) without having to worry about RMW, extra space > waste or additional fragmentation. > > On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli > wrote: >> Hello, >> >> these are only my thoughts; no code here, but I would like to share it >> hoping that it could be useful. >> >> As reported several times by Zygo (and others), one of the problem of >> raid5/6 is the write hole. Today BTRFS is not capable to address it. >> >> The problem is that the stripe size is bigger than the "sector size" (ok >> sector is not the correct word, but I am referring to the basic unit of >> writing on the disk, which is 4k or 16K in btrfs). >> So when btrfs writes less data than the stripe, the stripe is not filled; >> when it is filled by a subsequent write, a RMW of the parity is required. >> >> On the best of my understanding (which could be very wrong) ZFS try to solve >> this issue using a variable length stripe. >> >> On BTRFS this could be achieved using several BGs (== block group or chunk), >> one for each stripe size. >> >> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem >> should have three BGs: >> BG #1,composed by two disks (1 data+ 1 parity) >> BG #2 composed by three disks (2 data + 1 parity) >> BG #3 composed by four disks (3 data + 1 parity). >> >> If the data to be written has a size of 4k, it will be allocated to the BG >> #1. >> If the data to be written has a size of 8k, it will be allocated to the BG #2 >> If the data to be written has a size of 12k, it will be allocated to the BG >> #3 >> If the data to be written has a size greater than 12k, it will be allocated >> to the BG3, until the data fills a full stripes; then the remainder will be >> stored in BG #1 or BG #2. >> >> >> To avoid unbalancing of the disk usage, each BG could use all the disks, >> even if a stripe uses less disks: i.e >> >> DISK1 DISK2 DISK3 DISK4 >> S1S1S1S2 >> S2S2S3S3 >> S3S4S4S4 >> [] >> >> Above is show a BG which uses all the four disks, but has a stripe which >> spans only 3 disks. >> >> >> Pro: >> - btrfs already is capable to handle different BG in the filesystem, only >> the allocator has to change >> - no more RMW are required (== higher performance) >> >> Cons: >> - the data will be more fragmented >> - the filesystem, will have more BGs; this will require time-to time a >> re-balance. But is is an issue which we already know (even if may be not >> 100% addressed). >> >> >> Thoughts ? >> >> BR >> G.Baroncelli >> >> >> >> -- >> gpg @keyserver.linux.it: Goffredo Baroncelli >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html AFAIK all drives at now use 4k physical sector size, and use 512b only logically So it's create another RWM Read 4k -> Modify 512b -> Write 4k, instead of just write 512b. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] btrfs: use bio_for_each_segment_all in __btrfsic_submit_bio
On Wed, Nov 16, 2016 at 01:52:15PM +0100, Christoph Hellwig wrote: > And remove the bogus check for a NULL return value from kmap, which > can't happen. While we're at it: I don't think that kmapping up to 256 > will work without deadlocks on highmem machines, a better idea would > be to use vm_map_ram to map all of them into a single virtual address > range. Incidentally that would also simplify the code a lot. Reviewed-by: Omar Sandoval> Signed-off-by: Christoph Hellwig > --- > fs/btrfs/check-integrity.c | 30 +++--- > 1 file changed, 11 insertions(+), 19 deletions(-) > > diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c > index a6f657f..86f681f 100644 > --- a/fs/btrfs/check-integrity.c > +++ b/fs/btrfs/check-integrity.c > @@ -2819,10 +2819,11 @@ static void __btrfsic_submit_bio(struct bio *bio) >* btrfsic_mount(), this might return NULL */ > dev_state = btrfsic_dev_state_lookup(bio->bi_bdev); > if (NULL != dev_state && > - (bio_op(bio) == REQ_OP_WRITE) && NULL != bio->bi_io_vec) { > + (bio_op(bio) == REQ_OP_WRITE) && bio_has_data(bio)) { > unsigned int i; > u64 dev_bytenr; > u64 cur_bytenr; > + struct bio_vec *bvec; > int bio_is_patched; > char **mapped_datav; > > @@ -2840,32 +2841,23 @@ static void __btrfsic_submit_bio(struct bio *bio) > if (!mapped_datav) > goto leave; > cur_bytenr = dev_bytenr; > - for (i = 0; i < bio->bi_vcnt; i++) { > - BUG_ON(bio->bi_io_vec[i].bv_len != PAGE_SIZE); > - mapped_datav[i] = kmap(bio->bi_io_vec[i].bv_page); > - if (!mapped_datav[i]) { > - while (i > 0) { > - i--; > - kunmap(bio->bi_io_vec[i].bv_page); > - } > - kfree(mapped_datav); > - goto leave; > - } > + > + bio_for_each_segment_all(bvec, bio, i) { > + BUG_ON(bvec->bv_len != PAGE_SIZE); > + mapped_datav[i] = kmap(bvec->bv_page); > + > if (dev_state->state->print_mask & > BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH_VERBOSE) > pr_info("#%u: bytenr=%llu, len=%u, offset=%u\n", > -i, cur_bytenr, bio->bi_io_vec[i].bv_len, > -bio->bi_io_vec[i].bv_offset); > - cur_bytenr += bio->bi_io_vec[i].bv_len; > +i, cur_bytenr, bvec->bv_len, > bvec->bv_offset); > + cur_bytenr += bvec->bv_len; > } > btrfsic_process_written_block(dev_state, dev_bytenr, > mapped_datav, bio->bi_vcnt, > bio, _is_patched, > NULL, bio->bi_opf); > - while (i > 0) { > - i--; > - kunmap(bio->bi_io_vec[i].bv_page); > - } > + bio_for_each_segment_all(bvec, bio, i) > + kunmap(bvec->bv_page); > kfree(mapped_datav); > } else if (NULL != dev_state && (bio->bi_opf & REQ_PREFLUSH)) { > if (dev_state->state->print_mask & > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: raid with a variable stripe size
2016-11-18 21:15 GMT+03:00 Goffredo Baroncelli: > Hello, > > these are only my thoughts; no code here, but I would like to share it hoping > that it could be useful. > > As reported several times by Zygo (and others), one of the problem of raid5/6 > is the write hole. Today BTRFS is not capable to address it. > > The problem is that the stripe size is bigger than the "sector size" (ok > sector is not the correct word, but I am referring to the basic unit of > writing on the disk, which is 4k or 16K in btrfs). > So when btrfs writes less data than the stripe, the stripe is not filled; > when it is filled by a subsequent write, a RMW of the parity is required. > > On the best of my understanding (which could be very wrong) ZFS try to solve > this issue using a variable length stripe. > > On BTRFS this could be achieved using several BGs (== block group or chunk), > one for each stripe size. > > For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem > should have three BGs: > BG #1,composed by two disks (1 data+ 1 parity) > BG #2 composed by three disks (2 data + 1 parity) > BG #3 composed by four disks (3 data + 1 parity). > > If the data to be written has a size of 4k, it will be allocated to the BG #1. > If the data to be written has a size of 8k, it will be allocated to the BG #2 > If the data to be written has a size of 12k, it will be allocated to the BG #3 > If the data to be written has a size greater than 12k, it will be allocated > to the BG3, until the data fills a full stripes; then the remainder will be > stored in BG #1 or BG #2. > > > To avoid unbalancing of the disk usage, each BG could use all the disks, even > if a stripe uses less disks: i.e > > DISK1 DISK2 DISK3 DISK4 > S1S1S1S2 > S2S2S3S3 > S3S4S4S4 > [] > > Above is show a BG which uses all the four disks, but has a stripe which > spans only 3 disks. > > > Pro: > - btrfs already is capable to handle different BG in the filesystem, only the > allocator has to change > - no more RMW are required (== higher performance) > > Cons: > - the data will be more fragmented > - the filesystem, will have more BGs; this will require time-to time a > re-balance. But is is an issue which we already know (even if may be not 100% > addressed). > > > Thoughts ? > > BR > G.Baroncelli AFAIK, it's difficult to do such things with btrfs, because btrfs use chuck allocation for metadata & data, i.e. AFAIK ZFS work with storage more directly, so zfs directly span file to the different disks. May be it's can be implemented by some chunk allocator rework, i don't know. Fix me if i'm wrong, thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: raid with a variable stripe size
Based on the comments of this patch, stripe size could theoretically go as low as 512 byte: https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html If these very small (0.5k-2k) stripe sizes could really work (it's possible to implement such changes and it does not degrade performance too much - or at all - to keep it so low), we could use RAID-5(/6) on <=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem sector size + 4k node size, although I am not sure if node size is really important here) without having to worry about RMW, extra space waste or additional fragmentation. On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelliwrote: > Hello, > > these are only my thoughts; no code here, but I would like to share it hoping > that it could be useful. > > As reported several times by Zygo (and others), one of the problem of raid5/6 > is the write hole. Today BTRFS is not capable to address it. > > The problem is that the stripe size is bigger than the "sector size" (ok > sector is not the correct word, but I am referring to the basic unit of > writing on the disk, which is 4k or 16K in btrfs). > So when btrfs writes less data than the stripe, the stripe is not filled; > when it is filled by a subsequent write, a RMW of the parity is required. > > On the best of my understanding (which could be very wrong) ZFS try to solve > this issue using a variable length stripe. > > On BTRFS this could be achieved using several BGs (== block group or chunk), > one for each stripe size. > > For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem > should have three BGs: > BG #1,composed by two disks (1 data+ 1 parity) > BG #2 composed by three disks (2 data + 1 parity) > BG #3 composed by four disks (3 data + 1 parity). > > If the data to be written has a size of 4k, it will be allocated to the BG #1. > If the data to be written has a size of 8k, it will be allocated to the BG #2 > If the data to be written has a size of 12k, it will be allocated to the BG #3 > If the data to be written has a size greater than 12k, it will be allocated > to the BG3, until the data fills a full stripes; then the remainder will be > stored in BG #1 or BG #2. > > > To avoid unbalancing of the disk usage, each BG could use all the disks, even > if a stripe uses less disks: i.e > > DISK1 DISK2 DISK3 DISK4 > S1S1S1S2 > S2S2S3S3 > S3S4S4S4 > [] > > Above is show a BG which uses all the four disks, but has a stripe which > spans only 3 disks. > > > Pro: > - btrfs already is capable to handle different BG in the filesystem, only the > allocator has to change > - no more RMW are required (== higher performance) > > Cons: > - the data will be more fragmented > - the filesystem, will have more BGs; this will require time-to time a > re-balance. But is is an issue which we already know (even if may be not 100% > addressed). > > > Thoughts ? > > BR > G.Baroncelli > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/9] btrfs: refactor __btrfs_lookup_bio_sums to use bio_for_each_segment_all
On Wed, Nov 16, 2016 at 01:52:14PM +0100, Christoph Hellwig wrote: > Rework the loop a little bit to use the generic bio_for_each_segment_all > helper for iterating over the bio. One minor nit. Besides that, Reviewed-by: Omar Sandoval> Signed-off-by: Christoph Hellwig > --- > fs/btrfs/file-item.c | 31 +++ > 1 file changed, 11 insertions(+), 20 deletions(-) > > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c > index fa8aa53..54ccb91 100644 > --- a/fs/btrfs/file-item.c > +++ b/fs/btrfs/file-item.c > @@ -163,7 +163,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root > *root, > struct inode *inode, struct bio *bio, > u64 logical_offset, u32 *dst, int dio) > { > - struct bio_vec *bvec = bio->bi_io_vec; > + struct bio_vec *bvec; > struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio); > struct btrfs_csum_item *item = NULL; > struct extent_io_tree *io_tree = _I(inode)->io_tree; > @@ -177,7 +177,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root > *root, > u32 diff; > int nblocks; > int bio_index = 0; > - int count; > + int count = 0; > u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy); > > path = btrfs_alloc_path(); > @@ -223,8 +223,11 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root > *root, > if (dio) > offset = logical_offset; > > - page_bytes_left = bvec->bv_len; > - while (bio_index < bio->bi_vcnt) { > + bio_for_each_segment_all(bvec, bio, bio_index) { Can we just rename bio_index to i now? > + page_bytes_left = bvec->bv_len; > + if (count) > + goto next; > + > if (!dio) > offset = page_offset(bvec->bv_page) + bvec->bv_offset; > count = btrfs_find_ordered_sum(inode, offset, disk_bytenr, > @@ -285,29 +288,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root > *root, > found: > csum += count * csum_size; > nblocks -= count; > - > +next: > while (count--) { > disk_bytenr += root->sectorsize; > offset += root->sectorsize; > page_bytes_left -= root->sectorsize; > - if (!page_bytes_left) { > - bio_index++; > - /* > - * make sure we're still inside the > - * bio before we update page_bytes_left > - */ > - if (bio_index >= bio->bi_vcnt) { > - WARN_ON_ONCE(count); > - goto done; > - } > - bvec++; > - page_bytes_left = bvec->bv_len; > - } > - > + if (!page_bytes_left) > + break; /* move to next bio */ > } > } > > -done: > + WARN_ON_ONCE(count); > btrfs_free_path(path); > return 0; > } > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/9] btrfs: calculate end of bio offset properly
On Wed, Nov 16, 2016 at 01:52:13PM +0100, Christoph Hellwig wrote: > Use the bvec offset and len members to prepare for multipage bvecs. > > Signed-off-by: Christoph Hellwig> --- > fs/btrfs/compression.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index 8618ac3..27e9feb 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -445,6 +445,13 @@ int btrfs_submit_compressed_write(struct inode *inode, > u64 start, > return 0; > } > > +static u64 bio_end_offset(struct bio *bio) > +{ > + struct bio_vec *last = >bi_io_vec[bio->bi_vcnt - 1]; > + > + return page_offset(last->bv_page) + last->bv_len - last->bv_offset; Why is this minus bv_offset and not plus? Am I misunderstanding bv_offset? > +} > + > static noinline int add_ra_bio_pages(struct inode *inode, >u64 compressed_end, >struct compressed_bio *cb) > @@ -463,8 +470,7 @@ static noinline int add_ra_bio_pages(struct inode *inode, > u64 end; > int misses = 0; > > - page = cb->orig_bio->bi_io_vec[cb->orig_bio->bi_vcnt - 1].bv_page; > - last_offset = (page_offset(page) + PAGE_SIZE); > + last_offset = bio_end_offset(cb->orig_bio); > em_tree = _I(inode)->extent_tree; > tree = _I(inode)->io_tree; > > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/9] btrfs: use bi_size
On Wed, Nov 16, 2016 at 01:52:12PM +0100, Christoph Hellwig wrote: > Instead of using bi_vcnt to calculate it. Reviewed-by: Omar Sandoval> Signed-off-by: Christoph Hellwig > --- > fs/btrfs/compression.c | 7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index 12a631d..8618ac3 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -562,7 +562,6 @@ static noinline int add_ra_bio_pages(struct inode *inode, > * > * bio->bi_iter.bi_sector points to the compressed extent on disk > * bio->bi_io_vec points to all of the inode pages > - * bio->bi_vcnt is a count of pages > * > * After the compressed pages are read, we copy the bytes into the > * bio we were passed and then call the bio end_io calls > @@ -574,7 +573,6 @@ int btrfs_submit_compressed_read(struct inode *inode, > struct bio *bio, > struct extent_map_tree *em_tree; > struct compressed_bio *cb; > struct btrfs_root *root = BTRFS_I(inode)->root; > - unsigned long uncompressed_len = bio->bi_vcnt * PAGE_SIZE; > unsigned long compressed_len; > unsigned long nr_pages; > unsigned long pg_index; > @@ -619,7 +617,7 @@ int btrfs_submit_compressed_read(struct inode *inode, > struct bio *bio, > free_extent_map(em); > em = NULL; > > - cb->len = uncompressed_len; > + cb->len = bio->bi_iter.bi_size; > cb->compressed_len = compressed_len; > cb->compress_type = extent_compress_type(bio_flags); > cb->orig_bio = bio; > @@ -647,8 +645,7 @@ int btrfs_submit_compressed_read(struct inode *inode, > struct bio *bio, > add_ra_bio_pages(inode, em_start + em_len, cb); > > /* include any pages we added in add_ra-bio_pages */ > - uncompressed_len = bio->bi_vcnt * PAGE_SIZE; > - cb->len = uncompressed_len; > + cb->len = bio->bi_iter.bi_size; > > comp_bio = compressed_bio_alloc(bdev, cur_disk_byte, GFP_NOFS); > if (!comp_bio) > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/9] btrfs: don't access the bio directly in btrfs_csum_one_bio
On Wed, Nov 16, 2016 at 01:52:11PM +0100, Christoph Hellwig wrote: > Use bio_for_each_segment_all to iterate over the segments instead. > This requires a bit of reshuffling so that we only lookup up the ordered > item once inside the bio_for_each_segment_all loop. Reviewed-by: Omar Sandoval> Signed-off-by: Christoph Hellwig > --- > fs/btrfs/file-item.c | 21 ++--- > 1 file changed, 10 insertions(+), 11 deletions(-) > > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c > index d0d571c..fa8aa53 100644 > --- a/fs/btrfs/file-item.c > +++ b/fs/btrfs/file-item.c > @@ -447,13 +447,12 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct > inode *inode, > struct bio *bio, u64 file_start, int contig) > { > struct btrfs_ordered_sum *sums; > - struct btrfs_ordered_extent *ordered; > + struct btrfs_ordered_extent *ordered = NULL; > char *data; > - struct bio_vec *bvec = bio->bi_io_vec; > - int bio_index = 0; > + struct bio_vec *bvec; > int index; > int nr_sectors; > - int i; > + int i, j; > unsigned long total_bytes = 0; > unsigned long this_sum_bytes = 0; > u64 offset; > @@ -470,17 +469,20 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct > inode *inode, > if (contig) > offset = file_start; > else > - offset = page_offset(bvec->bv_page) + bvec->bv_offset; > + offset = 0; /* shut up gcc */ > > - ordered = btrfs_lookup_ordered_extent(inode, offset); > - BUG_ON(!ordered); /* Logic error */ > sums->bytenr = (u64)bio->bi_iter.bi_sector << 9; > index = 0; > > - while (bio_index < bio->bi_vcnt) { > + bio_for_each_segment_all(bvec, bio, j) { > if (!contig) > offset = page_offset(bvec->bv_page) + bvec->bv_offset; > > + if (!ordered) { > + ordered = btrfs_lookup_ordered_extent(inode, offset); > + BUG_ON(!ordered); /* Logic error */ > + } > + > data = kmap_atomic(bvec->bv_page); > > nr_sectors = BTRFS_BYTES_TO_BLKS(root->fs_info, > @@ -529,9 +531,6 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct > inode *inode, > } > > kunmap_atomic(data); > - > - bio_index++; > - bvec++; > } > this_sum_bytes = 0; > btrfs_add_ordered_sum(inode, ordered, sums); > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] btrfs: don't access the bio directly in the raid5/6 code
On Wed, Nov 16, 2016 at 01:52:10PM +0100, Christoph Hellwig wrote: > Just use bio_for_each_segment_all to iterate over all segments. The subject seems to be copied from patch 2 for this one. Otherwise, Reviewed-by: Omar Sandoval> Signed-off-by: Christoph Hellwig > --- > fs/btrfs/inode.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 147df4c..3f09cb6 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -8394,7 +8394,7 @@ static int btrfs_submit_direct_hook(struct > btrfs_dio_private *dip, > struct btrfs_root *root = BTRFS_I(inode)->root; > struct bio *bio; > struct bio *orig_bio = dip->orig_bio; > - struct bio_vec *bvec = orig_bio->bi_io_vec; > + struct bio_vec *bvec; > u64 start_sector = orig_bio->bi_iter.bi_sector; > u64 file_offset = dip->logical_offset; > u64 submit_len = 0; > @@ -8403,7 +8403,7 @@ static int btrfs_submit_direct_hook(struct > btrfs_dio_private *dip, > int async_submit = 0; > int nr_sectors; > int ret; > - int i; > + int i, j; > > map_length = orig_bio->bi_iter.bi_size; > ret = btrfs_map_block(root->fs_info, btrfs_op(orig_bio), > @@ -8433,7 +8433,7 @@ static int btrfs_submit_direct_hook(struct > btrfs_dio_private *dip, > btrfs_io_bio(bio)->logical = file_offset; > atomic_inc(>pending_bios); > > - while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) { > + bio_for_each_segment_all(bvec, orig_bio, j) { > nr_sectors = BTRFS_BYTES_TO_BLKS(root->fs_info, bvec->bv_len); > i = 0; > next_block: > @@ -8487,7 +8487,6 @@ static int btrfs_submit_direct_hook(struct > btrfs_dio_private *dip, > i++; > goto next_block; > } > - bvec++; > } > } > > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/9] btrfs: don't access the bio directly in the raid5/6 code
On Wed, Nov 16, 2016 at 01:52:09PM +0100, Christoph Hellwig wrote: > Just use bio_for_each_segment_all to iterate over all segments. > Besides the minor whitespace issue below, Reviewed-by: Omar Sandoval> Signed-off-by: Christoph Hellwig > --- > fs/btrfs/raid56.c | 16 ++-- > 1 file changed, 6 insertions(+), 10 deletions(-) > > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > index d016d4a..da941fb 100644 > --- a/fs/btrfs/raid56.c > +++ b/fs/btrfs/raid56.c > @@ -1144,10 +1144,10 @@ static void validate_rbio_for_rmw(struct > btrfs_raid_bio *rbio) > static void index_rbio_pages(struct btrfs_raid_bio *rbio) > { > struct bio *bio; > + struct bio_vec *bvec; > u64 start; > unsigned long stripe_offset; > unsigned long page_index; > - struct page *p; > int i; > > spin_lock_irq(>bio_list_lock); > @@ -1156,10 +1156,8 @@ static void index_rbio_pages(struct btrfs_raid_bio > *rbio) > stripe_offset = start - rbio->bbio->raid_map[0]; > page_index = stripe_offset >> PAGE_SHIFT; > > - for (i = 0; i < bio->bi_vcnt; i++) { > - p = bio->bi_io_vec[i].bv_page; > - rbio->bio_pages[page_index + i] = p; > - } > + bio_for_each_segment_all(bvec, bio, i) > + rbio->bio_pages[page_index+ i] = bvec->bv_page; Missing a space right here --^ > } > spin_unlock_irq(>bio_list_lock); > } > @@ -1433,13 +1431,11 @@ static int fail_bio_stripe(struct btrfs_raid_bio > *rbio, > */ > static void set_bio_pages_uptodate(struct bio *bio) > { > + struct bio_vec *bvec; > int i; > - struct page *p; > > - for (i = 0; i < bio->bi_vcnt; i++) { > - p = bio->bi_io_vec[i].bv_page; > - SetPageUptodate(p); > - } > + bio_for_each_segment_all(bvec, bio, i) > + SetPageUptodate(bvec->bv_page); > } > > /* > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: check: fix missing newlines
On Wed, Nov 16, 2016 at 12:54:32PM -0800, Omar Sandoval wrote: > From: Omar Sandoval> > Also, the other progress messages go to stderr, so "checking extents" > probably should, as well. > > Fixes: c7a1f66a205f ("btrfs-progs: check: switch some messages to common > helpers") > Signed-off-by: Omar Sandoval > --- > As a side note, it seems almost completely random whether we print to > stdout or stderr for any given message. That could probably use some > cleaning up for consistency. A quick run of e2fsck indicated that it > prints almost everything on stdout except for usage and administrative > problems. xfs_repair just seems to put everything in stderr. I > personally like the e2fsck approach. Anyone have any preference? Cleaning up the messages is ongoing work, most error messages have been converted. In case of 'check', I think that stdout is good to capture normal and error messages (so no error messages are accidentally lost if the user runs just "check > log" instead of "check >& log"). For that printf is still the way to print them. Besides the verbosity level could be improved, we've had complaints about that since ever. Patch applied. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: qgroup: fix error in ASSERT condition expression
On Thu, Nov 17, 2016 at 02:01:31PM +0900, Tsutomu Itoh wrote: > Option -f, -F and --sort don't work because a conditional expression > of ASSERT is wrong. > > Signed-off-by: Tsutomu ItohApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] btrfs-progs: test: fix error of test target of Makefile
On Fri, Nov 18, 2016 at 02:44:12PM +0900, Tsutomu Itoh wrote: > Add test-cli to test target of Makefile. > > Signed-off-by: Tsutomu ItohApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs-progs: test: expand size of test device of fsck-tests 013
On Fri, Nov 18, 2016 at 02:36:28PM +0800, Qu Wenruo wrote: > > > On 11/18/2016 01:47 PM, Tsutomu Itoh wrote: > > In my test environment, size of /lib/modules/`uname -r`/* is > > larger than 1GB, so test data can not copy to testdev. > > Therefore we need expand the size of testdev. > > IMHO the correct fix is to enhance populate_fs() to fill the fs into a > given size, other than using the unreliable copy. Yeah that would be better, the contents of the modules dir could be anything, but was enough to use for initial testing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: raid with a variable stripe size
Hello, these are only my thoughts; no code here, but I would like to share it hoping that it could be useful. As reported several times by Zygo (and others), one of the problem of raid5/6 is the write hole. Today BTRFS is not capable to address it. The problem is that the stripe size is bigger than the "sector size" (ok sector is not the correct word, but I am referring to the basic unit of writing on the disk, which is 4k or 16K in btrfs). So when btrfs writes less data than the stripe, the stripe is not filled; when it is filled by a subsequent write, a RMW of the parity is required. On the best of my understanding (which could be very wrong) ZFS try to solve this issue using a variable length stripe. On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size. For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs: BG #1,composed by two disks (1 data+ 1 parity) BG #2 composed by three disks (2 data + 1 parity) BG #3 composed by four disks (3 data + 1 parity). If the data to be written has a size of 4k, it will be allocated to the BG #1. If the data to be written has a size of 8k, it will be allocated to the BG #2 If the data to be written has a size of 12k, it will be allocated to the BG #3 If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2. To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e DISK1 DISK2 DISK3 DISK4 S1S1S1S2 S2S2S3S3 S3S4S4S4 [] Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks. Pro: - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change - no more RMW are required (== higher performance) Cons: - the data will be more fragmented - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed). Thoughts ? BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs-progs: test: fix convert-tests 004 failure
On Fri, Nov 18, 2016 at 02:49:51PM +0900, Tsutomu Itoh wrote: > convert-tests 004 failed because the argument to check_all_images > is not specified. > > # make test-convert > [TEST] convert-tests.sh > [TEST/conv] 004-ext2-backup-superblock-ranges > find: '': No such file or directory > # > > Signed-off-by: Tsutomu Itoh> --- > tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh > b/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh > index c56650b..d764ed8 100755 > --- a/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh > +++ b/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh > @@ -39,4 +39,4 @@ function check_image() { > rm -f $TEST_DEV > } > > -check_all_images > +check_all_images `pwd` I'll add a fallback to use the current directory instead of forcing all callsites to supply it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/6] btrfs-progs: better space_cache=v2 support
On Mon, Nov 14, 2016 at 10:43:17AM -0800, Omar Sandoval wrote: > Cover letter from v1: > > This series implements some support for space_cache=v2 in btrfs-progs. > In particular, this adds support for `btrfs check --clear-space-cache v2`, > proper printing of the free space tree flags for `btrfs inspect-internal > dump-super`, and better documentation. > > We'd previously talked about always making btrfs-progs always invalidate > the free space tree when doing write operations, but I decided that this > should be an action explicitly requested by the user. It'd also be > unsafe if using a kernel without the free space tree valid bit support, > which is why I didn't implement a `btrfs check --invalidate-free-space-cache` > option. Doing the full clear is always safe. > > Still missing is full read-write support, but this should hopefully > cover most btrfs-progs usage. > > Changes since v1: > > - Change unsigned -> unsigned int argument to > btrfs_check_fs_compatability() in patch 3 > - Remove BUG_ON() in btrfs_del_root() in patch 4 > - Return error from btrfs_free_tree_block() in patch 4 > - Handle errors from btrfs_free_tree_block() and clean_tree_block() in > patch 4 > - Add Qu Wenruo's Reviewed-by to patches 3, 4, and 5 > > Thanks! > > Omar Sandoval (6): > btrfs-progs: add the FREE_SPACE_TREE_VALID compat_ro bit definition > btrfs-progs: format FREE_SPACE_TREE{,_VALID} nicely in dump-super > btrfs-progs: add OPEN_CTREE_INVALIDATE_FST flag > btrfs-progs: add btrfs_clear_free_space_tree() from the kernel > btrfs-progs: implement btrfs check --clear-space-cache v2 > btrfs-progs: document space_cache=v2 more thoroughly Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] RAID5/6 scrub race fix
Hi Zygo On 2016-11-18 00:13, Zygo Blaxell wrote: > On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: >> Fix the so-called famous RAID5/6 scrub error. >> >> Thanks Goffredo Baroncelli for reporting the bug, and make it into our >> sight. >> (Yes, without the Phoronix report on this, >> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, >> I won't ever be aware of it) > > If you're hearing about btrfs RAID5 bugs for the first time through > Phoronix, then your testing coverage is *clearly* inadequate. > > Fill up a RAID5 array, start a FS stress test, pull a drive out while > that's running, let the FS stress test run for another hour, then try > to replace or delete the missing device. If there are any crashes, > corruptions, or EIO during any part of this process (assuming all the > remaining disks are healthy), then btrfs RAID5 is still broken, and > you've found another bug to fix. > > The fact that so many problems in btrfs can still be found this way > indicates to me that nobody is doing this basic level of testing > (or if they are, they're not doing anything about the results). [...] Sorry but I don't find useful this kind of discussion. Yes BTRFS RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be complete; but this is not a fault of a single person; and Qu tried to solve one issue and for this we should say only tanks.. Even if you don't find valuable the work of Qu (and my little one :-) ), this required some time and need to be respected. Finally, I don't think that we should compare the RAID-hole with this kind of bug(fix). The former is a design issue, the latter is a bug related of one of the basic feature of the raid system (recover from the lost of a disk/corruption). Even the MD subsystem (which is far behind btrfs) had tolerated the raid-hole until last year. And its solution is far to be cheap (basically the MD subsystem wrote the data first in the journal then on the disk... which is the kind of issue that a COW filesystem would solve). BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: send: fix failure of xfstests btrfs/038
On Thu, Nov 17, 2016 at 11:24:41AM -0800, Omar Sandoval wrote: > On Tue, Nov 15, 2016 at 05:16:27PM +0900, Tsutomu Itoh wrote: > > The following patch was imperfect, so xfstests btrfs/038 was failed. > > > > 6d4fb3d btrfs-progs: send: fix handling of multiple snapshots (-p option) > > > > [before] > > | # ./check btrfs/038 > > | FSTYP -- btrfs > > | PLATFORM -- Linux/x86_64 luna 4.9.0-rc5 > > | MKFS_OPTIONS -- /dev/sdb3 > > | MOUNT_OPTIONS -- /dev/sdb3 /test6 > > | > > | btrfs/038 1s ... [failed, exit status 1] - output mismatch (see > > /For_RT/xfstests2/results//btrfs/038.out.bad) > > | --- tests/btrfs/038.out 2015-08-04 16:09:38.0 +0900 > > | +++ /For_RT/xfstests2/results//btrfs/038.out.bad2016-11-15 > > 13:39:48.589435290 +0900 > > | @@ -7,3 +7,5 @@ > > | XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > > | wrote 1/1 bytes at offset 30 > > | XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > > | +failed: '/usr/local/bin/btrfs send -p /test6/mysnap1 -c > > /test6/clones_snap /test6/mysnap2 -f /tmp/tmp.aLpvAC7lsx/2.snap' > > | +(see /For_RT/xfstests2/results//btrfs/038.full for details) > > | ... > > | (Run 'diff -u tests/btrfs/038.out > > /For_RT/xfstests2/results//btrfs/038.out.bad' to see the entire diff) > > | Ran: btrfs/038 > > | Failures: btrfs/038 > > > > [after] > > | # ./check btrfs/038 > > | FSTYP -- btrfs > > | PLATFORM -- Linux/x86_64 luna 4.9.0-rc5 > > | MKFS_OPTIONS -- /dev/sdb3 > > | MOUNT_OPTIONS -- /dev/sdb3 /test6 > > | > > | btrfs/038 1s ... 1s > > | Ran: btrfs/038 > > | Passed all 1 tests > > > > Signed-off-by: Tsutomu Itoh> > Verified that this fixes both btrfs/038 and btrfs/117. > > Tested-by: Omar Sandoval Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v2 - block group internals!
On 11/18/2016 04:30 PM, Austin S. Hemmelgarn wrote: > > Now, I personally have no issue with the Hilbert ordering, but if there > were an option to use a linear ordering, I would almost certainly use > that instead, simply because I could more easily explain the data to > people. It's in there, but hidden :) --curve linear -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v2 - block group internals!
On 2016-11-18 09:37, Hans van Kranenburg wrote: Ha, On 11/18/2016 01:36 PM, Austin S. Hemmelgarn wrote: On 2016-11-17 16:08, Hans van Kranenburg wrote: On 11/17/2016 08:27 PM, Austin S. Hemmelgarn wrote: On 2016-11-17 13:51, Hans van Kranenburg wrote: But, the fun with visualizations of data is that you learn whether they just work(tm) or don't as soon as you see them. Mathematical or algorithmic beauty is not always a good recipe for beauty as seen by the human eye. So, let's gather a bunch of ideas which we can try out and then observe the result. Before doing so, I'm going to restructure the code a bit more so I can write another script in the same directory, just doing import heatmap and calling a few functions in there to quickly try stuff, bypassing the normal cli api. Also, the png writing handling is now done by some random png library that I found, which requires me to build (or copy/resize) an entire pixel grid in memory, explicitely listing all pixel values, which is a bit of a memory hog for bigger pictures, so I want to see if something can be done there also. I haven't had a chance to look at the code yet, but do you have an option to control how much data a pixel represents? On a multi TB filesystem for example, you may not care about exact data, just an overall view of the data, in which case making each pixel represent a larger chunk of data (and thus reducing the resolution of the image) would almost certainly save some memory on big filesystems. --order, which defines the hilbert curve order. Example: for a 238GiB filesystem, when specifying --order 7, then 2**7 = 128, so 128x128 = 16384 pixels, which means that a single one represents ~16MiB when --size > --order, the image simply gets scaled up. When not specifying --order, a number gets chosen automatically with which bytes per pixel is closest to 32MiB. When size is not specified, it's 10, or same as order if order is greater than 10. Now this output should make sense: -# ./heatmap.py /mnt/238GiB max_id 1 num_devices 1 fsid ed108358-c746-4e76-a071-3820d423a99d nodesize 16384 sectorsize 4096 clone_alignment 4096 scope filesystem curve hilbert order 7 size 10 pngfile fsid_ed10a358-c846-4e76-a071-3821d423a99d_at_1479473532.png grid height 128 width 128 total_bytes 255057723392 bytes_per_pixel 15567488.0 pixels 16384 -# ./heatmap.py /mnt/40TiB max_id 2 num_devices 2 fsid 9bc9947e-070f-4bbc-872e-49b2a39b3f7b nodesize 16384 sectorsize 4096 clone_alignment 4096 scope filesystem curve hilbert order 10 size 10 pngfile /home/beheer/heatmap/generated/fsid_9bd9947e-070f-4cbc-8e2e-49b3a39b8f7b_at_1479473950.png grid height 1024 width 1024 total_bytes 46165378727936 bytes_per_pixel 44026736.0 pixels 1048576 OK, here's another thought, is it possible to parse smaller chunks of the image at a time, and then use some external tool (ImageMagick maybe?) to stitch those together into the final image? That might also be useful for other reasons too (If you implement it so you can do arbitrary ranges, you could use it to split separate devices into independent images). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v2 - block group internals!
On 2016-11-18 10:08, Hans van Kranenburg wrote: On 11/18/2016 03:08 AM, Qu Wenruo wrote: When generating a picture of a file system with multiple devices, boundaries between the separate devices are not visible now. If someone has a brilliant idea about how to do this without throwing out actual usage data... The first thought that comes to mind for me is to make each device be a different color, and otherwise obey the same intensity mapping correlating to how much data is there. For example, if you've got a 3 device FS, the parts of the image that correspond to device 1 would go from 0x00 to 0xFF, the parts for device 2 could be 0x00 to 0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This is of course not perfect (you can't tell what device each segment of empty space corresponds to), but would probably cover most use cases. (for example, with such a scheme, you could look at an image and tell whether the data is relatively well distributed across all the devices or you might need to re-balance). What about linear output separated with lines(or just black)? Linear output does not produce useful images, except for really small filesystems. However, it's how the human brain is hardwired to parse data like this (two data points per item, one for value, one for ordering). That's part of the reason that all known writing systems use a linear arrangement arrangement of symbols to store information (the other parts have to do with things like storage efficiency and error detection (and yes, I'm serious, those do play a part in the evolution of language and writing)). As an example of why this is important, imagine showing someone who understands the concept of data fragmentation (most people have little to no issue understanding this concept) a heatmap of a filesystem with no space fragmentation at all without explaining that it uses a a Hilbert Curve 2d ordering. Pretty much 100% of people who aren't mathematicians or scientists will look at that and the first thought that will come to their mind is almost certainly going to be along the lines of 'holy crap that's fragmented really bad in this specific area'. This is the reason that pretty much nothing outside of scientific or mathematical data uses a Hilbert curve based 2d ordering of data (and even then, they almost never use it for final presentation of the data). Data presentation for something like this in a way that laypeople can understand is hard, but it's also important. Take a look at some of the graphical tools for filesystem defragmentation. The presentation requirements there are pretty similar, and so is the data being conveyed. They all use a grid oriented linear presentation of allocation data. The difference is that they scale up the blocks so that they're easily discernible by sight. This allows them to represent the data in a way that's trivial to explain (read this line-by-line), unlike the Hilbert curve (the data follows a complex folded spiral pattern which is fractal in nature). Now, I personally have no issue with the Hilbert ordering, but if there were an option to use a linear ordering, I would almost certainly use that instead, simply because I could more easily explain the data to people. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v2 - block group internals!
On 11/18/2016 03:08 AM, Qu Wenruo wrote: > > Just found one small problem. > After specifying --size 16 to output a given block group (small block > group, I need large size to make output visible), it takes a full cpu > and takes a long long long time to run. > So long I don't even want to wait. > > I changed size to 10, and it finished much faster. > > Is that expected? Yes, hilbert curve size increases exponentially when increasing order. 2**16 = 65536, 65536x65536 = 4294967296 pixels in the png image. So even if you would have a petabyte filesystem, that would still mean that a single pixel in the image represents only ~ 256kB. I don't think your small block groups are 256kB big? Specifying values like this does not make any sense at all, and it expected not to work well. >>> I don't see what displaying a blockgroup-level aggregate usage number >>> has to do with multi-device, except that the same %usage will appear >>> another time when using RAID1*. > > Although in fact, for profiles like RAID0/5/6/10, it's completely > possible that one dev_extent contains all the data, while another > dev_extent is almost empty. When using something like RAID0 profile, I would expect 50% of the data to end up in one dev_extent and 50% in the other? > Strictly speaking, at full fs or dev level, we should output things at > dev_extent level, then greyscale should be representing dev_extent > usage(which is not possible or quite hard to calculate) That's what it's doing now? > Anyway, the greyscale is mostly OK, just as a good addition output for > full fs graph. I don't follow. > Although if it could output the fs or specific dev without gray scale, I > think it would be better. > It will be much clearer about the dev_extent level fragments. I have no idea what you mean, sorry. >>> When generating a picture of a file system with multiple devices, >>> boundaries between the separate devices are not visible now. >>> >>> If someone has a brilliant idea about how to do this without throwing >>> out actual usage data... >>> >> The first thought that comes to mind for me is to make each device be a >> different color, and otherwise obey the same intensity mapping >> correlating to how much data is there. For example, if you've got a 3 >> device FS, the parts of the image that correspond to device 1 would go >> from 0x00 to 0xFF, the parts for device 2 could be 0x00 to >> 0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This >> is of course not perfect (you can't tell what device each segment of >> empty space corresponds to), but would probably cover most use cases. >> (for example, with such a scheme, you could look at an image and tell >> whether the data is relatively well distributed across all the devices >> or you might need to re-balance). > > What about linear output separated with lines(or just black)? Linear output does not produce useful images, except for really small filesystems. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache
Yes, the short window between the stalls and the panic makes it difficult to manually check much. I could setup a cron every 5 minutes or so if you want. Also, I see the OOM's in 4.8, but it has yet to panic on me. Where as 4.9rc has panic'd both times I've booted it, so depending on what you want to look at it might be easier to investigate on 4.8. Let me know, I can turn on a couple of the DEBUG config's and build a new 4.8.8. Never looked into a netconsole or serial console. I think just getting the system to use a higher res console would be an improvement, but the OOM's seemed to be the root cause of the panic so I haven't spent any time looking into that as of yet, Thanks, -Eli On Fri, Nov 18, 2016 at 6:54 AM, Tetsuo Handawrote: > On 2016/11/18 6:49, Vlastimil Babka wrote: >> On 11/16/2016 02:39 PM, E V wrote: >>> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of >>> the stack trace, and the 38 call traces in a 2 minute window shortly >>> before, to the bugzilla case for those not on it's e-mail list: >>> >>> https://bugzilla.kernel.org/show_bug.cgi?id=186671 >> >> The panic screenshot has only the last part, but the end marker says >> it's OOM with no killable processes. The DEBUG_VM config thus didn't >> trigger anything, and still there's tons of pagecache, mostly clean, >> that's not being reclaimed. >> >> Could you now try this? >> - enable CONFIG_PAGE_OWNER >> - boot with kernel option: page_owner=on >> - after the first oom, "cat /sys/kernel/debug/page_owner > file" >> - provide the file (compressed, it will be quite large) > > Excuse me for a noise, but do we really need to do > "cat /sys/kernel/debug/page_owner > file" after the first OOM killer > invocation? I worry that it might be too difficult to do. > Shouldn't we rather do "cat /sys/kernel/debug/page_owner > file" > hourly and compare tendency between the latest one and previous one? > > This system has swap, and /var/log/messages before panic > reports that swapin was stalling at memory allocation. > > > [130346.262510] dsm_sa_datamgrd: page allocation stalls for 52400ms, order:0, > mode:0x24200ca(GFP_HIGHUSER_MOVABLE) > [130346.262572] CPU: 1 PID: 3622 Comm: dsm_sa_datamgrd Tainted: GW I >4.9.0-rc5 #2 > [130346.262662] 8129ba69 8170e4c8 > c90003ccb8d8 > [130346.262714] 8113449f 024200ca1ca11b40 8170e4c8 > c90003ccb880 > [130346.262765] 0010 c90003ccb8e8 c90003ccb898 > 88041f226e80 > [130346.262817] Call Trace: > [130346.262843] [] ? dump_stack+0x46/0x5d > [130346.262872] [] ? warn_alloc+0x11f/0x140 > [130346.262899] [] ? __alloc_pages_slowpath+0x84b/0xa80 > [130346.262929] [] ? __alloc_pages_nodemask+0x2b0/0x2f0 > [130346.262960] [] ? alloc_pages_vma+0xbe/0x260 > [130346.262989] [] ? pagecache_get_page+0x22/0x280 > [130346.263019] [] ? __read_swap_cache_async+0x118/0x1a0 > [130346.263048] [] ? read_swap_cache_async+0xf/0x30 > [130346.263077] [] ? swapin_readahead+0x16e/0x1c0 > [130346.263106] [] ? radix_tree_lookup_slot+0xe/0x20 > [130346.263135] [] ? find_get_entry+0x14/0x130 > [130346.263162] [] ? pagecache_get_page+0x22/0x280 > [130346.263193] [] ? do_swap_page+0x44f/0x5f0 > [130346.263220] [] ? __radix_tree_lookup+0x62/0xc0 > [130346.263249] [] ? handle_mm_fault+0x66a/0xf00 > [130346.263277] [] ? find_get_entry+0x14/0x130 > [130346.263305] [] ? __do_page_fault+0x1c5/0x490 > [130346.263336] [] ? page_fault+0x22/0x30 > [130346.263364] [] ? copy_user_generic_string+0x2c/0x40 > [130346.263395] [] ? set_fd_set+0x1d/0x30 > [130346.263422] [] ? core_sys_select+0x1a5/0x260 > [130346.263450] [] ? getname_flags+0x6a/0x1e0 > [130346.263479] [] ? cp_new_stat+0x115/0x130 > [130346.263509] [] ? ktime_get_ts64+0x3f/0xf0 > [130346.263537] [] ? SyS_select+0xa5/0xe0 > [130346.263564] [] ? entry_SYSCALL_64_fastpath+0x13/0x94 > > > Under such situation, trying to login and execute /bin/cat could take minutes. > Also, writing to btrfs and ext4 seems to be stalling. The btrfs one is a > situation where WQ_MEM_RECLAIM kernel workqueue is unable to make progress. > > > [130420.008231] kworker/u34:21: page allocation stalls for 35028ms, order:0, > mode:0x2400840(GFP_NOFS|__GFP_NOFAIL) > [130420.008287] CPU: 5 PID: 24286 Comm: kworker/u34:21 Tainted: GW I >4.9.0-rc5 #2 > [130420.008401] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] > [130420.008432] 8129ba69 8170e4c8 > c900087836a0 > [130420.008483] 8113449f 024008401e3f1b40 8170e4c8 > c90008783648 > [130420.008534] 0010 c900087836b0 c90008783660 > 88041ecc4340 > [130420.008586] Call Trace: > [130420.008611] [] ? dump_stack+0x46/0x5d > [130420.008640] [] ? warn_alloc+0x11f/0x140 >
Re: [PATCH] Btrfs: remove rb_node field from the delayed ref node structure
On Fri, Nov 18, 2016 at 09:38:44AM +, fdman...@kernel.org wrote: > From: Filipe Manana> > After the last big change in the delayed references code that was needed > for the last qgroups rework, the red black tree node field of struct > btrfs_delayed_ref_node is no longer used, so just remove it, this helps > us save some memory (since struct rb_node is 24 bytes on x86_64) for > these structures. > > Signed-off-by: Filipe Manana Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v2 - block group internals!
Ha, On 11/18/2016 01:36 PM, Austin S. Hemmelgarn wrote: > On 2016-11-17 16:08, Hans van Kranenburg wrote: >> On 11/17/2016 08:27 PM, Austin S. Hemmelgarn wrote: >>> On 2016-11-17 13:51, Hans van Kranenburg wrote: >> But, the fun with visualizations of data is that you learn whether they >> just work(tm) or don't as soon as you see them. Mathematical or >> algorithmic beauty is not always a good recipe for beauty as seen by the >> human eye. >> >> So, let's gather a bunch of ideas which we can try out and then observe >> the result. >> >> Before doing so, I'm going to restructure the code a bit more so I can >> write another script in the same directory, just doing import heatmap >> and calling a few functions in there to quickly try stuff, bypassing the >> normal cli api. >> >> Also, the png writing handling is now done by some random png library >> that I found, which requires me to build (or copy/resize) an entire >> pixel grid in memory, explicitely listing all pixel values, which is a >> bit of a memory hog for bigger pictures, so I want to see if something >> can be done there also. > I haven't had a chance to look at the code yet, but do you have an > option to control how much data a pixel represents? On a multi TB > filesystem for example, you may not care about exact data, just an > overall view of the data, in which case making each pixel represent a > larger chunk of data (and thus reducing the resolution of the image) > would almost certainly save some memory on big filesystems. --order, which defines the hilbert curve order. Example: for a 238GiB filesystem, when specifying --order 7, then 2**7 = 128, so 128x128 = 16384 pixels, which means that a single one represents ~16MiB when --size > --order, the image simply gets scaled up. When not specifying --order, a number gets chosen automatically with which bytes per pixel is closest to 32MiB. When size is not specified, it's 10, or same as order if order is greater than 10. Now this output should make sense: -# ./heatmap.py /mnt/238GiB max_id 1 num_devices 1 fsid ed108358-c746-4e76-a071-3820d423a99d nodesize 16384 sectorsize 4096 clone_alignment 4096 scope filesystem curve hilbert order 7 size 10 pngfile fsid_ed10a358-c846-4e76-a071-3821d423a99d_at_1479473532.png grid height 128 width 128 total_bytes 255057723392 bytes_per_pixel 15567488.0 pixels 16384 -# ./heatmap.py /mnt/40TiB max_id 2 num_devices 2 fsid 9bc9947e-070f-4bbc-872e-49b2a39b3f7b nodesize 16384 sectorsize 4096 clone_alignment 4096 scope filesystem curve hilbert order 10 size 10 pngfile /home/beheer/heatmap/generated/fsid_9bd9947e-070f-4cbc-8e2e-49b3a39b8f7b_at_1479473950.png grid height 1024 width 1024 total_bytes 46165378727936 bytes_per_pixel 44026736.0 pixels 1048576 -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Btrfs: remove unused code when creating and merging reloc trees
On 11/18/2016 04:37 AM, fdman...@kernel.org wrote: From: Filipe MananaIn commit 5bc7247ac47c (Btrfs: fix broken nocow after balance) we started abusing the rtransid and otransid fields of root items from relocation trees to fix some issues with nodatacow mode. However later in commit ba8b0289333a (Btrfs: do not reset last_snapshot after relocation) we dropped the code that made use of those fields but did not remove the code that sets those fields. So just remove them to avoid confusion. Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Btrfs: fix relocation incorrectly dropping data references
On 11/18/2016 04:37 AM, fdman...@kernel.org wrote: From: Filipe MananaDuring relocation of a data block group we create a relocation tree for each fs/subvol tree by making a snapshot of each tree using btrfs_copy_root() and the tree's commit root, and then setting the last snapshot field for the fs/subvol tree's root to the value of the current transaction id minus 1. However this can lead to relocation later dropping references that it did not create if we have qgroups enabled, leaving the filesystem in an inconsistent state that keeps aborting transactions. Lets consider the following example to explain the problem, which requires qgroups to be enabled. We are relocating data block group Y, we have a subvolume with id 258 that has a root at level 1, that subvolume is used to store directory entries for snapshots and we are currently at transaction 3404. When committing transaction 3404, we have a pending snapshot and therefore we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot() in order to create its dentry at subvolume 258. This results in COWing leaf A from root 258 in order to add the dentry. Note that leaf A also contains file extent items referring to extents from some other block group X (we are currently relocating block group Y). Later on, still at create_pending_snapshot() we call qgroup_account_snapshot(), which switches the commit root for root 258 when it calls switch_commit_roots(), so now the COWed version of leaf A, lets call it leaf A', is accessible from the commit root of tree 258. At the end of qgroup_account_snapshot(), we call record_root_in_trans() with 258 as its argument, which results in btrfs_init_reloc_root() being called, which in turn calls relocation.c:create_reloc_root() in order to create a relocation tree associated to root 258, which results in assigning the value of 3403 (which is the current transaction id minus 1 = 3404 - 1) to the last_snapshot field of root 258. When creating the relocation tree root at ctree.c:btrfs_copy_root() we add a shared reference for leaf A', corresponding to the relocation tree's root, when we call btrfs_inc_ref() against the COWed root (a copy of the commit root from tree 258), which is at level 1. So at this point leaf A' has 2 references, one normal reference corresponding to root 258 and one shared reference corresponding to the root of the relocation tree. Transaction 3404 finishes its commit and transaction 3405 is started by relocation when calling merge_reloc_root() for the relocation tree associated to root 258. In the meanwhile leaf A' is COWed again, in response to some filesystem operation, when we are still at transaction 3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we call btrfs_block_can_be_shared() in order to figure out if other trees refer to the leaf and if any such trees exists, add a full back reference to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false because the following condition is false: btrfs_header_generation(buf) <= btrfs_root_last_snapshot(>root_item) which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with only one reference, corresponding to the shared reference we created when we called btrfs_copy_root() to create the relocation tree's root and btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not adding shared references for the extents from block group X that leaf A' refers to with its file extent items. Later, after merging the relocation root we do a call to to btrfs_drop_snapshot() in order to delete the relocation tree. This ends up calling do_walk_down() when path->slots[1] points to leaf A', which results in calling btrfs_lookup_extent_info() to get the number of references for leaf A', which is 1 at this time (only the shared reference exists) and this value is stored at wc->refs[0]. After this walk_up_proc() is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'. Because the current level is 0 and wc->refs[0] is 1, it does call btrfs_dec_ref() against leaf A', which results in removing the single references that the extents from block group X have which are associated to root 258 - the expectation was to have each of these extents with 2 references - one reference for root 258 and one shared reference related to the root of the relocation tree, and so we would drop only the shared reference (because leaf A' was supposed to have the flag BTRFS_BLOCK_FLAG_FULL_BACKREF set). This leaves the filesystem in an inconsistent state as we now have file extent items in a subvolume tree that point to extents from block group X without references in the extent tree. So later on when we try to decrement the references for these extents, for example due to a file unlink operation, truncate operation or overwriting ranges of a file, we fail because the expected references do not exist in
BCache
Hello, I „BCache“ a BTrFS RAID with 4 hard drives. Normal use seems to work good. I have no problems. I have the BTrFS RAID mounted through „/dev/bcache3“. But when I remove a disk (simulating it’s broken), BTrFS doesn’t inform me about a „missing disk“: # btrfs fi show Label: 'RAID' uuid: d0e2e2eb-2df7-454f-8446-5213cec2de3c Total devices 4 FS bytes used 12.55GiB devid1 size 465.76GiB used 6.00GiB path /dev/bcache3 devid2 size 931.51GiB used 6.00GiB path /dev/bcache1 devid3 size 596.17GiB used 6.00GiB path /dev/bcache2 devid4 size 465.76GiB used 6.00GiB path /dev/bcache0 One of these (actually „/dev/bcache3“ or „/dev/sde“) should be broken or missing. I can’t make a „btrfs device delete missing“ either. It replies „ERROR: not a btrfs filesystem:“. It doesn’t matter, if I use „/dev/bcacheX“ or „/dev/sdX“ for that. And if I make a „btrfs device delete missing /mnt/raid“, it replies: „ERROR: error removing device 'missing': no missing devices found to remove“. It looks to me, as if BCache hides these informations from BTrFS. Is this possible? What do you think? Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Heatmap - v2 - block group internals!
On 2016-11-17 16:08, Hans van Kranenburg wrote: On 11/17/2016 08:27 PM, Austin S. Hemmelgarn wrote: On 2016-11-17 13:51, Hans van Kranenburg wrote: When generating a picture of a file system with multiple devices, boundaries between the separate devices are not visible now. If someone has a brilliant idea about how to do this without throwing out actual usage data... The first thought that comes to mind for me is to make each device be a different color, and otherwise obey the same intensity mapping correlating to how much data is there. For example, if you've got a 3 device FS, the parts of the image that correspond to device 1 would go from 0x00 to 0xFF, the parts for device 2 could be 0x00 to 0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This is of course not perfect (you can't tell what device each segment of empty space corresponds to), but would probably cover most use cases. (for example, with such a scheme, you could look at an image and tell whether the data is relatively well distributed across all the devices or you might need to re-balance). "most use cases" -> what are those use cases? If you want to know how much total GiB or TiB is present on all devices, a simple btrfs fi show does suffice. Visualizing how the data patterning differs across devices would be the biggest one that comes to mind. Another option is to just write three images, one for each of the devices. :) Those are more easily compared. That would actually be more useful probably, as you can then do pretty much whatever post-processing you want, and it would cover the above use case just as well. The first idea with color that I had was to use two different colors for data and metadata. When also using separate colors for devices, it might all together become a big mess really quickly, or, maybe a beautiful rainbow. I actually like that idea a lot better than using color for differentiating between devices. But, the fun with visualizations of data is that you learn whether they just work(tm) or don't as soon as you see them. Mathematical or algorithmic beauty is not always a good recipe for beauty as seen by the human eye. So, let's gather a bunch of ideas which we can try out and then observe the result. Before doing so, I'm going to restructure the code a bit more so I can write another script in the same directory, just doing import heatmap and calling a few functions in there to quickly try stuff, bypassing the normal cli api. Also, the png writing handling is now done by some random png library that I found, which requires me to build (or copy/resize) an entire pixel grid in memory, explicitely listing all pixel values, which is a bit of a memory hog for bigger pictures, so I want to see if something can be done there also. I haven't had a chance to look at the code yet, but do you have an option to control how much data a pixel represents? On a multi TB filesystem for example, you may not care about exact data, just an overall view of the data, in which case making each pixel represent a larger chunk of data (and thus reducing the resolution of the image) would almost certainly save some memory on big filesystems. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache
It could be totally unrelated but I have a similar problem: processes get randomly OOM'd when I am doing anything "sort of heavy" on my Btrfs filesystems. I did some "evil tuning", so I assumed that must be the problem (even if the values looked sane for my system). Thus, I kept cutting back on the manually set values (mostly dirty/background ratio, io scheduler request queue size and such tunables) but it seems to be a dead end. I guess anything I change in order to try and cut back on the related memory footprint just makes the OOMs less frequent but it's only a matter of time and coincidence (lots of things randomly happen to do some notable amount of IO) until OOMs happen anyway. It seems to be plenty enough to start a defrag or balance on more than a single filesystem (in parallel) and pretty much any notable "useful" user load will have a high change of triggering OOMs (and get killed) sooner or later. It's just my limited observation but database-like loads [like that of bitcoind] (sync writes and/or frequent flushes?) or high priority buffered writes (ffmpeg running with higher than default priority and saving live video streams into files without recoding) seem to have higher chance of triggering this (more so than simply reading or writing files sequentially and asynchronously, either locally or through Samba). I am on gentoo-sources 4.8.8 right now but it was there with 4.7.x as well. On Thu, Nov 17, 2016 at 10:49 PM, Vlastimil Babkawrote: > On 11/16/2016 02:39 PM, E V wrote: >> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of >> the stack trace, and the 38 call traces in a 2 minute window shortly >> before, to the bugzilla case for those not on it's e-mail list: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=186671 > > The panic screenshot has only the last part, but the end marker says > it's OOM with no killable processes. The DEBUG_VM config thus didn't > trigger anything, and still there's tons of pagecache, mostly clean, > that's not being reclaimed. > > Could you now try this? > - enable CONFIG_PAGE_OWNER > - boot with kernel option: page_owner=on > - after the first oom, "cat /sys/kernel/debug/page_owner > file" > - provide the file (compressed, it will be quite large) > > Vlastimil > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache
On 2016/11/18 6:49, Vlastimil Babka wrote: > On 11/16/2016 02:39 PM, E V wrote: >> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of >> the stack trace, and the 38 call traces in a 2 minute window shortly >> before, to the bugzilla case for those not on it's e-mail list: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=186671 > > The panic screenshot has only the last part, but the end marker says > it's OOM with no killable processes. The DEBUG_VM config thus didn't > trigger anything, and still there's tons of pagecache, mostly clean, > that's not being reclaimed. > > Could you now try this? > - enable CONFIG_PAGE_OWNER > - boot with kernel option: page_owner=on > - after the first oom, "cat /sys/kernel/debug/page_owner > file" > - provide the file (compressed, it will be quite large) Excuse me for a noise, but do we really need to do "cat /sys/kernel/debug/page_owner > file" after the first OOM killer invocation? I worry that it might be too difficult to do. Shouldn't we rather do "cat /sys/kernel/debug/page_owner > file" hourly and compare tendency between the latest one and previous one? This system has swap, and /var/log/messages before panic reports that swapin was stalling at memory allocation. [130346.262510] dsm_sa_datamgrd: page allocation stalls for 52400ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE) [130346.262572] CPU: 1 PID: 3622 Comm: dsm_sa_datamgrd Tainted: GW I 4.9.0-rc5 #2 [130346.262662] 8129ba69 8170e4c8 c90003ccb8d8 [130346.262714] 8113449f 024200ca1ca11b40 8170e4c8 c90003ccb880 [130346.262765] 0010 c90003ccb8e8 c90003ccb898 88041f226e80 [130346.262817] Call Trace: [130346.262843] [] ? dump_stack+0x46/0x5d [130346.262872] [] ? warn_alloc+0x11f/0x140 [130346.262899] [] ? __alloc_pages_slowpath+0x84b/0xa80 [130346.262929] [] ? __alloc_pages_nodemask+0x2b0/0x2f0 [130346.262960] [] ? alloc_pages_vma+0xbe/0x260 [130346.262989] [] ? pagecache_get_page+0x22/0x280 [130346.263019] [] ? __read_swap_cache_async+0x118/0x1a0 [130346.263048] [] ? read_swap_cache_async+0xf/0x30 [130346.263077] [] ? swapin_readahead+0x16e/0x1c0 [130346.263106] [] ? radix_tree_lookup_slot+0xe/0x20 [130346.263135] [] ? find_get_entry+0x14/0x130 [130346.263162] [] ? pagecache_get_page+0x22/0x280 [130346.263193] [] ? do_swap_page+0x44f/0x5f0 [130346.263220] [] ? __radix_tree_lookup+0x62/0xc0 [130346.263249] [] ? handle_mm_fault+0x66a/0xf00 [130346.263277] [] ? find_get_entry+0x14/0x130 [130346.263305] [] ? __do_page_fault+0x1c5/0x490 [130346.263336] [] ? page_fault+0x22/0x30 [130346.263364] [] ? copy_user_generic_string+0x2c/0x40 [130346.263395] [] ? set_fd_set+0x1d/0x30 [130346.263422] [] ? core_sys_select+0x1a5/0x260 [130346.263450] [] ? getname_flags+0x6a/0x1e0 [130346.263479] [] ? cp_new_stat+0x115/0x130 [130346.263509] [] ? ktime_get_ts64+0x3f/0xf0 [130346.263537] [] ? SyS_select+0xa5/0xe0 [130346.263564] [] ? entry_SYSCALL_64_fastpath+0x13/0x94 Under such situation, trying to login and execute /bin/cat could take minutes. Also, writing to btrfs and ext4 seems to be stalling. The btrfs one is a situation where WQ_MEM_RECLAIM kernel workqueue is unable to make progress. [130420.008231] kworker/u34:21: page allocation stalls for 35028ms, order:0, mode:0x2400840(GFP_NOFS|__GFP_NOFAIL) [130420.008287] CPU: 5 PID: 24286 Comm: kworker/u34:21 Tainted: GW I 4.9.0-rc5 #2 [130420.008401] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] [130420.008432] 8129ba69 8170e4c8 c900087836a0 [130420.008483] 8113449f 024008401e3f1b40 8170e4c8 c90008783648 [130420.008534] 0010 c900087836b0 c90008783660 88041ecc4340 [130420.008586] Call Trace: [130420.008611] [] ? dump_stack+0x46/0x5d [130420.008640] [] ? warn_alloc+0x11f/0x140 [130420.008667] [] ? __alloc_pages_slowpath+0x84b/0xa80 [130420.008707] [] ? search_bitmap+0xc2/0x140 [btrfs] [130420.008736] [] ? __alloc_pages_nodemask+0x2b0/0x2f0 [130420.008766] [] ? alloc_pages_current+0x8a/0x110 [130420.008796] [] ? pagecache_get_page+0xec/0x280 [130420.008836] [] ? alloc_extent_buffer+0x108/0x430 [btrfs] [130420.008875] [] ? btrfs_alloc_tree_block+0x118/0x4d0 [btrfs] [130420.008927] [] ? __btrfs_cow_block+0x148/0x5d0 [btrfs] [130420.008964] [] ? btrfs_cow_block+0x114/0x1d0 [btrfs] [130420.009001] [] ? btrfs_search_slot+0x206/0xa40 [btrfs] [130420.009039] [] ? lookup_inline_extent_backref+0xd9/0x620 [btrfs] [130420.009095] [] ? set_extent_bit+0x24/0x30 [btrfs] [130420.009124] [] ? kmem_cache_alloc+0x17f/0x1b0 [130420.009161] [] ? __btrfs_free_extent.isra.69+0xef/0xd10 [btrfs] [130420.009215] [] ? btrfs_merge_delayed_refs+0x56/0x6f0 [btrfs] [130420.009269] [] ?
Re: Announcing btrfs-dedupe
On giovedì 17 novembre 2016 04:01:52 CET, Zygo Blaxell wrote: Duperemove does use a lot of memory, but the logs at that URL only show 2G of RAM in duperemove--not nearly enough to trigger OOM under normal conditions on an 8G machine. There's another process with 6G of virtual address space (although much less than that resident) that looks more interesting (i.e. duperemove might just be the victim of some interaction between baloo_file and the OOM killer). Thanks, I killed baloo_file before starting duperemove and it somehow improved (it reached 99.73% before getting killed by OOM killer once again): [ 6342.147251] Purging GPU memory, 0 pages freed, 18268 pages still pinned. [ 6342.147253] 48 and 0 pages still available in the bound and unbound GPU page lists. [ 6342.147340] Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), order=3, oom_score_adj=0 [ 6342.147341] Xorg cpuset=/ mems_allowed=0 [ 6342.147346] CPU: 3 PID: 650 Comm: Xorg Not tainted 4.8.8-2-ARCH #1 [ 6342.147347] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A09 08/29/2016 [ 6342.147348] 0286 9b89a9c8 88020752f598 812fde10 [ 6342.147351] 88020752f758 8801edc62ac0 88020752f608 81205fa2 [ 6342.147353] 000188020752f5a0 9b89a9c8 [ 6342.147356] Call Trace: [ 6342.147361] [] dump_stack+0x63/0x83 [ 6342.147364] [] dump_header+0x5c/0x1ea [ 6342.147366] [] oom_kill_process+0x265/0x410 [ 6342.147368] [] ? has_capability_noaudit+0x17/0x20 [ 6342.147369] [] out_of_memory+0x380/0x420 [ 6342.147373] [] ? find_next_bit+0x18/0x20 [ 6342.147374] [] __alloc_pages_nodemask+0xda0/0xde0 [ 6342.147377] [] alloc_pages_current+0x95/0x140 [ 6342.147380] [] kmalloc_order_trace+0x2e/0xf0 [ 6342.147382] [] __kmalloc+0x1ea/0x200 [ 6342.147397] [] ? alloc_gen8_temp_bitmaps+0x2e/0x80 [i915] [ 6342.147407] [] alloc_gen8_temp_bitmaps+0x47/0x80 [i915] [ 6342.147417] [] gen8_alloc_va_range_3lvl+0x98/0x9c0 [i915] [ 6342.147419] [] ? shmem_getpage_gfp+0xed/0xc30 [ 6342.147421] [] ? sg_init_table+0x1a/0x40 [ 6342.147423] [] ? swiotlb_map_sg_attrs+0x53/0x130 [ 6342.147432] [] gen8_alloc_va_range+0x256/0x490 [i915] [ 6342.147442] [] i915_vma_bind+0x9b/0x190 [i915] [ 6342.147453] [] i915_gem_object_do_pin+0x86b/0xa90 [i915] [ 6342.147463] [] i915_gem_object_pin+0x2d/0x30 [i915] [ 6342.147472] [] i915_gem_execbuffer_reserve_vma.isra.7+0x9f/0x180 [i915] [ 6342.147482] [] i915_gem_execbuffer_reserve.isra.8+0x396/0x3c0 [i915] [ 6342.147491] [] i915_gem_do_execbuffer.isra.14+0x68b/0x1270 [i915] [ 6342.147493] [] ? unix_stream_read_generic+0x281/0x8a0 [ 6342.147503] [] i915_gem_execbuffer2+0x104/0x270 [i915] [ 6342.147509] [] drm_ioctl+0x200/0x4f0 [drm] [ 6342.147518] [] ? i915_gem_execbuffer+0x330/0x330 [i915] [ 6342.147520] [] ? enqueue_hrtimer+0x3d/0xa0 [ 6342.147522] [] ? timerqueue_del+0x24/0x70 [ 6342.147523] [] ? __remove_hrtimer+0x3c/0x90 [ 6342.147525] [] do_vfs_ioctl+0xa3/0x5f0 [ 6342.147527] [] ? do_setitimer+0x12b/0x230 [ 6342.147529] [] ? __fget+0x77/0xb0 [ 6342.147531] [] SyS_ioctl+0x79/0x90 [ 6342.147533] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 [ 6342.147535] Mem-Info: [ 6342.147538] active_anon:76311 inactive_anon:76782 isolated_anon:0 active_file:347581 inactive_file:1415592 isolated_file:64 unevictable:8 dirty:482 writeback:0 unstable:0 slab_reclaimable:27219 slab_unreclaimable:14772 mapped:20714 shmem:30458 pagetables:10557 bounce:0 free:25642 free_pcp:327 free_cma:0 [ 6342.147541] Node 0 active_anon:305244kB inactive_anon:307128kB active_file:1390324kB inactive_file:5662368kB unevictable:32kB isolated(anon):0kB isolated(file):256kB mapped:82856kB dirty:1928kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 81920kB anon_thp: 121832kB writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? no [ 6342.147542] Node 0 DMA free:15688kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15896kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:208kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 6342.147545] lowmem_reserve[]: 0 3395 7850 7850 7850 [ 6342.147548] Node 0 DMA32 free:48772kB min:29172kB low:36464kB high:43756kB active_anon:84724kB inactive_anon:87164kB active_file:555728kB inactive_file:2639796kB unevictable:0kB writepending:696kB present:3564504kB managed:3488752kB mlocked:0kB slab_reclaimable:47196kB slab_unreclaimable:11472kB kernel_stack:192kB pagetables:200kB bounce:0kB free_pcp:1284kB local_pcp:0kB free_cma:0kB [ 6342.147553] lowmem_reserve[]: 0 0 4454 4454 4454 [ 6342.147555] Node 0 Normal free:38108kB min:38276kB low:47844kB high:57412kB active_anon:220520kB inactive_anon:219964kB active_file:834596kB
Re: [PATCH 1/3] Btrfs: remove balance warning that does not reflect a problem
On Fri, Nov 18, 2016 at 9:37 AM,wrote: > From: Filipe Manana > > On openSUSE/SLE systems where balance is triggered periodically in the > background, snapshotting happens when doing package installations and > upgrades, and (by default) the root system is organized with multiple > subvolumes, the following warning was triggered often: > > [ 630.773059] WARNING: CPU: 1 PID: 2549 at fs/btrfs/relocation.c:1848 > replace_path+0x3f0/0x940 [btrfs] > [ 630.773060] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs > libcrc32c acpi_cpufreq tpm_tis ppdev tpm parport_pc parport pcspkr e1000 > qemu_fw_cfg joydev i2c_piix4 button btrfs xor raid6_pq sr_mod cdrom > ata_generic virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect > sysimgblt > fb_sys_fops ttm ata_piix virtio_pci virtio_ring virtio serio_raw floppy drm sg > [ 630.773070] CPU: 1 PID: 2549 Comm: btrfs Tainted: GW > 4.7.7-2-btrfs+ #2 > [ 630.773071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 > [ 630.773072] 8801f704b8c8 813afd12 > > [ 630.773073] 8801f704b908 81081f8b > 0738 > [ 630.773075] 0001 8801e32eb8c0 1600 > 8800 > [ 630.773076] Call Trace: > [ 630.773078] [] dump_stack+0x63/0x81 > [ 630.773079] [] __warn+0xcb/0xf0 > [ 630.773080] [] warn_slowpath_null+0x1d/0x20 > [ 630.773090] [] replace_path+0x3f0/0x940 [btrfs] > [ 630.773092] [] ? ring_buffer_unlock_commit+0x3e/0x2a0 > [ 630.773102] [] merge_reloc_root+0x2b4/0x600 [btrfs] > [ 630.773111] [] merge_reloc_roots+0x140/0x250 [btrfs] > [ 630.773120] [] relocate_block_group+0x317/0x680 [btrfs] > [ 630.773129] [] btrfs_relocate_block_group+0x1cc/0x2d0 > [btrfs] > [ 630.773139] [] btrfs_relocate_chunk.isra.40+0x56/0xf0 > [btrfs] > [ 630.773149] [] __btrfs_balance+0x8d5/0xbb0 [btrfs] > [ 630.773159] [] btrfs_balance+0x2d0/0x5e0 [btrfs] > [ 630.773168] [] btrfs_ioctl_balance+0x383/0x390 [btrfs] > [ 630.773178] [] btrfs_ioctl+0x90f/0x1fb0 [btrfs] > [ 630.773180] [] ? pte_alloc_one+0x33/0x40 > [ 630.773182] [] do_vfs_ioctl+0x93/0x5a0 > [ 630.773183] [] ? __do_page_fault+0x203/0x4e0 > [ 630.773185] [] SyS_ioctl+0x79/0x90 > [ 630.773186] [] entry_SYSCALL_64_fastpath+0x1e/0xa8 > [ 630.773187] ---[ end trace 2cd6167577bf3be7 ]--- > > It turned out that this warning does not reflect any problem and just > makes users/system administrators worry about something going wrong. > The warning happens because when we create a relocation root (which is > just a snapshot of a subvolume tree) we set its last_snapshot field (as > well as for the subvolume's tree root) to a value corresponding to the > generation of the current transaction minus 1 (we do this at > relocation.c:create_reloc_root()). This means that when we merge the > relocation tree with the corresponding subvolume tree, at > walk_down_reloc_tree() we can catch pointers (bytenr/generation pairs) > with a generation that matches the generation of the transaction where > we created the relocation root, so those pointers correspond to tree > blocks created either before or after the relocation root was created. > If they were created before the relocation root (and in the same > transaction) we hit the warning, which we can safely remove because it > means the tree blocks are accessible from both trees (the subvolume > tree and the relocation tree). > > So fix this by removing the warning and adding a couple assertions that > verify the pointer generations match and that their generation matches > the value of the last_snapshot field from the relocation tree plus 1. > > Signed-off-by: Filipe Manana Ignore this patch in the series, since the second patch makes this one unnecessary. thanks > --- > fs/btrfs/relocation.c | 21 - > 1 file changed, 20 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c > index 0ec8ffa..cdc1a1c 100644 > --- a/fs/btrfs/relocation.c > +++ b/fs/btrfs/relocation.c > @@ -1848,7 +1848,26 @@ again: > new_ptr_gen = 0; > } > > - if (WARN_ON(new_bytenr > 0 && new_bytenr == old_bytenr)) { > + /* > +* When we create the reloc root (which is a snapshot of the > +* subvolume tree) we set its last_snapshot field (as well as > +* for the subvolume's tree root) to the value of the current > +* transaction generation minus 1 (at create_reloc_root()). > +* This means that at walk_down_reloc_tree() we can catch > +* pointers (bytenr/generation pairs) with a generation > +* matching the generation of the transaction where we created > +* the reloc
Re: root backup-reformat-restore
On Fri, Nov 18, 2016 at 04:06:22AM +0200, Marcus Sundman wrote: > On 18.11.2016 02:52, Hugo Mills wrote: > >On Fri, Nov 18, 2016 at 02:38:25AM +0200, Marcus Sundman wrote: > >>The FAQ says that "the best solution for small devices (under about > >>16 GB) is to reformat the FS with the --mixed option to mkfs.btrfs". > >> > >>OK. Does anyone have any good suggestions for doing that with an > >>existing / partition (which has special files and whatnot)? > >> > >>I assume a backup-restore cycle is needed, but with which program(s)? > > - If you don't have (or don't care about) any snapshots, then > > - if you can mount both the old and the new FS at the same time > >- mv > > It's the same partition, so I probably won't have the new FS there > at the same time. > > > - else, if you want to replace the old FS with the new > >- tar to a file elsewhere, and untar later > > Can tar really preserve all special files' attributes? (I have > separate partitions for /boot and /home, but the rest is in /) Yes, I've used it for moving / to a different machine before. Take a look through the options and pick all the ones that look like they're dealing with attributes. :) Hugo. -- Hugo Mills | Anyone who claims their cryptographic protocol is hugo@... carfax.org.uk | secure is either a genius or a fool. Given the http://carfax.org.uk/ | genius/fool ratio for our species, the odds aren't PGP: E2AB1DE4 | good. Bruce Schneier signature.asc Description: Digital signature
[PATCH] Btrfs: remove rb_node field from the delayed ref node structure
From: Filipe MananaAfter the last big change in the delayed references code that was needed for the last qgroups rework, the red black tree node field of struct btrfs_delayed_ref_node is no longer used, so just remove it, this helps us save some memory (since struct rb_node is 24 bytes on x86_64) for these structures. Signed-off-by: Filipe Manana --- fs/btrfs/delayed-ref.h | 6 -- 1 file changed, 6 deletions(-) diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 43f3629..a72fbe7 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -34,12 +34,6 @@ * ref_head. Must clean this mess up later. */ struct btrfs_delayed_ref_node { - /* -* ref_head use rb tree, stored in ref_root->href. -* indexed by bytenr -*/ - struct rb_node rb_node; - /*data/tree ref use list, stored in ref_head->ref_list. */ struct list_head list; -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Btrfs: fix relocation incorrectly dropping data references
From: Filipe MananaDuring relocation of a data block group we create a relocation tree for each fs/subvol tree by making a snapshot of each tree using btrfs_copy_root() and the tree's commit root, and then setting the last snapshot field for the fs/subvol tree's root to the value of the current transaction id minus 1. However this can lead to relocation later dropping references that it did not create if we have qgroups enabled, leaving the filesystem in an inconsistent state that keeps aborting transactions. Lets consider the following example to explain the problem, which requires qgroups to be enabled. We are relocating data block group Y, we have a subvolume with id 258 that has a root at level 1, that subvolume is used to store directory entries for snapshots and we are currently at transaction 3404. When committing transaction 3404, we have a pending snapshot and therefore we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot() in order to create its dentry at subvolume 258. This results in COWing leaf A from root 258 in order to add the dentry. Note that leaf A also contains file extent items referring to extents from some other block group X (we are currently relocating block group Y). Later on, still at create_pending_snapshot() we call qgroup_account_snapshot(), which switches the commit root for root 258 when it calls switch_commit_roots(), so now the COWed version of leaf A, lets call it leaf A', is accessible from the commit root of tree 258. At the end of qgroup_account_snapshot(), we call record_root_in_trans() with 258 as its argument, which results in btrfs_init_reloc_root() being called, which in turn calls relocation.c:create_reloc_root() in order to create a relocation tree associated to root 258, which results in assigning the value of 3403 (which is the current transaction id minus 1 = 3404 - 1) to the last_snapshot field of root 258. When creating the relocation tree root at ctree.c:btrfs_copy_root() we add a shared reference for leaf A', corresponding to the relocation tree's root, when we call btrfs_inc_ref() against the COWed root (a copy of the commit root from tree 258), which is at level 1. So at this point leaf A' has 2 references, one normal reference corresponding to root 258 and one shared reference corresponding to the root of the relocation tree. Transaction 3404 finishes its commit and transaction 3405 is started by relocation when calling merge_reloc_root() for the relocation tree associated to root 258. In the meanwhile leaf A' is COWed again, in response to some filesystem operation, when we are still at transaction 3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we call btrfs_block_can_be_shared() in order to figure out if other trees refer to the leaf and if any such trees exists, add a full back reference to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false because the following condition is false: btrfs_header_generation(buf) <= btrfs_root_last_snapshot(>root_item) which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with only one reference, corresponding to the shared reference we created when we called btrfs_copy_root() to create the relocation tree's root and btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not adding shared references for the extents from block group X that leaf A' refers to with its file extent items. Later, after merging the relocation root we do a call to to btrfs_drop_snapshot() in order to delete the relocation tree. This ends up calling do_walk_down() when path->slots[1] points to leaf A', which results in calling btrfs_lookup_extent_info() to get the number of references for leaf A', which is 1 at this time (only the shared reference exists) and this value is stored at wc->refs[0]. After this walk_up_proc() is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'. Because the current level is 0 and wc->refs[0] is 1, it does call btrfs_dec_ref() against leaf A', which results in removing the single references that the extents from block group X have which are associated to root 258 - the expectation was to have each of these extents with 2 references - one reference for root 258 and one shared reference related to the root of the relocation tree, and so we would drop only the shared reference (because leaf A' was supposed to have the flag BTRFS_BLOCK_FLAG_FULL_BACKREF set). This leaves the filesystem in an inconsistent state as we now have file extent items in a subvolume tree that point to extents from block group X without references in the extent tree. So later on when we try to decrement the references for these extents, for example due to a file unlink operation, truncate operation or overwriting ranges of a file, we fail because the expected references do not exist in the extent tree. This leads to warnings and
[PATCH 1/3] Btrfs: remove balance warning that does not reflect a problem
From: Filipe MananaOn openSUSE/SLE systems where balance is triggered periodically in the background, snapshotting happens when doing package installations and upgrades, and (by default) the root system is organized with multiple subvolumes, the following warning was triggered often: [ 630.773059] WARNING: CPU: 1 PID: 2549 at fs/btrfs/relocation.c:1848 replace_path+0x3f0/0x940 [btrfs] [ 630.773060] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c acpi_cpufreq tpm_tis ppdev tpm parport_pc parport pcspkr e1000 qemu_fw_cfg joydev i2c_piix4 button btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix virtio_pci virtio_ring virtio serio_raw floppy drm sg [ 630.773070] CPU: 1 PID: 2549 Comm: btrfs Tainted: GW 4.7.7-2-btrfs+ #2 [ 630.773071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 630.773072] 8801f704b8c8 813afd12 [ 630.773073] 8801f704b908 81081f8b 0738 [ 630.773075] 0001 8801e32eb8c0 1600 8800 [ 630.773076] Call Trace: [ 630.773078] [] dump_stack+0x63/0x81 [ 630.773079] [] __warn+0xcb/0xf0 [ 630.773080] [] warn_slowpath_null+0x1d/0x20 [ 630.773090] [] replace_path+0x3f0/0x940 [btrfs] [ 630.773092] [] ? ring_buffer_unlock_commit+0x3e/0x2a0 [ 630.773102] [] merge_reloc_root+0x2b4/0x600 [btrfs] [ 630.773111] [] merge_reloc_roots+0x140/0x250 [btrfs] [ 630.773120] [] relocate_block_group+0x317/0x680 [btrfs] [ 630.773129] [] btrfs_relocate_block_group+0x1cc/0x2d0 [btrfs] [ 630.773139] [] btrfs_relocate_chunk.isra.40+0x56/0xf0 [btrfs] [ 630.773149] [] __btrfs_balance+0x8d5/0xbb0 [btrfs] [ 630.773159] [] btrfs_balance+0x2d0/0x5e0 [btrfs] [ 630.773168] [] btrfs_ioctl_balance+0x383/0x390 [btrfs] [ 630.773178] [] btrfs_ioctl+0x90f/0x1fb0 [btrfs] [ 630.773180] [] ? pte_alloc_one+0x33/0x40 [ 630.773182] [] do_vfs_ioctl+0x93/0x5a0 [ 630.773183] [] ? __do_page_fault+0x203/0x4e0 [ 630.773185] [] SyS_ioctl+0x79/0x90 [ 630.773186] [] entry_SYSCALL_64_fastpath+0x1e/0xa8 [ 630.773187] ---[ end trace 2cd6167577bf3be7 ]--- It turned out that this warning does not reflect any problem and just makes users/system administrators worry about something going wrong. The warning happens because when we create a relocation root (which is just a snapshot of a subvolume tree) we set its last_snapshot field (as well as for the subvolume's tree root) to a value corresponding to the generation of the current transaction minus 1 (we do this at relocation.c:create_reloc_root()). This means that when we merge the relocation tree with the corresponding subvolume tree, at walk_down_reloc_tree() we can catch pointers (bytenr/generation pairs) with a generation that matches the generation of the transaction where we created the relocation root, so those pointers correspond to tree blocks created either before or after the relocation root was created. If they were created before the relocation root (and in the same transaction) we hit the warning, which we can safely remove because it means the tree blocks are accessible from both trees (the subvolume tree and the relocation tree). So fix this by removing the warning and adding a couple assertions that verify the pointer generations match and that their generation matches the value of the last_snapshot field from the relocation tree plus 1. Signed-off-by: Filipe Manana --- fs/btrfs/relocation.c | 21 - 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 0ec8ffa..cdc1a1c 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1848,7 +1848,26 @@ again: new_ptr_gen = 0; } - if (WARN_ON(new_bytenr > 0 && new_bytenr == old_bytenr)) { + /* +* When we create the reloc root (which is a snapshot of the +* subvolume tree) we set its last_snapshot field (as well as +* for the subvolume's tree root) to the value of the current +* transaction generation minus 1 (at create_reloc_root()). +* This means that at walk_down_reloc_tree() we can catch +* pointers (bytenr/generation pairs) with a generation +* matching the generation of the transaction where we created +* the reloc root, so those pointers correspond to tree blocks +* that were either created before or after the reloc root was +* created. If walk_down_reloc_tree() gave us a path that points +* to a tree block that was created (or COWed) before the reloc +* root was created and in the
[PATCH 3/3] Btrfs: remove unused code when creating and merging reloc trees
From: Filipe MananaIn commit 5bc7247ac47c (Btrfs: fix broken nocow after balance) we started abusing the rtransid and otransid fields of root items from relocation trees to fix some issues with nodatacow mode. However later in commit ba8b0289333a (Btrfs: do not reset last_snapshot after relocation) we dropped the code that made use of those fields but did not remove the code that sets those fields. So just remove them to avoid confusion. Signed-off-by: Filipe Manana --- fs/btrfs/relocation.c | 19 --- 1 file changed, 19 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 8777b17..83c1a8c 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1384,7 +1384,6 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans, struct extent_buffer *eb; struct btrfs_root_item *root_item; struct btrfs_key root_key; - u64 last_snap = 0; int ret; root_item = kmalloc(sizeof(*root_item), GFP_NOFS); @@ -1401,7 +1400,6 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans, ret = btrfs_copy_root(trans, root, root->commit_root, , BTRFS_TREE_RELOC_OBJECTID); BUG_ON(ret); - last_snap = btrfs_root_last_snapshot(>root_item); /* * Set the last_snapshot field to the generation of the commit * root - like this ctree.c:btrfs_block_can_be_shared() behaves @@ -1435,12 +1433,6 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans, memset(_item->drop_progress, 0, sizeof(struct btrfs_disk_key)); root_item->drop_level = 0; - /* -* abuse rtransid, it is safe because it is impossible to -* receive data into a relocation tree. -*/ - btrfs_set_root_rtransid(root_item, last_snap); - btrfs_set_root_otransid(root_item, trans->transid); } btrfs_tree_unlock(eb); @@ -2399,9 +2391,6 @@ void merge_reloc_roots(struct reloc_control *rc) { struct btrfs_root *root; struct btrfs_root *reloc_root; - u64 last_snap; - u64 otransid; - u64 objectid; LIST_HEAD(reloc_roots); int found = 0; int ret = 0; @@ -2440,14 +2429,6 @@ again: list_del_init(_root->root_list); } - /* -* we keep the old last snapshot transid in rtranid when we -* created the relocation tree. -*/ - last_snap = btrfs_root_rtransid(_root->root_item); - otransid = btrfs_root_otransid(_root->root_item); - objectid = reloc_root->root_key.offset; - ret = btrfs_drop_snapshot(reloc_root, rc->block_rsv, 0, 1); if (ret < 0) { if (list_empty(_root->root_list)) -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html