Re: Btrfs/SSD
On Tue, 18 Apr 2017 03:23:13 + (UTC) Duncan <1i5t5.dun...@cox.net> wrote: > Without reading the links... > > Are you /sure/ it's /all/ ssds currently on the market? Or are you > thinking narrowly, those actually sold as ssds? > > Because all I've read (and I admit I may not actually be current, but...) > on for instance sd cards, certainly ssds by definition, says they're > still very write-cycle sensitive -- very simple FTL with little FTL wear- > leveling. > > And AFAIK, USB thumb drives tend to be in the middle, moderately complex > FTL with some, somewhat simplistic, wear-leveling. > If I have to clarify, yes, it's all about SATA and NVMe SSDs. SD cards may be SSDs "by definition", but nobody will think of an SD card when you say "I bought an SSD for my computer". And yes, SD card and USB flash sticks are commonly understood to be much simpler and more brittle devices than full blown desktop (not to mention server) SSDs. > While the stuff actually marketed as SSDs, generally SATA or direct PCIE/ > NVME connected, may indeed match your argument, no real end-user concern > necessary any more as the FTLs are advanced enough that user or > filesystem level write-cycle concerns simply aren't necessary these days. > > > So does that claim that write-cycle concerns simply don't apply to modern > ssds, also apply to common thumb drives and sd cards? Because these are > certainly ssds both technically and by btrfs standards. > -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compressing nocow files
To inhibit chattr +C on systemd-journald journals: - manually remove the attribute on /var/log/journal and /var/log/journal/ - write an empty file: /etc/tmpfiles.d/journal-nocow.conf Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compressing nocow files
Liu Bo posted on Mon, 17 Apr 2017 11:07:07 -0700 as excerpted: > On Mon, Apr 17, 2017 at 11:36:17AM -0600, Chris Murphy wrote: >> HI, >> >> >> /dev/nvme0n1p8 on / type btrfs >> (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root) >> >> I've got a test folder with +C set and then copied a test file into it. >> >> $ lsattr C-- >> ./ system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> >> So now it's inherited +C. Next check fragments and compression. >> >> $ sudo ~/Applications/btrfs-debugfs -f >> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> (290662 0): ram 100663296 disk 17200840704 disk_size 100663296 file: >> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> extents 1 disk size 100663296 logical size 100663296 ratio 1.00 >> >> >> Try to compress it. >> >> $ sudo btrfs fi defrag -c >> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> >> Check again. >> >> [snip faux fragments] >> file: >> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> extents 768 disk size 21504000 logical size 100663296 ratio 4.68 >> >> It's compressed! OK?? >> >> OK delete that file, and add +c to the temp folder. >> >> $ lsattr c---C-- ./temp >> >> Copy another test file into temp. >> >> $ lsattr c---C-- >> ./ system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal >> >> $ sudo ~/Applications/btrfs-debugfs -f >> system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal >> (290739 0): ram 83886080 disk 21764243456 disk_size 83886080 file: >> system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal >> extents 1 disk size 83886080 logical size 83886080 ratio 1.00 >> >> Not compressed. Hmmm. So somehow btrfs fi defrag -c will force >> compression even on nocow files? I've also done this without -c option, >> with a Btrfs mounted using compress option; and the file does compress >> also. I thought nocow files were always no compress, but it seems >> there's exceptions. >> >> > Good catch. > > Btrfs defragment depends on COW to do the job, thus nocow inode are > forced to COW when processing defraged range, and the compress also > depends on COW, so btrfs filesystem defrag -c becomes an exception. Thanks for the explanation. =:^) > Speaking of which, this exception seems to be not harmful other than > confusing users. My question is, what does it do then when a new modification-write comes in to the compressed no-cow file, and the modification isn't as compressible as the data it replaced? Actually, do new writes to a defrag-compressed nocow file compress at all, or not, and/or does nocow effectively become cow1 (as when a nocow file is snapshotted) or even get effectively ignored entirely (as I believe happens when a normal cow file is marked nocow after it has been written)? Is this likely to be a source of some of the unreproduced bugs we've seen, simply because much like a metadata-inline block followed by extent- tree blocks, it wasn't supposed to happen and thus wasn't designed for or tested for? (FTR, this shouldn't affect me ATM as AFAIK I have no nocow files; systemd is set to produce temporary/tmpfs files only, no permanent journal, so its journal files don't even hit btrfs, and I no of no others set nocow by default, neither have I set anything nocow. But that doesn't mean I'm not concerned, as it certainly affects others, and could hypothetically affect me should I discover I have a reason to nocow something.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix extent map leak during fallocate error path
On Tue, Apr 04, 2017 at 03:20:36AM +0100, fdman...@kernel.org wrote: > From: Filipe Manana> > If the call to btrfs_qgroup_reserve_data() failed, we were leaking an > extent map structure. The failure can happen either due to an -ENOMEM > condition or, when quotas are enabled, due to -EDQUOT for example. > Reviewed-by: Liu Bo Thanks, -liubo > Signed-off-by: Filipe Manana > --- > fs/btrfs/file.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index 48dfb8e..56304c4 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -2856,8 +2856,10 @@ static long btrfs_fallocate(struct file *file, int > mode, > } > ret = btrfs_qgroup_reserve_data(inode, cur_offset, > last_byte - cur_offset); > - if (ret < 0) > + if (ret < 0) { > + free_extent_map(em); > break; > + } > } else { > /* >* Do not need to reserve unwritten extent for this > -- > 2.7.0.rc3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Roman Mamedov posted on Mon, 17 Apr 2017 23:24:19 +0500 as excerpted: > Days are long gone since the end user had to ever think about device > lifetimes with SSDs. Refer to endurance studies such as > http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre- all-dead > http://ssdendurancetest.com/ > https://3dnews.ru/938764/ > It has been demonstrated that all SSDs on the market tend to overshoot > even their rated TBW by several times, as a result it will take any user > literally dozens of years to wear out the flash no matter which > filesystem or what settings used Without reading the links... Are you /sure/ it's /all/ ssds currently on the market? Or are you thinking narrowly, those actually sold as ssds? Because all I've read (and I admit I may not actually be current, but...) on for instance sd cards, certainly ssds by definition, says they're still very write-cycle sensitive -- very simple FTL with little FTL wear- leveling. And AFAIK, USB thumb drives tend to be in the middle, moderately complex FTL with some, somewhat simplistic, wear-leveling. While the stuff actually marketed as SSDs, generally SATA or direct PCIE/ NVME connected, may indeed match your argument, no real end-user concern necessary any more as the FTLs are advanced enough that user or filesystem level write-cycle concerns simply aren't necessary these days. So does that claim that write-cycle concerns simply don't apply to modern ssds, also apply to common thumb drives and sd cards? Because these are certainly ssds both technically and by btrfs standards. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] Btrfs: change how we iterate bios in endio
Since dio submit has used bio_clone_fast, the submitted bio may not have a reliable bi_vcnt, for the bio vector iterations in checksum related functions, bio->bi_iter is not modified yet and it's safe to use bio_for_each_segment, while for those bio vector iterations in dio's read endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning bios and use the helper __bio_for_each_segment with the saved bvec_iter to access each bvec. Signed-off-by: Liu Bo--- fs/btrfs/extent_io.c | 1 + fs/btrfs/file-item.c | 31 +++ fs/btrfs/inode.c | 33 + fs/btrfs/volumes.h | 1 + 4 files changed, 34 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 1b7156c..54108d1 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2738,6 +2738,7 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, gfp_t gfp_mask, int offset btrfs_bio->end_io = NULL; bio_trim(bio, (offset >> 9), (size >> 9)); + btrfs_bio->iter = bio->bi_iter; } return bio; } diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 64fcb31..9f6062c 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -164,7 +164,8 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u64 logical_offset, u32 *dst, int dio) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); - struct bio_vec *bvec; + struct bio_vec bvec; + struct bvec_iter iter; struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio); struct btrfs_csum_item *item = NULL; struct extent_io_tree *io_tree = _I(inode)->io_tree; @@ -177,7 +178,7 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u64 page_bytes_left; u32 diff; int nblocks; - int count = 0, i; + int count = 0; u16 csum_size = btrfs_super_csum_size(fs_info->super_copy); path = btrfs_alloc_path(); @@ -206,8 +207,6 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, if (bio->bi_iter.bi_size > PAGE_SIZE * 8) path->reada = READA_FORWARD; - WARN_ON(bio->bi_vcnt <= 0); - /* * the free space stuff is only read when it hasn't been * updated in the current transaction. So, we can safely @@ -223,13 +222,13 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, if (dio) offset = logical_offset; - bio_for_each_segment_all(bvec, bio, i) { - page_bytes_left = bvec->bv_len; + bio_for_each_segment(bvec, bio, iter) { + page_bytes_left = bvec.bv_len; if (count) goto next; if (!dio) - offset = page_offset(bvec->bv_page) + bvec->bv_offset; + offset = page_offset(bvec.bv_page) + bvec.bv_offset; count = btrfs_find_ordered_sum(inode, offset, disk_bytenr, (u32 *)csum, nblocks); if (count) @@ -440,15 +439,15 @@ int btrfs_csum_one_bio(struct inode *inode, struct bio *bio, struct btrfs_ordered_sum *sums; struct btrfs_ordered_extent *ordered = NULL; char *data; - struct bio_vec *bvec; + struct bvec_iter iter; + struct bio_vec bvec; int index; int nr_sectors; - int i, j; unsigned long total_bytes = 0; unsigned long this_sum_bytes = 0; + int i; u64 offset; - WARN_ON(bio->bi_vcnt <= 0); sums = kzalloc(btrfs_ordered_sum_size(fs_info, bio->bi_iter.bi_size), GFP_NOFS); if (!sums) @@ -465,19 +464,19 @@ int btrfs_csum_one_bio(struct inode *inode, struct bio *bio, sums->bytenr = (u64)bio->bi_iter.bi_sector << 9; index = 0; - bio_for_each_segment_all(bvec, bio, j) { + bio_for_each_segment(bvec, bio, iter) { if (!contig) - offset = page_offset(bvec->bv_page) + bvec->bv_offset; + offset = page_offset(bvec.bv_page) + bvec.bv_offset; if (!ordered) { ordered = btrfs_lookup_ordered_extent(inode, offset); BUG_ON(!ordered); /* Logic error */ } - data = kmap_atomic(bvec->bv_page); + data = kmap_atomic(bvec.bv_page); nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, -bvec->bv_len + fs_info->sectorsize +bvec.bv_len + fs_info->sectorsize - 1); for (i = 0; i < nr_sectors; i++) { @@ -504,12 +503,12 @@ int
[PATCH 0/6 RFC] utilize bio_clone_fast to clean up
This attempts to use bio_clone_fast() in the places where we clone bio, such as when bio got cloned for multiple disks and when bio got split during dio submit. One benefit is to simplify dio submit to avoid calling bio_add_page one by one. Another benefit is that comparing to bio_clone_bioset, bio_clone_fast is faster because of copying the vector pointer directly, and bio_clone_fast doesn't modify bi_vcnt, so the extra work is to fix up bi_vcnt usage we currently have to use bi_iter to iterate bvec. Liu Bo (6): Btrfs: use bio_clone_fast to clone our bio Btrfs: use bio_clone_bioset_partial to simplify DIO submit Btrfs: change how we iterate bios in endio Btrfs: record error if one block has failed to retry Btrfs: change check-integrity to use bvec_iter Btrfs: unify naming of btrfs_io_bio fs/btrfs/check-integrity.c | 27 +++--- fs/btrfs/extent_io.c | 18 +++- fs/btrfs/extent_io.h | 1 + fs/btrfs/file-item.c | 31 --- fs/btrfs/inode.c | 203 - fs/btrfs/volumes.h | 1 + 6 files changed, 138 insertions(+), 143 deletions(-) -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] Btrfs: change check-integrity to use bvec_iter
Some check-integrity code depends on bio->bi_vcnt, this changes it to use bio segments because some bios passing here may not have a reliable bi_vcnt. Signed-off-by: Liu Bo--- fs/btrfs/check-integrity.c | 27 +++ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index ab14c2e..8e7ce48 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -2822,44 +2822,47 @@ static void __btrfsic_submit_bio(struct bio *bio) dev_state = btrfsic_dev_state_lookup(bio->bi_bdev); if (NULL != dev_state && (bio_op(bio) == REQ_OP_WRITE) && bio_has_data(bio)) { - unsigned int i; + unsigned int i = 0; u64 dev_bytenr; u64 cur_bytenr; - struct bio_vec *bvec; + struct bio_vec bvec; + struct bvec_iter iter; int bio_is_patched; char **mapped_datav; + int segs = bio_segments(bio); dev_bytenr = 512 * bio->bi_iter.bi_sector; bio_is_patched = 0; if (dev_state->state->print_mask & BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH) pr_info("submit_bio(rw=%d,0x%x, bi_vcnt=%u, bi_sector=%llu (bytenr %llu), bi_bdev=%p)\n", - bio_op(bio), bio->bi_opf, bio->bi_vcnt, + bio_op(bio), bio->bi_opf, segs, (unsigned long long)bio->bi_iter.bi_sector, dev_bytenr, bio->bi_bdev); - mapped_datav = kmalloc_array(bio->bi_vcnt, + mapped_datav = kmalloc_array(segs, sizeof(*mapped_datav), GFP_NOFS); if (!mapped_datav) goto leave; cur_bytenr = dev_bytenr; - bio_for_each_segment_all(bvec, bio, i) { - BUG_ON(bvec->bv_len != PAGE_SIZE); - mapped_datav[i] = kmap(bvec->bv_page); + bio_for_each_segment(bvec, bio, iter) { + BUG_ON(bvec.bv_len != PAGE_SIZE); + mapped_datav[i] = kmap(bvec.bv_page); + i++; if (dev_state->state->print_mask & BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH_VERBOSE) pr_info("#%u: bytenr=%llu, len=%u, offset=%u\n", - i, cur_bytenr, bvec->bv_len, bvec->bv_offset); - cur_bytenr += bvec->bv_len; + i, cur_bytenr, bvec.bv_len, bvec.bv_offset); + cur_bytenr += bvec.bv_len; } btrfsic_process_written_block(dev_state, dev_bytenr, - mapped_datav, bio->bi_vcnt, + mapped_datav, segs, bio, _is_patched, NULL, bio->bi_opf); - bio_for_each_segment_all(bvec, bio, i) - kunmap(bvec->bv_page); + bio_for_each_segment(bvec, bio, iter) + kunmap(bvec.bv_page); kfree(mapped_datav); } else if (NULL != dev_state && (bio->bi_opf & REQ_PREFLUSH)) { if (dev_state->state->print_mask & -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] Btrfs: record error if one block has failed to retry
In the nocsum case of dio read endio, it will return immediately if an error got returned when repairing, which left the rest blocks unrepaired. The behavior is different from how buffered read endio works in the same case. This changes it to record error only and go on repairing the rest blocks. Signed-off-by: Liu Bo--- fs/btrfs/inode.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index fca2f1f..cc46d21 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7942,6 +7942,7 @@ static int __btrfs_correct_data_nocsum(struct inode *inode, u32 sectorsize; int nr_sectors; int ret; + int err; fs_info = BTRFS_I(inode)->root->fs_info; sectorsize = fs_info->sectorsize; @@ -7962,8 +7963,10 @@ static int __btrfs_correct_data_nocsum(struct inode *inode, pgoff, start, start + sectorsize - 1, io_bio->mirror_num, btrfs_retry_endio_nocsum, ); - if (ret) - return ret; + if (ret) { + err = ret; + goto next; + } wait_for_completion(); @@ -7972,6 +7975,7 @@ static int __btrfs_correct_data_nocsum(struct inode *inode, goto next_block_or_try_again; } +next: start += sectorsize; if (nr_sectors--) { @@ -7980,7 +7984,7 @@ static int __btrfs_correct_data_nocsum(struct inode *inode, } } - return 0; + return err; } static void btrfs_retry_endio(struct bio *bio) -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] Btrfs: unify naming of btrfs_io_bio
All dio endio functions are using io_bio for struct btrfs_io_bio, this makes btrfs_submit_direct to follow this convention. Signed-off-by: Liu Bo--- fs/btrfs/inode.c | 38 +++--- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index cc46d21..73e7a44 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8424,16 +8424,16 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, loff_t file_offset) { struct btrfs_dio_private *dip = NULL; - struct bio *io_bio = NULL; - struct btrfs_io_bio *btrfs_bio; + struct bio *bio = NULL; + struct btrfs_io_bio *io_bio; int skip_sum; bool write = (bio_op(dio_bio) == REQ_OP_WRITE); int ret = 0; skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; - io_bio = btrfs_bio_clone(dio_bio, GFP_NOFS); - if (!io_bio) { + bio = btrfs_bio_clone(dio_bio, GFP_NOFS); + if (!bio) { ret = -ENOMEM; goto free_ordered; } @@ -8449,17 +8449,17 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, dip->logical_offset = file_offset; dip->bytes = dio_bio->bi_iter.bi_size; dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9; - io_bio->bi_private = dip; - dip->orig_bio = io_bio; + bio->bi_private = dip; + dip->orig_bio = bio; dip->dio_bio = dio_bio; atomic_set(>pending_bios, 0); - btrfs_bio = btrfs_io_bio(io_bio); - btrfs_bio->logical = file_offset; + io_bio = btrfs_io_bio(bio); + io_bio->logical = file_offset; if (write) { - io_bio->bi_end_io = btrfs_endio_direct_write; + bio->bi_end_io = btrfs_endio_direct_write; } else { - io_bio->bi_end_io = btrfs_endio_direct_read; + bio->bi_end_io = btrfs_endio_direct_read; dip->subio_endio = btrfs_subio_endio_read; } @@ -8482,8 +8482,8 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, if (!ret) return; - if (btrfs_bio->end_io) - btrfs_bio->end_io(btrfs_bio, ret); + if (io_bio->end_io) + io_bio->end_io(io_bio, ret); free_ordered: /* @@ -8495,16 +8495,16 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, * same as btrfs_endio_direct_[write|read] because we can't call these * callbacks - they require an allocated dip and a clone of dio_bio. */ - if (io_bio && dip) { - io_bio->bi_error = -EIO; - bio_endio(io_bio); + if (bio && dip) { + bio->bi_error = -EIO; + bio_endio(bio); /* -* The end io callbacks free our dip, do the final put on io_bio +* The end io callbacks free our dip, do the final put on bio * and all the cleanup and final put for dio_bio (through * dio_end_io()). */ dip = NULL; - io_bio = NULL; + bio = NULL; } else { if (write) btrfs_endio_direct_write_update_ordered(inode, @@ -8522,8 +8522,8 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, */ dio_end_io(dio_bio, ret); } - if (io_bio) - bio_put(io_bio); + if (bio) + bio_put(bio); kfree(dip); } -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] Btrfs: use bio_clone_bioset_partial to simplify DIO submit
Currently when mapping bio to limit bio to a single stripe length, we split bio by adding page to bio one by one, but later we don't modify the vector of bio at all, thus we can use bio_clone_fast to use the original bio vector directly. Signed-off-by: Liu Bo--- fs/btrfs/extent_io.c | 15 +++ fs/btrfs/extent_io.h | 1 + fs/btrfs/inode.c | 122 +++ 3 files changed, 62 insertions(+), 76 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 0d4aea4..1b7156c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2726,6 +2726,21 @@ struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) return bio; } +struct bio *btrfs_bio_clone_partial(struct bio *orig, gfp_t gfp_mask, int offset, int size) +{ + struct bio *bio; + + bio = bio_clone_fast(orig, gfp_mask, btrfs_bioset); + if (bio) { + struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio); + btrfs_bio->csum = NULL; + btrfs_bio->csum_allocated = NULL; + btrfs_bio->end_io = NULL; + + bio_trim(bio, (offset >> 9), (size >> 9)); + } + return bio; +} static int __must_check submit_one_bio(struct bio *bio, int mirror_num, unsigned long bio_flags) diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 3e4fad4..3b2bc88 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -460,6 +460,7 @@ btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs, gfp_t gfp_flags); struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs); struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask); +struct bio *btrfs_bio_clone_partial(struct bio *orig, gfp_t gfp_mask, int offset, int size); struct btrfs_fs_info; struct btrfs_inode; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index a18510b..6215720 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8230,16 +8230,6 @@ static void btrfs_end_dio_bio(struct bio *bio) bio_put(bio); } -static struct bio *btrfs_dio_bio_alloc(struct block_device *bdev, - u64 first_sector, gfp_t gfp_flags) -{ - struct bio *bio; - bio = btrfs_bio_alloc(bdev, first_sector, BIO_MAX_PAGES, gfp_flags); - if (bio) - bio_associate_current(bio); - return bio; -} - static inline int btrfs_lookup_and_bind_dio_csum(struct inode *inode, struct btrfs_dio_private *dip, struct bio *bio, @@ -8329,24 +8319,22 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip, struct btrfs_root *root = BTRFS_I(inode)->root; struct bio *bio; struct bio *orig_bio = dip->orig_bio; - struct bio_vec *bvec; u64 start_sector = orig_bio->bi_iter.bi_sector; u64 file_offset = dip->logical_offset; - u64 submit_len = 0; u64 map_length; - u32 blocksize = fs_info->sectorsize; int async_submit = 0; - int nr_sectors; + int submit_len; + int clone_offset = 0; + int clone_len; int ret; - int i, j; - map_length = orig_bio->bi_iter.bi_size; + submit_len = map_length = orig_bio->bi_iter.bi_size; ret = btrfs_map_block(fs_info, btrfs_op(orig_bio), start_sector << 9, _length, NULL, 0); if (ret) return -EIO; - if (map_length >= orig_bio->bi_iter.bi_size) { + if (map_length >= submit_len) { bio = orig_bio; dip->flags |= BTRFS_DIO_ORIG_BIO_SUBMITTED; goto submit; @@ -8358,70 +8346,52 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip, else async_submit = 1; - bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev, start_sector, GFP_NOFS); - if (!bio) - return -ENOMEM; - - bio->bi_opf = orig_bio->bi_opf; - bio->bi_private = dip; - bio->bi_end_io = btrfs_end_dio_bio; - btrfs_io_bio(bio)->logical = file_offset; + /* bio split */ atomic_inc(>pending_bios); + while (submit_len > 0) { + /* map_length < submit_len, it's a int */ + clone_len = min(submit_len, (int)map_length); + bio = btrfs_bio_clone_partial(orig_bio, GFP_NOFS, clone_offset, clone_len); + if (!bio) + goto out_err; + /* the above clone call also clone blkcg of orig_bio */ + + bio->bi_private = dip; + bio->bi_end_io = btrfs_end_dio_bio; + btrfs_io_bio(bio)->logical = file_offset; + + ASSERT(submit_len >= clone_len); + submit_len -= clone_len; + if (submit_len == 0) +
[PATCH 1/6] Btrfs: use bio_clone_fast to clone our bio
For raid1 and raid10, we clone the original bio to the bios which are then sent to different disks. Signed-off-by: Liu Bo--- fs/btrfs/extent_io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 27fdb25..0d4aea4 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2700,7 +2700,7 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask) struct btrfs_io_bio *btrfs_bio; struct bio *new; - new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset); + new = bio_clone_fast(bio, gfp_mask, btrfs_bioset); if (new) { btrfs_bio = btrfs_io_bio(new); btrfs_bio->csum = NULL; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 04/17/2017 09:22 PM, Imran Geriskovan wrote: > [...] > > Going over the thread following questions come to my mind: > > - What exactly does btrfs ssd option does relative to plain mode? There's quite an amount of information in the the very recent threads: - "About free space fragmentation, metadata write amplification and (no)ssd" - "BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such." - "btrfs filesystem keeps allocating new chunks for no apparent reason" - ... and a few more I suspect there will be some "summary" mails at some point, but for now, I'd recommend crawling through these threads first. And now for your instant satisfaction, a short visual guide to the difference, which shows actual btrfs behaviour instead of our guesswork around it (taken from the second mail thread just mentioned): -o ssd: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4 -o nossd: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4 -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarnwrote: > On 2017-04-17 14:34, Chris Murphy wrote: >> Nope. The first paragraph applies to NVMe machine with ssd mount >> option. Few fragments. >> >> The second paragraph applies to SD Card machine with ssd_spread mount >> option. Many fragments. > > Ah, apologies for my misunderstanding. >> >> >> These are different versions of systemd-journald so I can't completely >> rule out a difference in write behavior. > > There have only been a couple of changes in the write patterns that I know > of, but I would double check that the values for Seal and Compress in the > journald.conf file are the same, as I know for a fact that changing those > does change the write patterns (not much, but they do change). Same, unchanged defaults on both systems. #Storage=auto #Compress=yes #Seal=yes #SplitMode=uid #SyncIntervalSec=5m #RateLimitIntervalSec=30s #RateLimitBurst=1000 The sync interval sec is curious. 5 minutes? Umm, I'm seeing nearly constant hits every 2-5 seconds on the journal file; using filefrag. I'm sure there's a better way to trace a single file being read/written to than this, but... It's almost like we need these things to not fsync at all, and just rely on the filesystem commit time... >>> >>> >>> Essentially yes, but that causes all kinds of other problems. >> >> >> Drat. >> > Admittedly most of the problems are use-case specific (you can't afford to > lose transactions in a financial database for example, so it functionally > has to call fsync after each transaction), but most of it stems from the > fact that BTRFS is doing a lot of the same stuff that much of the 'problem' > software is doing itself internally. > Seems like the old way of doing things, and the staleness of the internet, have colluded to create a lot of nervousness and misuse of fsync. The very fact Btrfs needs a log tree to deal with fsync's in a semi-sane way... -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-04-17 14:34, Chris Murphy wrote: On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarnwrote: What is a high end SSD these days? Built-in NVMe? One with a good FTL in the firmware. At minimum, the good Samsung EVO drives, the high quality Intel ones, and the Crucial MX series, but probably some others. My choice of words here probably wasn't the best though. It's a confusing market that sorta defies figuring out what we've got. I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung EVO+ SD Card in an Intel NUC. They use that same EVO branding on an $11 SD Card. And then there's the Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 in another laptop. What makes it even more confusing is that other than Samsung (who _only_ use their own flash and controllers), manufacturer does not map to controller choice consistently, and even two drives with the same controller may have different firmware (and thus different degrees of reliability, those OCZ drives that were such crap at data retention were the result of a firmware option that the controller manufacturer pretty much told them not to use on production devices). So long as this file is not reflinked or snapshot, filefrag shows a pile of mostly 4096 byte blocks, thousands. But as they're pretty much all continuous, the file fragmentation (extent count) is usually never higher than 12. It meanders between 1 and 12 extents for its life. Except on the system using ssd_spread mount option. That one has a journal file that is +C, is not being snapshot, but has over 3000 extents per filefrag and btrfs-progs/debugfs. Really weird. Given how the 'ssd' mount option behaves and the frequency that most systemd instances write to their journals, that's actually reasonably expected. We look for big chunks of free space to write into and then align to 2M regardless of the actual size of the write, which in turn means that files like the systemd journal which see lots of small (relatively speaking) writes will have way more extents than they should until you defragment them. Nope. The first paragraph applies to NVMe machine with ssd mount option. Few fragments. The second paragraph applies to SD Card machine with ssd_spread mount option. Many fragments. Ah, apologies for my misunderstanding. These are different versions of systemd-journald so I can't completely rule out a difference in write behavior. There have only been a couple of changes in the write patterns that I know of, but I would double check that the values for Seal and Compress in the journald.conf file are the same, as I know for a fact that changing those does change the write patterns (not much, but they do change). Now, systemd aside, there are databases that behave this same way where there's a small section contantly being overwritten, and one or more sections that grow the data base file from within and at the end. If this is made cow, the file will absolutely fragment a ton. And especially if the changes are mostly 4KiB block sizes that then are fsync'd. It's almost like we need these things to not fsync at all, and just rely on the filesystem commit time... Essentially yes, but that causes all kinds of other problems. Drat. Admittedly most of the problems are use-case specific (you can't afford to lose transactions in a financial database for example, so it functionally has to call fsync after each transaction), but most of it stems from the fact that BTRFS is doing a lot of the same stuff that much of the 'problem' software is doing itself internally. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 4/17/17, Roman Mamedovwrote: > "Austin S. Hemmelgarn" wrote: >> * Compression should help performance and device lifetime most of the >> time, unless your CPU is fully utilized on a regular basis (in which >> case it will hurt performance, but still improve device lifetimes). > Days are long gone since the end user had to ever think about device lifetimes > with SSDs. Refer to endurance studies such as > It has been demonstrated that all SSDs on the market tend to overshoot even > their rated TBW by several times, as a result it will take any user literally > dozens of years to wear out the flash no matter which filesystem or what > settings used. And most certainly it's not worth it changing anything > significant in your workflow (such as enabling compression if it's > otherwise inconvenient or not needed) just to save the SSD lifetime. Going over the thread following questions come to my mind: - What exactly does btrfs ssd option does relative to plain mode? - Most(all?) SSDs employ wear leveling. Isn't it? That is they are constrantly remapping their blocks under the hood. So isn't it meaningless to speak of some kind of a block forging/fragmentation/etc.. affect of any writing pattern? - If it is so, Doesn't it mean that there is no better ssd usage strategy other than minimizing the total bytes written? That is whatever we do, if it contributes to this fact it is good, otherwise bad. Are all other things are beyond any user control? Is there a recommended setting? - How about "data retension" experiences? It is known that new ssds can hold data safely for longer period. As they age that margin gets shorter. As an extreme case if I write into a new ssd and shelve it, can i get back my data back after 5 years? How about a file written 5 years ago and never touched again although rest of the ssd is in active use during that period? - Yes may be lifetimes getting irrelevant. However TBW has still direct relation with data retension capability. Knowing that writing more data to a ssd can reduce the "life time of your data" is something strange. - But someone can come and say: Hey don't worry about "data retension years". Because your ssd will already be dead before data retension becomes a problem for you... Which is relieving.. :)) Anyway what are your opinions? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarnwrote: >> What is a high end SSD these days? Built-in NVMe? > > One with a good FTL in the firmware. At minimum, the good Samsung EVO > drives, the high quality Intel ones, and the Crucial MX series, but probably > some others. My choice of words here probably wasn't the best though. It's a confusing market that sorta defies figuring out what we've got. I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung EVO+ SD Card in an Intel NUC. They use that same EVO branding on an $11 SD Card. And then there's the Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 in another laptop. >> So long as this file is not reflinked or snapshot, filefrag shows a >> pile of mostly 4096 byte blocks, thousands. But as they're pretty much >> all continuous, the file fragmentation (extent count) is usually never >> higher than 12. It meanders between 1 and 12 extents for its life. >> >> Except on the system using ssd_spread mount option. That one has a >> journal file that is +C, is not being snapshot, but has over 3000 >> extents per filefrag and btrfs-progs/debugfs. Really weird. > > Given how the 'ssd' mount option behaves and the frequency that most systemd > instances write to their journals, that's actually reasonably expected. We > look for big chunks of free space to write into and then align to 2M > regardless of the actual size of the write, which in turn means that files > like the systemd journal which see lots of small (relatively speaking) > writes will have way more extents than they should until you defragment > them. Nope. The first paragraph applies to NVMe machine with ssd mount option. Few fragments. The second paragraph applies to SD Card machine with ssd_spread mount option. Many fragments. These are different versions of systemd-journald so I can't completely rule out a difference in write behavior. >> Now, systemd aside, there are databases that behave this same way >> where there's a small section contantly being overwritten, and one or >> more sections that grow the data base file from within and at the end. >> If this is made cow, the file will absolutely fragment a ton. And >> especially if the changes are mostly 4KiB block sizes that then are >> fsync'd. >> >> It's almost like we need these things to not fsync at all, and just >> rely on the filesystem commit time... > > Essentially yes, but that causes all kinds of other problems. Drat. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, 17 Apr 2017 07:53:04 -0400 "Austin S. Hemmelgarn"wrote: > General info (not BTRFS specific): > * Based on SMART attributes and other factors, current life expectancy > for light usage (normal desktop usage) appears to be somewhere around > 8-12 years depending on specifics of usage (assuming the same workload, > F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper > end, XFS is roughly in the middle, ext4 and NTFS are on the low end > (tested using Windows 7's NTFS driver), and FAT32 is an outlier at the > bottom of the barrel). Life expectancy for an SSD is defined not in years, but in TBW (terabytes written), and AFAICT that's not "from host", but "to flash" (some SSDs will show you both values in two separate SMART attributes out of the box, on some it can be unlocked). Filesystem may come into play only by the amount of write amplification they cause (how much "to flash" is greater than "from host"). Do you have any test data to show that FSes are ranked in that order by WA they cause, or is it all about "general feel" and how they are branded (F2FS says so, so it must be the best) > * Queued DISCARD support is still missing in most consumer SATA SSD's, > which in turn makes the trade-off on those between performance and > lifetime much sharper. My choice was to make a script to run from crontab, using "fstrim" on all mounted SSDs nightly, and aside from that all FSes are mounted with "nodiscard". Best of the both worlds, and no interference with actual IO operation. > * Modern (2015 and newer) SSD's seem to have better handling in the FTL > for the journaling behavior of filesystems like ext4 and XFS. I'm not > sure if this is actually a result of the FTL being better, or some > change in the hardware. Again, what makes you think this, did you observe the write amplification readings and now those are demonstrably lower than on "2014 and older" SSDs? So, by how much, and which models did you compare? > * In my personal experience, Intel, Samsung, and Crucial appear to be > the best name brands (in relative order of quality). I have personally > had bad experiences with SanDisk and Kingston SSD's, but I don't have > anything beyond circumstantial evidence indicating that it was anything > but bad luck on both counts. Why not think in terms not of "name brands" but platforms, i.e. a controller model + flash combination. For instance Intel have been using some other companies' controllers in their SSDs. Kingston uses tons of various controllers (Sandforce/Phison/Marvell/more?) depending on the model and range. > * Files with NOCOW and filesystems with 'nodatacow' set will both hurt > performance for BTRFS on SSD's, and appear to reduce the lifetime of the > SSD. "Appear to"? Just... what. So how many SSDs did you have fail under nocow? Or maybe can we get serious in a technical discussion? Did you by any chance mean cause more writes to the SSD and more "to flash" writes (resulting in a higher WA). If so, then by how much, and what was your test scenario comparing the same usage with and without nocow? > * Compression should help performance and device lifetime most of the > time, unless your CPU is fully utilized on a regular basis (in which > case it will hurt performance, but still improve device lifetimes). Days are long gone since the end user had to ever think about device lifetimes with SSDs. Refer to endurance studies such as http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead http://ssdendurancetest.com/ https://3dnews.ru/938764/ It has been demonstrated that all SSDs on the market tend to overshoot even their rated TBW by several times, as a result it will take any user literally dozens of years to wear out the flash no matter which filesystem or what settings used. And most certainly it's not worth it changing anything significant in your workflow (such as enabling compression if it's otherwise inconvenient or not needed) just to save the SSD lifetime. On Mon, 17 Apr 2017 13:13:39 -0400 "Austin S. Hemmelgarn" wrote: > > What is a high end SSD these days? Built-in NVMe? > One with a good FTL in the firmware. At minimum, the good Samsung EVO > drives, the high quality Intel ones As opposed to bad Samsung EVO drives and low-quality Intel ones? > and the Crucial MX series, but > probably some others. My choice of words here probably wasn't the best > though. Again, which controller? Crucial does not manufacture SSD controllers on their own, they just pack and brand stuff manufactured by someone else. So if you meant Marvell based SSDs, then that's many brands, not just Crucial. > For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets > rewritten in-place. This means that cheap FTL's will rewrite that erase > block in-place (which won't hurt performance but will impact device > lifetime), and good ones will rewrite into a free
Re: Remounting read-write after error is not allowed
On Mon, Apr 17, 2017 at 02:00:45PM -0400, Alexandru Guzu wrote: > Not sure if anyone is looking into that segfault, but I have an update. > I disconnected the USB drive for a while and today I reconnected it > and it auto-mounted with no issue. > > What is interesting is that the drive letter changed to what is was > before when it was working. > Remember that in my first email, the drive encounters an error as > /dev/sdg2, and it reconnected as /dev/sdh2 > > I was getting errors when it was assigned /dev/sdh2, but now it mounts > just fine when it is assigned /dev/sdg2 > > sudo btrfs fi show > Label: 'USB-data' uuid: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7 > Total devices 1 FS bytes used 230.72GiB > devid1 size 447.13GiB used 238.04GiB path /dev/sdg2 > > > is BTRFS really so sensitive to drive letter changes? > The USB driver is automounted as such: > > /dev/sdg2 on /media/alex/USB-data1 type btrfs > (rw,nosuid,nodev,relatime,space_cache,subvolid=5,subvol=/,uhelper=udisks2) > I don't think it's because of the changed drive letter, it was hitting the BUG_ON in kernel code btrfs_search_forward() and showed that somehow your btrfs had failed to read the metadata out of the drive (either because the underlying drive is not available for reading or because metadata checksum check is failing). That BUG_ON had gotten fixed in a later version, you may want to try the latest btrfs. Thanks, -liubo > > Thanks for the reply. > > > > I mounted it ro: > > $ sudo btrfs fi show /mnt > > Segmentation fault (core dumped) > > > > dmesg says: > > ... > > kernel BUG at /build/linux-wXdoVv/linux-4.4.0/fs/btrfs/ctree.c:5205! > > ... > > RIP: 0010:[] [] > > btrfs_search_forward+0x268/0x350 [btrfs] > > ... > > Call Trace: > > [] search_ioctl+0xf2/0x1c0 [btrfs] > > [] ? zone_statistics+0x7c/0xa0 > > [] btrfs_ioctl_tree_search+0x72/0xc0 [btrfs] > > [] btrfs_ioctl+0x455/0x28b0 [btrfs] > > [] ? mem_cgroup_try_charge+0x6b/0x1e0 > > [] ? handle_mm_fault+0xcad/0x1820 > > [] do_vfs_ioctl+0x29f/0x490 > > [] ? __do_page_fault+0x1b4/0x400 > > [] SyS_ioctl+0x79/0x90 > > [] entry_SYSCALL_64_fastpath+0x16/0x71 > > ... > > > > full dmesg output is at: > > pastebin.com/bhsEJiJN > > > > $ sudo btrfs fi df /mnt > > Data, single: total=236.01GiB, used=230.35GiB > > System, DUP: total=8.00MiB, used=48.00KiB > > System, single: total=4.00MiB, used=0.00B > > Metadata, DUP: total=1.00GiB, used=349.11MiB > > Metadata, single: total=8.00MiB, used=0.00B > > GlobalReserve, single: total=128.00MiB, used=0.00B > > > > I downloaded and compiled the current btrfs v4.10.1-9-gbd0ab27. > > > > sudo ./btrfs check /dev/sdh2 > > Checking filesystem on /dev/sdh2 > > UUID: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7 > > checking extents > > checking free space cache > > checking fs roots > > checking csums > > checking root refs > > found 247628603392 bytes used, no error found > > total csum bytes: 241513756 > > total tree bytes: 366084096 > > total fs tree bytes: 90259456 > > total extent tree bytes: 10010624 > > btree space waste bytes: 35406538 > > file data blocks allocated: 23837459185664 > > referenced 252963553280 > > > > and with lowmem mode (again no errors found): > > ./btrfs check /dev/sdh2 --mode=lowmem > > Checking filesystem on /dev/sdh2 > > UUID: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7 > > checking extents > > checking free space cache > > checking fs roots > > checking csums > > checking root refs > > found 247738298368 bytes used, no error found > > total csum bytes: 241513756 > > total tree bytes: 366084096 > > total fs tree bytes: 90259456 > > total extent tree bytes: 10010624 > > btree space waste bytes: 35406538 > > file data blocks allocated: 23837459185664 > > referenced 252963553280 > > > > maybe there is some hint in that segmentation fault? > > Also, I compiled from > > git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git but I > > did not get version 4.10.1 instead of 4.10.2 > > > > Regards, > > > > On Fri, Apr 14, 2017 at 1:17 PM, Chris Murphy> > wrote: > >> Can you ro mount it and: > >> > >> btrfs fi show /mnt > >> btrfs fi df /mnt > >> > >> And then next update the btrfs-progs to something newer like 4.9.2 or > >> 4.10.2 and then do another 'btrfs check' without repair. And then > >> separately do it again with --mode=lowmem and post both sets of > >> results? > >> > >> > >> Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compressing nocow files
On Mon, Apr 17, 2017 at 12:07 PM, Liu Bowrote: > On Mon, Apr 17, 2017 at 11:36:17AM -0600, Chris Murphy wrote: >> HI, >> >> >> /dev/nvme0n1p8 on / type btrfs >> (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root) >> >> I've got a test folder with +C set and then copied a test file into it. >> >> $ lsattr >> C-- >> ./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> >> So now it's inherited +C. Next check fragments and compression. >> >> $ sudo ~/Applications/btrfs-debugfs -f >> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> (290662 0): ram 100663296 disk 17200840704 disk_size 100663296 >> file: >> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> extents 1 disk size 100663296 logical size 100663296 ratio 1.00 >> >> >> Try to compress it. >> >> $ sudo btrfs fi defrag -c >> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> >> Check again. >> >> [snip faux fragments] >> file: >> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal >> extents 768 disk size 21504000 logical size 100663296 ratio 4.68 >> >> It's compressed! OK?? >> >> OK delete that file, and add +c to the temp folder. >> >> $ lsattr >> c---C-- ./temp >> >> Copy another test file into temp. >> >> $ lsattr >> c---C-- >> ./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal >> >> $ sudo ~/Applications/btrfs-debugfs -f >> system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal >> (290739 0): ram 83886080 disk 21764243456 disk_size 83886080 >> file: >> system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal >> extents 1 disk size 83886080 logical size 83886080 ratio 1.00 >> >> Not compressed. Hmmm. So somehow btrfs fi defrag -c will force >> compression even on nocow files? I've also done this without -c >> option, with a Btrfs mounted using compress option; and the file does >> compress also. I thought nocow files were always no compress, but it >> seems there's exceptions. >> > > Good catch. > > Btrfs defragment depends on COW to do the job, thus nocow inode are forced to > COW when processing defraged range, and the compress also depends on COW, so > btrfs filesystem defrag -c becomes an exception. > > Speaking of which, this exception seems to be not harmful other than confusing > users. No in fact it might be a benefit. Recent versions of systemd-journald defragment, but are highly compressible files with a lot of slack space in them. So defragment on SSD is just increasing write amplification. If the defragment ioctl supports passing compression request, maybe that's an optimization systemd-journald can leverage for much smaller files. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compressing nocow files
On Mon, Apr 17, 2017 at 11:36:17AM -0600, Chris Murphy wrote: > HI, > > > /dev/nvme0n1p8 on / type btrfs > (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root) > > I've got a test folder with +C set and then copied a test file into it. > > $ lsattr > C-- > ./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal > > So now it's inherited +C. Next check fragments and compression. > > $ sudo ~/Applications/btrfs-debugfs -f > system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal > (290662 0): ram 100663296 disk 17200840704 disk_size 100663296 > file: > system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal > extents 1 disk size 100663296 logical size 100663296 ratio 1.00 > > > Try to compress it. > > $ sudo btrfs fi defrag -c > system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal > > Check again. > > [snip faux fragments] > file: > system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal > extents 768 disk size 21504000 logical size 100663296 ratio 4.68 > > It's compressed! OK?? > > OK delete that file, and add +c to the temp folder. > > $ lsattr > c---C-- ./temp > > Copy another test file into temp. > > $ lsattr > c---C-- > ./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal > > $ sudo ~/Applications/btrfs-debugfs -f > system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal > (290739 0): ram 83886080 disk 21764243456 disk_size 83886080 > file: > system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal > extents 1 disk size 83886080 logical size 83886080 ratio 1.00 > > Not compressed. Hmmm. So somehow btrfs fi defrag -c will force > compression even on nocow files? I've also done this without -c > option, with a Btrfs mounted using compress option; and the file does > compress also. I thought nocow files were always no compress, but it > seems there's exceptions. > Good catch. Btrfs defragment depends on COW to do the job, thus nocow inode are forced to COW when processing defraged range, and the compress also depends on COW, so btrfs filesystem defrag -c becomes an exception. Speaking of which, this exception seems to be not harmful other than confusing users. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compressing nocow files
On 2017-04-17 13:36, Chris Murphy wrote: HI, /dev/nvme0n1p8 on / type btrfs (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root) I've got a test folder with +C set and then copied a test file into it. $ lsattr C-- ./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal So now it's inherited +C. Next check fragments and compression. $ sudo ~/Applications/btrfs-debugfs -f system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal (290662 0): ram 100663296 disk 17200840704 disk_size 100663296 file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 1 disk size 100663296 logical size 100663296 ratio 1.00 Try to compress it. $ sudo btrfs fi defrag -c system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal Check again. [snip faux fragments] file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 768 disk size 21504000 logical size 100663296 ratio 4.68 It's compressed! OK?? OK delete that file, and add +c to the temp folder. $ lsattr c---C-- ./temp Copy another test file into temp. $ lsattr c---C-- ./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal $ sudo ~/Applications/btrfs-debugfs -f system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal (290739 0): ram 83886080 disk 21764243456 disk_size 83886080 file: system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal extents 1 disk size 83886080 logical size 83886080 ratio 1.00 Not compressed. Hmmm. So somehow btrfs fi defrag -c will force compression even on nocow files? I've also done this without -c option, with a Btrfs mounted using compress option; and the file does compress also. I thought nocow files were always no compress, but it seems there's exceptions. This is odd behavior. The fact that it's inconsistent is what worries me the most though, not that it's possible to compress NOCOW files (the data loss window on a compressed write is a bit bigger than an uncompressed one, but it still doesn't have the race conditions that checksums+NOCOW does). I'm willing to bet the issue is in how the FS handles defragmentation, seeing as the only case here where it was compressed was after you compressed it using the defrag command, which IIRC, doesn't actually do any checks in the kernel side except making sure nothing has the file mapped as an executable. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compressing nocow files
On 2017-04-17 13:36, Chris Murphy wrote: HI, /dev/nvme0n1p8 on / type btrfs (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root) I've got a test folder with +C set and then copied a test file into it. $ lsattr C-- ./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal So now it's inherited +C. Next check fragments and compression. $ sudo ~/Applications/btrfs-debugfs -f system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal (290662 0): ram 100663296 disk 17200840704 disk_size 100663296 file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 1 disk size 100663296 logical size 100663296 ratio 1.00 Try to compress it. $ sudo btrfs fi defrag -c system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal Check again. [snip faux fragments] file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 768 disk size 21504000 logical size 100663296 ratio 4.68 It's compressed! OK?? OK delete that file, and add +c to the temp folder. $ lsattr c---C-- ./temp Copy another test file into temp. $ lsattr c---C-- ./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal $ sudo ~/Applications/btrfs-debugfs -f system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal (290739 0): ram 83886080 disk 21764243456 disk_size 83886080 file: system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal extents 1 disk size 83886080 logical size 83886080 ratio 1.00 Not compressed. Hmmm. So somehow btrfs fi defrag -c will force compression even on nocow files? I've also done this without -c option, with a Btrfs mounted using compress option; and the file does compress also. I thought nocow files were always no compress, but it seems there's exceptions. This is odd behavior. The fact that it's inconsistent is what worries me the most though, not that it's possible to compress NOCOW files (the data loss window on a compressed write is a bit bigger than an uncompressed one, but it still doesn't have the race conditions that checksums+NOCOW does). I'm willing to bet the issue is in how the FS handles defragmentation, seeing as the only case here where it was compressed was after you compressed it using the defrag command, which IIRC, doesn't actually do any checks in the kernel side except making sure nothing has the file mapped as an executable. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Remounting read-write after error is not allowed
Not sure if anyone is looking into that segfault, but I have an update. I disconnected the USB drive for a while and today I reconnected it and it auto-mounted with no issue. What is interesting is that the drive letter changed to what is was before when it was working. Remember that in my first email, the drive encounters an error as /dev/sdg2, and it reconnected as /dev/sdh2 I was getting errors when it was assigned /dev/sdh2, but now it mounts just fine when it is assigned /dev/sdg2 sudo btrfs fi show Label: 'USB-data' uuid: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7 Total devices 1 FS bytes used 230.72GiB devid1 size 447.13GiB used 238.04GiB path /dev/sdg2 is BTRFS really so sensitive to drive letter changes? The USB driver is automounted as such: /dev/sdg2 on /media/alex/USB-data1 type btrfs (rw,nosuid,nodev,relatime,space_cache,subvolid=5,subvol=/,uhelper=udisks2) Regards, On Fri, Apr 14, 2017 at 3:51 PM, Alexandru Guzuwrote: > Thanks for the reply. > > I mounted it ro: > $ sudo btrfs fi show /mnt > Segmentation fault (core dumped) > > dmesg says: > ... > kernel BUG at /build/linux-wXdoVv/linux-4.4.0/fs/btrfs/ctree.c:5205! > ... > RIP: 0010:[] [] > btrfs_search_forward+0x268/0x350 [btrfs] > ... > Call Trace: > [] search_ioctl+0xf2/0x1c0 [btrfs] > [] ? zone_statistics+0x7c/0xa0 > [] btrfs_ioctl_tree_search+0x72/0xc0 [btrfs] > [] btrfs_ioctl+0x455/0x28b0 [btrfs] > [] ? mem_cgroup_try_charge+0x6b/0x1e0 > [] ? handle_mm_fault+0xcad/0x1820 > [] do_vfs_ioctl+0x29f/0x490 > [] ? __do_page_fault+0x1b4/0x400 > [] SyS_ioctl+0x79/0x90 > [] entry_SYSCALL_64_fastpath+0x16/0x71 > ... > > full dmesg output is at: > pastebin.com/bhsEJiJN > > $ sudo btrfs fi df /mnt > Data, single: total=236.01GiB, used=230.35GiB > System, DUP: total=8.00MiB, used=48.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, DUP: total=1.00GiB, used=349.11MiB > Metadata, single: total=8.00MiB, used=0.00B > GlobalReserve, single: total=128.00MiB, used=0.00B > > I downloaded and compiled the current btrfs v4.10.1-9-gbd0ab27. > > sudo ./btrfs check /dev/sdh2 > Checking filesystem on /dev/sdh2 > UUID: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7 > checking extents > checking free space cache > checking fs roots > checking csums > checking root refs > found 247628603392 bytes used, no error found > total csum bytes: 241513756 > total tree bytes: 366084096 > total fs tree bytes: 90259456 > total extent tree bytes: 10010624 > btree space waste bytes: 35406538 > file data blocks allocated: 23837459185664 > referenced 252963553280 > > and with lowmem mode (again no errors found): > ./btrfs check /dev/sdh2 --mode=lowmem > Checking filesystem on /dev/sdh2 > UUID: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7 > checking extents > checking free space cache > checking fs roots > checking csums > checking root refs > found 247738298368 bytes used, no error found > total csum bytes: 241513756 > total tree bytes: 366084096 > total fs tree bytes: 90259456 > total extent tree bytes: 10010624 > btree space waste bytes: 35406538 > file data blocks allocated: 23837459185664 > referenced 252963553280 > > maybe there is some hint in that segmentation fault? > Also, I compiled from > git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git but I > did not get version 4.10.1 instead of 4.10.2 > > Regards, > > On Fri, Apr 14, 2017 at 1:17 PM, Chris Murphy wrote: >> Can you ro mount it and: >> >> btrfs fi show /mnt >> btrfs fi df /mnt >> >> And then next update the btrfs-progs to something newer like 4.9.2 or >> 4.10.2 and then do another 'btrfs check' without repair. And then >> separately do it again with --mode=lowmem and post both sets of >> results? >> >> >> Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
compressing nocow files
HI, /dev/nvme0n1p8 on / type btrfs (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root) I've got a test folder with +C set and then copied a test file into it. $ lsattr C-- ./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal So now it's inherited +C. Next check fragments and compression. $ sudo ~/Applications/btrfs-debugfs -f system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal (290662 0): ram 100663296 disk 17200840704 disk_size 100663296 file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 1 disk size 100663296 logical size 100663296 ratio 1.00 Try to compress it. $ sudo btrfs fi defrag -c system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal Check again. [snip faux fragments] file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 768 disk size 21504000 logical size 100663296 ratio 4.68 It's compressed! OK?? OK delete that file, and add +c to the temp folder. $ lsattr c---C-- ./temp Copy another test file into temp. $ lsattr c---C-- ./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal $ sudo ~/Applications/btrfs-debugfs -f system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal (290739 0): ram 83886080 disk 21764243456 disk_size 83886080 file: system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal extents 1 disk size 83886080 logical size 83886080 ratio 1.00 Not compressed. Hmmm. So somehow btrfs fi defrag -c will force compression even on nocow files? I've also done this without -c option, with a Btrfs mounted using compress option; and the file does compress also. I thought nocow files were always no compress, but it seems there's exceptions. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-04-17 12:58, Chris Murphy wrote: On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarnwrote: Regarding BTRFS specifically: * Given my recently newfound understanding of what the 'ssd' mount option actually does, I'm inclined to recommend that people who are using high-end SSD's _NOT_ use it as it will heavily increase fragmentation and will likely have near zero impact on actual device lifetime (but may _hurt_ performance). It will still probably help with mid and low-end SSD's. What is a high end SSD these days? Built-in NVMe? One with a good FTL in the firmware. At minimum, the good Samsung EVO drives, the high quality Intel ones, and the Crucial MX series, but probably some others. My choice of words here probably wasn't the best though. * Files with NOCOW and filesystems with 'nodatacow' set will both hurt performance for BTRFS on SSD's, and appear to reduce the lifetime of the SSD. Can you elaborate. It's an interesting problem, on a small scale the systemd folks have journald set +C on /var/log/journal so that any new journals are nocow. There is an initial fallocate, but the write behavior is writing in the same place at the head and tail. But at the tail, the writes get pushed torward the middle. So the file is growing into its fallocated space from the tail. The header changes in the same location, it's an overwrite. For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets rewritten in-place. This means that cheap FTL's will rewrite that erase block in-place (which won't hurt performance but will impact device lifetime), and good ones will rewrite into a free block somewhere else but may not free that original block for quite some time (which is bad for performance but slightly better for device lifetime). When BTRFS does a COW operation on a block however, it will guarantee that that block moves. Because of this, the old location will either: 1. Be discarded by the FS itself if the 'discard' mount option is set. 2. Be caught by a scheduled call to 'fstrim'. 3. Lay dormant for at least a while. The first case is ideal for most FTL's, because it lets them know immediately that that data isn't needed and the space can be reused. The second is close to ideal, but defers telling the FTL that the block is unused, which can be better on some SSD's (some have firmware that handles wear-leveling better in batches). The third is not ideal, but is still better than what happens with NOCOW or nodatacow set. Overall, this boils down to the fact that most FTL's get slower if they can't wear-level the device properly, and in-place rewrites make it harder for them to do proper wear-leveling. So long as this file is not reflinked or snapshot, filefrag shows a pile of mostly 4096 byte blocks, thousands. But as they're pretty much all continuous, the file fragmentation (extent count) is usually never higher than 12. It meanders between 1 and 12 extents for its life. Except on the system using ssd_spread mount option. That one has a journal file that is +C, is not being snapshot, but has over 3000 extents per filefrag and btrfs-progs/debugfs. Really weird. Given how the 'ssd' mount option behaves and the frequency that most systemd instances write to their journals, that's actually reasonably expected. We look for big chunks of free space to write into and then align to 2M regardless of the actual size of the write, which in turn means that files like the systemd journal which see lots of small (relatively speaking) writes will have way more extents than they should until you defragment them. Now, systemd aside, there are databases that behave this same way where there's a small section contantly being overwritten, and one or more sections that grow the data base file from within and at the end. If this is made cow, the file will absolutely fragment a ton. And especially if the changes are mostly 4KiB block sizes that then are fsync'd. It's almost like we need these things to not fsync at all, and just rely on the filesystem commit time... Essentially yes, but that causes all kinds of other problems. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarnwrote: > Regarding BTRFS specifically: > * Given my recently newfound understanding of what the 'ssd' mount option > actually does, I'm inclined to recommend that people who are using high-end > SSD's _NOT_ use it as it will heavily increase fragmentation and will likely > have near zero impact on actual device lifetime (but may _hurt_ > performance). It will still probably help with mid and low-end SSD's. What is a high end SSD these days? Built-in NVMe? > * Files with NOCOW and filesystems with 'nodatacow' set will both hurt > performance for BTRFS on SSD's, and appear to reduce the lifetime of the > SSD. Can you elaborate. It's an interesting problem, on a small scale the systemd folks have journald set +C on /var/log/journal so that any new journals are nocow. There is an initial fallocate, but the write behavior is writing in the same place at the head and tail. But at the tail, the writes get pushed torward the middle. So the file is growing into its fallocated space from the tail. The header changes in the same location, it's an overwrite. So long as this file is not reflinked or snapshot, filefrag shows a pile of mostly 4096 byte blocks, thousands. But as they're pretty much all continuous, the file fragmentation (extent count) is usually never higher than 12. It meanders between 1 and 12 extents for its life. Except on the system using ssd_spread mount option. That one has a journal file that is +C, is not being snapshot, but has over 3000 extents per filefrag and btrfs-progs/debugfs. Really weird. Now, systemd aside, there are databases that behave this same way where there's a small section contantly being overwritten, and one or more sections that grow the data base file from within and at the end. If this is made cow, the file will absolutely fragment a ton. And especially if the changes are mostly 4KiB block sizes that then are fsync'd. It's almost like we need these things to not fsync at all, and just rely on the filesystem commit time... -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On 2017-04-14 07:02, Imran Geriskovan wrote: Hi, Sometime ago we had some discussion about SSDs. Within the limits of unknown/undocumented device infos, we loosely had covered data retension capability/disk age/life time interrelations, (in?)effectiveness of btrfs dup on SSDs, etc.. Now, as time passed and with some accumulated experience on SSDs I think we again can have a status check/update on them if you can share your experiences and best practices. So if you have something to share about SSDs (it may or may not be directly related with btrfs) I'm sure everybody here will be happy to hear it. General info (not BTRFS specific): * Based on SMART attributes and other factors, current life expectancy for light usage (normal desktop usage) appears to be somewhere around 8-12 years depending on specifics of usage (assuming the same workload, F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper end, XFS is roughly in the middle, ext4 and NTFS are on the low end (tested using Windows 7's NTFS driver), and FAT32 is an outlier at the bottom of the barrel). * Queued DISCARD support is still missing in most consumer SATA SSD's, which in turn makes the trade-off on those between performance and lifetime much sharper. * Modern (2015 and newer) SSD's seem to have better handling in the FTL for the journaling behavior of filesystems like ext4 and XFS. I'm not sure if this is actually a result of the FTL being better, or some change in the hardware. * In my personal experience, Intel, Samsung, and Crucial appear to be the best name brands (in relative order of quality). I have personally had bad experiences with SanDisk and Kingston SSD's, but I don't have anything beyond circumstantial evidence indicating that it was anything but bad luck on both counts. Regarding BTRFS specifically: * Given my recently newfound understanding of what the 'ssd' mount option actually does, I'm inclined to recommend that people who are using high-end SSD's _NOT_ use it as it will heavily increase fragmentation and will likely have near zero impact on actual device lifetime (but may _hurt_ performance). It will still probably help with mid and low-end SSD's. * Files with NOCOW and filesystems with 'nodatacow' set will both hurt performance for BTRFS on SSD's, and appear to reduce the lifetime of the SSD. * Compression should help performance and device lifetime most of the time, unless your CPU is fully utilized on a regular basis (in which case it will hurt performance, but still improve device lifetimes). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html