Re: Btrfs/SSD

2017-04-17 Thread Roman Mamedov
On Tue, 18 Apr 2017 03:23:13 + (UTC)
Duncan <1i5t5.dun...@cox.net> wrote:

> Without reading the links...
> 
> Are you /sure/ it's /all/ ssds currently on the market?  Or are you 
> thinking narrowly, those actually sold as ssds?
> 
> Because all I've read (and I admit I may not actually be current, but...) 
> on for instance sd cards, certainly ssds by definition, says they're 
> still very write-cycle sensitive -- very simple FTL with little FTL wear-
> leveling.
> 
> And AFAIK, USB thumb drives tend to be in the middle, moderately complex 
> FTL with some, somewhat simplistic, wear-leveling.
> 

If I have to clarify, yes, it's all about SATA and NVMe SSDs. SD cards may be
SSDs "by definition", but nobody will think of an SD card when you say "I
bought an SSD for my computer". And yes, SD card and USB flash sticks are
commonly understood to be much simpler and more brittle devices than full
blown desktop (not to mention server) SSDs.

> While the stuff actually marketed as SSDs, generally SATA or direct PCIE/
> NVME connected, may indeed match your argument, no real end-user concern 
> necessary any more as the FTLs are advanced enough that user or 
> filesystem level write-cycle concerns simply aren't necessary these days.
> 
> 
> So does that claim that write-cycle concerns simply don't apply to modern 
> ssds, also apply to common thumb drives and sd cards?  Because these are 
> certainly ssds both technically and by btrfs standards.
> 


-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compressing nocow files

2017-04-17 Thread Chris Murphy
To inhibit chattr +C on systemd-journald journals:

- manually remove the attribute on /var/log/journal and
/var/log/journal/

- write an empty file:
/etc/tmpfiles.d/journal-nocow.conf


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compressing nocow files

2017-04-17 Thread Duncan
Liu Bo posted on Mon, 17 Apr 2017 11:07:07 -0700 as excerpted:

> On Mon, Apr 17, 2017 at 11:36:17AM -0600, Chris Murphy wrote:
>> HI,
>> 
>> 
>> /dev/nvme0n1p8 on / type btrfs
>> (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root)
>> 
>> I've got a test folder with +C set and then copied a test file into it.
>> 
>> $ lsattr C--
>> ./
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>> 
>> So now it's inherited +C. Next check fragments and compression.
>> 
>> $ sudo ~/Applications/btrfs-debugfs -f
>> 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>> (290662 0): ram 100663296 disk 17200840704 disk_size 100663296 file:
>> 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>> extents 1 disk size 100663296 logical size 100663296 ratio 1.00
>> 
>> 
>> Try to compress it.
>> 
>> $ sudo btrfs fi defrag -c
>> 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>> 
>> Check again.
>> 
>> [snip faux fragments]
>> file:
>> 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>> extents 768 disk size 21504000 logical size 100663296 ratio 4.68
>> 
>> It's compressed! OK??
>> 
>> OK delete that file, and add +c to the temp folder.
>> 
>> $ lsattr c---C-- ./temp
>> 
>> Copy another test file into temp.
>> 
>> $ lsattr c---C--
>> ./
system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
>> 
>> $ sudo ~/Applications/btrfs-debugfs -f
>> 
system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
>> (290739 0): ram 83886080 disk 21764243456 disk_size 83886080 file:
>> 
system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
>> extents 1 disk size 83886080 logical size 83886080 ratio 1.00
>> 
>> Not compressed. Hmmm. So somehow btrfs fi defrag -c will force
>> compression even on nocow files? I've also done this without -c option,
>> with a Btrfs mounted using compress option; and the file does compress
>> also. I thought nocow files were always no compress, but it seems
>> there's exceptions.
>>
>>
> Good catch.
> 
> Btrfs defragment depends on COW to do the job, thus nocow inode are
> forced to COW when processing defraged range, and the compress also
> depends on COW, so btrfs filesystem defrag -c becomes an exception.

Thanks for the explanation. =:^)

> Speaking of which, this exception seems to be not harmful other than
> confusing users.

My question is, what does it do then when a new modification-write comes 
in to the compressed no-cow file, and the modification isn't as 
compressible as the data it replaced?

Actually, do new writes to a defrag-compressed nocow file compress at 
all, or not, and/or does nocow effectively become cow1 (as when a nocow 
file is snapshotted) or even get effectively ignored entirely (as I 
believe happens when a normal cow file is marked nocow after it has been 
written)?

Is this likely to be a source of some of the unreproduced bugs we've 
seen, simply because much like a metadata-inline block followed by extent-
tree blocks, it wasn't supposed to happen and thus wasn't designed for or 
tested for?

(FTR, this shouldn't affect me ATM as AFAIK I have no nocow files; 
systemd is set to produce temporary/tmpfs files only, no permanent 
journal, so its journal files don't even hit btrfs, and I no of no others 
set nocow by default, neither have I set anything nocow.  But that 
doesn't mean I'm not concerned, as it certainly affects others, and could 
hypothetically affect me should I discover I have a reason to nocow 
something.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix extent map leak during fallocate error path

2017-04-17 Thread Liu Bo
On Tue, Apr 04, 2017 at 03:20:36AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
> extent map structure. The failure can happen either due to an -ENOMEM
> condition or, when quotas are enabled, due to -EDQUOT for example.
>

Reviewed-by: Liu Bo 

Thanks,

-liubo
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/file.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 48dfb8e..56304c4 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2856,8 +2856,10 @@ static long btrfs_fallocate(struct file *file, int 
> mode,
>   }
>   ret = btrfs_qgroup_reserve_data(inode, cur_offset,
>   last_byte - cur_offset);
> - if (ret < 0)
> + if (ret < 0) {
> + free_extent_map(em);
>   break;
> + }
>   } else {
>   /*
>* Do not need to reserve unwritten extent for this
> -- 
> 2.7.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Duncan
Roman Mamedov posted on Mon, 17 Apr 2017 23:24:19 +0500 as excerpted:

> Days are long gone since the end user had to ever think about device
> lifetimes with SSDs. Refer to endurance studies such as
> http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-
all-dead
> http://ssdendurancetest.com/
> https://3dnews.ru/938764/
> It has been demonstrated that all SSDs on the market tend to overshoot
> even their rated TBW by several times, as a result it will take any user
> literally dozens of years to wear out the flash no matter which
> filesystem or what settings used

Without reading the links...

Are you /sure/ it's /all/ ssds currently on the market?  Or are you 
thinking narrowly, those actually sold as ssds?

Because all I've read (and I admit I may not actually be current, but...) 
on for instance sd cards, certainly ssds by definition, says they're 
still very write-cycle sensitive -- very simple FTL with little FTL wear-
leveling.

And AFAIK, USB thumb drives tend to be in the middle, moderately complex 
FTL with some, somewhat simplistic, wear-leveling.

While the stuff actually marketed as SSDs, generally SATA or direct PCIE/
NVME connected, may indeed match your argument, no real end-user concern 
necessary any more as the FTLs are advanced enough that user or 
filesystem level write-cycle concerns simply aren't necessary these days.


So does that claim that write-cycle concerns simply don't apply to modern 
ssds, also apply to common thumb drives and sd cards?  Because these are 
certainly ssds both technically and by btrfs standards.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] Btrfs: change how we iterate bios in endio

2017-04-17 Thread Liu Bo
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio's read
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c |  1 +
 fs/btrfs/file-item.c | 31 +++
 fs/btrfs/inode.c | 33 +
 fs/btrfs/volumes.h   |  1 +
 4 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1b7156c..54108d1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2738,6 +2738,7 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, 
gfp_t gfp_mask, int offset
btrfs_bio->end_io = NULL;
 
bio_trim(bio, (offset >> 9), (size >> 9));
+   btrfs_bio->iter = bio->bi_iter;
}
return bio;
 }
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 64fcb31..9f6062c 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -164,7 +164,8 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, 
struct bio *bio,
   u64 logical_offset, u32 *dst, int dio)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-   struct bio_vec *bvec;
+   struct bio_vec bvec;
+   struct bvec_iter iter;
struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
struct btrfs_csum_item *item = NULL;
struct extent_io_tree *io_tree = _I(inode)->io_tree;
@@ -177,7 +178,7 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, 
struct bio *bio,
u64 page_bytes_left;
u32 diff;
int nblocks;
-   int count = 0, i;
+   int count = 0;
u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
 
path = btrfs_alloc_path();
@@ -206,8 +207,6 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, 
struct bio *bio,
if (bio->bi_iter.bi_size > PAGE_SIZE * 8)
path->reada = READA_FORWARD;
 
-   WARN_ON(bio->bi_vcnt <= 0);
-
/*
 * the free space stuff is only read when it hasn't been
 * updated in the current transaction.  So, we can safely
@@ -223,13 +222,13 @@ static int __btrfs_lookup_bio_sums(struct inode *inode, 
struct bio *bio,
if (dio)
offset = logical_offset;
 
-   bio_for_each_segment_all(bvec, bio, i) {
-   page_bytes_left = bvec->bv_len;
+   bio_for_each_segment(bvec, bio, iter) {
+   page_bytes_left = bvec.bv_len;
if (count)
goto next;
 
if (!dio)
-   offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+   offset = page_offset(bvec.bv_page) + bvec.bv_offset;
count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
   (u32 *)csum, nblocks);
if (count)
@@ -440,15 +439,15 @@ int btrfs_csum_one_bio(struct inode *inode, struct bio 
*bio,
struct btrfs_ordered_sum *sums;
struct btrfs_ordered_extent *ordered = NULL;
char *data;
-   struct bio_vec *bvec;
+   struct bvec_iter iter;
+   struct bio_vec bvec;
int index;
int nr_sectors;
-   int i, j;
unsigned long total_bytes = 0;
unsigned long this_sum_bytes = 0;
+   int i;
u64 offset;
 
-   WARN_ON(bio->bi_vcnt <= 0);
sums = kzalloc(btrfs_ordered_sum_size(fs_info, bio->bi_iter.bi_size),
   GFP_NOFS);
if (!sums)
@@ -465,19 +464,19 @@ int btrfs_csum_one_bio(struct inode *inode, struct bio 
*bio,
sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
index = 0;
 
-   bio_for_each_segment_all(bvec, bio, j) {
+   bio_for_each_segment(bvec, bio, iter) {
if (!contig)
-   offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+   offset = page_offset(bvec.bv_page) + bvec.bv_offset;
 
if (!ordered) {
ordered = btrfs_lookup_ordered_extent(inode, offset);
BUG_ON(!ordered); /* Logic error */
}
 
-   data = kmap_atomic(bvec->bv_page);
+   data = kmap_atomic(bvec.bv_page);
 
nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info,
-bvec->bv_len + 
fs_info->sectorsize
+bvec.bv_len + 
fs_info->sectorsize
 - 1);
 
for (i = 0; i < nr_sectors; i++) {
@@ -504,12 +503,12 @@ int 

[PATCH 0/6 RFC] utilize bio_clone_fast to clean up

2017-04-17 Thread Liu Bo
This attempts to use bio_clone_fast() in the places where we clone bio,
such as when bio got cloned for multiple disks and when bio got split
during dio submit.

One benefit is to simplify dio submit to avoid calling bio_add_page one by
one.

Another benefit is that comparing to bio_clone_bioset, bio_clone_fast is
faster because of copying the vector pointer directly, and bio_clone_fast
doesn't modify bi_vcnt, so the extra work is to fix up bi_vcnt usage we
currently have to use bi_iter to iterate bvec.

Liu Bo (6):
  Btrfs: use bio_clone_fast to clone our bio
  Btrfs: use bio_clone_bioset_partial to simplify DIO submit
  Btrfs: change how we iterate bios in endio
  Btrfs: record error if one block has failed to retry
  Btrfs: change check-integrity to use bvec_iter
  Btrfs: unify naming of btrfs_io_bio

 fs/btrfs/check-integrity.c |  27 +++---
 fs/btrfs/extent_io.c   |  18 +++-
 fs/btrfs/extent_io.h   |   1 +
 fs/btrfs/file-item.c   |  31 ---
 fs/btrfs/inode.c   | 203 -
 fs/btrfs/volumes.h |   1 +
 6 files changed, 138 insertions(+), 143 deletions(-)

-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] Btrfs: change check-integrity to use bvec_iter

2017-04-17 Thread Liu Bo
Some check-integrity code depends on bio->bi_vcnt, this changes it to use
bio segments because some bios passing here may not have a reliable
bi_vcnt.

Signed-off-by: Liu Bo 
---
 fs/btrfs/check-integrity.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index ab14c2e..8e7ce48 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2822,44 +2822,47 @@ static void __btrfsic_submit_bio(struct bio *bio)
dev_state = btrfsic_dev_state_lookup(bio->bi_bdev);
if (NULL != dev_state &&
(bio_op(bio) == REQ_OP_WRITE) && bio_has_data(bio)) {
-   unsigned int i;
+   unsigned int i = 0;
u64 dev_bytenr;
u64 cur_bytenr;
-   struct bio_vec *bvec;
+   struct bio_vec bvec;
+   struct bvec_iter iter;
int bio_is_patched;
char **mapped_datav;
+   int segs = bio_segments(bio);
 
dev_bytenr = 512 * bio->bi_iter.bi_sector;
bio_is_patched = 0;
if (dev_state->state->print_mask &
BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH)
pr_info("submit_bio(rw=%d,0x%x, bi_vcnt=%u, 
bi_sector=%llu (bytenr %llu), bi_bdev=%p)\n",
-  bio_op(bio), bio->bi_opf, bio->bi_vcnt,
+  bio_op(bio), bio->bi_opf, segs,
   (unsigned long long)bio->bi_iter.bi_sector,
   dev_bytenr, bio->bi_bdev);
 
-   mapped_datav = kmalloc_array(bio->bi_vcnt,
+   mapped_datav = kmalloc_array(segs,
 sizeof(*mapped_datav), GFP_NOFS);
if (!mapped_datav)
goto leave;
cur_bytenr = dev_bytenr;
 
-   bio_for_each_segment_all(bvec, bio, i) {
-   BUG_ON(bvec->bv_len != PAGE_SIZE);
-   mapped_datav[i] = kmap(bvec->bv_page);
+   bio_for_each_segment(bvec, bio, iter) {
+   BUG_ON(bvec.bv_len != PAGE_SIZE);
+   mapped_datav[i] = kmap(bvec.bv_page);
+   i++;
 
if (dev_state->state->print_mask &
BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH_VERBOSE)
pr_info("#%u: bytenr=%llu, len=%u, offset=%u\n",
-  i, cur_bytenr, bvec->bv_len, 
bvec->bv_offset);
-   cur_bytenr += bvec->bv_len;
+  i, cur_bytenr, bvec.bv_len, 
bvec.bv_offset);
+   cur_bytenr += bvec.bv_len;
}
btrfsic_process_written_block(dev_state, dev_bytenr,
- mapped_datav, bio->bi_vcnt,
+ mapped_datav, segs,
  bio, _is_patched,
  NULL, bio->bi_opf);
-   bio_for_each_segment_all(bvec, bio, i)
-   kunmap(bvec->bv_page);
+   bio_for_each_segment(bvec, bio, iter)
+   kunmap(bvec.bv_page);
kfree(mapped_datav);
} else if (NULL != dev_state && (bio->bi_opf & REQ_PREFLUSH)) {
if (dev_state->state->print_mask &
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] Btrfs: record error if one block has failed to retry

2017-04-17 Thread Liu Bo
In the nocsum case of dio read endio, it will return immediately if an
error got returned when repairing, which left the rest blocks unrepaired.
The behavior is different from how buffered read endio works in the same
case.  This changes it to record error only and go on repairing the rest
blocks.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fca2f1f..cc46d21 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7942,6 +7942,7 @@ static int __btrfs_correct_data_nocsum(struct inode 
*inode,
u32 sectorsize;
int nr_sectors;
int ret;
+   int err;
 
fs_info = BTRFS_I(inode)->root->fs_info;
sectorsize = fs_info->sectorsize;
@@ -7962,8 +7963,10 @@ static int __btrfs_correct_data_nocsum(struct inode 
*inode,
pgoff, start, start + sectorsize - 1,
io_bio->mirror_num,
btrfs_retry_endio_nocsum, );
-   if (ret)
-   return ret;
+   if (ret) {
+   err = ret;
+   goto next;
+   }
 
wait_for_completion();
 
@@ -7972,6 +7975,7 @@ static int __btrfs_correct_data_nocsum(struct inode 
*inode,
goto next_block_or_try_again;
}
 
+next:
start += sectorsize;
 
if (nr_sectors--) {
@@ -7980,7 +7984,7 @@ static int __btrfs_correct_data_nocsum(struct inode 
*inode,
}
}
 
-   return 0;
+   return err;
 }
 
 static void btrfs_retry_endio(struct bio *bio)
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] Btrfs: unify naming of btrfs_io_bio

2017-04-17 Thread Liu Bo
All dio endio functions are using io_bio for struct btrfs_io_bio, this
makes btrfs_submit_direct to follow this convention.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cc46d21..73e7a44 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8424,16 +8424,16 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
loff_t file_offset)
 {
struct btrfs_dio_private *dip = NULL;
-   struct bio *io_bio = NULL;
-   struct btrfs_io_bio *btrfs_bio;
+   struct bio *bio = NULL;
+   struct btrfs_io_bio *io_bio;
int skip_sum;
bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
int ret = 0;
 
skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
-   io_bio = btrfs_bio_clone(dio_bio, GFP_NOFS);
-   if (!io_bio) {
+   bio = btrfs_bio_clone(dio_bio, GFP_NOFS);
+   if (!bio) {
ret = -ENOMEM;
goto free_ordered;
}
@@ -8449,17 +8449,17 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
dip->logical_offset = file_offset;
dip->bytes = dio_bio->bi_iter.bi_size;
dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
-   io_bio->bi_private = dip;
-   dip->orig_bio = io_bio;
+   bio->bi_private = dip;
+   dip->orig_bio = bio;
dip->dio_bio = dio_bio;
atomic_set(>pending_bios, 0);
-   btrfs_bio = btrfs_io_bio(io_bio);
-   btrfs_bio->logical = file_offset;
+   io_bio = btrfs_io_bio(bio);
+   io_bio->logical = file_offset;
 
if (write) {
-   io_bio->bi_end_io = btrfs_endio_direct_write;
+   bio->bi_end_io = btrfs_endio_direct_write;
} else {
-   io_bio->bi_end_io = btrfs_endio_direct_read;
+   bio->bi_end_io = btrfs_endio_direct_read;
dip->subio_endio = btrfs_subio_endio_read;
}
 
@@ -8482,8 +8482,8 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
if (!ret)
return;
 
-   if (btrfs_bio->end_io)
-   btrfs_bio->end_io(btrfs_bio, ret);
+   if (io_bio->end_io)
+   io_bio->end_io(io_bio, ret);
 
 free_ordered:
/*
@@ -8495,16 +8495,16 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
 * same as btrfs_endio_direct_[write|read] because we can't call these
 * callbacks - they require an allocated dip and a clone of dio_bio.
 */
-   if (io_bio && dip) {
-   io_bio->bi_error = -EIO;
-   bio_endio(io_bio);
+   if (bio && dip) {
+   bio->bi_error = -EIO;
+   bio_endio(bio);
/*
-* The end io callbacks free our dip, do the final put on io_bio
+* The end io callbacks free our dip, do the final put on bio
 * and all the cleanup and final put for dio_bio (through
 * dio_end_io()).
 */
dip = NULL;
-   io_bio = NULL;
+   bio = NULL;
} else {
if (write)
btrfs_endio_direct_write_update_ordered(inode,
@@ -8522,8 +8522,8 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
 */
dio_end_io(dio_bio, ret);
}
-   if (io_bio)
-   bio_put(io_bio);
+   if (bio)
+   bio_put(bio);
kfree(dip);
 }
 
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] Btrfs: use bio_clone_bioset_partial to simplify DIO submit

2017-04-17 Thread Liu Bo
Currently when mapping bio to limit bio to a single stripe length, we
split bio by adding page to bio one by one, but later we don't modify
the vector of bio at all, thus we can use bio_clone_fast to use the
original bio vector directly.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c |  15 +++
 fs/btrfs/extent_io.h |   1 +
 fs/btrfs/inode.c | 122 +++
 3 files changed, 62 insertions(+), 76 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0d4aea4..1b7156c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2726,6 +2726,21 @@ struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned 
int nr_iovecs)
return bio;
 }
 
+struct bio *btrfs_bio_clone_partial(struct bio *orig, gfp_t gfp_mask, int 
offset, int size)
+{
+   struct bio *bio;
+
+   bio = bio_clone_fast(orig, gfp_mask, btrfs_bioset);
+   if (bio) {
+   struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
+   btrfs_bio->csum = NULL;
+   btrfs_bio->csum_allocated = NULL;
+   btrfs_bio->end_io = NULL;
+
+   bio_trim(bio, (offset >> 9), (size >> 9));
+   }
+   return bio;
+}
 
 static int __must_check submit_one_bio(struct bio *bio, int mirror_num,
   unsigned long bio_flags)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 3e4fad4..3b2bc88 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -460,6 +460,7 @@ btrfs_bio_alloc(struct block_device *bdev, u64 
first_sector, int nr_vecs,
gfp_t gfp_flags);
 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs);
 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask);
+struct bio *btrfs_bio_clone_partial(struct bio *orig, gfp_t gfp_mask, int 
offset, int size);
 
 struct btrfs_fs_info;
 struct btrfs_inode;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a18510b..6215720 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8230,16 +8230,6 @@ static void btrfs_end_dio_bio(struct bio *bio)
bio_put(bio);
 }
 
-static struct bio *btrfs_dio_bio_alloc(struct block_device *bdev,
-  u64 first_sector, gfp_t gfp_flags)
-{
-   struct bio *bio;
-   bio = btrfs_bio_alloc(bdev, first_sector, BIO_MAX_PAGES, gfp_flags);
-   if (bio)
-   bio_associate_current(bio);
-   return bio;
-}
-
 static inline int btrfs_lookup_and_bind_dio_csum(struct inode *inode,
 struct btrfs_dio_private *dip,
 struct bio *bio,
@@ -8329,24 +8319,22 @@ static int btrfs_submit_direct_hook(struct 
btrfs_dio_private *dip,
struct btrfs_root *root = BTRFS_I(inode)->root;
struct bio *bio;
struct bio *orig_bio = dip->orig_bio;
-   struct bio_vec *bvec;
u64 start_sector = orig_bio->bi_iter.bi_sector;
u64 file_offset = dip->logical_offset;
-   u64 submit_len = 0;
u64 map_length;
-   u32 blocksize = fs_info->sectorsize;
int async_submit = 0;
-   int nr_sectors;
+   int submit_len;
+   int clone_offset = 0;
+   int clone_len;
int ret;
-   int i, j;
 
-   map_length = orig_bio->bi_iter.bi_size;
+   submit_len = map_length = orig_bio->bi_iter.bi_size;
ret = btrfs_map_block(fs_info, btrfs_op(orig_bio), start_sector << 9,
  _length, NULL, 0);
if (ret)
return -EIO;
 
-   if (map_length >= orig_bio->bi_iter.bi_size) {
+   if (map_length >= submit_len) {
bio = orig_bio;
dip->flags |= BTRFS_DIO_ORIG_BIO_SUBMITTED;
goto submit;
@@ -8358,70 +8346,52 @@ static int btrfs_submit_direct_hook(struct 
btrfs_dio_private *dip,
else
async_submit = 1;
 
-   bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev, start_sector, GFP_NOFS);
-   if (!bio)
-   return -ENOMEM;
-
-   bio->bi_opf = orig_bio->bi_opf;
-   bio->bi_private = dip;
-   bio->bi_end_io = btrfs_end_dio_bio;
-   btrfs_io_bio(bio)->logical = file_offset;
+   /* bio split */
atomic_inc(>pending_bios);
+   while (submit_len > 0) {
+   /* map_length < submit_len, it's a int */
+   clone_len = min(submit_len, (int)map_length);
+   bio = btrfs_bio_clone_partial(orig_bio, GFP_NOFS, clone_offset, 
clone_len);
+   if (!bio)
+   goto out_err;
+   /* the above clone call also clone blkcg of orig_bio */
+
+   bio->bi_private = dip;
+   bio->bi_end_io = btrfs_end_dio_bio;
+   btrfs_io_bio(bio)->logical = file_offset;
+
+   ASSERT(submit_len >= clone_len);
+   submit_len -= clone_len;
+   if (submit_len == 0)
+  

[PATCH 1/6] Btrfs: use bio_clone_fast to clone our bio

2017-04-17 Thread Liu Bo
For raid1 and raid10, we clone the original bio to the bios which are then
sent to different disks.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 27fdb25..0d4aea4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2700,7 +2700,7 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t 
gfp_mask)
struct btrfs_io_bio *btrfs_bio;
struct bio *new;
 
-   new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
+   new = bio_clone_fast(bio, gfp_mask, btrfs_bioset);
if (new) {
btrfs_bio = btrfs_io_bio(new);
btrfs_bio->csum = NULL;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Hans van Kranenburg
On 04/17/2017 09:22 PM, Imran Geriskovan wrote:
> [...]
> 
> Going over the thread following questions come to my mind:
> 
> - What exactly does btrfs ssd option does relative to plain mode?

There's quite an amount of information in the the very recent threads:
- "About free space fragmentation, metadata write amplification and (no)ssd"
- "BTRFS as a GlusterFS storage back-end, and what I've learned from
using it as such."
- "btrfs filesystem keeps allocating new chunks for no apparent reason"
- ... and a few more

I suspect there will be some "summary" mails at some point, but for now,
I'd recommend crawling through these threads first.

And now for your instant satisfaction, a short visual guide to the
difference, which shows actual btrfs behaviour instead of our guesswork
around it (taken from the second mail thread just mentioned):

-o ssd:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

-o nossd:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
 wrote:
> On 2017-04-17 14:34, Chris Murphy wrote:

>> Nope. The first paragraph applies to NVMe machine with ssd mount
>> option. Few fragments.
>>
>> The second paragraph applies to SD Card machine with ssd_spread mount
>> option. Many fragments.
>
> Ah, apologies for my misunderstanding.
>>
>>
>> These are different versions of systemd-journald so I can't completely
>> rule out a difference in write behavior.
>
> There have only been a couple of changes in the write patterns that I know
> of, but I would double check that the values for Seal and Compress in the
> journald.conf file are the same, as I know for a fact that changing those
> does change the write patterns (not much, but they do change).

Same, unchanged defaults on both systems.

#Storage=auto
#Compress=yes
#Seal=yes
#SplitMode=uid
#SyncIntervalSec=5m
#RateLimitIntervalSec=30s
#RateLimitBurst=1000


The sync interval sec is curious. 5 minutes? Umm, I'm seeing nearly
constant hits every 2-5 seconds on the journal file; using filefrag.
I'm sure there's a better way to trace a single file being
read/written to than this, but...


 It's almost like we need these things to not fsync at all, and just
 rely on the filesystem commit time...
>>>
>>>
>>> Essentially yes, but that causes all kinds of other problems.
>>
>>
>> Drat.
>>
> Admittedly most of the problems are use-case specific (you can't afford to
> lose transactions in a financial database  for example, so it functionally
> has to call fsync after each transaction), but most of it stems from the
> fact that BTRFS is doing a lot of the same stuff that much of the 'problem'
> software is doing itself internally.
>

Seems like the old way of doing things, and the staleness of the
internet, have colluded to create a lot of nervousness and misuse of
fsync. The very fact Btrfs needs a log tree to deal with fsync's in a
semi-sane way...


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Austin S. Hemmelgarn

On 2017-04-17 14:34, Chris Murphy wrote:

On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarn
 wrote:


What is a high end SSD these days? Built-in NVMe?


One with a good FTL in the firmware.  At minimum, the good Samsung EVO
drives, the high quality Intel ones, and the Crucial MX series, but probably
some others.  My choice of words here probably wasn't the best though.


It's a confusing market that sorta defies figuring out what we've got.

I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung
EVO+ SD Card in an Intel NUC. They use that same EVO branding on an
$11 SD Card.

And then there's the Samsung Electronics Co Ltd NVMe SSD Controller
SM951/PM951 in another laptop.
What makes it even more confusing is that other than Samsung (who _only_ 
use their own flash and controllers), manufacturer does not map to 
controller choice consistently, and even two drives with the same 
controller may have different firmware (and thus different degrees of 
reliability, those OCZ drives that were such crap at data retention were 
the result of a firmware option that the controller manufacturer pretty 
much told them not to use on production devices).




So long as this file is not reflinked or snapshot, filefrag shows a
pile of mostly 4096 byte blocks, thousands. But as they're pretty much
all continuous, the file fragmentation (extent count) is usually never
higher than 12. It meanders between 1 and 12 extents for its life.

Except on the system using ssd_spread mount option. That one has a
journal file that is +C, is not being snapshot, but has over 3000
extents per filefrag and btrfs-progs/debugfs. Really weird.


Given how the 'ssd' mount option behaves and the frequency that most systemd
instances write to their journals, that's actually reasonably expected.  We
look for big chunks of free space to write into and then align to 2M
regardless of the actual size of the write, which in turn means that files
like the systemd journal which see lots of small (relatively speaking)
writes will have way more extents than they should until you defragment
them.


Nope. The first paragraph applies to NVMe machine with ssd mount
option. Few fragments.

The second paragraph applies to SD Card machine with ssd_spread mount
option. Many fragments.

Ah, apologies for my misunderstanding.


These are different versions of systemd-journald so I can't completely
rule out a difference in write behavior.
There have only been a couple of changes in the write patterns that I 
know of, but I would double check that the values for Seal and Compress 
in the journald.conf file are the same, as I know for a fact that 
changing those does change the write patterns (not much, but they do 
change).




Now, systemd aside, there are databases that behave this same way
where there's a small section contantly being overwritten, and one or
more sections that grow the data base file from within and at the end.
If this is made cow, the file will absolutely fragment a ton. And
especially if the changes are mostly 4KiB block sizes that then are
fsync'd.

It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...


Essentially yes, but that causes all kinds of other problems.


Drat.

Admittedly most of the problems are use-case specific (you can't afford 
to lose transactions in a financial database  for example, so it 
functionally has to call fsync after each transaction), but most of it 
stems from the fact that BTRFS is doing a lot of the same stuff that 
much of the 'problem' software is doing itself internally.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Imran Geriskovan
On 4/17/17, Roman Mamedov  wrote:
> "Austin S. Hemmelgarn"  wrote:

>> * Compression should help performance and device lifetime most of the
>> time, unless your CPU is fully utilized on a regular basis (in which
>> case it will hurt performance, but still improve device lifetimes).

> Days are long gone since the end user had to ever think about device lifetimes
> with SSDs. Refer to endurance studies such as
> It has been demonstrated that all SSDs on the market tend to overshoot even
> their rated TBW by several times, as a result it will take any user literally
> dozens of years to wear out the flash no matter which filesystem or what
> settings used. And most certainly it's not worth it changing anything
> significant in your workflow (such as enabling compression if it's
> otherwise inconvenient or not needed) just to save the SSD lifetime.

Going over the thread following questions come to my mind:

- What exactly does btrfs ssd option does relative to plain mode?

- Most(all?) SSDs employ wear leveling. Isn't it? That is they are
constrantly remapping their blocks under the hood. So isn't it
meaningless to speak of some kind of a block forging/fragmentation/etc..
affect of any writing pattern?

- If it is so, Doesn't it mean that there is no better ssd usage strategy
other than minimizing the total bytes written? That is whatever we do,
if it contributes to this fact it is good, otherwise bad. Are all other things
are beyond any user control? Is there a recommended setting?

- How about "data retension" experiences? It is known that
new ssds can hold data safely for longer period. As they age
that margin gets shorter. As an extreme case if I write into a new
ssd and shelve it, can i get back my data back after 5 years?
How about a file written 5 years ago and never touched again although
rest of the ssd is in active use during that period?

- Yes may be lifetimes getting irrelevant. However TBW has
still direct relation with data retension capability.
Knowing that writing more data to a ssd can reduce the
"life time of your data" is something strange.

- But someone can come and say: Hey don't worry about
"data retension years". Because your ssd will already be dead
before data retension becomes a problem for you... Which is
relieving.. :)) Anyway what are your opinions?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarn
 wrote:

>> What is a high end SSD these days? Built-in NVMe?
>
> One with a good FTL in the firmware.  At minimum, the good Samsung EVO
> drives, the high quality Intel ones, and the Crucial MX series, but probably
> some others.  My choice of words here probably wasn't the best though.

It's a confusing market that sorta defies figuring out what we've got.

I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung
EVO+ SD Card in an Intel NUC. They use that same EVO branding on an
$11 SD Card.

And then there's the Samsung Electronics Co Ltd NVMe SSD Controller
SM951/PM951 in another laptop.


>> So long as this file is not reflinked or snapshot, filefrag shows a
>> pile of mostly 4096 byte blocks, thousands. But as they're pretty much
>> all continuous, the file fragmentation (extent count) is usually never
>> higher than 12. It meanders between 1 and 12 extents for its life.
>>
>> Except on the system using ssd_spread mount option. That one has a
>> journal file that is +C, is not being snapshot, but has over 3000
>> extents per filefrag and btrfs-progs/debugfs. Really weird.
>
> Given how the 'ssd' mount option behaves and the frequency that most systemd
> instances write to their journals, that's actually reasonably expected.  We
> look for big chunks of free space to write into and then align to 2M
> regardless of the actual size of the write, which in turn means that files
> like the systemd journal which see lots of small (relatively speaking)
> writes will have way more extents than they should until you defragment
> them.

Nope. The first paragraph applies to NVMe machine with ssd mount
option. Few fragments.

The second paragraph applies to SD Card machine with ssd_spread mount
option. Many fragments.

These are different versions of systemd-journald so I can't completely
rule out a difference in write behavior.


>> Now, systemd aside, there are databases that behave this same way
>> where there's a small section contantly being overwritten, and one or
>> more sections that grow the data base file from within and at the end.
>> If this is made cow, the file will absolutely fragment a ton. And
>> especially if the changes are mostly 4KiB block sizes that then are
>> fsync'd.
>>
>> It's almost like we need these things to not fsync at all, and just
>> rely on the filesystem commit time...
>
> Essentially yes, but that causes all kinds of other problems.

Drat.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Roman Mamedov
On Mon, 17 Apr 2017 07:53:04 -0400
"Austin S. Hemmelgarn"  wrote:

> General info (not BTRFS specific):
> * Based on SMART attributes and other factors, current life expectancy 
> for light usage (normal desktop usage) appears to be somewhere around 
> 8-12 years depending on specifics of usage (assuming the same workload, 
> F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper 
> end, XFS is roughly in the middle, ext4 and NTFS are on the low end 
> (tested using Windows 7's NTFS driver), and FAT32 is an outlier at the 
> bottom of the barrel).

Life expectancy for an SSD is defined not in years, but in TBW (terabytes
written), and AFAICT that's not "from host", but "to flash" (some SSDs will
show you both values in two separate SMART attributes out of the box, on some
it can be unlocked). Filesystem may come into play only by the amount of write
amplification they cause (how much "to flash" is greater than "from host").
Do you have any test data to show that FSes are ranked in that order by WA
they cause, or is it all about "general feel" and how they are branded (F2FS
says so, so it must be the best)

> * Queued DISCARD support is still missing in most consumer SATA SSD's, 
> which in turn makes the trade-off on those between performance and 
> lifetime much sharper.

My choice was to make a script to run from crontab, using "fstrim" on all
mounted SSDs nightly, and aside from that all FSes are mounted with
"nodiscard". Best of the both worlds, and no interference with actual IO
operation.

> * Modern (2015 and newer) SSD's seem to have better handling in the FTL 
> for the journaling behavior of filesystems like ext4 and XFS.  I'm not 
> sure if this is actually a result of the FTL being better, or some 
> change in the hardware.

Again, what makes you think this, did you observe the write amplification
readings and now those are demonstrably lower than on "2014 and older" SSDs?
So, by how much, and which models did you compare?

> * In my personal experience, Intel, Samsung, and Crucial appear to be 
> the best name brands (in relative order of quality).  I have personally 
> had bad experiences with SanDisk and Kingston SSD's, but I don't have 
> anything beyond circumstantial evidence indicating that it was anything 
> but bad luck on both counts.

Why not think in terms not of "name brands" but platforms, i.e. a controller
model + flash combination. For instance Intel have been using some other
companies' controllers in their SSDs. Kingston uses tons of various
controllers (Sandforce/Phison/Marvell/more?) depending on the model and range.

> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt 
> performance for BTRFS on SSD's, and appear to reduce the lifetime of the 
> SSD.

"Appear to"? Just... what. So how many SSDs did you have fail under nocow?

Or maybe can we get serious in a technical discussion? Did you by any chance
mean cause more writes to the SSD and more "to flash" writes (resulting in a
higher WA). If so, then by how much, and what was your test scenario comparing
the same usage with and without nocow?

> * Compression should help performance and device lifetime most of the 
> time, unless your CPU is fully utilized on a regular basis (in which 
> case it will hurt performance, but still improve device lifetimes).

Days are long gone since the end user had to ever think about device lifetimes
with SSDs. Refer to endurance studies such as 
http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead
http://ssdendurancetest.com/
https://3dnews.ru/938764/
It has been demonstrated that all SSDs on the market tend to overshoot even
their rated TBW by several times, as a result it will take any user literally
dozens of years to wear out the flash no matter which filesystem or what
settings used. And most certainly it's not worth it changing anything
significant in your workflow (such as enabling compression if it's otherwise
inconvenient or not needed) just to save the SSD lifetime.

On Mon, 17 Apr 2017 13:13:39 -0400
"Austin S. Hemmelgarn"  wrote:

> > What is a high end SSD these days? Built-in NVMe?
> One with a good FTL in the firmware.  At minimum, the good Samsung EVO 
> drives, the high quality Intel ones

As opposed to bad Samsung EVO drives and low-quality Intel ones?

> and the Crucial MX series, but 
> probably some others.  My choice of words here probably wasn't the best 
> though.

Again, which controller? Crucial does not manufacture SSD controllers on their
own, they just pack and brand stuff manufactured by someone else. So if you
meant Marvell based SSDs, then that's many brands, not just Crucial.

> For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets 
> rewritten in-place.  This means that cheap FTL's will rewrite that erase 
> block in-place (which won't hurt performance but will impact device 
> lifetime), and good ones will rewrite into a free 

Re: Remounting read-write after error is not allowed

2017-04-17 Thread Liu Bo
On Mon, Apr 17, 2017 at 02:00:45PM -0400, Alexandru Guzu wrote:
> Not sure if anyone is looking into that segfault, but I have an update.
> I disconnected the USB drive for a while and today I reconnected it
> and it auto-mounted with no issue.
> 
> What is interesting is that the drive letter changed to what is was
> before when it was working.
> Remember that in my first email, the drive encounters an error as
> /dev/sdg2, and it reconnected as /dev/sdh2
> 
> I was getting errors when it was assigned /dev/sdh2, but now it mounts
> just fine when it is assigned /dev/sdg2
> 
> sudo btrfs fi show
> Label: 'USB-data'  uuid: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7
> Total devices 1 FS bytes used 230.72GiB
> devid1 size 447.13GiB used 238.04GiB path /dev/sdg2
> 
> 
> is BTRFS really so sensitive to drive letter changes?
> The USB driver is automounted as such:
> 
> /dev/sdg2 on /media/alex/USB-data1 type btrfs
> (rw,nosuid,nodev,relatime,space_cache,subvolid=5,subvol=/,uhelper=udisks2)
>

I don't think it's because of the changed drive letter, it was hitting the
BUG_ON in kernel code btrfs_search_forward() and showed that somehow your btrfs
had failed to read the metadata out of the drive (either because the underlying
drive is not available for reading or because metadata checksum check is
failing).

That BUG_ON had gotten fixed in a later version, you may want to try the latest
btrfs.


Thanks,

-liubo

> > Thanks for the reply.
> >
> > I mounted it ro:
> > $ sudo btrfs fi show /mnt
> > Segmentation fault (core dumped)
> >
> > dmesg says:
> > ...
> > kernel BUG at /build/linux-wXdoVv/linux-4.4.0/fs/btrfs/ctree.c:5205!
> > ...
> > RIP: 0010:[]  []
> > btrfs_search_forward+0x268/0x350 [btrfs]
> > ...
> > Call Trace:
> > [] search_ioctl+0xf2/0x1c0 [btrfs]
> > [] ? zone_statistics+0x7c/0xa0
> > [] btrfs_ioctl_tree_search+0x72/0xc0 [btrfs]
> > [] btrfs_ioctl+0x455/0x28b0 [btrfs]
> > [] ? mem_cgroup_try_charge+0x6b/0x1e0
> > [] ? handle_mm_fault+0xcad/0x1820
> > [] do_vfs_ioctl+0x29f/0x490
> > [] ? __do_page_fault+0x1b4/0x400
> > [] SyS_ioctl+0x79/0x90
> > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > ...
> >
> > full dmesg output is at:
> > pastebin.com/bhsEJiJN
> >
> > $ sudo btrfs fi df /mnt
> > Data, single: total=236.01GiB, used=230.35GiB
> > System, DUP: total=8.00MiB, used=48.00KiB
> > System, single: total=4.00MiB, used=0.00B
> > Metadata, DUP: total=1.00GiB, used=349.11MiB
> > Metadata, single: total=8.00MiB, used=0.00B
> > GlobalReserve, single: total=128.00MiB, used=0.00B
> >
> > I downloaded and compiled the current btrfs v4.10.1-9-gbd0ab27.
> >
> > sudo ./btrfs check /dev/sdh2
> > Checking filesystem on /dev/sdh2
> > UUID: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7
> > checking extents
> > checking free space cache
> > checking fs roots
> > checking csums
> > checking root refs
> > found 247628603392 bytes used, no error found
> > total csum bytes: 241513756
> > total tree bytes: 366084096
> > total fs tree bytes: 90259456
> > total extent tree bytes: 10010624
> > btree space waste bytes: 35406538
> > file data blocks allocated: 23837459185664
> >  referenced 252963553280
> >
> > and with lowmem mode (again no errors found):
> > ./btrfs check /dev/sdh2 --mode=lowmem
> > Checking filesystem on /dev/sdh2
> > UUID: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7
> > checking extents
> > checking free space cache
> > checking fs roots
> > checking csums
> > checking root refs
> > found 247738298368 bytes used, no error found
> > total csum bytes: 241513756
> > total tree bytes: 366084096
> > total fs tree bytes: 90259456
> > total extent tree bytes: 10010624
> > btree space waste bytes: 35406538
> > file data blocks allocated: 23837459185664
> >  referenced 252963553280
> >
> > maybe there is some hint in that segmentation fault?
> > Also, I compiled from
> > git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git but I
> > did not get version 4.10.1 instead of 4.10.2
> >
> > Regards,
> >
> > On Fri, Apr 14, 2017 at 1:17 PM, Chris Murphy  
> > wrote:
> >> Can you ro mount it and:
> >>
> >> btrfs fi show /mnt
> >> btrfs fi df /mnt
> >>
> >> And then next update the btrfs-progs to something newer like 4.9.2 or
> >> 4.10.2 and then do another 'btrfs check' without repair. And then
> >> separately do it again with --mode=lowmem and post both sets of
> >> results?
> >>
> >>
> >> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compressing nocow files

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 12:07 PM, Liu Bo  wrote:
> On Mon, Apr 17, 2017 at 11:36:17AM -0600, Chris Murphy wrote:
>> HI,
>>
>>
>> /dev/nvme0n1p8 on / type btrfs
>> (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root)
>>
>> I've got a test folder with +C set and then copied a test file into it.
>>
>> $ lsattr
>> C--
>> ./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>>
>> So now it's inherited +C. Next check fragments and compression.
>>
>> $ sudo ~/Applications/btrfs-debugfs -f
>> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>> (290662 0): ram 100663296 disk 17200840704 disk_size 100663296
>> file: 
>> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>> extents 1 disk size 100663296 logical size 100663296 ratio 1.00
>>
>>
>> Try to compress it.
>>
>> $ sudo btrfs fi defrag -c
>> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>>
>> Check again.
>>
>> [snip faux fragments]
>> file: 
>> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
>> extents 768 disk size 21504000 logical size 100663296 ratio 4.68
>>
>> It's compressed! OK??
>>
>> OK delete that file, and add +c to the temp folder.
>>
>> $ lsattr
>> c---C-- ./temp
>>
>> Copy another test file into temp.
>>
>> $ lsattr
>> c---C--
>> ./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
>>
>> $ sudo ~/Applications/btrfs-debugfs -f
>> system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
>> (290739 0): ram 83886080 disk 21764243456 disk_size 83886080
>> file: 
>> system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
>> extents 1 disk size 83886080 logical size 83886080 ratio 1.00
>>
>> Not compressed. Hmmm. So somehow btrfs fi defrag -c will force
>> compression even on nocow files? I've also done this without -c
>> option, with a Btrfs mounted using compress option; and the file does
>> compress also. I thought nocow files were always no compress, but it
>> seems there's exceptions.
>>
>
> Good catch.
>
> Btrfs defragment depends on COW to do the job, thus nocow inode are forced to
> COW when processing defraged range, and the compress also depends on COW, so
> btrfs filesystem defrag -c becomes an exception.
>
> Speaking of which, this exception seems to be not harmful other than confusing
> users.

No in fact it might be a benefit. Recent versions of systemd-journald
defragment, but are highly compressible files with a lot of slack
space in them. So defragment on SSD is just increasing write
amplification. If the defragment ioctl supports passing compression
request, maybe that's an optimization systemd-journald can leverage
for much smaller files.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compressing nocow files

2017-04-17 Thread Liu Bo
On Mon, Apr 17, 2017 at 11:36:17AM -0600, Chris Murphy wrote:
> HI,
> 
> 
> /dev/nvme0n1p8 on / type btrfs
> (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root)
> 
> I've got a test folder with +C set and then copied a test file into it.
> 
> $ lsattr
> C--
> ./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
> 
> So now it's inherited +C. Next check fragments and compression.
> 
> $ sudo ~/Applications/btrfs-debugfs -f
> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
> (290662 0): ram 100663296 disk 17200840704 disk_size 100663296
> file: 
> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
> extents 1 disk size 100663296 logical size 100663296 ratio 1.00
> 
> 
> Try to compress it.
> 
> $ sudo btrfs fi defrag -c
> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
> 
> Check again.
> 
> [snip faux fragments]
> file: 
> system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
> extents 768 disk size 21504000 logical size 100663296 ratio 4.68
> 
> It's compressed! OK??
> 
> OK delete that file, and add +c to the temp folder.
> 
> $ lsattr
> c---C-- ./temp
> 
> Copy another test file into temp.
> 
> $ lsattr
> c---C--
> ./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
> 
> $ sudo ~/Applications/btrfs-debugfs -f
> system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
> (290739 0): ram 83886080 disk 21764243456 disk_size 83886080
> file: 
> system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
> extents 1 disk size 83886080 logical size 83886080 ratio 1.00
> 
> Not compressed. Hmmm. So somehow btrfs fi defrag -c will force
> compression even on nocow files? I've also done this without -c
> option, with a Btrfs mounted using compress option; and the file does
> compress also. I thought nocow files were always no compress, but it
> seems there's exceptions.
>

Good catch.

Btrfs defragment depends on COW to do the job, thus nocow inode are forced to
COW when processing defraged range, and the compress also depends on COW, so
btrfs filesystem defrag -c becomes an exception.

Speaking of which, this exception seems to be not harmful other than confusing
users.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compressing nocow files

2017-04-17 Thread Austin S. Hemmelgarn

On 2017-04-17 13:36, Chris Murphy wrote:

HI,


/dev/nvme0n1p8 on / type btrfs
(rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root)

I've got a test folder with +C set and then copied a test file into it.

$ lsattr
C--
./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal

So now it's inherited +C. Next check fragments and compression.

$ sudo ~/Applications/btrfs-debugfs -f
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
(290662 0): ram 100663296 disk 17200840704 disk_size 100663296
file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 1 disk size 100663296 logical size 100663296 ratio 1.00


Try to compress it.

$ sudo btrfs fi defrag -c
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal

Check again.

[snip faux fragments]
file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 768 disk size 21504000 logical size 100663296 ratio 4.68

It's compressed! OK??

OK delete that file, and add +c to the temp folder.

$ lsattr
c---C-- ./temp

Copy another test file into temp.

$ lsattr
c---C--
./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal

$ sudo ~/Applications/btrfs-debugfs -f
system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
(290739 0): ram 83886080 disk 21764243456 disk_size 83886080
file: 
system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
extents 1 disk size 83886080 logical size 83886080 ratio 1.00

Not compressed. Hmmm. So somehow btrfs fi defrag -c will force
compression even on nocow files? I've also done this without -c
option, with a Btrfs mounted using compress option; and the file does
compress also. I thought nocow files were always no compress, but it
seems there's exceptions.
This is odd behavior.  The fact that it's inconsistent is what worries 
me the most though, not that it's possible to compress NOCOW files (the 
data loss window on a compressed write is a bit bigger than an 
uncompressed one, but it still doesn't have the race conditions that 
checksums+NOCOW does).


I'm willing to bet the issue is in how the FS handles defragmentation, 
seeing as the only case here where it was compressed was after you 
compressed it using the defrag command, which IIRC, doesn't actually do 
any checks in the kernel side except making sure nothing has the file 
mapped as an executable.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compressing nocow files

2017-04-17 Thread Austin S. Hemmelgarn

On 2017-04-17 13:36, Chris Murphy wrote:

HI,


/dev/nvme0n1p8 on / type btrfs
(rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root)

I've got a test folder with +C set and then copied a test file into it.

$ lsattr
C--
./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal

So now it's inherited +C. Next check fragments and compression.

$ sudo ~/Applications/btrfs-debugfs -f
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
(290662 0): ram 100663296 disk 17200840704 disk_size 100663296
file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 1 disk size 100663296 logical size 100663296 ratio 1.00


Try to compress it.

$ sudo btrfs fi defrag -c
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal

Check again.

[snip faux fragments]
file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 768 disk size 21504000 logical size 100663296 ratio 4.68

It's compressed! OK??

OK delete that file, and add +c to the temp folder.

$ lsattr
c---C-- ./temp

Copy another test file into temp.

$ lsattr
c---C--
./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal

$ sudo ~/Applications/btrfs-debugfs -f
system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
(290739 0): ram 83886080 disk 21764243456 disk_size 83886080
file: 
system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
extents 1 disk size 83886080 logical size 83886080 ratio 1.00

Not compressed. Hmmm. So somehow btrfs fi defrag -c will force
compression even on nocow files? I've also done this without -c
option, with a Btrfs mounted using compress option; and the file does
compress also. I thought nocow files were always no compress, but it
seems there's exceptions.
This is odd behavior.  The fact that it's inconsistent is what worries 
me the most though, not that it's possible to compress NOCOW files (the 
data loss window on a compressed write is a bit bigger than an 
uncompressed one, but it still doesn't have the race conditions that 
checksums+NOCOW does).


I'm willing to bet the issue is in how the FS handles defragmentation, 
seeing as the only case here where it was compressed was after you 
compressed it using the defrag command, which IIRC, doesn't actually do 
any checks in the kernel side except making sure nothing has the file 
mapped as an executable.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Remounting read-write after error is not allowed

2017-04-17 Thread Alexandru Guzu
Not sure if anyone is looking into that segfault, but I have an update.
I disconnected the USB drive for a while and today I reconnected it
and it auto-mounted with no issue.

What is interesting is that the drive letter changed to what is was
before when it was working.
Remember that in my first email, the drive encounters an error as
/dev/sdg2, and it reconnected as /dev/sdh2

I was getting errors when it was assigned /dev/sdh2, but now it mounts
just fine when it is assigned /dev/sdg2

sudo btrfs fi show
Label: 'USB-data'  uuid: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7
Total devices 1 FS bytes used 230.72GiB
devid1 size 447.13GiB used 238.04GiB path /dev/sdg2


is BTRFS really so sensitive to drive letter changes?
The USB driver is automounted as such:

/dev/sdg2 on /media/alex/USB-data1 type btrfs
(rw,nosuid,nodev,relatime,space_cache,subvolid=5,subvol=/,uhelper=udisks2)

Regards,

On Fri, Apr 14, 2017 at 3:51 PM, Alexandru Guzu  wrote:
> Thanks for the reply.
>
> I mounted it ro:
> $ sudo btrfs fi show /mnt
> Segmentation fault (core dumped)
>
> dmesg says:
> ...
> kernel BUG at /build/linux-wXdoVv/linux-4.4.0/fs/btrfs/ctree.c:5205!
> ...
> RIP: 0010:[]  []
> btrfs_search_forward+0x268/0x350 [btrfs]
> ...
> Call Trace:
> [] search_ioctl+0xf2/0x1c0 [btrfs]
> [] ? zone_statistics+0x7c/0xa0
> [] btrfs_ioctl_tree_search+0x72/0xc0 [btrfs]
> [] btrfs_ioctl+0x455/0x28b0 [btrfs]
> [] ? mem_cgroup_try_charge+0x6b/0x1e0
> [] ? handle_mm_fault+0xcad/0x1820
> [] do_vfs_ioctl+0x29f/0x490
> [] ? __do_page_fault+0x1b4/0x400
> [] SyS_ioctl+0x79/0x90
> [] entry_SYSCALL_64_fastpath+0x16/0x71
> ...
>
> full dmesg output is at:
> pastebin.com/bhsEJiJN
>
> $ sudo btrfs fi df /mnt
> Data, single: total=236.01GiB, used=230.35GiB
> System, DUP: total=8.00MiB, used=48.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=1.00GiB, used=349.11MiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=128.00MiB, used=0.00B
>
> I downloaded and compiled the current btrfs v4.10.1-9-gbd0ab27.
>
> sudo ./btrfs check /dev/sdh2
> Checking filesystem on /dev/sdh2
> UUID: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 247628603392 bytes used, no error found
> total csum bytes: 241513756
> total tree bytes: 366084096
> total fs tree bytes: 90259456
> total extent tree bytes: 10010624
> btree space waste bytes: 35406538
> file data blocks allocated: 23837459185664
>  referenced 252963553280
>
> and with lowmem mode (again no errors found):
> ./btrfs check /dev/sdh2 --mode=lowmem
> Checking filesystem on /dev/sdh2
> UUID: 227fbb6c-ae72-4b81-8e65-942a0ddc6ef7
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 247738298368 bytes used, no error found
> total csum bytes: 241513756
> total tree bytes: 366084096
> total fs tree bytes: 90259456
> total extent tree bytes: 10010624
> btree space waste bytes: 35406538
> file data blocks allocated: 23837459185664
>  referenced 252963553280
>
> maybe there is some hint in that segmentation fault?
> Also, I compiled from
> git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git but I
> did not get version 4.10.1 instead of 4.10.2
>
> Regards,
>
> On Fri, Apr 14, 2017 at 1:17 PM, Chris Murphy  wrote:
>> Can you ro mount it and:
>>
>> btrfs fi show /mnt
>> btrfs fi df /mnt
>>
>> And then next update the btrfs-progs to something newer like 4.9.2 or
>> 4.10.2 and then do another 'btrfs check' without repair. And then
>> separately do it again with --mode=lowmem and post both sets of
>> results?
>>
>>
>> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


compressing nocow files

2017-04-17 Thread Chris Murphy
HI,


/dev/nvme0n1p8 on / type btrfs
(rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/root)

I've got a test folder with +C set and then copied a test file into it.

$ lsattr
C--
./system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal

So now it's inherited +C. Next check fragments and compression.

$ sudo ~/Applications/btrfs-debugfs -f
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
(290662 0): ram 100663296 disk 17200840704 disk_size 100663296
file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 1 disk size 100663296 logical size 100663296 ratio 1.00


Try to compress it.

$ sudo btrfs fi defrag -c
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal

Check again.

[snip faux fragments]
file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 768 disk size 21504000 logical size 100663296 ratio 4.68

It's compressed! OK??

OK delete that file, and add +c to the temp folder.

$ lsattr
c---C-- ./temp

Copy another test file into temp.

$ lsattr
c---C--
./system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal

$ sudo ~/Applications/btrfs-debugfs -f
system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
(290739 0): ram 83886080 disk 21764243456 disk_size 83886080
file: 
system@2547b430ecdc441c9cf82569eeb22065-0002eb9f-00054d4ebb44b135.journal
extents 1 disk size 83886080 logical size 83886080 ratio 1.00

Not compressed. Hmmm. So somehow btrfs fi defrag -c will force
compression even on nocow files? I've also done this without -c
option, with a Btrfs mounted using compress option; and the file does
compress also. I thought nocow files were always no compress, but it
seems there's exceptions.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Austin S. Hemmelgarn

On 2017-04-17 12:58, Chris Murphy wrote:

On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarn
 wrote:


Regarding BTRFS specifically:
* Given my recently newfound understanding of what the 'ssd' mount option
actually does, I'm inclined to recommend that people who are using high-end
SSD's _NOT_ use it as it will heavily increase fragmentation and will likely
have near zero impact on actual device lifetime (but may _hurt_
performance).  It will still probably help with mid and low-end SSD's.


What is a high end SSD these days? Built-in NVMe?
One with a good FTL in the firmware.  At minimum, the good Samsung EVO 
drives, the high quality Intel ones, and the Crucial MX series, but 
probably some others.  My choice of words here probably wasn't the best 
though.





* Files with NOCOW and filesystems with 'nodatacow' set will both hurt
performance for BTRFS on SSD's, and appear to reduce the lifetime of the
SSD.


Can you elaborate. It's an interesting problem, on a small scale the
systemd folks have journald set +C on /var/log/journal so that any new
journals are nocow. There is an initial fallocate, but the write
behavior is writing in the same place at the head and tail. But at the
tail, the writes get pushed torward the middle. So the file is growing
into its fallocated space from the tail. The header changes in the
same location, it's an overwrite.
For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets 
rewritten in-place.  This means that cheap FTL's will rewrite that erase 
block in-place (which won't hurt performance but will impact device 
lifetime), and good ones will rewrite into a free block somewhere else 
but may not free that original block for quite some time (which is bad 
for performance but slightly better for device lifetime).


When BTRFS does a COW operation on a block however, it will guarantee 
that that block moves.  Because of this, the old location will either:

1. Be discarded by the FS itself if the 'discard' mount option is set.
2. Be caught by a scheduled call to 'fstrim'.
3. Lay dormant for at least a while.

The first case is ideal for most FTL's, because it lets them know 
immediately that that data isn't needed and the space can be reused. 
The second is close to ideal, but defers telling the FTL that the block 
is unused, which can be better on some SSD's (some have firmware that 
handles wear-leveling better in batches).  The third is not ideal, but 
is still better than what happens with NOCOW or nodatacow set.


Overall, this boils down to the fact that most FTL's get slower if they 
can't wear-level the device properly, and in-place rewrites make it 
harder for them to do proper wear-leveling.


So long as this file is not reflinked or snapshot, filefrag shows a
pile of mostly 4096 byte blocks, thousands. But as they're pretty much
all continuous, the file fragmentation (extent count) is usually never
higher than 12. It meanders between 1 and 12 extents for its life.

Except on the system using ssd_spread mount option. That one has a
journal file that is +C, is not being snapshot, but has over 3000
extents per filefrag and btrfs-progs/debugfs. Really weird.
Given how the 'ssd' mount option behaves and the frequency that most 
systemd instances write to their journals, that's actually reasonably 
expected.  We look for big chunks of free space to write into and then 
align to 2M regardless of the actual size of the write, which in turn 
means that files like the systemd journal which see lots of small 
(relatively speaking) writes will have way more extents than they should 
until you defragment them.


Now, systemd aside, there are databases that behave this same way
where there's a small section contantly being overwritten, and one or
more sections that grow the data base file from within and at the end.
If this is made cow, the file will absolutely fragment a ton. And
especially if the changes are mostly 4KiB block sizes that then are
fsync'd.

It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...

Essentially yes, but that causes all kinds of other problems.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarn
 wrote:

> Regarding BTRFS specifically:
> * Given my recently newfound understanding of what the 'ssd' mount option
> actually does, I'm inclined to recommend that people who are using high-end
> SSD's _NOT_ use it as it will heavily increase fragmentation and will likely
> have near zero impact on actual device lifetime (but may _hurt_
> performance).  It will still probably help with mid and low-end SSD's.

What is a high end SSD these days? Built-in NVMe?



> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt
> performance for BTRFS on SSD's, and appear to reduce the lifetime of the
> SSD.

Can you elaborate. It's an interesting problem, on a small scale the
systemd folks have journald set +C on /var/log/journal so that any new
journals are nocow. There is an initial fallocate, but the write
behavior is writing in the same place at the head and tail. But at the
tail, the writes get pushed torward the middle. So the file is growing
into its fallocated space from the tail. The header changes in the
same location, it's an overwrite.

So long as this file is not reflinked or snapshot, filefrag shows a
pile of mostly 4096 byte blocks, thousands. But as they're pretty much
all continuous, the file fragmentation (extent count) is usually never
higher than 12. It meanders between 1 and 12 extents for its life.

Except on the system using ssd_spread mount option. That one has a
journal file that is +C, is not being snapshot, but has over 3000
extents per filefrag and btrfs-progs/debugfs. Really weird.

Now, systemd aside, there are databases that behave this same way
where there's a small section contantly being overwritten, and one or
more sections that grow the data base file from within and at the end.
If this is made cow, the file will absolutely fragment a ton. And
especially if the changes are mostly 4KiB block sizes that then are
fsync'd.

It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...





-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-17 Thread Austin S. Hemmelgarn

On 2017-04-14 07:02, Imran Geriskovan wrote:

Hi,
Sometime ago we had some discussion about SSDs.
Within the limits of unknown/undocumented device infos,
we loosely had covered data retension capability/disk age/life time
interrelations, (in?)effectiveness of btrfs dup on SSDs, etc..

Now, as time passed and with some accumulated experience on SSDs
I think we again can have a status check/update on them if you
can share your experiences and best practices.

So if you have something to share about SSDs (it may or may not be
directly related with btrfs) I'm sure everybody here will be happy to
hear it.


General info (not BTRFS specific):
* Based on SMART attributes and other factors, current life expectancy 
for light usage (normal desktop usage) appears to be somewhere around 
8-12 years depending on specifics of usage (assuming the same workload, 
F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper 
end, XFS is roughly in the middle, ext4 and NTFS are on the low end 
(tested using Windows 7's NTFS driver), and FAT32 is an outlier at the 
bottom of the barrel).
* Queued DISCARD support is still missing in most consumer SATA SSD's, 
which in turn makes the trade-off on those between performance and 
lifetime much sharper.
* Modern (2015 and newer) SSD's seem to have better handling in the FTL 
for the journaling behavior of filesystems like ext4 and XFS.  I'm not 
sure if this is actually a result of the FTL being better, or some 
change in the hardware.
* In my personal experience, Intel, Samsung, and Crucial appear to be 
the best name brands (in relative order of quality).  I have personally 
had bad experiences with SanDisk and Kingston SSD's, but I don't have 
anything beyond circumstantial evidence indicating that it was anything 
but bad luck on both counts.


Regarding BTRFS specifically:
* Given my recently newfound understanding of what the 'ssd' mount 
option actually does, I'm inclined to recommend that people who are 
using high-end SSD's _NOT_ use it as it will heavily increase 
fragmentation and will likely have near zero impact on actual device 
lifetime (but may _hurt_ performance).  It will still probably help with 
mid and low-end SSD's.
* Files with NOCOW and filesystems with 'nodatacow' set will both hurt 
performance for BTRFS on SSD's, and appear to reduce the lifetime of the 
SSD.
* Compression should help performance and device lifetime most of the 
time, unless your CPU is fully utilized on a regular basis (in which 
case it will hurt performance, but still improve device lifetimes).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html