Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-15 Thread Nikolay Borisov


On 16.11.2017 09:38, Qu Wenruo wrote:
> 
> 
> On 2017年11月16日 14:54, Nikolay Borisov wrote:
>>
>>
>> On 16.11.2017 04:18, Qu Wenruo wrote:
>>> Hi all,
>>>
>>> [Background]
>>> Recently I'm considering the possibility to use checksum from filesystem
>>> to enhance device-mapper raid.
>>>
>>> The idea behind it is quite simple, since most modern filesystems have
>>> checksum for their metadata, and even some (btrfs) have checksum for data.
>>>
>>> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
>>> it can use the checksum to determine which copy is correct so it can
>>> return the correct data even one copy get corrupted.
>>>
>>> [Objective]
>>> The final objective is to allow device mapper to do the checksum
>>> verification (and repair if possible).
>>>
>>> If only for verification, it's not much different from current endio
>>> hook method used by most of the fs.
>>> However if we can move the repair part from filesystem (well, only btrfs
>>> supports it yet), it would benefit all fs.
>>>
>>> [What we have]
>>> The nearest infrastructure I found in kernel is bio_integrity_payload.
>>>
>>> However I found it's bounded to device, as it's designed to support
>>> SCSI/SATA integrity protocol.
>>> While for such use case, it's more bounded to filesystem, as fs (or
>>> higher layer dm device) is the source of integrity data, and device
>>> (dm-raid) only do the verification and possible repair.
>>>
>>> I'm not sure if this is a good idea to reuse or abuse
>>> bio_integrity_payload for this purpose.
>>>
>>> Should we use some new infrastructure or enhance existing
>>> bio_integrity_payload?
>>>
>>> (Or is this a valid idea or just another crazy dream?)
>>>
>>
>> This sounds good in principle, however I think there is one crucial
>> point which needs to be considered:
>>
>> All fs with checksums store those checksums in some specific way, then
>> when they fetch data from disk they they also know how to acquire the
>> respective checksum.
> 
> Just like integrity payload, we generate READ bio attached with checksum
> hook function and checksum data.

So how is this checksum data acquired in the first place?

> 
> So for data read, we read checksum first and attach it to data READ bio,
> then submit it.
> 
> And for metadata read, in most case the checksum is integrated into
> metadata header, like what we did in btrfs.
> 
> In that case we attach empty checksum data to bio, but use metadata
> specific function hook to handle it.
> 
>> What you suggest might be doable but it will
>> require lower layers (dm) be aware of how to acquire the specific
>> checksum for some data.
> 
> In above case, dm only needs to call the verification hook function.
> If verification passed, that's good.
> If not, try other copy if we have.
> 
> In this case, I don't think dm layer needs any extra interface to
> communicate with higher layer.


Well that verification function is the interface I meant, you are
communicating the checksum out of band essentially (notwithstanding the
metadata case, since you said checksum is in the actual metadata header)

In the end - which problem are you trying to solve, allow for a generic
checksumming layer which filesystems may use if they decide to ?

> 
> Thanks,
> Qu
> 
>> I don't think at this point there is such infra
>> and frankly I cannot even envision how it will work elegantly. Sure you
>> can create a dm-checksum target (which I believe dm-verity is very
>> similar to) that stores checksums alongside data but at this point the
>> fs is really out of the picture.
>>
>>
>>> Thanks,
>>> Qu
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-15 Thread Qu Wenruo


On 2017年11月16日 14:54, Nikolay Borisov wrote:
> 
> 
> On 16.11.2017 04:18, Qu Wenruo wrote:
>> Hi all,
>>
>> [Background]
>> Recently I'm considering the possibility to use checksum from filesystem
>> to enhance device-mapper raid.
>>
>> The idea behind it is quite simple, since most modern filesystems have
>> checksum for their metadata, and even some (btrfs) have checksum for data.
>>
>> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
>> it can use the checksum to determine which copy is correct so it can
>> return the correct data even one copy get corrupted.
>>
>> [Objective]
>> The final objective is to allow device mapper to do the checksum
>> verification (and repair if possible).
>>
>> If only for verification, it's not much different from current endio
>> hook method used by most of the fs.
>> However if we can move the repair part from filesystem (well, only btrfs
>> supports it yet), it would benefit all fs.
>>
>> [What we have]
>> The nearest infrastructure I found in kernel is bio_integrity_payload.
>>
>> However I found it's bounded to device, as it's designed to support
>> SCSI/SATA integrity protocol.
>> While for such use case, it's more bounded to filesystem, as fs (or
>> higher layer dm device) is the source of integrity data, and device
>> (dm-raid) only do the verification and possible repair.
>>
>> I'm not sure if this is a good idea to reuse or abuse
>> bio_integrity_payload for this purpose.
>>
>> Should we use some new infrastructure or enhance existing
>> bio_integrity_payload?
>>
>> (Or is this a valid idea or just another crazy dream?)
>>
> 
> This sounds good in principle, however I think there is one crucial
> point which needs to be considered:
> 
> All fs with checksums store those checksums in some specific way, then
> when they fetch data from disk they they also know how to acquire the
> respective checksum.

Just like integrity payload, we generate READ bio attached with checksum
hook function and checksum data.

So for data read, we read checksum first and attach it to data READ bio,
then submit it.

And for metadata read, in most case the checksum is integrated into
metadata header, like what we did in btrfs.

In that case we attach empty checksum data to bio, but use metadata
specific function hook to handle it.

> What you suggest might be doable but it will
> require lower layers (dm) be aware of how to acquire the specific
> checksum for some data.

In above case, dm only needs to call the verification hook function.
If verification passed, that's good.
If not, try other copy if we have.

In this case, I don't think dm layer needs any extra interface to
communicate with higher layer.

Thanks,
Qu

> I don't think at this point there is such infra
> and frankly I cannot even envision how it will work elegantly. Sure you
> can create a dm-checksum target (which I believe dm-verity is very
> similar to) that stores checksums alongside data but at this point the
> fs is really out of the picture.
> 
> 
>> Thanks,
>> Qu
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-15 Thread Nikolay Borisov


On 16.11.2017 04:18, Qu Wenruo wrote:
> Hi all,
> 
> [Background]
> Recently I'm considering the possibility to use checksum from filesystem
> to enhance device-mapper raid.
> 
> The idea behind it is quite simple, since most modern filesystems have
> checksum for their metadata, and even some (btrfs) have checksum for data.
> 
> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
> it can use the checksum to determine which copy is correct so it can
> return the correct data even one copy get corrupted.
> 
> [Objective]
> The final objective is to allow device mapper to do the checksum
> verification (and repair if possible).
> 
> If only for verification, it's not much different from current endio
> hook method used by most of the fs.
> However if we can move the repair part from filesystem (well, only btrfs
> supports it yet), it would benefit all fs.
> 
> [What we have]
> The nearest infrastructure I found in kernel is bio_integrity_payload.
> 
> However I found it's bounded to device, as it's designed to support
> SCSI/SATA integrity protocol.
> While for such use case, it's more bounded to filesystem, as fs (or
> higher layer dm device) is the source of integrity data, and device
> (dm-raid) only do the verification and possible repair.
> 
> I'm not sure if this is a good idea to reuse or abuse
> bio_integrity_payload for this purpose.
> 
> Should we use some new infrastructure or enhance existing
> bio_integrity_payload?
> 
> (Or is this a valid idea or just another crazy dream?)
> 

This sounds good in principle, however I think there is one crucial
point which needs to be considered:

All fs with checksums store those checksums in some specific way, then
when they fetch data from disk they they also know how to acquire the
respective checksum. What you suggest might be doable but it will
require lower layers (dm) be aware of how to acquire the specific
checksum for some data. I don't think at this point there is such infra
and frankly I cannot even envision how it will work elegantly. Sure you
can create a dm-checksum target (which I believe dm-verity is very
similar to) that stores checksums alongside data but at this point the
fs is really out of the picture.


> Thanks,
> Qu
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] factor __btrfs_open_devices()

2017-11-15 Thread Anand Jain



On 11/16/2017 01:03 AM, David Sterba wrote:

On Thu, Nov 09, 2017 at 11:45:22PM +0800, Anand Jain wrote:

This is in preparation to help bring the missing device back to the
alloc list dynamically. As of now if we run 'btrfs dev scan '
on a mounted FS, nothing happens, there is no code which handles this
condition. So the idea is to fix it by running through the device open
process for the reappeared missing device.

So this patch separates device open steps into a separate function so that
it can be called for a device, instead of for all the devices in the
dev_list. As this function is in the critical mount section, and to show
the src code changes as clearly as possible, I have created 4 no-functional
changes patches, which can be merged together if needed.


The splitting is fine, I've only merged 2 and 4 to one, but otherwise
it's clear from each patch if there are no functional changes. 1,3,2+4
added to next, thanks.


Thanks indeed. (git-diff showed changes which didn't make sense,
I had to split, merge is fine).

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-15 Thread Qu Wenruo
Hi all,

[Background]
Recently I'm considering the possibility to use checksum from filesystem
to enhance device-mapper raid.

The idea behind it is quite simple, since most modern filesystems have
checksum for their metadata, and even some (btrfs) have checksum for data.

And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
it can use the checksum to determine which copy is correct so it can
return the correct data even one copy get corrupted.

[Objective]
The final objective is to allow device mapper to do the checksum
verification (and repair if possible).

If only for verification, it's not much different from current endio
hook method used by most of the fs.
However if we can move the repair part from filesystem (well, only btrfs
supports it yet), it would benefit all fs.

[What we have]
The nearest infrastructure I found in kernel is bio_integrity_payload.

However I found it's bounded to device, as it's designed to support
SCSI/SATA integrity protocol.
While for such use case, it's more bounded to filesystem, as fs (or
higher layer dm device) is the source of integrity data, and device
(dm-raid) only do the verification and possible repair.

I'm not sure if this is a good idea to reuse or abuse
bio_integrity_payload for this purpose.

Should we use some new infrastructure or enhance existing
bio_integrity_payload?

(Or is this a valid idea or just another crazy dream?)

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/3] btrfs: tree-checker: Add checker for dir item

2017-11-15 Thread Liu Bo
On Wed, Nov 08, 2017 at 09:59:39AM +0200, Nikolay Borisov wrote:
> 
> 
> On  8.11.2017 02:54, Qu Wenruo wrote:
> > Add checker for dir item, for key types DIR_ITEM, DIR_INDEX and
> > XATTR_ITEM.
> > 
> > This checker does comprehensive check for:
> > 1) dir_item header and its data size
> >Against item boundary and maximum name/xattr length.
> >This part is mostly the same as old verify_dir_item().
> > 
> > 2) dir_type
> >Against maximum file types, and against key type.
> >Since XATTR key should only have FT_XATTR dir item, and normal dir
> >item type should not have XATTR key.
> > 
> >The check between key->type and dir_type is newly introduced by this
> >patch.
> > 
> > 3) name hash
> >For XATTR and DIR_ITEM key, key->offset is name hash (crc32).
> >Check the hash of name against key to ensure it's correct.
> > 
> >The name hash check is only found in btrfs-progs before this patch.
> > 
> > Signed-off-by: Qu Wenruo 
> > ---
> >  fs/btrfs/tree-checker.c | 141 
> > 
> >  1 file changed, 141 insertions(+)
> > 
> 
> 
> I'm gonna 'hijack' this patch to discuss something  - we do return the
> correct EUCLEAN status from the leaf checker, however the only place
> where it's called is in btree_readpage_end_io_hook and if the check
> fails we only return -EIO. I wonder whether want to propagate the
> EUCLEAN from there, any thoughts?

The @ret from readpage_end_io_hook won't go upper layer as it's
wrapped in end_bio_extent_readpage().

The readpage/readpages will just check page's uptodate bit and error
bit.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix deadlock when writing out space cache

2017-11-15 Thread Chris Mason

On 11/15/2017 06:46 PM, Liu Bo wrote:

On Wed, Nov 15, 2017 at 04:20:52PM -0500, Josef Bacik wrote:

From: Josef Bacik 

If we fail to prepare our pages for whatever reason (out of memory in
our case) we need to make sure to drop the block_group->data_rwsem,
otherwise hilarity ensues.



Thanks Josef, I searched all the logs and it looks like we've really 
only hit this twice this month.  It's surprising we haven't seen this 
more given how often we OOM.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: move some zstd work data from stack to workspace

2017-11-15 Thread Nick Terrell
On 11/15/17, 9:30 AM, "David Sterba"  wrote:
> * ZSTD_inBuffer in_buf
> * ZSTD_outBuffer out_buf
>
> are used in all functions to pass the compression parameters and the
> local variables consume some space. We can move them to the workspace
> and reduce the stack consumption:
>
> zstd.c:zstd_decompress-24 (136 -> 112)
> zstd.c:zstd_decompress_bio-24 (144 -> 120)
> zstd.c:zstd_compress_pages-24 (264 -> 240)

It looks good to me, and I ran my btrfs zstd compression and
decompression test and everything worked.

Is there a case where these 24 bytes matter, or is this just an easy
optimization?

Reviewed-by: Nick Terrell 




Re: [PATCH 1/2] btrfs: Fix wild memory access in compression level parser

2017-11-15 Thread Qu Wenruo


On 2017年11月15日 23:11, David Sterba wrote:
> On Mon, Nov 06, 2017 at 10:43:18AM +0800, Qu Wenruo wrote:
>> [BUG]
>> Kernel panic when mounting with "-o compress" mount option.
>> KASAN will report like:
>> --
>> ==
>> BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0
>> Read of size 1 at addr d86735fce994f800 by task mount/662
>> ...
>> Call Trace:
>>  dump_stack+0xe3/0x175
>>  kasan_report+0x163/0x370
>>  __asan_load1+0x47/0x50
>>  strncmp+0x31/0xc0
>>  btrfs_compress_str2level+0x20/0x70 [btrfs]
>>  btrfs_parse_options+0xff4/0x1870 [btrfs]
>>  open_ctree+0x2679/0x49f0 [btrfs]
>>  btrfs_mount+0x1b7f/0x1d30 [btrfs]
>>  mount_fs+0x49/0x190
>>  vfs_kern_mount.part.29+0xba/0x280
>>  vfs_kern_mount+0x13/0x20
>>  btrfs_mount+0x31e/0x1d30 [btrfs]
>>  mount_fs+0x49/0x190
>>  vfs_kern_mount.part.29+0xba/0x280
>>  do_mount+0xaad/0x1a00
>>  SyS_mount+0x98/0xe0
>>  entry_SYSCALL_64_fastpath+0x1f/0xbe
>> --
>>
>> [Cause]
>> For 'compress' and 'compress_force' options, its token doesn't expect
>> any parameter so its args[0] contains uninitialized data.
>> Accessing args[0] will cause above wild memory access.
>>
>> [Fix]
>> For Opt_compress and Opt_compress_force, set compression level to
>> Z_DEFAULT_COMPRESSION manually.
>>
>> NOTE: Don't set zlib compression level to 0 by default, which means no
>> compression.
> 
> But we never set the level to 0 at the point the compression actually
> happens. See zlib.c:zlib_set_level, if level is 0 then the level
> passed to zlib is 3. Z_DEFAULT_COMPRESSION is upstream zlib level 6,
> which is slower, we need zlib to stay in the real-time numbers.

Right, I missed that.

So should I still use 0, or use separate macro like
BTRFS_DEFAULT_ZLIB_LEVEL?

Thanks,
Qu

> 
>> @@ -507,8 +508,19 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
>> char *options,
>>  token == Opt_compress_force ||
>>  strncmp(args[0].from, "zlib", 4) == 0) {
>>  compress_type = "zlib";
>> +
>>  info->compress_type = BTRFS_COMPRESS_ZLIB;
>> -info->compress_level =
>> +/*
>> + * args[0] contains uninitialized data since
>> + * for these tokens we don't expect any
>> + * parameter.
>> + */
>> +if (token == Opt_compress ||
>> +token == Opt_compress_force)
>> +info->compress_level =
>> +Z_DEFAULT_COMPRESSION;
>> +else
>> +info->compress_level =
>>  btrfs_compress_str2level(args[0].from);
> 
> At least this will not screw up the levels, anything that's not
> recognized will become the default.
> 
>>  btrfs_set_opt(info->mount_opt, COMPRESS);
>>  btrfs_clear_opt(info->mount_opt, NODATACOW);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


[PATCH v2] fstests: btrfs/143: make test case more reliable

2017-11-15 Thread Liu Bo
This changes to use '_scratch_cycle_mount' to drop all caches btrfs could have
in order to avoid an issue that drop_caches somehow doesn't work on Nikolay's
box.

Also use bash -c to run 'read' only when %pid is odd so that we can read the
faulty disk.

Reported-by: Nikolay Borisov 
Signed-off-by: Liu Bo 
---
v2: Change 'fadvise -d' to _scratch_cycle_mount.

To Nikolay,

I didn't add your tested-by, but could you please verify if this also
works on your test box?


 tests/btrfs/143 | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/tests/btrfs/143 b/tests/btrfs/143
index da7bfd8..3875b6c 100755
--- a/tests/btrfs/143
+++ b/tests/btrfs/143
@@ -127,16 +127,16 @@ echo "step 3..repair the bad copy" >>$seqres.full
 # since raid1 consists of two copies, and the bad copy was put on stripe #1
 # while the good copy lies on stripe #0, the bad copy only gets access when the
 # reader's pid % 2 == 1 is true
-while true; do
-   # start_fail only fails the following buffered read so the repair is
-   # supposed to work.
-   echo 3 > /proc/sys/vm/drop_caches
-   start_fail
-   $XFS_IO_PROG -c "pread 0 4K" "$SCRATCH_MNT/foobar" > /dev/null &
-   pid=$!
-   wait
-   stop_fail
-   [ $((pid % 2)) == 1 ] && break
+while [[ -z ${result} ]]; do
+# invalidate the page cache.
+_scratch_cycle_mount
+
+start_fail
+result=$(bash -c "
+if [[ \$((\$\$ % 2)) -eq 1 ]]; then
+exec $XFS_IO_PROG -c \"pread 0 4K\" \"$SCRATCH_MNT/foobar\"
+fi");
+stop_fail
 done
 
 _scratch_unmount
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] Btrfs: avoid losing data raid profile when deleting a device

2017-11-15 Thread Liu Bo
We've avoided data losing raid profile when doing balance, but it
turns out that deleting a device could also result in the same
problem.

Say we have 3 disks, and they're created with '-d raid1' profile.

- We have chunk P (the only data chunk on the empty btrfs).

- Suppose that chunk P's two raid1 copies reside in disk A and disk B.

- Now, 'btrfs device remove disk B'
 btrfs_rm_device()
   -> btrfs_shrink_device()
  -> btrfs_relocate_chunk() #relocate any chunk on disk B
 to other places.

- Chunk P will be removed and a new chunk will be created to hold
  those data, but as chunk P is the only one holding raid1 profile,
  after it goes away, the new chunk will be created as single profile
  which is our default profile.

This fixes the problem by creating an empty data chunk before
relocating the data chunk.

Metadata/System chunk are supposed to have non-zero bytes all the time
so their raid profile is preserved.

Reported-by: James Alandt 
Signed-off-by: Liu Bo 
---
v3: More details about how losing data profile happens.

v2: - return the correct error.
- move helper ahead of __btrfs_balance().

 fs/btrfs/volumes.c | 84 ++
 1 file changed, 65 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4a72c45..a74396d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3018,6 +3018,48 @@ static int btrfs_relocate_sys_chunks(struct 
btrfs_fs_info *fs_info)
return ret;
 }
 
+/*
+ * return 1 : allocate a data chunk successfully,
+ * return <0: errors during allocating a data chunk,
+ * return 0 : no need to allocate a data chunk.
+ */
+static int btrfs_may_alloc_data_chunk(struct btrfs_fs_info *fs_info,
+ u64 chunk_offset)
+{
+   struct btrfs_block_group_cache *cache;
+   u64 bytes_used;
+   u64 chunk_type;
+
+   cache = btrfs_lookup_block_group(fs_info, chunk_offset);
+   ASSERT(cache);
+   chunk_type = cache->flags;
+   btrfs_put_block_group(cache);
+
+   if (chunk_type & BTRFS_BLOCK_GROUP_DATA) {
+   spin_lock(_info->data_sinfo->lock);
+   bytes_used = fs_info->data_sinfo->bytes_used;
+   spin_unlock(_info->data_sinfo->lock);
+
+   if (!bytes_used) {
+   struct btrfs_trans_handle *trans;
+   int ret;
+
+   trans = btrfs_join_transaction(fs_info->tree_root);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
+   ret = btrfs_force_chunk_alloc(trans, fs_info,
+ BTRFS_BLOCK_GROUP_DATA);
+   btrfs_end_transaction(trans);
+   if (ret < 0)
+   return ret;
+
+   return 1;
+   }
+   }
+   return 0;
+}
+
 static int insert_balance_item(struct btrfs_fs_info *fs_info,
   struct btrfs_balance_control *bctl)
 {
@@ -3476,7 +3518,6 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
u32 count_meta = 0;
u32 count_sys = 0;
int chunk_reserved = 0;
-   u64 bytes_used = 0;
 
/* step one make some room on all the devices */
devices = _info->fs_devices->devices;
@@ -3635,28 +3676,21 @@ static int __btrfs_balance(struct btrfs_fs_info 
*fs_info)
goto loop;
}
 
-   ASSERT(fs_info->data_sinfo);
-   spin_lock(_info->data_sinfo->lock);
-   bytes_used = fs_info->data_sinfo->bytes_used;
-   spin_unlock(_info->data_sinfo->lock);
-
-   if ((chunk_type & BTRFS_BLOCK_GROUP_DATA) &&
-   !chunk_reserved && !bytes_used) {
-   trans = btrfs_start_transaction(chunk_root, 0);
-   if (IS_ERR(trans)) {
-   mutex_unlock(_info->delete_unused_bgs_mutex);
-   ret = PTR_ERR(trans);
-   goto error;
-   }
-
-   ret = btrfs_force_chunk_alloc(trans, fs_info,
- BTRFS_BLOCK_GROUP_DATA);
-   btrfs_end_transaction(trans);
+   if (!chunk_reserved) {
+   /*
+* We may be relocating the only data chunk we have,
+* which could potentially end up with losing data's
+* raid profile, so lets allocate an empty one in
+* advance.
+*/
+   ret = btrfs_may_alloc_data_chunk(fs_info,
+found_key.offset);

[PATCH v2] Btrfs: set plug for fsync

2017-11-15 Thread Liu Bo
Setting plug can merge adjacent IOs before dispatching IOs to the disk
driver.

Without plug, it'd not be a problem for single disk usecases, but for
multiple disks using raid profile, a large IO can be split to several
IOs of stripe length, and plug can be helpful to bring them together
for each disk so that we can save several disk access.

Moreover, fsync issues synchronous writes, so plug can really take
effect.

Signed-off-by: Liu Bo 
---
v2: Explain why setting plug makes sense.

 fs/btrfs/file.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e43da6c..504e96d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2018,10 +2018,20 @@ int btrfs_release_file(struct inode *inode, struct file 
*filp)
 static int start_ordered_ops(struct inode *inode, loff_t start, loff_t end)
 {
int ret;
+   struct blk_plug plug;
 
+   /*
+* This is only called in fsync, which would do synchronous
+* writes, so a plug can merge adjacent IOs as much as
+* possible.  Esp. in case of multiple disks using raid
+* profile, a large IO can be split to several segments of
+* stripe length(64K).
+*/
+   blk_start_plug();
atomic_inc(_I(inode)->sync_writers);
ret = btrfs_fdatawrite_range(inode, start, end);
atomic_dec(_I(inode)->sync_writers);
+   blk_finish_plug();
 
return ret;
 }
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix deadlock when writing out space cache

2017-11-15 Thread Liu Bo
On Wed, Nov 15, 2017 at 04:20:52PM -0500, Josef Bacik wrote:
> From: Josef Bacik 
> 
> If we fail to prepare our pages for whatever reason (out of memory in
> our case) we need to make sure to drop the block_group->data_rwsem,
> otherwise hilarity ensues.
>

Reviewed-by: Liu Bo 

Thanks,

-liubo
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/free-space-cache.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index cdc9f4015ec3..a6c643275210 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1263,8 +1263,12 @@ static int __btrfs_write_out_cache(struct btrfs_root 
> *root, struct inode *inode,
>  
>   /* Lock all pages first so we can lock the extent safely. */
>   ret = io_ctl_prepare_pages(io_ctl, inode, 0);
> - if (ret)
> + if (ret) {
> + if (block_group &&
> + (block_group->flags & BTRFS_BLOCK_GROUP_DATA))
> + up_write(_group->data_rwsem);
>   goto out;
> + }
>  
>   lock_extent_bits(_I(inode)->io_tree, 0, i_size_read(inode) - 1,
>_state);
> -- 
> 2.7.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Duncan
Roy Sigurd Karlsbakk posted on Wed, 15 Nov 2017 15:10:08 +0100 as
excerpted:

>>> As for dedupe there is (to my knowledge) nothing fully automatic yet.
>>> You have to run a program to scan your filesystem but all the
>>> deduplication is done in the kernel.
>>> duperemove works apparently quite well when I tested it, but there may
>>> be some performance implications.

>> Correct, there is nothing automatic (and there are pretty significant
>> arguments against doing automatic deduplication in most cases), but the
>> off-line options (via the EXTENT_SAME ioctl) are reasonably reliable.
>> Duperemove in particular does a good job, though it may take a long
>> time for large data sets.
>> 
>> As far as performance, it's no worse than large numbers of snapshots.
>> The issues arise from using very large numbers of reflinks.
> 
> What is this "large" number of snapshots? Not that it's directly
> comparible, but I've worked with ZFS a while, and haven't seen those
> issues there.

Btrfs has scaling issues with reflinks, not so much in normal operation, 
but when it comes to filesystem maintenance such as btrfs check and btrfs 
balance.

Numerically, low double-digits of reflinks per extent seems to be 
reasonably fine, high double-digits to low triple-digits begins to run 
into scaling issues, and high triple digits to over 1000... better be 
prepared to wait awhile (can be days or weeks!) for that balance or check 
to complete, and check requires LOTS more memory as well, particularly at 
TB+ scale.

Of course snapshots are the common instance of reflinking, and each 
snapshot is another reflink to each extent of the data in the subvolume 
it covers, so limiting snapshots to 10-50 of each subvolume is 
recommended, and limiting to under 250-ish is STRONGLY recommended.  
(Total number of snapshots per filesystem, where there's many subvolumes 
and snapshots per subvolume falls within the above limits, doesn't seem 
to be a problem.)

Dedupe uses reflinking too, but the effects can be much more variable 
depending on the use-case and how many actual reflinks are being created.

A single extent with 1000 deduping reflinks, as might be common in a 
commercial/hosting use-case, shouldn't be too bad, perhaps comparable to 
a single snapshot, but obviously, do that with a bunch of extents (as a 
hosting use-case might) and it quickly builds to the effect of 1000 
snapshots of the same subvolume, which as mentioned above puts 
maintenance-task time out of the realm of reasonable, for many.

Tho of course in a commercial/hosting case maintenance may well not be 
done as a simple swap-in of a fresh backup is more likely, so it may not 
matter for that scenario.

OTOH, a typical individual/personal use-case may dedup many files but 
only single-digit times each, so the effect would be the same as a single-
digit number of snapshots at worst.

Meanwhile, while btrfs quotas are finally maturing in terms of actually 
tracking the numbers correctly, their effect on scaling is pretty bad 
too.  The recommendation is to keep btrfs quotas off unless you actually 
need them.  If you do need quotas, temporarily disable them while doing 
balances and device-removes (which do implicit balances), then quota-
rescan after the balance is done, because precisely tracking quotas thru 
a balance ends up repeatedly recalculating the numbers again and again 
during the balance, and that just doesn't scale.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zstd compression

2017-11-15 Thread Duncan
Austin S. Hemmelgarn posted on Wed, 15 Nov 2017 07:57:06 -0500 as
excerpted:

> The 'compress' and 'compress-force' mount options only impact newly
> written data.  The compression used is stored with the metadata for the
> extents themselves, so any existing data on the volume will be read just
> fine with whatever compression method it was written with, while new
> data will be written with the specified compression method.
> 
> If you want to convert existing files, you can use the '-c' option to
> the defrag command to do so.

... Being aware of course that using defrag to recompress files like that 
will break 100% of the existing reflinks, effectively (near) doubling 
data usage if the files are snapshotted, since the snapshot will now 
share 0% of its extents with the newly compressed files.

(The actual effect shouldn't be quite that bad, as some files are likely 
to be uncompressed due to not compressing well, and I'm not sure if 
defrag -c rewrites them or not.  Further, if there's multiple snapshots 
data usage should only double with respect to the latest one, the data 
delta between it and previous snapshots won't be doubled as well.)

While this makes sense if you think about it, it may not occur to some 
people until they've actually tried it, and see their data usage go way 
up instead of going down as they intuitively expected.  There have been 
posts to the list...

Of course if the data isn't snapshotted this doesn't apply and defrag -c 
to zstd should be fine. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix deadlock when writing out space cache

2017-11-15 Thread Omar Sandoval
On Wed, Nov 15, 2017 at 04:20:52PM -0500, Josef Bacik wrote:
> From: Josef Bacik 
> 
> If we fail to prepare our pages for whatever reason (out of memory in
> our case) we need to make sure to drop the block_group->data_rwsem,
> otherwise hilarity ensues.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Omar Sandoval 

> ---
>  fs/btrfs/free-space-cache.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index cdc9f4015ec3..a6c643275210 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1263,8 +1263,12 @@ static int __btrfs_write_out_cache(struct btrfs_root 
> *root, struct inode *inode,
>  
>   /* Lock all pages first so we can lock the extent safely. */
>   ret = io_ctl_prepare_pages(io_ctl, inode, 0);
> - if (ret)
> + if (ret) {
> + if (block_group &&
> + (block_group->flags & BTRFS_BLOCK_GROUP_DATA))
> + up_write(_group->data_rwsem);
>   goto out;
> + }
>  
>   lock_extent_bits(_I(inode)->io_tree, 0, i_size_read(inode) - 1,
>_state);
> -- 
> 2.7.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix deadlock when writing out space cache

2017-11-15 Thread Josef Bacik
From: Josef Bacik 

If we fail to prepare our pages for whatever reason (out of memory in
our case) we need to make sure to drop the block_group->data_rwsem,
otherwise hilarity ensues.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/free-space-cache.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index cdc9f4015ec3..a6c643275210 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1263,8 +1263,12 @@ static int __btrfs_write_out_cache(struct btrfs_root 
*root, struct inode *inode,
 
/* Lock all pages first so we can lock the extent safely. */
ret = io_ctl_prepare_pages(io_ctl, inode, 0);
-   if (ret)
+   if (ret) {
+   if (block_group &&
+   (block_group->flags & BTRFS_BLOCK_GROUP_DATA))
+   up_write(_group->data_rwsem);
goto out;
+   }
 
lock_extent_bits(_I(inode)->io_tree, 0, i_size_read(inode) - 1,
 _state);
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-15 Thread Duncan
Austin S. Hemmelgarn posted on Tue, 14 Nov 2017 07:38:03 -0500 as
excerpted:

> On 2017-11-14 02:34, Martin Steigerwald wrote:
>> Hello David.
>> 
>> David Sterba - 13.11.17, 23:50:
>>> while 4.14 is still fresh, let me address some concerns I've seen on
>>> linux forums already.
>>>
>>> The newly added ZSTD support is a feature that has broader impact than
>>> just the runtime compression. The btrfs-progs understand filesystem
>>> with ZSTD since 4.13. The remaining key part is the bootloader.
>>>
>>> Up to now, there are no bootloaders supporting ZSTD. This could lead
>>> to an unmountable filesystem if the critical files under /boot get
>>> accidentally or intentionally compressed by ZSTD.
>> 
>> But otherwise ZSTD is safe to use? Are you aware of any other issues?
> Aside from the obvious issue that recovery media like SystemRescueCD and
> the GParted LIveCD haven't caught up yet, and thus won't be able to do
> anything with the filesystem, my testing has not uncovered any issues,
> though it is by no means rigorous.

Tho from my understanding and last I read, btrfs restore (I believe it 
was) hadn't been updated to handle zstd yet, tho btrfs check and btrfs 
filesystem defrag had been, and of course btrfs balance if the kernel 
handles it since all balance does is call the kernel to do it.

So just confirming, does btrfs restore handle zstd from -progs 4.13?  

Because once a filesystem goes unmountable, restore is what I've had best 
luck with, so if it doesn't understand zstd, no zstd for me.  
(Regardless, being the slightly cautious type I'll very likely wait a 
couple kernel cycles before switching from lzo to zstd here, just to see 
if any weird reports about it from the first-out testers hit the list in 
the mean time.)

Meanwhile, it shouldn't need said but just in case, if you're using it, 
be sure you have backups /not/ using zstd, for at least a couple kernel 
cycles. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-15 Thread Nick Terrell
On 11/15/17, 6:41 AM, "David Sterba"  wrote:
> The branch is now in a state that can be tested. Turns out the memory
> requirements are too much for grub, so the boot fails with "not enough
> memory". The calculated value
>
> ZSTD_BTRFS_MAX_INPUT: 131072
> ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424
>
> This is not something I could fix easily, we'd probalby need a tuned
> version of ZSTD for grub constraints. Adding Nick to CC.

If I understand the grub code correctly, we only need to read, and we have
the entire input and output buffer in one segment. In that case you can use
ZSTD_initDCtx(), and ZSTD_decompressDCtx(). ZSTD_DCtxWorkspaceBound() is
only 155984. See decompress_single() in
https://patchwork.kernel.org/patch/9997909/ for an example.

This (uncompiled and untested) code should work:

```
static grub_ssize_t
grub_btrfs_zstd_decompress(char *ibuf, grub_size_t isize, grub_off_t off,
   char *obuf, grub_size_t osize)
{
  grub_size_t ret = 0;
  ZSTD_DCtx *ctx;
  void *wmem;
  grub_size_t wsize;

  wsize = ZSTD_DCtxWorkspaceBound();
  wmem = grub_malloc(wsize);
  if (!wmem)
{
   return -1;
}
  ctx = ZSTD_initDCtx(wmem, wsize);
  if (!ctx)
{
   grub_free(wmem);
   return -1;
}

  /* This is only necessary if there is junk after the end of the zstd
   * compressed data in the input buffer. Otherwise the return value
   * should always be exactly isize. If you delete this, and it always works,
   * then there isn't junk data at the end.
   */
  isize = ZSTD_findFrameCompressedSize(in_buf, in_len);
  if (ZSTD_isError(isize))
{
   grub_free(wmem);
   return -1;
}

  ret = ZSTD_decompressDCtx(ctx, obuf, osize, ibuf, isize);
  grub_free(wmem);
  if (ZSTD_isError(ret))
{
   return -1;
}
  return ret;
}
```


N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: Read before you deploy btrfs + zstd

2017-11-15 Thread Duncan
Imran Geriskovan posted on Wed, 15 Nov 2017 20:16:56 +0200 as excerpted:

> On 11/15/17, Martin Steigerwald  wrote:
>> Somehow I am happy that I still have a plain Ext4 for /boot. :)
> 
> You may use uncompressed btrfs for /boot.
> Both Syslinux (my choice) and Grub supports it.

And I've been running grub2 with compress=lzo on my /boot for years, now, 
without issue.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[josef-btrfs:slab-priority 1/1] mm/vmscan.c:336: undefined reference to `__udivdi3'

2017-11-15 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
slab-priority
head:   bd319083ec02fd19b9f3522935d3c6c0528e1864
commit: bd319083ec02fd19b9f3522935d3c6c0528e1864 [1/1] mm: use sc->priority for 
slab shrink targets
config: i386-randconfig-n0-201746 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
git checkout bd319083ec02fd19b9f3522935d3c6c0528e1864
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/vmscan.o: In function `do_shrink_slab':
>> mm/vmscan.c:336: undefined reference to `__udivdi3'

vim +336 mm/vmscan.c

   308  
   309  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
   310  struct shrinker *shrinker, int 
priority)
   311  {
   312  unsigned long freed = 0;
   313  unsigned long long delta;
   314  long total_scan;
   315  long freeable;
   316  long nr;
   317  long new_nr;
   318  int nid = shrinkctl->nid;
   319  long batch_size = shrinker->batch ? shrinker->batch
   320: SHRINK_BATCH;
   321  long scanned = 0, next_deferred;
   322  
   323  freeable = shrinker->count_objects(shrinker, shrinkctl);
   324  if (freeable == 0)
   325  return 0;
   326  
   327  /*
   328   * copy the current shrinker scan count into a local variable
   329   * and zero it so that other concurrent shrinker invocations
   330   * don't also do this scanning work.
   331   */
   332  nr = atomic_long_xchg(>nr_deferred[nid], 0);
   333  
   334  total_scan = nr;
   335  delta = freeable >> priority;
 > 336  delta = (4 * delta) / shrinker->seeks;
   337  total_scan += delta;
   338  if (total_scan < 0) {
   339  pr_err("shrink_slab: %pF negative objects to delete 
nr=%ld\n",
   340 shrinker->scan_objects, total_scan);
   341  total_scan = freeable;
   342  next_deferred = nr;
   343  } else
   344  next_deferred = total_scan;
   345  
   346  /*
   347   * We need to avoid excessive windup on filesystem shrinkers
   348   * due to large numbers of GFP_NOFS allocations causing the
   349   * shrinkers to return -1 all the time. This results in a large
   350   * nr being built up so when a shrink that can do some work
   351   * comes along it empties the entire cache due to nr >>>
   352   * freeable. This is bad for sustaining a working set in
   353   * memory.
   354   *
   355   * Hence only allow the shrinker to scan the entire cache when
   356   * a large delta change is calculated directly.
   357   */
   358  if (delta < freeable / 4)
   359  total_scan = min(total_scan, freeable / 2);
   360  
   361  /*
   362   * Avoid risking looping forever due to too large nr value:
   363   * never try to free more than twice the estimate number of
   364   * freeable entries.
   365   */
   366  if (total_scan > freeable * 2)
   367  total_scan = freeable * 2;
   368  
   369  trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
   370 freeable, delta, total_scan, 
priority);
   371  
   372  /*
   373   * Normally, we should not scan less than batch_size objects in 
one
   374   * pass to avoid too frequent shrinker calls, but if the slab 
has less
   375   * than batch_size objects in total and we are really tight on 
memory,
   376   * we will try to reclaim all available objects, otherwise we 
can end
   377   * up failing allocations although there are plenty of 
reclaimable
   378   * objects spread over several slabs with usage less than the
   379   * batch_size.
   380   *
   381   * We detect the "tight on memory" situations by looking at the 
total
   382   * number of objects we want to scan (total_scan). If it is 
greater
   383   * than the total number of objects on slab (freeable), we must 
be
   384   * scanning at high prio and therefore should try to reclaim as 
much as
   385   * possible.
   386   */
   387  while (total_scan >= batch_size ||
   388 total_scan >= freeable) {
   389  unsigned long ret;
   390  unsigned long nr_to_scan = min(batch_size, total_scan);
   391  
   392  shrinkctl->nr_to_scan = nr_to_scan;
   393  shrinkctl->nr_scanned = nr_to_scan;
   394  ret = shrinker->scan_objects(shrinker, shrinkctl);
   395  

Re: Read before you deploy btrfs + zstd

2017-11-15 Thread Imran Geriskovan
On 11/15/17, Martin Steigerwald  wrote:
> Somehow I am happy that I still have a plain Ext4 for /boot. :)

You may use uncompressed btrfs for /boot.
Both Syslinux (my choice) and Grub supports it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] btrfs: switch btrfs_trans_handle::adding_csums to bool

2017-11-15 Thread David Sterba
The semantics of adding_csums matches bool, 'short' was most likely used
to save space in a698d0755adb6f2 ("Btrfs: add a type field for the
transaction handle").

Signed-off-by: David Sterba 
---
 fs/btrfs/inode.c   | 4 ++--
 fs/btrfs/transaction.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f352662e066f..3a32e782ccc4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2036,10 +2036,10 @@ static noinline int add_pending_csums(struct 
btrfs_trans_handle *trans,
struct btrfs_ordered_sum *sum;
 
list_for_each_entry(sum, list, list) {
-   trans->adding_csums = 1;
+   trans->adding_csums = true;
btrfs_csum_file_blocks(trans,
   BTRFS_I(inode)->root->fs_info->csum_root, sum);
-   trans->adding_csums = 0;
+   trans->adding_csums = false;
}
return 0;
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index c55e44560103..a673142c003e 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -118,7 +118,7 @@ struct btrfs_trans_handle {
struct btrfs_block_rsv *block_rsv;
struct btrfs_block_rsv *orig_rsv;
short aborted;
-   short adding_csums;
+   bool adding_csums;
bool allocating_chunk;
bool can_flush_pending_bgs;
bool reloc_reserved;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] Shrink some data structures

2017-11-15 Thread David Sterba
Unused members removed, other reordered.

struct btrfs_transaction:

-   /* size: 432, cachelines: 7, members: 27 */
-   /* sum members: 416, holes: 4, sum holes: 16 */
-   /* last cacheline: 48 bytes */
+   /* size: 416, cachelines: 7, members: 27 */
+   /* sum members: 412, holes: 1, sum holes: 4 */
+   /* last cacheline: 32 bytes */

struct btrfs_trans_handle:

-   /* size: 120, cachelines: 2, members: 20 */
-   /* sum members: 117, holes: 1, sum holes: 3 */
-   /* last cacheline: 56 bytes */
+   /* size: 104, cachelines: 2, members: 19 */
+   /* last cacheline: 40 bytes *

And incidentally the .text size goes down as well:

   textdata bss dec hex filename
 985430   75100   18560 1079090  107732 pre/btrfs.ko
 985243   75100   18560 1078903  107677 post/btrfs.ko

David Sterba (6):
  btrfs: switch btrfs_trans_handle::adding_csums to bool
  btrfs: remove unused member of btrfs_trans_handle
  btrfs: switch to refcount_t type for btrfs_trans_handle::use_count
  btrfs: reoder btrfs_trans_handle members for better packing
  btrfs: use narrower type for btrfs_transaction::num_dirty_bgs
  btrfs: reoder btrfs_transaction members for better packing

 fs/btrfs/extent-tree.c |  2 +-
 fs/btrfs/inode.c   |  4 ++--
 fs/btrfs/transaction.c | 12 ++--
 fs/btrfs/transaction.h | 11 +--
 4 files changed, 14 insertions(+), 15 deletions(-)

-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] btrfs: reoder btrfs_transaction members for better packing

2017-11-15 Thread David Sterba
There are now 20 bytes of holes, we can reduce that to 4 by minor
changes. Moving 'aborted' to the status and flags is also more logical,
similar for num_dirty_bgs. The size goes from 432 to 416.

Signed-off-by: David Sterba 
---
 fs/btrfs/transaction.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 1805fd101767..6beee072b1bd 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -58,6 +58,7 @@ struct btrfs_transaction {
 
/* Be protected by fs_info->trans_lock when we want to change it. */
enum btrfs_trans_state state;
+   int aborted;
struct list_head list;
struct extent_io_tree dirty_pages;
unsigned long start_time;
@@ -70,7 +71,6 @@ struct btrfs_transaction {
struct list_head dirty_bgs;
struct list_head io_bgs;
struct list_head dropped_roots;
-   unsigned int num_dirty_bgs;
 
/*
 * we need to make sure block group deletion doesn't race with
@@ -79,11 +79,11 @@ struct btrfs_transaction {
 */
struct mutex cache_write_mutex;
spinlock_t dirty_bgs_lock;
+   unsigned int num_dirty_bgs;
/* Protected by spin lock fs_info->unused_bgs_lock. */
struct list_head deleted_bgs;
spinlock_t dropped_roots_lock;
struct btrfs_delayed_ref_root delayed_refs;
-   int aborted;
struct btrfs_fs_info *fs_info;
 };
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] btrfs: reoder btrfs_trans_handle members for better packing

2017-11-15 Thread David Sterba
Recent updates to the structure left some holes, reorder the types so
the packing is tight. The size goes from 112 to 104 on 64bit.

Signed-off-by: David Sterba 
---
 fs/btrfs/transaction.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index afa88f035654..edf53112a6f2 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -111,11 +111,12 @@ struct btrfs_trans_handle {
u64 transid;
u64 bytes_reserved;
u64 chunk_bytes_reserved;
-   refcount_t use_count;
unsigned long delayed_ref_updates;
struct btrfs_transaction *transaction;
struct btrfs_block_rsv *block_rsv;
struct btrfs_block_rsv *orig_rsv;
+   refcount_t use_count;
+   unsigned int type;
short aborted;
bool adding_csums;
bool allocating_chunk;
@@ -123,7 +124,6 @@ struct btrfs_trans_handle {
bool reloc_reserved;
bool sync;
bool dirty;
-   unsigned int type;
struct btrfs_root *root;
struct btrfs_fs_info *fs_info;
struct list_head new_bgs;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] btrfs: remove unused member of btrfs_trans_handle

2017-11-15 Thread David Sterba
Last user was removed in a monster commit a22285a6a32390195235171
("Btrfs: Integrate metadata reservation with start_transaction") in
2010.

Signed-off-by: David Sterba 
---
 fs/btrfs/transaction.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index a673142c003e..c48a4a03f1b4 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -112,7 +112,6 @@ struct btrfs_trans_handle {
u64 bytes_reserved;
u64 chunk_bytes_reserved;
unsigned long use_count;
-   unsigned long blocks_reserved;
unsigned long delayed_ref_updates;
struct btrfs_transaction *transaction;
struct btrfs_block_rsv *block_rsv;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] btrfs: switch to refcount_t type for btrfs_trans_handle::use_count

2017-11-15 Thread David Sterba
The use_count is a reference counter, we can use the refcount_t type,
though we don't use the atomicity. This is not a performance critical
code and we could catch the underflows. The type is changed from long,
but the number of references will fit an int.

Signed-off-by: David Sterba 
---
 fs/btrfs/transaction.c | 12 ++--
 fs/btrfs/transaction.h |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index dac688c696c3..6348573e26a7 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -495,8 +495,8 @@ start_transaction(struct btrfs_root *root, unsigned int 
num_items,
if (current->journal_info) {
WARN_ON(type & TRANS_EXTWRITERS);
h = current->journal_info;
-   h->use_count++;
-   WARN_ON(h->use_count > 2);
+   refcount_inc(>use_count);
+   WARN_ON(refcount_read(>use_count) > 2);
h->orig_rsv = h->block_rsv;
h->block_rsv = NULL;
goto got_it;
@@ -567,7 +567,7 @@ start_transaction(struct btrfs_root *root, unsigned int 
num_items,
h->transid = cur_trans->transid;
h->transaction = cur_trans;
h->root = root;
-   h->use_count = 1;
+   refcount_set(>use_count, 1);
h->fs_info = root->fs_info;
 
h->type = type;
@@ -837,8 +837,8 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
int err = 0;
int must_run_delayed_refs = 0;
 
-   if (trans->use_count > 1) {
-   trans->use_count--;
+   if (refcount_read(>use_count) > 1) {
+   refcount_dec(>use_count);
trans->block_rsv = trans->orig_rsv;
return 0;
}
@@ -1868,7 +1868,7 @@ static void cleanup_transaction(struct btrfs_trans_handle 
*trans,
struct btrfs_transaction *cur_trans = trans->transaction;
DEFINE_WAIT(wait);
 
-   WARN_ON(trans->use_count > 1);
+   WARN_ON(refcount_read(>use_count) > 1);
 
btrfs_abort_transaction(trans, err);
 
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index c48a4a03f1b4..afa88f035654 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -111,7 +111,7 @@ struct btrfs_trans_handle {
u64 transid;
u64 bytes_reserved;
u64 chunk_bytes_reserved;
-   unsigned long use_count;
+   refcount_t use_count;
unsigned long delayed_ref_updates;
struct btrfs_transaction *transaction;
struct btrfs_block_rsv *block_rsv;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] btrfs: use narrower type for btrfs_transaction::num_dirty_bgs

2017-11-15 Thread David Sterba
The u64 is an overkill here, we could not possibly create that many
blockgroups in one transaction.

Signed-off-by: David Sterba 
---
 fs/btrfs/extent-tree.c | 2 +-
 fs/btrfs/transaction.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 11b7d613140b..d59a65852447 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2894,7 +2894,7 @@ int btrfs_check_space_for_delayed_refs(struct 
btrfs_trans_handle *trans,
struct btrfs_block_rsv *global_rsv;
u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
-   u64 num_dirty_bgs = trans->transaction->num_dirty_bgs;
+   unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
u64 num_bytes, num_dirty_bgs_bytes;
int ret = 0;
 
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index edf53112a6f2..1805fd101767 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -70,7 +70,7 @@ struct btrfs_transaction {
struct list_head dirty_bgs;
struct list_head io_bgs;
struct list_head dropped_roots;
-   u64 num_dirty_bgs;
+   unsigned int num_dirty_bgs;
 
/*
 * we need to make sure block group deletion doesn't race with
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: move some zstd work data from stack to workspace

2017-11-15 Thread David Sterba
* ZSTD_inBuffer in_buf
* ZSTD_outBuffer out_buf

are used in all functions to pass the compression parameters and the
local variables consume some space. We can move them to the workspace
and reduce the stack consumption:

zstd.c:zstd_decompress-24 (136 -> 112)
zstd.c:zstd_decompress_bio-24 (144 -> 120)
zstd.c:zstd_compress_pages-24 (264 -> 240)

Signed-off-by: David Sterba 
---
 fs/btrfs/zstd.c | 132 
 1 file changed, 67 insertions(+), 65 deletions(-)

diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 17f2dd8fddb8..01a4eab602a3 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -43,6 +43,8 @@ struct workspace {
size_t size;
char *buf;
struct list_head list;
+   ZSTD_inBuffer in_buf;
+   ZSTD_outBuffer out_buf;
 };
 
 static void zstd_free_workspace(struct list_head *ws)
@@ -94,8 +96,6 @@ static int zstd_compress_pages(struct list_head *ws,
int nr_pages = 0;
struct page *in_page = NULL;  /* The current page to read */
struct page *out_page = NULL; /* The current page to write to */
-   ZSTD_inBuffer in_buf = { NULL, 0, 0 };
-   ZSTD_outBuffer out_buf = { NULL, 0, 0 };
unsigned long tot_in = 0;
unsigned long tot_out = 0;
unsigned long len = *total_out;
@@ -118,9 +118,9 @@ static int zstd_compress_pages(struct list_head *ws,
 
/* map in the first page of input data */
in_page = find_get_page(mapping, start >> PAGE_SHIFT);
-   in_buf.src = kmap(in_page);
-   in_buf.pos = 0;
-   in_buf.size = min_t(size_t, len, PAGE_SIZE);
+   workspace->in_buf.src = kmap(in_page);
+   workspace->in_buf.pos = 0;
+   workspace->in_buf.size = min_t(size_t, len, PAGE_SIZE);
 
 
/* Allocate and map in the output buffer */
@@ -130,14 +130,15 @@ static int zstd_compress_pages(struct list_head *ws,
goto out;
}
pages[nr_pages++] = out_page;
-   out_buf.dst = kmap(out_page);
-   out_buf.pos = 0;
-   out_buf.size = min_t(size_t, max_out, PAGE_SIZE);
+   workspace->out_buf.dst = kmap(out_page);
+   workspace->out_buf.pos = 0;
+   workspace->out_buf.size = min_t(size_t, max_out, PAGE_SIZE);
 
while (1) {
size_t ret2;
 
-   ret2 = ZSTD_compressStream(stream, _buf, _buf);
+   ret2 = ZSTD_compressStream(stream, >out_buf,
+   >in_buf);
if (ZSTD_isError(ret2)) {
pr_debug("BTRFS: ZSTD_compressStream returned %d\n",
ZSTD_getErrorCode(ret2));
@@ -146,22 +147,22 @@ static int zstd_compress_pages(struct list_head *ws,
}
 
/* Check to see if we are making it bigger */
-   if (tot_in + in_buf.pos > 8192 &&
-   tot_in + in_buf.pos <
-   tot_out + out_buf.pos) {
+   if (tot_in + workspace->in_buf.pos > 8192 &&
+   tot_in + workspace->in_buf.pos <
+   tot_out + workspace->out_buf.pos) {
ret = -E2BIG;
goto out;
}
 
/* We've reached the end of our output range */
-   if (out_buf.pos >= max_out) {
-   tot_out += out_buf.pos;
+   if (workspace->out_buf.pos >= max_out) {
+   tot_out += workspace->out_buf.pos;
ret = -E2BIG;
goto out;
}
 
/* Check if we need more output space */
-   if (out_buf.pos == out_buf.size) {
+   if (workspace->out_buf.pos == workspace->out_buf.size) {
tot_out += PAGE_SIZE;
max_out -= PAGE_SIZE;
kunmap(out_page);
@@ -176,19 +177,20 @@ static int zstd_compress_pages(struct list_head *ws,
goto out;
}
pages[nr_pages++] = out_page;
-   out_buf.dst = kmap(out_page);
-   out_buf.pos = 0;
-   out_buf.size = min_t(size_t, max_out, PAGE_SIZE);
+   workspace->out_buf.dst = kmap(out_page);
+   workspace->out_buf.pos = 0;
+   workspace->out_buf.size = min_t(size_t, max_out,
+   PAGE_SIZE);
}
 
/* We've reached the end of the input */
-   if (in_buf.pos >= len) {
-   tot_in += in_buf.pos;
+   if (workspace->in_buf.pos >= len) {
+   tot_in += workspace->in_buf.pos;
break;
}
 
/* Check if we 

Re: [PATCH 0/4] factor __btrfs_open_devices()

2017-11-15 Thread David Sterba
On Thu, Nov 09, 2017 at 11:45:22PM +0800, Anand Jain wrote:
> This is in preparation to help bring the missing device back to the
> alloc list dynamically. As of now if we run 'btrfs dev scan '
> on a mounted FS, nothing happens, there is no code which handles this
> condition. So the idea is to fix it by running through the device open
> process for the reappeared missing device.
> 
> So this patch separates device open steps into a separate function so that
> it can be called for a device, instead of for all the devices in the
> dev_list. As this function is in the critical mount section, and to show
> the src code changes as clearly as possible, I have created 4 no-functional
> changes patches, which can be merged together if needed.

The splitting is fine, I've only merged 2 and 4 to one, but otherwise
it's clear from each patch if there are no functional changes. 1,3,2+4
added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs: let variable required be declared inside the loop

2017-11-15 Thread David Sterba
On Thu, Nov 09, 2017 at 05:53:58PM +0200, Nikolay Borisov wrote:
> On  9.11.2017 17:45, Anand Jain wrote:
> > A preparation patch to create the actual device open in a new function.
> Just say you are reducing the visibility of some variables to the loop.

This patch is IMHO too fine-grained and can be folded with the last one.
A preparatory change like this would make sense if there are some
non-obvious consequences, but here it's just moving the declaration
block to a new function, that was easy to review.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] btrfs: use i_size_read() instead of open code

2017-11-15 Thread David Sterba
On Tue, Nov 07, 2017 at 07:44:59AM +0800, Anand Jain wrote:
> 
> 
> On 11/07/2017 12:52 AM, David Sterba wrote:
> > On Mon, Nov 06, 2017 at 04:36:14PM +0800, Anand Jain wrote:
> >> As i_size_read() takes care of 32bit smp or preempt cases as well.
> > 
> > Can bdev->bd_inode->i_size change so that we need to use the
> > i_size_read()? My answer is 'no'.
> 
>   Hm. Right I was looking at it only from the theoretical point of view.
> 
>   And I presume you mean to say disk resize at the block layer is not
>   really a practically achievable concern.

There are some safety checks, at least if there are bios in flight on
the partition when it's mounted.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: ref-verify: Remove unused parameter from walk_up_tree() to kill warning

2017-11-15 Thread David Sterba
On Wed, Nov 15, 2017 at 04:04:40PM +0100, Geert Uytterhoeven wrote:
> With gcc-4.1.2:
> 
> fs/btrfs/ref-verify.c: In function ‘btrfs_build_ref_tree’:
> fs/btrfs/ref-verify.c:1017: warning: ‘root’ is used uninitialized in this 
> function
> 
> The variable is indeed passed uninitialized, but it is never used by the
> callee.  However, not all versions of gcc are smart enough to notice.
> 
> Hence remove the unused parameter from walk_up_tree() to silence the
> compiler warning.
> 
> Signed-off-by: Geert Uytterhoeven 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-15 Thread Martin Steigerwald
David Sterba - 15.11.17, 15:39:
> On Tue, Nov 14, 2017 at 07:53:31PM +0100, David Sterba wrote:
> > On Mon, Nov 13, 2017 at 11:50:46PM +0100, David Sterba wrote:
> > > Up to now, there are no bootloaders supporting ZSTD.
> > 
> > I've tried to implement the support to GRUB, still incomplete and hacky
> > but most of the code is there.  The ZSTD implementation is copied from
> > kernel. The allocators need to be properly set up, as it needs to use
> > grub_malloc/grub_free for the workspace thats called from some ZSTD_*
> > functions.
> > 
> > https://github.com/kdave/grub/tree/btrfs-zstd
> 
> The branch is now in a state that can be tested. Turns out the memory
> requirements are too much for grub, so the boot fails with "not enough
> memory". The calculated value
> 
> ZSTD_BTRFS_MAX_INPUT: 131072
> ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424
> 
> This is not something I could fix easily, we'd probalby need a tuned
> version of ZSTD for grub constraints. Adding Nick to CC.

Somehow I am happy that I still have a plain Ext4 for /boot. :)

Thanks for looking into Grub support anyway.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] btrfs: tree-checker: Add checker for dir item

2017-11-15 Thread David Sterba
On Wed, Nov 08, 2017 at 08:54:25AM +0800, Qu Wenruo wrote:
> Add checker for dir item, for key types DIR_ITEM, DIR_INDEX and
> XATTR_ITEM.
> 
> This checker does comprehensive check for:
> 1) dir_item header and its data size
>Against item boundary and maximum name/xattr length.
>This part is mostly the same as old verify_dir_item().
> 
> 2) dir_type
>Against maximum file types, and against key type.
>Since XATTR key should only have FT_XATTR dir item, and normal dir
>item type should not have XATTR key.
> 
>The check between key->type and dir_type is newly introduced by this
>patch.
> 
> 3) name hash
>For XATTR and DIR_ITEM key, key->offset is name hash (crc32).
>Check the hash of name against key to ensure it's correct.
> 
>The name hash check is only found in btrfs-progs before this patch.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] btrfs: tree-checker: Fix false panic for sanity test

2017-11-15 Thread David Sterba
On Wed, Nov 08, 2017 at 08:54:24AM +0800, Qu Wenruo wrote:
> [BUG]
> If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
> instantly cause kernel panic like:
> 
> --
> ...
> assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
> ...
> Call Trace:
>  btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
>  setup_items_for_insert+0x385/0x650 [btrfs]
>  __btrfs_drop_extents+0x129a/0x1870 [btrfs]
> ...
> -
> 
> [Cause]
> Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
> if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.
> 
> However quite some btrfs_mark_buffer_dirty() callers(*) don't really
> initialize its item data but only initialize its item pointers, leaving
> item data uninitialized.
> 
> This makes tree-checker catch uninitialized data as error, causing
> such panic.
> 
> *: These callers include but not limited to
> setup_items_for_insert()
> btrfs_split_item()
> btrfs_expand_item()
> 
> [Fix]
> Add a new parameter @check_item_data to btrfs_check_leaf().
> With @check_item_data set to false, item data check will be skipped and
> fallback to old btrfs_check_leaf() behavior.
> 
> So we can still get early warning if we screw up item pointers, and
> avoid false panic.
> 
> Cc: Filipe Manana 
> Reported-by: Lakshmipathi.G 
> Signed-off-by: Qu Wenruo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] btrfs: Cleanup existing name_len checks

2017-11-15 Thread David Sterba
On Wed, Nov 08, 2017 at 08:54:26AM +0800, Qu Wenruo wrote:
> Since tree-checker has verified leaf when reading from disk, we don't
> need the existing verify_dir_item() or btrfs_is_name_len_valid().
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] btrfs: tree-checker: Fix false panic for sanity test

2017-11-15 Thread David Sterba
On Wed, Nov 08, 2017 at 04:03:35PM +0800, Qu Wenruo wrote:
> >> +int btrfs_check_leaf_full(struct btrfs_root *root, struct extent_buffer 
> >> *leaf)
> >> +{
> >> +  return check_leaf(root, leaf, true);
> >> +}
> >> +
> >> +int btrfs_check_leaf_relaxed(struct btrfs_root *root,
> >> +   struct extent_buffer *leaf)
> >> +{
> >> +  return check_leaf(root, leaf, false);
> >> +}
> > 
> > Presumably the compiler will figure it out but such trivial function are
> > usually defined straight into the header file so that the compiler
> > inlines them.
> 
> In that case, the function check_leaf() must be exported, so that we can
> inline it in header.
> 
> But exporting check_leaf() increases the possibility for caller to use
> it incorrectly, so I prefer no to export any internal used function.
> 
> 
> And compiler may or may not inline check_leaf() into
> btrfs_check_leaf_relaxed() function, but it doesn't matter.
> 
> If optimization can be done by compiler, then let compiler to do it.
> 
> What we should do is to ensure the abstraction/interface design is good
> enough, other than doing possible "over-optimization".

Though my original idea was closer to what Nikolay says, I'm fine with
the way you've actually implement it. We don't need the extended
check_leaf version that's hidden in the .c file.

The function inlining possibilities will be limited, but this is not a
performance critical code where the effects of inlining could be
observale in practice (I think, no numbers to back that).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: add a extent ref verify tool

2017-11-15 Thread Geert Uytterhoeven
On Tue, Nov 14, 2017 at 11:46 PM, Linux Kernel Mailing List
 wrote:
> Web:
> https://git.kernel.org/torvalds/c/fd708b81d972a0714b02a60eb4792fdbf15868c4
> Commit: fd708b81d972a0714b02a60eb4792fdbf15868c4
> Parent: 84f7d8e6242ceb377c7af10a7133c653cc7fea5f
> Refname:refs/heads/master
> Author: Josef Bacik 
> AuthorDate: Fri Sep 29 15:43:50 2017 -0400
> Committer:  David Sterba 
> CommitDate: Mon Oct 30 12:28:00 2017 +0100
>
> Btrfs: add a extent ref verify tool
>
> We were having corruption issues that were tied back to problems with
> the extent tree.  In order to track them down I built this tool to try
> and find the culprit, which was pretty successful.  If you compile with
> this tool on it will live verify every ref update that the fs makes and
> make sure it is consistent and valid.  I've run this through with
> xfstests and haven't gotten any false positives.  Thanks,
>
> Signed-off-by: Josef Bacik 
> Reviewed-by: David Sterba 
> [ update error messages, add fixup from Dan Carpenter to handle errors
>   of read_tree_block ]
> Signed-off-by: David Sterba 

> --- /dev/null
> +++ b/fs/btrfs/ref-verify.c

> +static int process_extent_item(struct btrfs_fs_info *fs_info,
> +  struct btrfs_path *path, struct btrfs_key *key,
> +  int slot, int *tree_block_level)
> +{
> +   struct btrfs_extent_item *ei;
> +   struct btrfs_extent_inline_ref *iref;
> +   struct btrfs_extent_data_ref *dref;
> +   struct btrfs_shared_data_ref *sref;
> +   struct extent_buffer *leaf = path->nodes[0];
> +   u32 item_size = btrfs_item_size_nr(leaf, slot);
> +   unsigned long end, ptr;
> +   u64 offset, flags, count;
> +   int type, ret;

With gcc-4.1.2:

warning: ‘ret’ may be used uninitialized in this function

> +
> +   ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
> +   flags = btrfs_extent_flags(leaf, ei);
> +
> +   if ((key->type == BTRFS_EXTENT_ITEM_KEY) &&
> +   flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
> +   struct btrfs_tree_block_info *info;
> +
> +   info = (struct btrfs_tree_block_info *)(ei + 1);
> +   *tree_block_level = btrfs_tree_block_level(leaf, info);
> +   iref = (struct btrfs_extent_inline_ref *)(info + 1);
> +   } else {
> +   if (key->type == BTRFS_METADATA_ITEM_KEY)
> +   *tree_block_level = key->offset;
> +   iref = (struct btrfs_extent_inline_ref *)(ei + 1);
> +   }
> +
> +   ptr = (unsigned long)iref;
> +   end = (unsigned long)ei + item_size;
> +   while (ptr < end) {

Is this loop guaranteed to run at least once?
If not:
  1. uninitialized ret will be returned,
  2. To which value should ret be preinitialized? 0 or an error?

As I understand this is a verification tool, it should be implemented
using defensive programming.

> +   iref = (struct btrfs_extent_inline_ref *)ptr;
> +   type = btrfs_extent_inline_ref_type(leaf, iref);
> +   offset = btrfs_extent_inline_ref_offset(leaf, iref);
> +   switch (type) {
> +   case BTRFS_TREE_BLOCK_REF_KEY:
> +   ret = add_tree_block(fs_info, offset, 0, 
> key->objectid,
> +*tree_block_level);
> +   break;
> +   case BTRFS_SHARED_BLOCK_REF_KEY:
> +   ret = add_tree_block(fs_info, 0, offset, 
> key->objectid,
> +*tree_block_level);
> +   break;
> +   case BTRFS_EXTENT_DATA_REF_KEY:
> +   dref = (struct btrfs_extent_data_ref 
> *)(>offset);
> +   ret = add_extent_data_ref(fs_info, leaf, dref,
> + key->objectid, key->offset);
> +   break;
> +   case BTRFS_SHARED_DATA_REF_KEY:
> +   sref = (struct btrfs_shared_data_ref *)(iref + 1);
> +   count = btrfs_shared_data_ref_count(leaf, sref);
> +   ret = add_shared_data_ref(fs_info, offset, count,
> + key->objectid, key->offset);
> +   break;
> +   default:
> +   btrfs_err(fs_info, "invalid key type in iref");
> +   ret = -EINVAL;
> +   break;
> +   }
> +   if (ret)
> +   break;
> +   ptr += btrfs_extent_inline_ref_size(type);
> +   }
> +   return ret;
> +}
> +
> +static int process_leaf(struct btrfs_root *root,
> +   struct btrfs_path *path, u64 *bytenr, u64 

Re: [PATCH v2 2/2] Btrfs: fix reported number of inode blocks after buffered append writes

2017-11-15 Thread David Sterba
On Sat, Nov 04, 2017 at 04:20:07AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> The patch from commit a7e3b975a0f9 ("Btrfs: fix reported number of inode
> blocks") introduced a regression where if we do a buffered write starting
> at position equal to or greater than the file's size and then stat(2) the
> file before writeback is triggered, the number of used blocks does not
> change (unless there's a prealloc/unwritten extent). Example:
> 
>   $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
>   $ du -h foobar
>   0   foobar
>   $ sync
>   $ du -h foobar
>   64K foobar
> 
> The first version of that patch didn't had this regression and the second
> version, which was the one committed, was made only to address some
> performance regression detected by the intel test robots using fs_mark.
> 
> This fixes the regression by setting the new delaloc bit in the range, and
> doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
> well, so that this way we set both bits at once avoiding navigation of the
> inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
> meaninful place, as we should set the new dellaloc bit when if we set the
> delalloc bit, which happens only if we copied bytes into the pages at
> __btrfs_buffered_write().
> 
> This was making some of LTP's du tests fail, which can be quickly run
> using a command line like the following:
> 
>   $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt
> 
> Fixes: a7e3b975a0f9 ("Btrfs: fix reported number of inode blocks")
> Signed-off-by: Filipe Manana 

FYI, I'm going to add the two patches to a 4.15 queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: Fix wild memory access in compression level parser

2017-11-15 Thread David Sterba
On Mon, Nov 06, 2017 at 10:43:18AM +0800, Qu Wenruo wrote:
> [BUG]
> Kernel panic when mounting with "-o compress" mount option.
> KASAN will report like:
> --
> ==
> BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0
> Read of size 1 at addr d86735fce994f800 by task mount/662
> ...
> Call Trace:
>  dump_stack+0xe3/0x175
>  kasan_report+0x163/0x370
>  __asan_load1+0x47/0x50
>  strncmp+0x31/0xc0
>  btrfs_compress_str2level+0x20/0x70 [btrfs]
>  btrfs_parse_options+0xff4/0x1870 [btrfs]
>  open_ctree+0x2679/0x49f0 [btrfs]
>  btrfs_mount+0x1b7f/0x1d30 [btrfs]
>  mount_fs+0x49/0x190
>  vfs_kern_mount.part.29+0xba/0x280
>  vfs_kern_mount+0x13/0x20
>  btrfs_mount+0x31e/0x1d30 [btrfs]
>  mount_fs+0x49/0x190
>  vfs_kern_mount.part.29+0xba/0x280
>  do_mount+0xaad/0x1a00
>  SyS_mount+0x98/0xe0
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
> --
> 
> [Cause]
> For 'compress' and 'compress_force' options, its token doesn't expect
> any parameter so its args[0] contains uninitialized data.
> Accessing args[0] will cause above wild memory access.
> 
> [Fix]
> For Opt_compress and Opt_compress_force, set compression level to
> Z_DEFAULT_COMPRESSION manually.
> 
> NOTE: Don't set zlib compression level to 0 by default, which means no
> compression.

But we never set the level to 0 at the point the compression actually
happens. See zlib.c:zlib_set_level, if level is 0 then the level
passed to zlib is 3. Z_DEFAULT_COMPRESSION is upstream zlib level 6,
which is slower, we need zlib to stay in the real-time numbers.

> @@ -507,8 +508,19 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   token == Opt_compress_force ||
>   strncmp(args[0].from, "zlib", 4) == 0) {
>   compress_type = "zlib";
> +
>   info->compress_type = BTRFS_COMPRESS_ZLIB;
> - info->compress_level =
> + /*
> +  * args[0] contains uninitialized data since
> +  * for these tokens we don't expect any
> +  * parameter.
> +  */
> + if (token == Opt_compress ||
> + token == Opt_compress_force)
> + info->compress_level =
> + Z_DEFAULT_COMPRESSION;
> + else
> + info->compress_level =
>   btrfs_compress_str2level(args[0].from);

At least this will not screw up the levels, anything that's not
recognized will become the default.

>   btrfs_set_opt(info->mount_opt, COMPRESS);
>   btrfs_clear_opt(info->mount_opt, NODATACOW);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: ref-verify: Remove unused parameter from walk_up_tree() to kill warning

2017-11-15 Thread Geert Uytterhoeven
With gcc-4.1.2:

fs/btrfs/ref-verify.c: In function ‘btrfs_build_ref_tree’:
fs/btrfs/ref-verify.c:1017: warning: ‘root’ is used uninitialized in this 
function

The variable is indeed passed uninitialized, but it is never used by the
callee.  However, not all versions of gcc are smart enough to notice.

Hence remove the unused parameter from walk_up_tree() to silence the
compiler warning.

Signed-off-by: Geert Uytterhoeven 
---
 fs/btrfs/ref-verify.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
index 34878699d363e641..171f3cce30e6badc 100644
--- a/fs/btrfs/ref-verify.c
+++ b/fs/btrfs/ref-verify.c
@@ -606,8 +606,7 @@ static int walk_down_tree(struct btrfs_root *root, struct 
btrfs_path *path,
 }
 
 /* Walk up to the next node that needs to be processed */
-static int walk_up_tree(struct btrfs_root *root, struct btrfs_path *path,
-   int *level)
+static int walk_up_tree(struct btrfs_path *path, int *level)
 {
int l;
 
@@ -984,7 +983,6 @@ void btrfs_free_ref_tree_range(struct btrfs_fs_info 
*fs_info, u64 start,
 int btrfs_build_ref_tree(struct btrfs_fs_info *fs_info)
 {
struct btrfs_path *path;
-   struct btrfs_root *root;
struct extent_buffer *eb;
u64 bytenr = 0, num_bytes = 0;
int ret, level;
@@ -1014,7 +1012,7 @@ int btrfs_build_ref_tree(struct btrfs_fs_info *fs_info)
 , _bytes);
if (ret)
break;
-   ret = walk_up_tree(root, path, );
+   ret = walk_up_tree(path, );
if (ret < 0)
break;
if (ret > 0) {
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-15 Thread David Sterba
On Tue, Nov 14, 2017 at 07:53:31PM +0100, David Sterba wrote:
> On Mon, Nov 13, 2017 at 11:50:46PM +0100, David Sterba wrote:
> > Up to now, there are no bootloaders supporting ZSTD.
> 
> I've tried to implement the support to GRUB, still incomplete and hacky
> but most of the code is there.  The ZSTD implementation is copied from
> kernel. The allocators need to be properly set up, as it needs to use
> grub_malloc/grub_free for the workspace thats called from some ZSTD_*
> functions.
> 
> https://github.com/kdave/grub/tree/btrfs-zstd

The branch is now in a state that can be tested. Turns out the memory
requirements are too much for grub, so the boot fails with "not enough
memory". The calculated value

ZSTD_BTRFS_MAX_INPUT: 131072
ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424

This is not something I could fix easily, we'd probalby need a tuned
version of ZSTD for grub constraints. Adding Nick to CC.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Roy Sigurd Karlsbakk
>> As for dedupe there is (to my knowledge) nothing fully automatic yet.
>> You have to run a program to scan your filesystem but all the
>> deduplication is done in the kernel.
>> duperemove works apparently quite well when I tested it, but there may
>> be some performance implications.
> Correct, there is nothing automatic (and there are pretty significant
> arguments against doing automatic deduplication in most cases), but the
> off-line options (via the EXTENT_SAME ioctl) are reasonably reliable.
> Duperemove in particular does a good job, though it may take a long time
> for large data sets.
> 
> As far as performance, it's no worse than large numbers of snapshots.
> The issues arise from using very large numbers of reflinks.

What is this "large" number of snapshots? Not that it's directly comparible, 
but I've worked with ZFS a while, and haven't seen those issues there.

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] btrfs-progs: Add support for running the tools from

2017-11-15 Thread David Sterba
On Wed, Nov 01, 2017 at 01:42:42AM +, Gu, Jinxiang wrote:
> > On 2017年10月31日 16:43, Gu Jinxiang wrote:
> > > Add support for running the tools from a given path (for
> > > example,/usr/bin) by setting $EXEC while running tests.
> > >
> > > Achieved:
> > > Specify the location of binary and run test like this:
> > > $ make EXEC=/usr/bin/ test
> > Any special reason to test system installed btrfs?
> > Even for system QA case, I think self test should be run at packaging time 
> > other than doing it after packaging.

This is supposed to extend the testing possibilities. The packaging
phase could run some tests, many tools run post-build tests, but we need
to manage devices and require root in general. This is not always
possible, besides that it requires to run from the sources directory.

> Just implement the project list below in github.
> .
> Add support for running the tools from a given path (git, /usr/bin, 
> /usr/local/bin), so the testsuite can be run independently.
> .

The idea is to export the testsuite files to a separate tar, run from a
script. In that case the system binaries will be preferred.

I've promoted the project task to 
https://github.com/kdave/btrfs-progs/issues/77 .
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zstd compression

2017-11-15 Thread Austin S. Hemmelgarn

On 2017-11-15 05:35, Imran Geriskovan wrote:

On 11/15/17, Lukas Pirl  wrote:

you might be interested in the thread "Read before you deploy
btrfs + zstd"¹.


Thanks. I've read it. Bootloader is not an issue since /boot is on
another uncompressed fs.

Let me make my question more generic:

Can there be any issues for switching mount time
compressions options from one to another, in any order?
(i.e none -> lzo -> zlib -> zstd -> none -> ...)

zstd is only a newcomer so my question applies to all
combinations..
The 'compress' and 'compress-force' mount options only impact newly 
written data.  The compression used is stored with the metadata for the 
extents themselves, so any existing data on the volume will be read just 
fine with whatever compression method it was written with, while new 
data will be written with the specified compression method.


If you want to convert existing files, you can use the '-c' option to 
the defrag command to do so.


Aside from this, there is one other thing to keep in mind about zstd 
which I mentioned later in the above mentioned thread.  Most system 
recovery tools do not yet have a new enough version of the kernel and/or 
 btrfs-progs to be able to access BTRFS volumes with zstd compressed 
data or metadata, so you may need to roll your own recovery solution for 
the time being if you want to use zstd.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Austin S. Hemmelgarn

On 2017-11-15 02:11, waxhead wrote:
As a regular BTRFS user I can tell you that there is no such thing as 
hot data tracking yet. Some people seem to use bcache together with 
btrfs and come asking for help on the mailing list.
Bcache works fine recently.  It was only with older versions that there 
were issues.  dm-cache similarly works fine on recent versions.  In both 
cases though, you need to be sure you know what you're doing, otherwise 
you are liable to break things.


Raid5/6 have received a few fixes recently, and it *may* soon me worth 
trying out raid5/6 for data, but keeping metadata in raid1/10 (I would 
rather loose a file or two than the entire filesystem).

I had plans to run some tests on this a while ago, but forgot about it.
As call good citizens, remember to have good backups. Last time I tested 
for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 
seems pretty rock solid as long as you have sufficient disks (hint: you 
need more than two for raid1 if you want to stay safe)
Parity profiles (raid5 and raid6) still have issues, although there are 
fewer than there were, with most of the remaining issues surrounding 
recovery.  I would still recommend against it for production usage.


Simple replication (raid1) is pretty much rock solid as long as you keep 
on top of replacing failing hardware and aren't stupid enough to run the 
array degraded for any extended period of time (converting to a single 
device volume instead of leaving things with half a volume is vastly 
preferred for multiple reasons).


Striped replication (raid10) is generally fine, but you can get much 
better performance by running BTRFS with a raid1 profile on top of two 
MD/LVM/Hardware RAID0 volumes (BTRFS still doesn't do a very good job of 
parallelizing things).


As for dedupe there is (to my knowledge) nothing fully automatic yet. 
You have to run a program to scan your filesystem but all the 
deduplication is done in the kernel.
duperemove works apparently quite well when I tested it, but there may 
be some performance implications.
Correct, there is nothing automatic (and there are pretty significant 
arguments against doing automatic deduplication in most cases), but the 
off-line options (via the EXTENT_SAME ioctl) are reasonably reliable. 
Duperemove in particular does a good job, though it may take a long time 
for large data sets.


As far as performance, it's no worse than large numbers of snapshots. 
The issues arise from using very large numbers of reflinks.


Roy Sigurd Karlsbakk wrote:

Hi all

I've been following this project on and off for quite a few years, and 
I wonder if anyone has looked into tiered storage on it. With tiered 
storage, I mean hot data lying on fast storage and cold data on slow 
storage. I'm not talking about cashing (where you just keep a copy of 
the hot data on the fast storage).


And btw, how far is raid[56] and block-level dedup from something 
useful in production?


Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Austin S. Hemmelgarn

On 2017-11-15 04:26, Marat Khalili wrote:


On 15/11/17 10:11, waxhead wrote:

hint: you need more than two for raid1 if you want to stay safe
Huh? Two is not enough? Having three or more makes a difference? (Or, 
you mean hot spare?)
They're probably referring to an issue where a two device array 
configured for raid1 which had lost a device and was mounted degraded 
and writable would generate single profile chunks on the remaining 
device instead of a half-complete raid1 chunk.  This, when combined with 
the fact that older kernels only check the filesystem as a whole for 
normal/degraded/irreparable instead of checking individual chunks would 
refuse to mount the resultant filesystem, meant that you only had one 
chance to fix such an array.


If instead you have more than two devices, regular complete raid1 
profile chunks are generated, and it becomes a non-issue.


The second issue (checking degraded status at the chunk level instead of 
volume level) has been fixed in the most recent kernels.


The first issue has not been fixed yet, but I'm pretty sure there are 
patches pending.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hello Dear...

2017-11-15 Thread M,Shakour Rosarita
Hello Dear...

I know that this message will come to you as a surprise. I hoped that
you will not expose or betray this trust and confident that I am about
to repose on you, my name is M, Shakour Rosarita. I am 19 years old
Girl, female, from Tartu Syria, (never married) 61 kg, white in
complexion, and I am currently living in the refugee camp here in
Abidjan Ivory Coast, My late beloved father M,Shakour Hassin was a
renowned businessman and owner of Natour Petrol Station in Syria, he
was killed in a stampede riot in Tartu, Syria.
When I got the news about my parents. I fled to a nearby country
Jordan before I joined a ferry to Africa and came to Abidjan capital
city Ivory Coast West Africa find safety here.
I came in 2015 to Abidjan and I now live in refugee camps here as
refugees are allowed freely to enter here without, My late father
deposited (US$6.200.000.00m) My late father kept the money at the bank
of Africa, I want you to help me transfer the money to your hand so
that you will help me bring me into your country for my continue
education.

I sent my pictures here for you to see. Who I am seriously looking for
a good-person in my life, so I want to hear from you soon and learn
more about you.

I am an open-minded and friendly girl to share a good time with you
and have fun and enjoy on the go, bird watching, the rest of our
lives. My Hobbies, tourism books, dance, music, soccer, tennis,
swimming and cinema.
I would just think about you, including your dose and doesn’t .I
believe we will do well together, and live like one family.
Thank you and God bless you, for you in your promise to help me here,
looking forward to your reply by the grace of God and have a good day.
I hope you send me your photos as well? I await your next reply in
faith please reply through my private email at
(mshakourrosarit...@gmail.com):
Thanks.
Ms Rosarita
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zstd compression

2017-11-15 Thread Imran Geriskovan
On 11/15/17, Lukas Pirl  wrote:
> you might be interested in the thread "Read before you deploy
> btrfs + zstd"¹.

Thanks. I've read it. Bootloader is not an issue since /boot is on
another uncompressed fs.

Let me make my question more generic:

Can there be any issues for switching mount time
compressions options from one to another, in any order?
(i.e none -> lzo -> zlib -> zstd -> none -> ...)

zstd is only a newcomer so my question applies to all
combinations..
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: handle dynamically reappearing missing device

2017-11-15 Thread Anand Jain



On 11/15/2017 03:38 PM, kbuild test robot wrote:

Hi Anand,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.14 next-20171114]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Anand-Jain/btrfs-handle-dynamically-reappearing-missing-device/20171115-143047
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=sparc64

All errors (new ones prefixed by >>):

fs/btrfs/volumes.c: In function 'device_list_add':

fs/btrfs/volumes.c:732:10: error: implicit declaration of function 
'btrfs_open_one_device'; did you mean 'btrfs_scan_one_device'? 
[-Werror=implicit-function-declaration]

ret = btrfs_open_one_device(fs_devices, device, fmode,
  ^
  btrfs_scan_one_device
cc1: some warnings being treated as errors



Its missing this patch set in the ML which was sent separately.
 [PATCH 0/4] factor __btrfs_open_devices()

Thanks, Anand



vim +732 fs/btrfs/volumes.c

610 
611 /*
612  * Add new device to list of registered devices
613  *
614  * Returns:
615  * 1   - first time device is seen
616  * 0   - device already known
617  * < 0 - error
618  */
619 static noinline int device_list_add(const char *path,
620struct btrfs_super_block *disk_super,
621u64 devid, struct btrfs_fs_devices 
**fs_devices_ret)
622 {
623 struct btrfs_device *device;
624 struct btrfs_fs_devices *fs_devices;
625 struct rcu_string *name;
626 int ret = 0;
627 u64 found_transid = btrfs_super_generation(disk_super);
628 
629 fs_devices = find_fsid(disk_super->fsid);
630 if (!fs_devices) {
631 fs_devices = alloc_fs_devices(disk_super->fsid);
632 if (IS_ERR(fs_devices))
633 return PTR_ERR(fs_devices);
634 
635 list_add(_devices->list, _uuids);
636 
637 device = NULL;
638 } else {
639 device = __find_device(_devices->devices, devid,
640disk_super->dev_item.uuid);
641 }
642 
643 if (!device) {
644 if (fs_devices->opened)
645 return -EBUSY;
646 
647 device = btrfs_alloc_device(NULL, ,
648 disk_super->dev_item.uuid);
649 if (IS_ERR(device)) {
650 /* we can safely leave the fs_devices entry 
around */
651 return PTR_ERR(device);
652 }
653 
654 name = rcu_string_strdup(path, GFP_NOFS);
655 if (!name) {
656 kfree(device);
657 return -ENOMEM;
658 }
659 rcu_assign_pointer(device->name, name);
660 
661 mutex_lock(_devices->device_list_mutex);
662 list_add_rcu(>dev_list, _devices->devices);
663 fs_devices->num_devices++;
664 mutex_unlock(_devices->device_list_mutex);
665 
666 ret = 1;
667 device->fs_devices = fs_devices;
668 } else if (!device->name || strcmp(device->name->str, path)) {
669 /*
670  * When FS is already mounted.
671  * 1. If you are here and if the device->name is NULL 
that
672  *means this device was missing at time of FS mount.
673  * 2. If you are here and if the device->name is 
different
674  *from 'path' that means either
675  *  a. The same device disappeared and reappeared 
with
676  * different name. or
677  *  b. The missing-disk-which-was-replaced, has
678  * reappeared now.
679  *
680  * We must allow 1 and 2a above. But 2b would be a 
spurious
681  * and unintentional.
682  *
683  * Further in case of 1 and 2a above, the disk at 'path'
684 

Re: [PATCH v2] btrfs/154: test for device dynamic rescan

2017-11-15 Thread Anand Jain



On 11/15/2017 02:47 PM, Eryu Guan wrote:

On Wed, Nov 15, 2017 at 11:05:15AM +0800, Anand Jain wrote:

Make sure missing device is included in the alloc list when it is
scanned on a mounted FS.

This test case needs btrfs kernel patch which is in the ML
   [PATCH] btrfs: handle dynamically reappearing missing device
Without the kernel patch, the test will run, but reports as
failed, as the device scanned won't appear in the alloc_list.

Signed-off-by: Anand Jain 
---
v2: Fixed review comments.
  tests/btrfs/154 | 186 
  tests/btrfs/154.out |  10 +++
  tests/btrfs/group   |   1 +
  3 files changed, 197 insertions(+)
  create mode 100755 tests/btrfs/154
  create mode 100644 tests/btrfs/154.out

diff --git a/tests/btrfs/154 b/tests/btrfs/154
new file mode 100755
index ..73a185157389
--- /dev/null
+++ b/tests/btrfs/154
@@ -0,0 +1,186 @@
+#! /bin/bash
+# FS QA Test 154
+#
+# Test for reappearing missing device functionality.
+#   This test will fail without the btrfs kernel patch
+#   [PATCH] btrfs: handle dynamically reappearing missing device
+#
+#-
+# Copyright (c) 2017 Oracle.  All Rights Reserved.
+# Author: Anand Jain 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/module
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+_require_loadable_fs_module "btrfs"
+
+_scratch_dev_pool_get 2
+
+DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
+DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
+
+echo DEV1=$DEV1 >> $seqres.full
+echo DEV2=$DEV2 >> $seqres.full
+
+# Balance won't be successful if filled too much
+DEV1_SZ=`blockdev --getsize64 $DEV1`
+DEV2_SZ=`blockdev --getsize64 $DEV2`
+
+# get min
+MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1`
+# Need disks with more than 2G
+if [ $MAX_FS_SZ -lt 20 ]; then
+   _scratch_dev_pool_put
+   _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G"
+fi
+
+MAX_FS_SZ=1
+bs="1M"
+COUNT=$(($MAX_FS_SZ / 100))
+CHECKPOINT1=0
+CHECKPOINT2=0
+
+setup()
+{
+   echo >> $seqres.full
+   echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full
+   echo "setup"
+   echo "-setup-" >> $seqres.full
+   _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1
+   _scratch_mount >> $seqres.full 2>&1
+   dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \
+   >>$seqres.full 2>&1
+   _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT}
+   _run_btrfs_util_prog filesystem df $SCRATCH_MNT
+   COUNT=$(( $COUNT - 1 ))
+   echo "unmount" >> $seqres.full
+   _scratch_unmount
+}
+
+degrade_mount_write()
+{
+   echo >> $seqres.full
+   echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full
+   echo
+   echo "degraded mount"
+
+   echo "clean btrfs ko" >> $seqres.full
+   # un-scan the btrfs devices
+   _reload_fs_module "btrfs"
+   _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
+   cnt=$(( $COUNT/10 ))
+   dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
+   >>$seqres.full 2>&1
+   COUNT=$(( $COUNT - $cnt ))
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _run_btrfs_util_prog filesystem df $SCRATCH_MNT
+   CHECKPOINT1=`md5sum $SCRATCH_MNT/tf1`
+   echo $SCRATCH_MNT/tf1:$CHECKPOINT1 >> $seqres.full
+}
+
+scan_missing_dev_and_write()
+{
+   echo >> $seqres.full
+   echo "--scan missing $DEV2--" >> $seqres.full
+   echo
+   echo "scan missing dev and write"
+
+   _run_btrfs_util_prog device scan $DEV2
+
+   echo >> $seqres.full
+
+   _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT}
+   

Re: zstd compression

2017-11-15 Thread Lukas Pirl
Hi Imran,

On 11/15/2017 09:51 AM, Imran Geriskovan wrote as excerpted:
> Any further advices?

you might be interested in the thread "Read before you deploy btrfs +
zstd"¹.

Cheers,

Lukas

¹ https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg69871.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Marat Khalili


On 15/11/17 10:11, waxhead wrote:

hint: you need more than two for raid1 if you want to stay safe
Huh? Two is not enough? Having three or more makes a difference? (Or, 
you mean hot spare?)


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


zstd compression

2017-11-15 Thread Imran Geriskovan
Kernel 4.14 now includes btrfs zstd compression support.

My question:
I currently have a fs mounted and used with "compress=lzo"
option. What happens if I change it to "compress=zstd"?

My guess is that existing files will be read and uncompressed via lzo.
And new files will be written with zstd compression. And
everything will run smoothly.

Is this optimistic guess valid? What are possible pitfalls,
if there are any? Any further advices?

Regards,
Imran
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html