Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Chris Murphy
OK so we know for raid5 data block groups there can be RMW. And
because of that, any interruption results in the write hole. On Btrfs
thought, the write hole is on disk only. If there's a lost strip
(failed drive or UNC read), reconstruction from wrong parity results
in a checksum error and EIO. That's good.

However, what happens in the metadata case? If metadata is raid5, and
there's a crash or power failure during metadata RMW, same problem,
wrong parity, bad reconstruction, csum mismatch, and EIO. So what's
the effect of EIO when reading metadata? And how common is RMW for
metadata operations?

I wonder where all of these damn strange cases where people can't do
anything at all with a normally degraded raid5 - one device failed,
and no other failures, but they can't mount due to a bunch of csum
errors.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to rescue RAID5

2016-10-14 Thread Hiroshi Honda
> That's the proper answer.  In practice... all hope isn't yet lost.  
I understood the proper answer.
I'll take care it in the future.

Is there something step/method can I do from this situation?

Thank you
Hiroshi Honda

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to rescue RAID5

2016-10-14 Thread Austin S. Hemmelgarn

On 2016-10-14 06:11, Hiroshi Honda wrote:

That's the proper answer.  In practice... all hope isn't yet lost.

I understood the proper answer.
I'll take care it in the future.

Is there something step/method can I do from this situation?

You should probably look at `btrfs restore`.  I'm not sure how well it 
works with the raid56 code, and you'll need somewhere to put the data 
being restored, but that's probably your best option.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots

2016-10-14 Thread Austin S. Hemmelgarn

On 2016-10-13 17:21, Alberto Bursi wrote:

Hi, I'm using OpenSUSE on a btrfs volume spanning 2 disks (set as raid1
for both metadata and data), no separate /home partition.
The distro loves to create dozens of subvolumes for various things and
makes snapshots, see:
alby@openSUSE-xeon:~> sudo btrfs subvolume list /
ID 257 gen 394 top level 5 path @
ID 258 gen 293390 top level 257 path @/.snapshots
ID 259 gen 293607 top level 258 path @/.snapshots/1/snapshot
ID 260 gen 107012 top level 257 path @/boot/grub2/i386-pc
ID 261 gen 107012 top level 257 path @/boot/grub2/x86_64-efi
ID 262 gen 293610 top level 257 path @/home
ID 263 gen 292439 top level 257 path @/opt
ID 264 gen 288726 top level 257 path @/srv
ID 265 gen 293610 top level 257 path @/tmp
ID 266 gen 292657 top level 257 path @/usr/local
ID 267 gen 104612 top level 257 path @/var/crash
ID 268 gen 133454 top level 257 path @/var/lib/libvirt/images
ID 269 gen 104612 top level 257 path @/var/lib/mailman
ID 270 gen 104612 top level 257 path @/var/lib/mariadb
ID 271 gen 292441 top level 257 path @/var/lib/mysql
ID 272 gen 104612 top level 257 path @/var/lib/named
ID 273 gen 104612 top level 257 path @/var/lib/pgsql
ID 274 gen 293608 top level 257 path @/var/log
ID 275 gen 104612 top level 257 path @/var/opt
ID 276 gen 293610 top level 257 path @/var/spool
ID 277 gen 293606 top level 257 path @/var/tmp
ID 362 gen 228259 top level 258 path @/.snapshots/56/snapshot
ID 364 gen 228259 top level 258 path @/.snapshots/57/snapshot
ID 528 gen 228259 top level 258 path @/.snapshots/110/snapshot
ID 529 gen 228259 top level 258 path @/.snapshots/111/snapshot
ID 670 gen 228259 top level 258 path @/.snapshots/240/snapshot
ID 671 gen 228259 top level 258 path @/.snapshots/241/snapshot
ID 894 gen 228283 top level 258 path @/.snapshots/438/snapshot
ID 895 gen 228283 top level 258 path @/.snapshots/439/snapshot
ID 896 gen 228283 top level 258 path @/.snapshots/440/snapshot
ID 897 gen 228283 top level 258 path @/.snapshots/441/snapshot
ID 1033 gen 288965 top level 258 path @/.snapshots/554/snapshot
ID 1034 gen 289531 top level 258 path @/.snapshots/555/snapshot
ID 1035 gen 289726 top level 258 path @/.snapshots/556/snapshot
ID 1036 gen 289729 top level 258 path @/.snapshots/557/snapshot
ID 1037 gen 290297 top level 258 path @/.snapshots/558/snapshot
ID 1038 gen 290301 top level 258 path @/.snapshots/559/snapshot
ID 1039 gen 290336 top level 258 path @/.snapshots/560/snapshot
ID 1041 gen 290338 top level 258 path @/.snapshots/562/snapshot
ID 1043 gen 292047 top level 258 path @/.snapshots/563/snapshot
ID 1044 gen 292051 top level 258 path @/.snapshots/564/snapshot
ID 1045 gen 292531 top level 258 path @/.snapshots/565/snapshot
ID 1046 gen 293153 top level 258 path @/.snapshots/566/snapshot

I'd like to be able to clone verbatim the whole volume to another
volume, for backup purposes.

Now, I think I can do that with a brutal partition clone from my
"recovery" (a debian system installed in another drive) and then doing
the procedures to deal with a lost drive.

But I'd rather prefer a clone on a live system, and without doing a
brutal clone as that will keep the same UUIDs.

I can(will) script this so even if it is a tedious process or it
involves writing a huge line in the commandline it's not an issue.
I'm not sure there is any way to do this on a live system.  You 
essentially need to either do a block level copy and change the UUID's 
(which recent versions of btrfstune can do), or use send with some 
manual setup to get the subvolumes across.  Both options require the 
filesystem to be effectively read-only, which is not something that any 
modern Linux distro can reliably handle for more than a few minutes.


If I had to do this, I'd go with the block level copy, since it requires 
a lot less effort, just make sure to use btrfstune to change the UUID's 
when the copy is done (that may take a while itself though, since it has 
to rewrite a lot of metadata).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs and numa - needing drop_caches to keep speed up

2016-10-14 Thread Julian Taylor

On 10/14/2016 08:28 AM, Stefan Priebe - Profihost AG wrote:

Hello list,

while running the same workload on two machines (single xeon and a dual
xeon) both with 64GB RAM.

I need to run echo 3 >/proc/sys/vm/drop_caches every 15-30 minutes to
keep the speed as good as on the non numa system. I'm not sure whether
this is related to numa.

Is there any sysctl parameter to tune?

Tested with vanilla v4.8.1

Greets,
Stefan


hi,
why do you think this is related to btrfs?

The only known issue that has this type of workaround that I know of are 
transparent huge pages.
This is easy to diagnose but recording some kernel stacks during the 
problem with perf.
If there is very high system cpu usage in a spinlock called by 
compaction functions during page faults it is the synchronous memory 
defragmentation needed for thp.
Should that be the case the better workaround is disabling it in 
/sys/kernel/mm/transparent_hugepage/defrag


cheers,
Julian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs and numa - needing drop_caches to keep speed up

2016-10-14 Thread Austin S. Hemmelgarn

On 2016-10-14 02:28, Stefan Priebe - Profihost AG wrote:

Hello list,

while running the same workload on two machines (single xeon and a dual
xeon) both with 64GB RAM.

I need to run echo 3 >/proc/sys/vm/drop_caches every 15-30 minutes to
keep the speed as good as on the non numa system. I'm not sure whether
this is related to numa.

Is there any sysctl parameter to tune?

Tested with vanilla v4.8.1
This may sound odd, but does echoing 1 or 2 to /proc/sys/vm/drop_caches 
help at all, or is it just 3?   The value itself is actually a bit-field 
with just two bits defined.  1 just drops the page-cache, while 2 just 
drops freeable SLAB objects, and 3 drops both.  The thing is, both have 
an impact when dealing with filesystems (page-cache contains cached file 
contents, while freeable SLAB objects includes cached dentries and 
inodes), so knowing whether one or the other or only both is what's 
helping things can help diagnose this further.


Regardless of that, you might try adjusting vm.vfs_cache_pressure. 
Increasing that will make the page-cache reclaim more aggressive, while 
decreasing it will make it less aggressive.  It defaults to 100, and I 
wouldn't suggest setting it below 50 or above about 150.  Keep in mind 
that increasing that will mean you're likely to put more load on the 
storage devices.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: improve inode's outstanding_extents computation

2016-10-14 Thread Stefan Priebe - Profihost AG

Am 06.10.2016 um 04:51 schrieb Wang Xiaoguang:
> This issue was revealed by modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB,
> When modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often
> gets these warnings from btrfs_destroy_inode():
>   WARN_ON(BTRFS_I(inode)->outstanding_extents);
>   WARN_ON(BTRFS_I(inode)->reserved_extents);
> 
> Simple test program below can reproduce this issue steadily.
> Note: you need to modify BTRFS_MAX_EXTENT_SIZE to 64KB to have test,
> otherwise there won't be such WARNING.
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
> 
>   int main(void)
>   {
>   int fd;
>   char buf[68 *1024];
> 
>   memset(buf, 0, 68 * 1024);
>   fd = open("testfile", O_CREAT | O_EXCL | O_RDWR);
>   pwrite(fd, buf, 68 * 1024, 64 * 1024);
>   return;
>   }
> 
> When BTRFS_MAX_EXTENT_SIZE is 64KB, and buffered data range is:
> 64KB  128K132KB
> |---|---|
>  64 + 4KB
> 
> 1) for above data range, btrfs_delalloc_reserve_metadata() will reserve
> metadata and set BTRFS_I(inode)->outstanding_extents to 2.
> (68KB + 64KB - 1) / 64KB == 2
> 
> Outstanding_extents: 2
> 
> 2) then btrfs_dirty_page() will be called to dirty pages and set
> EXTENT_DELALLOC flag. In this case, btrfs_set_bit_hook() will be called
> twice.
> The 1st set_bit_hook() call will set DEALLOC flag for the first 64K.
> 64KB  128KB
> |---|
>   64KB DELALLOC
> Outstanding_extents: 2
> 
> Set_bit_hooks() uses FIRST_DELALLOC flag to avoid re-increase
> outstanding_extents counter.
> So for 1st set_bit_hooks() call, it won't modify outstanding_extents,
> it's still 2.
> 
> Then FIRST_DELALLOC flag is *CLEARED*.
> 
> 3) 2nd btrfs_set_bit_hook() call.
> Because FIRST_DELALLOC have been cleared by previous set_bit_hook(),
> btrfs_set_bit_hook() will increase BTRFS_I(inode)->outstanding_extents by
> one, so now BTRFS_I(inode)->outstanding_extents is 3.
> 64KB128KB132KB
> |---||
>   64K DELALLOC   4K DELALLOC
> Outstanding_extents: 3
> 
> But the correct outstanding_extents number should be 2, not 3.
> The 2nd btrfs_set_bit_hook() call just screwed up this, and leads to the
> WARN_ON().
> 
> Normally, we can solve it by only increasing outstanding_extents in
> set_bit_hook().
> But the problem is for delalloc_reserve/release_metadata(), we only have
> a 'length' parameter, and calculate in-accurate outstanding_extents.
> If we only rely on set_bit_hook() release_metadata() will crew things up
> as it will decrease inaccurate number.
> 
> So the fix we use is:
> 1) Increase *INACCURATE* outstanding_extents at delalloc_reserve_meta
>Just as a place holder.
> 2) Increase *accurate* outstanding_extents at set_bit_hooks()
>This is the real increaser.
> 3) Decrease *INACCURATE* outstanding_extents before returning
>This makes outstanding_extents to correct value.
> 
> For 128M BTRFS_MAX_EXTENT_SIZE, due to limitation of
> __btrfs_buffered_write(), each iteration will only handle about 2MB
> data.
> So btrfs_dirty_pages() won't need to handle cases cross 2 extents.
> 
> Signed-off-by: Wang Xiaoguang 

Tested-by: Stefan Priebe 

Works fine since 8 days - no ENOSPC errors anymore.

Greets,
Stefan

> ---
>  fs/btrfs/ctree.h |  2 ++
>  fs/btrfs/inode.c | 65 
> ++--
>  fs/btrfs/ioctl.c |  6 ++
>  3 files changed, 62 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 33fe035..16885f6 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3119,6 +3119,8 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info 
> *fs_info, int delay_iput,
>  int nr);
>  int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
> struct extent_state **cached_state);
> +int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
> + struct extent_state **cached_state);
>  int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
>struct btrfs_root *new_root,
>struct btrfs_root *parent_root,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e6811c4..a7193b1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1590,6 +1590,9 @@ static void btrfs_split_extent_hook(struct inode *inode,
>   if (!(orig->state & EXTENT_DELALLOC))
>   return;
>  
> + if (btrfs_is_free_space_inode(inode))
> + return;
> +
>   size = orig->end - orig->start + 1;
>   if (si

Re: [PATCH 2/2] btrfs: fix false enospc for compression

2016-10-14 Thread Stefan Priebe - Profihost AG

Am 06.10.2016 um 04:51 schrieb Wang Xiaoguang:
> When testing btrfs compression, sometimes we got ENOSPC error, though fs
> still has much free space, xfstests generic/171, generic/172, generic/173,
> generic/174, generic/175 can reveal this bug in my test environment when
> compression is enabled.
> 
> After some debuging work, we found that it's btrfs_delalloc_reserve_metadata()
> which sometimes tries to reserve plenty of metadata space, even for very small
> data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes
> we try to reserve is calculated by the difference between outstanding_extents
> and reserved_extents. Please see below case for how ENOSPC occurs:
> 
>   1, Buffered write 128MB data in unit of 128KB, so finially we'll have inode
> outstanding extents be 1, and reserved_extents be 1024. Note it's
> btrfs_merge_extent_hook() that merges these 128KB units into one big
> outstanding extent, but do not change reserved_extents.
> 
>   2, When writing dirty pages, for compression, cow_file_range_async() will
> split above big extent in unit of 128KB(compression extent size is 128KB).
> When first split opeartion finishes, we'll have 2 outstanding extents and 1024
> reserved extents, and just right now the currently generated ordered extent is
> dispatched to run and complete, then btrfs_delalloc_release_metadata()(see
> btrfs_finish_ordered_io()) will be called to release metadata, after that we
> will have 1 outstanding extents and 1 reserved extents(also see logic in
> drop_outstanding_extent()). Later cow_file_range_async() continues to handles
> left data range[128KB, 128MB), and if no other ordered extent was dispatched
> to run, there will be 1023 outstanding extents and 1 reserved extent.
> 
>   3, Now if another bufferd write for this file enters, then
> btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
> for 1023 outstanding extents' metadata, for 16KB node size, it'll be 
> 1023*16384*2*8,
> about 255MB, for 64K node size, it'll be 1023*65536*8*2, about 1GB metadata, 
> so
> obviously it's not sane and can easily result in enospc error.
> 
> The root cause is that for compression, its max extent size will no longer be
> BTRFS_MAX_EXTENT_SIZE(128MB), it'll be 128KB, so current metadata reservation
> method in btrfs is not appropriate or correct, here we introduce:
>   enum btrfs_metadata_reserve_type {
>   BTRFS_RESERVE_NORMAL,
>   BTRFS_RESERVE_COMPRESS,
>   };
> and expand btrfs_delalloc_reserve_metadata() and 
> btrfs_delalloc_reserve_space()
> by adding a new enum btrfs_metadata_reserve_type argument. When a data range 
> will
> go through compression, we use BTRFS_RESERVE_COMPRESS to reserve metatata.
> Meanwhile we introduce EXTENT_COMPRESS flag to mark a data range that will go
> through compression path.
> 
> With this patch, we can fix these false enospc error for compression.
> 
> Signed-off-by: Wang Xiaoguang 

Tested-by: Stefan Priebe 

Works fine since 8 days - no ENOSPC errors anymore.

Greets,
Stefan


> ---
>  fs/btrfs/ctree.h |  31 ++--
>  fs/btrfs/extent-tree.c   |  55 +
>  fs/btrfs/extent_io.c |  59 +-
>  fs/btrfs/extent_io.h |   2 +
>  fs/btrfs/file.c  |  26 +--
>  fs/btrfs/free-space-cache.c  |   6 +-
>  fs/btrfs/inode-map.c |   5 +-
>  fs/btrfs/inode.c | 181 
> ---
>  fs/btrfs/ioctl.c |  12 ++-
>  fs/btrfs/relocation.c|  14 +++-
>  fs/btrfs/tests/inode-tests.c |  15 ++--
>  11 files changed, 309 insertions(+), 97 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 16885f6..fa6a19a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -97,6 +97,19 @@ static const int btrfs_csum_sizes[] = { 4 };
>  
>  #define BTRFS_DIRTY_METADATA_THRESH  SZ_32M
>  
> +/*
> + * for compression, max file extent size would be limited to 128K, so when
> + * reserving metadata for such delalloc writes, pass BTRFS_RESERVE_COMPRESS 
> to
> + * btrfs_delalloc_reserve_metadata() or btrfs_delalloc_reserve_space() to
> + * calculate metadata, for none-compression, use BTRFS_RESERVE_NORMAL.
> + */
> +enum btrfs_metadata_reserve_type {
> + BTRFS_RESERVE_NORMAL,
> + BTRFS_RESERVE_COMPRESS,
> +};
> +int inode_need_compress(struct inode *inode);
> +u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type);
> +
>  #define BTRFS_MAX_EXTENT_SIZE SZ_128M
>  
>  struct btrfs_mapping_tree {
> @@ -2677,10 +2690,14 @@ int btrfs_subvolume_reserve_metadata(struct 
> btrfs_root *root,
>  void btrfs_subvolume_release_metadata(struct btrfs_root *root,
> struct btrfs_block_rsv *rsv,
> u64 qgroup_reserved);
> -int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
> -void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
> -i

Re: btrfs and numa - needing drop_caches to keep speed up

2016-10-14 Thread Stefan Priebe - Profihost AG
Dear julian,

Am 14.10.2016 um 14:26 schrieb Julian Taylor:
> On 10/14/2016 08:28 AM, Stefan Priebe - Profihost AG wrote:
>> Hello list,
>>
>> while running the same workload on two machines (single xeon and a dual
>> xeon) both with 64GB RAM.
>>
>> I need to run echo 3 >/proc/sys/vm/drop_caches every 15-30 minutes to
>> keep the speed as good as on the non numa system. I'm not sure whether
>> this is related to numa.
>>
>> Is there any sysctl parameter to tune?
>>
>> Tested with vanilla v4.8.1
>>
>> Greets,
>> Stefan
> 
> hi,
> why do you think this is related to btrfs?

was just an idea as i couldn't find any other difference between those
systems.

> This is easy to diagnose but recording some kernel stacks during the >
problem with perf.

you just mean perf top? Does it also show locking problems? As i see not
much CPU usage in that case.

> The only known issue that has this type of workaround that I know of are
> transparent huge pages.

I already disabled thp by:
echo never > /sys/kernel/mm/transparent_hugepage/enabled

cat /proc/meminfo says:
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0



Greets,
Stefan

> 
> cheers,
> Julian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: fix false enospc for compression

2016-10-14 Thread Holger Hoffstätte
On 10/06/16 04:51, Wang Xiaoguang wrote:
> When testing btrfs compression, sometimes we got ENOSPC error, though fs
> still has much free space, xfstests generic/171, generic/172, generic/173,
> generic/174, generic/175 can reveal this bug in my test environment when
> compression is enabled.
> 
> After some debuging work, we found that it's btrfs_delalloc_reserve_metadata()
> which sometimes tries to reserve plenty of metadata space, even for very small
> data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes
> we try to reserve is calculated by the difference between outstanding_extents
> and reserved_extents. Please see below case for how ENOSPC occurs:
> 
>   1, Buffered write 128MB data in unit of 128KB, so finially we'll have inode
> outstanding extents be 1, and reserved_extents be 1024. Note it's
> btrfs_merge_extent_hook() that merges these 128KB units into one big
> outstanding extent, but do not change reserved_extents.
> 
>   2, When writing dirty pages, for compression, cow_file_range_async() will
> split above big extent in unit of 128KB(compression extent size is 128KB).
> When first split opeartion finishes, we'll have 2 outstanding extents and 1024
> reserved extents, and just right now the currently generated ordered extent is
> dispatched to run and complete, then btrfs_delalloc_release_metadata()(see
> btrfs_finish_ordered_io()) will be called to release metadata, after that we
> will have 1 outstanding extents and 1 reserved extents(also see logic in
> drop_outstanding_extent()). Later cow_file_range_async() continues to handles
> left data range[128KB, 128MB), and if no other ordered extent was dispatched
> to run, there will be 1023 outstanding extents and 1 reserved extent.
> 
>   3, Now if another bufferd write for this file enters, then
> btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
> for 1023 outstanding extents' metadata, for 16KB node size, it'll be 
> 1023*16384*2*8,
> about 255MB, for 64K node size, it'll be 1023*65536*8*2, about 1GB metadata, 
> so
> obviously it's not sane and can easily result in enospc error.
> 
> The root cause is that for compression, its max extent size will no longer be
> BTRFS_MAX_EXTENT_SIZE(128MB), it'll be 128KB, so current metadata reservation
> method in btrfs is not appropriate or correct, here we introduce:
>   enum btrfs_metadata_reserve_type {
>   BTRFS_RESERVE_NORMAL,
>   BTRFS_RESERVE_COMPRESS,
>   };
> and expand btrfs_delalloc_reserve_metadata() and 
> btrfs_delalloc_reserve_space()
> by adding a new enum btrfs_metadata_reserve_type argument. When a data range 
> will
> go through compression, we use BTRFS_RESERVE_COMPRESS to reserve metatata.
> Meanwhile we introduce EXTENT_COMPRESS flag to mark a data range that will go
> through compression path.
> 
> With this patch, we can fix these false enospc error for compression.
> 
> Signed-off-by: Wang Xiaoguang 

I took some time again to get this into my tree on top of what's in
btrfs-4.9rc1 and managed to merge it after all.

Both this and patch #1 seem to work fine, and they don't seem to cause any
regressions; ran a couple of both full and incremental rsync backups with
>100GB on a new and now compressed subvolume without problem. Also Stefan
just reported that his ENOSPC seems to be gone as well, so it seems to be
good. \o/

So for both this and patch #1 have a careful:

Tested-by: Holger Hoffstätte 

Also a comment about something I found while resolving conflicts caused
by the preliminary dedupe suppoort:

[..]
>  int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
> -   struct extent_state **cached_state);
> +   struct extent_state **cached_state, int flag);
>  int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
> - struct extent_state **cached_state);
> + struct extent_state **cached_state, int flag);
[..]
>  int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
> struct page **pages, size_t num_pages,
> loff_t pos, size_t write_bytes,
> -   struct extent_state **cached);
> +   struct extent_state **cached, int flag);

Instead of adding "int flag" why not use the already defined
btrfs_metadata_reserve_type enum? I know it's just an int at the end of
the day, but the dedupe support already added another "int dedupe" argument
and it's probably easy to cause confusion. 
Maybe later it would be beneficial to consolidate the flags into a consistent
set of enum values to prevent more "int flag" inflation and better declare the
intent of the extent state change. Not sure if that makes sense.

thanks,
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info

Re: btrfs and numa - needing drop_caches to keep speed up

2016-10-14 Thread Stefan Priebe - Profihost AG
Hi,
Am 14.10.2016 um 15:19 schrieb Stefan Priebe - Profihost AG:
> Dear julian,
> 
> Am 14.10.2016 um 14:26 schrieb Julian Taylor:
>> On 10/14/2016 08:28 AM, Stefan Priebe - Profihost AG wrote:
>>> Hello list,
>>>
>>> while running the same workload on two machines (single xeon and a dual
>>> xeon) both with 64GB RAM.
>>>
>>> I need to run echo 3 >/proc/sys/vm/drop_caches every 15-30 minutes to
>>> keep the speed as good as on the non numa system. I'm not sure whether
>>> this is related to numa.
>>>
>>> Is there any sysctl parameter to tune?
>>>
>>> Tested with vanilla v4.8.1
>>>
>>> Greets,
>>> Stefan
>>
>> hi,
>> why do you think this is related to btrfs?
> 
> was just an idea as i couldn't find any other difference between those
> systems.
> 
>> This is easy to diagnose but recording some kernel stacks during the >
> problem with perf.
> 
> you just mean perf top? Does it also show locking problems? As i see not
> much CPU usage in that case.


perf top looks like this:
   5,46%  libc-2.19.so   [.] memset
   5,26%  [kernel]   [k] page_fault
   3,63%  [kernel]   [k] clear_page_c_e
   1,38%  [kernel]   [k] _raw_spin_lock
   1,06%  [kernel]   [k] get_page_from_freelist
   0,83%  [kernel]   [k] copy_user_enhanced_fast_string
   0,79%  [kernel]   [k] release_pages
   0,68%  [kernel]   [k] handle_mm_fault
   0,57%  [kernel]   [k] free_hot_cold_page
   0,55%  [kernel]   [k] handle_pte_fault
   0,54%  [kernel]   [k] __pagevec_lru_add_fn
   0,45%  [kernel]   [k] unmap_page_range
   0,45%  [kernel]   [k] __mod_zone_page_state
   0,43%  [kernel]   [k] page_add_new_anon_rmap
   0,38%  [kernel]   [k] free_pcppages_bulk

> 
>> The only known issue that has this type of workaround that I know of are
>> transparent huge pages.
> 
> I already disabled thp by:
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> 
> cat /proc/meminfo says:
> HugePages_Total:   0
> HugePages_Free:0
> HugePages_Rsvd:0
> HugePages_Surp:0
> 
> 
> 
> Greets,
> Stefan
> 
>>
>> cheers,
>> Julian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/4] Btrfs-convert: Add support to copy common inode flags

2016-10-14 Thread David Sterba
On Fri, Oct 14, 2016 at 08:50:31AM +0800, Qu Wenruo wrote:
> > I can pull the branches from you, either to devel or integration. Please
> > base them on the last release point (ie. master) if there are no other
> > dependencies. Otherwise, I'll publish a branch that contains the desired
> > patches and will promise not to change it with rebases.
> 
> Sorry I'm not very familiar with when to send a pull request and when to 
> send normal patches, and to which branch it should be rebased to.

It's no required, just would make a few things easier. The pull is
additional to mailinglist patches, so send patches as usual.  If you put
the patches to a branch that I can pull, I'll do that eventually. This
happens if the patches are in a state that I don't need to touch them at
all or do just minor modifications.

>  From my understanding, for large patchset pull request is preferred(and 
> patchse should still be sent out for review) and for small changes, 
> normal mail should be good enough.

Basically yes. Single patches or very short series do not need to go
through git (just mails).

> And normally I rebase all my branches/patches to your devel branch.
> Or should I rebase them to the master branch?

As the devel branch could be rebased, it's not suitable as a starting
point. Even if you rebase it, I might not be able to pull it if
something changes in the meantime. As I don't process progs patches
every day, this is likely to happen.

If you don't want to do the git way now, no big deal. It's a long term
goal, but this requires to do the things right, so both parties,
develoeprs and maintainers, know what to expect and get it. I'll also
write that to wiki for future reference.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs progs build fails for 4.8

2016-10-14 Thread David Sterba
On Thu, Oct 13, 2016 at 03:25:57PM +0800, Qu Wenruo wrote:
> At 10/13/2016 01:26 AM, David Sterba wrote:
> > On Wed, Oct 12, 2016 at 10:01:27PM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 10/12/2016 09:58 PM, Abhay Sachan wrote:
> >>> Hi,
> >>> I tried building latest btrfs progs on CentOS 6, "./configure" failed
> >>> with the error:
> >>>
> >>> checking for FIEMAP_EXTENT_SHARED defined in linux/fiemap.h... no
> >>> configure: error: no definition of FIEMAP_EXTENT_SHARED found
> >>>
> >>> Any ideas what am I lacking in the system?
> >>
> >> Your kernel is too old and its header doesn't support
> >> FIEMAP_EXTENT_SHARED flag for fiemap ioctl.
> >
> > As this is not the first time, I wonder if we could provide a defintion
> > of the macro in progs, regardless of the installed headers. Note that
> > this does not mean the running kernel is old. Previously the user said
> > he was running a 4.4 kernel [1] (or it could be any other kernel
> > version). For that combination of headers and running kernel, I think
> > it's ok to add a fallback definition.
> 
> Yes, fallback definition is good for case like running kernel supports 
> SHARED_EXTENT flag, but not kernel header.
> 
> But I'm a little concerned about such fallback definition will lead to 
> corrupted result for "fi du".

Well, not corrupted bug wrong. The shared data will be reported as
exclusive.

> >> And since that flag is very important for tools like "btrfs filesystem
> >> du", for old kernel doesn't support EXTENT_SHARED flag, we have no
> >> choice but abort configuration.
> >
> > The check was added to configure so it's caught earlier than during
> > build. The 'fi du' command is useful, but not IMO critical to stop the
> > build.
> 
> What about just disabling subcommands using SHARED_EXTENT flag?
> Like "fi du".

This would need to be checked at runtime, based only on kernel version.
I think we can do that, print a warning in 'fi du' but otherwise let it
continue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs progs build fails for 4.8

2016-10-14 Thread Eric Sandeen
On 10/14/16 10:49 AM, David Sterba wrote:
> On Thu, Oct 13, 2016 at 03:25:57PM +0800, Qu Wenruo wrote:
>> At 10/13/2016 01:26 AM, David Sterba wrote:
>>> On Wed, Oct 12, 2016 at 10:01:27PM +0800, Qu Wenruo wrote:


 On 10/12/2016 09:58 PM, Abhay Sachan wrote:
> Hi,
> I tried building latest btrfs progs on CentOS 6, "./configure" failed
> with the error:
>
> checking for FIEMAP_EXTENT_SHARED defined in linux/fiemap.h... no
> configure: error: no definition of FIEMAP_EXTENT_SHARED found
>
> Any ideas what am I lacking in the system?

 Your kernel is too old and its header doesn't support
 FIEMAP_EXTENT_SHARED flag for fiemap ioctl.
>>>
>>> As this is not the first time, I wonder if we could provide a defintion
>>> of the macro in progs, regardless of the installed headers. Note that
>>> this does not mean the running kernel is old. Previously the user said
>>> he was running a 4.4 kernel [1] (or it could be any other kernel
>>> version). For that combination of headers and running kernel, I think
>>> it's ok to add a fallback definition.
>>
>> Yes, fallback definition is good for case like running kernel supports 
>> SHARED_EXTENT flag, but not kernel header.
>>
>> But I'm a little concerned about such fallback definition will lead to 
>> corrupted result for "fi du".
> 
> Well, not corrupted bug wrong. The shared data will be reported as
> exclusive.
> 
 And since that flag is very important for tools like "btrfs filesystem
 du", for old kernel doesn't support EXTENT_SHARED flag, we have no
 choice but abort configuration.
>>>
>>> The check was added to configure so it's caught earlier than during
>>> build. The 'fi du' command is useful, but not IMO critical to stop the
>>> build.
>>
>> What about just disabling subcommands using SHARED_EXTENT flag?
>> Like "fi du".
> 
> This would need to be checked at runtime, based only on kernel version.
> I think we can do that, print a warning in 'fi du' but otherwise let it
> continue.

Checking kernel version is pretty unreliable - distro kernels that backport
will quite possibly fail the version test.  Sure, it's possible to patch
the check back out of userspace in the distro, but one must first know that
such things exist, and look out for them...

Maybe warn on kernel version < 2.6.33 /and/ no shared extents found.

Or, why not just:

1) #ifndef FIEMAP_EXTENT_SHARED / #define FIEMAP_EXTENT_SHARED 0x2000 / 
#endif
2) carry on as usual.

Worst case is that you don't get shared accounting on very old kernels.

And FIEMAP_EXTENT_SHARED has been around since 2.6.33 so it's not going
to bite too many people, right?

(If you really wanted to, you could export has_fiemap_shared in the btrfs
features sysfs file, and runtime could see if shared extents are
detectable and if not, print a warning or something ... but given that
this would be a new feature much newer than the fiemap flag I'm not
sure that would do much good.)

-Eric
 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: define FIEMAP_EXTENT_SHARED locally if needed

2016-10-14 Thread Eric Sandeen
As it stands today, btrfs-progs won't build if the installed
kernel headers don't define FIEMAP_EXTENT_SHARED.

But that doesn't tell us anything about the capabilities of
the /running/ kernel - so just define it locally if not found
in the fiemap header, and allow the build to proceed.

If run against an old kernel, worst case scenario is that
no shared extents will be reported via the du command.

Signed-off-by: Eric Sandeen 
---

you can take it or leave it, but I had this locally 
anyway, so if it's helpful here you go :)

diff --git a/cmds-fi-du.c b/cmds-fi-du.c
index a5b66e6..fa8f421 100644
--- a/cmds-fi-du.c
+++ b/cmds-fi-du.c
@@ -29,6 +29,11 @@
 #include 
 #include 
 
+/* Appeared upstream in 2.6.33 */
+#ifndef FIEMAP_EXTENT_SHARED
+#define FIEMAP_EXTENT_SHARED 0x2000
+#endif
+
 #include "utils.h"
 #include "commands.h"
 #include "kerncompat.h"
diff --git a/configure.ac b/configure.ac
index 8fd8f42..9b87b24 100644
--- a/configure.ac
+++ b/configure.ac
@@ -144,8 +144,6 @@ if test "$DISABLE_BTRFSCONVERT" = 0 && test "x$convertfs" = 
"x"; then
AC_MSG_ERROR([no filesystems for convert, use --disable-convert 
instead])
 fi
 
-AX_CHECK_DEFINE([linux/fiemap.h], [FIEMAP_EXTENT_SHARED], [],
-   [AC_MSG_ERROR([no definition of FIEMAP_EXTENT_SHARED found])])
 
 dnl Define _LIBS= and _CFLAGS= by pkg-config
 dnl

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: build: detect fiemap shared flag but don't fail build

2016-10-14 Thread David Sterba
The FIEMAP_EXTENT_SHARED fiemap flag was introduced in 2.6.33. If the
headers do not provide the definition, the build will fail. The support
of the fiemap sharing depends on the running kernel. There are still
systems with 2.6.32 kernel headers but running newer versions.

To support such environment, don't fail build, provide own defintion of
the structure and detect if there's an old kernel in use in the relevant
command (btrfs fi du).

Reported-by: Abhay Sachan 
Signed-off-by: David Sterba 
---
 cmds-fi-du.c | 14 ++
 configure.ac | 10 +-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/cmds-fi-du.c b/cmds-fi-du.c
index a5b66e6fdf1a..895c242e821c 100644
--- a/cmds-fi-du.c
+++ b/cmds-fi-du.c
@@ -27,8 +27,13 @@
 
 #include 
 #include 
+#include 
 #include 
 
+#if !defined(FIEMAP_EXTENT_SHARED) && (HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE == 
1)
+#define FIEMAP_EXTENT_SHARED   0x2000
+#endif
+
 #include "utils.h"
 #include "commands.h"
 #include "kerncompat.h"
@@ -546,6 +551,7 @@ int cmd_filesystem_du(int argc, char **argv)
 {
int ret = 0, err = 0;
int i;
+   u32 kernel_version;
 
unit_mode = get_unit_mode_from_arg(&argc, argv, 1);
 
@@ -570,6 +576,14 @@ int cmd_filesystem_du(int argc, char **argv)
if (check_argc_min(argc - optind, 1))
usage(cmd_filesystem_du_usage);
 
+   kernel_version = get_running_kernel_version();
+
+   if (kernel_version < KERNEL_VERSION(2,6,33)) {
+   warning(
+"old kernel version detected, shared space will be reported as exclusive\n"
+"due to missing support for FIEMAP_EXTENT_SHARED flag");
+   }
+
printf("%10s  %10s  %10s  %s\n", "Total", "Exclusive", "Set shared",
"Filename");
 
diff --git a/configure.ac b/configure.ac
index 8fd8f425a076..9c6df3b9d4ed 100644
--- a/configure.ac
+++ b/configure.ac
@@ -144,8 +144,16 @@ if test "$DISABLE_BTRFSCONVERT" = 0 && test "x$convertfs" 
= "x"; then
AC_MSG_ERROR([no filesystems for convert, use --disable-convert 
instead])
 fi
 
+HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE=0
 AX_CHECK_DEFINE([linux/fiemap.h], [FIEMAP_EXTENT_SHARED], [],
-   [AC_MSG_ERROR([no definition of FIEMAP_EXTENT_SHARED found])])
+   [HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE=1
+AC_MSG_WARN([no definition of FIEMAP_EXTENT_SHARED found, 
probably old kernel, will use own defintion, 'btrfs fi du' might report wrong 
numbers])])
+
+if test "x$HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE" == "x1"; then
+AC_DEFINE([HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE], [1], [We defined 
FIEMAP_EXTENT_SHARED])
+else
+AC_DEFINE([HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE], [0], [We did not define 
FIEMAP_EXTENT_SHARED])
+fi
 
 dnl Define _LIBS= and _CFLAGS= by pkg-config
 dnl
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs progs build fails for 4.8

2016-10-14 Thread David Sterba
On Fri, Oct 14, 2016 at 12:00:56PM -0500, Eric Sandeen wrote:
> On 10/14/16 10:49 AM, David Sterba wrote:
> > On Thu, Oct 13, 2016 at 03:25:57PM +0800, Qu Wenruo wrote:
> >> At 10/13/2016 01:26 AM, David Sterba wrote:
> >>> On Wed, Oct 12, 2016 at 10:01:27PM +0800, Qu Wenruo wrote:
> 
> 
>  On 10/12/2016 09:58 PM, Abhay Sachan wrote:
> > Hi,
> > I tried building latest btrfs progs on CentOS 6, "./configure" failed
> > with the error:
> >
> > checking for FIEMAP_EXTENT_SHARED defined in linux/fiemap.h... no
> > configure: error: no definition of FIEMAP_EXTENT_SHARED found
> >
> > Any ideas what am I lacking in the system?
> 
>  Your kernel is too old and its header doesn't support
>  FIEMAP_EXTENT_SHARED flag for fiemap ioctl.
> >>>
> >>> As this is not the first time, I wonder if we could provide a defintion
> >>> of the macro in progs, regardless of the installed headers. Note that
> >>> this does not mean the running kernel is old. Previously the user said
> >>> he was running a 4.4 kernel [1] (or it could be any other kernel
> >>> version). For that combination of headers and running kernel, I think
> >>> it's ok to add a fallback definition.
> >>
> >> Yes, fallback definition is good for case like running kernel supports 
> >> SHARED_EXTENT flag, but not kernel header.
> >>
> >> But I'm a little concerned about such fallback definition will lead to 
> >> corrupted result for "fi du".
> > 
> > Well, not corrupted bug wrong. The shared data will be reported as
> > exclusive.
> > 
>  And since that flag is very important for tools like "btrfs filesystem
>  du", for old kernel doesn't support EXTENT_SHARED flag, we have no
>  choice but abort configuration.
> >>>
> >>> The check was added to configure so it's caught earlier than during
> >>> build. The 'fi du' command is useful, but not IMO critical to stop the
> >>> build.
> >>
> >> What about just disabling subcommands using SHARED_EXTENT flag?
> >> Like "fi du".
> > 
> > This would need to be checked at runtime, based only on kernel version.
> > I think we can do that, print a warning in 'fi du' but otherwise let it
> > continue.
> 
> Checking kernel version is pretty unreliable - distro kernels that backport
> will quite possibly fail the version test.  Sure, it's possible to patch
> the check back out of userspace in the distro, but one must first know that
> such things exist, and look out for them...
> 
> Maybe warn on kernel version < 2.6.33 /and/ no shared extents found.

Yeah version is not reliable, 'fi du' will just warn. As the warning is
printed at the beginning, we don't know if there will be any shared
extent.

> Or, why not just:
> 
> 1) #ifndef FIEMAP_EXTENT_SHARED / #define FIEMAP_EXTENT_SHARED 0x2000 / 
> #endif
> 2) carry on as usual.
> 
> Worst case is that you don't get shared accounting on very old kernels.

Yes, implemented it that way.

> And FIEMAP_EXTENT_SHARED has been around since 2.6.33 so it's not going
> to bite too many people, right?

Right, I want to avoid another report that the build failed on centos 6.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs progs build fails for 4.8

2016-10-14 Thread David Sterba
On Wed, Oct 12, 2016 at 07:28:20PM +0530, Abhay Sachan wrote:
> Hi,
> I tried building latest btrfs progs on CentOS 6, "./configure" failed
> with the error:
> 
> checking for FIEMAP_EXTENT_SHARED defined in linux/fiemap.h... no
> configure: error: no definition of FIEMAP_EXTENT_SHARED found
> 
> Any ideas what am I lacking in the system?

Can you please test the build from current devel branch? [1] Configure
will just print a warning if the flag is not defined, 'btrfs fi du' will
print another warning if it's running on kernel < 2.6.33 .

[1] git://github.com/kdave/btrfs-progs (branch devel)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is is possible to submit binary image as fstest test case?

2016-10-14 Thread David Sterba
On Mon, Oct 10, 2016 at 10:56:12AM +1100, Dave Chinner wrote:
> On Fri, Oct 07, 2016 at 06:05:51PM +0200, David Sterba wrote:
> > On Fri, Oct 07, 2016 at 08:18:38PM +1100, Dave Chinner wrote:
> > > On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote:
> > > > So I'm wondering if I can just upload a zipped raw image as part of
> > > > the test case?
> > > 
> > > Preferably not. We've managed to avoid pre-built images in xfstests
> > > for 15 years, so there'd have to be a really good reason to start
> > > doing this, especially as once we open that floodgate we'll end up
> > > with everyone wanting to do this and it will blow out the size of
> > > the repository in now time.
> > > 
> > > If the issue is just timing or being unable to trigger an error
> > > at the right time, this is what error injection frameworks or
> > > debug-only sysfs hooks are for. The XFS kernel code has both,
> > > xfstests use both, and they pretty much do away with the need for
> > > custom binary filesystem images for such testing...
> > 
> > Error injection framework may not be alwasy available, eg. in kernels
> > built for production. Yet I'd like to make the test possible.
> 
> Why would you ever enable error injection on a production kernel?

That's not what I mean. Let me rephrase. If the test needs the injection
infrastructue it can't be run with production build. A crafted image can
reproduce that.

> We simply don't run the error injection tests when the
> infrastructure is not there, jsut like we do with all other tests
> that are depenent on optional kernel/fs features

I think this depends on the test and the bug class it's supposed to hit.
The images need some determinism, eg. a traversal of structures, or
existence (or not) of some state, or error handling.

> > Also, I find it useful to keep the exact images that are attached to a
> > report and not necessarily try to recreate it to trigger a bug. If the
> > images are small, hosting them with the sources makes testing easy.
> > Big images would probably go to own repository and be linked.
> > 
> > I don't really expect fstests to store crafted images and would advise
> > Qu to not even try to propose that, because the answer was quite
> > predictable.
> 
> You say that like it's a bad thing?

>From the POV of fstests, it won't work, possibly in some very limited
number of tests. As it's a multi-filesystem project, this won't scale.
You said that earlier and I consider this obvious.

What we can do for fstests, configure an external path with filesystem
images and let tests use it for any purpose. In btrfs-progs we eg. don't
have mount tests of the fuzzed images, but I would like to make it more
accessible for wider testing. In a way that's acceptable for fstests.

> What's the guarantee that a
> specific corrupt image will always be sufficient to trigger the
> problem the test is supposed to check? i.e. we could change a
> different part of the FS code and that could mask the bug the image
> used to trigger. The test then passes, but we haven't actually fix
> the bug that the test used to exercise. i.e. we've got a false "we
> fixed the problem" report, when all we did is prevent a specific
> vector from tripping over it.

Again, I think this depends. I can't be sure the image will always
trigger the bug that was there originally. I've seen that some fuzzed
images required to add early checks that in some cases did not reach the
original point of failure in other images. As long as the test
assumptions are met, the test will not give false sense of fixing. In
case of the fuzzed images, one may not cover all the assumptions, but I
take them more like statistical proof that we don't have new bugs. And
it's convenient to have the images at hand.

> Error injection and sysfs hooks into debug code are much more
> reliable ways of testing that the root cause of a problem is fixed
> and stays fixed.  The reproducing trigger cannot be masked by
> changes in other code layers, so we know that if the error
> injection/sysfs hook test handles the problem correctly, we have
> actually fixed the underlying bug and not just masked it...

Agreed. This applies to the example that Qu asked about, as balance
modifies a global filesystem state, a single image won't cover many
variants of structures (with snapshots, optional mkfs features). The
injection framework will give better testing coverage.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Zygo Blaxell
On Fri, Oct 14, 2016 at 01:16:05AM -0600, Chris Murphy wrote:
> OK so we know for raid5 data block groups there can be RMW. And
> because of that, any interruption results in the write hole. On Btrfs
> thought, the write hole is on disk only. If there's a lost strip
> (failed drive or UNC read), reconstruction from wrong parity results
> in a checksum error and EIO. That's good.
> 
> However, what happens in the metadata case? If metadata is raid5, and
> there's a crash or power failure during metadata RMW, same problem,
> wrong parity, bad reconstruction, csum mismatch, and EIO. So what's
> the effect of EIO when reading metadata? 

The effect is you can't access the page or anything referenced by
the page.  If the page happens to be a root or interior node of
something important, large parts of the filesystem are inaccessible,
or the filesystem is not mountable at all.  RAID device management and
balance operations don't work because they abort as soon as they find
the first unreadable metadata page.

In theory it's still possible to rebuild parts of the filesystem offline
using backrefs or brute-force search.  Using an old root might work too,
but in bad cases the newest viable root could be thousands of generations
old (i.e. it's more likely that no viable root exists at all).

> And how common is RMW for metadata operations?

RMW in metadata is the norm.  It happens on nearly all commits--the only
exception seems to be when both ends of a commit write happen to land
on stripe boundaries accidentally, which is less than 1% of the time on
3 disks.

> I wonder where all of these damn strange cases where people can't do
> anything at all with a normally degraded raid5 - one device failed,
> and no other failures, but they can't mount due to a bunch of csum
> errors.

I'm *astonished* to hear about real-world successes with raid5 metadata.
The total-loss failure reports are the result I expect.

The current btrfs raid5 implementation is a thin layer of bugs on top
of code that is still missing critical pieces.  There is no mechanism to
prevent RMW-related failures combined with zero tolerance for RMW-related
failures in metadata, so I expect a btrfs filesystem using raid5 metadata
to be extremely fragile.  Failure is not likely--it's *inevitable*.

The non-RMW-aware allocator almost maximizes the risk of RMW data loss.
Every transaction commit contains multiple tree root pages, which
are the most critical metadata that could be lost due to RMW failure.
There is a window at least a few milliseconds wide, and potentially
several seconds wide, where some data on disk is in an unrecoverable
state due to RMW.  This happens twice a minute with the default commit
interval and 99% of commits are affected.  That's a million opportunities
per machine-year to lose metadata.  If a crash lands on one of those,
boom, no more filesystem.

I expect one random crash (i.e. a crash that is not strongly correlated
to RMW activity) out of 30-2000 (depending on filesystem size, workload,
rotation speed, btrfs mount parameters) will destroy a filesystem under
typical conditions.  Real world crashes tend not to be random (i.e. they
are strongly correlated to RMW activity), so filesystem loss will be
much more frequent in practice.


> 
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


[GIT PULL] Btrfs

2016-10-14 Thread Chris Mason
Hi Linus,

My for-linus-4.9 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.9

Has some fixes from Omar and Dave Sterba for our new free space tree.
This isn't heavily used yet, but as we move toward making it the new
default we wanted to nail down an endian bug.

Omar Sandoval (5) commits (+259/-145):
Btrfs: expand free space tree sanity tests to catch endianness bug (+96/-68)
Btrfs: fix extent buffer bitmap tests on big-endian systems (+51/-36)
Btrfs: fix free space tree bitmaps on big-endian systems (+76/-27)
Btrfs: fix mount -o clear_cache,space_cache=v2 (+12/-12)
Btrfs: catch invalid free space trees (+24/-2)

David Sterba (2) commits (+13/-12):
btrfs: tests: uninline member definitions in free_space_extent (+2/-1)
btrfs: tests: constify free space extent specs (+11/-11)

Total: (7) commits (+272/-157)

 fs/btrfs/ctree.h   |   3 +-
 fs/btrfs/disk-io.c |  33 +++---
 fs/btrfs/extent_io.c   |  64 +++
 fs/btrfs/extent_io.h   |  22 
 fs/btrfs/free-space-tree.c |  19 ++--
 fs/btrfs/tests/extent-io-tests.c   |  87 ---
 fs/btrfs/tests/free-space-tree-tests.c | 189 +++--
 include/uapi/linux/btrfs.h |  12 ++-
 8 files changed, 272 insertions(+), 157 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Duncan
Zygo Blaxell posted on Fri, 14 Oct 2016 15:55:45 -0400 as excerpted:

> The current btrfs raid5 implementation is a thin layer of bugs on top of
> code that is still missing critical pieces.  There is no mechanism to
> prevent RMW-related failures combined with zero tolerance for
> RMW-related failures in metadata, so I expect a btrfs filesystem using
> raid5 metadata to be extremely fragile.  Failure is not likely--it's
> *inevitable*.

Wow, that's a signature-quality quote reflecting just how dire the 
situation with btrfs parity-raid is ATM.  First sentence for a short sig, 
full paragraph for a longer one.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Chris Murphy
On Fri, Oct 14, 2016 at 1:55 PM, Zygo Blaxell
 wrote:

>
>> And how common is RMW for metadata operations?
>
> RMW in metadata is the norm.  It happens on nearly all commits--the only
> exception seems to be when both ends of a commit write happen to land
> on stripe boundaries accidentally, which is less than 1% of the time on
> 3 disks.

In the interest of due diligence, and the fact I can't confirm or deny
this myself from reading the code (although I do see many comments
involving RMW in the code), I must ask Qu if he can corroborate this.

Because basically means btrfs raid56 is not better than md raid56 - by
design. It has nothing to do with bugs. This is substantially worse
than the scrub->wrong parity bug.

Does it make sense to proscribe raid5 profile for metadata? As in,
disallow -m raid5 at mkfs time? Maybe recommend raid1. Even raid6
seems like it could be specious - yes there are two copies but if
there is constant RMW, then there's no CoW and we're not really
protected that well with all of these overwrites, statistically
speaking.

Basically you have to have a setup where there's no chance of torn or
misdirected writes, and no corruptions, in which case Btrfs checksums
aren't really helpful, you're using it for other reasons (snapshots
and what not).

Really seriously the CoW part of Btrfs being violated by all of this
RMW to me sounds like it reduces the pros of Btrfs.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 file system in a bad state

2016-10-14 Thread Chris Murphy
This may be relevant and is pretty terrible.

http://www.spinics.net/lists/linux-btrfs/msg59741.html


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots

2016-10-14 Thread Alberto Bursi


On 10/14/2016 01:38 PM, Austin S. Hemmelgarn wrote:
> On 2016-10-13 17:21, Alberto Bursi wrote:
>> Hi, I'm using OpenSUSE on a btrfs volume spanning 2 disks (set as raid1
>> for both metadata and data), no separate /home partition.
>> -
>> I'd like to be able to clone verbatim the whole volume to another
>> volume, for backup purposes.
>>
>> Now, I think I can do that with a brutal partition clone from my
>> "recovery" (a debian system installed in another drive) and then doing
>> the procedures to deal with a lost drive.
>>
>> But I'd rather prefer a clone on a live system, and without doing a
>> brutal clone as that will keep the same UUIDs.
>>
>> I can(will) script this so even if it is a tedious process or it
>> involves writing a huge line in the commandline it's not an issue.
> I'm not sure there is any way to do this on a live system.  You
> essentially need to either do a block level copy and change the UUID's
> (which recent versions of btrfstune can do), or use send with some
> manual setup to get the subvolumes across.  Both options require the
> filesystem to be effectively read-only, which is not something that any
> modern Linux distro can reliably handle for more than a few minutes.
>
> If I had to do this, I'd go with the block level copy, since it requires
> a lot less effort, just make sure to use btrfstune to change the UUID's
> when the copy is done (that may take a while itself though, since it has
> to rewrite a lot of metadata).

Heh, one of the reasons I migrated to btrfs was that I wanted to do 
things like these on a live system.

With my script I can already generate any necessary folder structure and 
send over all subvolumes with a bunch of btrfs send | btrfs receive with 
like 5 lines of script.

I was hoping there was some neat trick with btrfs send | btrfs receive
that allowed me to send a snapshot of / to the / of the new volume.

With rsync (from a ro snapshot of /) it should be possible to use the 
subvolume list as exclusion list, but I'd have rather wanted to use a 
btrfs feature instead.

-Alberto


Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots

2016-10-14 Thread Chris Murphy
It should be -e can accept a listing of all the subvolumes you want to
send at once. And possibly an -r flag, if it existed, could
automatically populate -e. But the last time I tested -e I just got
errors.

https://bugzilla.kernel.org/show_bug.cgi?id=111221


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Chris Murphy
On Fri, Oct 14, 2016 at 3:38 PM, Chris Murphy  wrote:
> On Fri, Oct 14, 2016 at 1:55 PM, Zygo Blaxell
>  wrote:
>
>>
>>> And how common is RMW for metadata operations?
>>
>> RMW in metadata is the norm.  It happens on nearly all commits--the only
>> exception seems to be when both ends of a commit write happen to land
>> on stripe boundaries accidentally, which is less than 1% of the time on
>> 3 disks.
>
> In the interest of due diligence, and the fact I can't confirm or deny
> this myself from reading the code (although I do see many comments
> involving RMW in the code), I must ask Qu if he can corroborate this.
>
> Because basically means btrfs raid56 is not better than md raid56 - by
> design. It has nothing to do with bugs. This is substantially worse
> than the scrub->wrong parity bug.
>
> Does it make sense to proscribe raid5 profile for metadata? As in,
> disallow -m raid5 at mkfs time? Maybe recommend raid1. Even raid6
> seems like it could be specious - yes there are two copies but if
> there is constant RMW, then there's no CoW and we're not really
> protected that well with all of these overwrites, statistically
> speaking.
>
> Basically you have to have a setup where there's no chance of torn or
> misdirected writes, and no corruptions, in which case Btrfs checksums
> aren't really helpful, you're using it for other reasons (snapshots
> and what not).
>
> Really seriously the CoW part of Btrfs being violated by all of this
> RMW to me sounds like it reduces the pros of Btrfs.


Also, is there RMW with raid0, or even raid10? Or is that always CoW
for metadata and data, just like single and dup? If raid0 is always
CoW, then I don't think it's correct to consider raid5 minus parity to
be anything like raid0 - in a Btrfs context anyway. Outside of that
context, I understand the argument.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots

2016-10-14 Thread Alberto Bursi


On 10/15/2016 12:17 AM, Chris Murphy wrote:
> It should be -e can accept a listing of all the subvolumes you want to
> send at once. And possibly an -r flag, if it existed, could
> automatically populate -e. But the last time I tested -e I just got
> errors.
>
> https://bugzilla.kernel.org/show_bug.cgi?id=111221
>
>

Not a problem (for me anyway), I can send all subvolumes already with my 
script (one after another, but still automatically).

What I can't do with btrfs commands is to send over the contents of a ro 
snapshot of / called for example "oldRootSnapshot", directly to 
"/tmp/newroot" (which is where I have mounted the other drive/volume).

The only thing I can do is send over the subvolume as a subvolume.
So I end up with /tmp/newroot/oldRootSnapshot and inside oldRootSnapshot 
I get my root, not what I wanted.

Only way I found so far is using rsync to move the contents of 
oldRootSnapshot in the /tmp/newroot by setting an exclusion list for all 
subvolumes, then run a deduplication with duperemove.

So, is there something I missed to do that?

-Alberto
N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: RAID system with adaption to changed number of disks

2016-10-14 Thread Zygo Blaxell
On Fri, Oct 14, 2016 at 04:30:42PM -0600, Chris Murphy wrote:
> Also, is there RMW with raid0, or even raid10? 

No.  Mirroring is writing the same data in two isolated places.  Striping
is writing data at different isolated places.  No matter which sectors
you write through these layers, it does not affect the correctness of
data in any sector at a different logical address.  None of these use
RMW--you read or write only complete sectors and act only on the specific
sectors requested.  Only parity RAID does RMW.

e.g. in RAID0, when you modify block 47, you may actually modify block
93 on a different disk, but there's always a 1:1 mapping between every
logical and physical address.  If there is a crash we go back to an
earlier tree that does not contain block 47/93 so we don't care if the
write was interrupted.

e.g. in RAID1, when you modify block 47, you modify physical block 47 on
two separate disks.  The state of disk1-block47 may be different from
the state of disk2-block47 if the write is interrupted.  If there is a
crash we go back to an earlier tree that does not contain either copy
of block 47 so we don't care about any inconsistency there.

So raid0, single, dup, raid1, and raid10 are OK--they fall into one or
both of the above cases.  CoW works there.  None of these properties
change in degraded mode with the mirroring profiles.

Parity RAID is writing data in non-isolated places.  When you write to
some sectors, additional sectors are implicitly modified in degraded mode
(whether you are in degraded mode at the time of the writes or not).
This is different from the other cases because the other cases never
modify any sectors that were not explicitly requested by the upper layer.
This is OK if and only if the CoW layer is aware of this behavior and
works around it.

> Or is that always CoW
> for metadata and data, just like single and dup? 

It's always CoW at the higher levels, even for parity RAID.  The problem
is that the CoW layer is not aware of the RMW behavior buried in the
parity RAID layer, so the combination doesn't work properly.

CoW thinks it's modifying only block 47, when in fact it's modifying
an entire stripe in degraded mode.  Let's assume 5-disk RAID5 with a
strip size of one block for this example, and say blocks 45-48 are one
RAID stripe.  If there is a crash, data in blocks 45, 46, 47, and 48
may be irretrievably damaged by inconsistent modification of parity and
data blocks.  When we try to go back to an earlier tree that does not
contain block 47, we will end up with a tree that contains corruption in
one of the blocks 45, 46, or 48.  This corruption will only be visible
when something else goes wrong (parity mismatch, data csum failure,
disk failure, or scrub) so a damaged filesystem that isn't degraded
could appear to be healthy for a long time.

If the CoW layer is aware of this, it can arrange operations such
that no stripe is modified while it is referenced by a committed tree.
Suppose the stripe at blocks 49-52 is empty, so we write our CoW block at
block 49 instead of 47.  Since blocks 50-52 contain no data we care about,
we don't even have to bother reading them (just fill the other blocks
with zero or find some other data to write in the same commit), and we
can eliminate many slow RMW operations entirely*.  If there is a crash
we just fall back to an earlier tree that does not contain block 49.
This tree is not damaged because we left blocks 45-48 alone.

One way to tell this is done right is all data in each RAID stripe will
always belong to exactly zero or one transaction, not dozens of different
transactions as stripes have now.

The other way to fix things is to make stripe RMW atomic so that CoW
works properly.  You can tell this is done right if you can find a stripe
update journal in the disk format or the code.

> If raid0 is always
> CoW, then I don't think it's correct to consider raid5 minus parity to
> be anything like raid0 - in a Btrfs context anyway. Outside of that
> context, I understand the argument.
> 
> 
> 
> -- 
> Chris Murphy

[*] We'd still need parity RAID RMW for nodatacow and PREALLOC because
neither uses the CoW layer.  That doesn't matter for nodatacow because
nodatacow is how users tell us they don't want to read their data any
more, but it has interesting implications for PREALLOC.  Maybe a solution
for PREALLOC is to do the first write strictly in RAID-stripe-sized units?


signature.asc
Description: Digital signature