btrfs_sync_file alignment trap on arm (kernel 4.2.5)

2015-11-04 Thread Cody P Schafer
Ideas as to what could cause this would be appreciated.

This consistently is triggered shortly after boot (I presume due to
conmand calling fsync on a file).

Note that I'm not quite running 4.2.5, but none of the changes I have
additionally applied are to btrfs or atomics.

Let me know if there is a way for me to get you more info.

It looks like the line is:

mutex_lock(>i_mutex);
atomic_inc(>log_batch);  
full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
_I(inode)->runtime_flags);


addr2line of the trapping instruction:
addr2line -e
./work/beaglebone-poky-linux-gnueabi/linux-yocto-ikabit/4.2.5+gitAUTOINC+c29ac1-r1/linux-beaglebone-standard-build/arch/arm/boot/vmlinux
c0217a34 -i

/home/cody/obj/y/tmp/work-shared/beaglebone/kernel-source/arch/arm/include/asm/atomic.h:194

/home/cody/obj/y/tmp/work-shared/beaglebone/kernel-source/fs/btrfs/file.c:1886

fault log:

[   11.488382] Alignment trap: not handling instruction e1993f9f at []
[   11.560558] Unhandled fault: alignment exception (0x001) at 0x026d
[   11.607301] pgd = dc09
[   11.610166] [026d] *pgd=9c063831, *pte=, *ppte=
[   11.665548] Internal error: : 1 [#1] PREEMPT ARM
[   11.670388] Modules linked in: omaplfb(O) bufferclass_ti(O) pvrsrvkm(O)
[   11.677341] CPU: 0 PID: 248 Comm: connmand Tainted: G   O
 4.2.5-yocto-standard #1
[   11.686172] Hardware name: Generic AM33XX (Flattened Device Tree)
[   11.692551] task: dc00d100 ti: dc068000 task.ti: dc068000
[   11.698219] PC is at btrfs_sync_file+0x104/0x3f4
[   11.703051] LR is at btrfs_sync_file+0x100/0x3f4
[   11.707883] pc : []lr : []psr: 60060013
[   11.707883] sp : dc069e98  ip : dc069e98  fp : dc069f1c
[   11.719897] r10:   r9 : 026d  r8 : dd746b40
[   11.725363] r7 : dcef8dcc  r6 :   r5 : dcef8d68  r4 : 0001
[   11.732192] r3 : 7fff  r2 :   r1 :   r0 : dcef8dcc
[   11.739024] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   11.746491] Control: 10c5387d  Table: 9c090019  DAC: 0015
[   11.752502] Process connmand (pid: 248, stack limit = 0xdc068210)
[   11.758878] Stack: (0xdc069e98 to 0xdc06a000)
[   11.763436] 9e80:
 7fff
[   11.771998] 9ea0: dc069f38 c014582c b30b  
37f6 81a4 d5012f68
[   11.780559] 9ec0:  8000   
 008d 
[   11.789121] 9ee0: 0400  0001 20b04e34 563924e6
dd746b40 dce2beb8 
[   11.797683] 9f00:   dc068000  dc069f54
dc069f20 c0171d00 c0217940
[   11.806245] 9f20:  7fff  dc069f38 c0145b6c
dd746b40  dd746b40
[   11.814807] 9f40: 0076 c000f7a4 dc069f74 dc069f58 c0171d3c
c0171c40  7fff
[   11.823369] 9f60:  c015cc9c dc069f94 dc069f78 c0171d7c
c0171d14  bebe2bf8
[   11.831931] 9f80: 00e64c4d b6a104d0 dc069fa4 dc069f98 c017204c
c0171d50  dc069fa8
[   11.840492] 9fa0: c000f600 c017203c bebe2bf8 00e64c4d 0009
bebe2bf8 008d 
[   11.849054] 9fc0: bebe2bf8 00e64c4d b6a104d0 0076 00e64810
00e60cd0 0009 
[   11.857616] 9fe0:  bebe2bd4 b6e63384 b6dcef00 60060010
0009 044a1103 099b0303
[   11.866199] [] (btrfs_sync_file) from []
(vfs_fsync_range+0xcc/0xd4)
[   11.874674] [] (vfs_fsync_range) from []
(vfs_fsync+0x34/0x3c)
[   11.882600] [] (vfs_fsync) from [] (do_fsync+0x38/0x54)
[   11.889890] [] (do_fsync) from [] (SyS_fsync+0x1c/0x20)
[   11.897188] [] (SyS_fsync) from []
(ret_fast_syscall+0x0/0x3c)
[   11.905116] Code: e14b05fc e1a7 eb1251f7 e1993f9f (e2833001)
[   13.418969] ---[ end trace d7bcd93aea7d243c ]---
[   13.442420] Kernel panic - not syncing: Fatal exception
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to allocate for space usage in particular btrfs volume

2015-11-04 Thread Hugo Mills
On Wed, Nov 04, 2015 at 09:10:42PM +, OmegaPhil wrote:
> Back in September I noticed that 'sudo du -chs /mnt/storage-1' reported
> 887GB used and 'df -h' 920GB for this particular volume - I went on
> #btrfs for any suggestions, and balancing + defraging made no
> difference. It had no subvolumes/snapshots etc, I basically used it like
> a checksumed ext4fs.
> 
> Since the volume was converted from ext4, I redid it from scratch (so
> made with kernel v4.1.3 or v4.1.6 on this Debian Testing machine), and
> the problem went away.
> 
> After a couple of months, df reports 907GB used, whereas du says 884GB -
> I currently have 8 large (1-5.5TB volumes) btrfs volumes in use,
> storage-1 is the only SSD volume and the only one with this problem.
> 
> No balancing or defraging this time, it didn't make a difference before
> and this is a relatively new volume.
> 
> Are there any sysadmin-level ways I can account for the ~23GB lost space?

   There's an issue where replacing blocks in the middle of an
existing extent won't split the extent, and thus the "old" blocks
aren't freed up, because they're held by the original extent (even
though not actually referenced by any existing file). This might be
what you're seeing.

   I'm not sure how to confirm this theory, or what to do about it if
it's true. (Defrag the file? Copy it elsewhere? Other?)

   Two other cases for df > du are orphaned files, although 23 GiB of
orphans is large; and missing out the dot-files in the directory that
du is run from (if doing, say, "du *" rather than "du ."). I've been
bitten by both of those in the past.

   Hugo.

> Thanks for any help.
> 
> =
> 
> $ uname -a
> 
> Linux omega1 4.2.0-1-amd64 #1 SMP Debian 4.2.5-1 (2015-10-27) x86_64
> GNU/Linux
> 
> $ btrfs --version
> 
> btrfs-progs v4.2.2
> 
> $ sudo btrfs fi usage /mnt/storage-1
> 
> Overall:
> Device size:   953.87GiB
> Device allocated:  932.04GiB
> Device unallocated: 21.83GiB
> Device missing:0.00B
> Used:  906.10GiB
> Free (estimated):   45.35GiB  (min: 34.43GiB)
> Data ratio: 1.00
> Metadata ratio: 2.00
> Global reserve:512.00MiB  (used: 0.00B)
> 
> Data,single: Size:925.01GiB, Used:901.50GiB
>/dev/sdb925.01GiB
> 
> Metadata,single: Size:8.00MiB, Used:0.00B
>/dev/sdb  8.00MiB
> 
> Metadata,DUP: Size:3.50GiB, Used:2.30GiB
>/dev/sdb  7.00GiB
> 
> System,single: Size:4.00MiB, Used:0.00B
>/dev/sdb  4.00MiB
> 
> System,DUP: Size:8.00MiB, Used:128.00KiB
>/dev/sdb 16.00MiB
> 
> Unallocated:
>/dev/sdb 21.83GiB
> 
> $ sudo btrfs-show-super /dev/sdb
> 
> superblock: bytenr=65536, device=/dev/sdb
> -
> csum  0x7f6b70be [match]
> bytenr65536
> flags 0x1
>   ( WRITTEN )
> magic _BHRfS_M [match]
> fsid  27430475-c49a-4e3f-8f8d-be5c14be59db
> label storage-1
> generation114344
> root  683413471232
> sys_array_size226
> chunk_root_generation 114251
> root_level1
> chunk_root21004288
> chunk_root_level  1
> log_root  683413979136
> log_root_transid  0
> log_root_level0
> total_bytes   1024209543168
> bytes_used971565568000
> sectorsize4096
> nodesize  16384
> leafsize  16384
> stripesize4096
> root_dir  6
> num_devices   1
> compat_flags  0x0
> compat_ro_flags   0x0
> incompat_flags0x161
>   ( MIXED_BACKREF |
> BIG_METADATA |
> EXTENDED_IREF |
> SKINNY_METADATA )
> csum_type 0
> csum_size 4
> cache_generation  114344
> uuid_tree_generation  114344
> dev_item.uuid c6b32341-6300-4f21-8c3b-3d7d458c3668
> dev_item.fsid 27430475-c49a-4e3f-8f8d-be5c14be59db [match]
> dev_item.type 0
> dev_item.total_bytes  1024209543168
> dev_item.bytes_used   1000765128704
> dev_item.io_align 4096
> dev_item.io_width 4096
> dev_item.sector_size  4096
> dev_item.devid1
> dev_item.dev_group0
> dev_item.seek_speed   0
> dev_item.bandwidth0
> dev_item.generation   0
> 
> =
> 
> dmesg contains a lot of information which is superfluous to btrfs and
> personal, I can filter on a regex and report if necessary.
> 



-- 
Hugo Mills | There are three things you should never see being
hugo@... carfax.org.uk | made: laws, standards, and sausages.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |



Re: Unable to allocate for space usage in particular btrfs volume

2015-11-04 Thread OmegaPhil
On 04/11/15 21:30, Hugo Mills wrote:
> On Wed, Nov 04, 2015 at 09:10:42PM +, OmegaPhil wrote:
>> Back in September I noticed that 'sudo du -chs /mnt/storage-1' reported
>> 887GB used and 'df -h' 920GB for this particular volume - I went on
>> #btrfs for any suggestions, and balancing + defraging made no
>> difference. It had no subvolumes/snapshots etc, I basically used it like
>> a checksumed ext4fs.
>>
>> Since the volume was converted from ext4, I redid it from scratch (so
>> made with kernel v4.1.3 or v4.1.6 on this Debian Testing machine), and
>> the problem went away.
>>
>> After a couple of months, df reports 907GB used, whereas du says 884GB -
>> I currently have 8 large (1-5.5TB volumes) btrfs volumes in use,
>> storage-1 is the only SSD volume and the only one with this problem.
>>
>> No balancing or defraging this time, it didn't make a difference before
>> and this is a relatively new volume.
>>
>> Are there any sysadmin-level ways I can account for the ~23GB lost space?
> 
>There's an issue where replacing blocks in the middle of an
> existing extent won't split the extent, and thus the "old" blocks
> aren't freed up, because they're held by the original extent (even
> though not actually referenced by any existing file). This might be
> what you're seeing.
> 
>I'm not sure how to confirm this theory, or what to do about it if
> it's true. (Defrag the file? Copy it elsewhere? Other?)
> 
>Two other cases for df > du are orphaned files, although 23 GiB of
> orphans is large; and missing out the dot-files in the directory that
> du is run from (if doing, say, "du *" rather than "du ."). I've been
> bitten by both of those in the past.
> 
>Hugo.


The volume doesn't change hugely over time, so it really ought not to
have broken so quickly - a quick rundown of the storage usage:

36% general (small files, some smallish videos)
24% music
23% pr0n
17% VMs

But in terms of 'large files changing', it could be the VM disks perhaps
- I'll move them out, balance, and then back in again, hopefully that'd
be a meaningful test.

du-wise was direct on the root directory, any idea how I could audit
orphan files?

Thanks






signature.asc
Description: OpenPGP digital signature


Unable to allocate for space usage in particular btrfs volume

2015-11-04 Thread OmegaPhil
Back in September I noticed that 'sudo du -chs /mnt/storage-1' reported
887GB used and 'df -h' 920GB for this particular volume - I went on
#btrfs for any suggestions, and balancing + defraging made no
difference. It had no subvolumes/snapshots etc, I basically used it like
a checksumed ext4fs.

Since the volume was converted from ext4, I redid it from scratch (so
made with kernel v4.1.3 or v4.1.6 on this Debian Testing machine), and
the problem went away.

After a couple of months, df reports 907GB used, whereas du says 884GB -
I currently have 8 large (1-5.5TB volumes) btrfs volumes in use,
storage-1 is the only SSD volume and the only one with this problem.

No balancing or defraging this time, it didn't make a difference before
and this is a relatively new volume.

Are there any sysadmin-level ways I can account for the ~23GB lost space?

Thanks for any help.

=

$ uname -a

Linux omega1 4.2.0-1-amd64 #1 SMP Debian 4.2.5-1 (2015-10-27) x86_64
GNU/Linux

$ btrfs --version

btrfs-progs v4.2.2

$ sudo btrfs fi usage /mnt/storage-1

Overall:
Device size: 953.87GiB
Device allocated:932.04GiB
Device unallocated:   21.83GiB
Device missing:  0.00B
Used:906.10GiB
Free (estimated): 45.35GiB  (min: 34.43GiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:925.01GiB, Used:901.50GiB
   /dev/sdb  925.01GiB

Metadata,single: Size:8.00MiB, Used:0.00B
   /dev/sdb8.00MiB

Metadata,DUP: Size:3.50GiB, Used:2.30GiB
   /dev/sdb7.00GiB

System,single: Size:4.00MiB, Used:0.00B
   /dev/sdb4.00MiB

System,DUP: Size:8.00MiB, Used:128.00KiB
   /dev/sdb   16.00MiB

Unallocated:
   /dev/sdb   21.83GiB

$ sudo btrfs-show-super /dev/sdb

superblock: bytenr=65536, device=/dev/sdb
-
csum0x7f6b70be [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid27430475-c49a-4e3f-8f8d-be5c14be59db
label   storage-1
generation  114344
root683413471232
sys_array_size  226
chunk_root_generation   114251
root_level  1
chunk_root  21004288
chunk_root_level1
log_root683413979136
log_root_transid0
log_root_level  0
total_bytes 1024209543168
bytes_used  971565568000
sectorsize  4096
nodesize16384
leafsize16384
stripesize  4096
root_dir6
num_devices 1
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x161
( MIXED_BACKREF |
  BIG_METADATA |
  EXTENDED_IREF |
  SKINNY_METADATA )
csum_type   0
csum_size   4
cache_generation114344
uuid_tree_generation114344
dev_item.uuid   c6b32341-6300-4f21-8c3b-3d7d458c3668
dev_item.fsid   27430475-c49a-4e3f-8f8d-be5c14be59db [match]
dev_item.type   0
dev_item.total_bytes1024209543168
dev_item.bytes_used 1000765128704
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size4096
dev_item.devid  1
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0

=

dmesg contains a lot of information which is superfluous to btrfs and
personal, I can filter on a regex and report if necessary.



signature.asc
Description: OpenPGP digital signature


[PATCH v3] btrfs: qgroup: exit the rescan worker during umount

2015-11-04 Thread Justin Maggard
I was hitting a consistent NULL pointer dereference during shutdown that
showed the trace running through end_workqueue_bio().  I traced it back to
the endio_meta_workers workqueue being poked after it had already been
destroyed.

Eventually I found that the root cause was a qgroup rescan that was still
in progress while we were stopping all the btrfs workers.

Currently we explicitly pause balance and scrub operations in
close_ctree(), but we do nothing to stop the qgroup rescan.  We should
probably be doing the same for qgroup rescan, but that's a much larger
change.  This small change is good enough to allow me to unmount without
crashing.

v3: avoid more races by calling btrfs_qgroup_wait_for_completion()

Signed-off-by: Justin Maggard 
---
 fs/btrfs/disk-io.c | 3 +++
 fs/btrfs/qgroup.c  | 9 ++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2d46675..1eb0839 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3780,6 +3780,9 @@ void close_ctree(struct btrfs_root *root)
fs_info->closing = 1;
smp_mb();
 
+   /* wait for the qgroup rescan worker to stop */
+   btrfs_qgroup_wait_for_completion(fs_info);
+
/* wait for the uuid_scan task to finish */
down(_info->uuid_tree_rescan_sem);
/* avoid complains from lockdep et al., set sem back to initial state */
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 46476c2..75c0249 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2286,7 +2286,7 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work 
*work)
goto out;
 
err = 0;
-   while (!err) {
+   while (!err && !btrfs_fs_closing(fs_info)) {
trans = btrfs_start_transaction(fs_info->fs_root, 0);
if (IS_ERR(trans)) {
err = PTR_ERR(trans);
@@ -2307,7 +2307,8 @@ out:
btrfs_free_path(path);
 
mutex_lock(_info->qgroup_rescan_lock);
-   fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
+   if (!btrfs_fs_closing(fs_info))
+   fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
 
if (err > 0 &&
fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT) {
@@ -2336,7 +2337,9 @@ out:
}
btrfs_end_transaction(trans, fs_info->quota_root);
 
-   if (err >= 0) {
+   if (btrfs_fs_closing(fs_info)) {
+   btrfs_info(fs_info, "qgroup scan paused");
+   } else if (err >= 0) {
btrfs_info(fs_info, "qgroup scan completed%s",
err > 0 ? " (inconsistency flag cleared)" : "");
} else {
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: test unmount during quota rescan

2015-11-04 Thread Justin Maggard
This test case tests if we are able to unmount a filesystem while
a quota rescan is running.  Up to now (4.3) this would result
in a kernel NULL pointer dereference.

Fixed by patch (btrfs: qgroup: exit the rescan worker during umount).

Signed-off-by: Justin Maggard 
---
 tests/btrfs/114 | 61 +
 tests/btrfs/114.out |  2 ++
 tests/btrfs/group   |  1 +
 3 files changed, 64 insertions(+)
 create mode 100644 tests/btrfs/114
 create mode 100644 tests/btrfs/114.out

diff --git a/tests/btrfs/114 b/tests/btrfs/114
new file mode 100644
index 000..0a0e8ba
--- /dev/null
+++ b/tests/btrfs/114
@@ -0,0 +1,61 @@
+#! /bin/bash
+# FS QA Test No. btrfs/114
+#
+# btrfs quota scan/unmount sanity test
+# Make sure that unmounting during a quota rescan doesn't crash
+#
+#---
+# Copyright (c) 2015 NETGEAR, Inc.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+for i in `seq 0 1 45`; do
+   echo -n > $SCRATCH_MNT/file.$i
+done
+echo 3 > /proc/sys/vm/drop_caches
+$BTRFS_UTIL_PROG quota enable $SCRATCH_MNT
+_scratch_unmount
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/btrfs/114.out b/tests/btrfs/114.out
new file mode 100644
index 000..a2aa4a2
--- /dev/null
+++ b/tests/btrfs/114.out
@@ -0,0 +1,2 @@
+QA output created by 114
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 7cf7dd7..10ab26b 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -116,3 +116,4 @@
 111 auto quick send
 112 auto quick clone
 113 auto quick compress clone
+114 auto qgroup
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Btrfs: Per-chunk degradable check

2015-11-04 Thread Qu Wenruo

Any new comment?

And to Chris, is it possible to pick these patchset for 4.4?

IMHO it's small enough (less than 100 lines) and didn't change degraded 
mount behavior much.


Thanks,
Qu

Qu Wenruo wrote on 2015/09/21 10:10 +0800:

Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.

Although the one-size-fit-all solution is quite safe, it's too strict if
data and metadata has different duplication level.

For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded mounted.

But in fact, some times all single chunks may be in the existing device
and in that case, we should allow it to be rw degraded mounted.

Such case can be easily reproduced using the following script:
  # mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
  # wipefs -f /dev/sdc
  # mount /dev/sdb -o degraded,rw

If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.

This patchset will introduce a new per-chunk degradable check for btrfs,
allow above case to succeed, and it's quite small anyway.

Also, it provides the possibility for later enhancement, like
automatically add 'degraded' mount option if possible.

Cc: Anand Jain 

Qu Wenruo (5):
   btrfs: Introduce a new function to check if all chunks a OK for
 degraded mount
   btrfs: Do per-chunk check for mount time check
   btrfs: Do per-chunk degraded check for remount
   btrfs: Allow barrier_all_devices to do per-chunk device check
   btrfs: Cleanup num_tolerated_disk_barrier_failures

  fs/btrfs/ctree.h   |  2 --
  fs/btrfs/disk-io.c | 87 ++
  fs/btrfs/disk-io.h |  2 --
  fs/btrfs/super.c   | 11 ---
  fs/btrfs/volumes.c | 84 +---
  fs/btrfs/volumes.h |  5 
  6 files changed, 94 insertions(+), 97 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-04 Thread Duncan
Austin S Hemmelgarn posted on Wed, 04 Nov 2015 13:45:37 -0500 as
excerpted:

> On 2015-11-04 13:01, Janos Toth F. wrote:
>> But the worst part is that there are some ISO files which were
>> seemingly copied without errors but their external checksums (the one
>> which I can calculate with md5sum and compare to the one supplied by
>> the publisher of the ISO file) don't match!
>> Well... this, I cannot understand.
>> How could these files become corrupt from a single disk failure? And
>> more importantly: how could these files be copied without errors? Why
>> didn't Btrfs gave a read error when the checksums didn't add up?
> If you can prove that there was a checksum mismatch and BTRFS returned
> invalid data instead of a read error or going to the other disk, then
> that is a very serious bug that needs to be fixed.  You need to keep in
> mind also however that it's completely possible that the data was bad
> before you wrote it to the filesystem, and if that's the case, there's
> nothing any filesystem can do to fix it for you.

As Austin suggests, if btrfs is returning data, and you haven't turned 
off checksumming with nodatasum or nocow, then it's almost certainly 
returning the data it was given to write out in the first place.  Whether 
that data it was given to write out was correct, however, is an 
/entirely/ different matter.

If ISOs are failing their external checksums, then something is going 
on.  Had you verified the external checksums when you first got the 
files?  That is, are you sure the files were correct as downloaded and/or 
ripped?

Where were the ISOs stored between original procurement/validation and 
writing to btrfs?  Is it possible you still have some/all of them on that 
media?  Do they still external-checksum-verify there?

Basically, assuming btrfs checksums are validating, there's three other 
likely possibilities for where the corruption could have come from before 
writing to btrfs.  Either the files were bad as downloaded or otherwise 
procured -- which is why I asked whether you verified them upon receipt 
-- or you have memory that's going bad, or your temporary storage is 
going bad, before the files ever got written to btrfs.

The memory going bad is a particularly worrying possibility, 
considering...

>> Now I am really considering to move from Linux to Windows and from
>> Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
>> that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new
>> motherboard with more native SATA connectors and an extra HDD). That
>> one seemed to actually do what it promises (abort any read operations
>> upon checksum errors [which always happens seamlessly on every read]
>> but look at the redundant data first and seamlessly "self-heal" if
>> possible). The only thing which made Btrfs to look as a better
>> alternative was the RAID-5 support. But I recently experienced two
>> cases of 1 drive failing of 3 and it always tuned out as a smaller or
>> bigger disaster (completely lost data or inconsistent data).

> Have you considered looking into ZFS?  I hate to suggest it as an
> alternative to BTRFS, but it's a much more mature and well tested
> technology than ReFS, and has many of the same features as BTRFS (and
> even has the option for triple parity instead of the double you get with
> RAID6).  If you do consider ZFS, make a point to look at FreeBSD in
> addition to the Linux version, the BSD one was a much better written
> port of the original Solaris drivers, and has better performance in many
> cases (and as much as I hate to admit it, BSD is way more reliable than
> Linux in most use cases).
> 
> You should also seriously consider whether the convenience of having a
> filesystem that fixes internal errors itself with no user intervention
> is worth the risk of it corrupting your data.  Returning correct data
> whenever possible is one thing, being 'self-healing' is completely
> different.  When you start talking about things that automatically fix
> internal errors without user intervention is when most seasoned system
> administrators start to get really nervous.  Self correcting systems
> have just as much chance to make things worse as they do to make things
> better, and most of them depend on the underlying hardware working
> correctly to actually provide any guarantee of reliability.

I too would point you at ZFS, but there's one VERY BIG caveat, and one 
related smaller one!

The people who have a lot of ZFS experience say it's generally quite 
reliable, but gobs of **RELIABLE** memory are *absolutely* *critical*!  
The self-healing works well, *PROVIDED* memory isn't producing errors.  
Absolutely reliable memory is in fact *so* critical, that running ZFS on 
non-ECC memory is severely discouraged as a very real risk to your data.

Which is why the above hints that your memory may be bad are so 
worrying.  Don't even *THINK* about ZFS, particularly its self-healing 
features, if you're not 

Re: Unable to allocate for space usage in particular btrfs volume

2015-11-04 Thread Duncan
OmegaPhil posted on Wed, 04 Nov 2015 21:53:09 + as excerpted:

> The volume doesn't change hugely over time, so it really ought not to
> have broken so quickly - a quick rundown of the storage usage:
> 
> 36% general (small files, some smallish videos)
> 24% music 23% pr0n 17% VMs
> 
> But in terms of 'large files changing', it could be the VM disks perhaps
> -
> I'll move them out, balance, and then back in again, hopefully that'd be
> a meaningful test.

VM image files (and large database files, for the same reason) are a bit 
of a problem on btrfs, and indeed, any COW-based filesystem, since the 
random rewrite pattern matching that use-case is pretty much the absolute 
worst-case match for a COW-based filesystem there is.

And that would be the worst-case in terms of the unsplit extents issue 
Hugo was talking about as well.  So they may well be the problem, indeed.

Since you're not doing snapshotting (which conflicts with this option, 
with an imperfect workaround), setting nocow on those files may well 
eliminate the problem, but be aware if you aren't already that (1) nocow 
does turn off checksumming as well, in ordered to avoid a race that could 
easily lead to data corruption, and (2) you can't just activate nocow on 
the existing file and expect it to work, the procedure is a bit more 
complicated than that, since nocow is only guaranteed to work if it's set 
at file creation.  Detailed instructions for #2 skipped as they've been 
posted many times, but if you are interested and haven't seen them, ask.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.2.5 forced read-only -ENOSPC w/ free space

2015-11-04 Thread E V
FYI, 4.1.12 completed the big rsync without issues. Guess I'm using
longterm for now.

On Mon, Nov 2, 2015 at 9:53 AM, E V  wrote:
> During an rsync, 20TB unallocated space. Currently, no snapshots.
> Should I try 4.1.12, or 4.3?
> dmesg:
> [122014.436612] BTRFS: error (device sde) in
> btrfs_run_delayed_refs:2781: errno=-28 No space left
> [122014.436615] BTRFS info (device sde): forced readonly
> [122014.436624] BTRFS: error (device sde) in
> btrfs_run_delayed_refs:2781: errno=-28 No space left
> [122014.436725] WARNING: CPU: 13 PID: 8025 at
> fs/btrfs/extent-tree.c:2781 btrfs_run_delayed_refs+0x97/0x195
> [btrfs]()
> [122014.436741] BTRFS: error (device sde) in
> __btrfs_prealloc_file_range:9636: errno=-28 No space left
> [122014.436772] BTRFS: error (device sde) in
> btrfs_start_dirty_block_groups:3461: errno=-28 No space left
> [122014.436777] BTRFS warning (device sde): Skipping commit of aborted
> transaction.
> [122014.436780] BTRFS: error (device sde) in cleanup_transaction:1710:
> errno=-5 IO failure
> [122014.436959] BTRFS: Transaction aborted (error -28)
> [122014.436961] Modules linked in: ipmi_si mpt2sas raid_class
> scsi_transport_sas dell_rbu nfsv3 nfsv4 nfsd auth_rpcgss oid_registry
> nfs_acl nfs lockd grace fscache sunrpc ext4 crc16 jbd2 ext2 coretemp
> joydev crct10dif_pclmul sha256_generic psmouse serio_raw hmac drbg
> aesni_intel iTCO_wdt ipmi_devintf iTCO_vendor_support dcdbas evdev
> aes_x86_64 glue_helper lrw gf128mul ablk_helper pcspkr cryptd lpc_ich
> mfd_core i7core_edac edac_core ipmi_msghandler acpi_power_meter button
> processor thermal_sys loop ext3 mbcache jbd btrfs xor raid6_pq
> hid_generic usbhid hid sg sd_mod crc32c_intel uhci_hcd ehci_pci
> ehci_hcd megaraid_sas ixgbe mdio ptp usbcore pps_core usb_common
> scsi_mod bnx2 [last unloaded: ipmi_si]
> [122014.437405] CPU: 13 PID: 8025 Comm: kworker/u66:13 Tainted: G
> I 4.2.5 #1
> [122014.437519] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [122014.437552]   0009 813af77a
> 880006ab7d08
> [122014.437606]  810421bb 3dac a01ee526
> 880342782f30
> [122014.437660]  ffe4 880100ea33b0 8803218ae800
> 880342782e08
> [122014.437714] Call Trace:
> [122014.437743]  [] ? dump_stack+0x40/0x50
> [122014.437773]  [] ? warn_slowpath_common+0x98/0xb0
> [122014.437817]  [] ?
> btrfs_run_delayed_refs+0x97/0x195 [btrfs]
> [122014.437863]  [] ? warn_slowpath_fmt+0x45/0x4a
> [122014.437906]  [] ?
> btrfs_run_delayed_refs+0x97/0x195 [btrfs]
> [122014.437965]  [] ?
> delayed_ref_async_start+0x33/0x71 [btrfs]
> [122014.438029]  [] ? normal_work_helper+0xc3/0x1fa [btrfs]
> [122014.438063]  [] ? process_one_work+0x159/0x286
> [122014.438093]  [] ? worker_thread+0x1d9/0x280
> [122014.438123]  [] ? rescuer_thread+0x27a/0x27a
> [122014.438152]  [] ? kthread+0xab/0xb3
> [122014.438180]  [] ? kthread_parkme+0x16/0x16
> [122014.438211]  [] ? ret_from_fork+0x3f/0x70
> [122014.438240]  [] ? kthread_parkme+0x16/0x16
> [122014.438268] ---[ end trace 1c8deab18b734f90 ]---
> [122014.438296] BTRFS: error (device sde) in
> btrfs_run_delayed_refs:2781: errno=-28 No space left
>
> btrfs file usage /mirror
> Overall:
> Device size: 140.07TiB
> Device allocated:119.96TiB
> Device unallocated:   20.11TiB
> Device missing:  0.00B
> Used:117.54TiB
> Free (estimated): 22.53TiB  (min: 12.47TiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
>
> Data,single: Size:119.66TiB, Used:117.24TiB
>/dev/sdb   24.91TiB
>/dev/sdc   24.91TiB
>/dev/sdd   34.92TiB
>/dev/sde   34.92TiB
>
> Metadata,RAID10: Size:151.00GiB, Used:149.88GiB
>/dev/sdb   37.75GiB
>/dev/sdc   37.75GiB
>/dev/sdd   37.75GiB
>/dev/sde   37.75GiB
>
> System,RAID10: Size:64.00MiB, Used:15.75MiB
>/dev/sdb   16.00MiB
>/dev/sdc   16.00MiB
>/dev/sdd   16.00MiB
>/dev/sde   16.00MiB
>
> Unallocated:
>/dev/sdb5.06TiB
>/dev/sdc5.06TiB
>/dev/sdd5.06TiB
>/dev/sde5.06TiB
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-04 Thread Janos Toth F.
Well. Now I am really confused about Btrfs RAID-5!

So, I replaced all SATA cables (which are explicitly marked for beeing
aimed at SATA3 speeds) and all the 3x2Tb WD Red 2.0 drives with 3x4Tb
Seagate Contellation ES 3 drives and started from sratch. I
secure-erased every drives, created an empty filesystem and ran a
"long" SMART self-test on all drivers before I started using the
storage space (the tests finished without errors, all drivers looked
fine, 0 zero bad sectors, 0 read or SATA CEC errors... all looked
perfectly fine at the time...).

It didn't take long before I realized that one of the new drives
started failing.
I started a scrub and it reported both corrected and uncorrectable errors.
I looked at the SMART data. 2 drives look perfectly fine and 1 drive
seems to be really sick. The latter one has some "reallocated" and
several hundreds of "pending" sectors among other error indications in
the log. I guess it's not the drive surface but the HDD controller (or
may be a head) which is really dying.

I figured the uncorrectable errors are write errors which is not
surprising given the perceived "health" of the drive according to it's
SMART attributes and error logs. That's understandable.


Although, I tried to copy data from the filesystem and it failed at
various ways.
There was a file which couldn't be copied at all. Good question why. I
guess it's because the filesystem needs to be repaired to get the
checksums and parities sorted out first. That's also understandable
(though unexpected, I thought RAID-5 Btrfs is sort-of "self-healing"
in these situations, it should theoretically still be able to
reconstruct and present the correct data, based on checksums and
parities seamlessly and only place error in the kernel log...).

But the worst part is that there are some ISO files which were
seemingly copied without errors but their external checksums (the one
which I can calculate with md5sum and compare to the one supplied by
the publisher of the ISO file) don't match!
Well... this, I cannot understand.
How could these files become corrupt from a single disk failure? And
more importantly: how could these files be copied without errors? Why
didn't Btrfs gave a read error when the checksums didn't add up?


Isn't Btrfs supposed to constantly check the integrity of the file
data during any normal read operations and give an error instead of
spitting out corrupt data as if it was perfectly legit?
I thought that's how it is supposed to work.
What's the point of full data checksuming if only an explicitly
requested scrub operation might look for errors? I thought's it's the
logical thing to do if checksum verification happens during every
single read operation and passing that check is mandatory in order to
get any data out of the filesystem (might be excluding the Direct-I/O
mode but I never use that on Btrfs - if that's even actually
supported, I don't know).


Now I am really considering to move from Linux to Windows and from
Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a
new motherboard with more native SATA connectors and an extra HDD).
That one seemed to actually do what it promises (abort any read
operations upon checksum errors [which always happens seamlessly on
every read] but look at the redundant data first and seamlessly
"self-heal" if possible). The only thing which made Btrfs to look as a
better alternative was the RAID-5 support. But I recently experienced
two cases of 1 drive failing of 3 and it always tuned out as a smaller
or bigger disaster (completely lost data or inconsistent data).


Does anybody have ideas what might went wrong in this second scenario?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-04 Thread Austin S Hemmelgarn

On 2015-11-04 13:01, Janos Toth F. wrote:

But the worst part is that there are some ISO files which were
seemingly copied without errors but their external checksums (the one
which I can calculate with md5sum and compare to the one supplied by
the publisher of the ISO file) don't match!
Well... this, I cannot understand.
How could these files become corrupt from a single disk failure? And
more importantly: how could these files be copied without errors? Why
didn't Btrfs gave a read error when the checksums didn't add up?
If you can prove that there was a checksum mismatch and BTRFS returned 
invalid data instead of a read error or going to the other disk, then 
that is a very serious bug that needs to be fixed.  You need to keep in 
mind also however that it's completely possible that the data was bad 
before you wrote it to the filesystem, and if that's the case, there's 
nothing any filesystem can do to fix it for you.


Isn't Btrfs supposed to constantly check the integrity of the file
data during any normal read operations and give an error instead of
spitting out corrupt data as if it was perfectly legit?
I thought that's how it is supposed to work.
Assuming that all of your hardware is working exactly like it's supposed 
to, yes it should work that way.  If however, you have something that 
corrupts the data in RAM before or while BTRFS is computing the checksum 
prior to writing the data, the it's fully possible for bad data to get 
written to disk and still have a perfectly correct checksum.  Bad RAM 
may also explain your issues mentioned above with not being able to copy 
stuff off of the filesystem.


Also, if you're using NOCOW files (or just the mount option), those very 
specifically do not store checksums for the blocks, because there is no 
way to do it without significant risk of data corruption.

What's the point of full data checksuming if only an explicitly
requested scrub operation might look for errors? I thought's it's the
logical thing to do if checksum verification happens during every
single read operation and passing that check is mandatory in order to
get any data out of the filesystem (might be excluding the Direct-I/O
mode but I never use that on Btrfs - if that's even actually
supported, I don't know).


Now I am really considering to move from Linux to Windows and from
Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a
new motherboard with more native SATA connectors and an extra HDD).
That one seemed to actually do what it promises (abort any read
operations upon checksum errors [which always happens seamlessly on
every read] but look at the redundant data first and seamlessly
"self-heal" if possible). The only thing which made Btrfs to look as a
better alternative was the RAID-5 support. But I recently experienced
two cases of 1 drive failing of 3 and it always tuned out as a smaller
or bigger disaster (completely lost data or inconsistent data).
Have you considered looking into ZFS?  I hate to suggest it as an 
alternative to BTRFS, but it's a much more mature and well tested 
technology than ReFS, and has many of the same features as BTRFS (and 
even has the option for triple parity instead of the double you get with 
RAID6).  If you do consider ZFS, make a point to look at FreeBSD in 
addition to the Linux version, the BSD one was a much better written 
port of the original Solaris drivers, and has better performance in many 
cases (and as much as I hate to admit it, BSD is way more reliable than 
Linux in most use cases).


You should also seriously consider whether the convenience of having a 
filesystem that fixes internal errors itself with no user intervention 
is worth the risk of it corrupting your data.  Returning correct data 
whenever possible is one thing, being 'self-healing' is completely 
different.  When you start talking about things that automatically fix 
internal errors without user intervention is when most seasoned system 
administrators start to get really nervous.  Self correcting systems 
have just as much chance to make things worse as they do to make things 
better, and most of them depend on the underlying hardware working 
correctly to actually provide any guarantee of reliability.  I cannot 
count the number of stories I've heard of 'self-healing' hardware RAID 
controllers destroying data.




smime.p7s
Description: S/MIME Cryptographic Signature


[PATCH] Btrfs: fix extent accounting for partial direct IO writes

2015-11-04 Thread fdmanana
From: Filipe Manana 

When doing a write using direct IO we can end up not doing the whole write
operation using the direct IO path, in that case we fallback to a buffered
write to do the remaining IO. This happens for example if the range we are
writing to contains a compressed extent.
When we do a partial write and fallback to buffered IO, due to the
existence of a compressed extent for example, we end up not adjusting the
outstanding extents counter of our inode which ends up getting decremented
twice, once by the DIO ordered extent for the partial write and once again
by btrfs_direct_IO(), resulting in an arithmetic underflow at
extent-tree.c:drop_outstanding_extent(). For example if we have:

  extents[ prealloc extent ] [ compressed extent ]
  offsetsAB  C   D   E

and at the moment our inode's outstanding extents counter is 0, if we do a
direct IO write against the range [B, D[ (which has a length smaller than
128Mb), we end up bumping our inode's outstanding extents counter to 1, we
create a DIO ordered extent for the range [B, C[ and then fallback to a
buffered write for the range [C, D[. The direct IO handler
(inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
1, leaving it with a value of 0, through a call to
btrfs_delalloc_release_space() and then shortly after the DIO ordered
extent finishes and calls btrfs_delalloc_release_metadata() which ends
up to attempt to decrement the inode's outstanding extents counter by 1,
resulting in an assertion failure at drop_outstanding_extent() because
the operation would result in an arithmetic underflow (0 - 1). This
produces the following trace:

  [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents 
>= num_extents, file: fs/btrfs/extent-tree.c, line: 5526
  [125471.338844] [ cut here ]
  [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173!
  [125471.340745] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
  [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod 
crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd 
grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 
parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 
jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci 
virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs]
  [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: GW
   4.3.0-rc5-btrfs-next-17+ #1
  [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
  [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
  [125471.340745] task: 8804244fcf80 ti: 88040a118000 task.ti: 
88040a118000
  [125471.340745] RIP: 0010:[]  [] 
assfail.constprop.46+0x1e/0x20 [btrfs]
  [125471.340745] RSP: 0018:88040a11bc78  EFLAGS: 00010296
  [125471.340745] RAX: 0075 RBX: 5000 RCX: 

  [125471.340745] RDX: 81098f93 RSI: 8147c619 RDI: 

  [125471.340745] RBP: 88040a11bc78 R08: 0001 R09: 

  [125471.340745] R10: 88040a11bc08 R11: 81651000 R12: 
8803efb4a000
  [125471.340745] R13: 8803efb4a000 R14:  R15: 
8802f8e33c88
  [125471.340745] FS:  () GS:88043dd4() 
knlGS:
  [125471.340745] CS:  0010 DS:  ES:  CR0: 8005003b
  [125471.340745] CR2: 7fae7ca86095 CR3: 01a0b000 CR4: 
06e0
  [125471.340745] Stack:
  [125471.340745]  88040a11bc88 a04ca0cd 88040a11bcc8 
a04ceeb1
  [125471.340745]  8802f8e33940 8802c93eadb0 8802f8e0bf50 
8803efb4a000
  [125471.340745]   8802f8e33c88 88040a11bd38 
a04eccfa
  [125471.340745] Call Trace:
  [125471.340745]  [] drop_outstanding_extent+0x3d/0x6d 
[btrfs]
  [125471.340745]  [] 
btrfs_delalloc_release_metadata+0x51/0xdd [btrfs]
  [125471.340745]  [] btrfs_finish_ordered_io+0x420/0x4eb 
[btrfs]
  [125471.340745]  [] finish_ordered_fn+0x15/0x17 [btrfs]
  [125471.340745]  [] normal_work_helper+0x14c/0x32a [btrfs]
  [125471.340745]  [] btrfs_endio_write_helper+0x12/0x14 
[btrfs]
  [125471.340745]  [] process_one_work+0x24a/0x4ac
  [125471.340745]  [] worker_thread+0x206/0x2c2
  [125471.340745]  [] ? rescuer_thread+0x2cb/0x2cb
  [125471.340745]  [] ? rescuer_thread+0x2cb/0x2cb
  [125471.340745]  [] kthread+0xef/0xf7
  [125471.340745]  [] ? kthread_parkme+0x24/0x24
  [125471.340745]  [] ret_from_fork+0x3f/0x70
  [125471.340745]  [] ? kthread_parkme+0x24/0x24
  [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 
c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 
0b 0f 1f 

[PATCH] fstests: test for btrfs direct IO write against compressed extent

2015-11-04 Thread fdmanana
From: Filipe Manana 

Test that doing a direct IO write against a file range that contains one
prealloc extent and one compressed extent works correctly.

>From the linux kernel 4.0 onwards, this either triggered an assertion
failure (leading to a BUG_ON) when CONFIG_BTRFS_ASSERT=y or resulted
in an arithmetic underflow of an inode's space reservation for write
operations.

That issue is fixed by the following linux kernel patch:

  "Btrfs: fix extent accounting for partial direct IO writes"

Signed-off-by: Filipe Manana 
---
 tests/btrfs/114 | 108 
 tests/btrfs/114.out |  15 
 tests/btrfs/group   |   1 +
 3 files changed, 124 insertions(+)
 create mode 100755 tests/btrfs/114
 create mode 100644 tests/btrfs/114.out

diff --git a/tests/btrfs/114 b/tests/btrfs/114
new file mode 100755
index 000..915d1e8
--- /dev/null
+++ b/tests/btrfs/114
@@ -0,0 +1,108 @@
+#! /bin/bash
+# FSQA Test No. 114
+#
+# Test that doing a direct IO write against a file range that contains one
+# prealloc extent and one compressed extent works correctly.
+#
+#---
+#
+# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_xfs_io_command "falloc"
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount "-o compress"
+
+# Create a compressed extent covering the range [700K, 800K[.
+$XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" $SCRATCH_MNT/foo \
+   | _filter_xfs_io
+
+# Create prealloc extent covering the range [600K, 700K[.
+$XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo
+
+# Write 80K of data to the range [640K, 720K[ using direct IO. This range 
covers
+# both the prealloc extent and the compressed extent. Because there's a
+# compressed extent in the range we are writing to, the DIO write code path 
ends
+# up only writing the first 60k of data, which goes to the prealloc extent, and
+# then falls back to buffered IO for writing the remaining 20K of data - 
because
+# that remaining data maps to a file range containing a compressed extent.
+# When falling back to buffered IO, we used to trigger an assertion when
+# releasing reserved space due to bad accounting of the inode's outstanding
+# extents counter, which was set to 1 but we ended up decrementing it by 1 
twice,
+# once through the ordered extent for the 60K of data we wrote using direct IO,
+# and once through the main direct IO handler (inode.cbtrfs_direct_IO()) 
because
+# the direct IO write wrote less than 80K of data (60K).
+$XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" $SCRATCH_MNT/foo \
+   | _filter_xfs_io
+
+# Now similar test as above but for very large write operations. This triggers
+# special cases for an inode's outstanding extents accounting, as internally
+# btrfs logically splits extents into 128Mb units.
+$XFS_IO_PROG -f -s \
+   -c "pwrite -S 0xaa -b 128M 258M 128M" \
+   -c "falloc 0 258M" \
+   $SCRATCH_MNT/bar | _filter_xfs_io
+$XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
+   | _filter_xfs_io
+
+# Now verify the file contents are correct and that they are the same even 
after
+# unmounting and mounting the fs again (or evicting the page cache).
+#
+# For file foo, all bytes in the range [0, 640K[ must have a value of 0x00, all
+# bytes in the range [640K, 720K[ must have a value of 0xbb and all bytes in 
the
+# range [720K, 800K[ must have a value of 0xaa.
+#
+# For file bar, all bytes in the range [0, 3M[ must havea value of 0x00, all
+# bytes in the range [3M, 259M[ must have a value of 0xbb and all bytes in the
+# range [259M, 386M[ must have a value of 0xaa.
+#
+echo "File digests before remounting 

[PATCH v2] fstests: generic test for fsync after hole punching

2015-11-04 Thread fdmanana
From: Filipe Manana 

Test that a file fsync works after punching a hole for the same file
range multiple times, and that after log/journal replay the file's
content and layout are correct.

This test is motivated by a bug found in btrfs, which is fixed by
the following linux kernel patch:

  "Btrfs: fix hole punching when using the no-holes feature"

Signed-off-by: Filipe Manana 
---

V2: Removed setting -O no-holes for MKFS_OPTIONS when the fs being
tested is btrfs.
Made the test use the new function _flakey_drop_and_remount,
suggested by Dave Chinner, which is introduced by:
"[PATCH] fstests: add helper function _flakey_drop_and_remount"

 tests/generic/110 | 108 ++
 tests/generic/110.out |  13 ++
 tests/generic/group   |   1 +
 3 files changed, 122 insertions(+)
 create mode 100755 tests/generic/110
 create mode 100644 tests/generic/110.out

diff --git a/tests/generic/110 b/tests/generic/110
new file mode 100755
index 000..08aa883
--- /dev/null
+++ b/tests/generic/110
@@ -0,0 +1,108 @@
+#! /bin/bash
+# FSQA Test No. 110
+#
+# Test that a file fsync works after punching a hole for the same file range
+# multiple times and that after log/journal replay the file's content is
+# correct.
+#
+# This test is motivated by a bug found in btrfs.
+#
+#---
+#
+# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/punch
+. ./common/dmflakey
+
+# real QA test starts here
+_need_to_be_root
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_xfs_io_command "fpunch"
+_require_xfs_io_command "fiemap"
+_require_dm_target flakey
+_require_metadata_journaling $SCRATCH_DEV
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_init_flakey
+_mount_flakey
+
+# Create out test file with some data and then fsync it.
+# We do the fsync only to make sure the last fsync we do in this test triggers
+# the fast code path of btrfs' fsync implementation, a condition necessary to
+# trigger the bug btrfs had.
+$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \
+   -c "fsync"  \
+   $SCRATCH_MNT/foobar | _filter_xfs_io
+
+# Now punch a hole against the range [96K, 128K[.
+$XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar
+
+# Punch another hole against a range that overlaps the previous range and ends
+# beyond eof.
+$XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar
+
+# Punch another hole against a range that overlaps the first range ([96K, 
128K[)
+# and ends at eof.
+$XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar
+
+# Fsync our file. We want to verify that, after a power failure and mounting 
the
+# filesystem again, the file content reflects all the hole punch operations.
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
+
+echo "File digest before power failure:"
+md5sum $SCRATCH_MNT/foobar | _filter_scratch
+
+echo "Fiemap before power failure:"
+$XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
+
+_flakey_drop_and_remount
+
+echo "File digest after log replay:"
+# Must match the same digest we got before the power failure.
+md5sum $SCRATCH_MNT/foobar | _filter_scratch
+
+echo "Fiemap after log replay:"
+# Must match the same extent listing we got before the power failure.
+$XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
+
+_unmount_flakey
+
+status=0
+exit
diff --git a/tests/generic/110.out b/tests/generic/110.out
new file mode 100644
index 000..ba016c8
--- /dev/null
+++ b/tests/generic/110.out
@@ -0,0 +1,13 @@
+QA output created by 110
+wrote 131072/131072 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File digest before power failure:
+d26bbb9a8396a9c0dd76423471b72b15  SCRATCH_MNT/foobar
+Fiemap before