Re: Possible deadlock when writing

2018-12-01 Thread Janos Toth F.
I obviously can't be sure (due to obscure nature of this issue) but I
think I observed similar behavior. For me it usually kicked in during
scheduled de-fragmentation runs. I initially suspected it might have
something to do with running defrag on files which are still opened
for appending writes (through specifying the entire root subvolume
folder recursively). But it kept happening after I omitted those
folders. And I think defrag has nothing more to do with this other
than generating a lot of IO. The SMART status is fine on all disks in
the multi-device filesystem. When this happens the write buffer in the
system RAM is ~full, manual sync hangs forever but some read
operations are successful. Normal reboot is not possible since sync
won't finish (but usually locks the system up pretty well if
attempted).

I didn't see this happening since I updated to 4.19.3 (if I recall
correctly). Although not much time has passed.
On Mon, Nov 26, 2018 at 8:14 PM Larkin Lowrey  wrote:
>
> I started having a host freeze randomly when running a 4.18 kernel. The
> host was stable when running 4.17.12.
>
> At first, it appeared that it was only IO that was frozen since I could
> run common commands that were likely cached in RAM and that did not
> touch storage. Anything that did touch storage would freeze and I would
> not be able to ctrl-c it.
>
> I noticed today, when it happened with kernel 4.19.2, that backups were
> still running and that the backup app could still read from the backup
> snapshot subvol. It's possible that the backups are still able to
> proceed because the accesses are all read-only and the snapshot was
> mounted with noatime so the backup process never triggers a write.
>
> There never are any errors output to the console when this happens and
> nothing is logged. When I first encountered this back in Sept. I managed
> to record a few sysrq dumps and attached them to a redhat ticket. See
> links below.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1627288
> https://bugzilla.redhat.com/attachment.cgi?id=1482177
>
> I do have several VMs running that have their image files nocow'd.
> Interestingly, all the VMs, except 1, seem to be able to write just
> fine. The one that can't has frozen completely and is the one that
> regularly generates the most IO.
>
> Any ideas on how to debug this further?
>
> --Larkin


Re: lazytime mount option—no support in Btrfs

2018-08-23 Thread Janos Toth F.
On Tue, Aug 21, 2018 at 4:10 PM Austin S. Hemmelgarn
 wrote:
> Also, once you've got the space cache set up by mounting once writable
> with the appropriate flag and then waiting for it to initialize, you
> should not ever need to specify the `space_cache` option again.

True.
I am not sure why I thought I have to keep cache-v2 in the rootflag
list. I guess I forgot it was meant to be removed after a reboot. Or
may be some old kernels misbehaved (always cleared the space-tree for
some reason and re-initialized a space-cache instead unless the flag
was there). But I removed it now and everything works as intended.
Thanks.


Re: lazytime mount option—no support in Btrfs

2018-08-21 Thread Janos Toth F.
> >> so pretty much everyone who wants to avoid the overhead from them can just
> >> use the `noatime` mount option.

It would be great if someone finally fixed this old bug then:
https://bugzilla.kernel.org/show_bug.cgi?id=61601
Until then, it seems practically impossible to use both noatime (this
can't be added as rootflag in the command line and won't apply if the
kernel already mounted the root as RW) and space-cache-v2 (has to be
added as a rootflag along with RW to take effect) for the root
filesystem (at least without an init*fs, which I never use, so can't
tell).


Re: Writeback errors in kernel log with Linux 4.15 (m=s=raid1, d=raid5, 5 disks)

2018-02-01 Thread Janos Toth F.
Hmm... Actually, I just discovered a different machine with s=m=d=dup
(single HDD) spit out a few similar messages (a lot less and it took
longer for them to appear at all but it handles very little load):

[  333.197366] WARNING: CPU: 0 PID: 134 at fs/fs-writeback.c:2339
__writeback_in

 odes_sb_nr+0xc6/0xd0
[  333.197371] CPU: 0 PID: 134 Comm: btrfs-transacti Not tainted
4.15.0-gentoo #

  2
[  333.197372] Hardware name: Dell Inc. Vostro 1700
 /  ,

 BIOS A07 04/21/2008
[  333.197373] RIP: 0010:__writeback_inodes_sb_nr+0xc6/0xd0
[  333.197376] RSP: 0018:96eb80777e00 EFLAGS: 00010246
[  333.197378] RAX:  RBX: 88d19a5dd000 RCX: 
[  333.197379] RDX: 0002 RSI: 1100 RDI: 88d19a4cb000
[  333.197380] RBP: 96eb80777e04 R08: fffe R09: 0003
[  333.197383] R10: 96eb80777d10 R11: 0400 R12: 88d19a0bc600
[  333.197384] R13: 88d19a96a5b8 R14: 00014c0a R15: 88d19a96a5e0
[  333.197385] FS:  () GS:88d19fc0()
knlGS:0

  000
[  333.197387] CS:  0010 DS:  ES:  CR0: 80050033
[  333.197389] CR2: 7fd058028000 CR3: c6a0a000 CR4: 06f0
[  333.197390] Call Trace:
[  333.197397]  btrfs_commit_transaction+0x776/0x860
[  333.197399]  ? start_transaction+0x99/0x380
[  333.197403]  transaction_kthread+0x176/0x1a0
[  333.197404]  ? btrfs_cleanup_transaction+0x490/0x490
[  333.197407]  kthread+0x109/0x120
[  333.197411]  ? __kthread_bind_mask+0x60/0x60
[  333.197413]  ret_from_fork+0x35/0x40
[  333.197417] Code: 8d 74 24 08 e8 ec fd ff ff 48 89 ee 48 89 df e8
d1 fe ff ff

   48 8b 44 24 48 65 48 33 04 25 28 00 00 00
75 0b 48 83 c4 50 5b 5d c3 <0f> ff eb

 c0 e8 c1 af ef ff
90 48 85 ff 53 48 c7 c3 60 5b 83 89
[  333.197459] ---[ end trace 4427bc8429f8bec7 ]---

On Fri, Feb 2, 2018 at 2:28 AM, Janos Toth F. <toth.f.ja...@gmail.com> wrote:
> I started seeing these on my d=raid5 filesystem after upgrading to Linux 4.15.
>
> Some files created since the upgrade seem to be corrupted.
>
> The disks seem to be fine (according to btrfs device stats and
> smartmontools device logs).
>
> The rest of the Btrfs filesystems (with m=s=d=single profiles) do not
> seem to have any issues but it could be linked to the specific load
> the d=raid5 filesystem handles (unlikely since it's nothing too
> fancy).
>
> The mount options are (for all Btrfs filesystems on this machine):
> noatime,max_inline=0,commit=300,space_cache=v2,flushoncommit
>
> dmesg is full of these:
>
> [12848.106644] Code: 8d 74 24 08 e8 ec fd ff ff 48 89 ee 48 89 df e8
> d1 fe ff ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 75 0b 48 83 c4
> 50 5b 5d c3 <0f> ff eb c0 e8 71 70 ef ff 90 53 48 c7 c3 60 62 83 9d 48
> 85 ff
> [12848.106664] ---[ end trace bb7f6cdb25fa37db ]---
> [12850.114566] WARNING: CPU: 0 PID: 7277 at fs/fs-writeback.c:2339
> __writeback_inodes_sb_nr+0xc6/0xd0
> [12850.114573] CPU: 0 PID: 7277 Comm: btrfs Tainted: GW
> 4.15.0-gentoo #2
> [12850.114575] Hardware name: Gigabyte Technology Co., Ltd. X150M-PRO
> ECC/X150M-PRO ECC-CF, BIOS F22b 07/04/2017
> [12850.114581] RIP: 0010:__writeback_inodes_sb_nr+0xc6/0xd0
> [12850.114584] RSP: 0018:a2e7c619b9c0 EFLAGS: 00010246
> [12850.114588] RAX:  RBX: 8963b99ca800 RCX: 
> 
> [12850.114591] RDX: 0002 RSI: 2fcf RDI: 
> 8963b982a800
> [12850.114593] RBP: a2e7c619b9c4 R08: fffe R09: 
> 0003
> [12850.114596] R10: f2ec00094c40 R11: 008f R12: 
> 89612accc400
> [12850.114598] R13: 8963b98235b8 R14: 8961166a5000 R15: 
> 8963b98235e0
> [12850.114602] FS:  7f9c6280c8c0() GS:8963bec0()
> knlGS:
> [12850.114605] CS:  0010 DS:  ES:  CR0: 80050033
> [12850.114607] CR2: 7ffd26332e78 CR3: 00024443a005 CR4: 
> 002606f0
> [12850.114609] Call Trace:
> [12850.114618]  btrfs_commit_transaction+0x776/0x860
> [12850.114626]  prepare_to_merge+0x226/0x240
> [12850.114632]  relocate_block_group+0x234/0x620
> [12850.114638]  btrfs_relocate_block_group+0x17e/0x230
> [12850.114643]  btrfs_relocate_chunk+0x2e/0xc0
> [12850.114647]  btrfs_balance+0xb42/0x1270
> [12850.114654]  ? terminate_walk+0x8c/0x100
> [12850.114660]  btrfs_ioctl_balance+0x330/0x390
> [12850.114665]  btrfs_ioctl+0x4a6/0x23da
> [12850.114674]  ? __alloc_pages_nodemask+0xa9/0x130
> [12850.114678]  ? do_vfs_ioctl+0x9f/0x620
> [12850.114684]  ? btrfs_ioctl_get_supported_features+0x20/0x20
> [12850.114687] 

Writeback errors in kernel log with Linux 4.15 (m=s=raid1, d=raid5, 5 disks)

2018-02-01 Thread Janos Toth F.
I started seeing these on my d=raid5 filesystem after upgrading to Linux 4.15.

Some files created since the upgrade seem to be corrupted.

The disks seem to be fine (according to btrfs device stats and
smartmontools device logs).

The rest of the Btrfs filesystems (with m=s=d=single profiles) do not
seem to have any issues but it could be linked to the specific load
the d=raid5 filesystem handles (unlikely since it's nothing too
fancy).

The mount options are (for all Btrfs filesystems on this machine):
noatime,max_inline=0,commit=300,space_cache=v2,flushoncommit

dmesg is full of these:

[12848.106644] Code: 8d 74 24 08 e8 ec fd ff ff 48 89 ee 48 89 df e8
d1 fe ff ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 75 0b 48 83 c4
50 5b 5d c3 <0f> ff eb c0 e8 71 70 ef ff 90 53 48 c7 c3 60 62 83 9d 48
85 ff
[12848.106664] ---[ end trace bb7f6cdb25fa37db ]---
[12850.114566] WARNING: CPU: 0 PID: 7277 at fs/fs-writeback.c:2339
__writeback_inodes_sb_nr+0xc6/0xd0
[12850.114573] CPU: 0 PID: 7277 Comm: btrfs Tainted: GW
4.15.0-gentoo #2
[12850.114575] Hardware name: Gigabyte Technology Co., Ltd. X150M-PRO
ECC/X150M-PRO ECC-CF, BIOS F22b 07/04/2017
[12850.114581] RIP: 0010:__writeback_inodes_sb_nr+0xc6/0xd0
[12850.114584] RSP: 0018:a2e7c619b9c0 EFLAGS: 00010246
[12850.114588] RAX:  RBX: 8963b99ca800 RCX: 
[12850.114591] RDX: 0002 RSI: 2fcf RDI: 8963b982a800
[12850.114593] RBP: a2e7c619b9c4 R08: fffe R09: 0003
[12850.114596] R10: f2ec00094c40 R11: 008f R12: 89612accc400
[12850.114598] R13: 8963b98235b8 R14: 8961166a5000 R15: 8963b98235e0
[12850.114602] FS:  7f9c6280c8c0() GS:8963bec0()
knlGS:
[12850.114605] CS:  0010 DS:  ES:  CR0: 80050033
[12850.114607] CR2: 7ffd26332e78 CR3: 00024443a005 CR4: 002606f0
[12850.114609] Call Trace:
[12850.114618]  btrfs_commit_transaction+0x776/0x860
[12850.114626]  prepare_to_merge+0x226/0x240
[12850.114632]  relocate_block_group+0x234/0x620
[12850.114638]  btrfs_relocate_block_group+0x17e/0x230
[12850.114643]  btrfs_relocate_chunk+0x2e/0xc0
[12850.114647]  btrfs_balance+0xb42/0x1270
[12850.114654]  ? terminate_walk+0x8c/0x100
[12850.114660]  btrfs_ioctl_balance+0x330/0x390
[12850.114665]  btrfs_ioctl+0x4a6/0x23da
[12850.114674]  ? __alloc_pages_nodemask+0xa9/0x130
[12850.114678]  ? do_vfs_ioctl+0x9f/0x620
[12850.114684]  ? btrfs_ioctl_get_supported_features+0x20/0x20
[12850.114687]  do_vfs_ioctl+0x9f/0x620
[12850.114692]  SyS_ioctl+0x36/0x70
[12850.114698]  ? get_vtime_delta+0x9/0x40
[12850.114702]  do_syscall_64+0x72/0x360
[12850.114707]  ? get_vtime_delta+0x9/0x40
[12850.114711]  ? get_vtime_delta+0x9/0x40
[12850.114716]  ? __vtime_account_system+0x10/0x50
[12850.114722]  entry_SYSCALL64_slow_path+0x25/0x25
[12850.114726] RIP: 0033:0x7f9c616063a7
[12850.114728] RSP: 002b:7ffe0d33a3d8 EFLAGS: 0206 ORIG_RAX:
0010
[12850.114733] RAX: ffda RBX: 0001 RCX: 7f9c616063a7
[12850.114735] RDX: 7ffe0d33a470 RSI: c4009420 RDI: 0003
[12850.114738] RBP: 0003 R08: 0003 R09: 0001
[12850.114740] R10: 0559 R11: 0206 R12: 7ffe0d33a470
[12850.114743] R13: 7ffe0d33a470 R14: 0001 R15: 7ffe0d33caf6
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleted subvols don't go away?

2017-08-27 Thread Janos Toth F.
ID=5 is the default, "root" or "toplevel" subvolume which can't be
deleted anyway (at least normally, I am not sure if some debug-magic
can achieve that).
I just checked this (out of curiosity) and all my Btrfs filesystems
report something very similar to yours (I thought DELETED was a made
up example but I see it was literal...):

~ # btrfs sub list -a /
ID 303 gen 172881 top level 5 path /gentoo
~ # btrfs sub list -ad /
ID 5 gen 172564 top level 0 path /DELETED

I guess this entry is some placeholder, like a hidden "trash"
directory on some filesystems. I don't think this means all Btrfs
filesystems forever hold on to their last deleted subvolumes (and only
one).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Raid5 issue.

2017-08-21 Thread Janos Toth F.
I lost enough Btrfs m=d=s=RAID5 filesystems in past experiments (I
didn't try using RAID5 for metadata and system chunks in the last few
years) to faulty SATA cables + hotplug enabled SATA controllers (where
a disk could disappear and reappear "as the wind blew"). Since then, I
made a habit of always disabling hotplug for all SATA disks involved
with Btrfs, even those with m=d=s=single profile (and I never desired
to built multi-devices filesystems from USB attached disks anyway but
this is good reason for me to explicitly avoid that).

I am not sure if other RAID profiles are affected in a similar way or
it's just RAID56. (Well, I mean RAID0 is obviously toast and RAID1/10
will obviously get degraded but I am not sure if it's possible to
re-sync RAID1/10 with a simple balance [possibly even without
remounting and doing manual device delete/add?] or the filesystem has
to be recreated from scratch [like RAID5].)

I think this hotplug problem is an entirely different issue from the
RAID56-scrub race-conditions (which are now considered fixed in linux
4.12) and nobody is currently working on this (if it's RAID56-only
then I don't expect it anytime soon [think years]).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Checksum of the parity

2017-08-13 Thread Janos Toth F.
On Sun, Aug 13, 2017 at 8:45 PM, Chris Murphy  wrote:
> Further, the error detection of corrupt reconstruction is why I say
> Btrfs is not subject *in practice* to the write hole problem. [2]
>
> [1]
> I haven't tested the raid6 normal read case where a stripe contains
> corrupt data strip and corrupt P strip, and Q strip is good. I expect
> instead of EIO, we get a reconstruction from Q, and then both data and
> P get fixed up, but I can't find it in comments or code.

Yes, that's what I would expect (which theoretically makes the odds of
successful recovery better on RAID6, possible "good enough") but I
have no clue how that actually gets handled right now (I guess the
current code isn't that thorough).

> [2]
> Is Btrfs subject to the write hole problem manifesting on disk? I'm
> not sure, sadly I don't read the code well enough. But if all Btrfs
> raid56 writes are full stripe CoW writes, and if the prescribed order
> guarantees still happen: data CoW to disk > metadata CoW to disk >
> superblock update, then I don't see how the write hole happens. Write
> hole requires: RMW of a stripe, which is a partial stripe overwrite,
> and a crash during the modification of the stripe making that stripe
> inconsistent as well as still pointed to by metadata.

I guess the problem is that stripe size or stripe element size is
(sort of) fixed (not sure which one, I guess it's the latter, in which
case the actual stripe size depends on the number of devices) and
relatively big (much bigger than the usual 4k sector size or even the
leaf size which now defaults to 16k, if I recall [but I set this to 4k
myself]), so a partial stripe update (RMW) is certainly possible
during generic use.

This is why I threw the idea around a few months ago to resurrect that
old (but dead looking / stuck) project about making the stripe
(element) size configurable by the user. That would allow for making
the stripe size equal to the filesystem sector size on a limited
amount of setups (for example, 5 or 6 HDD with 512-byte physical
sectors in RAID-5 or RAID-6 respectively) which would (as I
understand) practically eliminate the problem (at least on the
filesystem side, I am not sure if the HDD's volatile write-cache or at
least it's internal re-ordering feature should still be disabled for
this to really avoid inconsistencies between stripe elements --- I
can't recall ever seeing partially written sectors [we would know
since these are checksummed in place and thus appear unreadable if
partially written], I guess there might be usually enough electricity
in some small capacitor to finish the current sector after the power
gets cut ???).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs issue with mariadb incremental backup

2017-08-12 Thread Janos Toth F.
On Sat, Aug 12, 2017 at 11:34 PM, Chris Murphy  wrote:
> On Fri, Aug 11, 2017 at 11:08 PM,   wrote:
>
> I think the problem is that the script does things so fast that the
> snapshot is not always consistent on disk before btrfs send starts.
> It's just a guess though. If I'm right, this means the rsync mismaches
> mean the destination snapshots are bad.

Hmm. I was wondering about this exact issue the other day when I
fiddled with my own backup script.
- Should I issue a sync between creating a snapshot and starting to
rsync (or send/receive) the content of that fresh snapshot to an
external backup storage?
I dismissed the thought because I figured rsync should see the proper
state regardless if some data is still waiting in the system
write-back cache.
Now I am confused again. Is it only for send/receive? (I don't use
that feature yet but thought about switching to it from rsync.)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS warning: unhandled fiemap cache detected

2017-08-09 Thread Janos Toth F.
As much as I can tell it's nothing to worry about. (I have thousands
of these warnings.) I don't know why the patch was submitted for 4.13
but not applied to the next 4.12.x , since it looks like a trivial
tiny fix for an annoying problem.

On Wed, Aug 9, 2017 at 10:48 AM, Mario Fugazzi®  wrote:
> Hi everyone :-)
>
> After updating the kernel to the latest one 4.12 I receive a lot of warnings 
> in
> my dmesg but only on my ssd that is a kingston hyperx predator.
> No message get logged for my other two traditional HD that are btrfs also.
> Warnings show only on boot and not on every boot.
> Kernel is now 4.12.5 but I got that till upgrading to 4.12, in 4.11 no 
> warnings.
>
> Just wanted to know if I have to worry about this, scrubs are ok and btrfs
> check give me no problems.
>
> Thank you very much.
> Mario.
>
> Some sample:
>
> [   13.486100] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960767488 len=8192 flags=0x0
> [   13.486110] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960767488 len=8192 flags=0x0
> [   13.486261] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960824832 len=12288 flags=0x0
> [   13.486280] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960763392 len=4096 flags=0x0
> [   13.486434] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960767488 len=8192 flags=0x0
> [   13.486446] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960775680 len=4096 flags=0x0
> [   13.486616] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960783872 len=24576 flags=0x0
> [   13.486650] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960808448 len=4096 flags=0x0
> [   13.486719] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960812544 len=8192 flags=0x0
> [   13.486742] BTRFS warning (device sdd2): unhandled fiemap cache detected: 
> offset=0 phys=1960820736 len=4096 flags=0x0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to benchmark schedulers

2017-08-08 Thread Janos Toth F.
I think you should consider using Linux 4.12 which has bfq (bfq-mq)
for blk-mq. So, you don't need out-of-tree BFQ patches if you switch
to blk-mq (but now you are free to do so even if you have HDDs or SSDs
which benefit from software schedulers since you have some multi-queue
schedulers for them). Just make sure to enable blk-mq (has to be a
boot parameter or build-time choice) in order to gain access to
bfq-mq. And remember that bfq-mq has to be activated manually (the
build-time choice for a default scheduler is not valid for multi-queue
schedulers, you will default to "none" which is effectively the new
"no-op").
Note: there is only one BFQ in 4.12 and it's bfq-mq which runs under
the name of simply BFQ (not bfq-mq, I only used that name to make it
clear that BFQ in 4.12 is a multi-queue version of BFQ).

I always wondered if Btrfs makes any use of FUA if it's enabled for
compatible SATA devices (it's disabled by default because there are
some drives with faulty firmware).

I also wonder if RAID10 is any better (or actually worse?) for
metadata (and system) chunks than RAID1.

On Tue, Aug 8, 2017 at 1:59 PM, Bernhard Landauer  wrote:
> Hello everyone
>
> I am looking for a way to test different available schedulers with Manjaro's
> bfq-patched kernels on sytems with both SSD and spinning drives. Since
> phoronix-test-suite apparently is currently useless for this task due to its
> bad config for bfq I am looking for alternatives. Do you have any
> suggestions for me?
> Thank you.
>
> kind regards
> Bernhard Landauer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-29 Thread Janos Toth F.
Reply to the TL;DR part, so TL;DR marker again...

Well, I live on the other extreme now. I want as few filesystems as
possible and viable (it's obviously impossible to have a real backup
within the same fs and/or device and with the current
size/performance/price differences between HDD and SSD, it's evident
to separate the "small and fast" from the "big and slow" storage but
other than that...). I always believed (even before I got a real grasp
on these things and could explain my view or argue about this)
"subvolumes" (in a general sense but let's use this word here) should
reside below filesystems (and be totally optional) and filesystems
should spread over a whole disk or(md- or hardware) RAID volume
(forget the MSDOS partitions) and even these ZFS/Brtfs style
subvolumes should be used sparingly (only when you really have a good
enough reason to create a subvolume, although it doesn't matter nearly
as much with subvolumes than it does with partitions).

I remember the days when I thought it's important to create separate
partitions for different kinds of data (10+ years ago when I was aware
I didn't have the experience to deviate from common general
teachings). I remember all the pain of randomly running out of space
on any and all filesystems and eventually mixing the various kinds of
data on every theoretically-segregated filesystems (wherever I found
free space), causing a nightmare of broken sorting system (like a
library after a tornado) and then all the horror of my first russian
rulett like experiences of resizing partitions and filesystem to make
the segregation decent again. And I saw much worse on other peoples's
machines. At one point, I decided to create as few partitions as
possible (and I really like the idea of zero partitions, I don't miss
MSDOS).
I still get shivers if I need to resize a filesystems due to the
memories of those early tragic experiences when I never won the
lottery on the "trial and error" runs but lost filesystems with both
hands and learned what wild-spread silent corruption is and how you
can refresh your backups with corrupted copies...). Let's not take me
back to those early days, please. I don't want to live in a cave
anymore. Thank you modern filesystems (and their authors). :)

And on that note... Assuming I had interference problems, it was
caused by my human mistake/negligence. I can always make similar or
bigger human mistakes, independent of disk-level segregation. (For
example, no amount of partitions will save any data if I accidentally
wipe the entire drive with DD, or if I have it security-locked by the
controller and loose the passwords, etc...)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-28 Thread Janos Toth F.
The read-only scrub finished without errors/hangs (with kernel
4.12.3). So, I guess the hangs were caused by:
1: other bug in 4.13-RC1
2: crazy-random SATA/disk-controller issue
3: interference between various btrfs tools [*]
4: something in the background did DIO write with 4.13-RC1 (but all
affected content was eventually overwritten/deleted between the scrub
attempts)

[*] I expected scrub to finish in ~5 rather than ~40 hours (and didn't
expect interference issues), so I didn't disable the scheduled
maintenance script which deletes old files, recursively defrags the
whole fs and runs a balance with usage=33 filters. I guess either of
those (especially balance) could potentially cause scrub to hang.

On Thu, Jul 27, 2017 at 10:44 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Janos Toth F. posted on Thu, 27 Jul 2017 16:14:47 +0200 as excerpted:
>
>> * This is off-topic but raid5 scrub is painful. The disks run at
>> constant ~100% utilization while performing at ~1/5 of their sequential
>> read speeds. And despite explicitly asking idle IO priority when
>> launching scrub, the filesystem becomes unbearably slow (while scrub
>> takes a days or so to finish ... or get to the point where it hung the
>> last time around, close to the end).
>
> That's because basically all the userspace scrub command does is make the
> appropriate kernel calls to have it do the real scrub.  So priority-
> idling the userspace scrub doesn't do what it does on normal userspace
> jobs that do much of the work themselves.
>
> The problem is that idle-prioritizing the kernel threads actually doing
> the work could risk a deadlock due to lock inversion, since they're
> kernel threads and aren't designed with the idea of people messing with
> their priority in mind.
>
> Meanwhile, that's yet another reason btrfs raid56 mode isn't recommended
> at this time.  Try btrfs raid1 or raid10 mode instead, or possible btrfs
> raid1, single or raid0 mode on top of a pair of mdraid5s or similar.  Tho
> parity-raid mode in general (that is, not btrfs-specific) is known for
> being slow in various cases, with raid10 normally being the best
> performing closest alternative.  (Tho in the btrfs-specific case, btrfs
> raid1 on top of a pair of mdraid/dmraid/whatever raid0s, is the normally
> recommended higher performance reasonably low danger alternative.)

If this applies to all RAID flavors then I consider the built-in help
and the manual pages of scrub misleading (if it's RAID56-only, the
manual should still mention how RAID56 is an exception).

Also, a resumed scrub seems to skip a lot of data. It picks up where
it left but then prematurely reports a job well done. I remember
noticing a similar behavior with balance cancel/resume on RAID5 a few
years ago (it went on for a few more chunks but left the rest alone
and reported completion --- I am not sure if that's fixed now or these
have a common root cause).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-27 Thread Janos Toth F.
> It should only affect the dio-written files, the mentioned bug makes
> btrfs write garbage into those files, so checksum fails when reading
> files, nothing else from this bug.

Thanks for confirming that.  I thought so but I removed the affected
temporary files even before I knew they were corrupt, yet I had
trouble with the follow-up scrub, so I got confused about the scope of
the issue.
However, I am not sure if some other software which regularly runs in
the background might use DIO (I don't think so but can't say for
sure).

> A hang could normally be caught by sysrq-w, could you please try it
> and see if there is a difference in kernel log?

It's not a total system hang. The filesystem in question effectively
becomes read-only (I forgot to check if it actually turns RO or writes
just silently hang) and scrub hangs (it doesn't seem to do any disk
I/O and can't be cancelled gracefully). A graceful reboot or shutdown
silently fails.

In the mean time, I switched to Linux 4.12.3, updated the firmware on
the HDDs and ran an extended SMART self-test (which found no errors),
used cp to copy everything (not for backup but as a form of "crude
scrub" [see *], which yielded zero errors) and now finally started a
scrub (in foreground and read-only mode this time).

* This is off-topic but raid5 scrub is painful. The disks run at
constant ~100% utilization while performing at ~1/5 of their
sequential read speeds. And despite explicitly asking idle IO priority
when launching scrub, the filesystem becomes unbearably slow (while
scrub takes a days or so to finish ... or get to the point where it
hung the last time around, close to the end).

I find it a little strange that BFQ and the on-board disk caching with
NCQ + FUA (look-ahead read caching and write cache reordering enabled
with 128Mb on-board caches) can't mitigate the issue a little better
(whatever scrub is doing "wrong" from a performance perspective).

If scrub hangs again, I will try to extract something useful from the logs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-24 Thread Janos Toth F.
I accidentally ran into this problem (it's pretty silly because I
almost never run RC kernels or do dio writes but somehow I just
happened to do both at once, exactly before I read your patch notes).
I didn't initially catch any issues (I see no related messages in the
kernel log) but after seeing your patch, I started a scrub (*) and it
hung.

Is there a way to fix a filesystem corrupted by this bug or does it
need to be destroyed and recreated? (It's m=s=raid10, d=raid5 with
5x4Tb HDDs.) There is a partial backup (of everything really
important, the rest is not important enough to be kept in multiple
copies, hence the desire for raid5...) and everything seems to be
readable anyway (so could be saved if needed) but nuking a big fs is
never fun...

Scrub just hangs and pretty much makes the whole system hanging (it
needs a power cycling for a reboot). Although everything runs smooth
besides this. Btrfs check (read-only normal-mem mode) finds no errors,
the kernel log is clean, etc.

I think I deleted all the affected dio-written test-files even before
I started scrubbing, so that doesn't seem to do the trick. Any other
ideas?


* By the way, I see raid56 scrub is still painfully slow (~30Mb/s /
disk with raw disk speeds of >100 Mb/s). I forgot about this issue
since I last used raid5 a few years ago.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-05-13 Thread Janos Toth F.
> Anyway, that 20-33% left entirely unallocated/unpartitioned
> recommendation still holds, right?

I never liked that idea. And I really disliked how people considered
it to be (and even passed it down as) some magical, absolute
stupid-proof fail-safe thing (because it's not).

1: Unless you reliably trim the whole LBA space (and/or run
ata_secure_erase on the whole drive) before you (re-)partition the LBA
space, you have zero guarantee that the drive's controller/firmware
will treat the unallocated space as empty or will keep it's content
around as useful data (even if it's full of zeros because zero could
be very useful data unless it's specifically marked as "throwaway" by
trim/erase). On the other hand, a trim-compatible filesystem should
properly mark (trim) all (or at least most of) the free space as free
(= free to erase internally by the controller's discretion). And even
if trim isn't fail-proof either, those bugs should be temporary (and
it's not like a sane SSD will die in a few weeks due to these kind of
issues during sane usage and crazy drivers will often fail under crazy
usage regardless of trim and spare space).

2: It's not some daemon-summoning, world-ending catastrophe if you
occasionally happen to fill your SSD to ~100%. It probably won't like
it (it will probably get slow by the end of the writes and the
internal write amplification might skyrocket at it's peak) but nothing
extraordinary will happen and normal operation (high write speed,
normal internal write amplification, etc) should resume soon after you
make some room (for example, you delete your temporary files or move
some old content to an archive storage and you properly trim that
space). That space is there to be used, just don't leave it close to
100% all the time and try never leaving it close to 100% when you plan
to keep it busy with many small random writes.

3: Some drives have plenty of hidden internal spare space (especially
the expensive kinds offered for datacenters or "enthusiast" consumers
by big companies like Intel and such). Even some cheap drivers might
have plenty of erased space at 100% LBA allocation if they use
compression internally (and you don't fill it up to 100% with
in-compressible content).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "corrupt leaf, invalid item offset size pair"

2017-05-08 Thread Janos Toth F.
May be someone more talented will be able to assist you but in my
experience this kind of damage is fatal in practice (even if you could
theoretically fix it, it's probably easier to recreate the fs and
restore the content from backup, or use the rescue tool to save some
of the old content which you never had copies from and restore that).
I think the problem is that the disturbed disk gets out of sync
(obviously, it misses some queued/buffered writes) from the rest of
the fs/disk(s) but later gets accepted back like it's in a perfectly
fine state (and/or Btrfs is ready to deal with problems like this,
though it looks like it is not), and then some fatal corruption starts
developing (due to the problematic disk being treated like it has
correct data, even though it has some errors). If you have it mounted
RW long enough, it will probably get worse and gets unmountable at
some point (and thus harder, if not impossible to rescure any data).
This is how I usually lost my RAID-5 mode Btrfs filesystems before I
stopped experimenting with that. I never had this problem since I
disabled SATA HotPlug (in the firmware setup of the motherboard) and
switched to RAID-10 mode (and eventually replaced both faulty SATA
cables in the system, one at a time after an incident...).

On Mon, May 8, 2017 at 6:58 AM, Roman Mamedov  wrote:
> Hello,
>
> It appears like during some trouble with HDD cables and controllers, I got 
> some disk corruption.
> As a result, after a short period of time my Btrfs went read-only, and now 
> does not mount anymore.
>
> [Sun May  7 23:08:02 2017] BTRFS error (device dm-8): parent transid verify 
> failed on 13799442505728 wanted 625048 found 624487
> [Sun May  7 23:08:02 2017] BTRFS info (device dm-8): read error corrected: 
> ino 1 off 13799442505728 (dev /dev/mapper/vg-r6p1 sector 6736670512)
> [Sun May  7 23:08:33 2017] BTRFS error (device dm-8): parent transid verify 
> failed on 13799589576704 wanted 625088 found 624488
> [Sun May  7 23:08:33 2017] BTRFS error (device dm-8): parent transid verify 
> failed on 13799589576704 wanted 625088 found 624402
> [Sun May  7 23:08:33 2017] [ cut here ]
> [Sun May  7 23:08:33 2017] WARNING: CPU: 3 PID: 2022 at 
> fs/btrfs/extent-tree.c:6555 __btrfs_free_extent.isra.67+0x2c2/0xd40 [btrfs]()
> [Sun May  7 23:08:33 2017] BTRFS: Transaction aborted (error -5)
> [Sun May  7 23:08:33 2017] Modules linked in: dm_mirror dm_region_hash dm_log 
> ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_nat 
> xt_limit xt_length nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack 
> ip6t_rpfilter ipt_rpfilter xt_multiport iptable_nat nf_conntrack_ipv4 
> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_raw iptable_raw 
> ip6table_mangle iptable_mangle ip6table_filter ip6_tables iptable_filter 
> ip_tables x_tables cpufreq_userspace cpufreq_conservative cpufreq_stats 
> cpufreq_powersave nbd nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry 
> nfsv4 dns_resolver nfs lockd grace sunrpc fscache 8021q garp mrp bridge stp 
> llc bonding tcp_illinois aoe crc32 loop it87 hwmon_vid fuse kvm_amd kvm 
> irqbypass crct10dif_pclmul eeepc_wmi crc32_pclmul ghash_clmulni_intel 
> asus_wmi sparse_keymap rfkill
> [Sun May  7 23:08:33 2017]  video sha256_ssse3 sha256_generic hmac mxm_wmi 
> drbg ansi_cprng snd_hda_codec_realtek aesni_intel snd_hda_codec_generic 
> aes_x86_64 snd_hda_intel lrw gf128mul snd_hda_codec glue_helper snd_pcsp 
> snd_hda_core ablk_helper snd_hwdep cryptd snd_pcm snd_timer cp210x joydev snd 
> serio_raw k10temp evdev usbserial edac_mce_amd edac_core soundcore 
> fam15h_power sp5100_tco acpi_cpufreq tpm_infineon wmi i2c_piix4 tpm_tis tpm 
> 8250_fintek shpchp processor button ext4 crc16 mbcache jbd2 btrfs 
> dm_cache_smq raid10 raid1 raid456 async_raid6_recov async_memcpy async_pq 
> async_xor async_tx xor raid6_pq crc32c_generic md_mod dm_cache_mq dm_cache 
> dm_persistent_data dm_bio_prison dm_bufio libcrc32c dm_mod sg sd_mod 
> ata_generic hid_generic usbhid hid ohci_pci sata_mv ahci pata_jmicron libahci 
> crc32c_intel sata_sil24
> [Sun May  7 23:08:33 2017]  ehci_pci ohci_hcd xhci_pci psmouse xhci_hcd 
> ehci_hcd libata usbcore scsi_mod e1000e usb_common ptp pps_core
> [Sun May  7 23:08:33 2017] CPU: 3 PID: 2022 Comm: btrfs-transacti Not tainted 
> 4.4.66-rm1+ #181
> [Sun May  7 23:08:33 2017] Hardware name: To be filled by O.E.M. To be filled 
> by O.E.M./M5A97 LE R2.0, BIOS 2601 03/24/2015
> [Sun May  7 23:08:33 2017]  0286 2595262f 
> 8800d675baf0 812ff351
> [Sun May  7 23:08:33 2017]  8800d675bb38 a03929d2 
> 8800d675bb28 8107eb95
> [Sun May  7 23:08:33 2017]  0c42b8ffb000 fffb 
> 8805b6c60800 
> [Sun May  7 23:08:33 2017] Call Trace:
> [Sun May  7 23:08:33 2017]  [] dump_stack+0x63/0x82
> [Sun May  7 23:08:33 2017]  [] 
> warn_slowpath_common+0x95/0xe0
> [Sun May  7 23:08:33 2017]  [] 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Janos Toth F.
>> The command-line also rejects a number of perfectly legitimate
>> arguments that BTRFS does understand too though, so that's not much
>> of a test.
>
> Which are those? I didn't encounter any...

I think this bug still stands unresolved (for 3+ years, probably
because most people use init-rd/fs without ever considering to omit it
in case they don't really need it at all):
Bug 61601 - rootflags=noatime causes kernel panic when booting without initrd.
The last time I tried it applied to Btrfs as well:
https://bugzilla.kernel.org/show_bug.cgi?id=61601#c18
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux 4.8 kernel OOM

2017-02-26 Thread Janos Toth F.
So far 4.10.0 seems to be flawless for me. All the strange OOMs (which
may or may not were related to Btrfs but it looked that way), random
unexplained Btrfs mount failures (as well as some various other things
totally unrelated to filesystems like sdhc card reader driver
problems) which were present with 4.8.x and 4.9.x seem to be all gone
(at least so far... it has been only a few days that I updated to
4.10.0).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-28 Thread Janos Toth F.
I usually compile my kernels with CONFIG_X86_RESERVE_LOW=640 and
CONFIG_X86_CHECK_BIOS_CORRUPTION=N because 640 kilobyte seems like a
very cheap price to pay in order to avoid worrying about this (and
skip the associated checking + monitoring).

Out of curiosity (after reading this email) I set these to 4 and Y (so
1 page = 4k reserve and checking turned ON and activated by default)
on a useless laptop. Right after reboot, the kernel log was full of
the same kind of Btrfs errors reported in the first email of this
topic ("bad key order", etc). I could run a scrub with zero errors and
successfully reboot with a read-write mounted root filesystem with the
old kernel build (but the kernel log was still full of errors, as your
might imagine). I tried to run "btrfs check --repair" but it seems to
be useless in this situation, the filesystem needs to be recreated
(not too hard in my case when it's still fully readable). Although,
the kernel log was free of the "Corrupted low memory at" kind of
messages (even though I let it run for hours).

On Sat, Jan 28, 2017 at 6:00 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Austin S. Hemmelgarn posted on Fri, 27 Jan 2017 07:58:20 -0500 as
> excerpted:
>
>> On 2017-01-27 06:01, Oliver Freyermuth wrote:
 I'm also running 'memtester 12G' right now, which at least tests 2/3
 of the memory. I'll leave that running for a day or so, but of course
 it will not provide a clear answer...
>>>
>>> A small update: while the online memtester is without any errors still,
>>> I checked old syslogs from the machine and found something intriguing.
>
>>> kernel: Corrupted low memory at 88009000 (9000 phys) = 00098d39
>>> kernel: Corrupted low memory at 88009000 (9000 phys) = 00099795
>>> kernel: Corrupted low memory at 88009000 (9000 phys) = 000dd64e
>
> 0x9000 = 36K...
>
>>> This seems to be consistently happening from time to time (I have low
>>> memory corruption checking compiled in).
>>> The numbers always consistently increase, and after a reboot, start
>>> fresh from a small number again.
>>>
>>> I suppose this is a BIOS bug and it's storing some counter in low
>>> memory. I am unsure whether this could have triggered the BTRFS
>>> corruption, nor do I know what to do about it (are there kernel quirks
>>> for that?). The vendor does not provide any updates, as usual.
>>>
>>> If someone could confirm whether this might cause corruption for btrfs
>>> (and maybe direct me to the correct place to ask for a kernel quirk for
>>> this device - do I ask on MM, or somewhere else?), that would be much
>>> appreciated.
>
>> It is a firmware bug, Linux doesn't use stuff in that physical address
>> range at all.  I don't think it's likely that this specific bug caused
>> the corruption, but given that the firmware doesn't have it's
>> allocations listed correctly in the e820 table (if they were listed
>> correctly, you wouldn't be seeing this message), it would not surprise
>> me if the firmware was involved somehow.
>
> Correct me if I'm wrong (I'm no kernel expert, but I've been building my
> own kernel for well over a decade now so having a working familiarity
> with the kernel options, of which the following is my possibly incorrect
> read), but I believe that's only "fact check: mostly correct" (mostly as
> in yes it's the default, but there's a mainline kernel option to change
> it).
>
> I was just going over the related kernel options again a couple days ago,
> so they're fresh in my head, and AFAICT...
>
> There are THREE semi-related kernel options (config UI option location is
> based on the mainline 4.10-rc5+ git kernel I'm presently running):
>
> DEFAULT_MMAP_MIN_ADDR
>
> Config location: Processor type and features:
> Low address space to protect from user allocation
>
> This one is virtual memory according to config help, so likely not
> directly related, but similar idea.
>
> X86_CHECK_BIOS_CORRUPTION
>
> Location: Same section, a few lines below the first one:
> Check for low memory corruption
>
> I guess this is the option you (OF) have enabled.  Note that according to
> help, in addition to enabling this in options, a runtime kernel
> commandline option must be given as well, to actually enable the checks.
>
> X86_RESERVE_LOW
>
> Location: Same section, immediately below the check option:
> Amount of low memory, in kilobytes, to reserve for the BIOS
>
> Help for this one suggests enabling the check bios corruption option
> above if there are any doubts, so the two are directly related.
>
> All three options apparently default to 64K (as that's what I see here
> and I don't believe I've changed them), but can be changed.  See the
> kernel options help and where it points for more.
>
> My read of the above is that yes, by default the kernel won't use
> physical 0x9000 (36K), as it's well within the 64K default reserve area,
> but a blanket "Linux doesn't use stuff in that physical address range at
> all" is incorrect, as if the defaults have been 

Re: RAID56 status?

2017-01-23 Thread Janos Toth F.
On Mon, Jan 23, 2017 at 7:57 AM, Brendan Hide  wrote:
>
> raid0 stripes data in 64k chunks (I think this size is tunable) across all
> devices, which is generally far faster in terms of throughput in both
> writing and reading data.

I remember seeing some proposals for configurable stripe size in the
form of patches (which changed a lot over time) but I don't think the
idea reached a consensus (let alone if a final patch materialized and
got merged). I think it would be a nice feature though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unocorrectable errors with RAID1

2017-01-16 Thread Janos Toth F.
> BTRFS uses a 2 level allocation system.  At the higher level, you have
> chunks.  These are just big blocks of space on the disk that get used for
> only one type of lower level allocation (Data, Metadata, or System).  Data
> chunks are normally 1GB, Metadata 256MB, and System depends on the size of
> the FS when it was created.  Within these chunks, BTRFS then allocates
> individual blocks just like any other filesystem.

This always seems to confuse me when I try to get an abstract idea
about de-/fragmentation of Btrfs.
Can meta-/data be fragmented on both levels? And if so, can defrag
and/or balance "cure" both levels of fragmentation (if any)?
But how? May be several defrag and balance runs, repeated until
returns diminish (or at least you consider them meaningless and/or
unnecessary)?


> What balancing does is send everything back through the allocator, which in
> turn back-fills chunks that are only partially full, and removes ones that
> are now empty.

Does't this have a potential chance of introducing (additional)
extent-level fragmentation?

> FWIW, while there isn't a daemon yet that does this, it's a perfect thing
> for a cronjob.  The general maintenance regimen that I use for most of my
> filesystems is:
> * Run 'btrfs balance start -dusage=20 -musage=20' daily.  This will complete
> really fast on most filesystems, and keeps the slack-space relatively
> under-control (and has the nice bonus that it helps defragment free space.
> * Run a full scrub on all filesystems weekly.  This catches silent
> corruption of the data, and will fix it if possible.
> * Run a full defrag on all filesystems monthly.  This should be run before
> the balance (reasons are complicated and require more explanation than you
> probably care for).  I would run this at least weekly though on HDD's, as
> they tend to be more negatively impacted by fragmentation.

I wonder if one should always run a full balance instead of a full
scrub, since balance should also read (and thus theoretically verify)
the meta-/data (does it though? I would expect it to check the
chekcsums, but who knows...? may be it's "optimized" to skip that
step?) and also perform the "consolidation" of the chunk level.

I wish there was some more "integrated" solution for this: a
balance-like operation which consolidates the chunks and also
de-fragments the file extents at the same time while passively
uncovers (and fixes if necessary and possible) any checksum mismatches
/ data errors, so that balance and defrag can't work against
each-other and the overall work is minimized (compared to several full
runs or many different commands).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] recursive defrag cleanup

2017-01-04 Thread Janos Toth F.
I separated these 9 camera storages into 9 subvolumes (so now I have
10 subvols in total in this filesystem with the "root" subvol). It's
obviously way too early to talk about long term performance but now I
can tell that recursive defrag does NOT descend into "child"
subvolumes (it does not pick up the files located in these "child"
subvolumes when I point it to the "root" subvolume with the -r
option). That's very inconvenient (one might need to write a scrip
with a long static list of subvolumes and maintain it over time or
write a scrip which acquires the list from the subvolume list command
and feeds it to the defrag command one-by-one).

> Because each subvolume is functionally it's own tree, it has it's own
> locking for changes and other stuff, which means that splitting into
> subvolumes will usually help with concurrency.  A lot of high concurrency
> performance benchmarks do significantly better if you split things into
> individual subvolumes (and this drives a couple of the other kernel
> developers crazy to no end).  It's not well published, but this is actually
> the recommended usage if you can afford the complexity and don't need
> snapshots.

I am not a developer but this idea drives me crazy as well. I know
it's a silly reasoning but if you blindly extrapolate this idea you
come to the conclusion that every single file should be transparently
placed in it's own unique subvolume (by some automatic background
task) and every directory should automatically be a subvolume. I guess
there must be some inconveniently sub-optimal behavior in the tree
handling which could theoretically be optimized (or the observed
performance improvement of the subvolume segregation is some kind of
measurement error which does not really translate into actual real
life overall performance befit but only looks like that from some
specific perspective of the tests).

> As far as how much your buffering for write-back, that should depend
> entirely on how fast your RAM is relative to your storage device.  The
> smaller the gap between your storage and your RAM in terms of speed, the
> more you should be buffering (up to a point).  FWIW, I find that with
> DDR3-1600 RAM and a good (~540MB/s sequential write) SATA3 SSD, about
> 160-320MB gets a near ideal balance of performance, throughput, and
> fragmentation, but of course YMMV.

I don't think I share your logic on this. I usually consider the write
load random and I don't like my softwares possibly stalling while
there is plenty of RAM laying around to be used as a buffer until some
other tasks might stop trashing the disks (e.g. "bigger is always
better").

> Out of curiosity, just on this part, have you tried using cgroups to keep
> the memory usage isolated better?

No, I didn't even know cgroups can control the pagecache based on the
process which generates the cache-able IO.
To be honest, I don't think it's worth the effort for me (I would need
to learn how to use cgroups, I have zero experience with that).

> Also, if you can get ffmpeg to spit out the stream on stdout, you could pipe
> to dd and have that use Direct-IO.  The dd command should be something along
> the lines of:
> dd of= oflag=direct iflag=fullblock bs= of node-size>
> The oflag will force dd to open the output file with O_DIRECT, the iflag
> will force it to collect full blocks of data before writing them (the size
> is set by bs=, I'd recommend using a power of 2 that's a multiple of your
> node-size, larger numbers will increase latency but reduce fragmentation and
> improve throughput).  This may still use a significant amount of RAM (the
> pipe is essentially an in-memory buffer), and may crowd out other
> applications, but I have no idea how much it may or may not help.

This I can try (when I have no better things to play with). Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] recursive defrag cleanup

2017-01-03 Thread Janos Toth F.
On Tue, Jan 3, 2017 at 5:01 PM, Austin S. Hemmelgarn
 wrote:
> I agree on this point.  I actually hadn't known that it didn't recurse into
> sub-volumes, and that's a pretty significant caveat that should be
> documented (and ideally fixed, defrag doesn't need to worry about
> cross-subvolume stuff because it breaks reflinks anyway).

I think it descends into subvolumes to picks up the files (data)
inside them. I was referring to picking up the "child" subvolumes
(trees) and defrag those (as if you fed all the subvolumes to a
non-recursive defrag one-by-one with the current implementation --- if
I understand this current implementation correctly*).

To keep it simple: the recursive mode (IMO) should not ignore any
entities which the defrag tool is able to meaningfully operate on (no
matter if these are file data, directory metadata or subvolume tree
metadata, etc --- if it can be fragmented and can be defraged by this
tool, it should be done during a recursive mode operation with no
exceptions --- unless you can and do set explicit exceptions). I think
only the subvolume and/or the directory (*) metadata are currently
ignored by the recursive mode (if anything).

* But you got me a little bit confused again. After reading the first
few emails in this thread I thought only files (data) and subvolumes
(tree metadata) can be defraged by this tool and it's a no-op for
regular directories. Yet you seem to imply it's possible to defrag
regular directories (the directory metadata), meaning defrag can
operate on 3 type of entities in total (file data, subvolume tree
metadata, regular directory metadata).

> For single directories, -t almost certainly has near zero effect since
> directories are entirely in metadata.  For single files, it should only have
> an effect if it's smaller than the size of the file (it probably is for your
> usage if you've got hour long video files).  As far as the behavior above
> 128MB, stuff like that is expected to a certain extent when you have highly
> fragmented free space (the FS has to hunt harder to find a large enough free
> area to place the extent).
>
> FWIW, unless you have insanely slow storage, 32MB is a reasonable target
> fragment size.  Fragmentation is mostly an issue with sequential reads, and
> usually by the time you're through processing that 32MB of data, your
> storage device will have the next 32MB ready.  The optimal value of course
> depends on many things, but 32-64MB is reasonable for most users who aren't
> streaming multiple files simultaneously off of a slow hard drive.

Yes, I know and it's not a problem to use <=32Mb. I just wondered why
>=128Mb seems to be so incredibly slow for me.
Well, actually, I also wondered if the defrag tool can "create" big
enough continuous free space chunks by relocating other (probably
small[ish]) files (including non-fragmented files) in order to make
room for huge fragmented files to be re-assembled there as continuous
files. I just didn't make the connection between these two questions.
I mean, defrag will obviously fail with huge target extent sizes and
huge fragmented files if the free space is fragmented (and why
wouldn't it be somewhat fragmented...? deleting fragmented files will
result in fragmented free space and new files will be fragmented if
free space is fragmented, so you will delete fragmented files once
again, and it goes on forever -> that was my nightmare with ZFS... it
feels like the FS can only become more and more fragmented over time
unless you keep a lot of free space [let's say >50%] all the time and
it still remains somewhat random).

Although, this brings up complications. A really extensive defrag
would require some sort of smart planning: building a map of objects
(including free space and continuous files), divining the best
possible target and trying to achieve that shape by heavy
reorganization of (meta/)data.

> Really use case specific question, but have you tried putting each set of
> files (one for each stream) in it's own sub-volume?  Your metadata
> performance is probably degrading from the sheer number of extents involved
> (assuming H.264 encoding and full HD video with DVD quality audio, you're
> probably looking at at least 1000 extents for each file, probably more), and
> splitting into a subvolume per-stream should segregate the metadata for each
> set of files, which should in turn help avoid stuff like lock contention
> (and may actually make both balance and defrag run faster).

Before I had a dedicated disk+filesystem for these files I did think
about creating a subvolume for all these video recordings rather than
keeping them in a simple directory under a big multi-disk filesystem's
root/default subvolume. (The decision to separate these files was
forced by an external scale-ability problem --- limited number of
connectors/slots for disks and limited "working" RAID options in Btrfs
--- rather than an explicit desire for segregation -> although in the

Re: [PATCH] recursive defrag cleanup

2017-01-03 Thread Janos Toth F.
So, in order to defrag "everything" in the filesystem (which is
possible to / potentially needs defrag) I need to run:
1: a recursive defrag starting from the root subvolume (to pick up all
the files in all the possible subvolumes and directories)
2: a non-recursive defrag on the root subvolume + (optionally)
additional non-recursive defrag(s) on all the other subvolume(s) (if
any) [but not on all directories like some old scripts did]

In my opinion, the recursive defrag should pick up and operate on all
the subvolumes, including the one specified in the command line (if
it's a subvolume) and all subvolumes "below" it (not on files only).


Does the -t parameter have any meaning/effect on non-recursive (tree)
defrag? I usually go with 32M because t>=128Mb tends to be unduly slow
(it takes a lot of time, even if I try to run it repeatedly on the
same static file several times in a row whereas t<=32M finishes rather
quickly in this case -> could this be a bug or design flaw?).


I have a Btrfs filesystem (among others) on a single HDD with
single,single,single block profiles which is effectively write-only.
Nine concurrent ffmpeg processes write files from real-time video
streams 24/7 (there is no pre-allocation, the files just grow and grow
for an hour until a new one starts). A daily cronjob deletes the old
files every night and starts a recursive defrag on the root subvolume
(there are no other subvolumes, only the default id=5). I appended a
non-recursive defrag to this maintenance script now but I doubt it
does anything meaningful in this case (it finishes very fast, so I
don't think it does too much work). This is the filesystem which
"degrades" in speed for me very fast and needs metadata re-balance
from time to time (I usually do it before every kernel upgrades and
thus reboots in order to avoid a possible localmount rc-script
timeouts).

I know I should probably use a much more simple filesystem (might even
vfat, or ext4 - possibly with the journal disabled) for this kind of
storage but I was curious how Btrfs can handle the job (with CoW
enabled, no less). All in all, everything is fine except the
degradation of metadata performance. Since it's mostly write-only, I
could even skip the file defrags (I originally scheduled it in a hope
it will overcome the metadata slowdown problems and it's also useful
[even if not necessary] to have the files defragmented in case I
occasionally want to use them). I am not sure but I guess defraging
the files helps to reduce the overall metadata size and thus makes the
balance step faster (quick balancing) and more efficient (better
post-balance performance).


I can't remember the exact script but it basically fed every single
directories (not just subvolumes) to the defrag tool using 'find' and
it was meant to complement a separate recursive defrag step. It was
supposed to defrag the metadata (the metadata of every single
directory below the specified location, one by one, so it was very
quick on my video-archive but very slow on my system root and didn't
really seem to achieve anything on either of them).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] recursive defrag cleanup

2016-12-28 Thread Janos Toth F.
I still find the defrag tool a little bit confusing from a user perspective:
- Does the recursive defrag (-r) also defrag the specified directory's
extent tree or should one run two separate commands for completeness
(one with -r and one without -r)?
- What's the target scope of the extent tree defragmentation? Is it
recursive on the tree (regardless of the -r option) and thus
defragments all the extent trees in case one targets the root
subvolume?

In other words: What is the exact sequence of commands if one wishes
to defragment the whole filesystem as extensively as possible (all
files and extent trees included)?

There used to be a scrip floating around on various wikis (for
example, the Arch Linux wiki) which used the "find" tool to feed each
and every directory to the defrag command. I always thought that must
be overkill and now it's gone, but I don't see further explanations
and/or new scripts in place (other than a single command with the -r
option).


It's also a little mystery for me if balancing the metadata chunks is
supposed to be effectively defragmenting the metadata or not and what
the best practice regarding that issue is.
In my personal experience Btrfs filesystems tend to get slower over
time, up to the point where it takes several minutes to mount them or
to delete some big files (observed on HDDs, not on SSDs where the
sheer speed might masks the problem and filesystem tends to be smaller
anyway). When it gets really bad, Gentoo's localmount script starts to
time out on boot and Samba based network file deletions tend to freeze
the client Windows machine's file explorer.
It only takes 3-6 months and/or roughly 10-20 times of the total
disk(s) capacity's worth of write load to get there. Defrag doesn't
seem to help with that but running a balance on each and every
metadata blocks (data and system blocks can be skipped) seems to
"renew" it (no more timeouts or noticeable delays on mount, metadata
operation are as fast as expected, it works like a young
filesystem...).

One might expect that targeting the root subvolume with a recursive
defrag will take care of metadata fragmentation as well but it doesn't
seem to be the case and I don't see anybody recommending regular
matadata balancing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: some free space cache corruptions

2016-12-25 Thread Janos Toth F.
I am not sure I can remember a time when btrfs check did not print
this "cache and super generation don't match, space cache will be
invalidated" message, so I started ignoring it a long time ago because
I never seemed to have problem with missing free space and never got
any similar warnings/errors in the kernel log.

On Mon, Dec 26, 2016 at 1:12 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Christoph Anton Mitterer posted on Sun, 25 Dec 2016 23:00:34 +0100 as
> excerpted:
>
>> # btrfs check /dev/mapper/data-a2 ; echo $?
>> Checking filesystem on /dev/mapper/data-a2
>
> [...]
>
>> checking free space cache
>> block group 5431552376832 has wrong amount of free space
>> failed to load free space cache for block group 5431552376832
>
> [...]
>
>> 0
>>
>> => same error again...
>>
>> Any ideas how to resolve? And is this some serious error that could have
>> caused corruptions?
>
> By themselves, free-space cache warnings are minor and not a serious
> issue at all -- the cache is just that, a cache, designed to speed
> operation but not actually necessary, and btrfs can detect and route
> around space-cache corruption on-the-fly so by itself it's not a big deal.
>
> These warnings are however hints that something out of the routine has
> happened, and that you might wish to freshen your backups, run btrfs
> check and scrub and see if anything else is wrong (if you get them at
> mount, you got them /running/ btrfs check and nothing else out of the
> ordinary was reported), etc.
>
> Three things to note:
>
> 1) Plain btrfs check, without options that trigger fixes, is read-only,
> so you are likely to see anything unusual it reports again in repeated
> runs, unless the filesystem itself, or a scrub, etc, has fixed things in
> the mean time.  (And as I said, the space-cache is only a cache, designed
> to speed things up, cache corruption is fairly common and btrfs can and
> does deal with it without issue.  In fact btrfs has the nospace_cache
> option to entirely disable it at mount.)
>
> 2) It recently came to the attention of the devs that the existing btrfs
> mount-option method of clearing the free-space cache only clears it for
> block-groups/chunks it encounters on-the-fly.  It doesn't do a systematic
> beginning-to-end clear.  As such, in some instances it's possible to run
> with the clear_cache mount option (see the btrfs (5) manpage for mount
> option specifics, but to head off another question, it's recommended to
> stay with v1 cache for now) and still see space-cache warnings you
> thought should be cleared up, if btrfs didn't deal with those chunks in
> the run where the cache was cleared.
>
> 3) As a result of #2, the devs only very recently added support in btrfs
> check for a /full/ space-cache-v1 clear, using the new
> --clear-space-cache option.  But your btrfs-progs v4.7.3 is too old to
> support it.  I know it's in the v4.9 I just upgraded to... checking the
> wiki it appears the option was added in btrfs-progs v4.8.3 (v4.8.4 for v2
> cache).
>
> So if you want you can try the clear_cache mount option, and if that
> doesn't do it, upgrade to a current btrfs-progs and run it with the
> --clear-space-cache option, but you're not endangering your filesystem or
> anything by simply waiting until you get a btrfs-progs update from your
> distro, if you decide to.  The space-cache warnings aren't indicative of
> a serious problem now and btrfs deals with them on its own, they are
> simply hints that something, perhaps a crash with the btrfs mounted
> writable, happened at some time in the past, and that it it might be wise
> to investigate further for other damage, which you've already done, so
> you're good. =:^)
>
> Tho if you haven't recently run a scrub, I'd do that as well (and in fact
> recommend running it before check if you can successfully mount), since
> the problems it detects and fixes are conceptually different than the
> ones btrfs check deals with.  Scrub deals with actual on-media
> corruption, blocks not matching their checksum, while check deals with
> filesystem logic errors, whether or not the blocks containing them match
> the checksum.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] Btrfs: add DAX support for nocow btrfs

2016-12-07 Thread Janos Toth F.
I realize this is related very loosely (if at all) to this topic but
what about these two possible features:
- a mount option, or
- an attribute (which could be set on directories and/or sub-volumes
and applied to any new files created below these)
which effectively forces every read/write operations to behave like
the file was explicitly opened with DirectIO by the application (even
if the application has no DirectIO support)?

This could achieve something loosely similar to DAX while keeping more
of the "advanced" Btrfs features (I think only compression is ruled
out by DIO).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off

2016-12-01 Thread Janos Toth F.
Is there any fundamental reason not to support huge writeback caches?
(I mean, besides working around bugs and/or questionably poor design
choices which no one wishes to fix.)
The obvious drawback is the increased risk of data loss upon hardware
failure or kernel panic but why couldn't the user be allowed to draw
the line between probability of data loss and potential performance
gains?

The last time I changed hardware, I put double the amount of RAM into
my little home server for the sole reason to use a relatively huge
cache, especially a huge writeback cache. Although I realized it soon
enough that writeback ratios like 20/45 will make the system unstable
(OOM reaping) even if ~90% of the memory is theoretically free = used
as some form of cache, read or write, depending on this ratio
parameter and I ended up below the default to get rid of The Reaper.

My plan was to try and decrease the fragmentation of files which are
created by dumping several parallel real-time video streams into
separate files (and also minimize the HDD head seeks due to that).
(The computer in question is on a UPS.)

On Thu, Dec 1, 2016 at 4:49 PM, Michal Hocko  wrote:
> On Wed 30-11-16 10:16:53, Marc MERLIN wrote:
>> +folks from linux-mm thread for your suggestion
>>
>> On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote:
>> > > swraid5 < bcache < dmcrypt < btrfs
>> > >
>> > > Copying with btrfs send/receive causes massive hangs on the system.
>> > > Please see this explanation from Linus on why the workaround was
>> > > suggested:
>> > > https://lkml.org/lkml/2016/11/29/667
>> > And Linux' assessment is absolutely correct (at least, the general
>> > assessment is, I have no idea about btrfs_start_shared_extent, but I'm more
>> > than willing to bet he's correct that that's the culprit).
>>
>> > > All of this mostly went away with Linus' suggestion:
>> > > echo 2 > /proc/sys/vm/dirty_ratio
>> > > echo 1 > /proc/sys/vm/dirty_background_ratio
>> > >
>> > > But that's hiding the symptom which I think is that btrfs is piling up 
>> > > too many I/O
>> > > requests during btrfs send/receive and btrfs scrub (probably balance 
>> > > too) and not
>> > > looking at resulting impact to system health.
>>
>> > I see pretty much identical behavior using any number of other storage
>> > configurations on a USB 2.0 flash drive connected to a system with 16GB of
>> > RAM with the default dirty ratios because it's trying to cache up to 3.2GB
>> > of data for writeback.  While BTRFS is doing highly sub-optimal things 
>> > here,
>> > the ancient default writeback ratios are just as much a culprit.  I would
>> > suggest that get changed to 200MB or 20% of RAM, whichever is smaller, 
>> > which
>> > would give overall almost identical behavior to x86-32, which in turn works
>> > reasonably well for most cases.  I sadly don't have the time, patience, or
>> > expertise to write up such a patch myself though.
>>
>> Dear linux-mm folks, is that something you could consider (changing the
>> dirty_ratio defaults) given that it affects at least bcache and btrfs
>> (with or without bcache)?
>
> As much as the dirty_*ratio defaults a major PITA this is not something
> that would be _easy_ to change without high risks of regressions. This
> topic has been discussed many times with many good ideas, nothing really
> materialized from them though :/
>
> To be honest I really do hate dirty_*ratio and have seen many issues on
> very large machines and always suggested to use dirty_bytes instead but
> a particular value has always been a challenge to get right. It has
> always been very workload specific.
>
> That being said this is something more for IO people than MM IMHO.
>
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: raid with a variable stripe size

2016-11-29 Thread Janos Toth F.
I would love to have the stripe element size (per disk portions of
logical "full" stripes) changeable online with balance anyway
(starting from 512 byte/disk, not placing artificial arbitrary
limitations on it at the low end).
A small stripe size (for example 4k/disk or even 512byte/disk if you
happen to have HDDs with real 512 byte physical sectors) would help
minimizing this temporal space waste problem a lot (16 fold if you go
from 64k to 4k, or even completely if you go down to 512 byte on
5-disk RAID-5).

And regardless of that, I think having to keep in mind to balance
regularly, or even artificially running out of space from time to time
is much better than living with a constantly impending doom in mind
(and probably experiencing that disaster for real in your lifetime).

In case you wonder and/or care, ZFS not only allows for setting the
parameter which is closest to the "stripe element size" (smallest unit
which can be written to a disk at once) to 512 byte but that still
continues to be the default for many ZFS implementations, with 4k (or
more) being only optional (it's controlled by "ashift" and set
statically on pool creation time, although additional cache/log
devices might be added later with different ashift). And I like it
that way. I never used bigger ashift than the one matching the
physical sector size of the disks (usually 512 byte for HDDs or 4k for
SSDs). And I always used the smallest recordsize (effectively the
minimum "full" stripe stripe) I could get around with before notably
throttling the performance of sustained sequential writes. In this
regard, I never understood why people tend to crave huge units like
1MiB stripe size or so. Don't they ever store small files or read
small chunks of big files and/or care about latency (and even
minimizing the potential data loss in case multiple random sectors get
damaged on multiple disks, or upon power failure / kernel panic), not
even as long as benchmarks can show it's almost free to go lower...?

On Tue, Nov 29, 2016 at 6:49 AM, Qu Wenruo  wrote:
>
>
> At 11/29/2016 12:55 PM, Zygo Blaxell wrote:
>>
>> On Tue, Nov 29, 2016 at 12:12:03PM +0800, Qu Wenruo wrote:
>>>
>>>
>>>
>>> At 11/29/2016 11:53 AM, Zygo Blaxell wrote:

 On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
>
> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
>>
>> Hello,
>>
>> these are only my thoughts; no code here, but I would like to share it
>> hoping that it could be useful.
>>
>> As reported several times by Zygo (and others), one of the problem
>
> of raid5/6 is the write hole. Today BTRFS is not capable to address it.
>
> I'd say, no need to address yet, since current soft RAID5/6 can't
> handle it
> yet.
>
> Personally speaking, Btrfs should implementing RAID56 support just like
> Btrfs on mdadm.


 Even mdadm doesn't implement it the way btrfs does (assuming all bugs
 are fixed) any more.

> See how badly the current RAID56 works?


> The marginally benefit of btrfs RAID56 to scrub data better than
> tradition
> RAID56 is just a joke in current code base.


>> The problem is that the stripe size is bigger than the "sector size"
>
> (ok sector is not the correct word, but I am referring to the basic
> unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
> btrfs writes less data than the stripe, the stripe is not filled; when
> it is filled by a subsequent write, a RMW of the parity is required.
>>
>>
>> On the best of my understanding (which could be very wrong) ZFS try
>
> to solve this issue using a variable length stripe.
>
> Did you mean ZFS record size?
> IIRC that's file extent minimum size, and I didn't see how that can
> handle
> the write hole problem.
>
> Or did ZFS handle the problem?


 ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds
 the
 parity blocks within extents, so it behaves more like btrfs compression
 in the sense that the data in a RAID-Z extent is encoded differently
>>>
>>> >from the data in the file, and the kernel has to transform it on reads

 and writes.

 No ZFS stripe can contain blocks from multiple different
 transactions because the RAID-Z stripes begin and end on extent
 (single-transaction-write) boundaries, so there is no write hole on ZFS.

 There is some space waste in ZFS because the minimum allocation unit
 is two blocks (one data one parity) so any free space that is less
 than two blocks long is unusable.  Also the maximum usable stripe width
 (number of disks) is the size of the data in the extent plus one parity
 block.  It means if you write a lot of discontiguous 4K blocks, you
 effectively get 2-disk RAID1 and that may result in disappointing
 storage 

Re: RFC: raid with a variable stripe size

2016-11-18 Thread Janos Toth F.
Yes, I don't think one could find any NAND based SSDs with <4k page
size on the market right now (even =4k is hard to get) and 4k is
becoming the new norm for HDDs. However, some HDD manufacturers
continue to offer drives with 512 byte sectors (I think it's possible
to get new ones in sizable quantities if you need them).

I am aware it wouldn't solve the problem for >=4k sector devices
unless you are ready to balance frequently. But I think it would still
be a lot better to waste padding space on 4k stripes than, say, 64k
stripes until you can balance the new block groups. And, if the space
waste ratio is tolerable, then this could be an automatic background
task as soon as an individual block group or their totals get to a
high waste ratio.

I suggest this as a quick temporal workaround because it could be
cheap in terms of work if the above mentioned functionalities (stripe
size change, auto-balance) would be worked on anyway (regardless of
RAID-5/6 specific issues) until some better solution is realized
(probably through a lot more work over a lot longer development
period). RAID-5 isn't really optimal for a huge amount of disks (URE
during rebuild issue...), so the temporary space waste is probably
<=8x per unbalanced block groups (which are 1Gb or may be ~10Gb if I
am not mistaken, so usually <<8x of the whole available space). But
may be my guesstimates are wrong here.

On Fri, Nov 18, 2016 at 9:51 PM, Timofey Titovets <nefelim...@gmail.com> wrote:
> 2016-11-18 23:32 GMT+03:00 Janos Toth F. <toth.f.ja...@gmail.com>:
>> Based on the comments of this patch, stripe size could theoretically
>> go as low as 512 byte:
>> https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html
>> If these very small (0.5k-2k) stripe sizes could really work (it's
>> possible to implement such changes and it does not degrade performance
>> too much - or at all - to keep it so low), we could use RAID-5(/6) on
>> <=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem
>> sector size + 4k node size, although I am not sure if node size is
>> really important here) without having to worry about RMW, extra space
>> waste or additional fragmentation.
>>
>> On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli <kreij...@libero.it> 
>> wrote:
>>> Hello,
>>>
>>> these are only my thoughts; no code here, but I would like to share it 
>>> hoping that it could be useful.
>>>
>>> As reported several times by Zygo (and others), one of the problem of 
>>> raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>>
>>> The problem is that the stripe size is bigger than the "sector size" (ok 
>>> sector is not the correct word, but I am referring to the basic unit of 
>>> writing on the disk, which is 4k or 16K in btrfs).
>>> So when btrfs writes less data than the stripe, the stripe is not filled; 
>>> when it is filled by a subsequent write, a RMW of the parity is required.
>>>
>>> On the best of my understanding (which could be very wrong) ZFS try to 
>>> solve this issue using a variable length stripe.
>>>
>>> On BTRFS this could be achieved using several BGs (== block group or 
>>> chunk), one for each stripe size.
>>>
>>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
>>> should have three BGs:
>>> BG #1,composed by two disks (1 data+ 1 parity)
>>> BG #2 composed by three disks (2 data + 1 parity)
>>> BG #3 composed by four disks (3 data + 1 parity).
>>>
>>> If the data to be written has a size of 4k, it will be allocated to the BG 
>>> #1.
>>> If the data to be written has a size of 8k, it will be allocated to the BG 
>>> #2
>>> If the data to be written has a size of 12k, it will be allocated to the BG 
>>> #3
>>> If the data to be written has a size greater than 12k, it will be allocated 
>>> to the BG3, until the data fills a full stripes; then the remainder will be 
>>> stored in BG #1 or BG #2.
>>>
>>>
>>> To avoid unbalancing of the disk usage, each BG could use all the disks, 
>>> even if a stripe uses less disks: i.e
>>>
>>> DISK1 DISK2 DISK3 DISK4
>>> S1S1S1S2
>>> S2S2S3S3
>>> S3S4S4S4
>>> []
>>>
>>> Above is show a BG which uses all the four disks, but has a stripe which 
>>> spans only 3 disks.
>>>
>>>
>>> Pro:
>>> - btrfs already is capable to handle different BG in the filesystem, only 
>>> the allocator has 

Re: RFC: raid with a variable stripe size

2016-11-18 Thread Janos Toth F.
Based on the comments of this patch, stripe size could theoretically
go as low as 512 byte:
https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html
If these very small (0.5k-2k) stripe sizes could really work (it's
possible to implement such changes and it does not degrade performance
too much - or at all - to keep it so low), we could use RAID-5(/6) on
<=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem
sector size + 4k node size, although I am not sure if node size is
really important here) without having to worry about RMW, extra space
waste or additional fragmentation.

On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli  wrote:
> Hello,
>
> these are only my thoughts; no code here, but I would like to share it hoping 
> that it could be useful.
>
> As reported several times by Zygo (and others), one of the problem of raid5/6 
> is the write hole. Today BTRFS is not capable to address it.
>
> The problem is that the stripe size is bigger than the "sector size" (ok 
> sector is not the correct word, but I am referring to the basic unit of 
> writing on the disk, which is 4k or 16K in btrfs).
> So when btrfs writes less data than the stripe, the stripe is not filled; 
> when it is filled by a subsequent write, a RMW of the parity is required.
>
> On the best of my understanding (which could be very wrong) ZFS try to solve 
> this issue using a variable length stripe.
>
> On BTRFS this could be achieved using several BGs (== block group or chunk), 
> one for each stripe size.
>
> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
> should have three BGs:
> BG #1,composed by two disks (1 data+ 1 parity)
> BG #2 composed by three disks (2 data + 1 parity)
> BG #3 composed by four disks (3 data + 1 parity).
>
> If the data to be written has a size of 4k, it will be allocated to the BG #1.
> If the data to be written has a size of 8k, it will be allocated to the BG #2
> If the data to be written has a size of 12k, it will be allocated to the BG #3
> If the data to be written has a size greater than 12k, it will be allocated 
> to the BG3, until the data fills a full stripes; then the remainder will be 
> stored in BG #1 or BG #2.
>
>
> To avoid unbalancing of the disk usage, each BG could use all the disks, even 
> if a stripe uses less disks: i.e
>
> DISK1 DISK2 DISK3 DISK4
> S1S1S1S2
> S2S2S3S3
> S3S4S4S4
> []
>
> Above is show a BG which uses all the four disks, but has a stripe which 
> spans only 3 disks.
>
>
> Pro:
> - btrfs already is capable to handle different BG in the filesystem, only the 
> allocator has to change
> - no more RMW are required (== higher performance)
>
> Cons:
> - the data will be more fragmented
> - the filesystem, will have more BGs; this will require time-to time a 
> re-balance. But is is an issue which we already know (even if may be not 100% 
> addressed).
>
>
> Thoughts ?
>
> BR
> G.Baroncelli
>
>
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-18 Thread Janos Toth F.
It could be totally unrelated but I have a similar problem: processes
get randomly OOM'd when I am doing anything "sort of heavy" on my
Btrfs filesystems.
I did some "evil tuning", so I assumed that must be the problem (even
if the values looked sane for my system). Thus, I kept cutting back on
the manually set values (mostly dirty/background ratio, io scheduler
request queue size and such tunables) but it seems to be a dead end. I
guess anything I change in order to try and cut back on the related
memory footprint just makes the OOMs less frequent but it's only a
matter of time and coincidence (lots of things randomly happen to do
some notable amount of IO) until OOMs happen anyway.
It seems to be plenty enough to start a defrag or balance on more than
a single filesystem (in parallel) and pretty much any notable "useful"
user load will have a high change of triggering OOMs (and get killed)
sooner or later. It's just my limited observation but database-like
loads [like that of bitcoind] (sync writes and/or frequent flushes?)
or high priority buffered writes (ffmpeg running with higher than
default priority and saving live video streams into files without
recoding) seem to have higher chance of triggering this (more so than
simply reading or writing files sequentially and asynchronously,
either locally or through Samba).
I am on gentoo-sources 4.8.8 right now but it was there with 4.7.x as well.

On Thu, Nov 17, 2016 at 10:49 PM, Vlastimil Babka  wrote:
> On 11/16/2016 02:39 PM, E V wrote:
>> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of
>> the stack trace, and the 38 call traces in a 2 minute window shortly
>> before, to the bugzilla case for those not on it's e-mail list:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=186671
>
> The panic screenshot has only the last part, but the end marker says
> it's OOM with no killable processes. The DEBUG_VM config thus didn't
> trigger anything, and still there's tons of pagecache, mostly clean,
> that's not being reclaimed.
>
> Could you now try this?
> - enable CONFIG_PAGE_OWNER
> - boot with kernel option: page_owner=on
> - after the first oom, "cat /sys/kernel/debug/page_owner > file"
> - provide the file (compressed, it will be quite large)
>
> Vlastimil
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs check --repair: ERROR: cannot read chunk root

2016-10-30 Thread Janos Toth F.
I stopped using Btrfs RAID-5 after encountering this problem two times
(once due to a failing SATA cable, once due to a random kernel problem
which caused the SATA or the block device driver to reset/crash).
As much as I can tell, the main problem is that after a de- and a
subsequent re-attach (on purpose or due to failing cable/controller,
kernel problem, etc), your de-synced disk is taken back to the "array"
as if it was still in sync despite having a lower generation counter.
The filesystem detects the errors later but it can't reliably handle,
let alone fully correct them. If you search on this list with "RAID 5"
you will see that RAID-5/6 scrub/repair has some known serious
problems (which probably contribute to this but I guess something on
top of those problems plays a part here to make this extra messy, like
the generations never getting synced). The de-synchronization will get
worse over time if your are in writable mode (the generation of the
de-synced disk is stuck), up to he point of the filesystem becoming
unmountable (if not already, probably due to scrub causing errors on
the disks in sync).
I was able to use "rescue" once and even achieve a read-only mount
during the second time (with only a handful of broken files). But I
could see a pattern and didn't want to end up in situations like that
for a third time.
Too bad, I would prefer RAID-5/6 over RAID-1/10 any day otherwise
(RAID-5 is faster than RAID-1 and you can loose any two disks from a
RAID-6, not just one from each mirrors on RAID-10) but most people
think it's slow and obsolete (I mean they say that about hardware or
mdadm RAID-5/6 and Btrfs's RAID-5/6 is frowned upon for true reasons)
while it's actually the opposite with a limited number of drives (<=6,
or may be up to 10).
It's not impossible to get right though, RAID-Z is nice (except the
inability to defrag the inevitable fragmentation), so I keep hoping...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2016-07-23 Thread Janos Toth F.
It seems like I accidentally managed to break my Btrfs/RAID5
filesystem, yet again, in a similar fashion.
This time around, I ran into some random libata driver issue (?)
instead of a faulty hardware part but the end result is quiet similar.

I issued the command (replacing X with valid letters for every
hard-drives in the system):
# echo 1 > /sys/block/sdX/device/queue_depth
and I ended up with read-only filesystems.
I checked dmesg and saw write errors on every disks (not just those in RAID-5).

I tried to reboot immediately without success. My root filesystem with
a single-disk Btrfs (which is an SSD, so it has "single" profile for
both data and metadata) was unmountable, thus the kernel was stuck in
a panic-reboot cycle.
I managed to fix this one by booting from an USB stick and trying
various recovery methods (like mounting it with "-o
clear_cache,nospace_cache,recovery" and running "btrfs rescue
chunk-recovery") until everything seemed to be fine (it can now be
mounted read-write without error messages in the kernel-log, can be
fully scrubbed without errors reported, it passes in "btrfs check",
files can be actually written and read, etc).

Once my system was up and running (well, sort of), I realized my /data
is also un-mountable. I tried the same recovery methods on this RAID-5
filesystem but nothing seemed to help (there is an exception with the
recovery attempts: the system drive was a small and fast SSD so
"chunk-recovery" was a viable option to try but this one consists of
huge slow HDDs - so, I tried to run it as a last-resort over-night but
I found an unresponsive machine on the morning with the process stuck
relatively early in the process).

I can always mount it read-only and access files on it, seemingly
without errors (I compared some of the contents with backups and it
looks good) but as soon as I mount it read-write, all hell breaks
loose and it falls into read-only state in no time (with some files
seemingly disappearing from the filesystem) and the kernel log is
starting to get spammed with various kind of error messages (including
missing csums, etc).


After mounting it like this:
# mount /dev/sdb /data -o rw,noatime,nospace_cache
and doing:
# btrfs scrub start /data
the result is:

scrub status for 7d4769d6-2473-4c94-b476-4facce24b425
scrub started at Sat Jul 23 13:50:55 2016 and was aborted after 00:05:30
total bytes scrubbed: 18.99GiB with 16 errors
error details: read=16
corrected errors: 0, uncorrectable errors: 16, unverified errors: 0

The relevant dmesg output is:

 [ 1047.709830] BTRFS info (device sdc): disabling disk space caching
[ 1047.709846] BTRFS: has skinny extents
[ 1047.895818] BTRFS info (device sdc): bdev /dev/sdc errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[ 1047.895835] BTRFS info (device sdc): bdev /dev/sdb errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[ 1065.764352] BTRFS: checking UUID tree
[ 1386.423973] BTRFS error (device sdc): parent transid verify failed
on 24431936729088 wanted 585936 found 586145
[ 1386.430922] BTRFS error (device sdc): parent transid verify failed
on 24431936729088 wanted 585936 found 586145
[ 1411.738955] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1411.948040] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.040964] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.040980] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.041134] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.042628] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.042748] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1499.45] BTRFS error (device sdc): parent transid verify failed
on 24432312270848 wanted 585779 found 586143
[ 1499.230264] BTRFS error (device sdc): parent transid verify failed
on 24432312270848 wanted 585779 found 586143
[ 1525.865143] BTRFS error (device sdc): parent transid verify failed
on 24432367730688 wanted 585779 found 586144
[ 1525.880537] BTRFS error (device sdc): parent transid verify failed
on 24432367730688 wanted 585779 found 586144
[ 1552.434209] BTRFS error (device sdc): parent transid verify failed
on 24432415821824 wanted 585781 found 586144
[ 1552.437325] BTRFS error (device sdc): parent transid verify failed
on 24432415821824 wanted 585781 found 586144


btrfs check /dev/sdc results in:

Checking filesystem on /dev/sdc
UUID: 7d4769d6-2473-4c94-b476-4facce24b425
checking extents
parent transid verify failed on 24431859855360 wanted 585941 found 586144
parent transid verify failed on 24431859855360 wanted 585941 found 586144
checksum verify failed on 24431859855360 found 3F0C0853 wanted 165308D5

Unexpectedly slow removal of fragmented files (RAID-5)

2016-04-16 Thread Janos Toth F.
As you can see from the attached terminal log below, file deletion can
take an unexpectedly long time, even if there is little disk I/O from
other tasks.

Listing the contents of similar directories (<=1000 files with ~1Gb
size) can also be surprisingly slow (several seconds for a simple ls
command).

According to nmon's disk statistics the hard drives were not
over-saturated with I/O and mostly did reads rather than writes:

xDiskName Busy  Read WriteKB|0  |25 |50  |75   100|
xsdb   46%  435.70.0| >
xsdc   56%  461.70.0| >
xsdd   51%  431.70.0|RR   >
xTotals Read-MB/s=1.3  Writes-MB/s=0.0  Transfers/sec=237.3

I see nothing strange in the kernel log and my btrfs stats are clean
from errors (and the SMART attributes are also fine):

# btrfs device stats /dev/sdb
[/dev/sdb].write_io_errs   0
[/dev/sdb].read_io_errs0
[/dev/sdb].flush_io_errs   0
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0

(same as above for all drives)

This is a RAID-5 filesystem with 3 devices:

# btrfs fi show
Label: none  uuid: 7d4769d6-2473-4c94-b476-4facce24b425
Total devices 3 FS bytes used 5.13TiB
devid1 size 3.64TiB used 2.89TiB path /dev/sdc
devid2 size 3.64TiB used 2.89TiB path /dev/sdd
devid3 size 3.64TiB used 2.89TiB path /dev/sdb


# btrfs fi df /data
Data, RAID5: total=5.78TiB, used=5.13TiB
System, RAID5: total=64.00MiB, used=560.00KiB
Metadata, RAID5: total=11.00GiB, used=7.47GiB
GlobalReserve, single: total=512.00MiB, used=23.94MiB


These are fairly fast HDDs (3 equal models):
Model Number:   ST4000NM0033-9ZM170
Firmware Revision:  SN06

My current kernel is 4.5.1 (gentoo ~amd64) but this is not a new phenomenon.

Here is what I did for a quick benchmark (I would accept suggestions
how to measure/quantify this thing better):

# ls -Al
total 460860220
-rw-rw+ 1 ipcamrec nogroup 1147656438 Apr  1 19:06 2016-04-01_18.30.avi
-rw-rw+ 1 ipcamrec nogroup 1714611460 Apr  1 20:09 2016-04-01_19.08.avi
-rw-rw+ 1 ipcamrec nogroup 1889729894 Apr  1 21:09 2016-04-01_20.09.avi
-rw-rw+ 1 ipcamrec nogroup 1889900328 Apr  1 22:09 2016-04-01_21.09.avi
-rw-rw+ 1 ipcamrec nogroup 1889872856 Apr  1 23:09 2016-04-01_22.09.avi
-rw-rw+ 1 ipcamrec nogroup 1889847800 Apr  2 00:09 2016-04-01_23.09.avi
-rw-rw+ 1 ipcamrec nogroup 1889926046 Apr  2 01:09 2016-04-02_00.09.avi
-rw-rw+ 1 ipcamrec nogroup  623610962 Apr  2 01:29 2016-04-02_01.09.avi
-rw-rw+ 1 ipcamrec nogroup 1886283564 Apr  2 02:29 2016-04-02_01.29.avi
-rw-rw+ 1 ipcamrec nogroup 1889900926 Apr  2 03:29 2016-04-02_02.29.avi
-rw-rw+ 1 ipcamrec nogroup 1889861230 Apr  2 04:29 2016-04-02_03.29.avi
-rw-rw+ 1 ipcamrec nogroup 1889804690 Apr  2 05:29 2016-04-02_04.29.avi
-rw-rw+ 1 ipcamrec nogroup 1889857562 Apr  2 06:29 2016-04-02_05.29.avi
-rw-rw+ 1 ipcamrec nogroup 1889818618 Apr  2 07:29 2016-04-02_06.29.avi
-rw-rw+ 1 ipcamrec nogroup 1889895388 Apr  2 08:29 2016-04-02_07.29.avi
-rw-rw+ 1 ipcamrec nogroup 1025430638 Apr  2 09:01 2016-04-02_08.29.avi
-rw-rw+ 1 ipcamrec nogroup 1890289896 Apr  2 10:03 2016-04-02_09.03.avi
-rw-rw+ 1 ipcamrec nogroup 1889813210 Apr  2 11:03 2016-04-02_10.03.avi
-rw-rw+ 1 ipcamrec nogroup 1233504110 Apr  2 11:42 2016-04-02_11.03.avi
-rw-rw+ 1 ipcamrec nogroup 1641288188 Apr  2 12:37 2016-04-02_11.45.avi
-rw-rw+ 1 ipcamrec nogroup  386146966 Apr  2 12:53 2016-04-02_12.41.avi
-rw-rw+ 1 ipcamrec nogroup 1681678050 Apr  2 13:48 2016-04-02_12.55.avi
-rw-rw+ 1 ipcamrec nogroup 1539937212 Apr  2 14:39 2016-04-02_13.50.avi
-rw-rw+ 1 ipcamrec nogroup6094144 Apr  2 14:41 2016-04-02_14.40.avi
-rw-rw+ 1 ipcamrec nogroup 1047267124 Apr  2 15:14 2016-04-02_14.41.avi
-rw-rw+ 1 ipcamrec nogroup  646180950 Apr  2 15:37 2016-04-02_15.16.avi
-rw-rw+ 1 ipcamrec nogroup  647643096 Apr  2 15:58 2016-04-02_15.37.avi
-rw-rw+ 1 ipcamrec nogroup 1885294996 Apr  2 17:00 2016-04-02_16.00.avi
-rw-rw+ 1 ipcamrec nogroup   53816900 Apr  2 17:02 2016-04-02_17.00.avi
-rw-rw+ 1 ipcamrec nogroup 1890431554 Apr  2 18:03 2016-04-02_17.03.avi
-rw-rw+ 1 ipcamrec nogroup 1889900590 Apr  2 19:03 2016-04-02_18.03.avi
-rw-rw+ 1 ipcamrec nogroup 1601785838 Apr  2 20:03 2016-04-02_19.03.avi
-rw-rw+ 1 ipcamrec nogroup 1889895806 Apr  2 21:03 2016-04-02_20.03.avi
-rw-rw+ 1 ipcamrec nogroup 1889839996 Apr  2 22:03 2016-04-02_21.03.avi
-rw-rw+ 1 ipcamrec nogroup 1889895188 Apr  2 23:03 2016-04-02_22.03.avi
-rw-rw+ 1 ipcamrec nogroup 1889775096 Apr  3 00:03 2016-04-02_23.03.avi
-rw-rw+ 1 ipcamrec nogroup 1889922352 Apr  3 01:03 2016-04-03_00.03.avi
-rw-rw+ 1 ipcamrec nogroup 1889870066 Apr  3 02:03 2016-04-03_01.03.avi
-rw-rw+ 1 ipcamrec nogroup 1889866336 Apr  3 03:03 2016-04-03_02.03.avi

Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-06 Thread Janos Toth F.
I created a fresh RAID-5 mode Btrfs on the same 3 disks (including the
faulty one which is still producing numerous random read errors) and
Btrfs now seems to work exactly as I would anticipate.

I copied some data and verified the checksum. The data is readable and
correct regardless of the constant warning messages in the kernel log
about the read errors on the single faulty HDD (the bad behavior is
confirmed by the SMART logs and I tested it in a different PC as
well...).

I also ran several scrubs and now it always finishes with X corrected
and 0 uncorrected errors. (The errors are supposedly corrected but the
faulty HDD keeps randomly corrupting the data...)
The last time I saw uncorrected errors during the scrub and not every
data was readable. Rather strange...

I ran 24 hours of Gimps/Prime95 Blend stresstest without errors on the
problematic machine.
Although I updated the firmware of the drives. (I found an IMPORTANT
update when I went there to download SeaTools, although there was no
change log to tell me why this was important). This might changed the
error handling behavior of the drive...?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-04 Thread Janos Toth F.
Well. Now I am really confused about Btrfs RAID-5!

So, I replaced all SATA cables (which are explicitly marked for beeing
aimed at SATA3 speeds) and all the 3x2Tb WD Red 2.0 drives with 3x4Tb
Seagate Contellation ES 3 drives and started from sratch. I
secure-erased every drives, created an empty filesystem and ran a
"long" SMART self-test on all drivers before I started using the
storage space (the tests finished without errors, all drivers looked
fine, 0 zero bad sectors, 0 read or SATA CEC errors... all looked
perfectly fine at the time...).

It didn't take long before I realized that one of the new drives
started failing.
I started a scrub and it reported both corrected and uncorrectable errors.
I looked at the SMART data. 2 drives look perfectly fine and 1 drive
seems to be really sick. The latter one has some "reallocated" and
several hundreds of "pending" sectors among other error indications in
the log. I guess it's not the drive surface but the HDD controller (or
may be a head) which is really dying.

I figured the uncorrectable errors are write errors which is not
surprising given the perceived "health" of the drive according to it's
SMART attributes and error logs. That's understandable.


Although, I tried to copy data from the filesystem and it failed at
various ways.
There was a file which couldn't be copied at all. Good question why. I
guess it's because the filesystem needs to be repaired to get the
checksums and parities sorted out first. That's also understandable
(though unexpected, I thought RAID-5 Btrfs is sort-of "self-healing"
in these situations, it should theoretically still be able to
reconstruct and present the correct data, based on checksums and
parities seamlessly and only place error in the kernel log...).

But the worst part is that there are some ISO files which were
seemingly copied without errors but their external checksums (the one
which I can calculate with md5sum and compare to the one supplied by
the publisher of the ISO file) don't match!
Well... this, I cannot understand.
How could these files become corrupt from a single disk failure? And
more importantly: how could these files be copied without errors? Why
didn't Btrfs gave a read error when the checksums didn't add up?


Isn't Btrfs supposed to constantly check the integrity of the file
data during any normal read operations and give an error instead of
spitting out corrupt data as if it was perfectly legit?
I thought that's how it is supposed to work.
What's the point of full data checksuming if only an explicitly
requested scrub operation might look for errors? I thought's it's the
logical thing to do if checksum verification happens during every
single read operation and passing that check is mandatory in order to
get any data out of the filesystem (might be excluding the Direct-I/O
mode but I never use that on Btrfs - if that's even actually
supported, I don't know).


Now I am really considering to move from Linux to Windows and from
Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a
new motherboard with more native SATA connectors and an extra HDD).
That one seemed to actually do what it promises (abort any read
operations upon checksum errors [which always happens seamlessly on
every read] but look at the redundant data first and seamlessly
"self-heal" if possible). The only thing which made Btrfs to look as a
better alternative was the RAID-5 support. But I recently experienced
two cases of 1 drive failing of 3 and it always tuned out as a smaller
or bigger disaster (completely lost data or inconsistent data).


Does anybody have ideas what might went wrong in this second scenario?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-21 Thread Janos Toth F.
I went through all the recovery options I could find (starting from
read-only to "extraordinarily dangerous"). Nothing seemed to work.

A Windows based proprietary recovery software (ReclaiMe) could scratch
the surface but only that (it showed me the whole original folder
structure after a few minutes of scanning and the "preview" of some
some plaintext files was promising but most of the bigger files seemed
to be broken).

I used this as a bulk storage for backups and all the things I didn't
care to keep in more than one copies but that includes my
"scratchpad", so I cared enough to use RAID5 mode and to try restoring
some things.

Any last ideas before I "ata secure erase" and sell/repurpose the disks?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-21 Thread Janos Toth F.
I am afraid the filesystem right now is really damaged regardless of
it's state upon the unexpected cable failure because I tried some
dangerous options after read-only restore/recovery methods all failed
(including zero-log, followed by init-csum-tree and even
chunk-recovery -> all of them just spit out several kind of errors
which suggested they probably didn't even write anything to the disks
before they decided that they already failed but they only caused more
harm than good if they did write something).

Actually, I almost got rid of this data myself intentionally when my
new set of drives arrived. I was considering if I should simply start
from scratch (may be reviewing and might be saving my "scratchpad"
portion of the data but nothing really irreplaceable and/or valuable)
but I thought it's a good idea to test the "device replace" function
in real life.

Even though the replace operation seemed to be successful I am
beginning to wonder if it wasn't really.


On Wed, Oct 21, 2015 at 7:42 PM, ronnie sahlberg
<ronniesahlb...@gmail.com> wrote:
> Maybe hold off erasing the drives a little in case someone wants to
> collect some extra data for diagnosing how/why the filesystem got into
> this unrecoverable state.
>
> A single device having issues should not cause the whole filesystem to
> become unrecoverable.
>
> On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F. <toth.f.ja...@gmail.com> wrote:
>> I went through all the recovery options I could find (starting from
>> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>>
>> A Windows based proprietary recovery software (ReclaiMe) could scratch
>> the surface but only that (it showed me the whole original folder
>> structure after a few minutes of scanning and the "preview" of some
>> some plaintext files was promising but most of the bigger files seemed
>> to be broken).
>>
>> I used this as a bulk storage for backups and all the things I didn't
>> care to keep in more than one copies but that includes my
>> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
>> some things.
>>
>> Any last ideas before I "ata secure erase" and sell/repurpose the disks?
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-21 Thread Janos Toth F.
I tried several things, including the degraded mount option. One example:

# mount /dev/sdb /data -o ro,degraded,nodatasum,notreelog
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

# cat /proc/kmsg
<6>[  262.616929] BTRFS info (device sdd): allowing degraded mounts
<6>[  262.616943] BTRFS info (device sdd): setting nodatasum
<6>[  262.616949] BTRFS info (device sdd): disk space caching is enabled
<6>[  262.616953] BTRFS: has skinny extents
<6>[  262.652671] BTRFS: bdev (null) errs: wr 858, rd 8057, flush 280,
corrupt 0, gen 0
<3>[  262.697162] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  262.697633] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  262.697660] BTRFS: Failed to read block groups: -5
<3>[  262.709885] BTRFS: open_ctree failed
<6>[  267.197365] BTRFS info (device sdd): allowing degraded mounts
<6>[  267.197385] BTRFS info (device sdd): setting nodatasum
<6>[  267.197397] BTRFS info (device sdd): disabling tree log
<6>[  267.197406] BTRFS info (device sdd): disk space caching is enabled
<6>[  267.197412] BTRFS: has skinny extents
<6>[  267.232809] BTRFS: bdev (null) errs: wr 858, rd 8057, flush 280,
corrupt 0, gen 0
<3>[  267.246167] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  267.246706] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  267.246727] BTRFS: Failed to read block groups: -5
<3>[  267.261392] BTRFS: open_ctree failed

On Wed, Oct 21, 2015 at 6:09 PM, Janos Toth F. <toth.f.ja...@gmail.com> wrote:
> I went through all the recovery options I could find (starting from
> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>
> A Windows based proprietary recovery software (ReclaiMe) could scratch
> the surface but only that (it showed me the whole original folder
> structure after a few minutes of scanning and the "preview" of some
> some plaintext files was promising but most of the bigger files seemed
> to be broken).
>
> I used this as a bulk storage for backups and all the things I didn't
> care to keep in more than one copies but that includes my
> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
> some things.
>
> Any last ideas before I "ata secure erase" and sell/repurpose the disks?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-19 Thread Janos Toth F.
I was in the middle of replacing the drives of my NAS one-by-one (I
wished to move to bigger and faster storage at the end), so I used one
more SATA drive + SATA cable than usual. Unfortunately, the extra
cable turned out to be faulty and it looks like it caused some heavy
damage to the file system.

There was no "devive replace" running at the moment or the disaster.
The first round already got finished hours ago and I planned to start
the next one before going to sleep. So, it was a full RAID-5 setup in
normal state. But one of the active, mounted devices was the first
replacment HDD and it was hanging on the spare SATA cable.

I tried to save some file to my mounted samba share and I realized the
file system because read-only. I rebooted the machine and saw that my
/data can't be mounted.
According to SmartmonTools, one of the drives was suffering from SATA
communication errors.

I tried some tirivial recovery methods and I tried to search the
mailing list archives but I didn't really find a solution. I wonder if
somebody can help with this.

Should I run "btrfs rescue chunk-recover /dev/sda"?

Here are some raw details:

# uname -a
Linux F17a_NAS 4.2.3-gentoo #2 SMP Sun Oct 18 17:56:45 CEST 2015
x86_64 AMD E-350 Processor AuthenticAMD GNU/Linux

# btrfs --version
btrfs-progs v4.2.2

# btrfs check /dev/sda
checksum verify failed on 21102592 found 295F0086 wanted 
checksum verify failed on 21102592 found 295F0086 wanted 
checksum verify failed on 21102592 found 99D0FC26 wanted B08FFCA0
checksum verify failed on 21102592 found 99D0FC26 wanted B08FFCA0
bytenr mismatch, want=21102592, have=65536
Couldn't read chunk root
Couldn't open file system

# mount /dev/sda /data -o ro,recovery
mount: wrong fs type, bad option, bad superblock on /dev/sda, ...

# cat /proc/kmsg
<6>[ 1902.033164] BTRFS info (device sdb): enabling auto recovery
<6>[ 1902.033184] BTRFS info (device sdb): disk space caching is enabled
<6>[ 1902.033191] BTRFS: has skinny extents
<3>[ 1902.034931] BTRFS (device sdb): bad tree block start 0 21102592
<3>[ 1902.051259] BTRFS (device sdb): parent transid verify failed on
21147648 wanted 101748 found 101124
<3>[ 1902.111807] BTRFS (device sdb): parent transid verify failed on
44613632 wanted 101770 found 101233
<3>[ 1902.126529] BTRFS (device sdb): parent transid verify failed on
40595456 wanted 101767 found 101232
<6>[ 1902.164667] BTRFS: bdev /dev/sda errs: wr 858, rd 8057, flush
280, corrupt 0, gen 0
<3>[ 1902.165929] BTRFS (device sdb): parent transid verify failed on
44617728 wanted 101770 found 101233
<3>[ 1902.166975] BTRFS (device sdb): parent transid verify failed on
44621824 wanted 101770 found 101233
<3>[ 1902.271296] BTRFS (device sdb): parent transid verify failed on
38621184 wanted 101765 found 101223
<3>[ 1902.380526] BTRFS (device sdb): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[ 1902.381510] BTRFS (device sdb): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[ 1902.381549] BTRFS: Failed to read block groups: -5
<3>[ 1902.394835] BTRFS: open_ctree failed
<6>[ 1911.202254] BTRFS info (device sdb): enabling auto recovery
<6>[ 1911.202270] BTRFS info (device sdb): disk space caching is enabled
<6>[ 1911.202275] BTRFS: has skinny extents
<3>[ 1911.203611] BTRFS (device sdb): bad tree block start 0 21102592
<3>[ 1911.204803] BTRFS (device sdb): parent transid verify failed on
21147648 wanted 101748 found 101124
<3>[ 1911.246384] BTRFS (device sdb): parent transid verify failed on
44613632 wanted 101770 found 101233
<3>[ 1911.248729] BTRFS (device sdb): parent transid verify failed on
40595456 wanted 101767 found 101232
<6>[ 1911.251658] BTRFS: bdev /dev/sda errs: wr 858, rd 8057, flush
280, corrupt 0, gen 0
<3>[ 1911.252485] BTRFS (device sdb): parent transid verify failed on
44617728 wanted 101770 found 101233
<3>[ 1911.253542] BTRFS (device sdb): parent transid verify failed on
44621824 wanted 101770 found 101233
<3>[ 1911.278414] BTRFS (device sdb): parent transid verify failed on
38621184 wanted 101765 found 101223
<3>[ 1911.283950] BTRFS (device sdb): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[ 1911.284835] BTRFS (device sdb): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[ 1911.284873] BTRFS: Failed to read block groups: -5
<3>[ 1911.298783] BTRFS: open_ctree failed


# btrfs-show-super /dev/sda
superblock: bytenr=65536, device=/dev/sda
-
csum0xe8789014 [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid2bba7cff-b4bf-4554-bee4-66f69c761ec4
label
generation  101480
root37892096
sys_array_size  258
chunk_root_generation   101124
root_level  2
chunk_root  21147648
chunk_root_level1
log_root